Today’s challenge was about checking whether the collection of integers forms a valid UTF8 string or not.
Problem description
Given an integer array data
representing the data, return whether it is a valid UTF-8 encoding (i.e. it translates to a sequence of valid UTF-8 encoded characters).
A character in UTF8 can be from 1 to 4 bytes long, subjected to the following rules:
- For a 1-byte character, the first bit is a
0
, followed by its Unicode code. - For an n-bytes character, the first
n
bits are all one’s, then + 1
bit is0
, followed byn - 1
bytes with the most significant 2 bits being10
.
This is how the UTF-8 encoding would work:
Number of Bytes | UTF-8 Octet Sequence (binary) |
---|---|
1 | 0xxxxxxx |
2 | 110xxxxx 10xxxxxx |
3 | 1110xxxx 10xxxxxx 10xxxxxx |
4 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
x
denotes a bit in the binary form of a byte that may be either 0 or 1.
note
The input is an array of integers. Only the least significant 8 bits of each integer is used to store the data. This means each integer represents only 1 byte of data.
Examples
Input:
data = [197, 130, 1]
Output:
true
Explanation: data represents the octet sequence:
11000101 10000010 00000001
. It is a valid utf-8 encoding for a 2-bytes character followed by a 1-byte character.
Input:
data = [235, 140, 4]
Output:
false
Explanation: data represented the octet sequence:
11101011 10001100 00000100
. The first 3 bits are all one’s and the 4th bit is 0 means it is a 3-bytes character. The next byte is a continuation byte which starts with 10 and that’s correct. But the second continuation byte does not start with 10, so it is invalid.
You can find a brief description about the problem here.
Experience
I took the simplest approach to solve the challenge, but I believe it can be improved more. What I did was straightforward. Here is the algorithm:
Loop over each element of the list
- Initiate an iterator variable from zero
- Get the element on the iteratorth position of the list
- Convert it to a string representation of its binary code
- Count the number of 1s in front
- Check if all the elements represented by the number of 1s in the first element all start by 10
- If yes, advance to the (iterator + number of 1s)th element and repeat the steps 2 - 5
- If not, return False
The results
- Runtime: 248 ms, faster than 47.34% of Python3 online submissions.
- Memory Usage: 14.1 MB, less than 71.16% of Python3 online submissions.
The code
class Solution:
def validUtf8(self, data: List[int]) -> bool:
iterator = 0
while iterator < len(data):
current_binary = f"{data[iterator]:08b}"
zero_position = current_binary.find('0')
if zero_position == 0: iterator += 1
elif 1 < zero_position < 5:
sub_bytes = data[iterator + 1:iterator + zero_position]
if len(sub_bytes) + 1 < zero_position: return False
for next_byte in data[iterator + 1:iterator + zero_position]:
if f"{next_byte:08b}"[:2] != "10": return False
iterator += zero_position
else: return False
return True