My experience of solving algorithmic problem of UTF-8 Validation.

A photo of Betizazu Alemu looking professional.
Betizazu Alemu
  • 2 min read

Today’s challenge was about checking whether the collection of integers forms a valid UTF8 string or not.

Problem description

Given an integer array data representing the data, return whether it is a valid UTF-8 encoding (i.e. it translates to a sequence of valid UTF-8 encoded characters).

A character in UTF8 can be from 1 to 4 bytes long, subjected to the following rules:

  1. For a 1-byte character, the first bit is a 0, followed by its Unicode code.
  2. For an n-bytes character, the first n bits are all one’s, the n + 1 bit is 0, followed by n - 1 bytes with the most significant 2 bits being 10.

This is how the UTF-8 encoding would work:

Number of BytesUTF-8 Octet Sequence (binary)
1 0xxxxxxx
2 110xxxxx 10xxxxxx
3 1110xxxx 10xxxxxx 10xxxxxx
4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

x denotes a bit in the binary form of a byte that may be either 0 or 1.

note

The input is an array of integers. Only the least significant 8 bits of each integer is used to store the data. This means each integer represents only 1 byte of data.

Examples

Input: data = [197, 130, 1]

Output: true

Explanation: data represents the octet sequence: 11000101 10000010 00000001. It is a valid utf-8 encoding for a 2-bytes character followed by a 1-byte character.

Input: data = [235, 140, 4]

Output: false

Explanation: data represented the octet sequence: 11101011 10001100 00000100. The first 3 bits are all one’s and the 4th bit is 0 means it is a 3-bytes character. The next byte is a continuation byte which starts with 10 and that’s correct. But the second continuation byte does not start with 10, so it is invalid.

You can find a brief description about the problem here.

Experience

I took the simplest approach to solve the challenge, but I believe it can be improved more. What I did was straightforward. Here is the algorithm:

Loop over each element of the list

  1. Initiate an iterator variable from zero
  2. Get the element on the iteratorth position of the list
  3. Convert it to a string representation of its binary code
  4. Count the number of 1s in front
  5. Check if all the elements represented by the number of 1s in the first element all start by 10
    1. If yes, advance to the (iterator + number of 1s)th element and repeat the steps 2 - 5
    2. If not, return False

The results

  • Runtime: 248 ms, faster than 47.34% of Python3 online submissions.
  • Memory Usage: 14.1 MB, less than 71.16% of Python3 online submissions.

The code

python
class Solution:
    def validUtf8(self, data: List[int]) -> bool:
        iterator = 0
    
        while iterator < len(data):
            current_binary = f"{data[iterator]:08b}"
            zero_position = current_binary.find('0')
    
            if zero_position == 0: iterator += 1
            elif 1 < zero_position < 5:
                sub_bytes = data[iterator + 1:iterator + zero_position]
    
                if len(sub_bytes) + 1 < zero_position: return False
    
                for next_byte in data[iterator + 1:iterator + zero_position]:
                    if f"{next_byte:08b}"[:2] != "10": return False
    
                iterator += zero_position
            else: return False
    
        return True

Discover related blog posts that delve into similar themes, share valuable insights, and offer fresh perspectives.