Day Seven

Today’s challenge was about checking whether the collection of integers forms a valid UTF8 string or not.

Problem description

Given an integer array data representing the data, return whether it is a valid UTF-8 encoding (i.e. it translates to a sequence of valid UTF-8 encoded characters).

A character in UTF8 can be from 1 to 4 bytes long, subjected to the following rules:

For a 1-byte character, the first bit is a 0, followed by its Unicode code.
For an n-bytes character, the first n bits are all one’s, the n + 1 bit is 0, followed by n - 1 bytes with the most significant 2 bits being 10.

This is how the UTF-8 encoding would work:

Number of Bytes	UTF-8 Octet Sequence (binary)
1	0xxxxxxx
2	110xxxxx 10xxxxxx
3	1110xxxx 10xxxxxx 10xxxxxx
4	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

x denotes a bit in the binary form of a byte that may be either 0 or 1.

note

The input is an array of integers. Only the least significant 8 bits of each integer is used to store the data. This means each integer represents only 1 byte of data.

Examples

Input: data = [197, 130, 1]
Output: true
Explanation: data represents the octet sequence: 11000101 10000010 00000001. It is a valid utf-8 encoding for a 2-bytes character followed by a 1-byte character.

Input: data = [235, 140, 4]
Output: false
Explanation: data represented the octet sequence: 11101011 10001100 00000100. The first 3 bits are all one’s and the 4th bit is 0 means it is a 3-bytes character. The next byte is a continuation byte which starts with 10 and that’s correct. But the second continuation byte does not start with 10, so it is invalid.

You can find a brief description about the problem here.

Experience

I took the simplest approach to solve the challenge, but I believe it can be improved more. What I did was straightforward. Here is the algorithm:

Loop over each element of the list

Initiate an iterator variable from zero
Get the element on the iterator^th position of the list
Convert it to a string representation of its binary code
Count the number of 1s in front
Check if all the elements represented by the number of 1s in the first element all start by 10
1. If yes, advance to the (iterator + number of 1s)th element and repeat the steps 2 - 5
2. If not, return False

The results

Runtime: 248 ms, faster than 47.34% of Python3 online submissions.
Memory Usage: 14.1 MB, less than 71.16% of Python3 online submissions.

The code

python

class Solution:
    def validUtf8(self, data: List[int]) -> bool:
        iterator = 0
    
        while iterator < len(data):
            current_binary = f"{data[iterator]:08b}"
            zero_position = current_binary.find('0')
    
            if zero_position == 0: iterator += 1
            elif 1 < zero_position < 5:
                sub_bytes = data[iterator + 1:iterator + zero_position]
    
                if len(sub_bytes) + 1 < zero_position: return False
    
                for next_byte in data[iterator + 1:iterator + zero_position]:
                    if f"{next_byte:08b}"[:2] != "10": return False
    
                iterator += zero_position
            else: return False
    
        return True

previous Day Six next Day Eight

Day Seven

Problem description

Examples

Experience

The results

The code

Adding SSH Key to GitHub

Basic terminologies continued

Basic terminologies

Data Storages

Subscribe to Newsletter

Day Seven

Problem description

Examples

Experience

The results

The code

Related Articles

Adding SSH Key to GitHub

Basic terminologies continued

Basic terminologies

Data Storages