Text Representation in Computers

May 19, 2022
0 Comments
431 views
0 favorites

This article will briefly discuss how computers accept and process texts in a binary form.

This article is the second article on the serious about data. Suppose you missed the first part, which talks about data, how we measure it, and how we represent numbers in computer memory. In that case, you can quickly reference it here.

As discussed in the previous article, Data cannot be processed directly into computers using human language. Any type of Data, be it numbers, letters, special symbols, sounds, or pictures, must first be converted into machine-readable form, i.e., binary format. Why? Because, as we discussed in this article, everything is connected with turning the switch on and off. I believe this is the perfect time to talk about mechanisms to represent text values in a binary format.

Text is made up of individual characters, each represented by a string of bits on computers. These strings are assembled to form digital words, sentences, paragraphs, romance novels, etc. Sure, this idea might seem simple and easy for any human being. I dare you to think about it from the perspective of having around 1.2 billion different characters used in various languages worldwide, plus the fact that new emojis 😑 (Don't forget that 😑 are texts as well) are being created to capture every human emotion. Still, you have a single system that understands only binaries. Is that a simple task or idea? I doubt so.

For numbers, it might seem easy since you only need to convert them to binary and store their value in a memory specifying their size, as we have already seen in this article. Texts are not numerical values; they are just characters that we human beings attached sounds as their values rather than numerical values. If you go to someone and ask what the value of ‘A" is, I am sure you will confuse them, or they might think you are some sort of psycho. On the other hand, if you ask a computer what the value of ’A' is, you will get an answer of 65. Wait, what 🤔? Yeah, that is right; without a doubt, you will get 65. How? The answer will be character encoding.

What is character encoding?

We already know how we convert any positional number system to binary (If not, read this article). Still, as mentioned above, alphabetic characters do not have any numeric representation on a computer. Suppose we could define an encoding system that assigns numeric values to alphabetic characters. In that case, we could address the problem efficiently because converting numeric values back and forth is not a problem for any computer. In general, Character Encoding is a method of assigning some numeric value to characters.

There exist over 60 Character encodings at the moment, but mainly ASCII and Unicode are mentioned most frequently. The reason is most of the character encoding are extended versions of ASCII. So, let's discuss ASCII and Unicode one by one.

ASCII

The term stands for American Standard Code for Information Interchange, an early stage of character encoding mechanism. It came to existence in those days when memory usage was a big deal since, you know, computers had so little memory space. It uses 8 bits (1B) to represent characters, but it is only made for Latin characters (i.e., A-Z and a-Z), numbers (i.e., 0-9), and some special characters like #, @, $, etc.

The table below shows examples of ASCII characters with their associated codes and bytes.

CHARACTER	ASCII CODE	BYTE
A	065	01000001
a	097	01100001
B	066	01000010
b	098	01100010
Z	090	01011010
z	122	01111010
0	048	00110000
9	057	00111001
!	033	00100001

Just as characters come together to form words and sentences in the language, binary code does so in text files. So, the sentence “One byte is equal to eight bits.” represented in ASCII binary would be:

01001111 01101110 01100101 00100000 01100010 01111001 01110100 01100101 
00100000 01101001 01110011 00100000 01100101 01110001 01110101 01100001 
01101100 00100000 01110100 01101111 00100000 01100101 01101001 01100111 
01101000 01110100 00100000 01100010 01101001 01110100 01110011 00101110

The above example doesn't make any sense to us. But that's how computers process texts. They first convert the character to its corresponding encoding integer value and then to binaries.

The downside of ASCII encoding is its limitation on how many characters sets it will address. It uses 8 bits; we can only represent 256 unique characters. Curious, how? It's like this 2⁸=256. 256 might be enough if we all use the same character sets, i.e., the Latin Alphabet. To address this issue, people started creating their encoding system, an extended version of ASCII, considering the backward compatibility with ASCII codes.

Although having different character encoding systems might address the problem of managing every character in every language might seem a good solution, it is so confusing for a computer to understand what is what. It is just like having two different languages between computers. That's where Unicodes came into the play.

Unicode

These encoding address the significant problems in ASCII. The main difference between Unicode and ASCII is that Unicode allows characters to be up to 32 bits wide. That’s over 4 billion unique values. But for various reasons, not all of the values are ever used.

But when the size of the character encoding increases, the size of the document won't grow (We were talking about just 8 bits per character in ASCII, but now 32bit per character is 4x bigger than ASCII)? Well, luckily, no. Together with Unicode comes several mechanisms to represent or encode the characters. These are primarily the UTF-8 and UTF-16 encoding schemes that take an intelligent approach to the size problem.

Unicode encoding schemes like UTF-8 are more efficient in using their bits. With UTF-8, if a single character can be represented with 8bit, that’s all it will use. If a character needs 32bits, it’ll get all the 32bits. This is called variable length encoding, and it’s more efficient memory-wise. Thus, Unicode became the universal standard for encoding all human languages. And yes, it even includes emojis.

The first 256 characters are precisely the same as ASCII codes, i.e., it also supports backward compatibility.

UTF-8 can translate any Unicode character to a matching unique binary string and translate the binary string back to a Unicode character. This is the meaning of “UTF”, or “Unicode Transformation Format.”

The table below shows examples of Unicode characters with their associated codes and bytes. Each encoding begins with “U” for “Unicode,” followed by a Hexa-Decimal unique string of symbols to represent the character.

CHARACTER	Unicode Value	Equivalent UTF-8 BINARY ENCODING
A	U+0041	01000001
a	U+0061	01100001
0	U+0030	00110000
9	U+0039	00111001
!	U+0021	00100001
Ø	U+00D8	11000011 10011000
ڃ	U+0683	11011010 10000011
ಚ	U+0C9A	11100000 10110010 10011010
𠜎	U+2070E	11110000 10100000 10011100 10001110
😁	U+1F601	11110000 10011111 10011000 10000001
መ	U+1218	00010010 00011000

As you can witness from the above table, when 8bit is enough to represent a character Unicode UTF-8 uses just only 8bit, when it is more significant than that, it will use bigger-sized bits. If you are curious to find out how it really works internally, you can refer to this in-depth explanation.

This is how we represent text characters, including emojis on computers. In the following article, we will see how we represent colors, images, sounds, and videos in computer memory.

Was this article interesting and helpful?

Thanks for sharing us your openion!