This article is the second article on the serious about Data Representation in Computer Memory. Suppose you missed the first part, which talks about data, how we measure it, and how we represent numbers in computer memory. In that case, you can quickly reference it here .
As discussed in the previous article, Data cannot be processed directly into computers using human language. Any type of Data, be it numbers, letters, special symbols, sounds, or pictures, must first be converted into machine-readable form, i.e., binary format. Why? Because, as discussed in What is Programming article , everything is connected with turning the switch on and off in other words 0s and 1s. I believe this is the perfect time to talk about mechanisms to represent text values in a binary format.
Text is made up of individual characters, each represented by a string of characters. These strings are assembled to form digital words, sentences, paragraphs, romance novels, etc. Sure, this idea might seem simple and easy for any human being. I dare you to think about it from the perspective of having around 1.2 billion different characters used in various languages worldwide, plus the fact that new emojis 😑 (Don’t forget that 😑 are texts as well) are being created to capture every human emotion. Still, you have a single computer system that understands only binaries. Is that a simple task or idea? I doubt so.
For numbers, it might seem easy since you only need to convert them to binary and store their value in a memory specifying their size, as we have already seen in the What is Data article . Texts are not numerical values; they are just characters that we human beings attached sounds as their values rather than numerical values. If you go to someone and ask the value of “A”, I am sure you will confuse them, or they might think you are some sort of psycho. On the other hand, if you ask a computer the value of “A”, you will get an answer of 65. Wait, what 🤔? Yeah, that is right; without a doubt, you will get 65. How? The answer will be character encoding.
What is character encoding?
We already know how we convert any positional number system to binary (If not, read this article ). Still, as mentioned above, alphabetic characters do not have any numeric representation on a computer. Suppose we could define an encoding system that assigns numeric values to alphabetic characters. In that case, we could address the problem efficiently because converting numeric values to binary and vice versa is not a problem for any computer. In general, Character Encoding is a method of assigning some numeric value to characters.
There exist over 60 Character encodings at the moment, but mainly ASCII and Unicode are mentioned most frequently. The reason is most of the character encoding are extended versions of ASCII. So, let’s discuss ASCII and Unicode one by one.
ASCII Encoding
The term stands for American Standard Code for Information Interchange, an early stage of character encoding mechanism. It came to existence in those days when memory usage was a big deal since, you know, computers had so little memory space. It uses 8 bits (1B) to represent characters, but it is only made for Latin characters (i.e., A-Z and a-Z), numbers (i.e., 0-9), and some special characters like #, @, $, etc.
The table below shows examples of ASCII characters with their associated codes and bytes.
CHARACTER | ASCII CODE | BYTE |
---|---|---|
A | 65 | 01000001 |
a | 97 | 01100001 |
B | 66 | 01000010 |
b | 98 | 01100010 |
Z | 90 | 01011010 |
z | 122 | 01111010 |
0 | 48 | 00110000 |
9 | 57 | 00111001 |
! | 33 | 00100001 |
You can get the whole ASCII code table here .
Just as characters come together to form words and sentences in the language, binary code does so in text files in computers using binaries. So, the sentence “One byte is equal to eight bits.” represented in ASCII binary would be:
01001111 01101110 01100101 00100000 01100010 01111001 01110100 01100101
00100000 01101001 01110011 00100000 01100101 01110001 01110101 01100001
01101100 00100000 01110100 01101111 00100000 01100101 01101001 01100111
01101000 01110100 00100000 01100010 01101001 01110100 01110011 00101110
The above collection of 0s and 1s doesn’t make any sense to us. But that’s how computers process texts. They first convert the character to its corresponding encoding integer value and then to binaries.
The downside of ASCII encoding is its limitation on how many characters sets it will address. It uses 8 bits; we can only represent 256 unique characters. Curious, how? It’s like this 28=256. 256 might be enough if we all use the same character sets, i.e., the Latin Alphabet, in reality, we don’t. To address this issue, people started creating their encoding system, an extended version of ASCII, considering the backward compatibility with ASCII codes.
Although having different character encoding systems might address the problem of managing every character in every language might seem a good solution, it is so confusing for a computer to understand what is what. It is just like two peoples talking to eachother with their native language yet none of them understands what the other is saying. That’s where Unicodes came into the play.
Unicode
These encoding addresses the significant problems in ASCII. The main difference between Unicode and ASCII is that Unicode allows characters to be up to 32 bits wide. That’s over 4 billion unique values. But for various reasons, not all of the values are ever used, that’s why we are seeing different emojis in different systems.
But won’t the size of the document increase extensively when the size of the character encoding increases? We were talking about just 8 bits per character in ASCII, but now 32bit per character is 4x bigger than ASCII. Does that means one text document with Unicode encoding will have 4x bigger size than it’s ASII representation? Well, luckily, no. Together with Unicode comes several mechanisms to represent or encode the characters. These are primarily the UTF-8 and UTF-16 encoding schemes that take an intelligent approach to the size problem.
Unicode encoding schemes like UTF-8 are more efficient in using their bits. With UTF-8, if a single character can be represented with 8bit, that’s all it will use. If a character needs 32bits, it’ll get all the 32bits. This is called variable length encoding, and it’s more efficient memory-wise. Thus, Unicode became the universal standard for encoding all human languages. And yes, it even includes emojis.
The first 256 characters are precisely the same as ASCII codes, i.e., it also supports backward compatibility.
UTF-8 can translate any Unicode character to a matching unique binary string and translate the binary string back to a Unicode character. This is the meaning of “UTF”, or “Unicode Transformation Format.”
The table below shows examples of Unicode characters with their associated codes and bytes. Each encoding begins with “U” for “Unicode,” followed by a Hexa-Decimal unique string of symbols to represent the character.
Character | Unicode Value | Equivalent UTF-8 Binary Encoding |
---|---|---|
A | U+0041 | 01000001 |
a | U+0061 | 01100001 |
0 | U+0030 | 00110000 |
9 | U+0039 | 00111001 |
! | U+0021 | 00100001 |
Ø | U+00D8 | 11000011 10011000 |
ڃ | U+0683 | 11011010 10000011 |
ಚ | U+0C9A | 11100000 10110010 10011010 |
𠜎 | U+2070E | 11110000 10100000 10011100 10001110 |
😁 | U+1F601 | 11110000 10011111 10011000 10000001 |
መ | U+1218 | 00010010 00011000 |
As you can witness from the above table, when 8bit is enough to represent a character Unicode UTF-8 uses just only 8bit, when it is more significant than that, it will use bigger-sized bits. If you are curious to find out how it really works internally, you can refer to this in-depth explanation .
This is how we represent texts including emojis on computers. In the following article, we will see how we represent sounds, colors, images, and videos in computer memory.