Many people know that computers only understand binary (1s and 0s), so when we type characters such as ‘a’ or ‘b’, how does the computer understand and store these characters?
The answer is encoding: characters are mapped to numbers, which the computer can easily translate to binary. There are different standards of encoding. Let’s examine 2 notable ones: ASCII and Unicode (most used today).
Until the 1980s, ASCII was the leading character encoding standard. ASCII has most common Western characters (and a few other things) map to numbers 0 to 127. An example is shown below for a.
a maps to the the number 97, which ASCII encodes to binary as 01100001.
The overall idea is simple and you see can see the full chart here: https://www.ascii-code.com. Most computers back then used 8-bit-bytes, which can store numbers from 0 to 255.
And since ASCII only uses numbers 0 to 127, all the codes from 127 to 255 were left unspecified by ASCII. For any of these higher numbers, countries would assign different characters as they saw fit. One version of Latin ASCII had Ù mapping to 217, but a Japanese version of ASCII also had ﾙ mapping to 217. Fragmentation was rampant, and many languages needed more numbers than what was left available. A new standard was sorely needed; this came in the form of Unicode.
Unicode is a character set, in which every possible character maps to a number, a code point. A code point is still just a number, but the number corresponds to a character. It’s important to emphasize that the Unicode group has been painstakingly mapping every character — yes, every single one— to a code point (number). Even uncommon characters such as θ map to a code point, it’s the number 952, which in hexadecimal is 03B8. A Unicode code point is a little special: the number is hexadecimal and is prefixed with a U+. So more precisely, θ maps to U+03B8 in Unicode as you can see below.
Unfortunately, there’s a missing link, as we can’t say yet what the character maps to in binary. Do we need to allocate 2 or 3 bytes? What should we do with the leading bits of the bytes?
Right now θ just maps to the U+03B8 code point. In fact, all Unicode does is map a character to a code point — it doesn’t say how a computer should understand the code point. This is where encoding comes in. Unicode encoding allows us to turn a code point into bytes, the raw data a computer can understand.
There are many encoding standards, where one possible encoding standard is storing every character as 2 bytes (UCS-2). Another encoding standard is UTF-8, which is the most popular these days. UTF-8 is slightly complex, but the main features are below:
- code points below 127 (hex 007F) are stored in a byte (8 bits) — same as ASCII
- code points 128 (hex 0080) and above are stored in bytes ranging from 2 to 6 bytes. There are some extra rules about how the first 2 bits in the bytes are set (for parsing reasons)
If you’re curious, you can read the full spec here.
So, θ in UTF-8 is shown below with its corresponding binary. a is also included so you can compare it with our previous ASCII equivalent.
The missing link has been found! We can use an encoding standard such as UTF-8 to transform code points to binary. Note: the binary for a is the same as ASCII because of our 1. rule we previously mentioned.
Computers cannot understand characters the way humans do; characters must be mapped to code points (numbers), which can be easily encoded to binary. In ASCII, each character gets mapped to a code point and a 8-bit-byte. While in Unicode, each character is mapped to a code point, which using an encoding standard such as UTF-8 can then get encoded to bytes.
Did you know
- the times you’ve seen garbled text ����� in emails and word documents is the result of characters being decoded with an unintended character encoding. The characters might have been originally encoded with UCS-2, but the recipient might be unintentionally decoding it with UTF-8.
- if you were to open up the header in emails/network requests, you’d find something like below. It’s telling clients what the content’s character encoding standard is (in our example, it’s UTF-8).
Content-Type: text/html; charset=UTF-8