![]() |
VOOZH | about |
In the digital world, every letter, number, and symbol you see on your screenโfrom the "A" in "Apple" to the "@" in an email addressโis ultimately represented by a series of bits, the fundamental 0s and 1s that computers understand. Character encoding is the process of converting characters (letters, numbers, symbols) into a format that computers can understand and store.
Every letter, number, or symbol has a unique code number. When you type 'A', the computer looks it up in its codebook, finds the number for 'A', and then converts that number into 0s and 1s. When it needs to show you 'A' on the screen, it does the reverse.
Text encoding is how computers understand our words. It turns letters, numbers, and symbols into a code computers can read, usually as binary (1) and (0). Over time, different encoding systems have been created to handle all the different languages and symbols we use, like:
Also known as American Standard Code for Information Interchange, it is arguably the most fundamental and widely recognized character encoding. Developed in the 1960s for teletypes, it laid the groundwork for how computers worldwide communicate text. The idea is so simple just assign a number to each character, like A is assigned as 65, and so on.
ASCII is a 7-bit encoding, meaning it can represent 27= 128 different characters, like
1. Non-printable, system codes between 0 and 31.
2. Lower ASCII, between 32 and 127.
3. Higher ASCII, between 128 and 255.
See Complete ASCII Table
The 7-bit nature of ASCII limits it to English characters and a basic set of symbols. It cannot represent characters from other languages (like accented letters, Cyrillic, Arabic, Chinese, Japanese, Korean, etc.) or specialized symbols. This limitation led to the development of "extended ASCII" variants (using the 8th bit for an additional 128 characters), but these were inconsistent and caused "mojibake" (garbled text) when files were opened on systems using a different extended ASCII variant.
While Unicode defines the code points, it doesn't dictate how these code points are stored as sequences of bytes in computer memory or files. This is where Unicode Transformation Formats (UTFs) come into play.
It's a standardized method for encoding Unicode characters into a sequence of bytes for storage or transmission. Unicode is a universal character set that aims to represent all written languages, symbols, and emojis, while UTF defines how these characters are stored in binary form.
Example:
Character: โAโ
Code point: U+0041
UTF-8: 41
Example (BUS):
00 42 00 55 00 53