旧的字符编码,如ASCII,来自(前)8位时代,并试图将当时计算的主导语言,即英语,塞进0到127(7位)的数字中。字母表中有26个字母,包括大写和非大写的字母、数字和标点符号,效果相当不错。对于其他非英语语言,ASCII扩展了第8位,但是这种扩展所提供的额外128个数字/代码点将映射到不同的字符,这取决于所显示的语言。ISO-8859标准是这种映射的最常见形式;ISO-8859-1和ISO-8859-15(也称为ISO- latin -1, latin1,是的,8859 ISO标准也有两个不同的版本)。
UTF-8 is one of the ways to encode these code points in a form a computer can understand, aka bits. In other words, it's a way/algorithm to convert each of those code points to a sequence of bits or convert a sequence of bits to the equivalent code points. Note that there are a lot of alternative encodings for Unicode.
Joel gives a really nice explanation and an overview of the history here.
30 42 (hex) - > UTF8 encoding - > E3 81 82 (hex), which is above result in binary.
30 42 (hex) - > UTF16 encoding - > 30 42 (hex), which is above result in binary.
# Print character with the Unicode point (U+<hexcode>) using this:
[char]0x2550
# With Python installed, you can print the unicode character from U+xxxx with:
python -c 'print(u"\u2585")'
# To print the 3-byte hex code for a Unicode character:
printf "\\\x%s" $(printf '═'|xxd -p -c1 -u)
#\xE2\x95\x90
# To print the Unicode character represented by hex string:
printf '\xE2\x96\x85'
#▅
# To convert from UTF-16LE to Unicode
echo -en "════"| iconv -f UTF-16LE -t UNICODEFFFE
# To convert a string into hex:
echo -en '═�'| xxd -g 1
#00000000: e2 95 90 ef bf bd
# To convert a string into binary:
echo -en '═�\n'| xxd -b
#00000000: 11100010 10010101 10010000 11101111 10111111 10111101 ......
#00000006: 00001010
# To convert a binary string into hex:
printf '%x\n' "$((2#111000111000000110000010))"
#e38182