UTF-8和Unicode有什么区别?

UTF-8是Unicode文本的一种可能的编码方案。

Unicode是一个范围广泛的标准，它定义了超过140,000个字符，并为每个字符分配一个数字代码(一个码位)。它还定义了如何对文本进行排序、规范化、更改大小写等规则。Unicode中的字符由一个从0到0x10FFFF(包括0x10FFFF)的码位表示，但有些码位是保留的，不能用于字符。

将一串Unicode码位编码成二进制流的方法不止一种。这些被称为“编码”。最直接的编码是UTF-32，它将每个代码点存储为32位整数，每个整数宽为4字节。因为代码点最多只能到0x10FFFF(需要21位)，所以这种编码有点浪费。

UTF-8是另一种编码，由于与UTF-32和其他编码相比有许多优点，它正在成为事实上的标准。UTF-8将每个码位编码为1、2、3或4个字节值的序列。ASCII范围内的码位被编码为一个单字节值，以便与ASCII兼容。超出这个范围的代码点分别使用2、3或4个字节，这取决于它们所在的范围。

UTF-8在设计时考虑了这些属性:

ASCII characters are encoded exactly as they are in ASCII, such that an ASCII string is also a valid UTF-8 string representing the same characters. More efficient: Text strings in UTF-8 almost always occupy less space than the same strings in either UTF-32 or UTF-16, with just a few exceptions. Binary sorting: Sorting UTF-8 strings using a binary sort will still result in all code points being sorted in numerical order. When a code point uses multiple bytes, none of those bytes contain values in the ASCII range, ensuring that no part of them could be mistaken for an ASCII character. This is also a security feature. UTF-8 can be easily validated, and distinguished from other character encodings by a validator. Text in other 8-bit or multi-byte encodings will very rarely also validate as UTF-8 due to the very specific structure of UTF-8. Random access: At any point in a UTF-8 string it is possible to tell if the byte at that position is the first byte of a character or not, and to find the start of the next or current character, without needing to scan forwards or backwards more than 3 bytes or to know how far into the string we started reading from.

2017-09-26 05:05:13

Unicode只是一个标准，它定义了一个字符集(UCS)和编码(UTF)来编码这个字符集。但一般来说，Unicode指的是字符集，而不是标准。

在5分钟内阅读每个软件开发人员绝对必须知道的关于Unicode和字符集(没有借口!)和Unicode的绝对最小值。

2009-03-13 17:37:07

Unicode只定义码位，即代表一个字符的数字。如何在内存中存储这些代码点取决于所使用的编码。UTF-8是编码Unicode字符的一种方式。

2009-03-13 17:14:36

这篇文章解释了所有细节 http://kunststube.net/encoding/

写入缓冲区

如果你写入一个4字节的缓冲区，符号あUTF8编码，你的二进制将看起来像这样:

00000000 11100011 10000001 10000010

如果你写入一个4字节的缓冲区，使用UTF16编码的符号あ，你的二进制将看起来像这样:

00000000 00000000 00110000 01000010

正如你所看到的，根据你在内容中使用的语言，这将相应地影响你的记忆。

例如，对于这个特定的符号:あUTF16编码更有效，因为我们有2个空闲字节用于下一个符号。但这并不意味着你必须使用UTF16来表示日本字母。

从缓冲区读取

现在，如果你想读取上面的字节，你必须知道它是用什么编码写的，并正确解码回来。

例:如果你解码这个: 00000000 11100011 10000001 10000010 转换为UTF16编码，你将得到臣而不是あ

注意:Encoding和Unicode是两个不同的东西。Unicode是一个大(表)，每个符号都映射到一个唯一的码点。例如，あ符号(字母)有一个(码位):30 42(十六进制)。另一方面，编码是一种将符号转换为更合适的方式的算法，当存储到硬件时。

30 42 (hex) - > UTF8 encoding - > E3 81 82 (hex), which is above result in binary.

30 42 (hex) - > UTF16 encoding - > 30 42 (hex), which is above result in binary.

2019-10-12 04:30:59

UTF-8是Unicode文本的一种可能的编码方案。

Unicode是一个范围广泛的标准，它定义了超过140,000个字符，并为每个字符分配一个数字代码(一个码位)。它还定义了如何对文本进行排序、规范化、更改大小写等规则。Unicode中的字符由一个从0到0x10FFFF(包括0x10FFFF)的码位表示，但有些码位是保留的，不能用于字符。

将一串Unicode码位编码成二进制流的方法不止一种。这些被称为“编码”。最直接的编码是UTF-32，它将每个代码点存储为32位整数，每个整数宽为4字节。因为代码点最多只能到0x10FFFF(需要21位)，所以这种编码有点浪费。

UTF-8是另一种编码，由于与UTF-32和其他编码相比有许多优点，它正在成为事实上的标准。UTF-8将每个码位编码为1、2、3或4个字节值的序列。ASCII范围内的码位被编码为一个单字节值，以便与ASCII兼容。超出这个范围的代码点分别使用2、3或4个字节，这取决于它们所在的范围。

UTF-8在设计时考虑了这些属性:

ASCII characters are encoded exactly as they are in ASCII, such that an ASCII string is also a valid UTF-8 string representing the same characters. More efficient: Text strings in UTF-8 almost always occupy less space than the same strings in either UTF-32 or UTF-16, with just a few exceptions. Binary sorting: Sorting UTF-8 strings using a binary sort will still result in all code points being sorted in numerical order. When a code point uses multiple bytes, none of those bytes contain values in the ASCII range, ensuring that no part of them could be mistaken for an ASCII character. This is also a security feature. UTF-8 can be easily validated, and distinguished from other character encodings by a validator. Text in other 8-bit or multi-byte encodings will very rarely also validate as UTF-8 due to the very specific structure of UTF-8. Random access: At any point in a UTF-8 string it is possible to tell if the byte at that position is the first byte of a character or not, and to find the start of the next or current character, without needing to scan forwards or backwards more than 3 bytes or to know how far into the string we started reading from.

2017-09-26 05:05:13

你通常从谷歌开始，然后想尝试不同的东西。但是如何打印和转换所有这些字符集呢?

这里我列出了一些有用的一行程序。

Powershell:

# Print character with the Unicode point (U+<hexcode>) using this: 
[char]0x2550

# With Python installed, you can print the unicode character from U+xxxx with:
python -c 'print(u"\u2585")'

如果你有更多的Powershell trix或快捷方式，请评论。

在Bash中，你会喜欢libiconv和util-linux包中的iconv、hexdump和xxd(可能在其他*nix发行版中命名不同)。

# To print the 3-byte hex code for a Unicode character:
printf "\\\x%s" $(printf '═'|xxd -p -c1 -u)
#\xE2\x95\x90

# To print the Unicode character represented by hex string:
printf '\xE2\x96\x85'
#▅

# To convert from UTF-16LE to Unicode
echo -en "════"| iconv -f UTF-16LE -t UNICODEFFFE

# To convert a string into hex: 
echo -en '═�'| xxd -g 1
#00000000: e2 95 90 ef bf bd

# To convert a string into binary:
echo -en '═�\n'| xxd -b
#00000000: 11100010 10010101 10010000 11101111 10111111 10111101  ......
#00000006: 00001010

# To convert a binary string into hex:
printf  '%x\n' "$((2#111000111000000110000010))"
#e38182

2022-01-04 14:50:54

UTF-8和Unicode有什么区别?

推荐文章

最新文章

标签