我可以阅读MySQL文档,它非常清楚。但是,如何决定使用哪个字符集呢?排序对哪些数据有影响?

我要你解释一下这两种,以及如何选择。


从MySQL文档:

A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set. Let's make the distinction clear with an example of an imaginary character set. Suppose that we have an alphabet with four letters: 'A', 'B', 'a', 'b'. We give each letter a number: 'A' = 0, 'B' = 1, 'a' = 2, 'b' = 3. The letter 'A' is a symbol, the number 0 is the encoding for 'A', and the combination of all four letters and their encodings is a character set. Now, suppose that we want to compare two string values, 'A' and 'B'. The simplest way to do this is to look at the encodings: 0 for 'A' and 1 for 'B'. Because 0 is less than 1, we say 'A' is less than 'B'. Now, what we've just done is apply a collation to our character set. The collation is a set of rules (only one rule in this case): "compare the encodings." We call this simplest of all possible collations a binary collation. But what if we want to say that the lowercase and uppercase letters are equivalent? Then we would have at least two rules: (1) treat the lowercase letters 'a' and 'b' as equivalent to 'A' and 'B'; (2) then compare the encodings. We call this a case-insensitive collation. It's a little more complex than a binary collation. In real life, most character sets have many characters: not just 'A' and 'B' but whole alphabets, sometimes multiple alphabets or eastern writing systems with thousands of characters, along with many special symbols and punctuation marks. Also in real life, most collations have many rules: not just case insensitivity but also accent insensitivity (an "accent" is a mark attached to a character as in German 'ö') and multiple-character mappings (such as the rule that 'ö' = 'OE' in one of the two German collations).

字符编码是一种编码字符以使它们适合内存的方法。也就是说,如果字符集是ISO-8859-15,那么欧元符号€将被编码为0xa4,而在UTF-8中,它将被编码为0xe282ac。

排序规则是如何比较字符,在latin9中,有字母e é è ê f,如果按照它们的二进制表示排序,它会是e f é ê è但如果排序规则被设置为,例如,法语,你会让它们按照你认为的顺序排列,也就是所有e é è ê都相等,然后f。

A character set is a subset of all written glyphs. A character encoding specifies how those characters are mapped to numeric values. Some character encodings, like UTF-8 and UTF-16, can encode any character in the Universal Character Set. Others, like US-ASCII or ISO-8859-1 can only encode a small subset, since they use 7 and 8 bits per character, respectively. Because many standards specify both a character set and a character encoding, the term "character set" is often substituted freely for "character encoding".

排序规则由指定如何比较字符进行排序的规则组成。排序规则可以是特定于语言环境的:两个字符的正确顺序因语言而异。

选择字符集和排序规则取决于应用程序是否国际化。如果不是,你的目标市场是什么?

为了选择要支持的字符集,您必须考虑您的应用程序。如果您存储的是用户提供的输入,那么可能很难预见软件最终将在哪些地区使用。为了支持所有这些,最好从一开始就支持UCS (Unicode)。然而,这是有代价的;许多西欧字符现在每个字符需要两个字节的存储空间,而不是一个。

如果数据库使用排序规则创建索引,然后使用该索引提供排序结果,那么选择正确的排序规则有助于提高性能。但是,由于排序规则通常是特定于语言环境的,如果需要根据另一个语言环境的规则对结果进行排序,那么该索引将毫无价值。

我建议使用utf8mb4_unicode_ci,它基于用于排序和比较的Unicode标准,可以在非常广泛的语言中进行准确排序。