UTF-8和UTF-8与BOM有什么不同?哪个更好?


当前回答

应该注意的是,对于某些文件,即使在Windows上也不能有BOM。例如SQL*plus或VBScript文件。如果这样的文件包含BOM,则在尝试执行它们时会出现错误。

其他回答

其他优秀的回答已经回答过了

UTF-8和BOM-ed的UTF-8之间没有官方的区别 一个BOM-ed的UTF-8字符串将以以下三个字节开始。Ef bb bf 如果存在这些字节,在从文件/流中提取字符串时必须忽略。

但是,作为附加信息,UTF-8的BOM可以很好地“嗅出”字符串是否以UTF-8编码……或者它可以是任何其他编码的合法字符串…

例如,数据[EF BB BF 41 42 43]可以是:

合法的ISO-8859-1字符串“ABC” 合法的UTF-8字符串“ABC”

因此,尽管通过查看第一个字节来识别文件内容的编码很酷,但您不应该依赖于此,如上面的示例所示

编码应该是已知的,而不是推测的。

UTF-8和没有BOM的UTF-8有什么不同?

简单回答:在UTF-8中,BOM编码为文件开头的字节EF BB BF。

长一点的回答:

最初,预计Unicode将以UTF-16/UCS-2编码。BOM是为这种编码形式设计的。当您有2字节的代码单元时,有必要指出这两个字节的顺序,这样做的一个常见惯例是在数据的开头包含字符U+FEFF作为“字节顺序标记”。字符U+FFFE是永久未分配的,因此可以使用它来检测错误的字节顺序。

不管平台字节顺序如何,UTF-8都具有相同的字节顺序,因此不需要字节顺序标记。然而,它可能出现在从UTF-16转换为UTF-8的数据中(作为字节序列EF BB FF),或者作为表示数据为UTF-8的“签名”。

哪个更好?

没有。正如Martin Cote回答的那样,Unicode标准并不推荐这样做。它会导致非bom识别软件出现问题。

检测文件是否为UTF-8的更好方法是执行有效性检查。UTF-8对哪些字节序列是有效的有严格的规则,因此假阳性的概率可以忽略不计。如果一个字节序列看起来像UTF-8,那么它可能就是。

应该注意的是,对于某些文件,即使在Windows上也不能有BOM。例如SQL*plus或VBScript文件。如果这样的文件包含BOM,则在尝试执行它们时会出现错误。

如上所述,带有BOM的UTF-8可能会导致非BOM感知(或兼容)软件出现问题。我曾经用基于mozilla的KompoZer编辑UTF-8 + BOM编码的HTML文件,因为客户需要WYSIWYG程序。

保存时,布局总是会被破坏。我花了一些时间来解决这个问题。这些文件在Firefox中运行良好,但在Internet Explorer中显示了一个CSS怪癖,再次破坏了布局。在摆弄了几个小时链接的CSS文件后,我发现Internet Explorer不喜欢BOMfed HTML文件。我再也不会见你了。

还有,我刚在维基百科上找到了这个:

The shebang characters are represented by the same two bytes in extended ASCII encodings, including UTF-8, which is commonly used for scripts and other text files on current Unix-like systems. However, UTF-8 files may begin with the optional byte order mark (BOM); if the "exec" function specifically detects the bytes 0x23 0x21, then the presence of the BOM (0xEF 0xBB 0xBF) before the shebang will prevent the script interpreter from being executed. Some authorities recommend against using the byte order mark in POSIX (Unix-like) scripts,[15] for this reason and for wider interoperability and philosophical concerns

Unicode字节顺序标记(BOM)常见问题解答提供了一个简明的答案:

Q: How I should deal with BOMs? A: Here are some guidelines to follow: A particular protocol (e.g. Microsoft conventions for .txt files) may require use of the BOM on certain Unicode data streams, such as files. When you need to conform to such a protocol, use a BOM. Some protocols allow optional BOMs in the case of untagged text. In those cases, Where a text data stream is known to be plain text, but of unknown encoding, BOM can be used as a signature. If there is no BOM, the encoding could be anything. Where a text data stream is known to be plain Unicode text (but not which endian), then BOM can be used as a signature. If there is no BOM, the text should be interpreted as big-endian. Some byte oriented protocols expect ASCII characters at the beginning of a file. If UTF-8 is used with these protocols, use of the BOM as encoding form signature should be avoided. Where the precise type of the data stream is known (e.g. Unicode big-endian or Unicode little-endian), the BOM should not be used. In particular, whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used.