UTF-8和UTF-8与BOM有什么不同?哪个更好?
当前回答
如上所述,带有BOM的UTF-8可能会导致非BOM感知(或兼容)软件出现问题。我曾经用基于mozilla的KompoZer编辑UTF-8 + BOM编码的HTML文件,因为客户需要WYSIWYG程序。
保存时,布局总是会被破坏。我花了一些时间来解决这个问题。这些文件在Firefox中运行良好,但在Internet Explorer中显示了一个CSS怪癖,再次破坏了布局。在摆弄了几个小时链接的CSS文件后,我发现Internet Explorer不喜欢BOMfed HTML文件。我再也不会见你了。
还有,我刚在维基百科上找到了这个:
The shebang characters are represented by the same two bytes in extended ASCII encodings, including UTF-8, which is commonly used for scripts and other text files on current Unix-like systems. However, UTF-8 files may begin with the optional byte order mark (BOM); if the "exec" function specifically detects the bytes 0x23 0x21, then the presence of the BOM (0xEF 0xBB 0xBF) before the shebang will prevent the script interpreter from being executed. Some authorities recommend against using the byte order mark in POSIX (Unix-like) scripts,[15] for this reason and for wider interoperability and philosophical concerns
其他回答
一个实际的区别是,如果你为Mac OS X编写一个shell脚本,并将其保存为普通的UTF-8,你将得到响应:
#!/bin/bash: No such file or directory
在shebang行指定您希望使用哪个shell的响应中:
#!/bin/bash
如果你保存为UTF-8,没有BOM(说在BBEdit),一切都会很好。
UTF-8 BOM是文本流开头的字节序列(0xEF, 0xBB, 0xBF),它允许读者更可靠地猜测文件是否以UTF-8编码。
通常,BOM用于表示编码的字节顺序,但由于字节顺序与UTF-8无关,因此BOM是不必要的。
根据Unicode标准,不建议使用UTF-8文件的BOM:
2.6编码方案 ... 对于UTF-8,既不要求也不建议使用BOM,但在将UTF-8数据从使用BOM的其他编码形式转换或将BOM用作UTF-8签名的上下文中可能会遇到这种情况。有关更多信息,请参阅第16.8节特殊项中的“字节顺序标记”小节。
只有当文件实际包含一些非ascii字符时,UTF-8和BOM才有用。如果包含了它,而没有任何ASCII,那么它可能会破坏旧的应用程序,否则将文件解释为纯ASCII。当遇到非ASCII字符时,这些应用程序肯定会失败,因此在我看来,只有当文件可以并且不应该再被解释为纯ASCII时,才应该添加BOM。
我想说清楚的是,我宁愿没有BOM。如果一些旧的垃圾没有它就坏了,那么就添加它,替换遗留应用程序是不可行的。
不要制作UTF-8的BOM之外的任何东西。
Unicode字节顺序标记(BOM)常见问题解答提供了一个简明的答案:
Q: How I should deal with BOMs? A: Here are some guidelines to follow: A particular protocol (e.g. Microsoft conventions for .txt files) may require use of the BOM on certain Unicode data streams, such as files. When you need to conform to such a protocol, use a BOM. Some protocols allow optional BOMs in the case of untagged text. In those cases, Where a text data stream is known to be plain text, but of unknown encoding, BOM can be used as a signature. If there is no BOM, the encoding could be anything. Where a text data stream is known to be plain Unicode text (but not which endian), then BOM can be used as a signature. If there is no BOM, the text should be interpreted as big-endian. Some byte oriented protocols expect ASCII characters at the beginning of a file. If UTF-8 is used with these protocols, use of the BOM as encoding form signature should be avoided. Where the precise type of the data stream is known (e.g. Unicode big-endian or Unicode little-endian), the BOM should not be used. In particular, whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used.
应该注意的是,对于某些文件,即使在Windows上也不能有BOM。例如SQL*plus或VBScript文件。如果这样的文件包含BOM,则在尝试执行它们时会出现错误。
推荐文章
- Java标识符中的“连接字符”是什么?
- 使用Javascript的atob解码base64不能正确解码utf-8字符串
- 为什么字符集名称不是常量?
- 编码字符串为UTF-8
- 有没有统一码符号来表示"搜索"
- 我如何在PHP中输出一个UTF-8 CSV, Excel将正确读取?
- Python __str__与__unicode__
- 如何在Python中将字符串转换为utf-8
- Unicode和UTF-8的区别是什么?
- 我真的需要将“&”编码为“&”吗?
- 用Python写入UTF-8文件
- c++中的_tmain()和main()有什么区别?
- HTML编码问题-显示“”字符而不是“ ”
- 将Unicode文本写入文本文件?
- PHP DOMDocument loadHTML没有正确编码UTF-8