UTF-8和UTF-8与BOM有什么区别?

UTF-8和UTF-8与BOM有什么不同?哪个更好?

当前回答

如果你在HTML文件中使用UTF-8，如果你在同一页面上使用塞尔维亚西里尔语、塞尔维亚拉丁语、德语、匈牙利语或一些外来语言，那么使用UTF和BOM更好。

这是我(从事计算机和IT行业30年)的观点。

2013-03-15 10:01:53

其他回答

问:UTF-8和没有BOM的UTF-8有什么不同?哪个更好?

以下是一些摘自维基百科关于字节顺序标记(BOM)的文章，我相信这些文章为这个问题提供了一个可靠的答案。

关于BOM和UTF-8的含义:

Unicode标准允许使用UTF-8格式的BOM，但不要求或推荐使用。字节顺序在UTF-8中没有意义，因此在UTF-8中唯一使用的是在文本流开始时发出信号以UTF-8编码。

不使用BOM的参数:

不使用BOM的主要动机是向后兼容性使用不支持unicode的软件…另一个不这样做的原因使用BOM是为了鼓励UTF-8作为“默认”编码。

使用BOM的参数:

The argument for using a BOM is that without it, heuristic analysis is required to determine what character encoding a file is using. Historically such analysis, to distinguish various 8-bit encodings, is complicated, error-prone, and sometimes slow. A number of libraries are available to ease the task, such as Mozilla Universal Charset Detector and International Components for Unicode. Programmers mistakenly assume that detection of UTF-8 is equally difficult (it is not because of the vast majority of byte sequences are invalid UTF-8, while the encodings these libraries are trying to distinguish allow all possible byte sequences). Therefore not all Unicode-aware programs perform such an analysis and instead rely on the BOM. In particular, Microsoft compilers and interpreters, and many pieces of software on Microsoft Windows such as Notepad will not correctly read UTF-8 text unless it has only ASCII characters or it starts with the BOM, and will add a BOM to the start when saving text as UTF-8. Google Docs will add a BOM when a Microsoft Word document is downloaded as a plain text file.

有或没有BOM，哪个更好:

IETF建议，如果一个协议(a)总是使用UTF-8，或者(b)有其他方式表明使用的是什么编码，那么它“应该禁止使用U+FEFF作为签名。”

我的结论是:

仅在与软件应用程序的兼容性是绝对必要的情况下使用BOM。

还要注意，虽然引用的维基百科文章指出，许多Microsoft应用程序依赖BOM来正确检测UTF-8，但并非所有Microsoft应用程序都是如此。例如，正如@barlop所指出的，当使用带有UTF-8†的Windows命令提示符时，此类类型和更多的命令不期望出现BOM。如果存在BOM，它可能会像其他应用程序一样出现问题。

†chcp命令通过代码页65001提供对UTF-8(没有BOM)的支持。

2014-10-02 20:24:24

这个问题已经有了无数个答案，其中许多答案都很好，但我想尝试并澄清何时应该使用BOM，何时不应该使用BOM。

如前所述，任何使用UTF BOM(字节顺序标记)来确定字符串是否为UTF-8的方法都是有根据的猜测。如果有适当的元数据可用(如charset="utf-8")，那么您已经知道应该使用什么，但除此之外，您还需要进行测试并做出一些假设。这涉及到检查字符串来自的文件是否以十六进制字节码EF BB BF开头。

If a byte code corresponding to the UTF-8 BOM is found, the probability is high enough to assume it's UTF-8 and you can go from there. When forced to make this guess, however, additional error checking while reading would still be a good idea in case something comes up garbled. You should only assume a BOM is not UTF-8 (i.e. latin-1 or ANSI) if the input definitely shouldn't be UTF-8 based on its source. If there is no BOM, however, you can simply determine whether it's supposed to be UTF-8 by validating against the encoding.

为什么不推荐使用BOM ?

不支持unicode或兼容性较差的软件可能会假定它是latin-1或ANSI，并且不会从字符串中剥离BOM，这显然会导致问题。这并不是真正需要的(只要检查内容是否兼容，并且在找不到兼容编码时总是使用UTF-8作为备用)

什么时候应该使用BOM编码?

如果您无法以任何其他方式(通过字符集标记或文件系统元)记录元数据，并且像使用BOM一样使用程序，则应该使用BOM进行编码。在Windows上尤其如此，没有BOM的任何东西通常都被认为使用了遗留代码页。BOM告诉Office等程序，是的，这个文件中的文本是Unicode;这是使用的编码。

归根结底，我唯一真正有问题的文件是CSV。根据程序的不同，它必须或必须没有BOM。例如，如果你在Windows上使用Excel 2007+，如果你想要顺利地打开它，而不必求助于导入数据，它必须用BOM编码。

2016-01-25 16:03:13

Unicode字节顺序标记(BOM)常见问题解答提供了一个简明的答案:

Q: How I should deal with BOMs? A: Here are some guidelines to follow: A particular protocol (e.g. Microsoft conventions for .txt files) may require use of the BOM on certain Unicode data streams, such as files. When you need to conform to such a protocol, use a BOM. Some protocols allow optional BOMs in the case of untagged text. In those cases, Where a text data stream is known to be plain text, but of unknown encoding, BOM can be used as a signature. If there is no BOM, the encoding could be anything. Where a text data stream is known to be plain Unicode text (but not which endian), then BOM can be used as a signature. If there is no BOM, the text should be interpreted as big-endian. Some byte oriented protocols expect ASCII characters at the beginning of a file. If UTF-8 is used with these protocols, use of the BOM as encoding form signature should be avoided. Where the precise type of the data stream is known (e.g. Unicode big-endian or Unicode little-endian), the BOM should not be used. In particular, whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used.

2018-03-08 13:58:08

从http://en.wikipedia.org/wiki/Byte-order_mark:

字节顺序标记(BOM)是一个Unicode 符号的符号文本文件的字节顺序或流。其编码点为U+FEFF。 BOM使用是可选的，如果使用，应该出现在文本的开头吗流。除了它的特殊用途字节顺序指示器，即BOM 字符也可以指示哪一个几种Unicode表示文本是用。

总是在文件中使用BOM将确保它总是在支持UTF-8和BOM的编辑器中正确打开。

我对缺少BOM的真正问题如下。假设我们有一个文件，它包含: