我收到了一些编码的文本,但我不知道使用的是什么字符集。是否有一种方法可以使用Python确定文本文件的编码?如何检测文本文件的编码/代码页处理c#。


当前回答

你可以使用' python-magic package,它不会将整个文件加载到内存中:

import magic


def detect(
    file_path,
):
    return magic.Magic(
        mime_encoding=True,
    ).from_file(file_path)

输出是编码名称,例如:

iso - 8859 - 1 us - ascii utf - 8

其他回答

# Function: OpenRead(file)

# A text file can be encoded using:
#   (1) The default operating system code page, Or
#   (2) utf8 with a BOM header
#
#  If a text file is encoded with utf8, and does not have a BOM header,
#  the user can manually add a BOM header to the text file
#  using a text editor such as notepad++, and rerun the python script,
#  otherwise the file is read as a codepage file with the 
#  invalid codepage characters removed

import sys
if int(sys.version[0]) != 3:
    print('Aborted: Python 3.x required')
    sys.exit(1)

def bomType(file):
    """
    returns file encoding string for open() function

    EXAMPLE:
        bom = bomtype(file)
        open(file, encoding=bom, errors='ignore')
    """

    f = open(file, 'rb')
    b = f.read(4)
    f.close()

    if (b[0:3] == b'\xef\xbb\xbf'):
        return "utf8"

    # Python automatically detects endianess if utf-16 bom is present
    # write endianess generally determined by endianess of CPU
    if ((b[0:2] == b'\xfe\xff') or (b[0:2] == b'\xff\xfe')):
        return "utf16"

    if ((b[0:5] == b'\xfe\xff\x00\x00') 
              or (b[0:5] == b'\x00\x00\xff\xfe')):
        return "utf32"

    # If BOM is not provided, then assume its the codepage
    #     used by your operating system
    return "cp1252"
    # For the United States its: cp1252


def OpenRead(file):
    bom = bomType(file)
    return open(file, 'r', encoding=bom, errors='ignore')


#######################
# Testing it
#######################
fout = open("myfile1.txt", "w", encoding="cp1252")
fout.write("* hi there (cp1252)")
fout.close()

fout = open("myfile2.txt", "w", encoding="utf8")
fout.write("\u2022 hi there (utf8)")
fout.close()

# this case is still treated like codepage cp1252
#   (User responsible for making sure that all utf8 files
#   have a BOM header)
fout = open("badboy.txt", "wb")
fout.write(b"hi there.  barf(\x81\x8D\x90\x9D)")
fout.close()

# Read Example file with Bom Detection
fin = OpenRead("myfile1.txt")
L = fin.readline()
print(L)
fin.close()

# Read Example file with Bom Detection
fin = OpenRead("myfile2.txt")
L =fin.readline() 
print(L) #requires QtConsole to view, Cmd.exe is cp1252
fin.close()

# Read CP1252 with a few undefined chars without barfing
fin = OpenRead("badboy.txt")
L =fin.readline() 
print(L)
fin.close()

# Check that bad characters are still in badboy codepage file
fin = open("badboy.txt", "rb")
fin.read(20)
fin.close()

另一种计算编码的方法是使用 的代码 文件命令)。有大量的 可用的Python绑定。

文件源树中的python绑定可作为 Python-magic(或python3-magic) debian软件包。它可以通过执行以下操作来确定文件的编码:

import magic

blob = open('unknown-file', 'rb').read()
m = magic.open(magic.MAGIC_MIME_ENCODING)
m.load()
encoding = m.buffer(blob)  # "utf-8" "us-ascii" etc

在pypi上有一个名称相同但不兼容的python-magic pip包,它也使用libmagic。它也可以得到编码,通过这样做:

import magic

blob = open('unknown-file', 'rb').read()
m = magic.Magic(mime_encoding=True)
encoding = m.from_buffer(blob)

编辑:chardet似乎无人维护,但大部分答案适用。请登录https://pypi.org/project/charset-normalizer/查看其他选择

始终正确地检测编码是不可能的。

(来自chardet FAQ:)

However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds “txzqJv 2!dasd0a QqdKjvz” will instantly recognize that that isn't English (even though it is composed entirely of English letters). By studying lots of “typical” text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text's language.

有一个chardet库利用这项研究来检测编码。chardet是Mozilla中自动检测代码的一个端口。

你也可以使用UnicodeDammit。它将尝试以下方法:

An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. If Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. The only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document. An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected at this stage, it will be one of the UTF-* encodings, EBCDIC, or ASCII. An encoding sniffed by the chardet library, if you have it installed. UTF-8 Windows-1252

根据您的平台,我只选择使用linux shell文件命令。这适用于我,因为我使用它在一个脚本,专门运行在我们的linux机器之一。

显然,这不是一个理想的解决方案或答案,但可以根据您的需要进行修改。在我的例子中,我只需要确定一个文件是否为UTF-8。

import subprocess
file_cmd = ['file', 'test.txt']
p = subprocess.Popen(file_cmd, stdout=subprocess.PIPE)
cmd_output = p.stdout.readlines()
# x will begin with the file type output as is observed using 'file' command
x = cmd_output[0].split(": ")[1]
return x.startswith('UTF-8')

一些文本文件知道它们的编码,大多数则不是。意识到:

具有BOM的文本文件 XML文件以UTF-8编码或其编码在序言中给出 JSON文件总是用UTF-8编码

没有意识到:

CSV文件 任意文本文件

有些编码是通用的,即它们可以解码任何字节序列,有些则不是。US-ASCII不是万能的,因为任何大于127的字节都不能映射到任何字符。UTF-8不是万能的,因为任何字节序列都是无效的。

相反,Latin-1, Windows-1252等是通用的(即使一些字节没有正式映射到一个字符):

>>> [b.to_bytes(1, 'big').decode("latin-1") for b in range(256)]
['\x00', ..., 'ÿ']

给定一个以字节序列编码的随机文本文件,除非该文件知道其编码,否则无法确定其编码,因为有些编码是通用的。但有时可以排除非通用编码。所有通用编码仍然是可能的。chardet模块使用字节的频率来猜测哪种编码最适合已编码的文本。

如果你不想使用这个模块或类似的模块,这里有一个简单的方法:

检查文件是否知道其编码(BOM) 检查非通用编码并接受第一个可以解码字节的编码(ASCII在UTF-8之前,因为它更严格) 选择一个回退编码。

如果您只检查一个示例,那么第二步有点风险,因为文件其余部分中的某些字节可能是无效的。

代码:

def guess_encoding(data: bytes, fallback: str = "iso8859_15") -> str:
    """
    A basic encoding detector.
    """
    for bom, encoding in [
        (codecs.BOM_UTF32_BE, "utf_32_be"),
        (codecs.BOM_UTF32_LE, "utf_32_le"),
        (codecs.BOM_UTF16_BE, "utf_16_be"),
        (codecs.BOM_UTF16_LE, "utf_16_le"),
        (codecs.BOM_UTF8, "utf_8_sig"),
    ]:
        if data.startswith(bom):
            return encoding

    if all(b < 128 for b in data):
        return "ascii"  # you may want to use the fallback here if data is only a sample.

    decoder = codecs.getincrementaldecoder("utf_8")()
    try:
        decoder.decode(data, final=False)
    except UnicodeDecodeError:
        return fallback
    else:
        return "utf_8"  # not certain if data is only a sample

记住,非通用编码可能会失败。decode方法的errors参数可以设置为'ignore', 'replace'或'backslashreplace'以避免异常。