我收到了一些编码的文本,但我不知道使用的是什么字符集。是否有一种方法可以使用Python确定文本文件的编码?如何检测文本文件的编码/代码页处理c#。


当前回答

很久以前,我有这样的需求。

阅读我的旧代码,我发现了这个:

    import urllib.request
    import chardet
    import os
    import settings

    [...]
    file = 'sources/dl/file.csv'
    media_folder = settings.MEDIA_ROOT
    file = os.path.join(media_folder, str(file))
    if os.path.isfile(file):
        file_2_test = urllib.request.urlopen('file://' + file).read()
        encoding = (chardet.detect(file_2_test))['encoding']
        return encoding

这为我工作,并返回ascii

其他回答

另一种计算编码的方法是使用 的代码 文件命令)。有大量的 可用的Python绑定。

文件源树中的python绑定可作为 Python-magic(或python3-magic) debian软件包。它可以通过执行以下操作来确定文件的编码:

import magic

blob = open('unknown-file', 'rb').read()
m = magic.open(magic.MAGIC_MIME_ENCODING)
m.load()
encoding = m.buffer(blob)  # "utf-8" "us-ascii" etc

在pypi上有一个名称相同但不兼容的python-magic pip包,它也使用libmagic。它也可以得到编码,通过这样做:

import magic

blob = open('unknown-file', 'rb').read()
m = magic.Magic(mime_encoding=True)
encoding = m.from_buffer(blob)

如果你知道文件的一些内容,你可以尝试用几种编码来解码它,看看哪个丢失了。一般来说没有办法,因为文本文件就是文本文件,这些都是愚蠢的;)

这个网站有python代码识别ascii,编码bom和utf8 no bom: https://unicodebook.readthedocs.io/guess_encoding.html。将文件读入字节数组(data): http://www.codecodex.com/wiki/Read_a_file_into_a_byte_array。举个例子。我在osx。

#!/usr/bin/python                                                                                                  

import sys

def isUTF8(data):
    try:
        decoded = data.decode('UTF-8')
    except UnicodeDecodeError:
        return False
    else:
        for ch in decoded:
            if 0xD800 <= ord(ch) <= 0xDFFF:
                return False
        return True

def get_bytes_from_file(filename):
    return open(filename, "rb").read()

filename = sys.argv[1]
data = get_bytes_from_file(filename)
result = isUTF8(data)
print(result)


PS /Users/js> ./isutf8.py hi.txt                                                                                     
True

编辑:chardet似乎无人维护,但大部分答案适用。请登录https://pypi.org/project/charset-normalizer/查看其他选择

始终正确地检测编码是不可能的。

(来自chardet FAQ:)

However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds “txzqJv 2!dasd0a QqdKjvz” will instantly recognize that that isn't English (even though it is composed entirely of English letters). By studying lots of “typical” text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text's language.

有一个chardet库利用这项研究来检测编码。chardet是Mozilla中自动检测代码的一个端口。

你也可以使用UnicodeDammit。它将尝试以下方法:

An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. If Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. The only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document. An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected at this stage, it will be one of the UTF-* encodings, EBCDIC, or ASCII. An encoding sniffed by the chardet library, if you have it installed. UTF-8 Windows-1252

# Function: OpenRead(file)

# A text file can be encoded using:
#   (1) The default operating system code page, Or
#   (2) utf8 with a BOM header
#
#  If a text file is encoded with utf8, and does not have a BOM header,
#  the user can manually add a BOM header to the text file
#  using a text editor such as notepad++, and rerun the python script,
#  otherwise the file is read as a codepage file with the 
#  invalid codepage characters removed

import sys
if int(sys.version[0]) != 3:
    print('Aborted: Python 3.x required')
    sys.exit(1)

def bomType(file):
    """
    returns file encoding string for open() function

    EXAMPLE:
        bom = bomtype(file)
        open(file, encoding=bom, errors='ignore')
    """

    f = open(file, 'rb')
    b = f.read(4)
    f.close()

    if (b[0:3] == b'\xef\xbb\xbf'):
        return "utf8"

    # Python automatically detects endianess if utf-16 bom is present
    # write endianess generally determined by endianess of CPU
    if ((b[0:2] == b'\xfe\xff') or (b[0:2] == b'\xff\xfe')):
        return "utf16"

    if ((b[0:5] == b'\xfe\xff\x00\x00') 
              or (b[0:5] == b'\x00\x00\xff\xfe')):
        return "utf32"

    # If BOM is not provided, then assume its the codepage
    #     used by your operating system
    return "cp1252"
    # For the United States its: cp1252


def OpenRead(file):
    bom = bomType(file)
    return open(file, 'r', encoding=bom, errors='ignore')


#######################
# Testing it
#######################
fout = open("myfile1.txt", "w", encoding="cp1252")
fout.write("* hi there (cp1252)")
fout.close()

fout = open("myfile2.txt", "w", encoding="utf8")
fout.write("\u2022 hi there (utf8)")
fout.close()

# this case is still treated like codepage cp1252
#   (User responsible for making sure that all utf8 files
#   have a BOM header)
fout = open("badboy.txt", "wb")
fout.write(b"hi there.  barf(\x81\x8D\x90\x9D)")
fout.close()

# Read Example file with Bom Detection
fin = OpenRead("myfile1.txt")
L = fin.readline()
print(L)
fin.close()

# Read Example file with Bom Detection
fin = OpenRead("myfile2.txt")
L =fin.readline() 
print(L) #requires QtConsole to view, Cmd.exe is cp1252
fin.close()

# Read CP1252 with a few undefined chars without barfing
fin = OpenRead("badboy.txt")
L =fin.readline() 
print(L)
fin.close()

# Check that bad characters are still in badboy codepage file
fin = open("badboy.txt", "rb")
fin.read(20)
fin.close()