我使用Python 2从ASCII编码的文本文件解析JSON。

当用json或simplejson加载这些文件时,我的所有字符串值都转换为Unicode对象而不是字符串对象。问题是,我必须将数据与一些只接受字符串对象的库一起使用。我不能更改库也不能更新它们。

是否有可能获得字符串对象而不是Unicode对象?

例子

>>> import json
>>> original_list = ['a', 'b']
>>> json_list = json.dumps(original_list)
>>> json_list
'["a", "b"]'
>>> new_list = json.loads(json_list)
>>> new_list
[u'a', u'b']  # I want these to be of type `str`, not `unicode`

(2017年一个简单而干净的解决方案是使用最新版本的Python——即Python 3和更高版本。)


当前回答

没有内置选项让json模块函数返回字节字符串而不是Unicode字符串。然而,这个简短而简单的递归函数将任何解码的JSON对象从使用Unicode字符串转换为utf -8编码的字节字符串:

def byteify(input):
    if isinstance(input, dict):
        return {byteify(key): byteify(value)
                for key, value in input.iteritems()}
    elif isinstance(input, list):
        return [byteify(element) for element in input]
    elif isinstance(input, unicode):
        return input.encode('utf-8')
    else:
        return input

只需在从json中获得的输出上调用此函数。加载或json。负载的电话。

几点注意事项:

To support Python 2.6 or earlier, replace return {byteify(key): byteify(value) for key, value in input.iteritems()} with return dict([(byteify(key), byteify(value)) for key, value in input.iteritems()]), since dictionary comprehensions weren't supported until Python 2.7. Since this answer recurses through the entire decoded object, it has a couple of undesirable performance characteristics that can be avoided with very careful use of the object_hook or object_pairs_hook parameters. Mirec Miskuf's answer is so far the only one that manages to pull this off correctly, although as a consequence, it's significantly more complicated than my approach.

其他回答

我构建了这个递归施法者。它符合我的需要,我认为它是相对完整的。

def _parseJSON(self, obj):
    newobj = {}

    for key, value in obj.iteritems():
        key = str(key)

        if isinstance(value, dict):
            newobj[key] = self._parseJSON(value)
        elif isinstance(value, list):
            if key not in newobj:
                newobj[key] = []
                for i in value:
                    newobj[key].append(self._parseJSON(i))
        elif isinstance(value, unicode):
            val = str(value)
            if val.isdigit():
                val = int(val)
            else:
                try:
                    val = float(val)
                except ValueError:
                    val = str(val)
            newobj[key] = val

    return newobj

只需要像这样传递一个JSON对象:

obj = json.loads(content, parse_float=float, parse_int=int)
obj = _parseJSON(obj)

我把它作为一个类的私有成员,但您可以根据需要重新使用该方法。

问题在于simplejson和json是两个不同的模块,至少在处理Unicode的方式上是这样。你在Python 2.6+中有json,它给你Unicode值,而simplejson返回字符串对象。

在您的环境中尝试easy_installing -ing simplejson,看看是否有效。对我来说确实如此。

我从Mark Amery的回答中改编了代码,特别是为了摆脱isinstance的鸭子键入的优点。

编码是手动完成的,ensure_ascii是禁用的。json的Python文档。Dump说:

如果ensure_ascii为True(默认值),输出中的所有非ascii字符将使用\uXXXX序列转义

免责声明:在文档测试中,我使用了匈牙利语。一些著名的与匈牙利语相关的字符编码有:cp852,在DOS中使用的IBM/OEM编码(有时被称为ASCII)。我认为这是不正确的,因为它取决于代码页设置)。Windows-1250用于Windows(有时称为ANSI,取决于区域设置),ISO 8859-1有时用于HTTP服务器。

测试文本Tüskéshátú kígyóbűvölő来自Koltai László(本地个人姓名形式),来自维基百科。

# coding: utf-8
"""
This file should be encoded correctly with utf-8.
"""
import json

def encode_items(input, encoding='utf-8'):
    u"""original from: https://stackoverflow.com/a/13101776/611007
    adapted by SO/u/611007 (20150623)
    >>>
    >>> ## run this with `python -m doctest <this file>.py` from command line
    >>>
    >>> txt = u"Tüskéshátú kígyóbűvölő"
    >>> txt2 = u"T\\u00fcsk\\u00e9sh\\u00e1t\\u00fa k\\u00edgy\\u00f3b\\u0171v\\u00f6l\\u0151"
    >>> txt3 = u"uúuutifu"
    >>> txt4 = b'u\\xfauutifu'
    >>> # txt4 shouldn't be 'u\\xc3\\xbauutifu', string content needs double backslash for doctest:
    >>> assert u'\\u0102' not in b'u\\xfauutifu'.decode('cp1250')
    >>> txt4u = txt4.decode('cp1250')
    >>> assert txt4u == u'u\\xfauutifu', repr(txt4u)
    >>> txt5 = b"u\\xc3\\xbauutifu"
    >>> txt5u = txt5.decode('utf-8')
    >>> txt6 = u"u\\u251c\\u2551uutifu"
    >>> there_and_back_again = lambda t: encode_items(t, encoding='utf-8').decode('utf-8')
    >>> assert txt == there_and_back_again(txt)
    >>> assert txt == there_and_back_again(txt2)
    >>> assert txt3 == there_and_back_again(txt3)
    >>> assert txt3.encode('cp852') == there_and_back_again(txt4u).encode('cp852')
    >>> assert txt3 == txt4u,(txt3,txt4u)
    >>> assert txt3 == there_and_back_again(txt5)
    >>> assert txt3 == there_and_back_again(txt5u)
    >>> assert txt3 == there_and_back_again(txt4u)
    >>> assert txt3.encode('cp1250') == encode_items(txt4, encoding='utf-8')
    >>> assert txt3.encode('utf-8') == encode_items(txt5, encoding='utf-8')
    >>> assert txt2.encode('utf-8') == encode_items(txt, encoding='utf-8')
    >>> assert {'a':txt2.encode('utf-8')} == encode_items({'a':txt}, encoding='utf-8')
    >>> assert [txt2.encode('utf-8')] == encode_items([txt], encoding='utf-8')
    >>> assert [[txt2.encode('utf-8')]] == encode_items([[txt]], encoding='utf-8')
    >>> assert [{'a':txt2.encode('utf-8')}] == encode_items([{'a':txt}], encoding='utf-8')
    >>> assert {'b':{'a':txt2.encode('utf-8')}} == encode_items({'b':{'a':txt}}, encoding='utf-8')
    """
    try:
        input.iteritems
        return {encode_items(k): encode_items(v) for (k,v) in input.iteritems()}
    except AttributeError:
        if isinstance(input, unicode):
            return input.encode(encoding)
        elif isinstance(input, str):
            return input
        try:
            iter(input)
            return [encode_items(e) for e in input]
        except TypeError:
            return input

def alt_dumps(obj, **kwargs):
    """
    >>> alt_dumps({'a': u"T\\u00fcsk\\u00e9sh\\u00e1t\\u00fa k\\u00edgy\\u00f3b\\u0171v\\u00f6l\\u0151"})
    '{"a": "T\\xc3\\xbcsk\\xc3\\xa9sh\\xc3\\xa1t\\xc3\\xba k\\xc3\\xadgy\\xc3\\xb3b\\xc5\\xb1v\\xc3\\xb6l\\xc5\\x91"}'
    """
    if 'ensure_ascii' in kwargs:
        del kwargs['ensure_ascii']
    return json.dumps(encode_items(obj), ensure_ascii=False, **kwargs)

我还想强调Jarret Hardie引用JSON规范的答案,引用如下:

字符串是零个或多个Unicode字符的集合

在我的用例中,我有带有JSON内容的文件。它们是UTF-8编码的文件。ensure_ascii的结果是正确转义,但不是非常可读的JSON文件,这就是为什么我改编了Mark Amery的答案来满足我的需要。

doctest不是特别周到,但我分享了代码,希望它对某人有用。

我也遇到了这个问题,不得不处理JSON,我想出了一个小循环,将Unicode键转换为字符串。(GAE上的simplejson不返回字符串键。)

obj是从JSON解码的对象:

if NAME_CLASS_MAP.has_key(cls):
    kwargs = {}
    for i in obj.keys():
        kwargs[str(i)] = obj[i]
    o = NAME_CLASS_MAP[cls](**kwargs)
    o.save()

kwargs是我传递给GAE应用程序的构造函数的内容(它不喜欢**kwargs中的Unicode键)。

它不如Wells的解决方案健壮,但要小得多。

使用钩子支持Python 2和3(来自Mirec Miskuf的回答):

import requests
import six
from six import iteritems

requests.packages.urllib3.disable_warnings()  # @UndefinedVariable
r = requests.get("http://echo.jsontest.com/key/value/one/two/three", verify=False)

def _byteify(data):
    # If this is a Unicode string, return its string representation
    if isinstance(data, six.string_types):
        return str(data.encode('utf-8').decode())

    # If this is a list of values, return list of byteified values
    if isinstance(data, list):
        return [ _byteify(item) for item in data ]

    # If this is a dictionary, return dictionary of byteified keys and values,
    # but only if we haven't already byteified it
    if isinstance(data, dict):
        return {
            _byteify(key): _byteify(value) for key, value in iteritems(data)
        }
    # If it's anything else, return it in its original form
    return data

w = r.json(object_hook=_byteify)
print(w)

返回:

 {'three': '', 'key': 'value', 'one': 'two'}