我有数据保存在postgreSQL数据库。我正在使用Python2.7查询这些数据,并将其转换为Pandas DataFrame。但是,这个数据帧的最后一列有一个值字典。DataFrame df看起来是这样的:

Station ID     Pollutants
8809           {"a": "46", "b": "3", "c": "12"}
8810           {"a": "36", "b": "5", "c": "8"}
8811           {"b": "2", "c": "7"}
8812           {"c": "11"}
8813           {"a": "82", "c": "15"}

我需要把这个列分割成单独的列,这样DataFrame ' df2看起来就像这样:

Station ID     a      b       c
8809           46     3       12
8810           36     5       8
8811           NaN    2       7
8812           NaN    NaN     11
8813           82     NaN     15

我遇到的主要问题是列表的长度不一样。但是所有的列表只包含3个相同的值:'a', 'b'和'c'。而且它们总是以相同的顺序出现('a'第一,'b'第二,'c'第三)。


objs = [df, pandas.DataFrame(df['Pollutant Levels'].tolist()).iloc[:, :3]]
df2 = pandas.concat(objs, axis=1).drop('Pollutant Levels', axis=1)


IndexError: out-of-bounds on slice (end) 




#My data format 
u{'a': '1', 'b': '2', 'c': '3'}

#and not
{u'a': '1', u'b': '2', u'c': '3'}




In [2]: df = pd.DataFrame({'a':[1,2,3], 'b':[{'c':1}, {'d':3}, {'c':5, 'd':6}]})

In [3]: df
   a                   b
0  1           {u'c': 1}
1  2           {u'd': 3}
2  3  {u'c': 5, u'd': 6}

In [4]: df['b'].apply(pd.Series)
     c    d
0  1.0  NaN
1  NaN  3.0
2  5.0  6.0


In [7]: pd.concat([df.drop(['b'], axis=1), df['b'].apply(pd.Series)], axis=1)
   a    c    d
0  1  1.0  NaN
1  2  NaN  3.0
2  3  5.0  6.0


In [15]: pd.concat([df.drop('b', axis=1), pd.DataFrame(df['b'].tolist())], axis=1)
   a    c    d
0  1  1.0  NaN
1  2  NaN  3.0
2  3  5.0  6.0

试试这个:从SQL返回的数据必须转换为Dict。 或者是“污染物水平”现在是“污染物”

   StationID                   Pollutants
0       8809  {"a":"46","b":"3","c":"12"}
1       8810   {"a":"36","b":"5","c":"8"}
2       8811            {"b":"2","c":"7"}
3       8812                   {"c":"11"}
4       8813          {"a":"82","c":"15"}

df2["Pollutants"] = df2["Pollutants"].apply(lambda x : dict(eval(x)) )
df3 = df2["Pollutants"].apply(pd.Series )

    a    b   c
0   46    3  12
1   36    5   8
2  NaN    2   7
3  NaN  NaN  11
4   82  NaN  15

result = pd.concat([df, df3], axis=1).drop('Pollutants', axis=1)

   StationID    a    b   c
0       8809   46    3  12
1       8810   36    5   8
2       8811  NaN    2   7
3       8812  NaN  NaN  11
4       8813   82  NaN  15
df = pd.concat([df['a'], df.b.apply(pd.Series)], axis=1)



# step 1: convert the `Pollutants` column to Pandas dataframe series
df_pol_ps = data_df['Pollutants'].apply(pd.Series)

    a   b   c
0   46  3   12
1   36  5   8
2   NaN 2   7
3   NaN NaN 11
4   82  NaN 15

# step 2: concat columns `a, b, c` and drop/remove the `Pollutants` 
df_final = pd.concat([df, df_pol_ps], axis = 1).drop('Pollutants', axis = 1)

    StationID   a   b   c
0   8809    46  3   12
1   8810    36  5   8
2   8811    NaN 2   7
3   8812    NaN NaN 11
4   8813    82  NaN 15


df_final = pd.concat([df, df['Pollutants'].apply(pd.Series)], axis = 1).drop('Pollutants', axis = 1)

    StationID   a   b   c
0   8809    46  3   12
1   8810    36  5   8
2   8811    NaN 2   7
3   8812    NaN NaN 11
4   8813    82  NaN 15

你可以用pop + tolist来使用join。性能与使用drop + tolist的concat相当,但有些人可能会发现这样的语法更干净:

res = df.join(pd.DataFrame(df.pop('b').tolist()))


df = pd.DataFrame({'a':[1,2,3], 'b':[{'c':1}, {'d':3}, {'c':5, 'd':6}]})

def joris1(df):
    return pd.concat([df.drop('b', axis=1), df['b'].apply(pd.Series)], axis=1)

def joris2(df):
    return pd.concat([df.drop('b', axis=1), pd.DataFrame(df['b'].tolist())], axis=1)

def jpp(df):
    return df.join(pd.DataFrame(df.pop('b').tolist()))

df = pd.concat([df]*1000, ignore_index=True)

%timeit joris1(df.copy())  # 1.33 s per loop
%timeit joris2(df.copy())  # 7.42 ms per loop
%timeit jpp(df.copy())     # 7.68 ms per loop


import pandas as pd

df2 = pd.json_normalize(df['Pollutant Levels'])



df_contaminants = pd.DataFrame(df[' contaminants '].values.tolist(), index=df.index)


df_contaminants = df[' contaminants '].apply(pd.Series)



>>> df = pd.concat([df['Station ID'], df['Pollutants'].apply(pd.Series)], axis=1)
>>> print(df)
   Station ID    a    b   c
0        8809   46    3  12
1        8810   36    5   8
2        8811  NaN    2   7
3        8812  NaN  NaN  11
4        8813   82  NaN  15


def expand_dataframe(dw: pd.DataFrame, column_to_expand: str) -> pd.DataFrame:
    dw: DataFrame with some column which contain a dict to expand
        in columns
    column_to_expand: String with column name of dw
    import pandas as pd

    def convert_to_dict(sequence: str) -> Dict:
        import json
        s = sequence
        json_acceptable_string = s.replace("'", "\"")
        d = json.loads(json_acceptable_string)
        return d    

    expanded_dataframe = pd.concat([dw.drop([column_to_expand], axis=1),
    return expanded_dataframe

my_df = pd.DataFrame.from_dict(my_dict, orient='index', columns=['my_col'])

. .将正确地解析字典(将每个字典键放入单独的df列,键值放入df行),因此字典将不会首先被压缩到单个列中。

根据Shijith在这个答案中执行的时间分析,最快的规范化一列扁平的单层字典的方法: df.join (pd.DataFrame (df.pop(污染物).values.tolist ())) 它不会解决下面提到的list或dicts列的其他问题,例如带有NaN或嵌套dicts的行。

pd.json_normalize(df.Pollutants) is significantly faster than df.Pollutants.apply(pd.Series) See the %%timeit below. For 1M rows, .json_normalize is 47 times faster than .apply. Whether reading data from a file, or from an object returned by a database, or API, it may not be clear if the dict column has dict or str type. If the dictionaries in the column are str type, they must be converted back to a dict type, using ast.literal_eval, or json.loads(…). Use pd.json_normalize to convert the dicts, with keys as headers and values for rows. There are additional parameters (e.g. record_path & meta) for dealing with nested dicts. Use pandas.DataFrame.join to combine the original DataFrame, df, with the columns created using pd.json_normalize If the index isn't integers (as in the example), first use df.reset_index() to get an index of integers, before doing the normalize and join. pandas.DataFrame.pop is used to remove the specified column from the existing dataframe. This removes the need to drop the column later, using pandas.DataFrame.drop.

注意,如果列有任何NaN,则必须用空字典填充它们 df。污染物= df。Fillna ({i: {} for i in df.index}) 如果“污染物”列是字符串,则使用“{}”。 另请参阅如何使用nan对列进行json_normalize。

import pandas as pd
from ast import literal_eval
import numpy as np

data = {'Station ID': [8809, 8810, 8811, 8812, 8813, 8814],
        'Pollutants': ['{"a": "46", "b": "3", "c": "12"}', '{"a": "36", "b": "5", "c": "8"}', '{"b": "2", "c": "7"}', '{"c": "11"}', '{"a": "82", "c": "15"}', np.nan]}

df = pd.DataFrame(data)

# display(df)
   Station ID                        Pollutants
0        8809  {"a": "46", "b": "3", "c": "12"}
1        8810   {"a": "36", "b": "5", "c": "8"}
2        8811              {"b": "2", "c": "7"}
3        8812                       {"c": "11"}
4        8813            {"a": "82", "c": "15"}
5        8814                               NaN

# check the type of the first value in Pollutants
>>> print(type(df.iloc[0, 1]))
<class 'str'>

# replace NaN with '{}' if the column is strings, otherwise replace with {}
df.Pollutants = df.Pollutants.fillna('{}')  # if the NaN is in a column of strings
# df.Pollutants = df.Pollutants.fillna({i: {} for i in df.index})  # if the column is not strings

# Convert the column of stringified dicts to dicts
# skip this line, if the column contains dicts
df.Pollutants = df.Pollutants.apply(literal_eval)

# reset the index if the index is not unique integers from 0 to n-1
# df.reset_index(inplace=True)  # uncomment if needed

# remove and normalize the column of dictionaries, and join the result to df
df = df.join(pd.json_normalize(df.pop('Pollutants')))

# display(df)
   Station ID    a    b    c
0        8809   46    3   12
1        8810   36    5    8
2        8811  NaN    2    7
3        8812  NaN  NaN   11
4        8813   82  NaN   15
5        8814  NaN  NaN  NaN


# dataframe with 1M rows
dfb = pd.concat([df]*20000).reset_index(drop=True)

46.9 ms ± 201 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

pd.concat([dfb.drop(columns=['Pollutants']), dfb.Pollutants.apply(pd.Series)], axis=1)
7.75 s ± 52.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


>>> df

   Station ID                        Pollutants
0        8809  {"a": "46", "b": "3", "c": "12"}
1        8810   {"a": "36", "b": "5", "c": "8"}
2        8811              {"b": "2", "c": "7"}
3        8812                       {"c": "11"}
4        8813            {"a": "82", "c": "15"}


>>> df = pd.concat([df]*2000000).reset_index(drop=True)
>>> print(df.shape)
(10000000, 2)
def apply_drop(df):
    return df.join(df['Pollutants'].apply(pd.Series)).drop('Pollutants', axis=1)  

def json_normalise_drop(df):
    return df.join(pd.json_normalize(df.Pollutants)).drop('Pollutants', axis=1)  

def tolist_drop(df):
    return df.join(pd.DataFrame(df['Pollutants'].tolist())).drop('Pollutants', axis=1)  

def vlues_tolist_drop(df):
    return df.join(pd.DataFrame(df['Pollutants'].values.tolist())).drop('Pollutants', axis=1)  

def pop_tolist(df):
    return df.join(pd.DataFrame(df.pop('Pollutants').tolist()))  

def pop_values_tolist(df):
    return df.join(pd.DataFrame(df.pop('Pollutants').values.tolist()))

>>> %timeit apply_drop(df.copy())
1 loop, best of 3: 53min 20s per loop
>>> %timeit json_normalise_drop(df.copy())
1 loop, best of 3: 54.9 s per loop
>>> %timeit tolist_drop(df.copy())
1 loop, best of 3: 6.62 s per loop
>>> %timeit vlues_tolist_drop(df.copy())
1 loop, best of 3: 6.63 s per loop
>>> %timeit pop_tolist(df.copy())
1 loop, best of 3: 5.99 s per loop
>>> %timeit pop_values_tolist(df.copy())
1 loop, best of 3: 5.94 s per loop
| apply_drop          | 53min 20s |
| json_normalise_drop |    54.9 s |
| tolist_drop         |    6.62 s |
| vlues_tolist_drop   |    6.63 s |
| pop_tolist          |    5.99 s |
| pop_values_tolist   |    5.94 s |

df.join(pd.DataFrame(df.pop(' contaminants ').values.tolist()))是最快的








df['val'].apply(pd.Series) is extremely slow for large N as pandas constructs Series objects for each row, then proceeds to construct a DataFrame from them. For larger N the performance dips to the order of minutes or hours. pd.json_normalize(df['val'])) is slower simply because json_normalize is meant to work with a much more complex input data - particularly deeply nested JSON with multiple record paths and metadata. We have a simple flat dict for which pd.DataFrame suffices, so use that if your dicts are flat. Some answers suggest df.pop('val').values.tolist() or df.pop('val').to_numpy().tolist(). I don't think it makes much of a difference whether you listify the series or the numpy array. It's one operation less to listify the series directly and really isn't slower so I'd recommend avoiding generating the numpy array in the intermediate step.