我有一个pandas数据框架,其中一列文本字符串包含逗号分隔的值。我想拆分每个CSV字段,并为每个条目创建一个新行(假设CSV是干净的,只需要在','上拆分)。例如,a应该变成b:

In [7]: a
Out[7]: 
    var1  var2
0  a,b,c     1
1  d,e,f     2

In [8]: b
Out[8]: 
  var1  var2
0    a     1
1    b     1
2    c     1
3    d     2
4    e     2
5    f     2

到目前为止,我已经尝试了各种简单的函数,但是.apply方法在轴上使用时似乎只接受一行作为返回值,而且我不能让.transform工作。任何建议都将不胜感激!

示例数据:

from pandas import DataFrame
import numpy as np
a = DataFrame([{'var1': 'a,b,c', 'var2': 1},
               {'var1': 'd,e,f', 'var2': 2}])
b = DataFrame([{'var1': 'a', 'var2': 1},
               {'var1': 'b', 'var2': 1},
               {'var1': 'c', 'var2': 1},
               {'var1': 'd', 'var2': 2},
               {'var1': 'e', 'var2': 2},
               {'var1': 'f', 'var2': 2}])

我知道这不会起作用,因为我们通过numpy丢失了DataFrame元数据,但它应该给你一个我试图做的感觉:

def fun(row):
    letters = row['var1']
    letters = letters.split(',')
    out = np.array([row] * len(letters))
    out['var1'] = letters
a['idx'] = range(a.shape[0])
z = a.groupby('idx')
z.transform(fun)

当前回答

我提出了一个具有任意列数的数据框架的解决方案(同时一次仍然只分离一列的条目)。

def splitDataFrameList(df,target_column,separator):
    ''' df = dataframe to split,
    target_column = the column containing the values to split
    separator = the symbol used to perform the split

    returns: a dataframe with each entry for the target column separated, with each element moved into a new row. 
    The values in the other columns are duplicated across the newly divided rows.
    '''
    def splitListToRows(row,row_accumulator,target_column,separator):
        split_row = row[target_column].split(separator)
        for s in split_row:
            new_row = row.to_dict()
            new_row[target_column] = s
            row_accumulator.append(new_row)
    new_rows = []
    df.apply(splitListToRows,axis=1,args = (new_rows,target_column,separator))
    new_df = pandas.DataFrame(new_rows)
    return new_df

其他回答

使用赋值和爆炸的一行代码:

    col1  col2
0  a,b,c     1
1  d,e,f     2

df.assign(col1 = df.col1.str.split(',')).explode('col1', ignore_index=True)

输出:

  col1  col2
0    a     1
1    b     1
2    c     1
3    d     2
4    e     2
5    f     2

这是我为这个常见任务写的函数。它比Series/stack方法更有效。列的顺序和名称被保留。

def tidy_split(df, column, sep='|', keep=False):
    """
    Split the values of a column and expand so the new DataFrame has one split
    value per row. Filters rows where the column is missing.

    Params
    ------
    df : pandas.DataFrame
        dataframe with the column to split and expand
    column : str
        the column to split and expand
    sep : str
        the string used to split the column's values
    keep : bool
        whether to retain the presplit value as it's own row

    Returns
    -------
    pandas.DataFrame
        Returns a dataframe with the same columns as `df`.
    """
    indexes = list()
    new_values = list()
    df = df.dropna(subset=[column])
    for i, presplit in enumerate(df[column].astype(str)):
        values = presplit.split(sep)
        if keep and len(values) > 1:
            indexes.append(i)
            new_values.append(presplit)
        for value in values:
            indexes.append(i)
            new_values.append(value)
    new_df = df.iloc[indexes, :].copy()
    new_df[column] = new_values
    return new_df

有了这个函数,原来的问题就像这样简单:

tidy_split(a, 'var1', sep=',')

Try:

vals = np.array(a.var1.str.split(",").values.tolist())    
var = np.repeat(a.var2, vals.shape[1])

out = pd.DataFrame(np.column_stack((var, vals.ravel())), columns=a.columns)
display(out)

      var1 var2
    0   1   a
    1   1   b
    2   1   c
    3   2   d
    4   2   e
    5   2   f

这样怎么样:

In [55]: pd.concat([Series(row['var2'], row['var1'].split(','))              
                    for _, row in a.iterrows()]).reset_index()
Out[55]: 
  index  0
0     a  1
1     b  1
2     c  1
3     d  2
4     e  2
5     f  2

然后你只需要重命名列

我提出了一个具有任意列数的数据框架的解决方案(同时一次仍然只分离一列的条目)。

def splitDataFrameList(df,target_column,separator):
    ''' df = dataframe to split,
    target_column = the column containing the values to split
    separator = the symbol used to perform the split

    returns: a dataframe with each entry for the target column separated, with each element moved into a new row. 
    The values in the other columns are duplicated across the newly divided rows.
    '''
    def splitListToRows(row,row_accumulator,target_column,separator):
        split_row = row[target_column].split(separator)
        for s in split_row:
            new_row = row.to_dict()
            new_row[target_column] = s
            row_accumulator.append(new_row)
    new_rows = []
    df.apply(splitListToRows,axis=1,args = (new_rows,target_column,separator))
    new_df = pandas.DataFrame(new_rows)
    return new_df