我试图突出显示两个数据帧之间发生了什么变化。
假设我有两个Python Pandas数据框架:
"StudentRoster Jan-1":
id Name score isEnrolled Comment
111 Jack 2.17 True He was late to class
112 Nick 1.11 False Graduated
113 Zoe 4.12 True
"StudentRoster Jan-2":
id Name score isEnrolled Comment
111 Jack 2.17 True He was late to class
112 Nick 1.21 False Graduated
113 Zoe 4.12 False On vacation
我的目标是输出一个HTML表,它:
标识已更改的行(可以是int, float, boolean,字符串)
输出具有相同的OLD和NEW值的行(理想情况下是HTML表),以便消费者可以清楚地看到两个数据框架之间发生了什么变化:
“StudentRoster差异Jan-1 - Jan-2”:
id名称分数isregistered评论
尼克是1.11|现在1.21假毕业
113佐伊4.12是真的|现在是假的|现在“度假”
我想我可以逐行逐列比较,但有没有更简单的方法?
这个答案只是扩展了@Andy Hayden的答案,使其能够适应数值字段为nan的情况,并将其包装成一个函数。
import pandas as pd
import numpy as np
def diff_pd(df1, df2):
"""Identify differences between two pandas DataFrames"""
assert (df1.columns == df2.columns).all(), \
"DataFrame column names are different"
if any(df1.dtypes != df2.dtypes):
"Data Types are different, trying to convert"
df2 = df2.astype(df1.dtypes)
if df1.equals(df2):
return None
else:
# need to account for np.nan != np.nan returning True
diff_mask = (df1 != df2) & ~(df1.isnull() & df2.isnull())
ne_stacked = diff_mask.stack()
changed = ne_stacked[ne_stacked]
changed.index.names = ['id', 'col']
difference_locations = np.where(diff_mask)
changed_from = df1.values[difference_locations]
changed_to = df2.values[difference_locations]
return pd.DataFrame({'from': changed_from, 'to': changed_to},
index=changed.index)
所以对于你的数据(稍微编辑一下,在分数列中有一个NaN):
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
DF1 = StringIO("""id Name score isEnrolled Comment
111 Jack 2.17 True "He was late to class"
112 Nick 1.11 False "Graduated"
113 Zoe NaN True " "
""")
DF2 = StringIO("""id Name score isEnrolled Comment
111 Jack 2.17 True "He was late to class"
112 Nick 1.21 False "Graduated"
113 Zoe NaN False "On vacation" """)
df1 = pd.read_table(DF1, sep='\s+', index_col='id')
df2 = pd.read_table(DF2, sep='\s+', index_col='id')
diff_pd(df1, df2)
输出:
from to
id col
112 score 1.11 1.21
113 isEnrolled True False
Comment On vacation
下面是另一种使用选择和合并的方法:
In [6]: # first lets create some dummy dataframes with some column(s) different
...: df1 = pd.DataFrame({'a': range(-5,0), 'b': range(10,15), 'c': range(20,25)})
...: df2 = pd.DataFrame({'a': range(-5,0), 'b': range(10,15), 'c': [20] + list(range(101,105))})
In [7]: df1
Out[7]:
a b c
0 -5 10 20
1 -4 11 21
2 -3 12 22
3 -2 13 23
4 -1 14 24
In [8]: df2
Out[8]:
a b c
0 -5 10 20
1 -4 11 101
2 -3 12 102
3 -2 13 103
4 -1 14 104
In [10]: # make condition over the columns you want to comapre
...: condition = df1['c'] != df2['c']
...:
...: # select rows from each dataframe where the condition holds
...: diff1 = df1[condition]
...: diff2 = df2[condition]
In [11]: # merge the selected rows (dataframes) with some suffixes (optional)
...: diff1.merge(diff2, on=['a','b'], suffixes=('_before', '_after'))
Out[11]:
a b c_before c_after
0 -4 11 21 101
1 -3 12 22 102
2 -2 13 23 103
3 -1 14 24 104
以下是来自Jupyter的截图:
pandas >= 1.1: DataFrame.compare
使用pandas 1.1,基本上可以用一个函数调用复制Ted Petrou的输出。例子摘自文档:
pd.__version__
# '1.1.0'
df1.compare(df2)
score isEnrolled Comment
self other self other self other
1 1.11 1.21 NaN NaN NaN NaN
2 NaN NaN 1.0 0.0 NaN On vacation
这里,“self”指的是LHS数据帧,而“other”指的是RHS数据帧。默认情况下,相等的值将被nan替换,因此您可以只关注差异。如果您想显示相同的值,请使用
df1.compare(df2, keep_equal=True, keep_shape=True)
score isEnrolled Comment
self other self other self other
1 1.11 1.21 False False Graduated Graduated
2 4.12 4.12 True False NaN On vacation
你也可以使用align_axis改变比较轴:
df1.compare(df2, align_axis='index')
score isEnrolled Comment
1 self 1.11 NaN NaN
other 1.21 NaN NaN
2 self NaN 1.0 NaN
other NaN 0.0 On vacation
这是逐行比较值,而不是逐列比较值。
import pandas as pd
import io
texts = ['''\
id Name score isEnrolled Comment
111 Jack 2.17 True He was late to class
112 Nick 1.11 False Graduated
113 Zoe 4.12 True ''',
'''\
id Name score isEnrolled Comment
111 Jack 2.17 True He was late to class
112 Nick 1.21 False Graduated
113 Zoe 4.12 False On vacation''']
df1 = pd.read_fwf(io.StringIO(texts[0]), widths=[5,7,25,21,20])
df2 = pd.read_fwf(io.StringIO(texts[1]), widths=[5,7,25,21,20])
df = pd.concat([df1,df2])
print(df)
# id Name score isEnrolled Comment
# 0 111 Jack 2.17 True He was late to class
# 1 112 Nick 1.11 False Graduated
# 2 113 Zoe 4.12 True NaN
# 0 111 Jack 2.17 True He was late to class
# 1 112 Nick 1.21 False Graduated
# 2 113 Zoe 4.12 False On vacation
df.set_index(['id', 'Name'], inplace=True)
print(df)
# score isEnrolled Comment
# id Name
# 111 Jack 2.17 True He was late to class
# 112 Nick 1.11 False Graduated
# 113 Zoe 4.12 True NaN
# 111 Jack 2.17 True He was late to class
# 112 Nick 1.21 False Graduated
# 113 Zoe 4.12 False On vacation
def report_diff(x):
return x[0] if x[0] == x[1] else '{} | {}'.format(*x)
changes = df.groupby(level=['id', 'Name']).agg(report_diff)
print(changes)
打印
score isEnrolled Comment
id Name
111 Jack 2.17 True He was late to class
112 Nick 1.11 | 1.21 False Graduated
113 Zoe 4.12 True | False nan | On vacation