我试图准确强调两个数据框之间的变化。
假设我有两个 Python Pandas 数据框:
"StudentRoster Jan-1":
id Name score isEnrolled Comment
111 Jack 2.17 True He was late to class
112 Nick 1.11 False Graduated
113 Zoe 4.12 True
"StudentRoster Jan-2":
id Name score isEnrolled Comment
111 Jack 2.17 True He was late to class
112 Nick 1.21 False Graduated
113 Zoe 4.12 False On vacation
我的目标是输出一个 HTML 表格:
输出具有相同 OLD 和 NEW 值的行(理想情况下输出到 HTML 表中),因此消费者可以清楚地看到两个数据帧之间发生了什么变化:
"StudentRoster Difference Jan-1 - Jan-2":
id Name score isEnrolled Comment
112 Nick was 1.11| now 1.21 False Graduated
113 Zoe 4.12 was True | now False was "" | now "On vacation"
我想我可以逐行逐列比较,但是有没有更简单的方法?
最佳答案
第一部分和Constantine类似,可以得到空行的 bool 值*:
In [21]: ne = (df1 != df2).any(1)
In [22]: ne
Out[22]:
0 False
1 True
2 True
dtype: bool
然后我们可以看到哪些条目发生了变化:
In [23]: ne_stacked = (df1 != df2).stack()
In [24]: changed = ne_stacked[ne_stacked]
In [25]: changed.index.names = ['id', 'col']
In [26]: changed
Out[26]:
id col
1 score True
2 isEnrolled True
Comment True
dtype: bool
这里第一个条目是索引,第二个条目是已更改的列。
In [27]: difference_locations = np.where(df1 != df2)
In [28]: changed_from = df1.values[difference_locations]
In [29]: changed_to = df2.values[difference_locations]
In [30]: pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)
Out[30]:
from to
id col
1 score 1.11 1.21
2 isEnrolled True False
Comment None On vacation
* 注意:重要的是 df1
和 df2
在这里共享相同的索引。为了克服这种歧义,您可以确保只使用 df1.index & df2.index
查看共享标签,但我想我将把它留作练习。
https://stackoverflow.com/questions/17095101/