I want to be able to show the differences between two data frames in a clear way. It turns out to be a little involved.
What I want is to be able to show the exact row and columns that differ, showing both values from the two data frames.
Lets start by creating some data frames.
Code
import pandas as pd
import numpy as np
one = pd.DataFrame([
{"name" : "orlando" , "animal" : "cat" , "age" : 3 },
{"name" : "rupert" , "animal" : "bear" , "age" : 4 },
{"name" : "rollo" , "animal" : "dog" , "age" : 5 },
])
two = pd.DataFrame([
{"name" : "orlando" , "animal" : "cat" , "age" : 4 },
{"name" : "ruperts" , "animal" : "bear" , "age" : 4 },
{"name" : "rollo" , "animal" : "dogs" , "age" : 5 },
])
So each row is slightly different, I want to just show the differences. The start is to flatten the dataframe so it is index, column name, column value.
Code
(
one
.melt(var_name= "column" , ignore_index= False )
.reset_index()
.sort_values(by= ["index" , "column" ])
.set_index(["index" , "column" ])
)
value
index
column
0
age
3
animal
cat
name
orlando
1
age
4
animal
bear
name
rupert
2
age
5
animal
dog
name
rollo
This looks great. Lets try comparing them.
Code
def flatten(df: pd.DataFrame) -> pd.DataFrame:
return (
df
.melt(var_name= "column" , ignore_index= False )
.reset_index()
.sort_values(by= ["index" , "column" ])
.set_index(["index" , "column" ])
)
Code
one_flat = flatten(one)
two_flat = flatten(two)
one_flat == two_flat
value
index
column
0
age
False
animal
True
name
True
1
age
True
animal
True
name
False
2
age
True
animal
False
name
True
Code
def find_differences(left: pd.DataFrame, right: pd.DataFrame) -> pd.DataFrame:
left_flat = flatten(left)
right_flat = flatten(right)
differences = left_flat != right_flat
return (
left_flat[differences].dropna()
.merge(
right_flat[differences].dropna(),
left_index= True ,
right_index= True
)
.rename(columns= {"value_x" : "left" , "value_y" : "right" })
)
find_differences(one, two)
left
right
index
column
0
age
3
4
1
name
rupert
ruperts
2
animal
dog
dogs
This looks great.
Now I need to see how it handles extra rows.
Code
name
animal
age
0
orlando
cat
3
1
rupert
bear
4
Code
find_differences(one[:2 ], two)
ValueError: Can only compare identically-labeled DataFrame objects
So I need to ensure that the index matches for both dataframes. I wonder how I can do that.
I think I need to find the rows that are not in the other index and handle them separately.
Out of interest, for two dataframes with an equal index there is a built in compare
method:
Code
one.compare(two, align_axis= 0 )
name
animal
age
0
self
NaN
NaN
3.0
other
NaN
NaN
4.0
1
self
rupert
NaN
NaN
other
ruperts
NaN
NaN
2
self
NaN
dog
NaN
other
NaN
dogs
NaN
Code
def find_differences(left: pd.DataFrame, right: pd.DataFrame) -> pd.DataFrame:
left_extended = left.append(right[~ right.index.isin(left.index)])
left_extended.loc[~ left_extended.index.isin(left.index), :] = np.nan
right_extended = right.append(left[~ left.index.isin(right.index)])
right_extended.loc[~ right_extended.index.isin(right.index), :] = np.nan
left_flat = flatten(left_extended)
right_flat = flatten(right_extended)
return left_flat.compare(right_flat)
Code
find_differences(one[:2 ], two)
value
self
other
index
column
0
age
3.0
4.0
1
name
rupert
ruperts
2
age
NaN
5.0
animal
NaN
dogs
name
NaN
rollo
Code
find_differences(one, two[:2 ])
value
self
other
index
column
0
age
3.0
4.0
1
name
rupert
ruperts
2
age
5.0
NaN
animal
dog
NaN
name
rollo
NaN
Code
find_differences(one, one)
Code
find_differences(one, one).empty
I’m much happier with this than my previous approach which was from this stack overflow question . For me that approach failed at the (df1 != df2).stack()
expression.