Matthew’s Blog - Comparing Data Frames

I want to be able to show the differences between two data frames in a clear way. It turns out to be a little involved.

What I want is to be able to show the exact row and columns that differ, showing both values from the two data frames.

Lets start by creating some data frames.

Code

import pandas as pd
import numpy as np

one = pd.DataFrame([
    {"name": "orlando", "animal": "cat", "age": 3},
    {"name": "rupert", "animal": "bear", "age": 4},
    {"name": "rollo", "animal": "dog", "age": 5},
])
two = pd.DataFrame([
    {"name": "orlando", "animal": "cat", "age": 4},
    {"name": "ruperts", "animal": "bear", "age": 4},
    {"name": "rollo", "animal": "dogs", "age": 5},
])

So each row is slightly different, I want to just show the differences. The start is to flatten the dataframe so it is index, column name, column value.

Code

(
    one
        .melt(var_name="column", ignore_index=False)
        .reset_index()
        .sort_values(by=["index", "column"])
        .set_index(["index", "column"])
)

		value
index	column
0	age	3
	animal	cat
	name	orlando
1	age	4
	animal	bear
	name	rupert
2	age	5
	animal	dog
	name	rollo

This looks great. Lets try comparing them.

Code

def flatten(df: pd.DataFrame) -> pd.DataFrame:
    return (
        df
            .melt(var_name="column", ignore_index=False)
            .reset_index()
            .sort_values(by=["index", "column"])
            .set_index(["index", "column"])
    )

Code

one_flat = flatten(one)
two_flat = flatten(two)

one_flat == two_flat

		value
index	column
0	age	False
	animal	True
	name	True
1	age	True
	animal	True
	name	False
2	age	True
	animal	False
	name	True

Code

def find_differences(left: pd.DataFrame, right: pd.DataFrame) -> pd.DataFrame:
    left_flat = flatten(left)
    right_flat = flatten(right)
    differences = left_flat != right_flat
    
    return (
        left_flat[differences].dropna()
            .merge(
                right_flat[differences].dropna(),
                left_index=True, 
                right_index=True
            )
            .rename(columns={"value_x": "left", "value_y": "right"})
    )

find_differences(one, two)

		left	right
index	column
0	age	3	4
1	name	rupert	ruperts
2	animal	dog	dogs

This looks great.

Now I need to see how it handles extra rows.

Code

one[:2]

	name	animal	age
0	orlando	cat	3
1	rupert	bear	4

Code

find_differences(one[:2], two)

ValueError: Can only compare identically-labeled DataFrame objects

So I need to ensure that the index matches for both dataframes. I wonder how I can do that.

I think I need to find the rows that are not in the other index and handle them separately.

Out of interest, for two dataframes with an equal index there is a built in compare method:

Code

one.compare(two, align_axis=0)

		name	animal	age
0	self	NaN	NaN	3.0
0	other	NaN	NaN	4.0
1	self	rupert	NaN	NaN
1	other	ruperts	NaN	NaN
2	self	NaN	dog	NaN
2	other	NaN	dogs	NaN

Code

def find_differences(left: pd.DataFrame, right: pd.DataFrame) -> pd.DataFrame:
    left_extended = left.append(right[~right.index.isin(left.index)])
    left_extended.loc[~left_extended.index.isin(left.index), :] = np.nan
    right_extended = right.append(left[~left.index.isin(right.index)])
    right_extended.loc[~right_extended.index.isin(right.index), :] = np.nan
    
    left_flat = flatten(left_extended)
    right_flat = flatten(right_extended)
    
    return left_flat.compare(right_flat)

Code

find_differences(one[:2], two)

		value
		self	other
index	column
0	age	3.0	4.0
1	name	rupert	ruperts
2	age	NaN	5.0
	animal	NaN	dogs
	name	NaN	rollo

Code

find_differences(one, two[:2])

		value
		self	other
index	column
0	age	3.0	4.0
1	name	rupert	ruperts
2	age	5.0	NaN
	animal	dog	NaN
	name	rollo	NaN

Code

find_differences(one, one)

index	column

Code

find_differences(one, one).empty

True

I’m much happier with this than my previous approach which was from this stack overflow question. For me that approach failed at the (df1 != df2).stack() expression.