Comparing Data Frames

Visualize differences between two dataframes
Published

January 29, 2021

I want to be able to show the differences between two data frames in a clear way. It turns out to be a little involved.

What I want is to be able to show the exact row and columns that differ, showing both values from the two data frames.

Lets start by creating some data frames.

Code
import pandas as pd
import numpy as np

one = pd.DataFrame([
    {"name": "orlando", "animal": "cat", "age": 3},
    {"name": "rupert", "animal": "bear", "age": 4},
    {"name": "rollo", "animal": "dog", "age": 5},
])
two = pd.DataFrame([
    {"name": "orlando", "animal": "cat", "age": 4},
    {"name": "ruperts", "animal": "bear", "age": 4},
    {"name": "rollo", "animal": "dogs", "age": 5},
])

So each row is slightly different, I want to just show the differences. The start is to flatten the dataframe so it is index, column name, column value.

Code
(
    one
        .melt(var_name="column", ignore_index=False)
        .reset_index()
        .sort_values(by=["index", "column"])
        .set_index(["index", "column"])
)
value
index column
0 age 3
animal cat
name orlando
1 age 4
animal bear
name rupert
2 age 5
animal dog
name rollo

This looks great. Lets try comparing them.

Code
def flatten(df: pd.DataFrame) -> pd.DataFrame:
    return (
        df
            .melt(var_name="column", ignore_index=False)
            .reset_index()
            .sort_values(by=["index", "column"])
            .set_index(["index", "column"])
    )
Code
one_flat = flatten(one)
two_flat = flatten(two)

one_flat == two_flat
value
index column
0 age False
animal True
name True
1 age True
animal True
name False
2 age True
animal False
name True
Code
def find_differences(left: pd.DataFrame, right: pd.DataFrame) -> pd.DataFrame:
    left_flat = flatten(left)
    right_flat = flatten(right)
    differences = left_flat != right_flat
    
    return (
        left_flat[differences].dropna()
            .merge(
                right_flat[differences].dropna(),
                left_index=True, 
                right_index=True
            )
            .rename(columns={"value_x": "left", "value_y": "right"})
    )

find_differences(one, two)
left right
index column
0 age 3 4
1 name rupert ruperts
2 animal dog dogs

This looks great.

Now I need to see how it handles extra rows.

Code
one[:2]
name animal age
0 orlando cat 3
1 rupert bear 4
Code
find_differences(one[:2], two)
ValueError: Can only compare identically-labeled DataFrame objects

So I need to ensure that the index matches for both dataframes. I wonder how I can do that.

I think I need to find the rows that are not in the other index and handle them separately.

Out of interest, for two dataframes with an equal index there is a built in compare method:

Code
one.compare(two, align_axis=0)
name animal age
0 self NaN NaN 3.0
other NaN NaN 4.0
1 self rupert NaN NaN
other ruperts NaN NaN
2 self NaN dog NaN
other NaN dogs NaN
Code
def find_differences(left: pd.DataFrame, right: pd.DataFrame) -> pd.DataFrame:
    left_extended = left.append(right[~right.index.isin(left.index)])
    left_extended.loc[~left_extended.index.isin(left.index), :] = np.nan
    right_extended = right.append(left[~left.index.isin(right.index)])
    right_extended.loc[~right_extended.index.isin(right.index), :] = np.nan
    
    left_flat = flatten(left_extended)
    right_flat = flatten(right_extended)
    
    return left_flat.compare(right_flat)
Code
find_differences(one[:2], two)
value
self other
index column
0 age 3.0 4.0
1 name rupert ruperts
2 age NaN 5.0
animal NaN dogs
name NaN rollo
Code
find_differences(one, two[:2])
value
self other
index column
0 age 3.0 4.0
1 name rupert ruperts
2 age 5.0 NaN
animal dog NaN
name rollo NaN
Code
find_differences(one, one)
index column
Code
find_differences(one, one).empty
True

I’m much happier with this than my previous approach which was from this stack overflow question. For me that approach failed at the (df1 != df2).stack() expression.