Matthew’s Blog - Interactive Visualizations

I want to learn Bayesian networks to contrast them with the TabPFN (Hollmann et al. 2022) model. Being able to play with the network would help me understand them much more. It’s always been possible to embed working Jupyter widgets into Quarto posts, this will be my first attempt at doing so.

Hollmann, Noah, Samuel Müller, Katharina Eggensperger, and Frank Hutter. 2022. “TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second.” arXiv. https://doi.org/10.48550/ARXIV.2207.01848.

This post is concerned with finding the right way to visualize Bayesian networks. That will require being able to visualize tabular data and set multiple variables into fixed states.

Interactive Visualizations

To really get to grips with this I want to be able to see how the probabilities change as certain predicates are locked into a state. This is because a lot of Bayesian probability is of the form \(P(X | Y)\) (probability of X given Y). If I cannot create such a fixed state then I’m not really learning this, and it should be fun to make the blog more interactive.

Python Widget

This will create a button which is coupled to a label. Changing the state of the button updates the label.

It’s written in ipywidgets which I hope are supported. The Quarto website states that Jupyter Widgets are supported.

Code

import ipywidgets as widgets

button = widgets.ToggleButton(
    value=False,
    description='Click me',
    disabled=False,
    button_style='', # 'success', 'info', 'warning', 'danger' or ''
    tooltip='Description',
    icon='check' # (FontAwesome names without the `fa-` prefix)
)
label = widgets.Label(value="The button is not selected")

def observe_state_change(event: dict) -> None:
    if event["name"] != "value":
        return
    if event["new"]:
        label.value = "The button is selected"
    else:
        label.value = "The button is not selected"

button.observe(observe_state_change, type="change")

widgets.VBox([label, button])

As you can probably guess that didn’t work.

This would be the best option as it’s arbitrary python code that I can write. Translating arbitrary code into javascript isn’t being done and presumably the Jupyter Widgets that are linked are specific things that have been translated.

Plotly

Another interactable component is a plotly graph. I would like to be able to choose which nodes of the Bayesian network to lock in. Let’s see if it works at all first.

Code

import plotly.express as px
import plotly.io as pio

df = px.data.iris()
fig = px.scatter(
    df,
    x="sepal_width",
    y="sepal_length", 
    color="species", 
    marginal_y="violin",
    marginal_x="box",
    trendline="ols",
    template="simple_white",
)
fig.show()

After a bit of fiddling this has now worked, both on the blog and in Jupyter. To get this working I want to be able to lock in probabilities so really displaying a table with filters would be sufficient at this point.

I’m going to use the XOR truth table as my dataset.

Code

import plotly.graph_objects as go

fig = go.Figure(
    data=[
        go.Table(
            header=dict(values=['A', 'B', 'Y']),
            cells=dict(values=[
                [False, True, False, True],
                [False, False, True, True],
                [False, True, True, False],
            ])
        )
    ]
)
fig.show()

This works but can you add filtering to it? I’ve found this stack overflow answer which I will just copy below.

Code

import plotly.graph_objects as go
import pandas as pd

df = pd.DataFrame(
    {
        "Date": ["2020-01-27", "2020-02-27", "2020-03-27"],
        "A_item": [2, 8, 0],
        "B_item": [1, 7, 10],
        "C_item": [9, 2, 9],
        "Channel_type": ["Channel_1", "Channel_1", "Channel_2"],
    }
)

fig = go.Figure(go.Table(header={"values": df.columns}, cells={"values": df.T.values}))
fig.update_layout(
    updatemenus=[
        {
            "buttons": [
                {
                    "label": c,
                    "method": "update",
                    "args": [
                        {
                            "cells": {
                                "values": df.T.values
                                if c == "All"
                                else df.loc[df["Channel_type"].eq(c)].T.values
                            }
                        }
                    ],
                }
                for c in ["All"] + df["Channel_type"].unique().tolist()
            ]
        }
    ]
)

This is great because it suggests that I can use arbitrary python code to calculate the result for a given filter. I think that the code is run once to establish the values which are then encoded in the resulting javascript.

The next thing is to be able to filter by more than one condition. It turns out that this is quite complex as one filter does not reference another. I’ve found another stackoverflow answer which has a lot of code to achieve this. I’m going to review the code carefully and then try to create a pattern out of it that I can use going forward.

I must say that for this the plotly documentation is basically non existent.

Code

# Imports
import plotly.graph_objs as go
import pandas as pd
import numpy as np

# source data
df = pd.DataFrame(
    {
        0: {"num": 1, "label": "A", "color": "red", "value": 0.4},
        1: {"num": 2, "label": "A", "color": "blue", "value": 0.2},
        2: {"num": 3, "label": "A", "color": "green", "value": 0.3},
        3: {"num": 4, "label": "A", "color": "red", "value": 0.6},
        4: {"num": 5, "label": "A", "color": "blue", "value": 0.7},
        5: {"num": 6, "label": "A", "color": "green", "value": 0.4},
        6: {"num": 7, "label": "B", "color": "blue", "value": 0.2},
        7: {"num": 8, "label": "B", "color": "green", "value": 0.4},
        8: {"num": 9, "label": "B", "color": "red", "value": 0.4},
        9: {"num": 10, "label": "B", "color": "green", "value": 0.2},
        10: {"num": 11, "label": "C", "color": "red", "value": 0.1},
        11: {"num": 12, "label": "C", "color": "blue", "value": 0.3},
        12: {"num": 13, "label": "D", "color": "red", "value": 0.8},
        13: {"num": 14, "label": "D", "color": "blue", "value": 0.4},
        14: {"num": 15, "label": "D", "color": "green", "value": 0.6},
        15: {"num": 16, "label": "D", "color": "yellow", "value": 0.5},
        16: {"num": 17, "label": "E", "color": "purple", "value": 0.68},
    }
).T

df_input = df.copy()

# split df by labels
labels = df["label"].unique().tolist()
dates = df["num"].unique().tolist()

# dataframe collection grouped by labels
dfs = {}
for label in labels:
    dfs[label] = pd.pivot_table(
        df[df["label"] == label],
        values="value",
        index=["num"],
        columns=["color"],
        aggfunc=np.sum,
    )

# find row and column unions
common_cols = []
common_rows = []
for df in dfs.keys():
    common_cols = sorted(list(set().union(common_cols, list(dfs[df]))))
    common_rows = sorted(list(set().union(common_rows, list(dfs[df].index))))

# find dimensionally common dataframe
df_common = pd.DataFrame(np.nan, index=common_rows, columns=common_cols)

# reshape each dfs[df] into common dimensions
dfc = {}
for df_item in dfs:
    # print(dfs[unshaped])
    df1 = dfs[df_item].copy()
    s = df_common.combine_first(df1)
    df_reshaped = df1.reindex_like(s)
    dfc[df_item] = df_reshaped

# plotly start
fig = go.Figure()
# one trace for each column per dataframe: AI and RANDOM
for col in common_cols:
    fig.add_trace(
        go.Scatter(
            x=dates,
            visible=True,
            marker=dict(size=12, line=dict(width=2)),
            marker_symbol="diamond",
            name=col,
        )
    )

# menu setup
updatemenu = []

# buttons for menu 1, names
buttons = []

# create traces for each color:
# build argVals for buttons and create buttons
for df in dfc.keys():
    argList = []
    for col in dfc[df]:
        # print(dfc[df][col].values)
        argList.append(dfc[df][col].values)
    argVals = [{"y": argList}]

    buttons.append(dict(method="update", label=df, visible=True, args=argVals))

# buttons for menu 2, colors
b2_labels = common_cols

# matrix to feed all visible arguments for all traces
# so that they can be shown or hidden by choice
b2_show = [list(b) for b in [e == 1 for e in np.eye(len(b2_labels))]]
buttons2 = []
buttons2.append(
    {
        "method": "update",
        "label": "All",
        "args": [{"visible": [True] * len(common_cols)}],
    }
)

# create buttons to show or hide
for i in range(0, len(b2_labels)):
    buttons2.append(
        dict(method="update", label=b2_labels[i], args=[{"visible": b2_show[i]}])
    )

# add option for button two to hide all
buttons2.append(
    dict(method="update", label="None", args=[{"visible": [False] * len(common_cols)}])
)

# some adjustments to the updatemenus
updatemenu = []
your_menu = dict()
updatemenu.append(your_menu)
your_menu2 = dict()
updatemenu.append(your_menu2)
updatemenu[1]
updatemenu[0]["buttons"] = buttons
updatemenu[0]["direction"] = "down"
updatemenu[0]["showactive"] = True
updatemenu[1]["buttons"] = buttons2
updatemenu[1]["y"] = 0.6

fig.update_layout(showlegend=False, updatemenus=updatemenu)
fig.update_layout(yaxis=dict(range=[0, df_input["value"].max() + 0.4]))

# title
fig.update_layout(
    title=dict(
        text="<i>Filtering with multiple dropdown buttons</i>",
        font={"size": 18},
        y=0.9,
        x=0.5,
        xanchor="center",
        yanchor="top",
    )
)

# button annotations
fig.update_layout(
    annotations=[
        dict(
            text="<i>Label</i>",
            x=-0.2,
            xref="paper",
            y=1.1,
            yref="paper",
            align="left",
            showarrow=False,
            font=dict(size=16, color="steelblue"),
        ),
        dict(
            text="<i>Color</i>",
            x=-0.2,
            xref="paper",
            y=0.7,
            yref="paper",
            align="left",
            showarrow=False,
            font=dict(size=16, color="steelblue"),
        ),
    ]
)

fig.show()

This code is very verbose and not particularly clear. I believe that the code is applying filters to the table to determine the values to display. What is interesting is that the first filter (label) has an argument of a 2d array, while the second label (color) has an argument of a 1d array.

That might suggest that each new entry pushes the previous ones up by a dimension. I’ve now had a few goes at writing this, and it seems that plotly dramatically increases in complexity. This doesn’t feel like the solution, even though it is very pretty.

Observable JS

Now I can try Observable JS. The problem with this will be coupling the javascript code to python data. I’m going to start by just copying the example.

Code

pdata = FileAttachment("palmer-penguins.csv").csv({typed: true})

Plot.plot({
  facet: {
    data: pdata,
    x: "sex",
    y: "species",
    marginRight: 80
  },
  marks: [
    Plot.frame(),
    Plot.rectY(pdata, 
      Plot.binX(
        {y: "count"}, 
        {x: "body_mass_g", thresholds: 20, fill: "species"}
      )
    ),
    Plot.tickX(pdata, 
      Plot.groupZ(
        {x: "median"}, 
        {x: "body_mass_g",
         z: d => d.sex + d.species,
         stroke: "#333",
         strokeWidth: 2
        }
      )
    )
  ]
})

This has worked however I’m not that comfortable mixing the different kinds of language in this blog. Furthermore it doesn’t render in Jupyter so I can’t even see what it is showing.

I’m kinda stuck though. After reading more about it I am certain that this is able to render logically separate filters which can be composed. It seems like this is the solution. Working with this is going to be more complex as it spans python and javascript.

Iris Probability Distribution

The plotly example loaded the iris dataset which provides measurements for three different kinds of flower. With this we should be able to view the probability of the different species, then set the different input variables to see how that changes the probability.

Code

import plotly.express as px

df_iris = px.data.iris()
df_iris

	sepal_length	sepal_width	petal_length	petal_width	species	species_id
0	5.1	3.5	1.4	0.2	setosa	1
1	4.9	3.0	1.4	0.2	setosa	1
2	4.7	3.2	1.3	0.2	setosa	1
3	4.6	3.1	1.5	0.2	setosa	1
4	5.0	3.6	1.4	0.2	setosa	1
...	...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	virginica	3
146	6.3	2.5	5.0	1.9	virginica	3
147	6.5	3.0	5.2	2.0	virginica	3
148	6.2	3.4	5.4	2.3	virginica	3
149	5.9	3.0	5.1	1.8	virginica	3

150 rows × 6 columns

To load this dataframe into observable js the Quarto documentation recommends using ojs_define. The example is simple so let’s try it out:

ojs_define(df_iris_broken = df_iris)

NameError: name 'ojs_define' is not defined

Code

Inputs.table(df_iris_broken, { sort: "species", reverse: true })

The problem here is that I cannot run the ojs_define function as it is not available in jupyterlab. Instead it’s only available when the notebook is executed via Quarto. I don’t execute the notebooks using Quarto as some of them take a long time to complete.

Since I want to be able to work with this using jupyter I’m going to have to write out files locally and then operate over them.

import plotly.express as px

df_iris = px.data.iris()
df_iris.to_csv("iris.csv")

df_iris = FileAttachment("iris.csv").csv({ typed: true })
Inputs.table(df_iris, { sort: "species", reverse: true })

We can see an accepable table. If we take this table then what is the probability of the three species? The checkboxes here will restrict the dataset to only those species.

Code

function unique(values) {
    function distinct(accumulated, value) {
      if (! accumulated.includes(value)) {
        accumulated.push(value);
      }
      return accumulated;
    }
    values = values.reduce(distinct, [])
    values.sort()
    return values
}

function probability(values, selectedSpecies) {
  function count(values, name) {
    return values.filter(({ species }) => name == species).length
  }

  const names = unique(values.map(({ species }) => species))
  const filtered = values.filter(({ species }) => selectedSpecies.includes(species))
  const total = filtered.length

  return names.map((name) => ({ name, probability: count(filtered, name) / total }))
}

species = unique(df_iris.map(({ species }) => species))
viewof selectedSpecies = Inputs.checkbox(species, {label: "Species", value: species})
Inputs.table(probability(df_iris, selectedSpecies))

At this point we have a very simple way to filter the table and calculate the probability of a given iris species. Playing with the checkboxes shows that this is a balanced dataset.

What we want now is a way to restrict the values of the different flower features and see how it changes the distribution. This would be a good way to test that composing multiple filters works.

Code

function range(values, column) {
  const filtered = values.map(row => row[column])
  const min = Math.min.apply(Math, filtered)
  const max = Math.max.apply(Math, filtered)
  const minValue = Inputs.range([min, max], {step: 0.1, label: `${column} minimum`, value: min})
  const maxValue = Inputs.range([min, max], {step: 0.1, label: `${column} maximum`, value: max})

  return [minValue, maxValue]
}

function composedProbability(values, filters, columns) {
  function count(values, name) {
    return values.filter(({ species }) => name == species).length
  }

  const names = unique(values.map(({ species }) => species))
  const filtered = values.filter(row =>
    columns.map((column, index) => {
      let min = filters[index * 2]
      let max = filters[index * 2 + 1]
      return row[column] >= min && row[column] <= max
    })
    .reduce((a, b) => a && b, true)
  )
  const total = filtered.length

  return names.map((name) => ({ name, probability: count(filtered, name) / total }))
}

columns = df_iris.columns.filter(column => column != "species" && column != "species_id" && column.length)
columnFilters = columns
  .map(column => range(df_iris, column))
  .reduce((accumulator, [min, max]) => {
    accumulator.push(min)
    accumulator.push(max)
    return accumulator
  }, [])
viewof filterValues = Inputs.form(columnFilters)

Inputs.table(composedProbability(df_iris, filterValues, columns))

The display of all of these bars is not great, but the filters compose which is what I really wanted. If you play around you can see that sepal_length >= 5.2 and petal_width >= 0.5 excludes all setosa. We can check this with the original dataframe:

Code

df_iris[
    (df_iris.sepal_length >= 5.2) &
    (df_iris.petal_width >= 0.5)
].species.value_counts()

virginica     49
versicolor    46
Name: species, dtype: int64

That works out. I guess I have to write javascript for this blog now.