Evaluating different visualization techniques for this blog
Published
March 10, 2023
I want to learn Bayesian networks to contrast them with the TabPFN (Hollmann et al. 2022) model. Being able to play with the network would help me understand them much more. It’s always been possible to embed working Jupyter widgets into Quarto posts, this will be my first attempt at doing so.
Hollmann, Noah, Samuel Müller, Katharina Eggensperger, and Frank Hutter. 2022. “TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second.” arXiv. https://doi.org/10.48550/ARXIV.2207.01848.
This post is concerned with finding the right way to visualize Bayesian networks. That will require being able to visualize tabular data and set multiple variables into fixed states.
Interactive Visualizations
To really get to grips with this I want to be able to see how the probabilities change as certain predicates are locked into a state. This is because a lot of Bayesian probability is of the form \(P(X | Y)\) (probability of X given Y). If I cannot create such a fixed state then I’m not really learning this, and it should be fun to make the blog more interactive.
Python Widget
This will create a button which is coupled to a label. Changing the state of the button updates the label.
It’s written in ipywidgets which I hope are supported. The Quarto website states that Jupyter Widgets are supported.
Code
import ipywidgets as widgetsbutton = widgets.ToggleButton( value=False, description='Click me', disabled=False, button_style='', # 'success', 'info', 'warning', 'danger' or '' tooltip='Description', icon='check'# (FontAwesome names without the `fa-` prefix))label = widgets.Label(value="The button is not selected")def observe_state_change(event: dict) ->None:if event["name"] !="value":returnif event["new"]: label.value ="The button is selected"else: label.value ="The button is not selected"button.observe(observe_state_change, type="change")widgets.VBox([label, button])
As you can probably guess that didn’t work.
This would be the best option as it’s arbitrary python code that I can write. Translating arbitrary code into javascript isn’t being done and presumably the Jupyter Widgets that are linked are specific things that have been translated.
Plotly
Another interactable component is a plotly graph. I would like to be able to choose which nodes of the Bayesian network to lock in. Let’s see if it works at all first.
Code
import plotly.express as pximport plotly.io as piodf = px.data.iris()fig = px.scatter( df, x="sepal_width", y="sepal_length", color="species", marginal_y="violin", marginal_x="box", trendline="ols", template="simple_white",)fig.show()
After a bit of fiddling this has now worked, both on the blog and in Jupyter. To get this working I want to be able to lock in probabilities so really displaying a table with filters would be sufficient at this point.
I’m going to use the XOR truth table as my dataset.
This is great because it suggests that I can use arbitrary python code to calculate the result for a given filter. I think that the code is run once to establish the values which are then encoded in the resulting javascript.
The next thing is to be able to filter by more than one condition. It turns out that this is quite complex as one filter does not reference another. I’ve found another stackoverflow answer which has a lot of code to achieve this. I’m going to review the code carefully and then try to create a pattern out of it that I can use going forward.
I must say that for this the plotly documentation is basically non existent.
Code
# Importsimport plotly.graph_objs as goimport pandas as pdimport numpy as np# source datadf = pd.DataFrame( {0: {"num": 1, "label": "A", "color": "red", "value": 0.4},1: {"num": 2, "label": "A", "color": "blue", "value": 0.2},2: {"num": 3, "label": "A", "color": "green", "value": 0.3},3: {"num": 4, "label": "A", "color": "red", "value": 0.6},4: {"num": 5, "label": "A", "color": "blue", "value": 0.7},5: {"num": 6, "label": "A", "color": "green", "value": 0.4},6: {"num": 7, "label": "B", "color": "blue", "value": 0.2},7: {"num": 8, "label": "B", "color": "green", "value": 0.4},8: {"num": 9, "label": "B", "color": "red", "value": 0.4},9: {"num": 10, "label": "B", "color": "green", "value": 0.2},10: {"num": 11, "label": "C", "color": "red", "value": 0.1},11: {"num": 12, "label": "C", "color": "blue", "value": 0.3},12: {"num": 13, "label": "D", "color": "red", "value": 0.8},13: {"num": 14, "label": "D", "color": "blue", "value": 0.4},14: {"num": 15, "label": "D", "color": "green", "value": 0.6},15: {"num": 16, "label": "D", "color": "yellow", "value": 0.5},16: {"num": 17, "label": "E", "color": "purple", "value": 0.68}, }).Tdf_input = df.copy()# split df by labelslabels = df["label"].unique().tolist()dates = df["num"].unique().tolist()# dataframe collection grouped by labelsdfs = {}for label in labels: dfs[label] = pd.pivot_table( df[df["label"] == label], values="value", index=["num"], columns=["color"], aggfunc=np.sum, )# find row and column unionscommon_cols = []common_rows = []for df in dfs.keys(): common_cols =sorted(list(set().union(common_cols, list(dfs[df])))) common_rows =sorted(list(set().union(common_rows, list(dfs[df].index))))# find dimensionally common dataframedf_common = pd.DataFrame(np.nan, index=common_rows, columns=common_cols)# reshape each dfs[df] into common dimensionsdfc = {}for df_item in dfs:# print(dfs[unshaped]) df1 = dfs[df_item].copy() s = df_common.combine_first(df1) df_reshaped = df1.reindex_like(s) dfc[df_item] = df_reshaped# plotly startfig = go.Figure()# one trace for each column per dataframe: AI and RANDOMfor col in common_cols: fig.add_trace( go.Scatter( x=dates, visible=True, marker=dict(size=12, line=dict(width=2)), marker_symbol="diamond", name=col, ) )# menu setupupdatemenu = []# buttons for menu 1, namesbuttons = []# create traces for each color:# build argVals for buttons and create buttonsfor df in dfc.keys(): argList = []for col in dfc[df]:# print(dfc[df][col].values) argList.append(dfc[df][col].values) argVals = [{"y": argList}] buttons.append(dict(method="update", label=df, visible=True, args=argVals))# buttons for menu 2, colorsb2_labels = common_cols# matrix to feed all visible arguments for all traces# so that they can be shown or hidden by choiceb2_show = [list(b) for b in [e ==1for e in np.eye(len(b2_labels))]]buttons2 = []buttons2.append( {"method": "update","label": "All","args": [{"visible": [True] *len(common_cols)}], })# create buttons to show or hidefor i inrange(0, len(b2_labels)): buttons2.append(dict(method="update", label=b2_labels[i], args=[{"visible": b2_show[i]}]) )# add option for button two to hide allbuttons2.append(dict(method="update", label="None", args=[{"visible": [False] *len(common_cols)}]))# some adjustments to the updatemenusupdatemenu = []your_menu =dict()updatemenu.append(your_menu)your_menu2 =dict()updatemenu.append(your_menu2)updatemenu[1]updatemenu[0]["buttons"] = buttonsupdatemenu[0]["direction"] ="down"updatemenu[0]["showactive"] =Trueupdatemenu[1]["buttons"] = buttons2updatemenu[1]["y"] =0.6fig.update_layout(showlegend=False, updatemenus=updatemenu)fig.update_layout(yaxis=dict(range=[0, df_input["value"].max() +0.4]))# titlefig.update_layout( title=dict( text="<i>Filtering with multiple dropdown buttons</i>", font={"size": 18}, y=0.9, x=0.5, xanchor="center", yanchor="top", ))# button annotationsfig.update_layout( annotations=[dict( text="<i>Label</i>", x=-0.2, xref="paper", y=1.1, yref="paper", align="left", showarrow=False, font=dict(size=16, color="steelblue"), ),dict( text="<i>Color</i>", x=-0.2, xref="paper", y=0.7, yref="paper", align="left", showarrow=False, font=dict(size=16, color="steelblue"), ), ])fig.show()
This code is very verbose and not particularly clear. I believe that the code is applying filters to the table to determine the values to display. What is interesting is that the first filter (label) has an argument of a 2d array, while the second label (color) has an argument of a 1d array.
That might suggest that each new entry pushes the previous ones up by a dimension. I’ve now had a few goes at writing this, and it seems that plotly dramatically increases in complexity. This doesn’t feel like the solution, even though it is very pretty.
Observable JS
Now I can try Observable JS. The problem with this will be coupling the javascript code to python data. I’m going to start by just copying the example.
This has worked however I’m not that comfortable mixing the different kinds of language in this blog. Furthermore it doesn’t render in Jupyter so I can’t even see what it is showing.
I’m kinda stuck though. After reading more about it I am certain that this is able to render logically separate filters which can be composed. It seems like this is the solution. Working with this is going to be more complex as it spans python and javascript.
Iris Probability Distribution
The plotly example loaded the iris dataset which provides measurements for three different kinds of flower. With this we should be able to view the probability of the different species, then set the different input variables to see how that changes the probability.
Code
import plotly.express as pxdf_iris = px.data.iris()df_iris
sepal_length
sepal_width
petal_length
petal_width
species
species_id
0
5.1
3.5
1.4
0.2
setosa
1
1
4.9
3.0
1.4
0.2
setosa
1
2
4.7
3.2
1.3
0.2
setosa
1
3
4.6
3.1
1.5
0.2
setosa
1
4
5.0
3.6
1.4
0.2
setosa
1
...
...
...
...
...
...
...
145
6.7
3.0
5.2
2.3
virginica
3
146
6.3
2.5
5.0
1.9
virginica
3
147
6.5
3.0
5.2
2.0
virginica
3
148
6.2
3.4
5.4
2.3
virginica
3
149
5.9
3.0
5.1
1.8
virginica
3
150 rows × 6 columns
To load this dataframe into observable js the Quarto documentation recommends using ojs_define. The example is simple so let’s try it out:
The problem here is that I cannot run the ojs_define function as it is not available in jupyterlab. Instead it’s only available when the notebook is executed via Quarto. I don’t execute the notebooks using Quarto as some of them take a long time to complete.
Since I want to be able to work with this using jupyter I’m going to have to write out files locally and then operate over them.
import plotly.express as pxdf_iris = px.data.iris()df_iris.to_csv("iris.csv")
We can see an accepable table. If we take this table then what is the probability of the three species? The checkboxes here will restrict the dataset to only those species.
Code
functionunique(values) {functiondistinct(accumulated, value) {if (! accumulated.includes(value)) { accumulated.push(value); }return accumulated; } values = values.reduce(distinct, []) values.sort()return values}functionprobability(values, selectedSpecies) {functioncount(values, name) {return values.filter(({ species }) => name == species).length }const names =unique(values.map(({ species }) => species))const filtered = values.filter(({ species }) => selectedSpecies.includes(species))const total = filtered.lengthreturn names.map((name) => ({ name,probability:count(filtered, name) / total }))}species =unique(df_iris.map(({ species }) => species))viewof selectedSpecies = Inputs.checkbox(species, {label:"Species",value: species})Inputs.table(probability(df_iris, selectedSpecies))
At this point we have a very simple way to filter the table and calculate the probability of a given iris species. Playing with the checkboxes shows that this is a balanced dataset.
What we want now is a way to restrict the values of the different flower features and see how it changes the distribution. This would be a good way to test that composing multiple filters works.
Code
functionrange(values, column) {const filtered = values.map(row => row[column])const min =Math.min.apply(Math, filtered)const max =Math.max.apply(Math, filtered)const minValue = Inputs.range([min, max], {step:0.1,label:`${column} minimum`,value: min})const maxValue = Inputs.range([min, max], {step:0.1,label:`${column} maximum`,value: max})return [minValue, maxValue]}functioncomposedProbability(values, filters, columns) {functioncount(values, name) {return values.filter(({ species }) => name == species).length }const names =unique(values.map(({ species }) => species))const filtered = values.filter(row => columns.map((column, index) => {let min = filters[index *2]let max = filters[index *2+1]return row[column] >= min && row[column] <= max }).reduce((a, b) => a && b,true) )const total = filtered.lengthreturn names.map((name) => ({ name,probability:count(filtered, name) / total }))}columns = df_iris.columns.filter(column => column !="species"&& column !="species_id"&& column.length)columnFilters = columns.map(column =>range(df_iris, column)).reduce((accumulator, [min, max]) => { accumulator.push(min) accumulator.push(max)return accumulator }, [])viewof filterValues = Inputs.form(columnFilters)Inputs.table(composedProbability(df_iris, filterValues, columns))
The display of all of these bars is not great, but the filters compose which is what I really wanted. If you play around you can see that sepal_length >= 5.2 and petal_width >= 0.5 excludes all setosa. We can check this with the original dataframe: