Can I find relationships between scents in perfumes using Bayesian Networks?
Published
April 25, 2023
I’ve come across a perfume dataset that looks interesting. I want to see if I can use bayesian techniques to investigate it.
For each perfume there is a list of smells which form the overall perfume. Can we use Bayes to predict the probability of one smell given another?
Code
from pathlib import Pathimport pandas as pdDATA_FOLDER = Path("/data/perfumes/processed/")PERFUME_FILE = DATA_FOLDER /"perfumes.gz.parquet"perfumes_df = pd.read_parquet(PERFUME_FILE)smell_columns = perfumes_df.columns[ perfumes_df.columns.str.startswith("accord_")]perfumes_df[smell_columns].iloc[0].dropna().to_frame().T
accord_animalic
accord_floral
accord_fruity
accord_green
accord_powdery
accord_sweet
accord_tuberose
accord_white floral
accord_woody
accord_yellow floral
3
45.7181
56.0413
56.3575
44.2255
57.6156
67.1769
53.0983
100.0
61.1251
62.7636
We can see that we have proportions for the different smells. There are a lot of them.
Code
print(f"There are {len(smell_columns):,} different smells")
There are 73 different smells
If we take any value as an indication that the smell is present then we can treat this as a discrete state problem that can then be modelled. So for our perfume above we can say that accord_animalic is present.
The next thing is to look at the dataset to see if the presence or absence of one scent alters the probability distribution of another.
Unsurprisingly the theorem holds strong. Given how the values are calculated we can see that it should work out. First lets define the probabilities by the fractions they represent:
It’s nice that Bayes’ theorem is a simple bit of algebra over the underlying probability definitions.
Markov Chains
The markov chain is a model where the current distribution is based on the previous outcome. A chain like this can be extended to incorporate more prior states.
We can write this out as a Bayesian probability as \(P(state_n \mid state_{n-1})\). With this dataset we can create this easily. What is more important is a way to evaluate the quality of such a model.
A measure of the level of surprise seems appropriate given that this is dealing with information. If we construct a two deep markov chain how well does the distribution of predicted scents match up to the actual ones?
Let’s start by taking the top 10 scents to work with. Using these should prevent highly uneven splits.
The probability distribution can be calculated by taking each scent as a given and then calculating the distribution for the other scents. Since I want to make a dataframe out of this the other scents will include the given scent.
Each row of this dataframe is a scent that the perfume already has. The columns are then the probability of the other scent being paired with the given scent.
Code
import pandas as pddef probability(df: pd.DataFrame, condition: str, *givens: str) ->float:for given in givens: df = df[~df[given].isna()]returnlen(df[~df[condition].isna()]) /len(df)markov_df = pd.DataFrame( [ { condition: probability(perfumes_df, condition, given)for condition in top_10_scents.index }for given in top_10_scents.index ], index=top_10_scents.index,)markov_df
accord_woody
accord_powdery
accord_citrus
accord_sweet
accord_aromatic
accord_warm spicy
accord_floral
accord_fresh spicy
accord_amber
accord_white floral
accord_woody
1.000000
0.516177
0.515061
0.462254
0.512830
0.499070
0.390108
0.423949
0.410933
0.339160
accord_powdery
0.713992
1.000000
0.483539
0.536523
0.373971
0.413066
0.494856
0.307099
0.381173
0.397119
accord_citrus
0.716874
0.486542
1.000000
0.465321
0.550207
0.368530
0.436853
0.495859
0.308489
0.436853
accord_sweet
0.652493
0.547507
0.471916
1.000000
0.319685
0.412598
0.458793
0.255643
0.371129
0.414173
accord_aromatic
0.785755
0.414245
0.605698
0.347009
1.000000
0.499145
0.303704
0.651852
0.398291
0.260969
accord_warm spicy
0.797386
0.477124
0.423054
0.467023
0.520499
1.000000
0.279857
0.445039
0.549020
0.256684
accord_floral
0.639244
0.586228
0.514321
0.532602
0.324802
0.287020
1.000000
0.279098
0.286411
0.450945
accord_fresh spicy
0.763563
0.399866
0.641661
0.326189
0.766242
0.501674
0.306765
1.000000
0.385131
0.240455
accord_amber
0.755297
0.506494
0.407382
0.483254
0.477785
0.631579
0.321258
0.393028
1.000000
0.277512
accord_white floral
0.658484
0.557401
0.609386
0.569675
0.330686
0.311913
0.534296
0.259206
0.293141
1.000000
To determine if this is a good model we can calculate this distribution when we already have a scent chosen. The most popular scent is woody so let’s try that.
Each row of this dataframe is now a scent that the perfume already has that has been paired with woody. The columns are then the probability of the other scent being paired with the given scent.
The column for accord_woody is the probability of it being present given it is present, so that is always 1. The row for accord_woody is then the probability of the other scent being present given that woody is present twice. Since it’s not really present twice this has not changed from the original.
Code
given_woody_df = pd.DataFrame( [ { condition: probability(perfumes_df, condition, given, "accord_woody")for condition in top_10_scents.index }for given in top_10_scents.index ], index=top_10_scents.index,)given_woody_df
accord_woody
accord_powdery
accord_citrus
accord_sweet
accord_aromatic
accord_warm spicy
accord_floral
accord_fresh spicy
accord_amber
accord_white floral
accord_woody
1.0
0.516177
0.515061
0.462254
0.512830
0.499070
0.390108
0.423949
0.410933
0.339160
accord_powdery
1.0
1.000000
0.497839
0.499280
0.417147
0.471182
0.461095
0.336455
0.402017
0.384006
accord_citrus
1.0
0.498917
1.000000
0.433935
0.600722
0.423827
0.404332
0.527798
0.339350
0.408664
accord_sweet
1.0
0.557522
0.483508
1.000000
0.359614
0.472245
0.434433
0.271118
0.397426
0.412711
accord_aromatic
1.0
0.419869
0.603336
0.324148
1.000000
0.531545
0.277012
0.654822
0.419144
0.237854
accord_warm spicy
1.0
0.487332
0.437407
0.437407
0.546200
1.000000
0.263785
0.457526
0.543964
0.257824
accord_floral
1.0
0.610105
0.533842
0.514776
0.364156
0.337464
1.000000
0.308866
0.301239
0.449952
accord_fresh spicy
1.0
0.409649
0.641228
0.295614
0.792105
0.538596
0.284211
1.000000
0.414912
0.217544
accord_amber
1.0
0.504977
0.425339
0.447059
0.523077
0.660633
0.285973
0.428054
1.000000
0.246154
accord_white floral
1.0
0.584430
0.620614
0.562500
0.359649
0.379386
0.517544
0.271930
0.298246
1.000000
The new distributions look very different to the previous ones. It’s easy to get distracted by the woody column and row. Is there a more methodical way to compare the two models?
Model Comparisons
The distribution has changed. How much has it changed and is this markov model better than just the probability without consideration of existing scents?
Here we want to measure the difference between two probability distributions. Ultimately each point in this table is separate as it is then the probability of that single outcome. I think that the Kullback-Leibler divergence measure would be appropriate?
This does show me the difference for each combination of conditional and given. As a measure of the quality of the model I have a problem. How can I effectively combine these individual statistics into a measure of the model? Is this even the right approach?
A better way to approach this would be to split the data into test and train sets. The distributions can be trained and then the ability to predict the test set is the quality of the model.
I did want to use the train set as the quality measure as that would make it possible to generalize this approach to bayesian networks. That can come later though.
This does show what I was hoping for - that the markov model is higher quality than the naive bayes classifier.
Does this approach work if I skip the train/test split? What am I measuring against? I need a way to make the naive bayes classifier produce output of the same shape as the markov model. I can repeat the probabilities across every given, as the naive bayes classifier does not take them into account.
These predictions are clearly going to be bad. Most obviously the classifier considers \(P(woody \mid woody)\) and similar to be less than 1.
We can measure the difference between these models though.
Code
markov_df = markov( df=has_scent_df, columns=top_10_scents.index,)divergence = ( kl_div(markov_df, naive_bayes_df) .to_numpy() .mean())print(f"kl-divergence for naive bayes vs markov chain: {divergence:0.4f}")
kl-divergence for naive bayes vs markov chain: 29.8922
This is a huge difference. For fun we could consider fixing the chance of \(P(woody \mid woody)\) and similar and comparing again.
Code
naive_bayes_np = naive_bayes_df.to_numpy()naive_bayes_np[range(len(top_10_scents)),range(len(top_10_scents)),] =1.divergence = ( kl_div(markov_df, naive_bayes_np) .to_numpy() .mean())print(f"kl-divergence for adjusted naive bayes vs markov chain: {divergence:0.4f}")
kl-divergence for adjusted naive bayes vs markov chain: 27.0876
The difference is still very large. What does this difference represent? Does it show that the markov chain is a better model?
I don’t think so. It is showing that these models differ from each other. I can imagine two models which are equally bad and are this different from each other.
If we imagine the most precise model then that would produce the correct probability of a scent given every other scent on the perfume. We could try measuring against this for each perfume in the dataset. Given that the number of scents on a perfume varies the first step will be to produce the distinct list of every scent combination. Then from there produce the true probability distribution and compare it to the two classifiers.
Code
distinct_scents =set(frozenset( row[top_10_scents.index][~row[top_10_scents.index].isna()].index )for row in has_scent_df.iloc)scent_count =len(distinct_scents)naive_bayes_count =sum(len(scents) for scents in distinct_scents)markov_chain_count =sum(len(scents) * (len(scents) -1) for scents in distinct_scents)print(f"there are {scent_count:,} distinct sets of scents")print(f"there are {naive_bayes_count:,} comparisons to perform for naive bayes")print(f"there are {markov_chain_count:,} comparisons to perform for markov chain")
there are 742 distinct sets of scents
there are 3,641 comparisons to perform for naive bayes
there are 16,100 comparisons to perform for markov chain
This is an achievable number of comparisons to perform. The comparisons for the markov model will be increased as it can pick one other scent as the given scent.
Code
import pandas as pdfrom tqdm.auto import tqdmdef naive_bayes_comparisons( perfume_df: pd.DataFrame, probabilities: pd.Series, distinct_scents: set[frozenset[str]],) -> pd.Series: divergences = []for scents in tqdm(distinct_scents):for predicted_scent in scents:# filter perfumes to those with the given scents given_scents = scents - {predicted_scent} df = perfume_dffor scent in given_scents: df = df[~df[scent].isna()]# calculate true probability perfumes_with_scent = (~df[predicted_scent].isna()).sum() probability = perfumes_with_scent /len(df) divergences.append( kl_div( probabilities.loc[predicted_scent], probability, ) )return pd.Series(divergences)def markov_chain_comparisons( perfume_df: pd.DataFrame, probabilities: pd.DataFrame, distinct_scents: set[frozenset[str]],) -> pd.Series: divergences = []for scents in tqdm(distinct_scents):for given_scent in scents:for predicted_scent in scents - {given_scent}:# filter perfumes to those with the given scents given_scents = scents - {predicted_scent} df = perfume_dffor scent in given_scents: df = df[~df[scent].isna()]# calculate true probability perfumes_with_scent = (~df[predicted_scent].isna()).sum() probability = perfumes_with_scent /len(df) divergences.append( kl_div( probabilities.loc[given_scent][predicted_scent], probability, ) )return pd.Series(divergences)
The naive bayes model divergence from the true distribution is:
Every comparison I come up with seems to make the difference between the two more dramatic.
What am I really doing with these though? These are classifiers. I could perform the same comparison but treat it as a classification where the majority class of the model is the one that it predicts. This should be more consistent with how these would actually be used.
Code
import pandas as pdfrom tqdm.auto import tqdmdef naive_bayes_classifications( perfume_df: pd.DataFrame, probabilities: pd.Series, distinct_scents: set[frozenset[str]],) -> pd.Series: correct_classifications =0 total_classifications =0for scents in tqdm(distinct_scents):for predicted_scent in scents:# filter perfumes to those with the given scents given_scents = scents - {predicted_scent} df = perfume_dffor scent in given_scents: df = df[~df[scent].isna()]# does the model predict this scent? predicted = probabilities.loc[predicted_scent] >=0.5# assign the appropriate number of classifications perfumes_with_scent = (~df[predicted_scent].isna()).sum()if predicted: correct_classifications += perfumes_with_scentelse: correct_classifications +=len(df) - perfumes_with_scent total_classifications +=len(df)return correct_classifications / total_classificationsdef markov_chain_classifications( perfume_df: pd.DataFrame, probabilities: pd.DataFrame, distinct_scents: set[frozenset[str]],) -> pd.Series: correct_classifications =0 total_classifications =0for scents in tqdm(distinct_scents):for given_scent in scents:for predicted_scent in scents - {given_scent}:# filter perfumes to those with the given scents given_scents = scents - {predicted_scent} df = perfume_dffor scent in given_scents: df = df[~df[scent].isna()]# does the model predict this scent? predicted = probabilities.loc[given_scent][predicted_scent] >=0.5# assign the appropriate number of classifications perfumes_with_scent = (~df[predicted_scent].isna()).sum()if predicted: correct_classifications += perfumes_with_scentelse: correct_classifications +=len(df) - perfumes_with_scent total_classifications +=len(df)return correct_classifications / total_classifications
To me this shows the quality of the classifier much better. The markov chain classifier is better than random while the naive bayes classifier is not.
Is it possible to use this technique to construct a bayesian network? We could iterate through all possible connections and take the one which improves the model by the greatest margin. It’s likely that this would be a very slow as the evaluation of every edge would be \(O(N^2)\) complexity.
Checking My Work
At this point I need to consult someone that has done this already. I like computers so I am consulting the pomegranate library. This has an implementation of markov chains so we can start by working with that.
Part of the problem with using this is that it does bake in the chain part (unsurprisingly) so I have to enforce an order on these. I am going to categorize these by column index and then sort them by value.
from pomegranate.markov_chain import MarkovChainmodel = MarkovChain(k=2)model.fit(chains)
ValueError: expected sequence of length 5 at dim 1 (got 4)
This wants the different chain lengths to be identical. I’m surprised by this restriction as I had always thought of Markov chains as being a way to predict over arbitrary length sequences (e.g. next word prediction).
At the moment I’m using this to verify the work that I have done so far. Let’s try creating the full set of “chains” which are only length two. Over a big dataset with lots of features this would be impractically large, but we have 10 scents that may be present and a few thousand perfumes.
Code
scent_to_category = { name: indexfor index, name inenumerate(top_10_scents.index)}def row_to_2_chains(row: pd.Series) ->list[list[list[int]]]: row = row[top_10_scents.index] row = row.dropna() scents =set(row.index.map(scent_to_category))return [ [[first], [second]]for first insorted(scents)for second insorted(scents - {first}) ]chains = [ chainfor row in perfumes_df.ilocfor chain in row_to_2_chains(row)]
Code
from pomegranate.markov_chain import MarkovChainmodel = MarkovChain(k=2)model.fit(chains)
TypeError: 'NoneType' object cannot be interpreted as an integer
Now it appears that the data needs to be at least one longer than the markov chain length.
Code
scent_to_category = { name: indexfor index, name inenumerate(top_10_scents.index)}def row_to_3_chains(row: pd.Series) ->list[list[list[int]]]: row = row[top_10_scents.index] row = row.dropna() scents =set(row.index.map(scent_to_category))return [ [[first], [second], [third]]for first insorted(scents)for second insorted(scents - {first})for third insorted(scents - {first, second}) ]chains = [ chainfor row in perfumes_df.ilocfor chain in row_to_3_chains(row)]
Code
from pomegranate.markov_chain import MarkovChainmodel = MarkovChain(k=2)model.fit(chains) ;None
The fitted model cannot be used to predict, which is surprising as this library appears to copy the sklearn estimator style. Instead you can get the probability distribution for a given chain of events.
We can see how this has been normalized by summing all of the possible second events for a given scent.
model.probability([ [[scent_to_category["accord_woody"]], [second]]for second in scent_to_category.values()]).sum()
tensor(0.1436)
This probability is unlike the previous computation that I did. When I calculated the probabilities I was considering the probability that the scent was present or missing. The construction of the chains has not replicated this. That is because the chains are not exclusive, if we have woody to powdery that does not mean that citrus is not present.
Reviewing the construction of this Markov chain shows me how my own model was incorrect. Instead of creating a markov chain I have modelled something closer to a Bayesian network. Since I want to investigate Bayesian networks let’s press on.
Bayesian Networks
Notes
evaluation should be a regression model evaluation.
mape (mean absolute percentage error) is a suitable metric.
for naive bayes calibration is a problem even if the model is ok bayesian network can be better calibrated, if you get the correct structure
bayesian network is good for optimization problems given user preferences as a prior, maximize this continuous variable (rating)
bayesian optimization botorch is a bayesian optimizer in pytorch
trial design <- bayesian experimental design, bayesian search space filling can be combined with bayesian search
Bayesian optimization (BO) allows us to tune parameters in relatively few iterations by building a smooth model from an initial set of parameterizations (referred to as the “surrogate model”) in order to predict the outcomes for as yet unexplored parameterizations. BO is an adaptive approach where the observations from previous evaluations are used to decide what parameterizations to evaluate next. The same strategy can be used to predict the expected gain from all future evaluations and decide on early termination, if the expected benefit is smaller than what is worthwhile for the problem at hand. https://ax.dev/docs/bayesopt
libraries: pomegranate, pymc, pympy
there are two types of experiment: given dataset what is the best model, this adds a layer of complexity as I need to fit the network as well as understand
there are lots of potential graphs that can fit small datasets and give good results it can be a subgraph thing where the proposed network has excess edges or it can be influence spreading through different paths resulting in non subgraph relationships between some ideal network and the current one
structure learning is the process of learning these networks can start with an artificial dataset where the structure is known and then attempt to derive it through the search algorithms can vary the training dataset size and explore the results