Matthew’s Blog - Perfume Smells and Bayesian Networks

I’ve come across a perfume dataset that looks interesting. I want to see if I can use bayesian techniques to investigate it.

For each perfume there is a list of smells which form the overall perfume. Can we use Bayes to predict the probability of one smell given another?

Code

from pathlib import Path
import pandas as pd

DATA_FOLDER = Path("/data/perfumes/processed/")
PERFUME_FILE = DATA_FOLDER / "perfumes.gz.parquet"

perfumes_df = pd.read_parquet(PERFUME_FILE)
smell_columns = perfumes_df.columns[
    perfumes_df.columns.str.startswith("accord_")
]
perfumes_df[smell_columns].iloc[0].dropna().to_frame().T

	accord_animalic	accord_floral	accord_fruity	accord_green	accord_powdery	accord_sweet	accord_tuberose	accord_white floral	accord_woody	accord_yellow floral
3	45.7181	56.0413	56.3575	44.2255	57.6156	67.1769	53.0983	100.0	61.1251	62.7636

We can see that we have proportions for the different smells. There are a lot of them.

Code

print(f"There are {len(smell_columns):,} different smells")

There are 73 different smells

If we take any value as an indication that the smell is present then we can treat this as a discrete state problem that can then be modelled. So for our perfume above we can say that accord_animalic is present.

The next thing is to look at the dataset to see if the presence or absence of one scent alters the probability distribution of another.

Code

animalic_count = len(
    perfumes_df[~perfumes_df.accord_animalic.isna()]
)
floral_count = len(
    perfumes_df[~perfumes_df.accord_floral.isna()]
)
animalic_floral_count = len(
    perfumes_df[
        ~perfumes_df.accord_animalic.isna() &
        ~perfumes_df.accord_floral.isna()
    ]
)

p_animalic = animalic_count / len(perfumes_df)
p_floral = floral_count / len(perfumes_df)
p_floral_given_animalic = animalic_floral_count / animalic_count

print(f"P(animalic) = {p_animalic:0.4f}")
print(f"P(floral) = {p_floral:0.4f}")
print(f"P(floral | animalic) = {p_floral_given_animalic:0.4f}")

P(animalic) = 0.1227
P(floral) = 0.4202
P(floral | animalic) = 0.2818

We can see that an animalic perfume is less likely to have a floral scent as well.

Bayes’ theorem

Let’s reverse the probability and see if it holds up.

\[ P(animalic \mid floral) = \frac{P(floral \mid animalic) \cdot P(animalic)}{P(floral)} \]

Code

animalic_count = len(
    perfumes_df[~perfumes_df.accord_animalic.isna()]
)
floral_count = len(
    perfumes_df[~perfumes_df.accord_floral.isna()]
)
animalic_floral_count = len(
    perfumes_df[
        ~perfumes_df.accord_animalic.isna() &
        ~perfumes_df.accord_floral.isna()
    ]
)

p_animalic = animalic_count / len(perfumes_df)
p_floral = floral_count / len(perfumes_df)
p_floral_given_animalic = animalic_floral_count / animalic_count

print(f"P(animalic) = {p_animalic:0.4f}")
print(f"P(floral) = {p_floral:0.4f}")
print(f"P(floral | animalic) = {p_floral_given_animalic:0.4f}")

calculated_probability = p_floral_given_animalic * p_animalic / p_floral
actual_probability = animalic_floral_count / floral_count
print(f"predicted P(animalic | floral) = {calculated_probability:0.4f}")
print(f"actual P(animalic | floral) = {actual_probability:0.4f}")

P(animalic) = 0.1227
P(floral) = 0.4202
P(floral | animalic) = 0.2818
predicted P(animalic | floral) = 0.0823
actual P(animalic | floral) = 0.0823

Unsurprisingly the theorem holds strong. Given how the values are calculated we can see that it should work out. First lets define the probabilities by the fractions they represent:

\[ \begin{aligned} P(animalic) &= \frac{count_{\text{animalic}}}{count_{\text{all}}} \\ P(floral) &= \frac{count_{\text{floral}}}{count_{\text{all}}} \\ P(floral \mid animalic) &= \frac{count_{\text{animalic and floral}}}{count_{\text{animalic}}} \\ P(animalic \mid floral) &= \frac{count_{\text{animalic and floral}}}{count_{\text{floral}}} \end{aligned} \]

Then we can expand Bayes’ theorem using these fractions:

\[ \begin{aligned} P(animalic \mid floral) &= \frac{P(floral \mid animalic) \cdot P(animalic)}{P(floral)} \\ &= \frac{ \frac{count_{\text{animalic and floral}}}{count_{\text{animalic}}} \cdot \frac{count_{\text{animalic}}}{count_{\text{all}}} }{\frac{count_{\text{floral}}}{count_{\text{all}}}} \\ &= \frac{count_{\text{animalic and floral}}}{count_{\text{animalic}}} \cdot \frac{count_{\text{animalic}}}{count_{\text{all}}} \cdot \frac{count_{\text{all}}}{count_{\text{floral}}} \\ &= \frac{ count_{\text{animalic and floral}} \cdot (count_{\text{animalic}} \cdot count_{\text{all}}) }{ (count_{\text{animalic}} \cdot count_{\text{all}}) \cdot count_{\text{floral}} } \\ &= \frac{count_{\text{animalic and floral}}}{count_{\text{floral}}} \end{aligned} \]

It’s nice that Bayes’ theorem is a simple bit of algebra over the underlying probability definitions.

Markov Chains

The markov chain is a model where the current distribution is based on the previous outcome. A chain like this can be extended to incorporate more prior states.

We can write this out as a Bayesian probability as \(P(state_n \mid state_{n-1})\). With this dataset we can create this easily. What is more important is a way to evaluate the quality of such a model.

A measure of the level of surprise seems appropriate given that this is dealing with information. If we construct a two deep markov chain how well does the distribution of predicted scents match up to the actual ones?

Let’s start by taking the top 10 scents to work with. Using these should prevent highly uneven splits.

Code

top_10_scents = (
    (~perfumes_df[smell_columns].isna())
        .sum()
        .sort_values(ascending=False)
        .head(10)
)
top_10_scents

accord_woody           2689
accord_powdery         1944
accord_citrus          1932
accord_sweet           1905
accord_aromatic        1755
accord_warm spicy      1683
accord_floral          1641
accord_fresh spicy     1493
accord_amber           1463
accord_white floral    1385
dtype: int64

The probability distribution can be calculated by taking each scent as a given and then calculating the distribution for the other scents. Since I want to make a dataframe out of this the other scents will include the given scent.

Each row of this dataframe is a scent that the perfume already has. The columns are then the probability of the other scent being paired with the given scent.

Code

import pandas as pd

def probability(df: pd.DataFrame, condition: str, *givens: str) -> float:
    for given in givens:
        df = df[~df[given].isna()]
    return len(df[~df[condition].isna()]) / len(df)

markov_df = pd.DataFrame(
    [
        {
            condition: probability(perfumes_df, condition, given)
            for condition in top_10_scents.index
        }
        for given in top_10_scents.index
    ],
    index=top_10_scents.index,
)
markov_df

	accord_woody	accord_powdery	accord_citrus	accord_sweet	accord_aromatic	accord_warm spicy	accord_floral	accord_fresh spicy	accord_amber	accord_white floral
accord_woody	1.000000	0.516177	0.515061	0.462254	0.512830	0.499070	0.390108	0.423949	0.410933	0.339160
accord_powdery	0.713992	1.000000	0.483539	0.536523	0.373971	0.413066	0.494856	0.307099	0.381173	0.397119
accord_citrus	0.716874	0.486542	1.000000	0.465321	0.550207	0.368530	0.436853	0.495859	0.308489	0.436853
accord_sweet	0.652493	0.547507	0.471916	1.000000	0.319685	0.412598	0.458793	0.255643	0.371129	0.414173
accord_aromatic	0.785755	0.414245	0.605698	0.347009	1.000000	0.499145	0.303704	0.651852	0.398291	0.260969
accord_warm spicy	0.797386	0.477124	0.423054	0.467023	0.520499	1.000000	0.279857	0.445039	0.549020	0.256684
accord_floral	0.639244	0.586228	0.514321	0.532602	0.324802	0.287020	1.000000	0.279098	0.286411	0.450945
accord_fresh spicy	0.763563	0.399866	0.641661	0.326189	0.766242	0.501674	0.306765	1.000000	0.385131	0.240455
accord_amber	0.755297	0.506494	0.407382	0.483254	0.477785	0.631579	0.321258	0.393028	1.000000	0.277512
accord_white floral	0.658484	0.557401	0.609386	0.569675	0.330686	0.311913	0.534296	0.259206	0.293141	1.000000

To determine if this is a good model we can calculate this distribution when we already have a scent chosen. The most popular scent is woody so let’s try that.

Each row of this dataframe is now a scent that the perfume already has that has been paired with woody. The columns are then the probability of the other scent being paired with the given scent.

The column for accord_woody is the probability of it being present given it is present, so that is always 1. The row for accord_woody is then the probability of the other scent being present given that woody is present twice. Since it’s not really present twice this has not changed from the original.

Code

given_woody_df = pd.DataFrame(
    [
        {
            condition: probability(perfumes_df, condition, given, "accord_woody")
            for condition in top_10_scents.index
        }
        for given in top_10_scents.index
    ],
    index=top_10_scents.index,
)
given_woody_df

	accord_woody	accord_powdery	accord_citrus	accord_sweet	accord_aromatic	accord_warm spicy	accord_floral	accord_fresh spicy	accord_amber	accord_white floral
accord_woody	1.0	0.516177	0.515061	0.462254	0.512830	0.499070	0.390108	0.423949	0.410933	0.339160
accord_powdery	1.0	1.000000	0.497839	0.499280	0.417147	0.471182	0.461095	0.336455	0.402017	0.384006
accord_citrus	1.0	0.498917	1.000000	0.433935	0.600722	0.423827	0.404332	0.527798	0.339350	0.408664
accord_sweet	1.0	0.557522	0.483508	1.000000	0.359614	0.472245	0.434433	0.271118	0.397426	0.412711
accord_aromatic	1.0	0.419869	0.603336	0.324148	1.000000	0.531545	0.277012	0.654822	0.419144	0.237854
accord_warm spicy	1.0	0.487332	0.437407	0.437407	0.546200	1.000000	0.263785	0.457526	0.543964	0.257824
accord_floral	1.0	0.610105	0.533842	0.514776	0.364156	0.337464	1.000000	0.308866	0.301239	0.449952
accord_fresh spicy	1.0	0.409649	0.641228	0.295614	0.792105	0.538596	0.284211	1.000000	0.414912	0.217544
accord_amber	1.0	0.504977	0.425339	0.447059	0.523077	0.660633	0.285973	0.428054	1.000000	0.246154
accord_white floral	1.0	0.584430	0.620614	0.562500	0.359649	0.379386	0.517544	0.271930	0.298246	1.000000

The new distributions look very different to the previous ones. It’s easy to get distracted by the woody column and row. Is there a more methodical way to compare the two models?

Model Comparisons

The distribution has changed. How much has it changed and is this markov model better than just the probability without consideration of existing scents?

Here we want to measure the difference between two probability distributions. Ultimately each point in this table is separate as it is then the probability of that single outcome. I think that the Kullback-Leibler divergence measure would be appropriate?

Let’s see how well this works out.

Code

import math
from scipy.special import kl_div

markov_difference_df = kl_div(markov_df, given_woody_df)
markov_difference_df = markov_difference_df.iloc[1:]
markov_difference_df = markov_difference_df[markov_difference_df.columns[1:]]
markov_difference_df

	accord_powdery	accord_citrus	accord_sweet	accord_aromatic	accord_warm spicy	accord_floral	accord_fresh spicy	accord_amber	accord_white floral
accord_powdery	0.000000	2.073590e-04	0.001356	0.002316	0.003741	0.001207	0.001320	0.000550	0.000221
accord_citrus	0.000155	0.000000e+00	0.001109	0.002186	0.003775	0.001274	0.000986	0.001448	0.000951
accord_sweet	0.000091	1.400730e-04	0.000000	0.002304	0.003936	0.000671	0.000450	0.000890	0.000003
accord_aromatic	0.000038	4.618474e-06	0.000788	0.000000	0.001008	0.001246	0.000007	0.000528	0.001088
accord_warm spicy	0.000108	2.381003e-04	0.000981	0.000614	0.000000	0.000480	0.000172	0.000023	0.000003
accord_floral	0.000473	3.613522e-04	0.000305	0.002208	0.003973	0.000000	0.001483	0.000371	0.000001
accord_fresh spicy	0.000118	1.461724e-07	0.001529	0.000427	0.001295	0.000872	0.000000	0.001095	0.001166
accord_amber	0.000002	3.845165e-04	0.001427	0.002020	0.000648	0.002092	0.001474	0.000000	0.001918
accord_white floral	0.000635	1.021807e-04	0.000046	0.001199	0.006391	0.000268	0.000302	0.000044	0.000000

This does show me the difference for each combination of conditional and given. As a measure of the quality of the model I have a problem. How can I effectively combine these individual statistics into a measure of the model? Is this even the right approach?

A better way to approach this would be to split the data into test and train sets. The distributions can be trained and then the ability to predict the test set is the quality of the model.

I did want to use the train set as the quality measure as that would make it possible to generalize this approach to bayesian networks. That can come later though.

Code

from sklearn.model_selection import KFold
from scipy.special import kl_div

def markov(df: pd.DataFrame, columns: list[str]) -> pd.DataFrame:
    return pd.DataFrame(
        [
            {
                condition: probability(df, condition, given)
                for condition in columns
            }
            for given in columns
        ],
        index=columns,
    )

def probability(df: pd.DataFrame, condition: str, *givens: str) -> float:
    for given in givens:
        df = df[~df[given].isna()]
    return len(df[~df[condition].isna()]) / len(df)

has_scent_df = perfumes_df[
    (~perfumes_df[top_10_scents.index].isna())
        .sum(axis="columns") > 0
]
divergences = []
for train_index, test_index in KFold(n_splits=5).split(has_scent_df):
    markov_train = markov(
        df=has_scent_df.iloc[train_index],
        columns=top_10_scents.index,
    )
    markov_test = markov(
        df=has_scent_df.iloc[test_index],
        columns=top_10_scents.index,
    )
    divergences.append(
        kl_div(markov_train, markov_test)
            .to_numpy()
            .mean()
    )
print(f"markov 2-chain kl-divergence: {sum(divergences)/len(divergences):0.5f}")

markov 2-chain kl-divergence: 0.00264

Code

from sklearn.model_selection import KFold
from scipy.special import kl_div

def naive_bayes(df: pd.DataFrame, columns: list[str]) -> pd.DataFrame:
    return df[columns].sum() / len(df)

has_scent_df = perfumes_df[
    (~perfumes_df[top_10_scents.index].isna())
        .sum(axis="columns") > 0
]
divergences = []
for train_index, test_index in KFold(n_splits=5).split(has_scent_df):
    naive_bayes_train = naive_bayes(
        df=has_scent_df.iloc[train_index],
        columns=top_10_scents.index,
    )
    naive_bayes_test = naive_bayes(
        df=has_scent_df.iloc[test_index],
        columns=top_10_scents.index,
    )
    divergences.append(
        kl_div(naive_bayes_train, naive_bayes_test)
            .to_numpy()
            .mean()
    )
print(f"naive bayes kl-divergence: {sum(divergences)/len(divergences):0.5f}")

naive bayes kl-divergence: 0.21933

This does show what I was hoping for - that the markov model is higher quality than the naive bayes classifier.

Does this approach work if I skip the train/test split? What am I measuring against? I need a way to make the naive bayes classifier produce output of the same shape as the markov model. I can repeat the probabilities across every given, as the naive bayes classifier does not take them into account.

Code

naive_bayes_series = naive_bayes(
    df=has_scent_df,
    columns=top_10_scents.index,
)
naive_bayes_df = pd.DataFrame(
    [
        naive_bayes_series
        for _ in top_10_scents.index
    ],
    index=top_10_scents.index,
)

naive_bayes_df

	accord_woody	accord_powdery	accord_citrus	accord_sweet	accord_aromatic	accord_warm spicy	accord_floral	accord_fresh spicy	accord_amber	accord_white floral
accord_woody	51.0625	31.812264	38.023817	33.199188	31.839081	30.413636	29.603573	25.126306	26.994825	26.996845
accord_powdery	51.0625	31.812264	38.023817	33.199188	31.839081	30.413636	29.603573	25.126306	26.994825	26.996845
accord_citrus	51.0625	31.812264	38.023817	33.199188	31.839081	30.413636	29.603573	25.126306	26.994825	26.996845
accord_sweet	51.0625	31.812264	38.023817	33.199188	31.839081	30.413636	29.603573	25.126306	26.994825	26.996845
accord_aromatic	51.0625	31.812264	38.023817	33.199188	31.839081	30.413636	29.603573	25.126306	26.994825	26.996845
accord_warm spicy	51.0625	31.812264	38.023817	33.199188	31.839081	30.413636	29.603573	25.126306	26.994825	26.996845
accord_floral	51.0625	31.812264	38.023817	33.199188	31.839081	30.413636	29.603573	25.126306	26.994825	26.996845
accord_fresh spicy	51.0625	31.812264	38.023817	33.199188	31.839081	30.413636	29.603573	25.126306	26.994825	26.996845
accord_amber	51.0625	31.812264	38.023817	33.199188	31.839081	30.413636	29.603573	25.126306	26.994825	26.996845
accord_white floral	51.0625	31.812264	38.023817	33.199188	31.839081	30.413636	29.603573	25.126306	26.994825	26.996845

These predictions are clearly going to be bad. Most obviously the classifier considers \(P(woody \mid woody)\) and similar to be less than 1.

We can measure the difference between these models though.

Code

markov_df = markov(
    df=has_scent_df,
    columns=top_10_scents.index,
)

divergence = (
    kl_div(markov_df, naive_bayes_df)
            .to_numpy()
            .mean()
)
print(f"kl-divergence for naive bayes vs markov chain: {divergence:0.4f}")

kl-divergence for naive bayes vs markov chain: 29.8922

This is a huge difference. For fun we could consider fixing the chance of \(P(woody \mid woody)\) and similar and comparing again.

Code

naive_bayes_np = naive_bayes_df.to_numpy()
naive_bayes_np[
    range(len(top_10_scents)),
    range(len(top_10_scents)),
] = 1.

divergence = (
    kl_div(markov_df, naive_bayes_np)
            .to_numpy()
            .mean()
)
print(f"kl-divergence for adjusted naive bayes vs markov chain: {divergence:0.4f}")

kl-divergence for adjusted naive bayes vs markov chain: 27.0876

The difference is still very large. What does this difference represent? Does it show that the markov chain is a better model?

I don’t think so. It is showing that these models differ from each other. I can imagine two models which are equally bad and are this different from each other.

If we imagine the most precise model then that would produce the correct probability of a scent given every other scent on the perfume. We could try measuring against this for each perfume in the dataset. Given that the number of scents on a perfume varies the first step will be to produce the distinct list of every scent combination. Then from there produce the true probability distribution and compare it to the two classifiers.

Code

distinct_scents = set(
    frozenset(
        row[top_10_scents.index][~row[top_10_scents.index].isna()].index
    )
    for row in has_scent_df.iloc
)
scent_count = len(distinct_scents)
naive_bayes_count = sum(len(scents) for scents in distinct_scents)
markov_chain_count = sum(
    len(scents) * (len(scents) - 1) for scents in distinct_scents
)

print(f"there are {scent_count:,} distinct sets of scents")
print(f"there are {naive_bayes_count:,} comparisons to perform for naive bayes")
print(f"there are {markov_chain_count:,} comparisons to perform for markov chain")

there are 742 distinct sets of scents
there are 3,641 comparisons to perform for naive bayes
there are 16,100 comparisons to perform for markov chain

This is an achievable number of comparisons to perform. The comparisons for the markov model will be increased as it can pick one other scent as the given scent.

Code

import pandas as pd
from tqdm.auto import tqdm

def naive_bayes_comparisons(
    perfume_df: pd.DataFrame,
    probabilities: pd.Series,
    distinct_scents: set[frozenset[str]],
) -> pd.Series:
    divergences = []
    for scents in tqdm(distinct_scents):
        for predicted_scent in scents:
            # filter perfumes to those with the given scents
            given_scents = scents - {predicted_scent}
            df = perfume_df
            for scent in given_scents:
                df = df[~df[scent].isna()]

            # calculate true probability
            perfumes_with_scent = (~df[predicted_scent].isna()).sum()
            probability = perfumes_with_scent / len(df)

            divergences.append(
                kl_div(
                    probabilities.loc[predicted_scent],
                    probability,
                )
            )
    return pd.Series(divergences)

def markov_chain_comparisons(
    perfume_df: pd.DataFrame,
    probabilities: pd.DataFrame,
    distinct_scents: set[frozenset[str]],
) -> pd.Series:
    divergences = []
    for scents in tqdm(distinct_scents):
        for given_scent in scents:
            for predicted_scent in scents - {given_scent}:
                # filter perfumes to those with the given scents
                given_scents = scents - {predicted_scent}
                df = perfume_df
                for scent in given_scents:
                    df = df[~df[scent].isna()]

                # calculate true probability
                perfumes_with_scent = (~df[predicted_scent].isna()).sum()
                probability = perfumes_with_scent / len(df)

                divergences.append(
                    kl_div(
                        probabilities.loc[given_scent][predicted_scent],
                        probability,
                    )
                )
    return pd.Series(divergences)

The naive bayes model divergence from the true distribution is:

(
    naive_bayes_comparisons(
        perfume_df=has_scent_df,
        probabilities=naive_bayes_series,
        distinct_scents=distinct_scents,
    )
    .describe()
    .to_frame()
    .rename(columns={0: ""})
)


count	3641.000000
mean	110.994604
std	25.864290
min	63.821709
25%	93.640733
50%	105.891925
75%	121.338677
max	206.200107

The length 2 markov chain model divergence from the true distribution is:

(
    markov_chain_comparisons(
        perfume_df=has_scent_df,
        probabilities=markov_df,
        distinct_scents=distinct_scents,
    )
    .describe()
    .to_frame()
    .rename(columns={0: ""})
)


count	16100.000000
mean	0.028344
std	0.044867
min	0.000000
25%	0.002250
50%	0.011179
75%	0.035849
max	0.670140

Every comparison I come up with seems to make the difference between the two more dramatic.

What am I really doing with these though? These are classifiers. I could perform the same comparison but treat it as a classification where the majority class of the model is the one that it predicts. This should be more consistent with how these would actually be used.

Code

import pandas as pd
from tqdm.auto import tqdm

def naive_bayes_classifications(
    perfume_df: pd.DataFrame,
    probabilities: pd.Series,
    distinct_scents: set[frozenset[str]],
) -> pd.Series:
    correct_classifications = 0
    total_classifications = 0

    for scents in tqdm(distinct_scents):
        for predicted_scent in scents:
            # filter perfumes to those with the given scents
            given_scents = scents - {predicted_scent}
            df = perfume_df
            for scent in given_scents:
                df = df[~df[scent].isna()]

            # does the model predict this scent?
            predicted = probabilities.loc[predicted_scent] >= 0.5

            # assign the appropriate number of classifications
            perfumes_with_scent = (~df[predicted_scent].isna()).sum()
            if predicted:
                correct_classifications += perfumes_with_scent
            else:
                correct_classifications += len(df) - perfumes_with_scent
            total_classifications += len(df)

    return correct_classifications / total_classifications

def markov_chain_classifications(
    perfume_df: pd.DataFrame,
    probabilities: pd.DataFrame,
    distinct_scents: set[frozenset[str]],
) -> pd.Series:
    correct_classifications = 0
    total_classifications = 0

    for scents in tqdm(distinct_scents):
        for given_scent in scents:
            for predicted_scent in scents - {given_scent}:
                # filter perfumes to those with the given scents
                given_scents = scents - {predicted_scent}
                df = perfume_df
                for scent in given_scents:
                    df = df[~df[scent].isna()]

                # does the model predict this scent?
                predicted = probabilities.loc[given_scent][predicted_scent] >= 0.5

                # assign the appropriate number of classifications
                perfumes_with_scent = (~df[predicted_scent].isna()).sum()
                if predicted:
                    correct_classifications += perfumes_with_scent
                else:
                    correct_classifications += len(df) - perfumes_with_scent
                total_classifications += len(df)

    return correct_classifications / total_classifications

Code

naive_bayes_accuracy = naive_bayes_classifications(
    perfume_df=has_scent_df,
    probabilities=naive_bayes_series,
    distinct_scents=distinct_scents,
)
print(f"naive bayes classification accuracy: {naive_bayes_accuracy:0.4f}")

naive bayes classification accuracy: 0.4529

Code

markov_chain_accuracy = markov_chain_classifications(
    perfume_df=has_scent_df,
    probabilities=markov_df,
    distinct_scents=distinct_scents,
)
print(f"markov chain classification accuracy: {markov_chain_accuracy:0.4f}")

markov chain classification accuracy: 0.6122

To me this shows the quality of the classifier much better. The markov chain classifier is better than random while the naive bayes classifier is not.

Is it possible to use this technique to construct a bayesian network? We could iterate through all possible connections and take the one which improves the model by the greatest margin. It’s likely that this would be a very slow as the evaluation of every edge would be \(O(N^2)\) complexity.

Checking My Work

At this point I need to consult someone that has done this already. I like computers so I am consulting the pomegranate library. This has an implementation of markov chains so we can start by working with that.

Part of the problem with using this is that it does bake in the chain part (unsurprisingly) so I have to enforce an order on these. I am going to categorize these by column index and then sort them by value.

Code

scent_to_category = {
    name: index
    for index, name in enumerate(top_10_scents.index)
}

def row_to_chain(row: pd.Series) -> list[int]:
    row = row[top_10_scents.index]
    row = row.dropna()
    row = row.sort_values(ascending=False)
    return row.index.map(scent_to_category)

chains = [
    row_to_chain(row)
    for row in perfumes_df.iloc
]

Code

from pomegranate.markov_chain import MarkovChain

model = MarkovChain(k=2)
model.fit(chains)

ValueError: expected sequence of length 5 at dim 1 (got 4)

This wants the different chain lengths to be identical. I’m surprised by this restriction as I had always thought of Markov chains as being a way to predict over arbitrary length sequences (e.g. next word prediction).

At the moment I’m using this to verify the work that I have done so far. Let’s try creating the full set of “chains” which are only length two. Over a big dataset with lots of features this would be impractically large, but we have 10 scents that may be present and a few thousand perfumes.

Code

scent_to_category = {
    name: index
    for index, name in enumerate(top_10_scents.index)
}

def row_to_2_chains(row: pd.Series) -> list[list[list[int]]]:
    row = row[top_10_scents.index]
    row = row.dropna()
    scents = set(row.index.map(scent_to_category))
    return [
        [[first], [second]]
        for first in sorted(scents)
        for second in sorted(scents - {first})
    ]

chains = [
    chain
    for row in perfumes_df.iloc
    for chain in row_to_2_chains(row)
]

Code

from pomegranate.markov_chain import MarkovChain

model = MarkovChain(k=2)
model.fit(chains)

TypeError: 'NoneType' object cannot be interpreted as an integer

Now it appears that the data needs to be at least one longer than the markov chain length.

Code

scent_to_category = {
    name: index
    for index, name in enumerate(top_10_scents.index)
}

def row_to_3_chains(row: pd.Series) -> list[list[list[int]]]:
    row = row[top_10_scents.index]
    row = row.dropna()
    scents = set(row.index.map(scent_to_category))
    return [
        [[first], [second], [third]]
        for first in sorted(scents)
        for second in sorted(scents - {first})
        for third in sorted(scents - {first, second})
    ]

chains = [
    chain
    for row in perfumes_df.iloc
    for chain in row_to_3_chains(row)
]

Code

from pomegranate.markov_chain import MarkovChain

model = MarkovChain(k=2)
model.fit(chains) ; None

The fitted model cannot be used to predict, which is surprising as this library appears to copy the sklearn estimator style. Instead you can get the probability distribution for a given chain of events.

We can see how this has been normalized by summing all of the possible second events for a given scent.

model.probability([
    [[scent_to_category["accord_woody"]], [second]]
    for second in scent_to_category.values()
]).sum()

tensor(0.1436)

This probability is unlike the previous computation that I did. When I calculated the probabilities I was considering the probability that the scent was present or missing. The construction of the chains has not replicated this. That is because the chains are not exclusive, if we have woody to powdery that does not mean that citrus is not present.

Reviewing the construction of this Markov chain shows me how my own model was incorrect. Instead of creating a markov chain I have modelled something closer to a Bayesian network. Since I want to investigate Bayesian networks let’s press on.

Bayesian Networks

Notes

evaluation should be a regression model evaluation.

mape (mean absolute percentage error) is a suitable metric.

for naive bayes calibration is a problem even if the model is ok bayesian network can be better calibrated, if you get the correct structure

bayesian network is good for optimization problems given user preferences as a prior, maximize this continuous variable (rating)

bayesian optimization botorch is a bayesian optimizer in pytorch

trial design <- bayesian experimental design, bayesian search space filling can be combined with bayesian search

Bayesian optimization (BO) allows us to tune parameters in relatively few iterations by building a smooth model from an initial set of parameterizations (referred to as the “surrogate model”) in order to predict the outcomes for as yet unexplored parameterizations. BO is an adaptive approach where the observations from previous evaluations are used to decide what parameterizations to evaluate next. The same strategy can be used to predict the expected gain from all future evaluations and decide on early termination, if the expected benefit is smaller than what is worthwhile for the problem at hand. https://ax.dev/docs/bayesopt

libraries: pomegranate, pymc, pympy

there are two types of experiment: given dataset what is the best model, this adds a layer of complexity as I need to fit the network as well as understand

there are lots of potential graphs that can fit small datasets and give good results it can be a subgraph thing where the proposed network has excess edges or it can be influence spreading through different paths resulting in non subgraph relationships between some ideal network and the current one

structure learning is the process of learning these networks can start with an artificial dataset where the structure is known and then attempt to derive it through the search algorithms can vary the training dataset size and explore the results