(He et al. 2017 )
Data Cleaning
The comment data for the perfumes is nice and has been analysed for sentiment already. However the number of comments that each individual has given varies quite substantially.
In the movielens dataset each reviewer has reviewed at least 20 movies. That makes sure that there is enough data to train the reviewer embeddings on. Lets see if we can come up with a similar threshold for the perfumes and commenters.
The first thing to do is to look at the distribution of comments.
Code
import pandas as pd
comments_df = pd.read_parquet("/data/perfumes/processed/comments.gz.parquet" )
comments_df["comment_length" ] = comments_df.content.str .strip().str .len ()
comments_df = comments_df[comments_df.comment_length > 0 ]
comments_df = comments_df.sort_values(by= "comment_length" , ascending= False )
comments_df = comments_df.drop_duplicates(
subset= ["object_id" , "author_id" ],
keep= "first" # keep longest comment
)
(
comments_df
.author_name
.value_counts()
.to_frame()
.reset_index(drop= True )
.plot(
title= "comments per author" ,
ylabel= "comments" ,
legend= False ,
)
)
(
comments_df
.object_id
.value_counts()
.to_frame()
.reset_index(drop= True )
.plot(
title= "comments per perfume" ,
ylabel= "comments" ,
legend= False ,
)
) ; None
We can see here that the commenter or perfume count drops of rapidly. This distribution is quite typical as the most popular perfumes are more visible on the site leading to more comments.
Is there a suitable threshold that we can use to make sure there is enough data per perfume or commenter? Let’s try checking a few thresholds.
Code
import pandas as pd
def threshold(df: pd.DataFrame, count: int ) -> pd.DataFrame:
def subset(values: pd.Series) -> set [str ]:
counts = values.value_counts()
return set (counts[counts >= count].index)
last_size = None
while last_size != len (df):
last_size = len (df)
authors = subset(df.author_name)
perfumes = subset(df.object_id)
df = df[df.author_name.isin(authors) & df.object_id.isin(perfumes)]
df = df.reset_index(drop= True )
df["author_name" ] = df.author_name.astype("category" )
df["object_id" ] = df.object_id.astype("category" )
return df
def threshold_stats(df: pd.DataFrame, count: int ) -> dict :
df = threshold(df= df, count= count)
author_count = len (df.author_name.unique())
perfume_count = len (df.object_id.unique())
return {
"threshold" : count,
"comments" : len (df),
"authors" : author_count,
"perfumes" : perfume_count,
}
threshold_df = pd.DataFrame([
threshold_stats(df= comments_df, count= count)
for count in range (21 )
]).set_index("threshold" )
threshold_df.plot()
threshold_df.loc[[5 , 10 , 20 ]]
comments
authors
perfumes
threshold
5
125836
10834
2296
10
75394
3760
1517
20
3635
137
85
Using the movielens threshold of 20 would destroy this dataset. I think that a threshold of 5 or 10 is achievable as there is still enough data to make this interesting.
Code
comments_5_df = threshold(df= comments_df, count= 5 )
comments_10_df = threshold(df= comments_df, count= 10 )
We then need a target value to train against. Each comment has been through a sentiment model to determine the overall sentiment of the comment. I am going to use this as a proxy for the commenters preference.
To train collaborative filters I need a single value expressing this. Keeping the process simple should help. * The target value is between -1 and 1 * If the sentiment is negative then the target is -1 * If the sentiment is neutral then the target is 0 * If the sentiment is positive then the target is 1
This scale is nice because tanh can be applied to the output of the model to scale it to this range.
Code
import pandas as pd
comments_5_df["target_class" ] = comments_5_df[["negative" , "neutral" , "positive" ]].T.idxmax()
comments_5_df["target" ] = comments_5_df.target_class.map ({
"negative" : - 1.0 ,
"neutral" : 0. ,
"positive" : 1.0 ,
})
comments_10_df["target_class" ] = comments_10_df[["negative" , "neutral" , "positive" ]].T.idxmax()
comments_10_df["target" ] = comments_10_df.target_class.map ({
"negative" : - 1.0 ,
"neutral" : 0. ,
"positive" : 1.0 ,
})
pd.DataFrame(
{
"threshold 5" : comments_5_df.target_class.value_counts() / len (comments_5_df),
"threshold 10" : comments_10_df.target_class.value_counts() / len (comments_10_df)
}
).loc[["negative" , "neutral" , "positive" ]]
threshold 5
threshold 10
target_class
negative
0.375799
0.396610
neutral
0.113688
0.108351
positive
0.510514
0.495039
These are similar distributions and neutral is rarely expressed. It may well be that we can combine this with negative as a neutral review of a perfume is not an endorsement and that would make these datasets nearly balanced.
The final thing is to split these into test and train datasets. Comments have a date so taking the most recent comment by each user would produce a 20% or 10% test dataset which is systematic.
Code
import pandas as pd
def test_train_split_idx(df: pd.DataFrame) -> (pd.Index, pd.Index):
most_recent = (
df
.sort_values(by= "date" , ascending= False )
.drop_duplicates(subset= ["author_name" ], keep= "first" )
)
recent_index = set (most_recent.index)
without_most_recent = df[~ df.index.isin(recent_index)]
return most_recent.index, without_most_recent.index
Code
import pandas as pd
comments_5_test_idx, comments_5_train_idx = test_train_split_idx(comments_5_df)
comments_10_test_idx, comments_10_train_idx = test_train_split_idx(comments_10_df)
pd.DataFrame([
{"threshold" : 5 , "test size" : len (comments_5_test_idx), "train size" : len (comments_5_train_idx)},
{"threshold" : 10 , "test size" : len (comments_10_test_idx), "train size" : len (comments_10_train_idx)},
]).set_index("threshold" )
test size
train size
threshold
5
10834
115002
10
3760
71634
SciKit Learn
I’ve done this sort of training before. These models are dramatically simpler than what I usually use, and their performance heavily depends on the choice of hyperparameters. Given that it might be a good time to try out structuring this code as a sklearn estimator.
If this code is structured in that way then it is possible to perform a grid search over the hyperparameters. Let’s start by defining the structure of such an estimator.
Code
from dataclasses import dataclass
from typing import TypedDict, Optional
from tqdm.auto import tqdm
from torch.utils.data import DataLoader, Dataset
from torch import nn
import torch
import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator
optimizer_fn = {
"sgd" : torch.optim.SGD,
"adam" : torch.optim.Adam,
}
class Entry(TypedDict):
user: str
product: int
target: float
@dataclass
class CommentDataset(Dataset):
df: pd.DataFrame
def __post_init__(self ) -> None :
self .df = self .df.reset_index(drop= True )
def __len__ (self ) -> int :
return len (self .df)
def __getitem__ (self , index) -> Entry:
return {
"user" : self .df.author_name.cat.codes[index],
"product" : self .df.object_id.cat.codes[index],
"target" : self .df.target[index],
}
class CollaborativeEstimator(BaseEstimator):
def __init__ (
self ,
learning_rate: float ,
epochs: int ,
batch_size: int ,
optimizer: str ,
device: str ,
) -> None :
self .learning_rate = learning_rate
self .epochs = epochs
self .batch_size = batch_size
self .optimizer = optimizer
self .device = device
self .model = None
def _create_model(self ) -> nn.Module:
raise NotImplementedError ()
def fit(self , X, y, quiet: bool = True ) -> None :
df = pd.DataFrame(X)
df["target" ] = y
self .model = self .train(train_df= df, quiet= quiet)
@torch.inference_mode ()
def predict(self , X) -> np.array:
assert self .model is not None , "fit the model first"
users = X.author_name.cat.codes.tolist()
products = X.object_id.cat.codes.tolist()
self .model.eval ()
predictions = [
self .model(
users= torch.tensor(users[index:index+ self .batch_size]),
products= torch.tensor(products[index:index+ self .batch_size]),
).numpy()
for index in range (0 , len (users), self .batch_size)
]
return np.concatenate(predictions)
def train(self , train_df: pd.DataFrame, test_df: Optional[pd.DataFrame] = None , quiet: bool = True ) -> nn.Module:
train_dl = DataLoader(
dataset= CommentDataset(train_df),
batch_size= self .batch_size,
shuffle= True ,
)
if test_df is not None :
test_dl = DataLoader(
dataset= CommentDataset(test_df),
batch_size= self .batch_size,
shuffle= False ,
)
else :
test_dl = None
model = self ._create_model()
model.to(self .device)
optimizer = optimizer_fn[self .optimizer](
model.parameters(),
lr= self .learning_rate,
)
for epoch in tqdm(range (self .epochs), disable= quiet):
train_loss = self ._train_iter(model= model, optimizer= optimizer, dl= train_dl, quiet= quiet)
train_loss /= len (train_df)
if test_dl is not None :
test_loss = self ._test_iter(model= model, dl= test_dl) / len (test_df)
if not quiet:
print (f"epoch { epoch:02d} : train loss { train_loss:0.4f} , test loss { test_loss:0.4f} " )
elif not quiet:
print (f"epoch { epoch:02d} : train loss { train_loss:0.4f} " )
model.eval ()
return model
def _train_iter(self , model: nn.Module, optimizer: torch.optim.Optimizer, dl: DataLoader, quiet: bool ) -> float :
model.train()
train_loss = 0.
for batch in tqdm(dl, disable= quiet, leave= False ):
users = batch["user" ].to(self .device)
products = batch["product" ].to(self .device)
targets = batch["target" ].to(self .device)
optimizer.zero_grad()
output = model(users= users, products= products)
loss = (output - targets)** 2
train_loss += loss.sum ().item()
loss = loss.mean()
loss.backward()
optimizer.step()
return train_loss
@torch.inference_mode ()
def _test_iter(self , model: nn.Module, dl: DataLoader, quiet: bool ) -> float :
model.eval ()
test_loss = 0.
for batch in tqdm(test_dl, disable= quiet, leave= False ):
users = batch["user" ].to(self .device)
products = batch["product" ].to(self .device)
targets = batch["target" ].to(self .device)
output = model(users= users, products= products)
output = model(users= users, products= products)
loss = (output - targets)** 2
test_loss += loss.sum ().item()
return test_loss
This implementation lacks a key part - the _create_model
method. I also need to ensure that this can work with a custom data split. Let’s try it out with a trivial model.
This model takes a single hyperparameter that it will return as the score.
Code
from dataclasses import dataclass
from torch import nn
import torch
@dataclass
class FixedConfig:
score: float
class FixedModel(nn.Module):
def __init__ (self , config: FixedConfig) -> None :
super ().__init__ ()
self .config = config
# need a parameter to train
self .score = nn.Parameter(torch.ones(1 ) * config.score)
def forward(self , users: torch.IntTensor, products: torch.IntTensor) -> torch.Tensor:
return torch.ones(users.shape[0 ]) * self .score
class FixedEstimator(CollaborativeEstimator):
def __init__ (
self ,
score: float ,
learning_rate: float ,
epochs: int ,
batch_size: int ,
optimizer: str ,
device: str ,
) -> None :
super ().__init__ (
learning_rate= learning_rate,
epochs= epochs,
batch_size= batch_size,
optimizer= optimizer,
device= device,
)
self .score = score
def _create_model(self ) -> nn.Module:
config = FixedConfig(score= self .score)
return FixedModel(config)
We can then try this out. The aim here is to make sure that the model is trained the correct number of times with the correct dataset.
Code
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error, make_scorer
parameter_grid = {
"score" : [- 1 , - 0.5 , 0 , 0.5 , 1 ],
}
base_estimator = FixedEstimator(
score= 0.0 ,
learning_rate= 0.0 , # disable changing the score
epochs= 1 ,
batch_size= 64 ,
optimizer= "sgd" ,
device= "cpu" ,
)
grid = GridSearchCV(
estimator= base_estimator,
param_grid= parameter_grid,
n_jobs=- 1 ,
scoring= make_scorer(mean_absolute_error, greater_is_better= False ),
cv= [(comments_5_train_idx, comments_5_test_idx)],
error_score= "raise" ,
)
grid.fit(
X= comments_5_df[["object_id" , "author_name" ]],
y= comments_5_df["target" ],
)
GridSearchCV(cv=[(Index([ 0, 1, 2, 4, 5, 6, 7, 8, 10,
11,
...
125826, 125827, 125828, 125829, 125830, 125831, 125832, 125833, 125834,
125835],
dtype='int64', length=115002),
Index([ 6997, 106583, 2096, 99857, 75533, 116425, 103546, 91366, 20675,
46071,
...
89531, 12221, 71600, 102132, 111257, 87469, 124689, 110351, 77345,
115926],
dtype='int64', length=10834))],
error_score='raise',
estimator=FixedEstimator(batch_size=64, device='cpu', epochs=1,
learning_rate=0.0, optimizer='adam',
score=0.0),
n_jobs=-1, param_grid={'score': [-1, -0.5, 0, 0.5, 1]},
scoring=make_scorer(mean_absolute_error, greater_is_better=False)) In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org. GridSearchCV GridSearchCV(cv=[(Index([ 0, 1, 2, 4, 5, 6, 7, 8, 10,
11,
...
125826, 125827, 125828, 125829, 125830, 125831, 125832, 125833, 125834,
125835],
dtype='int64', length=115002),
Index([ 6997, 106583, 2096, 99857, 75533, 116425, 103546, 91366, 20675,
46071,
...
89531, 12221, 71600, 102132, 111257, 87469, 124689, 110351, 77345,
115926],
dtype='int64', length=10834))],
error_score='raise',
estimator=FixedEstimator(batch_size=64, device='cpu', epochs=1,
learning_rate=0.0, optimizer='adam',
score=0.0),
n_jobs=-1, param_grid={'score': [-1, -0.5, 0, 0.5, 1]},
scoring=make_scorer(mean_absolute_error, greater_is_better=False))
Code
{'mean_fit_time': array([10.15469885, 9.80105376, 10.18585205, 10.24359465, 9.83500171]),
'std_fit_time': array([0., 0., 0., 0., 0.]),
'mean_score_time': array([0.00424838, 0.00418973, 0.00426602, 0.00428987, 0.00427222]),
'std_score_time': array([0., 0., 0., 0., 0.]),
'param_score': masked_array(data=[-1, -0.5, 0, 0.5, 1],
mask=[False, False, False, False, False],
fill_value='?',
dtype=object),
'params': [{'score': -1},
{'score': -0.5},
{'score': 0},
{'score': 0.5},
{'score': 1}],
'split0_test_score': array([-1.13716079, -0.99843087, -0.85970094, -0.86127008, -0.86283921]),
'mean_test_score': array([-1.13716079, -0.99843087, -0.85970094, -0.86127008, -0.86283921]),
'std_test_score': array([0., 0., 0., 0., 0.]),
'rank_test_score': array([5, 4, 1, 2, 3], dtype=int32)}
The results here show me that there were 5 models evaluated which each had a score parameter. Those 5 models have scores tracked by the splitN_test_score
values, which only has a single entry. This shows that the cv split I passed in was respected.
The score itself is negative, which surprised me. It turns out that using a loss function and greater_is_better
results in the scorer negating the underlying score, so that the grid search can still use the behaviour of maximizing the score.
The best estimator is available as the grid.estimator
, so we can check that it returns a fixed value.
Code
grid.estimator.predict(comments_5_df.loc[comments_5_test_idx].head(10 ))
AssertionError: fit the model first
It looks like it has initialized the model but not trained it!
Let’s try that and then evaluate it.
Code
grid.estimator.fit(
X= comments_5_df.loc[comments_5_train_idx][["object_id" , "author_name" ]],
y= comments_5_df.loc[comments_5_train_idx][["target" ]],
)
grid.estimator.predict(
X= comments_5_df.loc[comments_5_test_idx][["object_id" , "author_name" ]].head(10 ),
)
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)
We can see that the estimator is returning the fixed 0. value for every input, which is what the model is designed to do.
This model achieved an absolute error of 0.8597, so we should bear that in mind for future models.
To have a good baseline we can even infer what the ideal single value to return is. When we generated the data we set the target value for each row. With this we can calculate the mean of the overall dataset quite easily.
Code
comments_5_df.loc[comments_5_train_idx].target.mean()
Code
from sklearn.metrics import mean_absolute_error
estimator = FixedEstimator(
score= comments_5_df.loc[comments_5_train_idx].target.mean(),
learning_rate= 0.0 , # disable changing the score
epochs= 1 ,
batch_size= 64 ,
optimizer= "sgd" ,
device= "cpu" ,
)
estimator.fit(
X= comments_5_df.loc[comments_5_train_idx][["object_id" , "author_name" ]],
y= comments_5_df.loc[comments_5_train_idx][["target" ]],
)
predictions = estimator.predict(
X= comments_5_df.loc[comments_5_test_idx][["object_id" , "author_name" ]],
)
mean_absolute_error(
y_true= comments_5_df.loc[comments_5_test_idx]["target" ],
y_pred= predictions,
)
This shows that fitting directly to the training data doesn’t always generalize well. It’s good to know that a score of around 0.86 can be considered no better than random.
Hadamard
The simplest collaborative filter is the hadamard product over the user and perfume embeddings. This can optionally incorporate a bias per user and product which is not subject to the product.
Code
from dataclasses import dataclass
from torch import nn
import torch
@dataclass
class HadamardConfig:
product_count: int
user_count: int
embedding_size: int
bias: bool
tanh: bool
class HadamardModel(nn.Module):
def __init__ (self , config: HadamardConfig) -> None :
super ().__init__ ()
self .config = config
self .user_embedding = nn.Embedding(
num_embeddings= config.user_count,
embedding_dim= config.embedding_size,
)
self .product_embedding = nn.Embedding(
num_embeddings= config.product_count,
embedding_dim= config.embedding_size,
)
if config.bias:
self .user_bias = nn.Parameter(torch.zeros(config.user_count))
self .product_bias = nn.Parameter(torch.zeros(config.user_count))
def forward(self , users: torch.IntTensor, products: torch.IntTensor) -> torch.Tensor:
users = users.long ()
products = products.long ()
user_embedding = self .user_embedding(users)
product_embedding = self .product_embedding(products)
hadamard = user_embedding * product_embedding
result = hadamard.sum (dim=- 1 )
if self .config.bias:
result = result + self .user_bias[users] + self .product_bias[products]
if self .config.tanh:
return torch.tanh(result)
return result
class HadamardEstimator(CollaborativeEstimator):
def __init__ (
self ,
product_count: int ,
user_count: int ,
embedding_size: int ,
bias: bool ,
tanh: bool ,
learning_rate: float ,
epochs: int ,
batch_size: int ,
optimizer: str ,
device: str ,
) -> None :
super ().__init__ (
learning_rate= learning_rate,
epochs= epochs,
batch_size= batch_size,
optimizer= optimizer,
device= device,
)
self .product_count = product_count
self .user_count = user_count
self .embedding_size = embedding_size
self .bias = bias
self .tanh = tanh
def _create_model(self ) -> nn.Module:
config = HadamardConfig(
product_count= self .product_count,
user_count= self .user_count,
embedding_size= self .embedding_size,
bias= self .bias,
tanh= self .tanh,
)
return HadamardModel(config)
Code
estimator = HadamardEstimator(
product_count= len (comments_5_df.object_id.cat.categories),
user_count= len (comments_5_df.author_name.cat.categories),
embedding_size= 32 ,
bias= True ,
tanh= True ,
learning_rate= 1e-3 ,
epochs= 5 ,
batch_size= 64 ,
optimizer= "adam" ,
device= "cpu" ,
)
estimator.fit(
X= comments_5_df.loc[comments_5_train_idx][["object_id" , "author_name" ]],
y= comments_5_df.loc[comments_5_train_idx][["target" ]],
quiet= False ,
)
predictions = estimator.predict(
X= comments_5_df.loc[comments_5_test_idx][["object_id" , "author_name" ]],
)
mean_absolute_error(
y_true= comments_5_df.loc[comments_5_test_idx].target,
y_pred= predictions,
)
epoch 00: train loss 1.7357
epoch 01: train loss 1.5551
epoch 02: train loss 1.4020
epoch 03: train loss 1.2910
epoch 04: train loss 1.2088
This is performing poorly. The range of values that can be produced is -1 to 1 and the model is forced to stick within that thanks to the tanh activation function. It’s ~1 off the correct value on average.
Remember that returning a fixed value of 0.0 got a mean absolute error of 0.8597. I can’t even say it has overfit, as the loss function is mean squared error. The train function must be averaging an absolute error of more than one to have a square that is also above one.
There are several hyperparameters available for this model, and the model is so simple that it runs quite quickly. Part of the reason for using a sklearn estimator as the structure is to allow a grid search over these hyperparameters.
Code
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error, make_scorer
parameter_grid = {
"embedding_size" : [32 , 64 ],
"bias" : [True , False ],
"tanh" : [True , False ],
"batch_size" : [64 , 128 ],
}
base_estimator = HadamardEstimator(
product_count= len (comments_5_df.object_id.cat.categories),
user_count= len (comments_5_df.author_name.cat.categories),
embedding_size= 32 ,
bias= True ,
tanh= True ,
learning_rate= 1e-1 ,
epochs= 5 ,
batch_size= 64 ,
optimizer= "sgd" ,
device= "cpu" ,
)
grid = GridSearchCV(
estimator= base_estimator,
param_grid= parameter_grid,
n_jobs=- 1 ,
scoring= make_scorer(mean_absolute_error, greater_is_better= False ),
cv= [(comments_5_train_idx, comments_5_test_idx)],
error_score= "raise" ,
)
grid.fit(
X= comments_5_df[["object_id" , "author_name" ]],
y= comments_5_df["target" ],
)
combination_count = len (grid.cv_results_["mean_test_score" ])
best_score = - grid.cv_results_["mean_test_score" ].max ()
print (
f"grid search of { combination_count:,} combinations "
f"results in best score of { best_score:0.4f} "
)
grid search of 16 combinations results in best score of 0.9748
Code
HadamardEstimator(batch_size=64, bias=True, device='cpu', embedding_size=32,
epochs=5, learning_rate=0.1, optimizer='sgd',
product_count=2296, tanh=True, user_count=10834) In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
While this grid search has improved the score slightly the model itself is still significantly worse than just predicting a single value each time.
Hadamard then Linear
This uses a linear layer to turn the embedding product into a single value. I’m removing the per user and per perfume bias just to make this model simpler. I don’t expect this to be the solution.
Code
from dataclasses import dataclass
from torch import nn
import torch
@dataclass
class HadamardThenLinearConfig:
product_count: int
user_count: int
embedding_size: int
bias: bool
tanh: bool
class HadamardThenLinearModel(nn.Module):
def __init__ (self , config: HadamardThenLinearConfig) -> None :
super ().__init__ ()
self .config = config
self .user_embedding = nn.Embedding(
num_embeddings= config.user_count,
embedding_dim= config.embedding_size,
)
self .product_embedding = nn.Embedding(
num_embeddings= config.product_count,
embedding_dim= config.embedding_size,
)
self .linear = nn.Linear(config.embedding_size, 1 , bias= config.bias)
def forward(self , users: torch.IntTensor, products: torch.IntTensor) -> torch.Tensor:
users = users.long ()
products = products.long ()
user_embedding = self .user_embedding(users)
product_embedding = self .product_embedding(products)
hadamard = user_embedding * product_embedding
result = self .linear(hadamard)
if self .config.tanh:
return torch.tanh(result)
return result
class HadamardThenLinearEstimator(CollaborativeEstimator):
def __init__ (
self ,
product_count: int ,
user_count: int ,
embedding_size: int ,
bias: bool ,
tanh: bool ,
learning_rate: float ,
epochs: int ,
batch_size: int ,
optimizer: str ,
device: str ,
) -> None :
super ().__init__ (
learning_rate= learning_rate,
epochs= epochs,
batch_size= batch_size,
optimizer= optimizer,
device= device,
)
self .product_count = product_count
self .user_count = user_count
self .embedding_size = embedding_size
self .bias = bias
self .tanh = tanh
def _create_model(self ) -> nn.Module:
config = HadamardThenLinearConfig(
product_count= self .product_count,
user_count= self .user_count,
embedding_size= self .embedding_size,
bias= self .bias,
tanh= self .tanh,
)
return HadamardThenLinearModel(config)
Code
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error, make_scorer
parameter_grid = {
"embedding_size" : [32 , 64 ],
"bias" : [True , False ],
"tanh" : [True , False ],
"batch_size" : [64 , 128 ],
}
base_estimator = HadamardThenLinearEstimator(
product_count= len (comments_5_df.object_id.cat.categories),
user_count= len (comments_5_df.author_name.cat.categories),
embedding_size= 32 ,
bias= True ,
tanh= True ,
learning_rate= 1e-1 ,
epochs= 5 ,
batch_size= 64 ,
optimizer= "sgd" ,
device= "cpu" ,
)
grid = GridSearchCV(
estimator= base_estimator,
param_grid= parameter_grid,
n_jobs=- 1 ,
scoring= make_scorer(mean_absolute_error, greater_is_better= False ),
cv= [(comments_5_train_idx, comments_5_test_idx)],
error_score= "raise" ,
)
grid.fit(
X= comments_5_df[["object_id" , "author_name" ]],
y= comments_5_df["target" ],
)
combination_count = len (grid.cv_results_["mean_test_score" ])
best_score = - grid.cv_results_["mean_test_score" ].max ()
print (
f"grid search of { combination_count:,} combinations "
f"results in best score of { best_score:0.4f} "
)
grid search of 16 combinations results in best score of 0.8598
Code
HadamardThenLinearEstimator(batch_size=64, bias=True, device='cpu',
embedding_size=32, epochs=5, learning_rate=0.1,
optimizer='sgd', product_count=2296, tanh=True,
user_count=10834) In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
With the linear layer the model is able to achieve an identical score to the fixed score model. This is hardly a resounding success. I wonder if this has devolved into making fixed predictions.
Code
grid.estimator.fit(
X= comments_5_df.loc[comments_5_train_idx][["object_id" , "author_name" ]],
y= comments_5_df.loc[comments_5_train_idx][["target" ]],
)
grid.estimator.predict(
X= comments_5_df.loc[comments_5_test_idx][["object_id" , "author_name" ]].head(10 ),
)
array([[0.05531971],
[0.1267522 ],
[0.10492838],
[0.10181362],
[0.19049501],
[0.10646662],
[0.06197954],
[0.06103778],
[0.05164405],
[0.12245701]], dtype=float32)
There is some variation in the output. It still hasn’t managed to do better than a fixed prediction though.
Code
from dataclasses import dataclass
import torch
from torch import nn
import pandas as pd
activation_name_to_function = {
"gelu" : nn.GELU,
"tanh" : nn.Tanh,
}
@dataclass
class HadamardThenLinearConfig:
embedding_size: int
linear_layers: list [int ]
activation_function: str = "gelu"
class HadamardThenLinear(nn.Module):
def __init__ (
self ,
config: HadamardThenLinearConfig,
comments_df: pd.DataFrame,
) -> None :
super ().__init__ ()
self .user_to_ord = {
name: index
for index, name in enumerate (
sorted (comments_df.author_name.unique())
)
}
self .product_to_ord = {
product: index
for index, product in enumerate (
sorted (comments_df.object_id.unique())
)
}
self .user_embedding = nn.Embedding(
num_embeddings= len (self .user_to_ord),
embedding_dim= config.embedding_size,
)
self .product_embedding = nn.Embedding(
num_embeddings= len (self .product_to_ord),
embedding_dim= config.embedding_size,
)
linear = [
nn.Linear(
in_features= config.embedding_size,
out_features= config.linear_layers[0 ],
)
]
activation_function = activation_name_to_function[config.activation_function]
for in_features, out_features in zip (config.linear_layers, config.linear_layers[1 :]):
linear.extend([
activation_function(),
nn.Linear(
in_features= in_features,
out_features= out_features,
),
])
self .linear = nn.Sequential(* linear)
@property
def device(self ) -> torch.device:
return self .user_embedding.weight.device
def predict(self , user: str , product: int ) -> torch.Tensor:
device = self .device
user_tensor = torch.tensor([self .user_to_ord[user]], device= self .device)
product_tensor = torch.tensor([self .product_to_ord[product]], device= self .device)
return self .forward(users= user_tensor, products= product_tensor)[0 ]
def forward(self , users: torch.IntTensor, products: torch.IntTensor) -> torch.Tensor:
user_embedding = self .user_embedding(users)
product_embedding = self .product_embedding(products)
hadamard = torch.mul(user_embedding, product_embedding)
return self .linear(hadamard)
Code
import numpy as np
def regression_value(df: pd.DataFrame) -> np.array:
"""
Turns the positive/neutral/negative into a single value.
The range is split as follows:
-1 negative / -0.33 neutral 0.33 / positive 1
This gives each value a range of 0.66.
This is a regression task so the exact value is calculated thusly:
The majority class has a value between 0 and 1
The majority class must have a value of at least one third or it would not be the majority
We can take the value of the majority of positive and negative and place it on the line
-negative / positive
E.G.
positive: 0.4, neutral: 0.3, negative: 0.3
result: 0.4
positive: 0.3, neutral: 0.3, negative: 0.4
result: -0.4
Think this is simple enough for now.
"""
values = df[["negative" , "positive" ]].to_numpy()
index = values.argmax(axis= 1 )
sign = (index - 0.5 ) * 2
# select maximum value
values = values[np.arange(values.shape[0 ]), index]
return values * sign
Code
import pandas as pd
comments_df = pd.read_parquet("/data/perfumes/processed/comments.gz.parquet" )
comments_df["target" ] = regression_value(comments_df)
comments_df["target" ].plot.hist(bins= 20 )
<Axes: ylabel='Frequency'>
This doesn’t seem well balanced. Is it consistent with the original data?
What I need to do is to compare the ratio of the different class ranges to the actual distribution of comment sentiment. I can calculate the ratio of class ranges by thresholding the target values
Code
regression_distribution = pd.DataFrame([
{"label" : "negative" , "count" : (comments_df.target <= - 1 / 3 ).sum ()},
{"label" : "neutral" , "count" : ((comments_df.target > - 1 / 3 ) & (comments_df.target < 1 / 3 )).sum ()},
{"label" : "positive" , "count" : (comments_df.target >= 1 / 3 ).sum ()},
]).set_index("label" )
regression_distribution / len (comments_df)
count
label
negative
0.380840
neutral
0.074362
positive
0.544798
And I can compare this to the majority class for the sentiment analysis
Code
true_distribution = (
comments_df[["negative" , "neutral" , "positive" ]]
.T
.idxmax()
.value_counts()
.loc[["negative" , "neutral" , "positive" ]]
.to_frame()
)
true_distribution / len (comments_df)
count
negative
0.356156
neutral
0.123791
positive
0.520053
With these we can calculate the relative difference between the classes
Code
(regression_distribution - true_distribution) / len (comments_df)
count
label
negative
0.024685
neutral
-0.049429
positive
0.024744
As you can see this regression approach has reduced neutral and approximately evenly boosted negative and positive. Positive is the majority class overall so negative has seen a slight increase in relative frequency. This doesn’t seem terrible though.
Code
from tqdm.auto import tqdm
from torch.utils.data import Dataset, DataLoader
import torch
import pandas as pd
from typing import TypedDict
class Entry(TypedDict):
user: str
product: int
target: float
class DataframeDataset(Dataset):
def __init__ (self , df: pd.DataFrame, model: NeuralCollaborativeFilter) -> None :
self .df = df
self .model = model
def __len__ (self ) -> int :
return len (self .df)
def __getitem__ (self , index) -> Entry:
row = self .df.iloc[index]
return {
"user" : model.user_to_ord[row.author_name],
"product" : model.product_to_ord[row.object_id],
"target" : row.target,
}
def train(
model: NeuralCollaborativeFilter,
comments_df: pd.DataFrame,
epochs: float ,
batch_size: int ,
learning_rate: float = 1e-2 ,
) -> None :
steps_per_epoch = len (comments_df) // batch_size
max_steps = max (int (steps_per_epoch * epochs), 1 )
step = 0
dl = DataLoader(
dataset= DataframeDataset(df= comments_df, model= model),
batch_size= batch_size,
shuffle= True ,
)
model.cuda()
model.train()
optimizer = torch.optim.SGD(model.parameters(), lr= learning_rate)
with tqdm(total= max_steps) as progress:
while step < max_steps:
total_loss = 0.
count = 0
for batch in dl:
users = batch["user" ].to(model.device)
products = batch["product" ].to(model.device)
targets = batch["target" ].to(model.device)
optimizer.zero_grad()
output = model(users= users, products= products)
loss = (output - targets)** 2
loss = loss.mean()
loss.backward()
optimizer.step()
total_loss += loss.item() * users.shape[0 ]
count += users.shape[0 ]
step += 1
progress.update(n= 1 )
if step >= max_steps:
break
print (f"loss: { total_loss / count:0.4f} " )
Code
model = NeuralCollaborativeFilter(
config= NeuralCollaborativeFilterConfig(embedding_size= 10 , linear_layers= [5 , 1 ]),
comments_df= comments_df,
)
Code
train(model= model, comments_df= comments_df, epochs= 5.0 , batch_size= 1024 )
loss: 0.3669
loss: 0.3594
loss: 0.3589
loss: 0.3586
loss: 0.3585
Code
NeuralCollaborativeFilter(
(user_embedding): Embedding(95239, 10)
(product_embedding): Embedding(3301, 10)
(linear): Sequential(
(0): Linear(in_features=10, out_features=5, bias=True)
(1): GELU(approximate=none)
(2): Linear(in_features=5, out_features=1, bias=True)
)
)
Code
import torch
import pandas as pd
@torch.inference_mode ()
def predict(row: pd.Series) -> dict :
user = row.author_name
product = row.object_id
target = row.target
output = model.predict(user= row.author_name, product= row.object_id)
return {
"user" : user,
"product" : product,
"target" : target,
"actual" : output.item(),
}
with torch.inference_mode():
prediction_df = pd.DataFrame([
predict(row)
for row in comments_df.sample(n= 10 ).iloc
])
display(prediction_df)
user
product
target
actual
0
Arapaima
30164
0.416576
0.172515
1
MrBolton
11649
0.384884
0.051200
2
Roge'
29223
-0.395069
0.150277
3
muncierobson
28246
0.532335
0.203679
4
AuthenticAF
153
0.811077
0.157944
5
zaramona
839
0.408400
0.112785
6
kbot
1833
-0.938313
0.153833
7
dlnkmch
44035
-0.638388
0.169346
8
Yiorgos
143
0.923327
0.176568
9
Firas Mohammad
1834
0.544264
0.143746
This is not very good.
Code
model = NeuralCollaborativeFilter(
config= NeuralCollaborativeFilterConfig(
embedding_size= 50 ,
linear_layers= [50 , 1 ],
activation_function= "tanh" ,
),
comments_df= comments_df,
)
train(model= model, comments_df= comments_df, epochs= 5.0 , batch_size= 64 )
loss: 0.3602
loss: 0.3583
loss: 0.3582
loss: 0.3581
loss: 0.3581
Code
import torch
import pandas as pd
@torch.inference_mode ()
def predict(row: pd.Series) -> dict :
user = row.author_name
product = row.object_id
target = row.target
output = model.predict(user= row.author_name, product= row.object_id)
return {
"user" : user,
"product" : product,
"target" : target,
"actual" : output.item(),
}
with torch.inference_mode():
prediction_df = pd.DataFrame([
predict(row)
for row in comments_df.sample(n= 10 ).iloc
])
display(prediction_df)
user
product
target
actual
0
Lsquared
51694
-0.903351
0.130655
1
dolcethadon
3169
0.123378
0.136765
2
Nuppu
4375
-0.942973
0.142797
3
gennarowilde
28246
0.896267
0.143619
4
Eveningrose
31666
-0.918491
0.131670
5
Lightning
705
0.574902
0.141479
6
Lerochek
33519
-0.688771
0.138597
7
Sarah
14982
0.768827
0.139419
8
DarlingNikki
707
-0.782034
0.135745
9
Madyana
18590
0.404112
0.150562