Cross Language Prompt Internalization - Article Feature Clustering

Clustering the Wikipedia linked article features for Word Sense Induction
prompt internalization
multilingual prompt internalization
cross language word sense induction
Published

July 19, 2022

I have calculated features using the XLM-RoBERTa base model for different links on wikipedia. If the student model is to be used for word sense induction then these features should cluster. A lack of clustering would show that this approach is not viable.

If clustering is viable the next step would be to work out an efficient way to test containment, as well as calculating an approximate smallest number of features required. The wikipedia processing was only 2% completed and I estimated it as taking 20 days so a way to reduce the work required would be great.

Load Features

The features have been written to a set of files, each of which contains 1,000 articles. Each article has an average of 30 links in it. Only 201 files were processed however that does give us slightly over 6 million features.

These features are not evenly distributed among the different articles so some of them will have a lot of features and some will have very few or none.

Code
from pathlib import Path
import pandas as pd

DATA_FOLDER = Path("/data/prompt-internalization/multilingual/wikipedia/enwiki/20220701/")

synonyms_df = pd.read_parquet(DATA_FOLDER / "synonyms.gz.parquet")
features_df = pd.concat([
    pd.read_parquet(file)
    for file in sorted((DATA_FOLDER / "features").glob("*.gz.parquet"))
])
Code
top_10_features = features_df.target.value_counts()[:10].index
top_10_features
Index(['united states', 'world war ii', 'new york city', 'france', 'england',
       'democratic party (united states)', 'united kingdom', 'london', 'india',
       'world war i'],
      dtype='object')
Code
top_10_rows = features_df[features_df.target.isin(top_10_features)]
top_10_rows = top_10_rows.copy()
top_10_rows["target"] = top_10_rows.target.astype("category")
top_10_rows = top_10_rows.reset_index(drop=True)
top_10_rows
target probability index
0 france [0.67140967, 0.21350534, 0.032488544, 0.031876... [74831, 6557, 90788, 119862, 83658, 16843, 825...
1 united states [0.6089455, 0.16102183, 0.053454213, 0.0463878... [74831, 90788, 119862, 6557, 79200, 22836, 596...
2 world war ii [0.20548585, 0.07137032, 0.066305555, 0.059100... [48962, 42552, 72944, 65786, 5550, 60457, 7064...
3 france [0.91993374, 0.043690596, 0.0077915695, 0.0053... [74831, 6557, 83658, 90788, 119862, 16843, 228...
4 united states [0.62910575, 0.114415355, 0.045016088, 0.01939... [74831, 90788, 6557, 22836, 98148, 83658, 1198...
... ... ... ...
39958 new york city [0.6512393, 0.1441473, 0.04517786, 0.028335713... [90788, 74831, 6406, 79200, 6557, 16843, 49990...
39959 united states [0.63683414, 0.089536354, 0.066114485, 0.03989... [74831, 119862, 90788, 6557, 22836, 60457, 836...
39960 united states [0.58584386, 0.1483111, 0.12411963, 0.02704905... [74831, 90788, 6557, 119862, 82581, 22836, 168...
39961 new york city [0.71458614, 0.06404259, 0.053600095, 0.039861... [90788, 74831, 79200, 6406, 16843, 49990, 9814...
39962 england [0.86271405, 0.06343852, 0.024254996, 0.014000... [74831, 6557, 90788, 83658, 119862, 82581, 168...

39963 rows × 3 columns

We have almost 40,000 features for the top 10 articles. These articles are nice as they form into several distinct themes so we should expect articles like united states and england to be near to each other and united states and democratic party (united states) to be separated.

To be able to work with this I need to create a matrix that has the approximation of what the model produced. Each feature is the top 100 tokens split between the token index and probability. This matrix will be large though - the model has a vocabulary of 250,002 tokens, which will cause memory issues. Finding a smaller matrix that covers all the tokens that are non-zero would be good.

Code
import pandas as pd
import numpy as np

def to_probability_matrix(df: pd.DataFrame) -> np.array:
    unique_tokens = sorted(df["index"].explode().unique())
    token_map = {
        token_index: index
        for index, token_index in enumerate(unique_tokens)
    }
    rows = len(df)
    token_count = len(unique_tokens)

    def map_tokens(tokens: list[int]) -> list[int]:
        return list(map(token_map.get, tokens))
        
    token_index = np.array(df["index"].apply(map_tokens).tolist())
    probability = np.zeros((rows, token_count))
    probability.flat[
        token_index + (token_count * np.arange(rows)[:, None])
    ] = np.array(df.probability.tolist())

    return probability
Code
%%time

probability = to_probability_matrix(top_10_rows)

sums = probability.sum(axis=1)
print(sums.min(), sums.max(), sums.mean())
del sums
print(probability.shape)
print()
0.9999997086397343 1.0000002902579581 1.0000000003762612
(39963, 2192)

CPU times: user 923 ms, sys: 148 ms, total: 1.07 s
Wall time: 1.07 s

There are 10 separate articles and 100 tokens per link. If each article was always described in an identical way, and every different article was completely different, there would be 1,000 distinct tokens. We can see that there are 2,192 distinct tokens so this ideal state has not been achieved.

It should be possible to review how well the tokens overlap between the different labels.

Code
import pandas as pd

def get_iou_crosstab(df: pd.DataFrame) -> pd.DataFrame:
    labels = sorted(df.target.unique())
    df = df[["target", "index"]]
    df = df.explode("index")
    df = df.groupby("target").agg(set)

    intersection = {
        (left, right): len(df.loc[left][0] & df.loc[right][0])
        for left in labels
        for right in labels
    }
    union = {
        (left, right): len(df.loc[left][0] | df.loc[right][0])
        for left in labels
        for right in labels
    }
    return pd.DataFrame([
        {other: intersection[(label, other)] / union[(label, other)] for other in labels}
        for label in labels
    ], index=labels)
Code
iou_crosstab = get_iou_crosstab(top_10_rows)
iou_crosstab.style.highlight_quantile(q_left=0.5, axis=None, color='lightgreen')
  democratic party (united states) england france india london new york city united kingdom united states world war i world war ii
democratic party (united states) 1.000000 0.334888 0.308453 0.354062 0.273659 0.320273 0.355752 0.346124 0.311338 0.271583
england 0.334888 1.000000 0.613659 0.608515 0.573287 0.511668 0.598338 0.537795 0.355219 0.306073
france 0.308453 0.613659 1.000000 0.592105 0.459889 0.441463 0.568365 0.489827 0.314562 0.276911
india 0.354062 0.608515 0.592105 1.000000 0.477408 0.481350 0.589595 0.531915 0.360320 0.304940
london 0.273659 0.573287 0.459889 0.477408 1.000000 0.598842 0.475347 0.462697 0.337808 0.280827
new york city 0.320273 0.511668 0.441463 0.481350 0.598842 1.000000 0.513912 0.490647 0.414894 0.339520
united kingdom 0.355752 0.598338 0.568365 0.589595 0.475347 0.513912 1.000000 0.589783 0.404762 0.350670
united states 0.346124 0.537795 0.489827 0.531915 0.462697 0.490647 0.589783 1.000000 0.359353 0.299329
world war i 0.311338 0.355219 0.314562 0.360320 0.337808 0.414894 0.404762 0.359353 1.000000 0.640577
world war ii 0.271583 0.306073 0.276911 0.304940 0.280827 0.339520 0.350670 0.299329 0.640577 1.000000

This crosstab is pretty encouraging. With the threshold of 0.5 it shows that:

  • the world war articles overlap only with each other
  • the political party doesn’t overlap with anything else
  • most of the countries overlap with each other and the cities

This is a really rough way to cluster this data. Let’s see the results of a more structured analysis.

2d Projection

The first thing is to try out projecting this data into 2 dimensions so we can visualize it. I’m going to use my two favourite approaches of PCA and T-SNE.

Code
from sklearn.decomposition import PCA

pca_output = PCA(
    n_components=2,
    random_state=0,
).fit_transform(probability)
pca_df = pd.DataFrame(pca_output)
pca_df["target"] = top_10_rows.target.values

pca_df.plot.scatter(x=0, y=1, s=0.1) ; None

Code
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import numpy as np

n_components = 2

# doing this to avoid the warning about future T-SNE
tsne_init = (
    PCA(
        n_components=n_components,
        svd_solver="randomized",
        random_state=0,
    )
        .fit_transform(probability)
        .astype(np.float32, copy=False)
)

tsne_output = TSNE(
    n_components=n_components,
    learning_rate="auto",
    init=tsne_init,
).fit_transform(probability)
tsne_df = pd.DataFrame(tsne_output)
tsne_df["target"] = top_10_rows.target.values

tsne_df.plot.scatter(x=0, y=1, s=0.1) ; None

Here it looks like the data has become a triangle with PCA and four clearly separate islands with T-SNE. This is interesting as it shows both clear structure and a lower set of clusters compared to the number of labels (remember it’s the top 10 labels).

Let’s try coloring this data by the label.

Code
from typing import List, Optional
import matplotlib.pyplot as plt
import pandas as pd

def plot_2d(df: pd.DataFrame, targets: Optional[List[str]] = None) -> None:
    fig, ax = plt.subplots(figsize=(12, 9))

    for group_name, group_idx in df.groupby("target").groups.items():
        if targets and group_name not in targets:
            continue
        ax.scatter(
            *df.iloc[group_idx, [0, 1]].T.values,
            s=0.1,
            label=group_name
        )
    ax.legend(markerscale=10)
Code
plot_2d(pca_df)

Code
plot_2d(tsne_df)

The clusters do have mostly consistent color. I’m pleased that the democratic party is a separate isolated cluster which has very little overlap with the other labels. The T-SNE projection suggests that there are four distinct clusters in this data, being:

  • The democratic party (bottom center)
  • The world wars (left)
  • London and New York (top)
  • The countries (right)

Which is encouraging as it suggests that the cities can be cleanly separated from the countries.

PCA looks like it extends in another dimension as the shape reminds me of an old origami fold that I was fond of:

hyperbolic plane

I wonder if the labels separate more clearly in 3d.

3d Projection

PCA and T-SNE are not restricted to projecting to 2 dimensions. Given that the PCA projection in particular looks like it has a significant third dimension I’m interested in visualizing it.

Code
from sklearn.decomposition import PCA

pca = PCA(n_components=3, random_state=0)
pca.fit(probability.T)
pca_3d_df = pd.DataFrame(pca.components_.T)
pca_3d_df["target"] = top_10_rows["target"].values
Code
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import numpy as np

n_components = 3

# doing this to avoid the warning about future T-SNE
tsne_init = (
    PCA(
        n_components=n_components,
        svd_solver="randomized",
        random_state=0,
    )
        .fit_transform(probability)
        .astype(np.float32, copy=False)
)

tsne_output = TSNE(
    n_components=n_components,
    learning_rate="auto",
    init=tsne_init,
).fit_transform(probability)
tsne_3d_df = pd.DataFrame(tsne_output)
tsne_3d_df["target"] = top_10_rows.target.values
Code
from typing import List, Optional
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import pandas as pd

def plot_3d(df: pd.DataFrame, targets: Optional[List[str]] = None) -> None:
    fig = plt.figure(figsize=(12, 9))
    ax = Axes3D(fig, auto_add_to_figure=False)
    fig.add_axes(ax)

    for group_name, group_idx in df.groupby("target").groups.items():
        if targets and group_name not in targets:
            continue
        ax.scatter(
            *df.iloc[group_idx, [0, 1, 2]].T.values,
            s=0.1,
            label=group_name
        )
    ax.legend(markerscale=10)
Code
plot_3d(pca_3d_df)

Projecting this into 3d has shown something similar to the shape that I suspected. It’s almost a shame that I can’t rotate that projection a little to see it better. We can now try separating out the different labels into related groups to see if they are clustered near to each other.

Code
plot_3d(pca_3d_df, ["world war i", "world war ii"])

Code
plot_3d(pca_3d_df, ["united kingdom", "united states", "england", "france", "india"])

Code
plot_3d(pca_3d_df, ["london", "new york city"])

Code
plot_3d(pca_3d_df, ["democratic party (united states)", "united states"])

The 3d projection helps a lot. I can see here that the separation between the separate kinds of things (political party/world war/city/country) seems pretty good. The countries in particular are blended together so it may well be that they are quite similar and that distinguishing them might be tricky.

We can repeat this with T-SNE, however that already looks like a complex structure so I am not sure how much this will help.

Code
plot_3d(tsne_3d_df)

Code
plot_3d(tsne_3d_df, ["world war i", "world war ii"])

Code
plot_3d(tsne_3d_df, ["united kingdom", "united states", "england", "france", "india"])

Code
plot_3d(tsne_3d_df, ["london", "new york city"])

Code
plot_3d(tsne_3d_df, ["democratic party (united states)", "united states"])

The T-SNE projection might look pretty but the clarity of the separate clusters is somewhat lacking. I’m satisfied with the PCA projection results though.

Clustering New Points

Having these clusters is nice. What I need is to be able to define these clusters in a way where I can test new points to see if they lie within.

Given that the labels do not form nice spherical shapes I do not think that KMeans would be appropriate. Something that can track the shape more accurately could be better.

I like using (H)DBSCAN as it can track arbitrary shapes. Working with that over a large number of labels could be tricky so I may need to optimize this in future.

The biggest problem with using DBSCAN is coming up with appropriate values for eps and min_samples.

Code
probability_df = pd.DataFrame(
    {"probability": list(probability)},
    index=top_10_rows.index
)
probability_df["target"] = top_10_rows.target.values
Code
from sklearn.neighbors import NearestNeighbors
from sklearn.cluster import DBSCAN
import numpy as np

def dbscan_label(df: pd.DataFrame, min_samples_ratio: float) -> (list[int], int):
    dbscan, most_common_label = run_dbscan(
        df=df,
        min_samples_ratio=min_samples_ratio,
    )
    return dbscan.labels_, most_common_label

def run_dbscan(df: pd.DataFrame, min_samples_ratio: float) -> (DBSCAN, int):
    probability = np.array(df.probability.tolist())
    min_samples = max(1, int(len(df) * min_samples_ratio))

    neighbors = NearestNeighbors(n_neighbors=min_samples)
    neighbors.fit(probability)

    neighbor_distance, _ = neighbors.kneighbors(X=probability, n_neighbors=min_samples)
    # 75% quantile distance that encompasses all min_samples points
    median_eps = np.quantile(neighbor_distance[:, -1], 0.75)

    dbscan = DBSCAN(eps=median_eps, min_samples=min_samples)
    dbscan.fit(probability)

    labels = pd.Series(dbscan.labels_)
    labels = labels[labels != -1]
    most_common_label = labels.value_counts().index[0]

    return dbscan, most_common_label
Code
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
import pandas as pd


def plot_dbscan_by_label(
    df: pd.DataFrame,
    pca_df: pd.DataFrame,
    tsne_df: pd.DataFrame,
    min_samples_ratio: float,
) -> None:
    labels = sorted(df.target.unique())
    rows = len(labels)

    fig, axes = plt.subplots(rows,2, figsize=(10,5*rows))
    for label, (ax_pca, ax_tsne) in zip(labels, axes):
        label_df = df[probability_df.target == label].copy()
        cluster_labels, primary_cluster = dbscan_label(
            df=label_df,
            min_samples_ratio=min_samples_ratio,
        )
        ratio = (np.array(cluster_labels) == primary_cluster).sum() / len(label_df)

        probability_cluster_df = pca_df[pca_df.target == label].copy()
        probability_cluster_df["color"] = cluster_labels
        probability_cluster_df["color"] = probability_cluster_df.color == primary_cluster
        probability_cluster_df["color"] = probability_cluster_df["color"].map({True: "green", False: "grey"})
        probability_cluster_df.plot.scatter(
            x=0,
            y=1,
            c="color",
            s=0.1,
            ax=ax_pca,
            legend=False,
            title=f"{label} PCA ({ratio:0.3f})",
            colorbar=False,
            xlabel=None,
            ylabel=None,
        )

        probability_cluster_df = tsne_df[tsne_df.target == label].copy()
        probability_cluster_df["color"] = cluster_labels
        probability_cluster_df["color"] = probability_cluster_df.color == primary_cluster
        probability_cluster_df["color"] = probability_cluster_df["color"].map({True: "green", False: "grey"})
        probability_cluster_df.plot.scatter(
            x=0,
            y=1,
            c="color",
            s=0.1,
            colormap="viridis",
            ax=ax_tsne,
            legend=False,
            title=f"{label} TSNE ({ratio:0.3f})",
            colorbar=False,
            xlabel=None,
            ylabel=None,
        )

    plt.tight_layout()  # Optional ... often improves the layout 
Code
plot_dbscan_by_label(
    df=probability_df,
    pca_df=pca_df,
    tsne_df=tsne_df,
    min_samples_ratio=0.1
)

Code
plot_dbscan_by_label(df=probability_df, pca_df=pca_df, tsne_df=tsne_df, min_samples_ratio=0.01)

This doesn’t seem terrible? When considering these it is more significant to work out what the labelling would be for the wrong classes. As I see it DBSCAN is a very off/on classification where I may need a more continuous metric to allow for dispute resolution.

Code
def dbscan_predict(model: DBSCAN, points: np.array) -> np.array:
    point_count = points.shape[0]
    result = np.ones(shape=point_count, dtype=int) * -1
    for i in range(point_count):
        difference = model.components_ - points[i, :]
        distance = np.linalg.norm(difference, axis=1)
        closest_idx = np.argmin(distance)
        if distance[closest_idx] < model.eps:
            result[i] = model.labels_[model.core_sample_indices_[closest_idx]]
    return result
Code
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
import pandas as pd
from tqdm.auto import tqdm

def dbscan_iou_crosstab(df: pd.DataFrame, min_samples_ratio: float) -> pd.DataFrame:
    labels = sorted(df.target.unique())
    rows = len(labels)
    probability = np.array(df.probability.tolist())

    classifiers: dict[str, (DBSCAN, int)] = {
        label: run_dbscan(
            df=df[df.target == label],
            min_samples_ratio=min_samples_ratio,
        )
        for label in labels
    }
    
    def get_predictions(model: DBSCAN, index: int) -> pd.Series:
        cluster_indices = dbscan_predict(model, probability)
        predictions = cluster_indices == index
        return (
            pd.DataFrame({"target": df.target, "label": predictions})
                .groupby("target")
                .agg(sum)
                .label
        )

    result = pd.DataFrame([
        get_predictions(*classifiers[label])
        for label in tqdm(labels)
    ], index=labels)
    rows_per_label = df.groupby("target").agg(len).probability
    return result.divide(rows_per_label, axis="columns")
Code
dbscan_crosstab_01 = dbscan_iou_crosstab(probability_df, min_samples_ratio=0.1)
dbscan_crosstab_01.style.highlight_between(left=0.1, axis=None, color='lightgreen')
target democratic party (united states) england france india london new york city united kingdom united states world war i world war ii
democratic party (united states) 0.950913 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
england 0.000000 0.883395 0.820830 0.657761 0.000646 0.000000 0.315442 0.484727 0.000000 0.000000
france 0.000000 0.848814 0.860951 0.670612 0.000000 0.000000 0.241379 0.241993 0.000000 0.000000
india 0.000000 0.953701 0.929651 0.910720 0.001291 0.000000 0.595502 0.690540 0.000000 0.000000
london 0.000000 0.005716 0.000000 0.004058 0.926081 0.897688 0.039880 0.066578 0.000000 0.000000
new york city 0.000000 0.000000 0.000000 0.000000 0.358618 0.895171 0.000300 0.000000 0.000000 0.000000
united kingdom 0.000000 0.975707 0.964001 0.967535 0.057456 0.000000 0.913343 0.946916 0.000000 0.000000
united states 0.000000 0.937696 0.884858 0.876902 0.041963 0.000000 0.709145 0.900208 0.000000 0.000000
world war i 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000300 0.000000 0.963760 0.933968
world war ii 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.840330 0.945915
Code
dbscan_crosstab_001 = dbscan_iou_crosstab(probability_df, min_samples_ratio=0.01)
dbscan_crosstab_001.style.highlight_between(left=0.1, axis=None, color='lightgreen')
target democratic party (united states) england france india london new york city united kingdom united states world war i world war ii
democratic party (united states) 0.906615 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
england 0.000000 0.844813 0.805441 0.533311 0.000000 0.000000 0.165517 0.228944 0.000000 0.000000
france 0.000000 0.709060 0.837318 0.510991 0.000000 0.000000 0.088756 0.049822 0.000000 0.000000
india 0.000000 0.909403 0.897499 0.860670 0.000000 0.000000 0.354723 0.358986 0.000000 0.000000
london 0.000000 0.001715 0.000000 0.001691 0.856359 0.722820 0.010495 0.027432 0.000000 0.000000
new york city 0.000000 0.000000 0.000000 0.000000 0.344416 0.845502 0.000000 0.000000 0.000000 0.000000
united kingdom 0.000000 0.957702 0.944490 0.923909 0.023564 0.000000 0.851574 0.883452 0.000000 0.000000
united states 0.000000 0.903973 0.820280 0.778154 0.017753 0.000000 0.587406 0.851868 0.000000 0.000000
world war i 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000300 0.000000 0.921780 0.808847
world war ii 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.753140 0.907491

Using DBSCAN as the clustering shows much stronger performance. At this point the cities / countries are separated. Just from clustering this seems good enough as the actual text that was linked can be used as a signal to separate the intermingled cluster into the specific article.

Clustering with OPTIC

The sklearn documentation for DBSCAN also mentions OPTIC as an alternative which requires less configuration. Given that eps / max_samples are hard to calculate it might be worth trying.

Code
import matplotlib.pyplot as plt
from sklearn.cluster import OPTICS

# min_samples_ratio = 1/4 # 1_000 # 1_000 // 2**4
# eps = 1/2**3

labels = sorted(probability_df.target.unique())[:2]
rows = len(labels)

fig, axes = plt.subplots(rows,2, figsize=(10,5*rows))
for label, (ax_pca, ax_tsne) in zip(labels, axes):
    labelled_probability = probability_df[probability_df.target == label]
    # min_samples = int(len(labelled_probability) * min_samples_ratio)

    cluster_output = (
        OPTICS()
            .fit(np.array(labelled_probability.probability.tolist()))
    )

    probability_cluster_df = pca_df[pca_df.target == label].copy()
    probability_cluster_df["color"] = cluster_output.labels_
    probability_cluster_df = probability_cluster_df[probability_cluster_df.color != -1]
    probability_cluster_df.plot.scatter(
        x=0,
        y=1,
        c="color",
        s=0.1,
        colormap="viridis",
        ax=ax_pca,
        legend=False,
        title=f"{label} PCA",
        colorbar=False,
        xlabel=None,
        ylabel=None,
    )
    
    probability_cluster_df = tsne_df[tsne_df.target == label].copy()
    probability_cluster_df["color"] = cluster_output.labels_
    probability_cluster_df = probability_cluster_df[probability_cluster_df.color != -1]
    probability_cluster_df.plot.scatter(
        x=0,
        y=1,
        c="color",
        s=0.1,
        colormap="viridis",
        ax=ax_tsne,
        legend=False,
        title=f"{label} TSNE",
        colorbar=False,
        xlabel=None,
        ylabel=None,
    )

plt.tight_layout()  # Optional ... often improves the layout 
/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.10/lib/python3.10/site-packages/sklearn/cluster/_optics.py:903: RuntimeWarning: divide by zero encountered in divide
  ratio = reachability_plot[:-1] / reachability_plot[1:]
/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.10/lib/python3.10/site-packages/sklearn/cluster/_optics.py:903: RuntimeWarning: divide by zero encountered in divide
  ratio = reachability_plot[:-1] / reachability_plot[1:]

This is incredibly slow to run and it smashes the data into tiny clusters. Given the lack of configuration it’s not really a viable system.