Matthew’s Blog - Conversations with a Paper

Recently someone at work suggested using Chat GPT to summarize a paper. The idea is that it can provide a summary and then you can have a further conversation with Chat GPT to explore further.

I quite like this as there are many papers that I want to read. If I could just have an initial filter to see if the paper was sufficiently interesting then that would help me focus my efforts.

To do this I am going to converse with a paper using Mistral and the PyMuPDF library (confusingly imported as fitz).

Dataset

I am going to test this on two papers. The first is the Attention Is All You Need paper that introduced transformers (Vaswani et al. 2017). Transformers have vastly improved language modelling. It would be interesting to see what the model can extract from this paper.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” https://arxiv.org/abs/1706.03762.

Kusupati, Aditya, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, et al. 2024. “Matryoshka Representation Learning.” https://arxiv.org/abs/2205.13147.

As a comparison I want to use the Matryoshka Representation Learning paper which came out recently and is concerned with more efficiently encoding embeddings (Kusupati et al. 2024). This is a recent paper so there should be no chance of the paper being referenced in the training data for the model. I also want to see if this paper is interesting enough to read.

Code

from pathlib import Path

ATTENTION_PAPER = Path("attention-is-all-you-need.pdf")
REPRESENTATION_PAPER = Path("matryoshka-representation-learning.pdf")

Model

I am, once again, going to use Mistral 7B Instruct. This has worked well when I have used it recently and supports conversational usage.

Code

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.2",
    load_in_4bit=True,
)

Chat Interface

I can quickly adjust the code from my previous work to make the chat interface. It doesn’t need to be really slick however I can make it output markdown so that it renders nicely in this blog.

Since it might be difficult to distinguish between the chat and my commentary I will add separators between them.

Code

from __future__ import annotations
from dataclasses import dataclass
from typing import Literal
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from IPython.display import Markdown

@dataclass
class Utterance:
    role: Literal["user", "assistant"]
    content: str

    def as_json(self) -> dict[str, str]:
        return {"role": self.role, "content": self.content}

    def as_markdown(self) -> str:
        return "\n".join([
            f"#### {self.role.capitalize()}",
            *self.content.splitlines(),
        ])

@dataclass
class Chat:
    utterances: list[Utterance]

    def as_json(self) -> list[dict[str, str]]:
        return list(map(Utterance.as_json, self.utterances))

    def as_markdown(self) -> Markdown:
        return Markdown("\n\n".join(map(Utterance.as_markdown, self.utterances)))

    def next(self, utterance: Utterance) -> Chat:
        return Chat(self.utterances + [utterance])

    def assistant(self, content: str) -> Chat:
        return self.next(Utterance(role="assistant", content=content))

    def user(self, content: str) -> Chat:
        return self.next(Utterance(role="user", content=content))

    def __getitem__(self, index: int | slice) -> Chat:
        item = self.utterances[index]
        if not isinstance(item, list):
            item = [item]
        return Chat(item)
    

@torch.inference_mode()
def generate_chat(
    model: AutoModelForCausalLM,
    tokenizer: AutoTokenizer,
    chat: str | Chat,
    max_new_tokens: int = 100,
    do_sample: bool = False,
    **kwargs,
) -> Chat:
    if isinstance(chat, str):
        chat = Chat([Utterance(role="user", content=chat)])
    chat_input = tokenizer.apply_chat_template(
        chat.as_json(),
        return_tensors="pt",
        padding="longest",
    )
    chat_input = chat_input.to(model.device)
    generated_ids = model.generate(
        chat_input,
        max_new_tokens=max_new_tokens,
        do_sample=do_sample,
        pad_token_id=tokenizer.pad_token_id,
        **kwargs,
    )
    output = tokenizer.decode(
        generated_ids[0, chat_input.shape[1] :],
        skip_special_tokens=True,
    )
    output = output.strip()
    response = Utterance(role="assistant", content=output)
    return chat.next(response)

We also need a starting prompt for the model:

Code

task = """
You are an expert in the field of deep learning.
I am going to provide you with the text of a
paper and I want you to summarize the paper.
We will then have a conversation about the paper.

You should only use details from the paper. If the
paper does not answer a question then you should
state that you do not know the answer.
"""

Now we can try it out and see if this is likely to work.

Code

chat = generate_chat(
    model=model,
    tokenizer=tokenizer,
    chat=task,
)

chat.as_markdown()

User

You are an expert in the field of deep learning. I am going to provide you with the text of a paper and I want you to summarize the paper. We will then have a conversation about the paper.

You should only use details from the paper. If the paper does not answer a question then you should state that you do not know the answer.

Assistant

Understood. I will summarize the paper based on the information provided and limit my responses to details from the text. If the paper does not address a specific question, I will indicate that I do not have an answer. Please provide the text of the paper for me to summarize.

This is a good start. I now need code to read the pdf and then provide that as part of the prompt.

Code

from pathlib import Path
import fitz

def read_paper(file: Path) -> str:
    with fitz.open(file) as doc:
        return "\n\n".join(page.get_text() for page in doc)

Attention is all you need

How well can it summarize this paper?

Code

chat = (
    Chat([])
        .user(task)
        .assistant(
            "Understood. I will summarize the paper based "
            "on the information provided and limit my responses "
            "to details from the text. If the paper does not "
            "address a specific question, I will indicate that "
            "I do not have an answer. Please provide the text "
            "of the paper for me to summarize."
        )
)

attention_text = read_paper(ATTENTION_PAPER)
chat = generate_chat(
    model=model,
    tokenizer=tokenizer,
    chat=chat.user(attention_text),
    max_new_tokens=1_500,
)

chat[-1].as_markdown()

Assistant

The paper introduces the Transformer, a new sequence transduction model that replaces recurrent layers in encoder-decoder architectures with multi-headed self-attention. The Transformer achieves significant time savings for translation tasks, outperforming architectures based on recurrent or convolutional layers on both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks. The authors plan to apply the Transformer to other tasks, extend it to handle input and output modalities other than text, and investigate local, restricted attention mechanisms for large inputs and outputs. The code used to train and evaluate the models is available at https://github.com/tensorflow/tensor2tensor.

The Transformer is a neural network architecture for sequence transduction tasks, such as machine translation, that relies solely on attention mechanisms, replacing the recurrent layers commonly used in encoder-decoder architectures. The model achieves a new state-of-the-art in translation tasks, outperforming architectures based on recurrent or convolutional layers. The Transformer can be trained significantly faster than these architectures.

The paper discusses the motivation for using self-attention, comparing it to recurrent and convolutional layers in terms of computational complexity, parallelizability, and maximum path length between input and output positions. The authors also present the Transformer architecture, including the encoder and decoder stacks, self-attention layers, and position-wise feed-forward networks.

The Transformer uses multi-head attention in three different ways: encoder-decoder attention, self-attention layers in the encoder, and self-attention layers in the decoder. The model also includes position-wise feed-forward networks and positional encodings to make use of the order of the sequence.

The paper also discusses the training regime for the Transformer, including the training data and batching, hardware and schedule, optimizer, and regularization. The authors present results on the WMT 2014 English-to-German and English-to-French translation tasks, achieving new state-of-the-art BLEU scores while using fewer training resources than previous models. The authors also evaluate the importance of different components of the Transformer by varying the model in various ways and measuring the change in performance on English-to-German translation.

The Transformer also generalizes well to English constituency parsing, outperforming previously reported models in both WSJ only and semi-supervised settings. The authors present results in Table 4, which shows the Transformer’s performance compared to various parsing models.

The paper concludes by discussing the future of attention-based models and the plans to apply them to other tasks, extend the Transformer to handle input and output modalities other than text, and investigate local, restricted attention mechanisms for large inputs and outputs. The code used to train and evaluate the models is available at https://github.com/tensorflow/tensor2tensor.

This is a verbose response however it does seem to be accurate. Let’s try asking it for further detail about the English constituency parsing.

Code

chat = generate_chat(
    model=model,
    tokenizer=tokenizer,
    chat=chat.user(
        "I'm not familiar with the English constituency "
        "parsing task that the transformer model was "
        "evaluated against. Can you tell me more about it?"
    ),
    max_new_tokens=1_500,
)

chat[-2:].as_markdown()

User

I’m not familiar with the English constituency parsing task that the transformer model was evaluated against. Can you tell me more about it?

Assistant

English constituency parsing is a task in natural language processing (NLP) that involves analyzing the structure of a given English sentence and identifying its constituents, or the groups of words that form meaningful units in the sentence. The goal is to build a model that can accurately identify the hierarchical structure of a sentence, which can be represented as a tree-like structure called a parse tree.

In the context of the paper you provided, the Transformer model was evaluated on English constituency parsing as a way to demonstrate its ability to generalize to other tasks beyond machine translation. The authors trained a 4-layer Transformer model on the Wall Street Journal (WSJ) portion of the Penn Treebank, which contains approximately 40,000 training sentences, and also in a semi-supervised setting using the larger high-confidence and BerkleyParser corpora.

The results showed that the Transformer model performed surprisingly well, yielding better results than all previously reported models with the exception of the Recurrent Neural Network Grammar in both the WSJ only and semi-supervised settings. The Transformer outperformed the Berkeley-Parser even when training only on the WSJ training set of 40,000 sentences.

English constituency parsing is an important task in NLP as it provides a deeper understanding of the structure of a sentence, which can be useful for various applications such as information extraction, question answering, and machine translation. The ability of the Transformer model to perform well on this task demonstrates its versatility and generalization capabilities.

I like this description as it clearly explains the task. What is interesting here is that the description from the paper (section 6.3) does not explain the task in this level of detail. It seems that the model has incorporated existing knowledge into this answer.

Let’s see if the model thinks this paper is worth reading fully.

Code

chat = generate_chat(
    model=model,
    tokenizer=tokenizer,
    chat=chat.user(
        "Do you think this is a significant paper? "
        "Should I read it in more detail?"
    ),
    max_new_tokens=1_500,
)

chat[-2:].as_markdown()

User

Do you think this is a significant paper? Should I read it in more detail?

Assistant

Yes, the “Attention is All You Need” paper by Vaswani et al. is considered a significant paper in the field of deep learning and natural language processing. It introduced the Transformer model, which has since become a popular and influential architecture in the NLP community. The Transformer model dispenses with recurrence and convolutions entirely, relying solely on attention mechanisms, and it achieves state-of-the-art results on machine translation tasks while being more parallelizable and requiring significantly less time to train.

The paper also demonstrated the effectiveness of the Transformer model on other tasks, such as English constituency parsing, and showed that it generalizes well to other tasks beyond machine translation. The paper’s findings have led to a surge of interest in attention mechanisms and self-attention in the NLP community, and the Transformer model has become a popular choice for various NLP tasks.

If you are interested in deep learning and natural language processing, I would recommend reading the paper in more detail. It provides a clear and concise explanation of the Transformer model, its motivation, and its advantages over previous architectures. The paper also includes extensive experiments and results, which demonstrate the effectiveness of the Transformer model on various tasks. Overall, the paper is a seminal work in the field of deep learning and natural language processing, and it is worth your time to read and understand it in more detail.

Here it has clearly incorporated existing knowledge about the impact of this paper. While this historical information is handy, I want to be sure that this will work on new papers. Let’s try it out on the embedding paper.

Embedding Paper

This time the paper is less well known so it may be unfamiliar to the model. It was initially released in 2022 so it’s possible that this forms part of the training set. If that is a problem then I can try to find a more recent paper.

Code

chat = (
    Chat([])
        .user(task)
        .assistant(
            "Understood. I will summarize the paper based "
            "on the information provided and limit my responses "
            "to details from the text. If the paper does not "
            "address a specific question, I will indicate that "
            "I do not have an answer. Please provide the text "
            "of the paper for me to summarize."
        )
)

paper_text = read_paper(REPRESENTATION_PAPER)
chat = generate_chat(
    model=model,
    tokenizer=tokenizer,
    chat=chat.user(paper_text),
    max_new_tokens=1_500,
)

chat[-1].as_markdown()

OutOfMemoryError: CUDA out of memory. Tried to allocate 6.50 GiB. GPU 0 has a total capacity of 23.64 GiB of which 4.54 GiB is free. Including non-PyTorch memory, this process has 19.10 GiB memory in use. Of the allocated memory 15.35 GiB is allocated by PyTorch, and 3.55 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

This is using a huge amount of memory. Looking at the paper it is 35 pages long and I think that the number of tokens is just too large.

I wonder if it would be possible to cut this down by stripping the references and all of the further data (which is interesting as it’s details of the studies). There are also images embedded in the text output.

Code

from pathlib import Path
import fitz
import re

IMAGE_PATTERN = re.compile(r'\<latexit[^>]*\>.*?\</latexit\>', flags=re.DOTALL)

def read_paper(file: Path) -> str:
    def _read() -> str:
        with fitz.open(file) as doc:
            return "\n\n".join(page.get_text() for page in doc)
    text = _read()
    text = IMAGE_PATTERN.sub("", text)
    lines = text.splitlines()
    lines = lines[:lines.index("References")]
    text = "\n".join(lines)
    return text

Code

chat = (
    Chat([])
        .user(task)
        .assistant(
            "Understood. I will summarize the paper based "
            "on the information provided and limit my responses "
            "to details from the text. If the paper does not "
            "address a specific question, I will indicate that "
            "I do not have an answer. Please provide the text "
            "of the paper for me to summarize."
        )
)

paper_text = read_paper(REPRESENTATION_PAPER)
chat = generate_chat(
    model=model,
    tokenizer=tokenizer,
    chat=chat.user(paper_text),
    max_new_tokens=1_500,
)

chat[-1].as_markdown()

Assistant

The paper introduces Matryoshka Representation Learning (MRL), a method for encoding information at different granularities within a single embedding vector. MRL aims to adapt to the statistical complexity and available compute resources of a downstream task. The paper demonstrates the effectiveness of MRL for large-scale adaptive classification and adaptive retrieval. MRL matches the accuracy of the fixed-feature baseline while using 14× smaller representation size on average. Additionally, an adaptive shortlisting and re-ranking system using MRL ensures comparable mAP@10 to the baseline while being 128× cheaper in FLOPs and 14× faster in wall-clock time. The paper also discusses related work, contributions, and applications of MRL. The paper is authored by researchers from the University of Washington, Google Research, and Harvard University.

This is a good start. I wonder if the shorter summary is because the model has less existing knowledge to use, or because of the truncation.

Code

chat = generate_chat(
    model=model,
    tokenizer=tokenizer,
    chat=chat.user(
        "How does it encode the information at different granularities?"
        "If you could write a simple program that approximates the process "
        "that would be helpful."
    ),
    max_new_tokens=1_500,
)

chat[-2:].as_markdown()

User

How does it encode the information at different granularities?If you could write a simple program that approximates the process that would be helpful.

Assistant

Matryoshka Representation Learning (MRL) encodes information at different granularities by learning a hierarchical representation, where each level of the hierarchy corresponds to a different granularity. This is achieved by optimizing the original loss function at multiple representation sizes, which are chosen based on a set of explicitly optimized nested dimensions.

The process of learning MRL involves the following steps:

Choose a set of representation sizes, M, which contains less than log(d) elements.
For each nested dimension m in M, optimize the multi-class classification loss using a separate linear classifier, W(m), and aggregate the losses after scaling with their relative importance.
Solve the optimization problem using standard empirical risk minimization.
Despite only optimizing for O(log(d)) nested dimensions, MRL results in accurate representations that interpolate for dimensions that fall between the chosen granularity of the representations.

Unfortunately, it’s not possible to write a simple program that approximates the entire process of learning MRL, as it involves complex optimization techniques and neural network architectures. However, the paper provides a detailed explanation of the process and includes code snippets and examples to help illustrate the concept.

If you’re interested in a simpler approximation of the process, you might consider using a dimensionality reduction technique like t-SNE or PCA to visualize the different granularities of information within a high-dimensional representation. These techniques can help you understand how the information is encoded at different scales, but they don’t capture the full flexibility and adaptability of MRL.

This is a good answer. I cannot confirm that it is accurate, at least the model was able to admit that it couldn’t produce the code required.

Verification of this summary is very important though. Should I spend my time reading this paper?

Code

chat = generate_chat(
    model=model,
    tokenizer=tokenizer,
    chat=chat.user(
        "Do you think this is a significant paper? "
        "Should I read it in more detail?"
    ),
    max_new_tokens=1_500,
)

chat[-2:].as_markdown()

User

Do you think this is a significant paper? Should I read it in more detail?

Assistant

Yes, I believe the paper “Matryoshka Representation Learning” is significant as it introduces a novel approach to representation learning that enables a single embedding vector to adapt to multiple downstream tasks with varying computational resources. The paper demonstrates that MRL can lead to up to 14× smaller embedding sizes for image classification, up to 14× real-world speed-ups for large-scale retrieval, and up to 2% accuracy improvements for long-tail few-shot classification, all while being as robust as the original representations.

The paper also shows that MRL can be easily adapted to various representation learning frameworks, including masked language modeling, contrastive learning, and efficient neural networks. Furthermore, the paper provides extensive evaluations of MRL on various datasets and tasks, demonstrating its effectiveness and flexibility.

Given the significance of the paper and its potential impact on the field of representation learning, I would recommend reading it in more detail to gain a better understanding of the concept, the experimental results, and the potential applications of MRL. The paper is well-written and includes clear explanations, detailed experiments, and useful code and model examples.

That’s encouraging. When I get round to that I can check if the detailed summary was correct.

Broadly this seems like a very viable way to read papers. I’m excited to use this in future.