Matthew’s Blog - Retrieval Augmented Generation over … this blog

I made this blog as a way to note down the tools and techniques that I use so that I could refer to them in future. I’ve also used it to present ideas to work colleagues as well as an easy way to investigate things that catch my eye. There is a search function for the website that Quarto provides, is it the best though?

Since Retrieval Augmented Generation (RAG) has emerged I feel that a better solution could be produced. Can I make a Q&A bot that can work over the content of the blog and provide answers with code?

This is a simple technique that is well explored by this point. My goal with this post is more to intentionally practice something simple to build up strength in this area (a “rep” if you will).

Dataset

This blog is a set of jupyter notebooks. There are also quite a few python files that get imported. The aim will be to index them.

Jupyter notebooks are json files so parsing them and extracting the sections will be required. The python files can be treated as entire documents, and it would be possible to use the ast module to break them down into logical parts.

We can start by inspecting the structure of this notebook itself. Let’s read the json of this file:

Code

from pathlib import Path
import json

THIS_NOTEBOOK = Path("rag-over-this-blog.ipynb")
data = json.loads(THIS_NOTEBOOK.read_text())

markdown_cells = [
    cell
    for cell in data["cells"]
    if cell["cell_type"] == "markdown"
]
markdown_cells[:3]

[{'cell_type': 'markdown',
  'id': '6613f22d-167b-419f-90eb-92d0806be1af',
  'metadata': {},
  'source': ['I made this blog as a way to note down the tools and techniques that I use so that I could refer to them in future.\n',
   "I've also used it to present ideas to work colleagues as well as an easy way to investigate things that catch my eye.\n",
   'There is a search function for the website that Quarto provides, is it the best though?\n',
   '\n',
   'Since Retrieval Augmented Generation (RAG) has emerged I feel that a better solution could be produced.\n',
   'Can I make a Q&A bot that can work over the content of the blog and provide answers with code?\n',
   '\n',
   'This is a simple technique that is well explored by this point.\n',
   'My goal with this post is more to intentionally practice something simple to build up strength in this area (a "rep" if you will).']},
 {'cell_type': 'markdown',
  'id': 'e50ab6ae-52d4-4f53-ad6b-259765b5a543',
  'metadata': {},
  'source': ['## Dataset\n',
   '\n',
   'This blog is a set of jupyter notebooks.\n',
   'There are also quite a few python files that get imported.\n',
   'The aim will be to index them.\n',
   '\n',
   'Jupyter notebooks are json files so parsing them and extracting the sections will be required.\n',
   'The python files can be treated as entire documents, and it would be possible to use the ast module to break them down into logical parts.']},
 {'cell_type': 'markdown',
  'id': '8b5fbdc0-b17b-4596-b987-b66000151f52',
  'metadata': {},
  'source': ['We can start by inspecting the structure of this notebook itself.\n',
   "Let's read the json of this file:"]}]

Here we can see the three markdown blocks that form the content of this blog post. The source of the cells contains the text of each part, extracting that will be easy. The structure can even be made into a pydantic model which would then allow a consistent extraction of the post metadata as well. The post metadata sits in a raw cell at the top of the blog post and contains the title, date and description.

Let’s see the start of this blog again, this time converting the data into pydantic types:

Code

from __future__ import annotations
from IPython.display import Markdown
from typing import Annotated, Literal, Any
from pydantic import BaseModel, BeforeValidator, Field, computed_field
import re
import json
import yaml
from collections.abc import Iterator

class MetadataCell(BaseModel):
    metadata: dict[str, Any]
    cell_type: Literal["raw"]

    @staticmethod
    def from_json(source: list[str], cell_type: str, **kwargs) -> MetadataCell:
        # start and end have ---
        lines = [
            line
            for line in source
            if line.strip() != "---"
        ]
        content = "\n".join(lines)
        metadata = yaml.safe_load(content)
        return MetadataCell(metadata=metadata, cell_type=cell_type)

    @property
    def title(self) -> str:
        return self.metadata["title"]

    @property
    def description(self) -> str:
        if "description" not in self.metadata:
            return ""
        return self.metadata["description"]

    def _repr_markdown_(self) -> str:
        if self.description:
            return f"### {self.title}\n\n{self.description}"
        return f"### {self.title}"

class TextCell(BaseModel):
    paragraphs: list[str]
    cell_type: Literal["markdown"]

    @staticmethod
    def from_json(source: list[str], cell_type: str, **kwargs) -> TextCell:
        text = "".join(source)
        paragraphs = re.split(r"\n\n+", text)
        paragraphs = map(str.strip, paragraphs)
        paragraphs = list(paragraphs)
        return TextCell(paragraphs=paragraphs, cell_type=cell_type)

    def _repr_markdown_(self) -> str:
        content = "\n\n".join(self.paragraphs)
        content = content.replace("##", "####")
        return content

class CodeCell(BaseModel):
    source: str
    cell_type: Literal["code"]

    @staticmethod
    def from_json(source: list[str], cell_type: str, **kwargs) -> CodeCell:
        text = "".join(source)
        text = text.strip()
        return CodeCell(source=text, cell_type=cell_type)

    def _repr_markdown_(self) -> str:
        content = f"```\n{self.source}\n```"
        return content

CellTypes = MetadataCell | TextCell | CodeCell
CellTypesAnnotation = Annotated[CellTypes, Field(discriminator="cell_type")]

class Notebook(BaseModel):
    path: Path
    cells: list[CellTypesAnnotation]

    @staticmethod
    def from_path(path: Path) -> Notebook | None:
        data = json.loads(path.read_text())
        nb = Notebook.from_json(path=path, **data)
        if not isinstance(nb.cells[0], MetadataCell):
            return None
        return nb

    @staticmethod
    def from_json(path: Path, cells: list[dict], **kwargs) -> Notebook:
        def to_cell(data: dict) -> CellTypes:
            match data["cell_type"]:
                case "raw":
                    return MetadataCell.from_json(**data)
                case "markdown":
                    return TextCell.from_json(**data)
                case "code":
                    return CodeCell.from_json(**data)
            raise ValueError(f"unknown cell type: {data['cell_type']} in {data}")
        converted_cells = list(map(to_cell, cells))
        return Notebook(path=path, cells=converted_cells)

    @property
    def title(self) -> str:
        assert isinstance(self.cells[0], MetadataCell)
        return self.cells[0].title

    @property
    def description(self) -> str:
        assert isinstance(self.cells[0], MetadataCell)
        return self.cells[0].description

    def __getitem__(self, key) -> Notebook:
        values = self.cells[key]
        if not isinstance(values, list):
            values = [values]
        return Notebook(path=self.path, cells=values)

    def iterate_paragraph_windows(self) -> Iterator[NotebookSection]:
        paragraph_and_location = [
            (paragraph, cell_index, paragraph_index)
            for cell_index, cell in enumerate(self.cells)
            if isinstance(cell, TextCell)
            for paragraph_index, paragraph in enumerate(cell.paragraphs)
        ]
        if len(paragraph_and_location) < 3:
            return
        for i in range(2, len(paragraph_and_location)):
            preceding, current, following = paragraph_and_location[i-2:i+1]
            _, cell_index, paragraph_index = current
            content = "\n\n".join([preceding[0], current[0], following[0]])
            yield NotebookSection(
                notebook=self,
                text=content,
                cell=cell_index,
                paragraph=paragraph_index,
            )

    def _repr_markdown_(self) -> str:
        content = "\n\n".join(cell._repr_markdown_() for cell in self.cells)
        return content

BLOG_ROOT = Path(".").resolve().parents[3]

class NotebookSection(BaseModel):
    notebook: Notebook
    text: str
    cell: int
    paragraph: int

    @property
    def id(self) -> str:
        file = self.notebook.path.resolve()
        relative_path = file.relative_to(BLOG_ROOT)
        return f"{relative_path} cell {self.cell} paragraph {self.paragraph}"

    @property
    def content(self) -> str:
        title = self.notebook.title
        description = self.notebook.description
        return f"#### {title}\n\n{description}\n\n{self.text}"

    def _repr_markdown_(self) -> str:
        return self.content

THIS_NOTEBOOK = Path("rag-over-this-blog.ipynb")
data = json.loads(THIS_NOTEBOOK.read_text())
notebook = Notebook.from_json(path=THIS_NOTEBOOK, **data)
notebook[:4]

Retrieval Augmented Generation over … this blog

Creating a Q&A bot for this blog

Since Retrieval Augmented Generation (RAG) has emerged I feel that a better solution could be produced. Can I make a Q&A bot that can work over the content of the blog and provide answers with code?

This is a simple technique that is well explored by this point. My goal with this post is more to intentionally practice something simple to build up strength in this area (a “rep” if you will).

Dataset

This blog is a set of jupyter notebooks. There are also quite a few python files that get imported. The aim will be to index them.

We can start by inspecting the structure of this notebook itself. Let’s read the json of this file:

Very nice. I even used the jupyter magic methods to render the content nicely.

My blog posts can be quite long. It’s normal to split longer documents up into smaller sections and then index those. This is a simple way to improve generation by allowing the remixing of document fragments from different documents. Let’s try creating a sliding window over this blog post:

Code

next(notebook.iterate_paragraph_windows())

Retrieval Augmented Generation over … this blog

Creating a Q&A bot for this blog

Since Retrieval Augmented Generation (RAG) has emerged I feel that a better solution could be produced. Can I make a Q&A bot that can work over the content of the blog and provide answers with code?

This is a simple technique that is well explored by this point. My goal with this post is more to intentionally practice something simple to build up strength in this area (a “rep” if you will).

With this I can then embed the text and perform a search.

Embedding and Vector Search

The embedding model I use most often is the sentence-transformers/all-MiniLM-L6-v2 model which is “good enough until you can prove it isn’t”. One problem this will have is the input is limited to 512 tokens, which will be a problem when embedding longer sections of code. Let’s not worry about that too much right now, this is a quick rep after all!

The chroma embedding database has embedding built in and uses this model by default. This will be the fastest way to get going. I want to embed every paragraph (with the preceding and following paragraphs for context) as well as the title and description.

The aim is to iterate through each notebook and generate these windows of content. Each will have metadata which indicates the location in the document where the match occurs.

As a quick test we can embed this notebook and then search for “What is good enough?” which should return this paragraph.

Code

import chromadb
import json
from IPython.display import Markdown

client = chromadb.Client(chromadb.Settings(anonymized_telemetry=False))

if "blog" in client.list_collections():
    client.delete_collection("blog")
collection = client.create_collection("blog")

THIS_NOTEBOOK = Path("rag-over-this-blog.ipynb")
data = json.loads(THIS_NOTEBOOK.read_text())
notebook = Notebook.from_json(path=THIS_NOTEBOOK, **data)
sections = list(notebook.iterate_paragraph_windows())

collection.add(
    documents=[section.content for section in sections],
    metadatas=[{"section": section.model_dump_json()} for section in sections],
    ids=[section.id for section in sections],
)

results = collection.query(
    query_texts=["What is good enough?"],
    n_results=1,
)
result_sections = [
    NotebookSection(**json.loads(metadata["section"]))
    for metadata in results["metadatas"][0]
]
for result in result_sections:
    display(result)
    display(Markdown("---"))

/home/matthew/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:12<00:00, 6.56MiB/s]

Retrieval Augmented Generation over … this blog

Creating a Q&A bot for this blog

As a quick test we can embed this notebook and then search for “What is good enough?” which should return this paragraph.

Well that worked and started getting very meta heh.

Lets continue.

Well that worked and started getting very meta heh.

Lets continue.

The Whole Elephant

Let’s embed and search over the whole blog then. I’m going to time this to give you an idea of how fast this is. To index the blog I need to:

load every notebook
split them into sections
index the sections

After that we will be able to search over them

Code

%%time

from pathlib import Path
import json
import chromadb

POSTS_ROOT = Path(".").resolve().parents[2]
notebooks = [
    Notebook.from_path(path)
    for path in sorted(POSTS_ROOT.glob("**/*.ipynb"))
    if ".ipynb_checkpoints" not in str(path)
]
notebooks = list(filter(None, notebooks))

client = chromadb.Client(chromadb.Settings(anonymized_telemetry=False))

if "blog" in client.list_collections():
    client.delete_collection("blog")
collection = client.create_collection("blog")

sections = [
    section
    for notebook in notebooks
    for section in notebook.iterate_paragraph_windows()
]

collection.add(
    documents=[section.content for section in sections],
    metadatas=[{"section": section.model_dump_json()} for section in sections],
    ids=[section.id for section in sections],
)

CPU times: user 31min 54s, sys: 1.85 s, total: 31min 56s
Wall time: 1min 27s

This has indexed 6,660 sections in 97 seconds. That’s about 68 sections/second. I happen to know that the underlying embedding model can run extremely quickly, which makes me think that the majority of this time was spent on file io.

We can now try out a simple query. I want to find out what distilliation is.

Code

results = collection.query(
    query_texts=["What is model distillation?"],
    n_results=1,
)
for document_id, metadata in zip(results["ids"][0], results["metadatas"][0]):
    print(document_id)
    section = NotebookSection(**json.loads(metadata["section"]))
    display(section)
    display(Markdown("<hr/>"))

posts/2022/04/13/huggingface-distillation-workshop.ipynb cell 5 paragraph 2

Huggingface Distillation Workshop

Create a distilled model

My initial notes on the workshop were:

What is Distillation?

in distillation the student is trained on the task, and is also trained to produce the same class distribution as the teacher (referred to as a knowledge distillation loss parameter). It has a composite loss function where the accuracy of the student on the task is combined with the degree to which the student output matches the teacher.

This has worked well, identifying the exact post where I initially discussed the subject. This is really a very simple approach and it has worked.

Generation using Retrieved Documents

The next part is to incorporate this into the generation of an answer. Deepseek is very popular right now so using that should produce something interesting.

I’m going to collect more results for the queries and then filter them down to different posts to increase diversity.

Code

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B")
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B")

Code

from IPython.display import Markdown

def answer_question(question: str) -> Markdown:
    results = collection.query(
        query_texts=[question],
        n_results=10,
    )
    sections = [
        NotebookSection(**json.loads(document["section"]))
        for document in results["metadatas"][0]
    ]

    seen_paths = set()
    documents = []
    for section in sections:
        if section.notebook.path in seen_paths:
            continue
        documents.append(section)
        seen_paths.add(section.notebook.path)

    prompt_documents = [
        f"""
Title: {document.notebook.title}
File: {document.notebook.path}
Content:
{document.text}
""".strip()
        for document in documents
    ]
    document_str = "\n\n".join(prompt_documents)

    prompt = f"""
You are a helpful assistant that answers questions about my blog using the content on my blog.
The user has asked a question and I am going to provide you with some sourced context from the blog.
Please answer the question using only the context and include the file that the content is sourced from.

The question is: {question}
The documents available are:
{document_str}
    """.strip()

    tokens = tokenizer.apply_chat_template(
        conversation=[
            {"role": "user", "content": prompt},
        ],
        add_generation_prompt=True,
        return_tensors="pt",
    )
    tokens = tokens.to(model.device)

    outputs = model.generate(
        tokens,
        max_new_tokens=1_000,
        do_sample=True,
        top_k=50,
        top_p=0.95,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
    )
    output_tokens = outputs[0][len(tokens[0]):]
    output_tokens = output_tokens.tolist()
    end_of_think_token = tokenizer.get_added_vocab()["</think>"]
    if end_of_think_token in output_tokens:
        think_index = output_tokens.index(end_of_think_token)
        output_tokens = output_tokens[think_index + 1:]
    answer = tokenizer.decode(output_tokens, skip_special_tokens=True)
    answer = "\n".join(
        f"> {line}"
        for line in answer.splitlines()
    )
    
    return Markdown(f"""
Question: {question}

Answer:
{answer}
>
> <small>DeepSeek-R1-Distill-Qwen-1.5B</small>
""")

Code

answer_question("What is model distillation?")

Question: What is model distillation?

Answer: > > > Model distillation is a training technique where a student neural network is trained alongside a teacher model. The goal is to improve the student’s performance and diversity by leveraging the teacher’s knowledge. The composite loss function combines the student’s accuracy on the target task and the similarity of its outputs to those of the teacher. This process often involves adjusting the temperature parameter in the loss function to fine-tune the model’s predictions, allowing the student to produce more accurate and diverse outputs. > > DeepSeek-R1-Distill-Qwen-1.5B

Code

answer_question("What is prompt internalization?")

Question: What is prompt internalization?

Answer: > > > Prompt internalization refers to the process by which a model learns and incorporates the original prompt into its output. This is measured using the KL Divergence, which quantifies the difference between the model’s output and the intended prompt. The user is testing this concept on more complex tasks to assess its effectiveness. > > DeepSeek-R1-Distill-Qwen-1.5B

This works well. I’m pleased that it was able to provide a reasonable answer to the prompt internalization question as I was playing around with that some time ago and I don’t think the technique is referred to in that way elsewhere. It’s extremely slow though so not really practical for the blog.

As always my thoughts now turn to how this could be improved. Recently I tried to decompose documents into atomic facts that were then linked together to form a graph. Doing RAG over that graph would be very interesting as it would allow the formation of answers that are distributed across multiple documents in a more effective way than the snippet approach. It would also be helpful if I could include code snippets in the embeddings as I often try to locate previous code samples to remind myself how I did something before.

Still, a good rep.