I made this blog as a way to note down the tools and techniques that I use so that I could refer to them in future. I’ve also used it to present ideas to work colleagues as well as an easy way to investigate things that catch my eye. There is a search function for the website that Quarto provides, is it the best though?
Since Retrieval Augmented Generation (RAG) has emerged I feel that a better solution could be produced. Can I make a Q&A bot that can work over the content of the blog and provide answers with code?
This is a simple technique that is well explored by this point. My goal with this post is more to intentionally practice something simple to build up strength in this area (a “rep” if you will).
Dataset
This blog is a set of jupyter notebooks. There are also quite a few python files that get imported. The aim will be to index them.
Jupyter notebooks are json files so parsing them and extracting the sections will be required. The python files can be treated as entire documents, and it would be possible to use the ast module to break them down into logical parts.
We can start by inspecting the structure of this notebook itself. Let’s read the json of this file:
Code
from pathlib import Pathimport jsonTHIS_NOTEBOOK = Path("rag-over-this-blog.ipynb")data = json.loads(THIS_NOTEBOOK.read_text())markdown_cells = [ cellfor cell in data["cells"]if cell["cell_type"] =="markdown"]markdown_cells[:3]
[{'cell_type': 'markdown',
'id': '6613f22d-167b-419f-90eb-92d0806be1af',
'metadata': {},
'source': ['I made this blog as a way to note down the tools and techniques that I use so that I could refer to them in future.\n',
"I've also used it to present ideas to work colleagues as well as an easy way to investigate things that catch my eye.\n",
'There is a search function for the website that Quarto provides, is it the best though?\n',
'\n',
'Since Retrieval Augmented Generation (RAG) has emerged I feel that a better solution could be produced.\n',
'Can I make a Q&A bot that can work over the content of the blog and provide answers with code?\n',
'\n',
'This is a simple technique that is well explored by this point.\n',
'My goal with this post is more to intentionally practice something simple to build up strength in this area (a "rep" if you will).']},
{'cell_type': 'markdown',
'id': 'e50ab6ae-52d4-4f53-ad6b-259765b5a543',
'metadata': {},
'source': ['## Dataset\n',
'\n',
'This blog is a set of jupyter notebooks.\n',
'There are also quite a few python files that get imported.\n',
'The aim will be to index them.\n',
'\n',
'Jupyter notebooks are json files so parsing them and extracting the sections will be required.\n',
'The python files can be treated as entire documents, and it would be possible to use the ast module to break them down into logical parts.']},
{'cell_type': 'markdown',
'id': '8b5fbdc0-b17b-4596-b987-b66000151f52',
'metadata': {},
'source': ['We can start by inspecting the structure of this notebook itself.\n',
"Let's read the json of this file:"]}]
Here we can see the three markdown blocks that form the content of this blog post. The source of the cells contains the text of each part, extracting that will be easy. The structure can even be made into a pydantic model which would then allow a consistent extraction of the post metadata as well. The post metadata sits in a raw cell at the top of the blog post and contains the title, date and description.
Let’s see the start of this blog again, this time converting the data into pydantic types:
I made this blog as a way to note down the tools and techniques that I use so that I could refer to them in future. I’ve also used it to present ideas to work colleagues as well as an easy way to investigate things that catch my eye. There is a search function for the website that Quarto provides, is it the best though?
Since Retrieval Augmented Generation (RAG) has emerged I feel that a better solution could be produced. Can I make a Q&A bot that can work over the content of the blog and provide answers with code?
This is a simple technique that is well explored by this point. My goal with this post is more to intentionally practice something simple to build up strength in this area (a “rep” if you will).
Dataset
This blog is a set of jupyter notebooks. There are also quite a few python files that get imported. The aim will be to index them.
Jupyter notebooks are json files so parsing them and extracting the sections will be required. The python files can be treated as entire documents, and it would be possible to use the ast module to break them down into logical parts.
We can start by inspecting the structure of this notebook itself. Let’s read the json of this file:
My blog posts can be quite long. It’s normal to split longer documents up into smaller sections and then index those. This is a simple way to improve generation by allowing the remixing of document fragments from different documents. Let’s try creating a sliding window over this blog post:
Code
next(notebook.iterate_paragraph_windows())
Retrieval Augmented Generation over … this blog
Creating a Q&A bot for this blog
I made this blog as a way to note down the tools and techniques that I use so that I could refer to them in future. I’ve also used it to present ideas to work colleagues as well as an easy way to investigate things that catch my eye. There is a search function for the website that Quarto provides, is it the best though?
Since Retrieval Augmented Generation (RAG) has emerged I feel that a better solution could be produced. Can I make a Q&A bot that can work over the content of the blog and provide answers with code?
This is a simple technique that is well explored by this point. My goal with this post is more to intentionally practice something simple to build up strength in this area (a “rep” if you will).
With this I can then embed the text and perform a search.
Embedding and Vector Search
The embedding model I use most often is the sentence-transformers/all-MiniLM-L6-v2 model which is “good enough until you can prove it isn’t”. One problem this will have is the input is limited to 512 tokens, which will be a problem when embedding longer sections of code. Let’s not worry about that too much right now, this is a quick rep after all!
The chroma embedding database has embedding built in and uses this model by default. This will be the fastest way to get going. I want to embed every paragraph (with the preceding and following paragraphs for context) as well as the title and description.
The aim is to iterate through each notebook and generate these windows of content. Each will have metadata which indicates the location in the document where the match occurs.
As a quick test we can embed this notebook and then search for “What is good enough?” which should return this paragraph.
Code
import chromadbimport jsonfrom IPython.display import Markdownclient = chromadb.Client(chromadb.Settings(anonymized_telemetry=False))if"blog"in client.list_collections(): client.delete_collection("blog")collection = client.create_collection("blog")THIS_NOTEBOOK = Path("rag-over-this-blog.ipynb")data = json.loads(THIS_NOTEBOOK.read_text())notebook = Notebook.from_json(path=THIS_NOTEBOOK, **data)sections =list(notebook.iterate_paragraph_windows())collection.add( documents=[section.content for section in sections], metadatas=[{"section": section.model_dump_json()} for section in sections], ids=[section.idfor section in sections],)results = collection.query( query_texts=["What is good enough?"], n_results=1,)result_sections = [ NotebookSection(**json.loads(metadata["section"]))for metadata in results["metadatas"][0]]for result in result_sections: display(result) display(Markdown("---"))
As a quick test we can embed this notebook and then search for “What is good enough?” which should return this paragraph.
Well that worked and started getting very meta heh.
Lets continue.
Well that worked and started getting very meta heh.
Lets continue.
The Whole Elephant
Let’s embed and search over the whole blog then. I’m going to time this to give you an idea of how fast this is. To index the blog I need to:
load every notebook
split them into sections
index the sections
After that we will be able to search over them
Code
%%timefrom pathlib import Pathimport jsonimport chromadbPOSTS_ROOT = Path(".").resolve().parents[2]notebooks = [ Notebook.from_path(path)for path insorted(POSTS_ROOT.glob("**/*.ipynb"))if".ipynb_checkpoints"notinstr(path)]notebooks =list(filter(None, notebooks))client = chromadb.Client(chromadb.Settings(anonymized_telemetry=False))if"blog"in client.list_collections(): client.delete_collection("blog")collection = client.create_collection("blog")sections = [ sectionfor notebook in notebooksfor section in notebook.iterate_paragraph_windows()]collection.add( documents=[section.content for section in sections], metadatas=[{"section": section.model_dump_json()} for section in sections], ids=[section.idfor section in sections],)
CPU times: user 31min 54s, sys: 1.85 s, total: 31min 56s
Wall time: 1min 27s
This has indexed 6,660 sections in 97 seconds. That’s about 68 sections/second. I happen to know that the underlying embedding model can run extremely quickly, which makes me think that the majority of this time was spent on file io.
We can now try out a simple query. I want to find out what distilliation is.
Code
results = collection.query( query_texts=["What is model distillation?"], n_results=1,)for document_id, metadata inzip(results["ids"][0], results["metadatas"][0]):print(document_id) section = NotebookSection(**json.loads(metadata["section"])) display(section) display(Markdown("<hr/>"))
in distillation the student is trained on the task, and is also trained to produce the same class distribution as the teacher (referred to as a knowledge distillation loss parameter). It has a composite loss function where the accuracy of the student on the task is combined with the degree to which the student output matches the teacher.
This has worked well, identifying the exact post where I initially discussed the subject. This is really a very simple approach and it has worked.
Generation using Retrieved Documents
The next part is to incorporate this into the generation of an answer. Deepseek is very popular right now so using that should produce something interesting.
I’m going to collect more results for the queries and then filter them down to different posts to increase diversity.
Code
from transformers import AutoTokenizer, AutoModelForCausalLMtokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B")model = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B")
Code
from IPython.display import Markdowndef answer_question(question: str) -> Markdown: results = collection.query( query_texts=[question], n_results=10, ) sections = [ NotebookSection(**json.loads(document["section"]))for document in results["metadatas"][0] ] seen_paths =set() documents = []for section in sections:if section.notebook.path in seen_paths:continue documents.append(section) seen_paths.add(section.notebook.path) prompt_documents = [f"""Title: {document.notebook.title}File: {document.notebook.path}Content:{document.text}""".strip()for document in documents ] document_str ="\n\n".join(prompt_documents) prompt =f"""You are a helpful assistant that answers questions about my blog using the content on my blog.The user has asked a question and I am going to provide you with some sourced context from the blog.Please answer the question using only the context and include the file that the content is sourced from.The question is: {question}The documents available are:{document_str} """.strip() tokens = tokenizer.apply_chat_template( conversation=[ {"role": "user", "content": prompt}, ], add_generation_prompt=True, return_tensors="pt", ) tokens = tokens.to(model.device) outputs = model.generate( tokens, max_new_tokens=1_000, do_sample=True, top_k=50, top_p=0.95, num_return_sequences=1, eos_token_id=tokenizer.eos_token_id, ) output_tokens = outputs[0][len(tokens[0]):] output_tokens = output_tokens.tolist() end_of_think_token = tokenizer.get_added_vocab()["</think>"]if end_of_think_token in output_tokens: think_index = output_tokens.index(end_of_think_token) output_tokens = output_tokens[think_index +1:] answer = tokenizer.decode(output_tokens, skip_special_tokens=True) answer ="\n".join(f"> {line}"for line in answer.splitlines() )return Markdown(f"""Question: {question}Answer:{answer}>> <small>DeepSeek-R1-Distill-Qwen-1.5B</small>""")
Code
answer_question("What is model distillation?")
Question: What is model distillation?
Answer: > > > Model distillation is a training technique where a student neural network is trained alongside a teacher model. The goal is to improve the student’s performance and diversity by leveraging the teacher’s knowledge. The composite loss function combines the student’s accuracy on the target task and the similarity of its outputs to those of the teacher. This process often involves adjusting the temperature parameter in the loss function to fine-tune the model’s predictions, allowing the student to produce more accurate and diverse outputs. > > DeepSeek-R1-Distill-Qwen-1.5B
Code
answer_question("What is prompt internalization?")
Question: What is prompt internalization?
Answer: > > > Prompt internalization refers to the process by which a model learns and incorporates the original prompt into its output. This is measured using the KL Divergence, which quantifies the difference between the model’s output and the intended prompt. The user is testing this concept on more complex tasks to assess its effectiveness. > > DeepSeek-R1-Distill-Qwen-1.5B
This works well. I’m pleased that it was able to provide a reasonable answer to the prompt internalization question as I was playing around with that some time ago and I don’t think the technique is referred to in that way elsewhere. It’s extremely slow though so not really practical for the blog.
As always my thoughts now turn to how this could be improved. Recently I tried to decompose documents into atomic facts that were then linked together to form a graph. Doing RAG over that graph would be very interesting as it would allow the formation of answers that are distributed across multiple documents in a more effective way than the snippet approach. It would also be helpful if I could include code snippets in the embeddings as I often try to locate previous code samples to remind myself how I did something before.