In the previous post, I learned how LLM works from LLM itself (Copilot, to be more specific). In this post, I will learn about RAG, of course, from Copilot.
Methodology:
- Ask questions starting from “What is RAG?” to Copilot, and read answers
- Write down my understanding, asking Copilot if I’m correct
- Revise my writing until my understanding is solid
Let’s start today’s learning.
What is RAG?
RAG (Retrieval-Augmented Generation) is a workflow for retrieving relevant external information and injecting it into user questions for LLM to consume. RAG supplements these LLM’s weak points:
- no knowledge of internal/confidential documents (eg, Confluence pages)
- context window is relatively small
RAG has a vector database that stores embedded documents that LLM doesn’t have access to, and use it as additional context. The vector database looks for similarities for user questions and those documents using similarity metrics (eg, cosine similarity, dot product).
User questions are processed in RAG pipelines, and appropriate context is added to them for LLM to consume. RAG follows these steps:
- embed query
- retrieve related chunks
- add them to prompt
- LLM generates answers
Chunking
Before using RAG, we need to create a vector database and ingest embedded documents to it. As a single embedding cannot meaningfully represent a 100+ page document, we need to split documents into many chunks.
In the database, each chunk represents a meaningful block in a document. When the pipeline receives a question, it is embedded and the database looks for most relevant chunks for the query.
How to split a document into chunks is an important design decision for a pipeline. For example, using a fixed number of tokens (eg, 300-500) or using document headers.
The PoC pipeline
I’m going to create the pipeline for this PoC.
┌──────────────────────────┐
│ User Query │
└─────────────┬────────────┘
│
▼
┌────────────────────────┐
│ 1. Embed the Query │
│ (MiniLM via Ollama) │
└─────────────┬──────────┘
│
▼
┌────────────────────────┐
│ 2. Retrieve Top-k │
│ Chunks from Chroma │
└─────────────┬──────────┘
│
▼
┌────────────────────────┐
│ 3. Build Prompt with │
│ Retrieved Context │
└─────────────┬──────────┘
│
▼
┌────────────────────────┐
│ 4. Generate Answer via │
│ LLM (llama3.2:3b) │
└─────────────┬──────────┘
│
▼
┌──────────────────────────┐
│ Final Answer │
└──────────────────────────┘
High-level system architecture
In this PoC, I will create a python script to control the entire pipeline.
As an embedding model and an LLM, ollama will run as a server, and on the backend, a vector DB (Chroma DB) will be running. The DB will store embedded document information and look for similarity between queries and document chunks.
┌──────────────────────────────┐
│ Python App │
│ (Ingestion + RAG Pipeline) │
└───────────────┬──────────────┘
│
▼
┌────────────────────────────────────────────────────┐
│ Local Services │
│ │
│ ┌────────────────────┐ ┌──────────────────┐ │
│ │ Ollama Server │ │ ChromaDB │ │
│ │ - Embedding Model │◀───▶│ - Vector Store │ │
│ │ - LLM (1B/3B) │ │ - Persistent DB │ │
│ └────────────────────┘ └──────────────────┘ │
└────────────────────────────────────────────────────┘
Document will flow like this:
Document (README.org)
│
▼
Chunking (80 words, 20 overlap)
│
▼
Embedding (MiniLM via Ollama)
│
▼
ChromaDB (Persistent Collection)
Setup demo environment
For this PoC demo, I’ve chosen ChromaDB and oolama because these are lightweight and free. My homelab server isn’t powerful (N150-based mini PC w/o GPUs).
Setup Python 3.10+. I already have one.
$ python --version
Python 3.13.5
Create a project and activate venv
mkdir ragdemo; cd ragdemo
python -m venv ve
source ve/bin/activate
Install ChromaDB
pip install chromadb
Install ollama and pull an embedding model
curl -fsSL https://ollama.com/install.sh | sh
ollama pull nomic-embed-text
ollama list # to verify
Create a test script.
import chromadb
import requests
def embed(text):
r = requests.post(
"http://localhost:11434/api/embeddings",
json={"model": "nomic-embed-text", "prompt": text}
)
return r.json()["embedding"]
client = chromadb.Client()
collection = client.create_collection("demo")
chunk = "This is a test chunk."
collection.add(
ids=["chunk_1"],
embeddings=[embed(chunk)],
documents=[chunk]
)
query = "What is this about?"
results = collection.query(
query_embeddings=[embed(query)],
n_results=3
)
print(results)
This create a Chroma DB and a collection “demo” in memory, embed a single chunk “This is a test chunk.” into it, and do search “What is this about?”. It’s interesting to see that it’s not a simple keyword search.
Run it.
(ve) achiwa@act:~/py/ragdemo$ python t.py
{'ids': [['chunk_1']], 'embeddings': None, 'documents': [['This is a test chunk.']], 'uris': None, 'included': ['metadatas', 'documents', 'distances'], 'data': None, 'metadatas': [[None]], 'distances': [[481.20458984375]]}
Looks like it’s working. As the sole document was “This is a test chunk.” and query was “What is this about?”, the distance between these strings was rather big: 481.20…, but this shouldn’t be an issue.
Using a real document
Next, let’s try using real document as a test document. I chose the README.org file on one of my github repos: https://raw.githubusercontent.com/achiwa912/vbs/refs/heads/main/README.org
Download the file to the project directory.
wget https://raw.githubusercontent.com/achiwa912/vbs/refs/heads/main/README.org
mv README.org vbs_readme.org
Try vbs_readme.org.
import chromadb
import requests
def embed(text):
r = requests.post(
"http://localhost:11434/api/embeddings",
json={"model": "nomic-embed-text", "prompt": text},
timeout=60
)
data = r.json()
if "embedding" not in data:
print("Ollama returned an error:", data)
return None
return data["embedding"]
def chunk_text(text, size=300, overlap=50):
words = text.split()
chunks = []
i = 0;
while i < len(words):
chunk = words[i:i+size]
chunks.append(" ".join(chunk))
i += size - overlap
return chunks
client = chromadb.PersistentClient(path="./chroma")
collection = client.get_or_create_collection("demo")
with open("vbs_readme.org") as f:
text = f.read()
chunks = chunk_text(text, size=120, overlap=30)
for i, chunk in enumerate(chunks):
vec = embed(chunk)
if vec is None:
print(f"Skipping chunk {i} due to embedding error")
continue
collection.add(
ids=[f"vbs_{i}"],
embeddings=[vec],
documents=[chunk],
metadatas=[{"source": "vbs_readme.org"}]
)
results = collection.query(
query_embeddings=[embed("How does the shifting leaning window work?")],
n_results=3
)
print(results)
It took about 10 minutes to process a short or medium-sized vbs_readme.org file. Embedding this document with ollama alone was a bit too much for the Intel N150 CPU without GPU.
Let’s use a more lightweight embedding model.
$ ollama pull all-minilm
$ ollama list
NAME ID SIZE MODIFIED
all-minilm:latest 1b226e2802db 45 MB 10 hours ago
nomic-embed-text:latest 0a109f422b47 274 MB 10 hours ago
Change the chunk size to 80 tokens.
chunks = chunk_text(text, size=80, overlap=20)
Delete the database
$ rm -rf chroma/
Run the script again.
$ date; python t.py; date
Mon Mar 16 06:28:30 AM EDT 2026
{'ids': [['vbs_5', 'vbs_4', 'vbs_25']], 'embeddings': None, 'documents': [['window can have 10 words, and initially has the 1st 10 words in a word book. (More precicely, words are randomly picked) As you repeat 10 words in the learning window, you memorize a word or two. Then, the learning window replaces the memorized word with a new word. This way, the learning window has 10 words that you are actively working on. It shifts through a word book and eventually reaches the end of the book. Then, the next', 'most likely to forget almost everything when you encountr the same word for the 2nd, 3rd or 4th time. vocaBull addresses this issue by introducing a combination of high-frequency repetitions of small number of words, and low-frequency repetitions of medium number of words - the Shifting Learning Window method. [[./images/vbs_SLW.jpg]] The high-frequenry repetitions are realized with the "learning window". Learning window can have 10 words, and initially has the 1st 10 words in a word book. (More precicely, words are', 'Vocabull Server is under [[https://en.wikipedia.org/wiki/MIT_License][MIT license]]. * Contact Kyosuke Achiwa - achiwa912+gmail.com (please replace + with @) Project link: [[https://github.com/achiwa912/vbs]] Blog article: https://achiwa912.github.io/vbs_eng.html * Acknowledgments - Vocabull uses user management and other parts from the fabulous =Flask Web Development= (by Miguel Grinberg) [[https://www.oreilly.com/library/view/flask-web-development/9781491991725/][book]] and [[https://github.com/miguelgrinberg/flasky][companion github repository]] - Vocabull uses a bootstrap 4 theme =litera= from [[bootswatch CDN]]']], 'uris': None, 'included': ['metadatas', 'documents', 'distances'], 'data': None, 'metadatas': [[{'source': 'vbs_readme.org'}, {'source': 'vbs_readme.org'}, {'source': 'vbs_readme.org'}]], 'distances': [[1.2157325744628906, 1.7700687646865845, 1.789774775505066]]}
Mon Mar 16 06:29:01 AM EDT 2026
Looks good. For the question “How does the shifting leaning window work?”, the script presented the context. It took about 30 seconds to run without an error. Not too bad!
Asking LLM with context
Context is ready. Now, let’s ask LLM the question. First, install LLM.
$ ollama pull llama3.2:1b
<snip>
$ ollama list
NAME ID SIZE MODIFIED
llama3.2:1b baf6a787fdff 1.3 GB 23 minutes ago
all-minilm:latest 1b226e2802db 45 MB 11 hours ago
nomic-embed-text:latest 0a109f422b47 274 MB 12 hours ago
llama3.2:1b is a small, lightweight LLM which can run on CPUs like N150. Of course, it is not powerful at all compared with ChatGPT or Gemini, but it should be sufficient for my PoC.
Add a few functions and use it in the script.
def ask_llm(question, context):
prompt = f"Answer the question using ONLY the context.\n\nContext:\n{context}\n\nQuestion: {question}\nAnswer:"
r = requests.post(
"http://localhost:11434/api/generate",
json={"model": "llama3.2:1b", "prompt": prompt, "stream": False}
)
# print("RAW LLM OUTPUT:", r.json()) # for debug
return r.json()["response"]
def rag(question):
qvec = embed(question)
results = collection.query(query_embeddings= [qvec], n_results=3)
if not results["documents"] or not results["documents"][0]:
return "No relevant context found."
context = "\n\n".join(results["documents"][0])
return ask_llm(question, context)
results = rag("How does the shifting leaning window work?")
And run it.
The Shifting Learning Window method works by introducing two types of word repetitions: high-frequency repetitions and low-frequency repetitions.
High-frequency repetitions involve repeating the same small number of words (typically 1-5) multiple times, such as “hello”, “world”, or “python”. These are called “learning windows”.
Low-frequency repetitions involve repeating a medium number of words (typically 10-20) less frequently, such as “machine learning”, “artificial intelligence”, or “computer vision”.
The learning window shifts through the word book by replacing the memorized word with a new word at random intervals. This ensures that the learner is constantly faced with novel information and increases the likelihood of forgetting previously learned words.
This is awesome! A tiny LLM is actually running on a mini PC! But I see some hallucinations. Probably the model is too small for this task.
- High-frequency repetition should be about 10 words
- Where did “hello”, “world” and “python” come from?, etc.
Let’s try a larger model:
ollama pull llama3.2:3b
It is about 2GB. A bit heavy on storage but that’s all right. And change this line to use the model.
json={"model": "llama3.2:3b", "prompt": prompt, "stream": False}
This time, it took 4 min to complete.
The Shifting Learning Window method works by introducing a combination of high-frequency repetitions of small number of words and low-frequency repetitions of medium number of words. It uses a “learning window” that can have 10 words, initially filled with the first 10 words from a word book. As you repeat 10 words in the learning window, you memorize one or two words, and then the learning window replaces the memorized words with new ones. This process continues until the learning window reaches the end of the book, at which point it repeats the cycle.
Wow, what an improvement! Now I see no hallucinations, and ollama summarized shifting learning window very well.
Lessons learned
From this simple PoC, I’ve learned:
- a PoC-level RAG pipeline is not difficult to implement. It took a day and a half for both PoC and writing this blog article.
- a Intel N150-based, low-power PC can host a lightweight LLM
- chunk size is very important in a low-power environment. I needed to set it as low as 80 to remove errors
- choosing the smallest LLM model is not always a good idea. I needed to upgrade from a 1.3GB model to a 2GB model in storage size to get useful, hallucination-free results
Future improvements
This PoC is done in an environment with very little resources. A more powerful PC with GPU should be able to handle larger chunks and can have a smarter LLM model locally. Integrating with OpenAI and other APIs (paid!) would enable much more powerful LLMs and thus would results in better results.
For the python script, there are possible enhancements:
- Make it a web application using Flask, possibly with Bootstrap
- Containerize
- Add web and PDF crawlers to add more context information
Conclusion
Conducted a PoC for end-to-end RAG workflow:
- Prepared context information
- Embed and added to a vector DB
- Gave a question related to the context
- Embed the question and searched similar context information
- Generated a prompt with additional context information for LLM to consume
- Gave it to LLM, and LLM gave an answer using the context information
The LLM answer was a good summary of the context information without hallucinations, proving that it is practical and useful in daily professional life.
Appendix
Learning with Copilot
This is irrelevant with the PoC itself, but I learned these about copilot.
- Code snippets Copilot shows sometimes do not work as-is and require troubleshooting, which is a great opportunity to acquire deeper knowledge on the subject
- Just following troubleshooting steps from Copilot is not effective. They might do more harm. Asking why it suggests the steps is important
- Copilot sometimes forgets about my environment
- Copilot can be seen and treated as a junior teacher but with exceptionally great knowledge. It’s very imbalanced. You can use it wisely.
- Copilot is always encouraging. It’s a great leaning partner.
System diagram
┌──────────────────────────┐
│ Source Document │
│ (vbs_readme.org) │
└─────────────┬────────────┘
│
▼
┌──────────────────────────┐
│ Chunking (80/20) │
└─────────────┬────────────┘
│
▼
┌──────────────────────────┐
│ Embedding Model (MiniLM)│
│ via Ollama │
└─────────────┬────────────┘
│
▼
┌──────────────────────────┐
│ Chroma Vector Store │
│ (Persistent Collection) │
└─────────────┬────────────┘
│
▼
┌────────────────────────────────────┐
│ RAG Query Flow │
│ (embed -> retrieve -> generate) │
└─────────────┬──────────────────────┘
│
▼
┌──────────────────────────┐
│ Query Embedding (MiniLM)│
└─────────────┬────────────┘
│
▼
┌──────────────────────────┐
│ Retrieve Top-k Chunks │
│ from Chroma │
└─────────────┬────────────┘
│
▼
┌──────────────────────────┐
│ LLM (llama3.2:3b) via │
│ Ollama │
└─────────────┬────────────┘
│
▼
┌──────────────────────────┐
│ Final Answer │
└──────────────────────────┘
Project environment
I’ve created a github repository for this PoC. Please take a look if you are interested.
Directory structure:
$ tree -L 2
.
├── chroma
│ ├── b8810b00-c991-49e9-8148-70be3f602df8
│ └── chroma.sqlite3
├── t.py
├── vbs_readme.org
└── ve
├── bin
├── include
├── lib
├── lib64 -> lib
├── pyvenv.cfg
└── share
Installed ollama models:
$ ollama list
NAME ID SIZE MODIFIED
llama3.2:3b a80c4f17acd5 2.0 GB 35 minutes ago
llama3.2:1b baf6a787fdff 1.3 GB About an hour ago
all-minilm:latest 1b226e2802db 45 MB 12 hours ago
The complete source code of the script (ragdemo.py):
import chromadb
import requests
def embed(text):
r = requests.post(
"http://localhost:11434/api/embeddings",
json={"model": "all-minilm", "prompt": text},
timeout=60
)
data = r.json()
if "embedding" not in data:
print("Ollama returned an error:", data)
return None
return data["embedding"]
def chunk_text(text, size=300, overlap=50):
words = text.split()
chunks = []
i = 0
while i < len(words):
chunk = words[i:i+size]
chunks.append(" ".join(chunk))
i += size - overlap
return chunks
def ask_llm(question, context):
prompt = f"Answer the question using ONLY the context.\n\nContext:\n{context}\n\nQuestion: {question}\nAnswer:"
r = requests.post(
"http://localhost:11434/api/generate",
json={"model": "llama3.2:3b", "prompt": prompt, "stream": False}
)
# print("RAW LLM OUTPUT:", r.json())
return r.json()["response"]
def rag(question):
qvec = embed(question)
results = collection.query(query_embeddings= [qvec], n_results=3)
if not results["documents"] or not results["documents"][0]:
return "No relevant context found."
context = "\n\n".join(results["documents"][0])
return ask_llm(question, context)
client = chromadb.PersistentClient(path="./chroma")
collection = client.get_or_create_collection("demo")
with open("vbs_readme.org") as f:
text = f.read()
chunks = chunk_text(text, size=80, overlap=20)
for i, chunk in enumerate(chunks):
vec = embed(chunk)
if vec is None:
print(f"Skipping chunk {i} due to embedding error")
continue
collection.add(
ids=[f"vbs_{i}"],
embeddings=[vec],
documents=[chunk],
metadatas=[{"source": "vbs_readme.org",
"chunk_id": i,
}]
)
results = rag("How does the shifting leaning window work?")
print(results)