RAG workflow PoC - Python, ChromaDB, Ollama

In the previous post, I learned how LLM works from LLM itself (Copilot, to be more specific). In this post, I will learn about RAG, of course, from Copilot.

Methodology:

Ask questions starting from “What is RAG?” to Copilot, and read answers
Write down my understanding, asking Copilot if I’m correct
Revise my writing until my understanding is solid

Let’s start today’s learning.

What is RAG?

RAG (Retrieval-Augmented Generation) is a workflow for retrieving relevant external information and injecting it into user questions for LLM to consume. RAG supplements these LLM’s weak points:

no knowledge of internal/confidential documents (eg, Confluence pages)
context window is relatively small

RAG has a vector database that stores embedded documents that LLM doesn’t have access to, and use it as additional context. The vector database looks for similarities for user questions and those documents using similarity metrics (eg, cosine similarity, dot product).

User questions are processed in RAG pipelines, and appropriate context is added to them for LLM to consume. RAG follows these steps:

embed query
retrieve related chunks
add them to prompt
LLM generates answers

Chunking

Before using RAG, we need to create a vector database and ingest embedded documents to it. As a single embedding cannot meaningfully represent a 100+ page document, we need to split documents into many chunks.

In the database, each chunk represents a meaningful block in a document. When the pipeline receives a question, it is embedded and the database looks for most relevant chunks for the query.

How to split a document into chunks is an important design decision for a pipeline. For example, using a fixed number of tokens (eg, 300-500) or using document headers.

The PoC pipeline

I’m going to create the pipeline for this PoC.

┌──────────────────────────┐
│        User Query        │
└─────────────┬────────────┘
              │
              ▼
 ┌────────────────────────┐
 │   1. Embed the Query   │
 │  (MiniLM via Ollama)   │
 └─────────────┬──────────┘
               │
               ▼
 ┌────────────────────────┐
 │  2. Retrieve Top-k     │
 │   Chunks from Chroma   │
 └─────────────┬──────────┘
               │
               ▼
 ┌────────────────────────┐
 │  3. Build Prompt with  │
 │     Retrieved Context  │
 └─────────────┬──────────┘
               │
               ▼
 ┌────────────────────────┐
 │ 4. Generate Answer via │
 │   LLM (llama3.2:3b)    │
 └─────────────┬──────────┘
               │
               ▼
┌──────────────────────────┐
│      Final Answer        │
└──────────────────────────┘

High-level system architecture

In this PoC, I will create a python script to control the entire pipeline.

As an embedding model and an LLM, ollama will run as a server, and on the backend, a vector DB (Chroma DB) will be running. The DB will store embedded document information and look for similarity between queries and document chunks.

           ┌──────────────────────────────┐
           │          Python App          │
           │  (Ingestion + RAG Pipeline)  │
           └───────────────┬──────────────┘
                           │
                           ▼
┌────────────────────────────────────────────────────┐
│                    Local Services                  │
│                                                    │
│   ┌────────────────────┐     ┌──────────────────┐  │
│   │   Ollama Server    │     │   ChromaDB       │  │
│   │ - Embedding Model  │◀───▶│ - Vector Store   │  │
│   │ - LLM (1B/3B)      │     │ - Persistent DB  │  │
│   └────────────────────┘     └──────────────────┘  │
└────────────────────────────────────────────────────┘

Document will flow like this:

Document (README.org)
        │
        ▼
Chunking (80 words, 20 overlap)
        │
        ▼
Embedding (MiniLM via Ollama)
        │
        ▼
ChromaDB (Persistent Collection)

Setup demo environment

For this PoC demo, I’ve chosen ChromaDB and oolama because these are lightweight and free. My homelab server isn’t powerful (N150-based mini PC w/o GPUs).

Setup Python 3.10+. I already have one.

$ python --version
Python 3.13.5

Create a project and activate venv

mkdir ragdemo; cd ragdemo
python -m venv ve
source ve/bin/activate

Install ChromaDB

pip install chromadb

Install ollama and pull an embedding model

curl -fsSL https://ollama.com/install.sh | sh
ollama pull nomic-embed-text
ollama list # to verify

Create a test script.

import chromadb
import requests

def embed(text):
    r = requests.post(
        "http://localhost:11434/api/embeddings",
        json={"model": "nomic-embed-text", "prompt": text}
        )
    return r.json()["embedding"]

client = chromadb.Client()
collection = client.create_collection("demo")

chunk = "This is a test chunk."
collection.add(
    ids=["chunk_1"],
    embeddings=[embed(chunk)],
    documents=[chunk]
    )

query = "What is this about?"
results = collection.query(
    query_embeddings=[embed(query)],
    n_results=3
    )

print(results)

This create a Chroma DB and a collection “demo” in memory, embed a single chunk “This is a test chunk.” into it, and do search “What is this about?”. It’s interesting to see that it’s not a simple keyword search.

Run it.

(ve) achiwa@act:~/py/ragdemo$ python t.py
{'ids': [['chunk_1']], 'embeddings': None, 'documents': [['This is a test chunk.']], 'uris': None, 'included': ['metadatas', 'documents', 'distances'], 'data': None, 'metadatas': [[None]], 'distances': [[481.20458984375]]}

Looks like it’s working. As the sole document was “This is a test chunk.” and query was “What is this about?”, the distance between these strings was rather big: 481.20…, but this shouldn’t be an issue.

Using a real document

Next, let’s try using real document as a test document. I chose the README.org file on one of my github repos: https://raw.githubusercontent.com/achiwa912/vbs/refs/heads/main/README.org

Download the file to the project directory.

wget https://raw.githubusercontent.com/achiwa912/vbs/refs/heads/main/README.org
mv README.org vbs_readme.org

Try vbs_readme.org.

import chromadb
import requests

def embed(text):
    r = requests.post(
        "http://localhost:11434/api/embeddings",
        json={"model": "nomic-embed-text", "prompt": text},
        timeout=60
        )
    data = r.json()
    if "embedding" not in data:
        print("Ollama returned an error:", data)
        return None

    return data["embedding"]

def chunk_text(text, size=300, overlap=50):
    words = text.split()
    chunks = []
    i = 0;
    while i < len(words):
        chunk = words[i:i+size]
        chunks.append(" ".join(chunk))
        i += size - overlap
    return chunks

client = chromadb.PersistentClient(path="./chroma")
collection = client.get_or_create_collection("demo")

with open("vbs_readme.org") as f:
    text = f.read()

chunks = chunk_text(text, size=120, overlap=30)

for i, chunk in enumerate(chunks):
    vec = embed(chunk)
    if vec is None:
        print(f"Skipping chunk {i} due to embedding error")
        continue

    collection.add(
        ids=[f"vbs_{i}"],
        embeddings=[vec],
        documents=[chunk],
        metadatas=[{"source": "vbs_readme.org"}]
        )

results = collection.query(
    query_embeddings=[embed("How does the shifting leaning window work?")],
    n_results=3
    )

print(results)

It took about 10 minutes to process a short or medium-sized vbs_readme.org file. Embedding this document with ollama alone was a bit too much for the Intel N150 CPU without GPU.

Let’s use a more lightweight embedding model.

$ ollama pull all-minilm
$ ollama list
NAME                       ID              SIZE      MODIFIED
all-minilm:latest          1b226e2802db    45 MB     10 hours ago
nomic-embed-text:latest    0a109f422b47    274 MB    10 hours ago

Change the chunk size to 80 tokens.

chunks = chunk_text(text, size=80, overlap=20)

Delete the database

$ rm -rf chroma/

Run the script again.

$ date; python t.py; date
Mon Mar 16 06:28:30 AM EDT 2026
{'ids': [['vbs_5', 'vbs_4', 'vbs_25']], 'embeddings': None, 'documents': [['window can have 10 words, and initially has the 1st 10 words in a word book. (More precicely, words are randomly picked) As you repeat 10 words in the learning window, you memorize a word or two. Then, the learning window replaces the memorized word with a new word. This way, the learning window has 10 words that you are actively working on. It shifts through a word book and eventually reaches the end of the book. Then, the next', 'most likely to forget almost everything when you encountr the same word for the 2nd, 3rd or 4th time. vocaBull addresses this issue by introducing a combination of high-frequency repetitions of small number of words, and low-frequency repetitions of medium number of words - the Shifting Learning Window method. [[./images/vbs_SLW.jpg]] The high-frequenry repetitions are realized with the "learning window". Learning window can have 10 words, and initially has the 1st 10 words in a word book. (More precicely, words are', 'Vocabull Server is under [[https://en.wikipedia.org/wiki/MIT_License][MIT license]]. * Contact Kyosuke Achiwa - achiwa912+gmail.com (please replace + with @) Project link: [[https://github.com/achiwa912/vbs]] Blog article: https://achiwa912.github.io/vbs_eng.html * Acknowledgments - Vocabull uses user management and other parts from the fabulous =Flask Web Development= (by Miguel Grinberg) [[https://www.oreilly.com/library/view/flask-web-development/9781491991725/][book]] and [[https://github.com/miguelgrinberg/flasky][companion github repository]] - Vocabull uses a bootstrap 4 theme =litera= from [[bootswatch CDN]]']], 'uris': None, 'included': ['metadatas', 'documents', 'distances'], 'data': None, 'metadatas': [[{'source': 'vbs_readme.org'}, {'source': 'vbs_readme.org'}, {'source': 'vbs_readme.org'}]], 'distances': [[1.2157325744628906, 1.7700687646865845, 1.789774775505066]]}
Mon Mar 16 06:29:01 AM EDT 2026

Looks good. For the question “How does the shifting leaning window work?”, the script presented the context. It took about 30 seconds to run without an error. Not too bad!

Asking LLM with context

Context is ready. Now, let’s ask LLM the question. First, install LLM.

$ ollama pull llama3.2:1b
<snip>
$ ollama list
NAME                       ID              SIZE      MODIFIED
llama3.2:1b                baf6a787fdff    1.3 GB    23 minutes ago
all-minilm:latest          1b226e2802db    45 MB     11 hours ago
nomic-embed-text:latest    0a109f422b47    274 MB    12 hours ago

llama3.2:1b is a small, lightweight LLM which can run on CPUs like N150. Of course, it is not powerful at all compared with ChatGPT or Gemini, but it should be sufficient for my PoC.

Add a few functions and use it in the script.

def ask_llm(question, context):
    prompt = f"Answer the question using ONLY the context.\n\nContext:\n{context}\n\nQuestion: {question}\nAnswer:"
    r = requests.post(
        "http://localhost:11434/api/generate",
        json={"model": "llama3.2:1b", "prompt": prompt, "stream": False}
        )
    # print("RAW LLM OUTPUT:", r.json())  # for debug
    return r.json()["response"]

def rag(question):
    qvec = embed(question)
    results = collection.query(query_embeddings= [qvec], n_results=3)
    if not results["documents"] or not results["documents"][0]:
        return "No relevant context found."
    context = "\n\n".join(results["documents"][0])
    return ask_llm(question, context)

results = rag("How does the shifting leaning window work?")

And run it.

The Shifting Learning Window method works by introducing two types of word repetitions: high-frequency repetitions and low-frequency repetitions.

High-frequency repetitions involve repeating the same small number of words (typically 1-5) multiple times, such as “hello”, “world”, or “python”. These are called “learning windows”.

Low-frequency repetitions involve repeating a medium number of words (typically 10-20) less frequently, such as “machine learning”, “artificial intelligence”, or “computer vision”.

The learning window shifts through the word book by replacing the memorized word with a new word at random intervals. This ensures that the learner is constantly faced with novel information and increases the likelihood of forgetting previously learned words.

This is awesome! A tiny LLM is actually running on a mini PC! But I see some hallucinations. Probably the model is too small for this task.

High-frequency repetition should be about 10 words
Where did “hello”, “world” and “python” come from?, etc.

Let’s try a larger model:

ollama pull llama3.2:3b

It is about 2GB. A bit heavy on storage but that’s all right. And change this line to use the model.

json={"model": "llama3.2:3b", "prompt": prompt, "stream": False}

This time, it took 4 min to complete.

The Shifting Learning Window method works by introducing a combination of high-frequency repetitions of small number of words and low-frequency repetitions of medium number of words. It uses a “learning window” that can have 10 words, initially filled with the first 10 words from a word book. As you repeat 10 words in the learning window, you memorize one or two words, and then the learning window replaces the memorized words with new ones. This process continues until the learning window reaches the end of the book, at which point it repeats the cycle.

Wow, what an improvement! Now I see no hallucinations, and ollama summarized shifting learning window very well.

Lessons learned

From this simple PoC, I’ve learned:

a PoC-level RAG pipeline is not difficult to implement. It took a day and a half for both PoC and writing this blog article.
a Intel N150-based, low-power PC can host a lightweight LLM
chunk size is very important in a low-power environment. I needed to set it as low as 80 to remove errors
choosing the smallest LLM model is not always a good idea. I needed to upgrade from a 1.3GB model to a 2GB model in storage size to get useful, hallucination-free results

Future improvements

This PoC is done in an environment with very little resources. A more powerful PC with GPU should be able to handle larger chunks and can have a smarter LLM model locally. Integrating with OpenAI and other APIs (paid!) would enable much more powerful LLMs and thus would results in better results.

For the python script, there are possible enhancements:

Make it a web application using Flask, possibly with Bootstrap
Containerize
Add web and PDF crawlers to add more context information

Conclusion

Conducted a PoC for end-to-end RAG workflow:

Prepared context information
Embed and added to a vector DB
Gave a question related to the context
Embed the question and searched similar context information
Generated a prompt with additional context information for LLM to consume
Gave it to LLM, and LLM gave an answer using the context information

The LLM answer was a good summary of the context information without hallucinations, proving that it is practical and useful in daily professional life.

Appendix

Learning with Copilot

This is irrelevant with the PoC itself, but I learned these about copilot.

Code snippets Copilot shows sometimes do not work as-is and require troubleshooting, which is a great opportunity to acquire deeper knowledge on the subject
Just following troubleshooting steps from Copilot is not effective. They might do more harm. Asking why it suggests the steps is important
Copilot sometimes forgets about my environment
Copilot can be seen and treated as a junior teacher but with exceptionally great knowledge. It’s very imbalanced. You can use it wisely.
Copilot is always encouraging. It’s a great leaning partner.

System diagram

     ┌──────────────────────────┐
     │      Source Document     │
     │     (vbs_readme.org)     │
     └─────────────┬────────────┘
                   │
                   ▼
     ┌──────────────────────────┐
     │     Chunking (80/20)     │
     └─────────────┬────────────┘
                   │
                   ▼
     ┌──────────────────────────┐
     │  Embedding Model (MiniLM)│
     │        via Ollama        │
     └─────────────┬────────────┘
                   │
                   ▼
     ┌──────────────────────────┐
     │   Chroma Vector Store    │
     │  (Persistent Collection) │
     └─────────────┬────────────┘
                   │
                   ▼
┌────────────────────────────────────┐
│          RAG Query Flow            │
│  (embed -> retrieve -> generate)   │
└─────────────┬──────────────────────┘
              │
              ▼
     ┌──────────────────────────┐
     │  Query Embedding (MiniLM)│
     └─────────────┬────────────┘
                   │
                   ▼
     ┌──────────────────────────┐
     │ Retrieve Top-k Chunks    │
     │       from Chroma        │
     └─────────────┬────────────┘
                   │
                   ▼
     ┌──────────────────────────┐
     │  LLM (llama3.2:3b) via   │
     │         Ollama           │
     └─────────────┬────────────┘
                   │
                   ▼
     ┌──────────────────────────┐
     │     Final Answer         │
     └──────────────────────────┘

Project environment

I’ve created a github repository for this PoC. Please take a look if you are interested.

Directory structure:

$ tree -L 2
.
├── chroma
│   ├── b8810b00-c991-49e9-8148-70be3f602df8
│   └── chroma.sqlite3
├── t.py
├── vbs_readme.org
└── ve
    ├── bin
    ├── include
    ├── lib
    ├── lib64 -> lib
    ├── pyvenv.cfg
    └── share

Installed ollama models:

$ ollama list
NAME                 ID              SIZE      MODIFIED
llama3.2:3b          a80c4f17acd5    2.0 GB    35 minutes ago
llama3.2:1b          baf6a787fdff    1.3 GB    About an hour ago
all-minilm:latest    1b226e2802db    45 MB     12 hours ago

The complete source code of the script (ragdemo.py):

import chromadb
import requests

def embed(text):
    r = requests.post(
        "http://localhost:11434/api/embeddings",
        json={"model": "all-minilm", "prompt": text},
        timeout=60
        )
    data = r.json()
    if "embedding" not in data:
        print("Ollama returned an error:", data)
        return None

    return data["embedding"]

def chunk_text(text, size=300, overlap=50):
    words = text.split()
    chunks = []
    i = 0
    while i < len(words):
        chunk = words[i:i+size]
        chunks.append(" ".join(chunk))
        i += size - overlap
    return chunks

def ask_llm(question, context):
    prompt = f"Answer the question using ONLY the context.\n\nContext:\n{context}\n\nQuestion: {question}\nAnswer:"
    r = requests.post(
        "http://localhost:11434/api/generate",
        json={"model": "llama3.2:3b", "prompt": prompt, "stream": False}
        )
    # print("RAW LLM OUTPUT:", r.json())
    return r.json()["response"]

def rag(question):
    qvec = embed(question)
    results = collection.query(query_embeddings= [qvec], n_results=3)
    if not results["documents"] or not results["documents"][0]:
        return "No relevant context found."
    context = "\n\n".join(results["documents"][0])
    return ask_llm(question, context)

client = chromadb.PersistentClient(path="./chroma")
collection = client.get_or_create_collection("demo")

with open("vbs_readme.org") as f:
    text = f.read()

chunks = chunk_text(text, size=80, overlap=20)

for i, chunk in enumerate(chunks):
    vec = embed(chunk)
    if vec is None:
        print(f"Skipping chunk {i} due to embedding error")
        continue

    collection.add(
        ids=[f"vbs_{i}"],
        embeddings=[vec],
        documents=[chunk],
        metadatas=[{"source": "vbs_readme.org",
                    "chunk_id": i,
                    }]
        )

results = rag("How does the shifting leaning window work?")

print(results)

What is RAG?#

Chunking#

The PoC pipeline#

High-level system architecture#

Setup demo environment#

Using a real document#

Asking LLM with context#

Lessons learned#

Future improvements#

Conclusion#

Appendix#

Learning with Copilot#

System diagram#

Project environment#