## My Data & Reason For This Chatbot

My portfolio website contains a decent amount of **text data**. It is my digital real estate where I can host my work afterall.

**Components:**
- Personal data: Social links, experience, tech skills
- Blogs: 10 as of today, 2000 words on average
- Project descriptions: 4 as of today

![chatbot illustration](/images/blog/chatbot/chatbot.png)

I believed if I have so much amount of text data, it could be a great **feature to allow visitors to query it** and help in the process know more about my work.

**For example:**
- **User**: "Tell me about his recent projects"
- **_Assistant_**: "Piyush has 4 recent projects: CapybaraDB, Knowflow, Post Automation Agent, TrackML"
- **User**: "Elaborate on CapybaraDB"
- **_Assistant_**: "CapybaraDB is a minimalistic vector database implementation designed for educational purposes and prototyping. The major concepts include:
Simplicity
Provides a minimal API (add_document, search, get_document) for easy integration
No complex configuration required to get started
Flexibility
Supports multiple file formats and precision levels (float32, float16, binary)
Offers in-memory or on-disk storage options
Includes GPU acceleration capabilities
Minimal Dependencies
Relies only on essential libraries for core functionality
Lightweight footprint ideal for learning and experimentation
"
- **User**: "What are all the precision types supported?"
- **_Assistant_**: "CapybaraDB supports: float32, float16, binary precision"

Above is a very simple example, at the end of the day **I want users to be able to query any and every minute detail mentioned on the website.**

## Tech Stack

**Here's the tech stack I decided to go with:**

1. **Supabase (PostgresSQL)**: I chose this for my vector database, it is simple to setup, generous free tier and other features like RPC functions
2. **Vercel's AI SDK**: my website already uses typescript (Next.js), integrating AI SDK was a very natural process with great developer experience
3. **Groq & Qwen-32B**: I needed blazing fast inference speeds, hence decided to go with Groq. Plus a LLM which has enough parameters to provide mature reponses and maintain a decent context window but not take ages to respond like Gemini-3-Pro

## Workflow

![chatbot workflow](/images/blog/chatbot/workflow.png)

I decided to keep the indexing phase and query phases completely seperate. There should be absolutely no writes to the DB at runtime in prod. 

**Phases:**
1. **Indexing Phase**: A Python script processes Markdown files from blog and project directories, chunks the content, and stores it in Supabase for vector/full-text search.
2. **Query Phase**: A Next.js API route uses Groq (with Qwen model) to handle user messages, augmented by a dynamic system prompt and a `searchKnowledgeBase` tool that queries Supabase for relevant content chunks.

One major issue I thought a lot about was the context management for easy queries vs difficul queries where database querying is actually needed.

**The Issue & The Fix:**
- For simple queries like "What his name?", "What are his tech skills?", "What is the name of his recent employer?", **searching the database everytime was not work it.**
- The fix was **putting simple information + meta data inside the system prompt** itself as the global context and performing database **queries on complex queries only.**

To implement the routing for the complex queries, I decided to **use tooling through Vercel's AI SDK**. I provided a `searchKnowledgeBase` tool so the LLM can generate special keywords from user's input and query the database. Then after the context from the db is fetched, the LLM can provide informed responses.

## Database Design

I used **Supabase** to store my search vectors. Below is the **schema design.**

```sql
CREATE TABLE content_chunks (
  id BIGSERIAL PRIMARY KEY,
  content_type TEXT NOT NULL,
  title TEXT NOT NULL,
  file_path TEXT NOT NULL,
  chunk_text TEXT NOT NULL,
  chunk_index INTEGER NOT NULL,
  url TEXT NOT NULL,
  metadata JSONB DEFAULT '{}',
  search_vector TSVECTOR GENERATED ALWAYS AS (
    setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
    setweight(to_tsvector('english', coalesce(chunk_text, '')), 'B')
  ) STORED,
  created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX content_chunks_search_idx ON content_chunks
USING gin(search_vector);

CREATE INDEX content_chunks_type_idx ON content_chunks(content_type);
CREATE INDEX content_chunks_metadata_idx ON content_chunks USING gin(metadata);

ALTER TABLE content_chunks ENABLE ROW LEVEL SECURITY;

CREATE POLICY "Public read access" ON content_chunks
  FOR SELECT USING (true);

```

It stores chunks as search vectors which can **queried by postgres built-in full text search.**

### How PostgreSQL's full text search works

It is an **FTS engine**, an alternative to Elastic search. It aims to understand the nuances of linguistics to give relevance based ranking.

**Components:**
- **Document**: a unit of content you will be searching through
- **Lexemes**: normalised word forms will represent the core meaning of the word. Includes, lowercasing, removing punctuation, removing stop words
- **Stemming**: reducing words to their root form (running -> ran)

**Data types:**
1. **tsvector (Text search vector)**: sorted list of lexemes extracted from a document and also stores positional information 
2. **tsquery (Text search query)**: represents search terms and logical operators (&, |, !)

Relevance is determined using the ts_rank() function, which calculates a score based on how often the search terms appear and how close they are to each other. You can further refine this by assigning weights to different parts of a document using setweight(). Weight 'A' is typically given to the most important fields, like a title. Weights 'B', 'C', and 'D' are used for less critical data like subtitles or body content. Matches in 'A' weighted fields will result in a higher relevance rank than matches in 'C' weighted fields

**Some limitations:**
- Lack of Inverse Document Frequency
- Expensive scoring
- Elasticsearch is better highly complex aggregations, faceted search and geospatial or image based search

## Conclusion

**Try the [Chatbot](https://piyushchoudhari.me/chat)**

Making a chatbot which has the ability to pull context from my website was a pretty fun implementation as **I see myself using this more often as I forget what I wrote or implemented 6 months down line, lol**. Will try to perform more robust and quantitative tests on the chatbot's responses in the future.