My Codebases Have an AI Receptionist Now
1. Introduction

Every developer has been in this situation: a colleague asks “how does auth work in that service?” and you spend five minutes digging through your own codebase to remember. Now multiply that by every recruiter, collaborator, or future-you who wants to understand a repo they haven’t touched in months.
I built askmycode to solve this. It’s a Streamlit chat app where you can ask natural-language questions about any of your GitHub repositories and get back a precise, source-cited answer , file paths, line numbers, actual code snippets, no hand-waving.
How is rate limiting implemented in @capynodes-backend?
Rate limiting is enforced in capynodes-backend via the check_rate_limit
function in middleware/rate_limit.py:L14-38. It uses a sliding-window
counter stored in Redis...
What makes this interesting is how it works under the hood. There’s no search index, no vector database, no pre-processing step. The agent reads your code the same way you would on a fresh machine: it scans directory trees, greps for symbols, and opens files until it has enough to answer. This post walks through every design decision.
2. Indexing vs. Agentic Retrieval
The conventional approach to “chat with your codebase” is Retrieval-Augmented Generation (RAG):
- Chunk every file into small pieces.
- Embed each chunk with a model like
text-embedding-ada-002. - Store embeddings in a vector database (Pinecone, Chroma, Weaviate, …).
- At query time, embed the question, retrieve the nearest chunks, and pass them to an LLM.
RAG is fast and scales to massive corpora. But for code, it has real problems:
| Problem | Why it matters for code |
|---|---|
| Index staleness | Every git push potentially invalidates cached embeddings. You need a sync pipeline or your answers drift from reality. |
| Chunking is lossy | A function split across a chunk boundary loses context. A class split from its imports becomes ambiguous. |
| Retrieval errors are silent | If the relevant chunk ranked 6th instead of 1st, the LLM never sees it and may hallucinate a plausible-but-wrong answer. |
| Fixed granularity | Embeddings can’t decide “I need the whole file” vs. “I only need lines 20-35”. |
Agentic retrieval takes a different approach. Instead of pre-building an index, the agent plans and acts at query time:
User query
↓
Agent decides: "I should search for the symbol first"
↓
search_code("check_rate_limit") → file: middleware/rate_limit.py, line 14
↓
Agent decides: "I need to read that file"
↓
read_file_tool(repo, "middleware/rate_limit.py") → actual source code
↓
Agent decides: "I have enough to answer"
↓
Synthesize grounded answer
The key insight from the AGENTIC_LOOP.md design doc:
grepnever hallucinates a function signature.
Every tool call returns ground-truth data from the actual file system. The agent can’t invent a result , it either finds the code or it doesn’t. This makes answers verifiable and the failure mode honest (“I couldn’t find evidence of X”) rather than subtly wrong.
The trade-off is latency. A RAG system retrieves in milliseconds; an agentic loop may take 3-10 tool calls. For code Q&A, where correctness matters more than speed, this trade-off is almost always worth making.
3. Tech Stack
| Layer | Technology | Why |
|---|---|---|
| LLM | Gemma 4 via OpenRouter | Fast inference, strong tool-calling, excellent code reasoning |
| Agentic framework | LangGraph | First-class state machine primitives, conditional edges, streaming |
| UI | Streamlit | Zero-boilerplate chat UI with real-time status updates |
| Code search | grep -rn -E (subprocess) | Fast, dependency-free, exact-match on live files |
| Repo management | git clone (subprocess) | Snapshots repos locally; no remote API calls at query time |
| Config | config.json | Simple {name: owner/repo} whitelist |
| Logging | Python logging (rotating file) | Structured key=value pairs, 5 MB rotating log |
| Evals | pytest + LLM-as-judge | Tool sequencing and grounding checks |
The stack is deliberately minimal. No ORM, no message queue, no container orchestration , just Python, an OpenRouter API key, and git. The repo runs locally with uv sync && uv run streamlit run src/app.py.
Repos are configured in config.json and auto-cloned on first startup:
{
"repos": {
"capybaradb": "capybara-brain346/capybaradb",
"knowflow": "capybara-brain346/knowflow"
}
}
Any directory already present under repos/ is also picked up automatically, so local-only repos work too.
4. LangGraph Flow
The agent is modelled as a state machine with three nodes and two conditional edges. Here’s the actual graph definition from src/graph.py:
from langgraph.graph import END, StateGraph
graph = StateGraph(AgentState)
graph.add_node("plan", plan_node)
graph.add_node("tools_node", tools_node)
graph.add_node("observe", observe_node)
graph.set_entry_point("plan")
# plan → tools_node (if LLM emitted a tool call)
# plan → END (if LLM answered directly)
graph.add_conditional_edges("plan", should_continue,
{"tools_node": "tools_node", END: END})
graph.add_edge("tools_node", "observe")
# observe → plan (if more evidence needed)
# observe → END (if ready to synthesize, or MAX_HOPS reached)
graph.add_conditional_edges("observe", route_from_observe,
{"plan": "plan", END: END})
Visually:
┌─────────────────────────────────────────┐
│ │
query ──►│ plan_node ──► tools_node ──► observe_node │──► END (synthesize)
│ ▲ │ │
│ └────────── loop ───────────┘ │
└─────────────────────────────────────────┘
The State Object
All data flows through a single AgentState TypedDict, appended to (never overwritten) by each node:
class AgentState(TypedDict):
messages: Annotated[list[dict], operator.add] # full conversation + tool messages
tool_results: Annotated[list[ToolResult], operator.add] # accumulated evidence
files_read: list[list[str]] # (repo, path) pairs , deduplication guard
query: str # original question, never mutated
hop_count: int # iteration counter, checked against MAX_HOPS=10
answer: str | None # READY sentinel set by observe_node
tagged_repos: list[str] # @-mentioned repos for scoping
Node Responsibilities
plan_node , Calls the LLM with the full system prompt and accumulated message history. The LLM decides which tool to call next (or answers directly if it already has enough). The system prompt encodes the exploration strategy:
- Go targeted first:
search_codebeforeget_file_treebeforeread_file_tool. - Use regex alternation:
jwt|token|OAuth|authenticateto cover synonyms in one call. - Don’t re-read: the
files_readlist is injected into context so the agent skips already-read files.
tools_node , Executes the tool calls emitted by plan_node. Each tool is dispatched from the TOOL_FUNCTIONS registry. Results are appended to tool_results and returned as role: tool messages for the LLM’s next turn. Errors (whitelist violations, file-not-found, grep timeouts) are caught and returned as error strings , the agent sees them and can adapt.
observe_node , This is the decision node. It asks the LLM a focused question in JSON mode: “Do you have enough to answer, or do you need more?” The response is a simple {"decision": "loop" | "synthesize", "reasoning": "..."}. If hop_count >= MAX_HOPS, it forces synthesize regardless of the LLM’s preference.
The routing function:
def route_from_observe(state: AgentState) -> Literal["plan", "__end__"]:
if state.get("answer") is not None or state["hop_count"] >= MAX_HOPS:
return END
return "plan"
Synthesis , After the loop ends, build_synthesize_messages constructs a final prompt with all accumulated tool_results and the original query. The LLM writes a grounded answer, citing specific files and line numbers. This step happens outside the graph in app.py using a streaming call so the user sees tokens arrive in real time.
5. Tool Details
All tools are plain Python functions that access cloned repos on the local filesystem. The tool layer enforces whitelist boundaries independently of the LLM , no amount of prompt manipulation can make a tool read outside the config-defined repos.
Security: Double Validation
Every tool validates both the repo name and the resolved path:
def _validate_repo(repo: str) -> Path:
whitelist = get_whitelist()
if repo not in whitelist:
raise WhitelistViolation(f"Repo '{repo}' is not in the whitelist.")
return whitelist[repo].resolve()
def _validate_path(repo_root: Path, path: str) -> Path:
candidate = (repo_root / path.lstrip("/")).resolve()
if not candidate.is_relative_to(repo_root):
raise WhitelistViolation("Path resolves outside the repo root.")
return candidate
The is_relative_to check defeats path traversal attacks (../../etc/passwd). The whitelist check means even a jailbroken LLM prompt can’t read an unconfigured repo.
Tool Reference
| Tool | Signature | Purpose |
|---|---|---|
list_repos | () → list[str] | Returns all whitelisted repo names. Called first when the agent needs to establish scope. |
get_file_tree | (repo, path="/") → dict | Lists immediate files and dirs at a path. Returns name + size for files. |
read_file_tool | (repo, path) → str | Returns raw file content. Truncates at 200 KB with a [truncated at 200 KB] notice. |
search_code | (query, repos=[]) → list[dict] | Runs grep -rn -E across one or more repos. Returns {repo, file, line_number, snippet} objects, capped at 50 results. |
get_repo_metadata | (repo) → dict | Returns file count, last-modified timestamp, top file extensions, and the first 500 chars of the README. Fast and cheap , no full file reads. |
How the Agent Uses Them
A typical query like “How does auth work in capynodes-backend?” follows this pattern:
search_codewithjwt|token|OAuth|authenticate|login→ findsmiddleware/auth_jwt.py:L14read_file_toolon that file → reads the actual implementation- observe decides: “I have the definition. Synthesize.”
A high-level overview query like “What is capybaradb?” follows a different pattern:
get_repo_metadata→ gets file count, README excerpt, primary language- observe decides: “README answers the question. Synthesize.”
The search_code tool is the workhorse. Because it takes a full extended regex, the agent can cover many synonyms in a single call, keeping the hop count low.
6. Evals and Safeguards
E2E Eval Results (20-query golden set)
| Metric | Result | Target | Status |
|---|---|---|---|
| Keyword accuracy | 85.0% | ≥ 80% | ✓ Pass |
| Mean hop count | 1.3 | ≤ 5 | ✓ Pass |
| Mean grounding score | 0.887 | ≥ 0.95 | ✗ Fail |
| Whitelist violations | 0 | = 0 | ✓ Pass |
| MAX_HOPS hit rate | 0.0% | ≤ 10% | ✓ Pass |
Four of five metrics pass. The agent is efficient (average 1.3 hops per query , well inside budget), never hit the loop cap, and never attempted to read outside a whitelisted repo. Keyword accuracy at 85% clears the 80% floor, meaning the agent surfaces the right symbols and file references in the large majority of cases.
The one miss is grounding score at 0.887, which falls short of the 0.95 target. A grounding score below threshold means the LLM-as-judge found claims in answers that weren’t directly traceable back to tool results , the agent occasionally extrapolates slightly beyond what the retrieved evidence strictly supports. The likely causes are synthesis over truncated files (the 200 KB cap can cut off the tail of a large module) and the agent generalising from a single code path to a broader architectural claim.
Enforcing Tool Calls at the Protocol Level
A key design concern was preventing the model from answering out of prior weights instead of retrieved evidence. The system addresses this with two independent layers:
1. PLAN_SYSTEM prompt (nodes.py) , a ## Mandatory rule block at the top of the system prompt makes the expectation explicit in natural language:
You have NO prior knowledge of these repositories. You MUST call at least one tool on every turn before synthesizing an answer.
2. plan_node API constraint (nodes.py) , plan_node calls call_llm(messages, tools=TOOL_SCHEMAS, tool_choice="required"). The OpenRouter API enforces a tool call on every plan hop at the protocol level, making it impossible for the model to short-circuit the retrieval loop regardless of its confidence. The call_llm helper in utils.py exposes tool_choice as a parameter (defaulting to "auto") so all other callers are unaffected.
The two-layer approach means the agent is reminded of the rule in natural language and mechanically prevented from breaking it.
Shipping an LLM-based system without evals is like shipping code without tests: you might be fine, but you won’t know when you break something. askmycode has two eval suites under tests/evals/.
T1: Tool Sequencing Evals
These tests assert that the agent calls tools in a sensible order , they don’t care about the content of answers, only about the structural properties of the tool call sequence.
Tools are patched with instant stubs that return canned data, so only the LLM’s planning decisions are under test. This makes the tests fast and deterministic on the infrastructure side.
| Test | Query | Assertion |
|---|---|---|
| T1-01 | ”What repos are available?” | list_repos is called, ≤ 2 hops |
| T1-02 | ”What does the auth module do?” | search_code appears before read_file_tool |
| T1-03 | ”What is capybaradb about?” | First call is a cheap orientation tool (not a blind file read) |
| T1-04 | ”Where is rate limiting implemented?” | search_code is called |
| T1-05 | ”Compare vector storage across repos” | list_repos is the very first call |
T1-02 is particularly important. A naive agent might blindly call read_file_tool on a guessed path. The correct strategy is to search first, then read what the search points to. The test enforces this:
if has_read:
first_search = next(i for i, t in enumerate(seq) if t == "search_code")
first_read = next(i for i, t in enumerate(seq) if t == "read_file_tool")
assert first_search < first_read
G2: Grounding Evals
These tests use an LLM-as-judge pattern. Given a (tool_results, answer) pair, a judge LLM scores the answer on grounding: is every claim in the answer traceable back to something in tool_results?
| Test | Scenario | Expectation |
|---|---|---|
| G2-01 | Answer correctly cites real file content | Grounding score ≥ 0.95 |
| G2-02 | Empty tool results, answer invents a function | At least one violation flagged |
| G2-03 | Answer references content beyond a truncation point | Violation flagged |
| G2-04 | Answer attributes code to the wrong repo | Violation flagged |
| G2-05 | Answer cites :L999 when the actual line is L5 | Violation flagged |
G2-02 through G2-05 are negative tests , they construct adversarial inputs and assert the judge catches the problem. This gives confidence that the grounding evaluator itself is calibrated, not just rubber-stamping everything as correct.
E2E: Golden Set
The E2E suite runs the full agent against 20 golden queries across five categories (function lookup, dependency tracing, repo overview, implementation location, feature presence). Each golden case specifies expected keywords that must appear in the answer:
{
"id": "E-08",
"category": "dependency_tracing",
"query": "Where is JWT authentication implemented in knowflow?",
"repo_scope": "knowflow",
"expected_keywords": ["jwt", "auth_service"]
}
Results are logged to logs/eval_results.csv so you can track pass rate across runs.
Hard Constraints
Beyond evals, several hard limits are enforced in code:
| Constraint | Value | Effect |
|---|---|---|
MAX_HOPS | 10 | Loop hard-stops; agent synthesizes with whatever it has |
MAX_FILE_SIZE | 200 KB | Large files truncated with a notice |
CONTEXT_BUDGET_CHARS | 320,000 (~80K tokens) | Oldest tool results dropped from context if exceeded |
TIMEOUT_SECONDS | 60 | Per-query wall-clock limit |
GREP_MAX_RESULTS | 50 | Search results capped to prevent context flooding |
| Whitelist enforcement | Strict | Path traversal blocked at the filesystem level |
7. Conclusion
The entire thing runs locally with one API key and no external databases. If you want to point it at your own repos, edit config.json and run:
uv sync
uv run streamlit run src/app.py
Askmycode is live on: askmycode
The source code is on GitHub: capybara-brain346/askmycode