Building AI Coding Agents That Actually Understand Your Codebase
You're building an AI agent to automate some tedious coding task. Maybe it's generating tests, refactoring legacy code, or updating API documentation. You feed it a decent prompt, point it at your repo, and wait.
It comes back with code that compiles but breaks three integration tests. It confidently updates a function that's been deprecated for six months. It generates beautiful documentation for the wrong API version.
The problem isn't the LLM. The problem is your agent has no fucking idea what it's looking at.
Why Context Matters More Than Prompts
Most AI coding agents fail at the same choke point: they treat your codebase like a pile of text files. They read functions in isolation, miss dependencies, and have no concept of which code is actually being used versus which is dead weight from 2019.
Let's say you're building an agent to refactor authentication logic. Your LLM is Claude 3.5 Sonnet — perfectly capable. You write a thoughtful prompt explaining the task. The agent starts reading files.
It finds auth.py. Looks good. It finds legacy_auth.py. Also looks relevant. It finds AuthService, AuthManager, and AuthHelper classes. They all seem to do similar things. Which one is canonical? Which ones are actively maintained?
Without context, your agent is guessing. And guessing at scale creates garbage.
The Three Types of Context That Actually Matter
Forget RAG over raw code for a minute. If you want agents that work, you need three specific types of codebase intelligence:
1. Feature-level understanding
Your codebase doesn't organize around features — it organizes around files and folders. But your users think in features. "The notification system", "the checkout flow", "the admin dashboard".
An agent that understands features can answer questions like "which files implement user authentication?" without grepping for string matches. It knows that auth involves database migrations, middleware, route handlers, and frontend components even if they're scattered across different directories.
This is where most agents fall apart immediately. They read the function you pointed them at but miss the six other places that depend on it.
2. Call graphs and dependencies
Static analysis tools can generate call graphs, but they're noisy as hell. You need to know which dependencies matter. If a function is called once from a test file and 400 times from production code, that context changes everything.
An agent refactoring that function needs to know: Is this public API? Is it used by other services? Is it hot path code that runs on every request? Without this, you get technically correct changes that break prod.
3. Ownership and health metrics
Who knows this code? When was it last touched? How complex is it? How often does it change?
This sounds soft, but it's brutally practical. If your agent is supposed to update database queries and it touches a file that hasn't been modified in 18 months and is owned by the team that just got laid off — that's a red flag. Conversely, if it's suggesting changes to code with high churn and active ownership, those changes might get reviewed and merged quickly.
How to Build Context Without Losing Your Mind
Building this context from scratch is miserable. You need to:
Parse every language in your polyglot codebase
Build and maintain ASTs
Track symbol usage across files
Map database schemas to the code that queries them
Identify dead code vs. active code
Connect frontend routes to backend handlers
Keep all of this updated as code changes
This is why most teams don't do it. They throw more tokens at the problem instead.
But there's a better path. Tools like Glue automatically index your codebase and extract the structure your agents need. Files, symbols, API routes, database schemas — all mapped and queryable. More importantly, Glue uses AI to discover features in your codebase, so your agents can reason about "the billing system" instead of just billing_v2_final_ACTUALLY_FINAL.py.
A Practical Example: Test Generation
Let's make this concrete. You're building an agent that generates integration tests for API endpoints.
Without context:
Your agent reads the endpoint handler. It sees function parameters and return types. It writes a test that calls the endpoint with valid data and checks the response status code.
The test passes on your machine. In CI, it fails because it didn't set up the required database state. It also didn't mock the external payment service that endpoint calls. And it used a test user ID that exists in your local DB but not in CI.
With context:
Your agent queries for the feature "payment processing". It gets back a map showing:
The API handler
The database models it touches
External services it calls
Existing tests for similar features
The team that owns this code
Now it can generate a test that:
Uses the same database fixtures as other payment tests
Mocks the external payment service correctly
Includes the right authentication setup
Follows the conventions this team already uses
The test passes in CI on the first try.
Integrating Context Into Your Agent Architecture
You have options for how to wire this up:
Option 1: Retrieval at inference time
Your agent calls a context API before generating code. "Give me everything related to user authentication." It gets back a structured response with files, dependencies, and metadata. Then it generates code with that context in its prompt.
This is clean but can be slow. Every agent action requires a context lookup.
Option 2: Pre-computed context in system prompts
You build a knowledge graph of your codebase and serialize relevant chunks into your agent's system prompt. Faster at inference time but requires careful prompt management to stay under token limits.
Option 3: MCP (Model Context Protocol)
If you're using Claude or other MCP-compatible models, you can expose codebase context as tools. Your agent can call get_feature_map() or find_dependencies() just like it would call a search API or a calculator.
This is actually elegant. Glue supports MCP, so your agents can query codebase structure the same way they query anything else. No custom APIs to maintain.
The Agent Loop That Actually Works
Here's a pattern that's proven effective:
Understand the task — What feature is this related to? What's the desired outcome?
Query for context — Get the feature map, relevant files, dependency graph, ownership info
Generate code — Use the LLM with full context
Verify against constraints — Check that changes don't violate architecture rules (don't add new DB migrations to deprecated services, don't modify code with unclear ownership, etc.)
Run validation — Tests, linters, type checkers
Learn from failures — If tests fail, use the failure output to refine context queries
The context query in step 2 is what separates working agents from toys. Without it, you're just throwing code at a wall.
What About RAG Over Raw Code?
Sure, you can embed your entire codebase and retrieve relevant chunks based on semantic similarity. It's better than nothing.
But it's also how you end up with agents that find the old deprecated implementation because it has similar variable names to what you're working on. Or miss critical dependencies because they're in a file with a non-obvious name.
Raw code embeddings don't capture structure. They don't know that UserService and UserRepository are tightly coupled even if the code looks different. They don't know that the 50 lines of code in your checkout handler touch four databases, two external APIs, and a message queue.
You need structured context. Embeddings can augment it, but they can't replace it.
Actually Shipping This
If you're building agents for your own team, start small:
Pick one repetitive task (generating similar tests, updating API docs, adding logging)
Build context queries that answer the specific questions your agent needs
Wire them into your agent loop
Measure success rate against human-written code
Don't try to build a general-purpose coding agent on day one. Build something that solves a specific problem better than a human can, then expand from there.
And don't build the context infrastructure yourself unless you want that to become your full-time job. Glue already does this — it indexes your codebase, discovers features, and provides APIs your agents can query. You focus on the agent logic, not the parsing.
The Next Bottleneck
Once your agents have context, the next problem is coordination. Multiple agents working on the same codebase need to know what each other is doing. That's a harder problem than context, and nobody's really solved it yet.
But you can't even start thinking about coordination until your agents understand what they're modifying. Context first. Everything else later.