Building a Blast Radius Oracle: How I Designed Impact Analysis
Most engineers learn about blast radius the hard way. You ship what looks like a simple refactor. Two hours later, checkout is broken in production. The payment service is throwing errors. Customer support is lighting up Slack.
That "simple" utility function you moved? It was called by seventeen different features across nine services. Nobody knew.
I spent the last year building Glue's blast radius analysis. The goal: predict what breaks before you ship. The reality: way harder than it sounds.
Why Traditional Dependency Graphs Fail
When I started, I assumed this was a solved problem. Just trace the dependency graph, right? Parse imports, build a tree, done.
Dependency graphs tell you what's connected. They don't tell you what matters.
Here's a real example from a codebase we indexed. A shared formatCurrency() function had 247 import references. The graph said changing it would impact 247 files. Technically true, useless in practice.
Most of those files? Background scripts, admin tools, internal dashboards. Three files actually mattered: checkout flow, invoice generation, subscription billing. Those were the blast radius. The rest was noise.
Static analysis gives you syntax. It doesn't give you meaning.
The Feature Mapping Problem
The breakthrough came when I stopped thinking about files and started thinking about features.
A feature isn't a file. It's not even a set of files. It's a user-visible capability that spans multiple layers of your stack. The checkout feature might touch:
CheckoutButton.tsx
cart.service.ts
payment-processor.go
order-confirmation-email.html
inventory-reserve.sql
Change any of these, and you're potentially impacting checkout. But your dependency graph doesn't know that. It sees five unrelated files.
We solved this by building a feature discovery system. Glue indexes your codebase and uses AI to identify features by analyzing code patterns, test files, route handlers, and commit history. It groups files into feature clusters based on how they actually work together, not just how they import each other.
The mapping isn't perfect. AI isn't magic. But it's right 80% of the time, which beats manual documentation that's right 0% of the time because nobody maintains it.
Three Layers of Impact
Once you have features mapped, you can build real blast radius analysis. We track impact at three layers:
Direct Impact: Files that import your changes. This is the easy part. Parse the dependency graph, find direct references. Every static analysis tool does this.
Feature Impact: Features that depend on the files you're changing. This is where it gets interesting. If you modify db-connection-pool.ts, we look at which features use it. Maybe that's "User Authentication" and "Report Generation". Now you know what to test.
Downstream Impact: Features that depend on impacted features. This is the blast radius. If you break authentication, you break everything that requires login. The cascade effect.
Most tools stop at layer one. We needed all three.
The Runtime vs. Compile Time Problem
Static analysis has a fundamental limitation: it only sees what's written in code. It doesn't see what happens at runtime.
What's the blast radius of changing handlers/payment.ts? A static analyzer can't tell you. The dependency is resolved at runtime based on user input.
We handle this by combining static analysis with dynamic inference. We look at:
Route definitions and API endpoints
Event handlers and message queues
Database query patterns
Test execution paths
If payment.ts is only called when action === 'payment', and that only happens in the checkout flow, we can infer the feature boundary even without a direct import.
This is probabilistic, not deterministic. We assign confidence scores. High confidence means multiple signals agree. Low confidence means we're guessing based on limited data.
I'd rather have a 70% confidence answer than pretend I can be 100% certain about dynamic behavior.
Code Health Meets Blast Radius
Here's where it got interesting. We already had code health metrics in Glue — churn rate, complexity scores, ownership data. Combining these with blast radius created something new.
A file with high complexity and high blast radius? That's a critical risk zone. Touch it carefully. Have someone review who understands the full impact.
A file with high churn and high blast radius? That's a refactoring priority. Too many people are changing code that affects too many things. Either stabilize it or break it apart.
A file with low ownership clarity and high blast radius? That's a knowledge gap. Nobody owns the code that impacts multiple features. Document it or assign an owner before something breaks.
The blast radius alone isn't actionable. Combined with health metrics, it tells you where to focus.
The Ownership Pyramid
One pattern emerged consistently: blast radius follows ownership structure.
In codebases with clear ownership — each feature has a designated team, responsibilities are documented — the blast radius tends to be contained. Changes in one team's code rarely cascade to another team's features.
In codebases with diffuse ownership — shared utilities everywhere, no clear boundaries — the blast radius is huge. Everything affects everything.
We started visualizing this as ownership pyramids. At the top: core infrastructure that everyone depends on. In the middle: shared services used by multiple features. At the bottom: feature-specific code with minimal external impact.
Good architectures have a narrow top. A few well-maintained foundational components that rarely change. Bad architectures have a wide top. Tons of shared code that's constantly being modified.
You can reshape the pyramid through refactoring. Extract feature-specific logic out of shared utilities. Make core infrastructure more stable and slower to change. Create ownership boundaries.
But first you need to see the pyramid. That's what blast radius analysis gives you.
Building the Oracle
The actual implementation is messier than the theory. We run multiple analysis passes:
Pass 1: Syntax Tree Analysis
Parse every file. Build import graphs. Identify function calls, class hierarchies, interface implementations. This is standard compiler stuff.
Pass 2: Feature Clustering
Use AI to group files into features based on naming patterns, directory structure, test coverage, and commit co-occurrence. Files that change together probably work together.
Pass 3: Runtime Inference
Analyze dynamic imports, event handlers, config-driven behavior. Look for patterns that indicate runtime dependencies not visible in static code.
Pass 4: Impact Propagation
Simulate changes. If file A changes, what features use A? If feature X breaks, what features depend on X? Build the cascade tree.
Pass 5: Confidence Scoring
Weight each connection by signal strength. Direct imports get high confidence. Inferred runtime dependencies get lower confidence. Show both the likely impact and the uncertainty.
Each pass takes seconds for small codebases, minutes for large ones. We cache aggressively and only reanalyze what changed.
What I Got Wrong
Early versions were garbage. The feature detection was too aggressive — everything was part of the authentication feature if it checked user.id. The blast radius was too conservative — we flagged every change as high-risk.
I learned that precision matters more than recall. It's okay to miss some edge cases. It's not okay to cry wolf constantly. Engineers ignore tools that have high false positive rates.
We tuned the confidence thresholds aggressively. Now we only surface high-confidence impacts by default. Low-confidence connections are available if you dig, but we don't lead with them.
Also: the UI mattered more than I expected. Showing a giant dependency graph is overwhelming. Showing "This change impacts 3 features: Checkout, Billing, Reports" is actionable. Design the visualization for the decision you want people to make.
Using This in Practice
When we integrate with Cursor and other editors through MCP, the blast radius analysis runs on every file you open. You see the impact zone immediately.
Before you refactor that utility function, you know it's used by critical payment flows. Before you delete that seemingly unused import, you know it's actually required by the background job processor.
The goal isn't to prevent all changes. It's to give you context. Sometimes you ship the change anyway because it's worth the risk. But you do it knowingly. You write better tests. You notify the right teams. You watch the right metrics after deploy.
That's the oracle's job: not to predict the future, but to help you prepare for it.
What's Next
We're adding temporal analysis next. Not just what breaks, but when. Some impacts are immediate — API errors right after deploy. Some are delayed — background jobs that run hourly, email templates that render once a day.
Understanding the timing of blast radius helps with rollout strategy. Deploy during low-traffic hours if the impact is immediate. Stage the rollout if the impact is delayed.
We're also building team impact scores. When you make a change, which teams need to be notified? Which teams should review? Which teams own the monitoring once it ships?
Blast radius isn't just technical. It's organizational. The code structure reflects the team structure. Analyzing one gives you insight into the other.
The hardest problems in software aren't algorithmic. They're about understanding systems — both code and people — and predicting how changes ripple through them.
That's what we're building. One blast radius at a time.