The Refactor That Broke Production (And How Graph Analysis Would Have Prevented It)
John Doe
The Refactor That Broke Production (And How Graph Analysis Would Have Prevented It)
Last Tuesday, I deployed what looked like a straightforward refactor. Thirty minutes later, our checkout service was returning 500s and our Slack was on fire.
The change? Moving user authentication logic from our monolithic user-service into a new auth-service. Clean separation of concerns. Textbook microservices. What could go wrong?
Everything.
The "Simple" Refactor
We had a classic monolith problem. The handled user profiles, preferences, billing history, AND authentication. It was becoming a maintenance nightmare (you know how this goes).
Make user-service depend on auth-service for token validation
Keep existing APIs intact with internal service calls
Ship it gradually
Here's what the dependency flow looked like in my head:
Frontend → user-service → auth-service → database
Clean. Linear. Simple.
But distributed systems laugh at your diagrams.
The Hidden Web
The refactor went smoothly. Unit tests passed. Integration tests passed. Code review was uneventful. I deployed to staging and everything worked perfectly.
(This should have been my first red flag. When has anything ever worked perfectly in staging?)
The problem surfaced during user preference updates. Turns out, when a user changed their email, we had this "helpful" audit trail that logged the change:
# In user-service
def update_email(user_id, new_email):
old_email = get_current_email(user_id)
update_user_email(user_id, new_email)
# Log the change with user context
audit_log.info(f"Email changed", extra={
'user_id': user_id,
'old_email': old_email,
'new_email': new_email,
'user_details': get_user_context(user_id) # <-- the problem
})
That get_user_context() function seemed innocent. It grabbed user details for audit logs. Made sense.
But look closer:
def get_user_context(user_id):
user = User.objects.get(id=user_id)
return {
'username': user.username,
'account_type': user.account_type,
'is_admin': check_admin_status(user_id), # <-- oh no
'permissions': get_user_permissions(user_id)
}
The check_admin_status() call? It hit an internal endpoint to verify admin privileges. And after my refactor, that endpoint lived in... the auth-service.
So the actual flow was:
user-service → auth-service → user-service (for user context)
Circular dependency. Classic.
Why This Broke Everything
The circular call created a deadlock during high-traffic periods. Here's what happened:
User update hits user-service
user-service calls auth-service for token validation
auth-service processes request, returns success
user-service updates database
Audit logging triggers get_user_context()
get_user_context() calls back to auth-service to check admin status
Under load, connection pools exhaust
Both services start timing out on each other
Circuit breakers trip
Everything falls over
The kicker? This only happened during email updates (rare) and only under moderate load (so not in our load tests).
Graph Analysis Would Have Caught This
After the post-mortem (and several strong coffees), I built a dependency analyzer. It maps every HTTP call, database query, and service interaction in our codebase.
Here's the simplified version:
import ast
import networkx as nx
from pathlib import Path
class ServiceDependencyAnalyzer:
def __init__(self, services_dir):
self.graph = nx.DiGraph()
self.services_dir = Path(services_dir)
def analyze_service_calls(self, service_name):
"""Extract HTTP calls from a service's codebase"""
service_path = self.services_dir / service_name
for py_file in service_path.rglob("*.py"):
tree = ast.parse(py_file.read_text())
for node in ast.walk(tree):
if isinstance(node, ast.Call):
call_target = self._extract_service_call(node)
if call_target:
self.graph.add_edge(service_name, call_target)
def _extract_service_call(self, node):
"""Parse AST node to identify service calls"""
# Look for requests.post/get/etc patterns
if (isinstance(node.func, ast.Attribute) and
node.func.attr in ['get', 'post', 'put', 'delete']):
if node.args and isinstance(node.args[0], ast.Str):
url = node.args[0].s
return self._extract_service_from_url(url)
return None
def find_cycles(self):
"""Find circular dependencies"""
try:
cycles = list(nx.simple_cycles(self.graph))
return cycles
except nx.NetworkXNoCycle:
return []
def analyze_impact(self, service_name):
"""Find all services that depend on this one"""
if service_name not in self.graph:
return []
# All nodes that have paths TO this service
dependents = []
for node in self.graph.nodes():
if nx.has_path(self.graph, node, service_name):
dependents.append(node)
return dependents
Running this on our pre-refactor code would have shown:
Circular dependencies found:
['user-service', 'auth-service', 'user-service']
Impact analysis for auth-service changes:
- user-service (direct dependency)
- billing-service (depends on user-service)
- notification-service (depends on user-service)
- admin-dashboard (depends on user-service)
That circular dependency would have been obvious. We could have fixed it before production.
The Real Fix
The solution wasn't just breaking the circle (though we did that). We needed to fundamentally rethink our service boundaries.
Instead of auth-service needing user details, we inverted the dependency:
# New approach: auth-service returns minimal context
def validate_token(token):
payload = jwt.decode(token)
return {
'user_id': payload['user_id'],
'permissions': payload['permissions'], # Embedded in token
'expires_at': payload['exp']
}
# user-service handles its own admin checks
def check_admin_status(user_id):
user = User.objects.get(id=user_id)
return user.is_admin # No service call needed
But the bigger win was making dependency analysis part of our CI pipeline:
# .github/workflows/dependency-check.yml
name: Dependency Analysis
on: [pull_request]
jobs:
analyze:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Run dependency analysis
run: |
python scripts/analyze_dependencies.py
if [ $? -ne 0 ]; then
echo "Circular dependencies detected!"
exit 1
fi
Now every PR gets checked for circular dependencies. No more surprises in production.
The Numbers
Before the fix:
12-hour outage
$47K in lost revenue (based on our checkout conversion rates)
847 customer support tickets
Team worked until 3 AM
After implementing dependency analysis:
Zero circular dependency incidents in 8 months
23 potential issues caught in PR review
Average resolution time: 15 minutes (vs. 12 hours)
The analysis script takes 3 minutes to run and has saved us from at least two other potential outages.
Actually Learn From This
Don't just read this and nod along. If you're running microservices, you probably have circular dependencies lurking in your codebase right now.
Build the analyzer. Run it on your services. You'll be surprised what you find.
Because the next refactor that looks "simple" might be the one that takes your site down for half a day. And unlike me, you won't have the excuse that nobody warned you.
JavaScript Static Code Analysis Beyond ESLint
What happens after you max out ESLint: advanced JavaScript analysis techniques.