How to Map a 10M LOC Monorepo in 2 Days

Your new CTO wants a "complete picture" of the 10 million line monorepo. You have 48 hours.

I've been in this exact situation three times. The last company had 847 microservices (yes, someone counted) living in a single repo with zero documentation about who owns what. The dependency graph looked like someone threw spaghetti at a wall and called it architecture.

Here's the thing — you can't read 10 million lines of code. But you can make the code tell you its secrets.

The Panic Response (Don't Do This)

Your first instinct is to start reading. Maybe begin with the README files. Open a few directories. Look at some package.json files.

Stop.

At 200 lines per minute (if you're speed reading), you'd need 833 hours just to scan everything once. That's 20 work weeks. For scanning.

I watched a principal engineer spend six months manually cataloging services before I joined. He had a beautiful spreadsheet with 127 entries. We actually had 400+ services. His approach was admirable and completely useless.

What You Actually Need to Know

Before you touch any code, figure out what questions you're answering:

Who owns what? (Because when it breaks at 3 AM, someone's getting called)
What depends on what? (So you don't accidentally delete the authentication service)
How does code flow to production? (The deployment pipeline maze)
Where are the integration points? (The places that will hurt when you split things)

That's it. Everything else is noise for your 48-hour window.

The Three-Layer Approach

Layer 1: File System Intelligence (2 hours)

Start with what the filesystem tells you. This isn't about reading code — it's about reading structure.

# Get the lay of the land
find . -name "package.json" | head -20
find . -name "pom.xml" | head -20  
find . -name "Dockerfile" | head -20
find . -name "*.yaml" -path "*/k8s/*" | head -20

# Count everything that matters
echo "Services/Apps:"
find . -name "package.json" | wc -l
echo "Docker containers:"
find . -name "Dockerfile" | wc -l
echo "K8s resources:"
find . -name "*.yaml" -path "*/k8s/*" | wc -l

This gives you the shape of the problem. One repo I analyzed had 89 Dockerfiles but 156 package.json files. That told me immediately we had library confusion and partial containerization.

Now get the ownership signals:

# Who touched what recently?
git log --format="%an" --since="6 months ago" | sort | uniq -c | sort -nr | head -20

# What are the hotspots?
git log --format=format: --name-only --since="6 months ago" | grep -v '^$' | sort | uniq -c | sort -nr | head -30

The hotspots are where your problems live. Files that change constantly are either core infrastructure (good to know) or poorly designed (also good to know).

Layer 2: Dependency Mapping (4 hours)

This is where most people get stuck. Don't manually read import statements. Automate it.

For Node.js ecosystems:

# Extract all package.json dependencies
find . -name "package.json" -exec jq -r '.name + "," + (.dependencies // {} | keys | join("|"))' {} \; > deps.csv

For Java:

# Maven dependencies (assuming mvn dependency:tree works)
find . -name "pom.xml" -execdir mvn dependency:tree -DoutputFile=deps.txt \; 2>/dev/null

But here's the real trick — use existing tools instead of building your own parser:

# Install dependency-cruiser (works for JS/TS)
npm install -g dependency-cruiser

# Run it on major subdirectories
for dir in services/* apps/*; do
    echo "Analyzing $dir"
    dependency-cruiser --output-type json "$dir" > "deps-$(basename $dir).json" 2>/dev/null
done

I prefer madge for Node.js projects because it's faster and handles circular dependencies:

npm install -g madge

# Generate dependency graph for each service
find . -name "package.json" -execdir madge --json . \; > service-deps.json

Layer 3: Production Intelligence (2 hours)

The filesystem lies. Git history lies. But production doesn't lie.

Look at your deployment configs:

# Find all the ways code gets deployed
find . -name "*.yaml" | grep -E "(deploy|k8s|kubernetes)" | head -20
find . -name "Dockerfile" -exec dirname {} \; | sort -u
find . -name ".github" -type d -exec find {} -name "*.yml" \;

Check the CI/CD signals:

# What actually gets built?
grep -r "docker build" .github/ . | head -20
grep -r "npm run build" .github/ . | head -20

# What services are exposed?
grep -r "port:" . | grep -E "\.(yaml|yml):" | head -20

This tells you what actually runs in production vs. what exists in the repo. I once found 47 services in CI/CD configs but 89 in the filesystem. The difference? 42 dead services no one had the courage to delete.

Building the Map

Don't create a perfect diagram. Create a useful one.

I use a simple three-column approach:

Core Services (< 10 items)

Authentication, payments, user management
The stuff that breaks everything when it's down
Usually has the most inbound dependencies

Feature Services (10-50 items)

Product features, business logic
Depend on core services
Change frequently

Support Services (Everything else)

Utilities, data processing, reporting
Often can be ignored for architectural decisions

Here's my quick-and-dirty visualization script:

import json
import networkx as nx
from collections import defaultdict

# Load dependency data (from madge output)
deps = defaultdict(set)
with open('service-deps.json') as f:
    data = json.load(f)
    for service, dependencies in data.items():
        deps[service].update(dependencies)

# Find the most connected services
G = nx.DiGraph()
for service, dependencies in deps.items():
    for dep in dependencies:
        G.add_edge(service, dep)

# Print services by importance (incoming connections)
by_importance = sorted(G.nodes(), key=lambda n: G.in_degree(n), reverse=True)
print("Top 20 most depended-on services:")
for service in by_importance[:20]:
    print(f"{service}: {G.in_degree(service)} dependencies")

The Outcome

After 8 hours of automated analysis, you'll have:

Service inventory — What exists and what actually runs
Dependency hotspots — The 10-20 services everything else needs
Ownership patterns — Who to blame when things break
Deployment complexity — How many ways code reaches production

This isn't a perfect architectural diagram. It's actionable intelligence.

The last time I did this exercise, we discovered that 23% of our services were unused, 31% had circular dependencies, and three junior engineers had accidentally become the maintainers of our most critical infrastructure.

Your CTO gets a data-driven overview instead of hand-waving. You get to keep your sanity. Everyone wins.

Well, except the person who has to actually fix the mess. But at least now you know where the bodies are buried.

How to Map a 10M LOC Monorepo in 2 Days

The Panic Response (Don't Do This)

What You Actually Need to Know

The Three-Layer Approach

Layer 1: File System Intelligence (2 hours)

Layer 2: Dependency Mapping (4 hours)

Layer 3: Production Intelligence (2 hours)

Building the Map

The Outcome

Related Posts

Code Metrics: What to Track and Why

Best GPT for Coding: Comparing AI Code Assistants

Code Quality Measurement: Metrics, Tools & Benchmarks