Testing Strategies for Vibe Coding: When AI Writes Code You Don't Understand
You just shipped a feature in 45 minutes that would've taken two days last year. Cursor wrote most of it. You tweaked a few things, the tests passed, and it works in staging. You feel productive as hell.
You also have no idea what half the code does.
This is vibe coding. You're directing AI like a contractor — describing what you want, reviewing the output, iterating until it feels right. It's insanely productive until something breaks and you're staring at code that might as well be written in Sumerian.
The real problem isn't that AI writes bad code. Modern LLMs are pretty good. The problem is you're now responsible for code you didn't fully internalize, in a codebase you might not completely understand, with test coverage that might be theater.
Your CI is green. All tests pass. You're safe, right?
Wrong. Tests tell you what works, not what matters.
I watched a team ship a payment feature where the AI-generated tests validated every edge case perfectly. The feature worked. The problem? It bypassed the fraud detection middleware that every other payment flow used. The tests didn't know to check that because the engineer didn't know that pattern existed.
When you vibe code, you skip the painful process of understanding context. You don't grep for similar patterns. You don't check how the existing code handles authentication, or rate limiting, or error reporting. You just... ship.
Your tests validate the code you wrote, not the code you should've written.
Test What AI Doesn't See
AI is terrible at understanding implicit requirements. It knows your function signature and maybe some surrounding code. It doesn't know:
Your API rate limits and where they're enforced
Which database queries need specific indexes
Your team's error handling patterns
Security policies buried in middleware
Feature flags that control rollout
You need tests that validate these invisible constraints.
Integration smoke tests are your friend. Not the comprehensive kind — the paranoid kind:
These tests are obnoxious to write. They're also what catch the "oh shit" bugs at 2 AM.
Property-Based Testing for Unknown Unknowns
When you don't understand the code deeply, you can't predict edge cases. So stop trying.
Property-based testing generates hundreds of random inputs and checks invariants:
import fc from 'fast-check';
// You don't know what breaks this parser
// So test what MUST be true
fc.assert(
fc.property(fc.string(), (input) => {
const result = aiGeneratedParser(input);
// Must never crash
expect(result).toBeDefined();
// Output must be serializable
expect(JSON.stringify(result)).toBeTruthy();
// Must be idempotent
expect(aiGeneratedParser(input)).toEqual(result);
})
);
This finds the weird shit. The input that's valid Unicode but breaks your regex. The edge case where your AI-generated state machine gets stuck in a loop.
Tools like fast-check (JS) or hypothesis (Python) do this. They're annoying to set up and feel like overkill until they catch a production bug at 3% reproducibility.
Mutation Testing: Are Your Tests Lying?
Your tests pass. Great. Would they fail if the code was wrong?
Mutation testing finds out. It randomly breaks your code and reruns tests. If tests still pass, your coverage is fake:
# Using Stryker for JS
npx stryker run
# It'll mutate your code:
- if (user.isAdmin) {
+ if (false) {
# If tests still pass, you have a problem
This is brutal. It'll tell you that 40% of your "tested" code has no real coverage. That's normal for vibe-coded features because you validated the happy path and moved on.
You don't need 100% mutation coverage. But run it on your critical paths — payment processing, auth, data deletion. The stuff that gets you fired.
Context-Aware Testing: Know What You're Changing
Here's where things get real. You can't test what you don't understand. And when AI writes code, you often don't understand the full picture.
You need codebase intelligence. Not documentation (it's wrong). Not comments (they lie). Actual runtime behavior and relationships.
This is where something like Glue helps. Before you let Cursor write a feature, you should know:
What features already touch this code (feature maps)
Which endpoints call this function (call graphs)
Who owns this and what's the churn rate (ownership + health)
What similar patterns exist in the codebase
You're not looking for permission. You're looking for traps.
I've seen engineers ask Cursor to "add caching to this API endpoint" without knowing there's already a Redis layer three functions up the stack. The AI happily adds another cache, the tests pass, and now you have a nasty invalidation bug.
Quick context check before you vibe code:
// Before asking AI to modify this
async function getUserData(userId: string) {
// Who calls this? What features use it?
// Is there existing caching? Logging? Error tracking?
return db.users.findOne({ id: userId });
}
Glue's feature discovery shows you this automatically. It maps where getUserData is actually used, what API routes depend on it, and whether it's in a high-churn area (read: unstable).
You make better testing decisions when you know if you're touching legacy code that serves 60% of users vs. a new experimental feature with three callers.
The Ownership Test
Who's going to maintain this code? When it breaks at 3 AM, who gets paged?
If you're vibe coding in someone else's domain, write tests that prove you didn't break their shit:
// This code is owned by the payments team
// Your tests need to validate their contracts
describe('Contract: payments team', () => {
it('preserves existing error codes', async () => {
// They depend on specific error.code values
const error = await getPaymentError();
expect(error.code).toMatch(/^PAY_/);
});
it('maintains response time SLA', async () => {
const start = Date.now();
await processPayment(testData);
expect(Date.now() - start).toBeLessThan(200);
});
});
These are social tests. They catch when your changes violate unstated assumptions that other teams rely on.
Glue's ownership maps show you this. You can see which team owns what code, what their patterns are, and what's likely to explode if you change it. Not because of code quality — because of organizational dependencies.
Load Testing the Weird Stuff
AI loves generating code that works perfectly at n=1 and dies at n=1000.
I've seen Cursor generate a beautiful user import function that loaded a CSV, validated each row, and inserted into Postgres. Worked great in tests with 10 rows. In production with 50,000 rows, it OOM'd the server because the AI used an in-memory array for everything.
Your unit tests won't catch this. You need load tests:
Not comprehensive. Just enough to catch obvious algorithmic disasters.
Manual Testing: The Dirty Secret
You need to actually use the feature. Not just run tests. Actually click through it like a user.
This sounds obvious but nobody does it. You ship the PR as soon as CI is green.
Spend 10 minutes in your feature:
Click everything twice
Try invalid inputs
Check the network tab for sketchy requests
Look at what got logged
See if error states are sane
You'll find bugs. Every time. Because AI optimizes for "tests pass" not "feels right to use."
When to Trust, When to Rewrite
Some AI-generated code you can ship with light testing. Some you need to tear apart and rebuild.
Ship with confidence:
CRUD operations with clear specs
UI components with visual tests
API clients with schema validation
Data transformations with property tests
Rewrite or heavily verify:
Security-critical code (auth, permissions, data access)
Performance-sensitive paths (hot loops, DB queries)
Error handling and recovery logic
Anything with complex state machines
If the AI-generated code has deeply nested conditionals or clever optimizations you don't understand, don't ship it. Simplify it, even if it's less "elegant."
Clear and tested beats clever and fast.
The Real Test: Can You Debug It?
Here's your litmus test. Six months from now, when this code breaks in production at 2 AM, can you debug it?
If the answer is "maybe" or "I'd need to re-read everything," your testing strategy failed. You didn't build enough understanding.
Good tests are also documentation. They show how code is supposed to behave, what edge cases exist, and what assumptions matter:
it('handles concurrent updates with optimistic locking', async () => {
// This test documents a critical behavior
// Future you (or your teammate) needs to know this exists
const user = await createUser();
const [update1, update2] = await Promise.allSettled([
updateUser(user.id, { version: 1 }),
updateUser(user.id, { version: 1 })
]);
expect(update1.status === 'rejected' || update2.status === 'rejected')
.toBe(true);
});
If your tests are just "verify output matches input," they're not helping future you.
Vibe Coding Isn't Going Away
AI tools are going to get better. You're going to ship faster. The gap between "writing code" and "understanding code" will keep growing.
Testing is how you manage that gap. Not 100% coverage theater. Practical, paranoid tests that validate the things AI can't see — system context, organizational requirements, unstated assumptions.
Use tools that give you codebase intelligence (like Glue) so you know what you're changing. Write property tests for unknown unknowns. Load test the weird stuff. And for fuck's sake, manually test your features.
Vibe coding is powerful. Just don't be the engineer debugging AI-generated code at 3 AM with no idea what it does.