AI Is Writing Code Faster Than Humans Can Verify

Software development is entering an era where the bottleneck isn’t writing code—it’s ensuring that code is correct. As AI tools like GitHub Copilot, Amazon CodeWhisperer, and Google’s AI assistants churn out billions of lines of code annually, enterprises face an uncomfortable truth: human review processes are drowning in volume. A San Francisco-based startup called Theorem is betting that the solution lies in automating the verification process itself, using AI to prove that AI-generated software behaves as intended.

The company, which emerged from Y Combinator’s Spring 2025 batch, has secured $6 million in seed funding led by Khosla Ventures, with participation from Y Combinator, e14, SAIF, Halcyon, and angel investors including Blake Borgesson and Arthur Breitman. Theorem’s approach combines formal verification—a mathematically rigorous method for proving software correctness—with AI-driven proof generation, slashing the time and expertise required to validate complex systems.

Why Formal Verification Matters Now

Formal verification isn’t new. For decades, it has been the gold standard for mission-critical systems like avionics, nuclear reactors, and cryptographic protocols. The catch? It’s expensive. Traditional methods demand eight lines of mathematical proof for every single line of code, requiring PhD-level engineers and years of work. Even a project like the HTTPS protocol—now securing trillions of internet connections—took 15 person-years to verify.

Theorem’s co-founder, Jason Gross, knows this firsthand. Before founding the company, he led MIT research that produced verified cryptography code for HTTPS. Today, Theorem’s AI automates the proof-writing process, transforming what once required years into weeks—or even days. The result? A system that can validate AI-generated code at scale without overwhelming human reviewers.

How It Works: AI That Proves Itself

Theorem’s technology operates on a principle called fractional proof decomposition. Instead of exhaustively testing every possible code behavior—a computationally impossible task for complex systems—the platform allocates verification efforts based on the importance of each component. This targeted approach recently uncovered a bug in Anthropic’s Claude chatbot that evaded traditional testing.

In a technical demonstration called SFBench, Theorem’s AI translated 1,276 formal proofs from one verification language (Rocq) to another (Lean) and automatically verified their equivalence. A human team would have taken roughly 2.7 person-years to complete the same work. The system also handles interdependent code—where solutions span dozens of files—something conventional AI agents struggle with due to limited context windows.

Theorem’s customers, ranging from AI research labs to GPU-accelerated computing firms, are already seeing results. One client brought a 1,500-page specification and a legacy system riddled with memory leaks and crashes. Their goal: boost performance from 10 Mbps to 1 Gbps—a 100x improvement—without introducing new errors. Theorem generated 16,000 lines of production-ready code, which the client deployed without manual review. The confidence came from a compact, executable specification paired with automated equivalence checks.

A Race Against AI’s Own Limitations

The stakes are high. AI is accelerating software development in sectors where reliability is non-negotiable—financial systems, medical devices, power grids, and transportation networks. Yet, as AI tools become more capable, so do the risks. Gross warns that the cost of exploiting vulnerabilities is plummeting, creating an asymmetric threat landscape. The only viable defense? Verification that scales with AI’s speed.

Software security is a delicate balance, Gross notes. If we want a solution that lasts beyond the next generation of models, verification is the only path forward. He adds that regulators may soon view the absence of formal verification in critical systems as gross negligence, given that the technology is now affordable.

Theorem isn’t the only player in this space, but its focus on software oversight—rather than broader mathematical verification—sets it apart. The team, which includes Gross and co-founder Rajashree Agrawal (a machine learning research engineer), is targeting industries like robotics, renewable energy, cryptocurrency, and drug synthesis. With the new funding, the company plans to expand its team, ramp up compute resources, and push into new verticals.

The race is on. AI is writing the future of software—faster than ever before. Now, someone has to ensure it’s done right.