New AI Memory System Cuts Agent Costs by 90%

Traditional AI memory systems struggle to keep up with the demands of modern agentic workflows. While retrieval-augmented generation (RAG) excels at pulling in fresh external data, it often fails to retain the nuanced, long-term context that production agents require. Now, a new approach called observational memory—developed by Mastra, a startup founded by the original creators of Gatsby—is turning the problem on its head by eliminating retrieval entirely.

Instead of fetching context dynamically, observational memory relies on two specialized agents: the Observer, which compresses raw conversation history into dated, event-based logs, and the Reflector, which periodically reorganizes those logs to remove redundancy. The result is a system that maintains stable context windows, cuts token costs by up to 90%, and outperforms RAG in benchmarks measuring long-term memory retention.

For enterprises deploying AI agents in customer support, document processing, or site reliability workflows, the implications are significant. Forgetting a user’s past requests or preferences isn’t just inefficient—it’s a usability killer. Observational memory addresses that by treating memory as a first-class primitive, not an afterthought.

How It Works: Compression Over Retrieval

Most memory systems today rely on RAG’s dynamic retrieval model: when an AI agent needs context, it queries a vector database and injects the results into the prompt. The problem? Every new interaction invalidates the cache, forcing providers like OpenAI and Anthropic to charge full rates for each retrieval—even when the underlying prompt hasn’t changed.

Observational memory flips this model. The system divides the context window into two sections: a stable observation log (compressed, dated notes) and a volatile raw history buffer (current session messages). When the raw buffer hits 30,000 tokens, the Observer agent processes it, extracting key decisions and events into new observations. These are appended to the log, while the raw messages are discarded. When the log itself reaches 40,000 tokens, the Reflector agent kicks in, merging related observations and pruning outdated entries.

The compression isn’t just about space—it’s about preserving actionable details. Traditional compaction methods (like those used in coding agents) summarize conversations into documentation-style blobs, losing granularity. Observational memory, by contrast, generates a decision log: a chronological record of what was decided, when, and why. This structure ensures agents can act consistently over weeks or months without losing critical context.

Text compression ratios: 3–6x for standard conversations, 5–40x for tool-heavy workloads.
No vector databases required: Pure text-based storage simplifies deployment.
Configurable thresholds: Observation and reflection triggers are adjustable.

Benchmark Performance: Stability Over Search

On the LongMemEval benchmark—designed to test an AI’s ability to retain context across extended interactions—observational memory scored 94.87% using GPT-5-mini, with a completely stable context window. When tested against Mastra’s own RAG implementation on GPT-4o, it achieved 84.23% accuracy compared to RAG’s 80.05%. The margin may seem small, but in production systems where forgetting a single preference can disrupt workflows, those points matter.

The real advantage, however, lies in cost efficiency. By maintaining a static observation prefix, the system achieves near-full cache hits for most interactions—reducing token costs by 4–10x compared to dynamic retrieval. Even during reflection (which occurs infrequently), only a portion of the cache is invalidated, preserving budget predictability.

Enterprise Adoption: Where RAG Falls Short

Observational memory isn’t a replacement for RAG—it’s a specialized alternative for use cases where long-term stability outweighs the need for broad external knowledge. Early adopters include

AI-powered CMS agents (e.g., Sanity, Contentful) that must remember user-specific content requests across sessions.
SRE tools tracking alert investigations and resolution steps over weeks.
Document automation systems processing multi-step workflows (e.g., contract reviews, compliance checks).

For these applications, forgetting is a dealbreaker. A support agent that can’t recall a user’s past interactions or an SRE tool that loses track of resolved incidents creates friction that users notice immediately. Observational memory solves this by treating memory as a system of record, not just a utility.

Availability and Integration

The system is now available as part of Mastra 1.0, with plug-ins released this week for LangChain, Vercel’s AI SDK, and other frameworks. This allows developers to integrate observational memory into existing agentic workflows without rewriting core infrastructure.

For teams evaluating memory solutions, the key questions are

Do your agents need dynamic external retrieval (RAG) or stable internal context (observational)?
How cost-sensitive are your workflows? Observational memory’s caching advantages could slash spending by 90%.
What’s your tolerance for lossy compression? RAG preserves broader knowledge but at the cost of stability.
Are your agents tool-heavy? Observational memory excels at compressing large outputs (e.g., API interactions, multi-step decisions).

As AI agents move from lab experiments to embedded systems—where reliability and cost efficiency are non-negotiable—memory architecture may become as critical as model choice. Observational memory proves that sometimes, simpler isn’t just easier—it’s more powerful.