Scaling AI Memory: Google DeepMind Push for Million-Token Long Context Models

AI Summary: Google DeepMind dramatically expanding AI context windows from thousands to millions of tokens, rapidly deploying this capability in Gemini 1.5 Pro and effectively giving models a significantly larger working memory. It allows AI to process vast amounts of information simultaneously and complements systems like RAG for even larger datasets, has been tested up to 10 million tokens, though cost is currently a limiting factor, with future efforts focusing on improving the quality and efficiency of using these expanded contexts.

May 13 2025 16:38
At Google DeepMind, Nikolay Savinov and his team were attempting something that seemed almost impossibly ambitious: teaching AI models to process not just thousands, but millions of tokens of context at once.

"When I started working on long context, the competition was about 128K or 200K tokens at most," says Savinov, Staff Research Scientist at Google DeepMind and co-lead for long context pre-training. "I thought, well, 1 million is an ambitious enough step forward. It was about 5x compared to 200K."

What happened next surprised even the researchers themselves. The breakthrough came faster than expected, leading to rapid deployment in Google's Gemini 1.5 Pro model earlier this year – and soon after, an expansion to 2 million tokens.

This exponential leap in what AI systems can "remember" during a conversation isn't just a technical achievement. It represents a fundamental shift in how these systems can be used, opening pathways to applications that weren't previously feasible.

Understanding the Basics: What Is a Token?

Before diving deeper, it's worth understanding what exactly we mean by "tokens" – the basic unit of measurement for context windows in AI language models.

"The way you should think about a token, it's basically slightly less than one word, in case of text," Savinov explains. "A token could be a word, part of a word, or things like punctuation – commas, full stops, et cetera."

This way of breaking down language creates some peculiar challenges. When asked why AI uses tokens instead of processing text character-by-character like humans do, Savinov points to efficiency: "The generation is going to be slower because you generate roughly one token at a time. If you are generating a word in one go, it's going to be much faster than generating every character separately."

This tokenization approach is why some seemingly simple tasks – like counting the number of R's in "strawberry" – can sometimes trip up even advanced AI systems. Because the model may see "strawberry" as a single token rather than eight individual characters, it doesn't naturally develop the ability to analyze characters within tokens without specific training.

The Context Window: An AI's Working Memory

The context window represents the amount of information an AI model can access and reference during a conversation – essentially functioning as its working memory.

"Context windows are basically exactly these context tokens that we are feeding into an LLM," Savinov says. "It could be the current prompt or the previous interactions with the user. It could be files that the user uploaded, like videos or PDFs."

AI systems actually draw knowledge from two distinct sources. The first is what Savinov calls "in-weight" or "pre-training memory" – knowledge the system absorbed during its initial training on internet data. The second is "in-context memory" – information explicitly provided during the current conversation.

This distinction matters because updating in-context memory is significantly easier than updating knowledge embedded in the model's weights. Context windows thus become crucial for providing current information, private data, or rare facts the model wouldn't have encountered enough during training to reliably remember.

Long Context and RAG: Complementary Rather Than Competitive

One of the most hotly debated topics in AI circles today is whether expanding context windows will render Retrieval Augmented Generation (RAG) systems obsolete. RAG systems work by searching through large databases for relevant information, then feeding only the most pertinent pieces into an AI's context window.

Savinov firmly believes they'll coexist: "Enterprise knowledge bases constitute billions of tokens, not millions. For this scale, you still need RAG."

Rather than competing approaches, he sees a powerful synergy. "What I think is going to happen in practice is that long context and RAG are going to work together. The benefit of long context for RAG is that you will be able to retrieve more relevant needles from the haystack by using RAG."

With shorter context windows, developers had to be extremely selective about what information to include. Longer windows allow for more comprehensive retrieval, potentially improving the quality of responses by ensuring important context isn't left out.

Testing the Limits: From 1 Million to 10 Million Tokens

How far can context windows expand? According to Savinov, there don't seem to be fundamental technical barriers to scaling much further – the team has already tested systems with 10 million token contexts.

"When we released the 1.5 Pro model, we actually ran some inference tests at 10 million, and we got some quality numbers as well. For single-needle retrieval, it was almost perfect for the whole 10 million context. We could have shipped this model," Savinov reveals.

The primary limiting factor right now is cost. "It's pretty expensive to run this inference. I guess we weren't sure if people are ready to pay a lot of money for this, so we started with something more reasonable, in terms of the price."

Looking ahead, Savinov believes scaling beyond current limits will require more innovations, not just brute-force scaling: "My feeling is that we actually need more innovations. It's not just a matter of brute-force scaling. To actually have close to perfect 10 million context, we need more innovations."

Best Practices for Developers

For developers looking to leverage long context windows effectively, Savinov offers several practical recommendations:

Use context caching: "The first time you supply a long context to the model and you're asking a question, it's going to take longer and it's going to cost more. If you're asking the second question after the first one on the same context, then you can rely on context caching to make it both cheaper and faster."
Place questions after the context: "If you want to rely on caching and profit from cost saving, put the question after the context. If you put it at the beginning, your caching is going to start from scratch."
Combine with RAG for larger datasets: "If you need to go into billions of tokens of context, then you need to combine with RAG."
Don't include irrelevant information: "Don't pack the context with irrelevant stuff. It's going to affect multi-needle retrieval."
Give explicit prompting guidance: "It's beneficial to resolve contradiction explicitly by careful prompting. For example, start your question with saying, 'based on the information above.' This gives a hint to the model that it actually has to rely on in-context memory instead of in-weight memory."

The Connection Between Long Context and Reasoning

Perhaps one of the most intriguing aspects of long context models is their relationship to reasoning capabilities. Savinov sees a deep connection between the two.

"If the next token prediction task improves with the increasing context length, then you can interpret this in two ways," he explains. "One way is to say, I'm going to load more context into the input, and predictions for my short answer are going to improve as well. But another way to look at this is that output tokens are very similar to input tokens. If you allow the model to feed the output into its own input, it kind of becomes like input."

This creates a powerful dynamic where the model can essentially write to its own memory, overcoming limitations imposed by network depth. "If you need to make many logical jumps through the context when making a prediction, then you are limited by the network depth. But if you imagine that you are feeding the output into the input, then you are not limited anymore."

This relationship explains why models capable of handling longer contexts often demonstrate improved reasoning capabilities as well – they're fundamentally connected abilities.

Current Limitations and Future Directions

Long context enables entirely new applications like comprehensive document analysis, code generation spanning entire repositories, and more sophisticated reasoning across complex information landscapes. It's particularly transformative for knowledge-intensive fields where connecting information across vast datasets is essential.

As context windows continue to expand and models get better at utilizing this expanded memory effectively, we're likely to see AI systems that can maintain much more coherent understanding across long conversations, analyze entire books or codebases at once, and synthesize information from diverse sources with greater accuracy.

While long context models have made enormous strides, several limitations remain. Output length still lags behind input capacity, and handling difficult retrieval tasks with similar distractors remains challenging. Looking to the future, Savinov offers some predictions:

What I think is going to happen first is, the quality of the current 1 or 2 million contexts is going to increase dramatically, and we are going to max out pretty much all the retrieval-like tasks quite soon.

The most significant developments may come in improving quality rather than simply expanding window size further. This means better handling of complex, multi-part queries across large documents and improved integration of information throughout extremely long contexts.