Using Fast Compression to Reduce Agentic Loop Costs

In this article, you will learn what fast compression is, why it is important in agent AI loops, and how to use it using summarization and command extraction.
Topics we will cover include:
- Why agent loops quadruple the cost of tokens, and how fast compression deals with this.
- A review of very fast compression techniques, including instruction completion, recursive compression, vector database retrieval, and LLMLingua.
- A working Python example that combines recursive summarization and command distillation to achieve meaningful token savings.
Introduction
Agent loops in production it may equate to higher costs, especially when it comes to both LLM and external application use via APIs, where billing is often closely related to token usage.
Good news: pressing fast is one of the most effective strategies you can use to navigate the high cost of agent loops. This article presents and discusses how multiple methods of rapid compression can help alleviate financial problems when using agent loops.
Quick Press: Motivation and Common Strategies
Many agent frameworks, such as LangGraph and AutoGPT, force the agent to keep context of what it did in previous steps. Let’s say your agent needs to take 10 to 20 steps to solve a problem. To do step 1, send 500 tokens. In step 2, it has to send those previous 500 tokens and the new information associated with this step – say about 1,000 tokens in total. This may increase to about 1,500 tokens in step 3, and so on. By the time we get to step 20, we’ve been “paying” by over-posting the same information over and over again.
In the example above, it would appear that the number of tokens sent per step (the full size of the message) increases sequentially. In fact, however, the which accumulates the cost of every agent loop becomes quadratic, not linear, leading to the cost explosion of long-lived loops. This is where quick compression techniques come in to help, with techniques like featured content, summarizing, and more, as we’ll discuss shortly.
An example of a cost curve for agent loops without comparison with instant compression
The issue is not only financial: there are other hidden costs related to delay, as long information takes a long time to process, and not all users are willing to wait 30 seconds together. Compressed instructions also enable faster guesses and reduce compute overhead.
To put this in perspective, a 500K token context can theoretically be reduced to a 32K token-compressed window that stores all the relevant information, while elements like repetitive JSON structures, stopwords, and low-value dialog parts are removed. Here are some cost-effective solutions and frameworks to consider to implement your compression strategy quickly:
- Distillation order: this involves creating a “compressed” version of long system information that can be sent repeatedly, containing symbols or shorthand that the model will understand and interpret.
- Repetitive summary: every few steps in the loop, use an agent or a small, cheap model like Llama 3 or GPT-4o-mini to summarize the context of the previous steps into a shorter paragraph that describes the current state of the task.
- Vector database (RAG) for historical retrieval: this replaces sending the full history over and over again by saving it to a free local vector database like FAISS or Chroma. For any given information, only the most relevant actions are returned as part of your content.
- LLMLngua: an open source framework that is gaining popularity, which focuses on detecting and eliminating “irrelevant” tokens quickly before they are sent to a larger, more expensive language model.
A Practical Example: Summarizing Agent
Below is an example of a cost-effective fast compression technique that combines iterative compression and instruction distillation using Python. The code is intended to serve as a template for what that fast compression mindset should look like when translated into a real, large-scale environment. It shows a simplified simulation of the agent loop, emphasizing the steps of abstraction and distillation:
import tiktoken def count_tokens(text, model=”gpt-4o”): encoding = tiktoken.encoding_for_model(model) return len(encoding.encode(text)) def compress_history(history_list): “”” Function that simulates ‘Summing’. In a small app that sends a model (like a small language model, like ecoding) gpt-4o-mini) to compress it. “”” print(“— History Depressor —“) # In production, pass ‘combined’ to the summary model join = ” “.join(history_list) # Distillation: Shorthand version of event summary = f”{len(history_list)} summary return summary # 1. Distilled System Prompt (uses shorthand instead of prose) system_prompt = “Rule: ResearchBot. Task: Find X. Output: JSON only. Limitations: No fluff.” # 2. History of the Agentic Loop = []raw_token_total = 0 for step in range(1, 6): action = f”Step {step}: Agent performed the longest search for data point {step}…” history.append(action) # Counts what the prompt COULD be without compression current_full_context = system_prompt + ” “.join) raw(hi) count_tokens(current_full_context) print(f”Loop {step} | Context Tokens Full: {raw_tokens}”) # 3. Using Compression compressed_context = system_prompt + compress_history(history) compressed_tokens = count_tokens(compressed_context)nFinals” {raw_tokens}”) print(f”Compressed Final Tokens: {compressed_tokens}”) print(f”Savings: {((raw_tokens – compressed_tokens) / raw_tokens) * 100:.1f}%”)
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
enter tiktoken def count_tokens(text, model=“gpt-4o”): coding = tiktoken.model_encoding(model) come back this(coding.enter the code(text)) def press_history(history_list): “”“ An activity that simulates ‘Summing’. In a real application, consists of sending input to a sub-language model (like gpt-4o-mini) to shorten it. ““” print(“— Compressive History —“) # In production, pass ‘combined’ to the abstract model combined = “”.join in(history_list) # Distillation: Shorthand version of events summary = f“{len(history_list)} steps summary: Tasks A & B completed. Result: Success.” come back summary # 1. Distilled System Prompt (uses shorthand instead of prose) notification_system = “Action: ResearchBot. Function: Find X. Output: JSON only. Restrictions: No fluff.” # 2. The Agentic Loop history = [] total_green_token = 0 for step in the middle width(1, 6): action = f“Step {step}: The agent performed the longest search for data point {step}…” history.add(action) # Calculates how the information WILL LOOK without compression current_full_shape = notification_system + “”.join in(history) green_tokens = count_tokens(current_full_shape) print(f“Loop {step} | Raw Content Tokens: {raw_tokens}”) # 3. Using Stress context_compressed = notification_system + press_history(history) tokens_compressed = count_tokens(context_compressed) print(f“nUncompressed Storage Tokens: {raw_tokens}”) print(f“Last Compressed Tokens: {compressed_tokens}”) print(f“Savings: {((green_tokens – compressed_tokens) / green_tokens) * 100:.1f}%”) |
This code shows how to periodically replace the accumulated list of actions with an abbreviation consisting of a single string, which helps to avoid the additional cost of paying for the same context tokens in every iteration of the loop. Try using a smaller, cheaper or local model like the Llama 3 to do the summarizing step.
Regarding distillation, this example shows what you actually do:
A standard command of 42 tokens is “He is a useful research assistant. Your goal is to find information about X. Please provide the output in a valid JSON format and do not include any dialog filler.” can be entered in this 12-token command: “Action: ResearchBot. Function: Get X Output: JSON. No fluff.” The model will understand it in almost the same way. Imagine a 100 step loop: this difference of 30 tokens alone can save just about 3,000 tokens in the system notification.
Output:
Episode 1 | Full Content Tokens: 37 Loop 2 | Full Content Tokens: 55 Loop 3 | Full Content Tokens: 73 Loop 4 | Full Content Tokens: 91 Loop 5 | Full Content Tokens: 109 — Compression History — Last Uncompressed Tokens: 109 Last Compressed Tokens: 36 Savings: 67.0%
|
A loop 1 | Full Context Tokens: 37 A loop 2 | Full Context Tokens: 55 A loop 3 | Full Context Tokens: 73 A loop 4 | Full Context Tokens: 91 A loop 5 | Full Context Tokens: 109 —– Pressure History —– Finally Unsuppressed Tokens: 109 Finally It is suppressed Tokens: 36 To save: 67.0% |
Wrapping up
Fast compression is not a minor adjustment; it is a practical requirement for any agent system that runs more than a few steps. The techniques discussed here, from destructive distillation and iterative summarization to history retrieval based on RAG and LLMLingua, each deal with a different angle quadratic cost problem, and can be combined for even greater savings. As a starting point, iterative abstraction paired with distilled system information does not require additional infrastructure and can simply cut down the usage of tokens significantly, as the example above shows.



