AI Sparks

Using Fast Compression to Reduce Agentic Loop Costs

In this article, you will learn what fast compression is, why it is important in agent AI loops, and how to use it using summarization and command extraction.

Topics we will cover include:

  • Why agent loops quadruple the cost of tokens, and how fast compression deals with this.
  • A review of very fast compression techniques, including instruction completion, recursive compression, vector database retrieval, and LLMLingua.
  • A working Python example that combines recursive summarization and command distillation to achieve meaningful token savings.

Introduction

Agent loops in production it may equate to higher costs, especially when it comes to both LLM and external application use via APIs, where billing is often closely related to token usage.

Good news: pressing fast is one of the most effective strategies you can use to navigate the high cost of agent loops. This article presents and discusses how multiple methods of rapid compression can help alleviate financial problems when using agent loops.

Quick Press: Motivation and Common Strategies

Many agent frameworks, such as LangGraph and AutoGPT, force the agent to keep context of what it did in previous steps. Let’s say your agent needs to take 10 to 20 steps to solve a problem. To do step 1, send 500 tokens. In step 2, it has to send those previous 500 tokens and the new information associated with this step – say about 1,000 tokens in total. This may increase to about 1,500 tokens in step 3, and so on. By the time we get to step 20, we’ve been “paying” by over-posting the same information over and over again.

In the example above, it would appear that the number of tokens sent per step (the full size of the message) increases sequentially. In fact, however, the which accumulates the cost of every agent loop becomes quadratic, not linear, leading to the cost explosion of long-lived loops. This is where quick compression techniques come in to help, with techniques like featured content, summarizing, and more, as we’ll discuss shortly.

An example of a cost curve for agent loops without comparison with instant compression

The issue is not only financial: there are other hidden costs related to delay, as long information takes a long time to process, and not all users are willing to wait 30 seconds together. Compressed instructions also enable faster guesses and reduce compute overhead.

To put this in perspective, a 500K token context can theoretically be reduced to a 32K token-compressed window that stores all the relevant information, while elements like repetitive JSON structures, stopwords, and low-value dialog parts are removed. Here are some cost-effective solutions and frameworks to consider to implement your compression strategy quickly:

  • Distillation order: this involves creating a “compressed” version of long system information that can be sent repeatedly, containing symbols or shorthand that the model will understand and interpret.
  • Repetitive summary: every few steps in the loop, use an agent or a small, cheap model like Llama 3 or GPT-4o-mini to summarize the context of the previous steps into a shorter paragraph that describes the current state of the task.
  • Vector database (RAG) for historical retrieval: this replaces sending the full history over and over again by saving it to a free local vector database like FAISS or Chroma. For any given information, only the most relevant actions are returned as part of your content.
  • LLMLngua: an open source framework that is gaining popularity, which focuses on detecting and eliminating “irrelevant” tokens quickly before they are sent to a larger, more expensive language model.

A Practical Example: Summarizing Agent

Below is an example of a cost-effective fast compression technique that combines iterative compression and instruction distillation using Python. The code is intended to serve as a template for what that fast compression mindset should look like when translated into a real, large-scale environment. It shows a simplified simulation of the agent loop, emphasizing the steps of abstraction and distillation:

This code shows how to periodically replace the accumulated list of actions with an abbreviation consisting of a single string, which helps to avoid the additional cost of paying for the same context tokens in every iteration of the loop. Try using a smaller, cheaper or local model like the Llama 3 to do the summarizing step.

Regarding distillation, this example shows what you actually do:

A standard command of 42 tokens is “He is a useful research assistant. Your goal is to find information about X. Please provide the output in a valid JSON format and do not include any dialog filler.” can be entered in this 12-token command: “Action: ResearchBot. Function: Get X Output: JSON. No fluff.” The model will understand it in almost the same way. Imagine a 100 step loop: this difference of 30 tokens alone can save just about 3,000 tokens in the system notification.

Output:

Wrapping up

Fast compression is not a minor adjustment; it is a practical requirement for any agent system that runs more than a few steps. The techniques discussed here, from destructive distillation and iterative summarization to history retrieval based on RAG and LLMLingua, each deal with a different angle quadratic cost problem, and can be combined for even greater savings. As a starting point, iterative abstraction paired with distilled system information does not require additional infrastructure and can simply cut down the usage of tokens significantly, as the example above shows.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button