This project is a proof-of-concept middleware designed to reduce the token cost of long conversations with large language models (LLMs) by converting extensive message histories into a single image.
For a detailed walkthrough and explanation of this project, check out the Medium article: Cutting LLM Costs by Converting Long Text into Images
When interacting with LLMs like GPT-4, the entire conversation history is typically sent with each new request. As the conversation grows, so does the number of tokens, leading to significantly higher API costs. A conversation with thousands of words can quickly become expensive to maintain.
This project introduces a Flask-based proxy server that acts as a middleware between your application and the LLM API (e.g., OpenAI). The middleware monitors the conversation's word count. When the count exceeds a predefined threshold (currently 750 words), it performs the following steps:
- Image Conversion: The entire conversation history is rendered as a text-based image.
- Payload Reconstruction: The original message history is replaced with a new payload suitable for a vision-capable model (like GPT-4o). This new payload contains:
- The generated image of the conversation history.
- A final prompt instructing the model to use the image as context to answer the user's latest message.
- Cost Savings: By representing thousands of words as a single image, we drastically reduce the number of tokens sent, leading to significant cost savings on long conversations.
The project consists of two main components:
- A Flask application that listens for requests on a local port.
- It intercepts requests, calculates the word count of the messages payload.
- If the word count is over 750, it uses the TextToImageOptimizer class to generate a PNG image from the text.
- It then forwards a modified request to the target API with the image and the final user prompt.
- If the word count is below the threshold, it simply forwards the request as is.
- A simple command-line interface for interacting with the proxy server.
- It maintains a local message history and allows you to chat in a loop.
- After each message, it prints the total word count of the conversation, so you can see when the image conversion will be triggered.
Install the required Python libraries using the requirements file:
pip install -r requirements.txtYou may also need to install a font for the image generation. The script tries to find a suitable font, but if not available, it uses a default PIL font which may not be ideal. On Debian/Ubuntu, you can install fonts with:
sudo apt-get install fonts-dejavuWhen using the client or sending direct requests, ensure you provide your API key. For the client script, open the request script and set the API_KEY variable.
Start the middleware by running the server script:
python server.pyYou will see a message indicating that the server is running on port 8000.
In a separate terminal, run the interactive client to chat:
python request.pyYou can also interact with the proxy server directly from your own application or using tools like curl.
Send a POST request to the proxy's endpoint.
- Content-Type: application/json
- Authorization: Bearer YOUR_API_KEY (Your actual API key)
- X-Target-Url: The URL of the target LLM API
The JSON payload should follow the standard structure of the target API. For OpenAI-compatible APIs, this includes:
- model: The name of the model you want to use.
- messages: A list of message objects, each with a role (system, user, or assistant) and content.
- Other optional parameters like stream, temperature, etc.
import requests
import json
PROXY_URL = "http://localhost:8000/v1/chat/completions"
API_KEY = "YOUR_API_KEY"
TARGET_URL = "https://api.openai.com/v1/chat/completions"
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {API_KEY}",
"X-Target-Url": TARGET_URL
}
payload = {
"model": "gpt-4o",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello, what is the capital of France?"}
],
"stream": False
}
try:
response = requests.post(PROXY_URL, headers=headers, json=payload)
response.raise_for_status()
print(response.json())
except requests.exceptions.RequestException as e:
print(f"An error occurred: {e}")Here is a basic example of how to send a request using curl.
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "X-Target-Url: https://api.openai.com/v1/chat/completions" \
-d '{
"model": "gpt-4o",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello, what is the capital of France?"
}
],
"stream": false
}'- Dynamic Threshold: Allow the word count threshold to be set via an environment variable or header.
- Smarter History Truncation: Instead of converting the entire history, keep the last few messages as text and convert only the older parts to an image.
- Support for Other Media: Extend the middleware to handle other types of attachments or data.
- Async Processing: Improve performance by using an asynchronous web framework.
- Upload to PyPI: Package the project and upload it to the Python Package Index (PyPI) for easier distribution.