Documentation
Summarization
Dashboard

API Documentation

Summarization controls and endpoints for compressing long-running conversations into durable context artifacts with lower token overhead.

Summarization

Long conversations can exceed context window limits and increase costs. Mnexium's Summarization feature automatically compresses older messages into concise summaries while preserving recent messages verbatim.

When enabled, Mnexium generates rolling summaries of your conversation history. Summaries are cached and reused across requests, so you only pay for summarization once per conversation segment.

Use the summarize parameter in your mnx object to enable automatic summarization. Choose a preset mode based on your cost/fidelity tradeoff:

ModeStart AtKeep RecentSummary TargetBest For
offAllMaximum fidelity (default)
light70K tokens25 msgs~1,800 tokensSafe compression
balanced55K tokens15 msgs~1,100 tokensBest cost/performance
aggressive35K tokens8 msgs~700 tokensCheapest possible
Using a preset mode
json
{
  "model": "gpt-4o-mini",
  "messages": [{ "role": "user", "content": "..." }],
  "mnx": {
    "subject_id": "user_123",
    "chat_id": "550e8400-e29b-41d4-a716-446655440000",
    "summarize": "balanced"
  }
}
Using custom config
json
{
  "model": "gpt-4o-mini",
  "messages": [{ "role": "user", "content": "..." }],
  "mnx": {
    "subject_id": "user_123",
    "chat_id": "550e8400-e29b-41d4-a716-446655440000",
    "summarize_config": {
      "start_at_tokens": 40000,
      "chunk_size": 15000,
      "keep_recent_messages": 10,
      "summary_target": 800
    }
  }
}
start_at_tokens— Token threshold to trigger summarization. History below this is sent verbatim.
chunk_size— How many tokens to summarize at a time when history exceeds threshold.
keep_recent_messages— Always keep this many recent messages verbatim (not summarized).
summary_target— Target token count for each generated summary.
  1. When a chat request comes in, Mnexium counts tokens in the conversation history using tiktoken.
  2. If history exceeds start_at_tokens, older messages are summarized.
  3. The summary is generated using gpt-4o-mini and cached in the database.
  4. Future requests reuse the cached summary until new messages push past the threshold again.
  5. The final context sent to the LLM is: [Summary] + [Recent Messages] + [New Message]

Mnexium uses a rolling summary by default: we maintain a single condensed memory block for older messages and inject that plus the most recent turns into the model.

This is the most token-efficient strategy and is recommended for almost all workloads.

For specialized use cases that need more detailed historical context inside the prompt (at higher token cost), granular summaries can be enabled in a future release, which keep multiple smaller summary blocks instead of one.