ALTK-Evolve: Long-term memory that improves AI agents | Keryc
Imagine a brilliant cook who memorizes cookbooks but forgets your kitchen every morning. Excellent at following recipes, yet unable to learn that your oven runs hot or that customers ask for more salt. Sound familiar? That’s what happens today with many agents: re-reading logs instead of real learning.
What problem does it solve
Most agents repeat mistakes because they don’t extract principles from experience. Dumping transcripts back into a prompt is like reading a diary: it gives context, not judgment. ALTK-Evolve aims to change that. Instead of storing recipes (transcripts), it extracts and refines operational principles —rules, guidelines and policies— that can actually transfer to new situations.
The idea? Turn raw trajectories into reusable best practices. The benefit? Your agent doesn’t have to start like a newbie every time.
How ALTK-Evolve works (operational architecture)
Evolve runs a continuous loop between observability, refinement and just-in-time retrieval. There are two main flows:
Downstream flow (observation and extraction): captures the agent’s full trajectories (utterances, thoughts, tool calls, outcomes) inside an Interaction Layer —for example Langfuse or any OpenTelemetry-based tool. Plug-in extractors look for structural patterns and generate candidate entities (guidelines, policies, SOPs) that are persisted.
Upstream flow (refinement and retrieval): a background job consolidates duplicates, prunes weak rules and promotes confirmed strategies. That’s how a high-quality library evolves. At runtime, the application layer requests only the relevant entities and injects them into context when it matters.
Some key components:
Interaction Layer: observability and single extraction/retrieval point.
Plug-in extractors: detect patterns in traces to create candidates.
Consolidate-and-score job: deduplication, scoring and garbage collection.
Retrieval with Progressive Disclosure: only what’s relevant enters context, avoiding prompt bloat.
Why it works? Technical principles
Teaches judgment: turns one-off events into portable strategies that generalize.
Controls noise: scoring keeps memory useful, not a junk drawer.
Progressive Disclosure: retrieval is just-in-time, not a massive history dump.
Technically, this is an episodic memory that produces and maintains structured entities (guidelines, policies, SOPs) with trust metadata and usefulness signals. The key is the consolidation pipeline that prevents uncontrolled growth and prioritizes empirical evidence.
Evaluation in AppWorld: technical results
Evolve was tested on AppWorld, a benchmark of realistic multi-step tasks with API calls (on average 9.5 APIs across 1.8 apps). A ReAct agent received the task instruction plus the top-5 guidelines retrieved from prior runs (train/dev) and was evaluated on an unseen test set (test-normal).
The main metric was Scenario Goal Completion (SGC), which demands consistency: success across scenario variants.
Difficulty
Baseline SGC
+ Memory
Δ
Easy
79.0%
84.2%
+5.2
Medium
56.2%
62.5%
+6.3
Hard
19.1%
33.3%
+14.2
Aggregate
50.0%
58.9%
+8.9
Important takeaways from the evaluation:
Generalization: the gain on Test-Normal indicates the agent learns principles, not exact memorization.
Scaling with complexity: harder tasks see larger percentage improvements; in Hard there’s a 74% relative improvement in success.
Consistency: SGC improves more than raw pass-rate, which reduces “flaky” behavior across variants.
ALTK-Evolve is designed to fit existing stacks. Highlighted options:
Plugin for Claude Code:
claude plugin marketplace add AgentToolkit/altk-evolve
claude plugin install evolve@evolve-marketplace
This extracts entities and saves them to the filesystem using Claude Code hooks for automatic retrieval.
Lite mode: quick demo, easy to try but limited (no cross-session consolidation or GC).
Low-code / Pro-code: captures cross-session data, consolidation and garbage collection. Compatible with LLM clients and popular agent frameworks (OpenAI, LiteLLM, Hugging Face agents).
Simple integration with Codex and IBM Bob: import altk_evolve.auto and flip a flag to emit traces to a UI like Arize Phoenix; then sync to generate guidelines without changing your architecture.
Integration with CUGA via MCP: before each run call get_guidelines and afterwards save_trajectory to feed the learning loop. Result: a low-overhead loop that improves with use.
In short: you can try it in minutes with Claude Code or add traces to your existing pipeline and leverage the consolidate-and-score to get real gains.
Technical recommendations for adoption
Start by instrumenting structured traces: without good observability extraction fails.
Design extractors with clear entity schemas (what is a guideline, what metadata does it have, what evidence signals).
Tune scoring and TTL policies to avoid useless retainers.
Prioritize Progressive Disclosure: less is more in context.
Measure with consistency metrics (SGC) in addition to pass-rate to detect flaky behavior.
Final reflection
ALTK-Evolve isn’t magic; it’s memory engineering: turning experiences into actionable principles. If your agent seems to learn little despite logging everything, it probably needs this intermediate step of extraction, consolidation and retrieval. For simple tasks the gains are nice; for complex tasks they can be transformative.
If you want to try it quickly, there are demos and tutorials ready. Star the repo if you test it — that helps prioritize improvements and share real use cases.