🎯 Background & Motivation
Xiaoduan AI's core architecture is built on "Incremental Pruning" — to maintain a constant context window and ensure smooth long-conversation experiences, the system prunes old historical information over time. However, this design encounters a structural contradiction with the underlying KV cache mechanism.
Current Problem
State-of-the-art KV cache reuse techniques (like standard Prefix Caching) rely on strict prefix matching — the prerequisite for cache reuse is that the token sequence "starting from the beginning must be exactly the same."
| Problem | Description |
|---|---|
| ❌ Context Prefix Changes | Once incremental pruning causes changes to the context prefix |
| ❌ KV Cache Invalid | The remaining KV cache becomes invalid due to absolute position changes or prefix mismatch |
| ❌ Triggers Prefill Recalculation | Leading to expensive Prefill recalculation |
Personal Exploration Journey
As an independent developer, I conducted extensive local experiments:
| Attempted Approach | Result |
|---|---|
| Adjusting Prompt organization order | Limited effectiveness |
| Prefix preloading for stability | Cannot fundamentally solve the problem |
| Exploring physical splitting/merging of raw KV data | Limited by framework constraints |
Ultimately chose a compromise solution:
Preload fixed-format prompts + tools during local model startup for pre-stored KV reuse
- ✅ Static parts (prompts + tools) → Permanently reuse KV cache
- ❌ Dynamic parts (memory, tool results, etc.) → Continue bearing recalculation costs
Industry Gap
This is not giving up exploration, but recognizing an industry reality from practice: Current inference frameworks do not yet provide native, flexible KV reuse support for intelligent memory management architectures like "incremental pruning."
🎯 Core Problem Being Solved
This project aims to solve this industry-wide challenge through the core mechanism of "Chunk-Naming-Indexing":
- ✅ Achieve flexible, non-prefix-dependent KV cache reuse
- ✅ When model context changes due to incremental pruning, remaining KV cache can still be stably reused
- ✅ No longer trigger full recalculation due to history removal
⚙️ Core Technical Vision
1️⃣ Context Chunking & Solidification
Long Context → Split by Token Count → Multiple Logical Chunks
- Each chunk, after generation, is treated as an independently manageable KV cache unit
2️⃣ Global Naming & Indexing
- Assign a globally incrementing unique identifier to each generated chunk (e.g., auto-incrementing ID)
- Upper-layer memory scheduler maintains a lightweight in-memory index table
- Index table records currently used chunk ID data
3️⃣ Incremental Pruning & Precise Removal
Memory Scheduler Issues Pruning Command
↓
Notify Inference Engine of Expired Chunk IDs to Prune
↓
Engine Releases VRAM/RAM Storage Space Based on Index
↓
Update Index Table
4️⃣ Selective Reuse
| Step | Operation |
|---|---|
| ① | Before subsequent inference requests, read required KV chunks from cached KV data layer by layer |
| ② | Efficiently complete fusion inference after dynamic position decoding |
| ③ | Remaining unaffected KV chunks continue to be reused for subsequent generation |
⚠️ Current Limitations & Expectations
Current Limitations
| Limitation | Description |
|---|---|
| Lack of Framework Native Support | vLLM, llama.cpp and other mainstream frameworks' KV Cache management inherently depends on strict prefix matching |
| Position Encoding Mechanism | Compatibility challenges between absolute and relative position encodings |
Expected Directions
- 🎯 Inference engines provide a set of intelligent, general-purpose KV Cache management APIs
- 🔗 Enable decoupling between upper-layer applications and engines
- 💾 Allow memory schedulers to manage KV caches like operating files
🚀 Technical Innovation & Value
Core Philosophy
Traditional Mode: Rigid Chain (Forced Prefix Matching)
↓ Transformation
New Mode: Flexible Building Blocks (Chunk-based Combinable Reuse)
Core Logic
"KV Split into 100 Parts is Equally Linked to 100 Parts of Memory"
Implementation Effects Comparison
| Scenario | Traditional Approach | New Solution |
|---|---|---|
| Incremental Pruning | Full Recalculation ❌ | Minus-1-Plus-1 Sync Reuse ✅ |
| Cache Utilization Rate | Low | High |
| Memory-KV Integration | Disconnected | Intelligent Linkage |
Value Summary
- ✅ Fundamentally change the "Incremental Pruning = Full Recalculation" deadlock
- ✅ Form true intelligent linkage between upper-layer memory management and lower-layer KV cache
- ✅ Break free from rigid dependency on strict prefix matching
- ✅ Pave the way for more complex memory scheduling strategies "Incremental Pruning" application scenario, conducting targeted optimization and development at the inference engine level:
- 💡 Provide more flexible KV cache management APIs
- 🔧 Innovate from the ground up in position encoding mechanisms
Whichever direction, it will have far-reaching impact on the upper-layer application ecosystem.
As a humble application developer, I will continue to explore and validate within my capabilities.
📬 Contact & Communication
- QQ Group: 362422425 (Admin)
This document aims to share technical vision and promote discussion and practice in AI inference efficiency optimization.
📋 Metadata
| Field | Value |
|---|---|
| Author | Nuo Yan (yiliu666) |
| Project | Xiaoduan AI |
| First Published | November 2025 (ModelScope) |
| License | Apache 2.0 |