Xiaoduan KV Chunk-based Incremental Reuse Architecture

0 阅读2分钟

🎯 Background & Motivation

Xiaoduan AI's core architecture is built on "Incremental Pruning" — to maintain a constant context window and ensure smooth long-conversation experiences, the system prunes old historical information over time. However, this design encounters a structural contradiction with the underlying KV cache mechanism.

Current Problem

State-of-the-art KV cache reuse techniques (like standard Prefix Caching) rely on strict prefix matching — the prerequisite for cache reuse is that the token sequence "starting from the beginning must be exactly the same."

ProblemDescription
❌ Context Prefix ChangesOnce incremental pruning causes changes to the context prefix
❌ KV Cache InvalidThe remaining KV cache becomes invalid due to absolute position changes or prefix mismatch
❌ Triggers Prefill RecalculationLeading to expensive Prefill recalculation

Personal Exploration Journey

As an independent developer, I conducted extensive local experiments:

Attempted ApproachResult
Adjusting Prompt organization orderLimited effectiveness
Prefix preloading for stabilityCannot fundamentally solve the problem
Exploring physical splitting/merging of raw KV dataLimited by framework constraints

Ultimately chose a compromise solution:

Preload fixed-format prompts + tools during local model startup for pre-stored KV reuse

  • ✅ Static parts (prompts + tools) → Permanently reuse KV cache
  • ❌ Dynamic parts (memory, tool results, etc.) → Continue bearing recalculation costs

Industry Gap

This is not giving up exploration, but recognizing an industry reality from practice: Current inference frameworks do not yet provide native, flexible KV reuse support for intelligent memory management architectures like "incremental pruning."


🎯 Core Problem Being Solved

This project aims to solve this industry-wide challenge through the core mechanism of "Chunk-Naming-Indexing":

  • ✅ Achieve flexible, non-prefix-dependent KV cache reuse
  • ✅ When model context changes due to incremental pruning, remaining KV cache can still be stably reused
  • ✅ No longer trigger full recalculation due to history removal

⚙️ Core Technical Vision

1️⃣ Context Chunking & Solidification

Long Context → Split by Token Count → Multiple Logical Chunks
  • Each chunk, after generation, is treated as an independently manageable KV cache unit

2️⃣ Global Naming & Indexing

  • Assign a globally incrementing unique identifier to each generated chunk (e.g., auto-incrementing ID)
  • Upper-layer memory scheduler maintains a lightweight in-memory index table
  • Index table records currently used chunk ID data

3️⃣ Incremental Pruning & Precise Removal

Memory Scheduler Issues Pruning Command
        ↓
Notify Inference Engine of Expired Chunk IDs to Prune
        ↓
Engine Releases VRAM/RAM Storage Space Based on Index
        ↓
Update Index Table

4️⃣ Selective Reuse

StepOperation
Before subsequent inference requests, read required KV chunks from cached KV data layer by layer
Efficiently complete fusion inference after dynamic position decoding
Remaining unaffected KV chunks continue to be reused for subsequent generation

⚠️ Current Limitations & Expectations

Current Limitations

LimitationDescription
Lack of Framework Native SupportvLLM, llama.cpp and other mainstream frameworks' KV Cache management inherently depends on strict prefix matching
Position Encoding MechanismCompatibility challenges between absolute and relative position encodings

Expected Directions

  • 🎯 Inference engines provide a set of intelligent, general-purpose KV Cache management APIs
  • 🔗 Enable decoupling between upper-layer applications and engines
  • 💾 Allow memory schedulers to manage KV caches like operating files

🚀 Technical Innovation & Value

Core Philosophy

Traditional Mode: Rigid Chain (Forced Prefix Matching)
        ↓ Transformation
New Mode: Flexible Building Blocks (Chunk-based Combinable Reuse)

Core Logic

"KV Split into 100 Parts is Equally Linked to 100 Parts of Memory"

Implementation Effects Comparison

ScenarioTraditional ApproachNew Solution
Incremental PruningFull Recalculation ❌Minus-1-Plus-1 Sync Reuse ✅
Cache Utilization RateLowHigh
Memory-KV IntegrationDisconnectedIntelligent Linkage

Value Summary

  1. ✅ Fundamentally change the "Incremental Pruning = Full Recalculation" deadlock
  2. ✅ Form true intelligent linkage between upper-layer memory management and lower-layer KV cache
  3. ✅ Break free from rigid dependency on strict prefix matching
  4. Pave the way for more complex memory scheduling strategies "Incremental Pruning" application scenario, conducting targeted optimization and development at the inference engine level:
  • 💡 Provide more flexible KV cache management APIs
  • 🔧 Innovate from the ground up in position encoding mechanisms

Whichever direction, it will have far-reaching impact on the upper-layer application ecosystem.

As a humble application developer, I will continue to explore and validate within my capabilities.


📬 Contact & Communication

  • QQ Group: 362422425 (Admin)

This document aims to share technical vision and promote discussion and practice in AI inference efficiency optimization.


📋 Metadata

FieldValue
AuthorNuo Yan (yiliu666)
ProjectXiaoduan AI
First PublishedNovember 2025 (ModelScope)
LicenseApache 2.0