The State of AI-Assisted Code Intelligence 2025: What We Built, What I Learned

21 阅读9分钟

The State of AI-Assisted Code Intelligence 2025: What We Built, What We Learned

Billion Bian January 6, 2026

As 2025 comes to a close, I want to look back at some of the most important developments in our code intelligence systems—specifically the Context Graph engine and the Studio development platform—reflect on the problems we actually solved (and the ones we didn't), and share a few thoughts on what might come next.

Unlike the broader AI discourse that often gets lost in speculation about AGI timelines or model capabilities, I want to focus on the practical engineering of making LLMs genuinely useful for code understanding and modification. Because the reality is: having a smart model is table stakes; the hard part is building the infrastructure around it.


1. The Year We Stopped Trusting Vector Search Alone

Let me start with a confession: at the beginning of 2025, we had a fairly standard RAG (Retrieval-Augmented Generation) setup for code search. Embed the code, store in Qdrant, retrieve top-K, feed to LLM. It worked... sort of.

The problem was precision. When a developer asked "where is the payment flow implemented?", our system would return files that mentioned "payment" but missed the actual orchestration logic buried in a generically-named useTransaction.ts hook. The embedding model didn't understand that this file was the hub through which all payment logic flowed.

1.1 The Graph Database Insight

The breakthrough came when we stopped treating code files as independent documents and started treating the codebase as a graph.

┌─────────────────────────────────────────────────────────────────┐
│                    The Problem with Pure RAG                    │
│                                                                 │
│  Query: "payment flow"                                          │
│                                                                 │
│  ❌ Vector Search Returns:            ✅ What We Actually Need: │
│                                                                 │
│  1. PaymentButton.tsx (0.89)         1. useTransaction.ts       │
│     - mentions "payment"                - orchestrates flow     │
│                                                                 │
│  2. payment-types.ts (0.87)          2. PaymentService.ts       │
│     - type definitions                  - API integration       │
│                                                                 │
│  3. PaymentIcon.svg (0.85)           3. PaymentContext.tsx      │
│     - literally just an icon           - state management      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

image.png Figure 1: The gap between vector similarity and actual code relevance

Our solution was to layer graph traversal on top of vector search. When you search for a concept, we don't just return the top embeddings—we also traverse the import/export graph to find files that are structurally central to the query.

We call this "n-hop traversal":

  • 0-hop: Pure vector search (what everyone does)
  • 1-hop: Include direct importers/exportees of top results
  • 2-hop: Include files two edges away (discovering orchestrators)

The implementation uses Nebula Graph to store edges extracted via Tree-sitter:

// Simplified relationship extraction
interface CodeRelationship {
  source: string;      // importing file
  target: string;      // imported module
  type: 'imports' | 'invokes' | 'inherits';
  weight: number;      // based on usage frequency
}

1.2 Quantifying the Improvement

┌────────────────────────────────────────────────────────────────────┐
│                    Search Quality Metrics                          │
│                                                                    │
│  Metric              Pure RAG     RAG + 2-hop Graph    Δ          │
│  ─────────────────────────────────────────────────────────────    │
│  Precision@5         0.42         0.71                 +69%       │
│  Recall@10           0.56         0.83                 +48%       │
│  Developer Rating    3.1/5        4.4/5                +42%       │
│  "Found what I       47%          81%                  +72%       │
│   needed" rate                                                     │
│                                                                    │
│  Based on 847 production queries, October-December 2025           │
└────────────────────────────────────────────────────────────────────┘

Figure 2: Impact of graph-augmented retrieval

The lesson here is structural understanding matters. An LLM can only be as good as the context you provide it. If your retrieval system doesn't understand that useTransaction.ts is the hub of payment logic, no amount of prompt engineering will fix that.


2. The Expert System: Multi-Agent Before It Was Cool

Around March 2025, we noticed a pattern in our failures. The code search might return the right files, but the generated plans still missed critical constraints:

  • "Use the shared design system components" (defined in a docs folder)
  • "This route must be registered in config/routes.tsx" (an architectural convention)
  • "Don't call this API directly; use the wrapper in utils/http.ts" (institutional knowledge)

The code itself didn't encode these rules. They lived in documentation, AGENTS.md files, commit messages, and the collective memory of the team.

2.1 Seven Experts, One Plan

Our solution was to build an expert system—not in the 1980s rule-based AI sense, but as a set of specialized retrieval agents that each query a different knowledge source:

image.png

┌──────────────────────────────────────────────────────────────────┐
                   Expert System Architecture                      
                                                                   
                        ┌──────────────┐                           
                          User Query                             
                        └──────┬───────┘                           
                                                                  
          ┌────────────────────┼────────────────────┐             
                                                               
                                                               
  ┌───────────────┐  ┌───────────────┐  ┌───────────────┐        
    PRD Expert      Codebase Exp.     Rule Expert          
    Priority: 10     Priority: 9      Priority: 8          
                                                           
   Data Sources:    Data Sources:    Data Sources:         
   - Docs API       - Qdrant         - AGENTS.md           
   - Memory Bank    - Nebula Graph│   - Conventions         
   - Flow History│   - Git History    - Lint Rules          
  └───────┬───────┘  └───────┬───────┘  └───────┬───────┘        
                                                               
              ┌─────────────┼─────────────┐                    
                                                            
                                                            
  ┌───────────────┐  ┌───────────────┐  ┌───────────────┐        
   Skill Expert     Figma Expert    │Guidance Expert│        
    Priority: 7      Priority: 6      Priority: 5          
  └───────┬───────┘  └───────┬───────┘  └───────┬───────┘        
                                                               
          └──────────────────┴──────────────────┘                 
                                                                 
                                                                 
                    ┌────────────────┐                            
                     Expert Manager                             
                      Parallel Exec                             
                      ~2-3 seconds                              
                    └───────┬────────┘                            
                                                                 
                                                                 
                    ┌────────────────┐                            
                       Planner LLM                              
                      (Claude Opus)                             
                     System Prompt:                             
                     + Expert Ctx                               
                    └────────────────┘                            
                                                                  
└──────────────────────────────────────────────────────────────────┘

Figure 3: The seven-expert architecture

Key design decisions:

  1. Parallel execution: All experts run concurrently via Promise.all. The total latency is bounded by the slowest expert (~2-3s), not the sum.

  2. Graceful degradation: Each expert is wrapped with a @Resilient() decorator. If the Figma API is down, we proceed without design context rather than failing the entire request.

  3. Priority-weighted aggregation: The Planner LLM sees all expert contexts in its system prompt, but higher-priority experts get more prominent positioning.

// Decorator pattern for expert resilience
@Timed()
@Cached(300000)  // 5-minute cache
@Resilient({ fallbackConfidence: 0 })
async analyze(input: ExpertInput): Promise<ExpertContext> {
  // Expert-specific logic
}

2.2 The Human-in-the-Loop Revelation

Perhaps the most counterintuitive finding of 2025: adding human confirmation steps made the system faster overall.

When we analyzed production failures, we found a pattern:

  • AI confidently generates a plan
  • Developer starts implementing
  • 3 hours in, discovers a critical constraint was missed
  • Has to restart or significantly revise

The cost of these "confident wrong" plans was enormous.

Our solution was a three-phase workflow:

image.png

┌──────────────────────────────────────────────────────────────────┐
│               Three-Phase Planning Workflow                       │
│                                                                   │
│  Phase 1: prepare_plan (~28s)                                    │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ 1. Extract migration keywords from reference docs        │    │
│  │ 2. Search codebase for matching patterns                 │    │
│  │ 3. Cross-verify: only flag issues with code evidence     │    │
│  │ 4. Return confirmation items + Session ID                │    │
│  └─────────────────────────────────────────────────────────┘    │
│                               │                                   │
│                               ▼                                   │
│  Phase 2: submit_confirmation (loop until resolved)              │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ Human reviews each flagged issue:                        │    │
│  │                                                          │    │
│  │ Q: "Found 2 files using legacy HTTP client. Is this      │    │
│  │    intended for the v2.0 API migration?"                 │    │
│  │                                                          │    │
│  │ Evidence:                                                │    │
│  │ - src/utils/http.ts: import { get } from '@company/http' │    │
│  │ - src/app.ts: import { post } from '@company/http'       │    │
│  │                                                          │    │
│  │ Human: "Yes, these need to be migrated to new HTTP API"  │    │
│  └─────────────────────────────────────────────────────────┘    │
│                               │                                   │
│                               ▼                                   │
│  Phase 3: generate_final_plan (~120s)                            │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ With human confirmations injected as high-priority ctx:  │    │
│  │ → Generate detailed, actionable implementation plan      │    │
│  │ → Include risk assessment based on confirmed constraints │    │
│  │ → Estimate effort per step                               │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                   │
│  Total: ~150s with 2 confirmation rounds                         │
│  vs. Hours of wasted effort from overconfident plans             │
└──────────────────────────────────────────────────────────────────┘

Figure 4: The human-in-the-loop workflow

The key insight: asking 2 targeted questions upfront is faster than fixing 10 wrong assumptions later.

But we couldn't ask just anything. We implemented a "zero false positive" policy: every confirmation question must include codebaseEvidence—proof that the issue actually exists in the code. This prevents the system from asking about theoretical problems that don't apply to this specific codebase.


3. MCP: The Protocol That Changed Everything

If there's one technology that defined our 2025, it's the Model Context Protocol (MCP).

Before MCP, integrating our code intelligence tools with different AI clients (Cursor, Claude Desktop, custom applications) meant building separate integrations for each. Every client had its own way of:

  • Defining tool schemas
  • Handling streaming responses
  • Managing authentication
  • Dealing with long-running operations

MCP gave us a single interface:

// One MCP server, works everywhere
{
  name: 'search_with_planner',
  description: '🎯 Search code entities and generate AI implementation plan',
  inputSchema: {
    type: 'object',
    properties: {
      query: { type: 'string', description: 'Natural language query' },
      collectionName: { type: 'string' },
      scopePath: { type: 'string' },
      relateDepth: { type: 'number', default: 2 },
      enableExperts: { type: 'boolean', default: true },
      // ...
    }
  }
}
┌──────────────────────────────────────────────────────────────────┐
│                   MCP Integration Landscape                       │
│                                                                   │
│      ┌───────────┐    ┌───────────┐    ┌───────────┐            │
│      │  Cursor   │    │  Claude   │    │  Custom   │            │
│      │   IDE     │    │  Desktop  │    │   Apps    │            │
│      └─────┬─────┘    └─────┬─────┘    └─────┬─────┘            │
│            │                │                │                   │
│            └────────────────┼────────────────┘                   │
│                             │                                    │
│                             ▼                                    │
│                    ┌────────────────┐                            │
│                    │  MCP Protocol  │                            │
│                    │    (stdio)     │                            │
│                    └───────┬────────┘                            │
│                            │                                     │
│                            ▼                                     │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │                    Context Graph MCP                         │ │
│  │                                                             │ │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐     │ │
│  │  │ search_      │  │ search_with_ │  │ get_related_ │     │ │
│  │  │ entities     │  │ planner      │  │ entities     │     │ │
│  │  └──────────────┘  └──────────────┘  └──────────────┘     │ │
│  │                                                             │ │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐     │ │
│  │  │ prepare_plan │  │ submit_      │  │ get_doc_     │     │ │
│  │  │              │  │ confirmation │  │ context      │     │ │
│  │  └──────────────┘  └──────────────┘  └──────────────┘     │ │
│  │                                                             │ │
│  └─────────────────────────┬──────────────────────────────────┘ │
│                            │                                     │
│        ┌───────────────────┼───────────────────┐                │
│        │                   │                   │                │
│        ▼                   ▼                   ▼                │
│  ┌──────────┐       ┌──────────┐       ┌──────────┐            │
│  │  Qdrant  │       │  Nebula  │       │  Redis   │            │
│  │ (Vectors)│       │ (Graph)  │       │ (Cache)  │            │
│  └──────────┘       └──────────┘       └──────────┘            │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

Figure 5: MCP as the unifying layer

The impact on our development velocity was dramatic. Instead of maintaining N client integrations, we maintain one MCP server. When we added the human-confirmation workflow in November, it worked in Cursor immediately—no client-side changes needed.


4. Lessons Learned: What Actually Matters

4.1 Never Trust Embeddings Alone—Always Keep the Full Content

Here's my unpopular opinion: I never fully trusted embeddings. Not because they're useless—they're excellent for retrieval—but because they're a lossy compression of meaning.

The moment you reduce 500 lines of code to a 3072-dimensional vector, you've lost information. That vector might tell you "this file is semantically related to authentication," but it can't tell you how it implements authentication, what edge cases it handles, or which APIs it calls.

My design principle was simple: embeddings are for finding, not for understanding. The full code content must always be preserved and returned to the LLM.

image.png

┌──────────────────────────────────────────────────────────────────┐
│                    Layered Index Architecture                     │
│                                                                   │
│  Layer 1: Graph Index (Shallow)                                  │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │  Nebula Graph: Stores relationships only                    │ │
│  │  - imports: ["./utils", "@lib/http"]                        │ │
│  │  - calls: ["fetchUser", "validateToken"]                    │ │
│  │  - inherits: ["BaseService"]                                │ │
│  │                                                              │ │
│  │  Purpose: Fast traversal (find all files importing X)       │ │
│  │  Speed: O(1) edge lookup, ~5ms for 2-hop                    │ │
│  └────────────────────────────────────────────────────────────┘ │
│                              │                                   │
│                              ▼                                   │
│  Layer 2: Vector Index (Precise)                                │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │  Qdrant: Stores vector + FULL content                       │ │
│  │  - vector: [0.023, -0.156, ...]  (for retrieval)           │ │
│  │  - content: "export class AuthService { ... }" (complete)   │ │
│  │  - metadata: { summary, tags, fingerprint, ... }            │ │
│  │                                                              │ │
│  │  Purpose: Semantic search + complete code for LLM           │ │
│  │  Speed: ~50ms for top-K, returns full code                  │ │
│  └────────────────────────────────────────────────────────────┘ │
│                                                                   │
│  Search Flow:                                                     │
│  1. Vector search in Qdrant → candidate files                   │
│  2. Graph traversal in Nebula → expand to related files         │
│  3. Return full content for all matches → LLM has complete code │
│                                                                   │
└──────────────────────────────────────────────────────────────────┘

Figure 7: The two-layer index—vector for finding, content for understanding

The key insight: vectors are indexes, not representations. Just like a book's index helps you find pages about "authentication" but doesn't replace reading those pages, embeddings help you find relevant code but don't replace sending the actual code to the LLM.

interface CodeEntity {
  // Layer 1: For finding (Vector Index)
  vector: number[];           // Semantic search

  // CRITICAL: Full content always preserved
  content: string;            // Complete source code - never summarized

  // Layer 2: For traversing (Graph Index)
  imports: string[];          // Outgoing edges
  calls: string[];            // Function call edges

  // Metadata for ranking
  summary: string;            // AI-generated one-liner
  tags: string[];             // Extracted concepts
  fingerprint: string;        // MD5 for change detection
}

This architecture means:

  • Search is fast: Vector similarity in Qdrant (~50ms)
  • Context is complete: LLM always sees full code, not summaries
  • Dependencies are traversable: Graph edges enable n-hop expansion

4.2 Latency Budgets Are Real

Every search request has an implicit user expectation. For code search:

  • < 3s: Feels instant, users stay in flow
  • 3-10s: Acceptable if results are good
  • > 10s: Users start multitasking, lose context

We designed the system around a 15-second budget for the full planning workflow:

  • Phase 1 (Search): 3s max
  • Phase 2 (Experts, parallel): 3s max
  • Phase 3 (Planner LLM): 9s max

When expert APIs were slow, we added aggressive caching (5-minute TTL for documentation, 10-minute for commit history). When the Planner was slow, we moved to streaming responses so users see progress.

4.3 The Codebase Is Not Static

A common failure mode: build a beautiful index, ship it, and watch it rot.

Code changes constantly. A file that was central last month might be deprecated now. A new utility was added that should appear in search results but isn't indexed.

Our solution was incremental indexing with two modes:

  • Warm path: On-commit webhook triggers re-indexing of changed files (~5s)
  • Cold path: Nightly full rebuild for drift correction (~15 min for 10k files)
┌──────────────────────────────────────────────────────────────────┐
│                    Index Freshness Strategy                       │
│                                                                   │
│  Time ──────────────────────────────────────────────────────▶    │
│                                                                   │
│  9:00  │  Developer pushes commit                                │
│        │      │                                                   │
│        │      ▼                                                   │
│  9:00  │  Webhook fires                                          │
│        │      │                                                   │
│        │      ▼                                                   │
│  9:05  │  Incremental index: 3 changed files re-embedded         │
│        │      │                                                   │
│        │      ▼                                                   │
│  9:05  │  Graph edges updated (imports/exports)                  │
│        │      │                                                   │
│        │      ▼                                                   │
│  9:05  │  Cache invalidated for affected modules                 │
│        │                                                          │
│        │                                                          │
│  3:00  │  Nightly: Full reindex (catch any missed updates)       │
│   AM   │                                                          │
│                                                                   │
└──────────────────────────────────────────────────────────────────┘

Figure 6: Keeping the index fresh


5. What's Next: Predictions for 2026

5.1 Agent Orchestration Will Mature

In 2025, we built specialized experts that each query a data source and return context. In 2026, I expect these experts to become more autonomous—capable of multi-step reasoning, tool calling, and self-correction.

The pattern I see emerging:

Query: "Add a new payment method"
  │
  ├─▶ Expert 1 discovers: "PaymentService uses strategy pattern"
  │       │
  │       └─▶ Triggers sub-search: "Find all existing strategies"
  │               │
  │               └─▶ Returns: CreditCardStrategy, PayPalStrategy, etc.
  │
  ├─▶ Expert 2 discovers: "New payment methods require compliance review"
  │       │
  │       └─▶ Triggers: Check Jira for compliance templates
  │
  └─▶ Planner synthesizes both findings into coherent plan

This is essentially multi-agent architecture, but grounded in practical code intelligence rather than abstract reasoning.

5.2 Personalization Will Become Critical

Right now, our system treats all developers the same. But a senior engineer working on payments has different context than a new hire exploring the codebase.

I expect 2026 to bring:

  • Developer-specific context: "Show files I've modified frequently"
  • Team-aware search: "Show files my team owns"
  • Task-aware ranking: "I'm debugging → prioritize error handling code"

5.3 The Rise of "Code Memory"

GitHub Copilot Workspace and similar tools hint at a future where the AI maintains persistent memory of ongoing tasks. Instead of each request being stateless, the system remembers:

  • "Last week you were working on the payment refactor"
  • "You asked about error handling in PaymentService 3 times"
  • "Based on your commit patterns, you prefer functional style"

We're already seeing this with our Redis-backed session management. I expect this to become a first-class feature across the industry.


6. Reflections on AI-Assisted Development

6.1 The Productivity Paradox

A recurring theme in 2025 was the productivity paradox: AI tools that seem to save time can actually cost time if they're wrong.

Example from our data:

  • Time saved per correct plan: ~4 hours
  • Time lost per incorrect plan: ~6 hours
  • If accuracy is 70%, net impact is negative

This is why we obsessed over precision. A system that says "I don't know" is more valuable than one that confidently hallucinates.

6.2 The Human-AI Interface Is the Product

The most important code we wrote in 2025 wasn't the vector search or the graph traversal—it was the prompts and interfaces that shape how humans interact with the system.

Good interface:

Found 16 files related to "payment flow".
Top 3 structural hubs:
1. src/services/PaymentService.ts (imported by 23 files)
2. src/hooks/useTransaction.ts (orchestrates flow)
3. src/types/payment.ts (defines interfaces)

Would you like to: [See all files] [Generate plan] [Refine search]

Bad interface:

Results: [file1, file2, file3, ... file16]

The first interface builds understanding. The second offloads cognitive work to the human.

6.3 Institutional Knowledge Is the Real Moat

OpenAI can train on public GitHub. They can't train on:

  • Your company's architectural decisions
  • Your team's code review patterns
  • The context behind why that weird edge case exists
  • The Slack conversation where a design was decided

Our bet is that local context wins. A smaller model with access to your company's Memory Bank beats a larger model without it.


7. Technical Appendix: Key Architecture Decisions

For those interested in the implementation details, here are the key decisions we made:

7.1 Storage Layer

ComponentTechnologyRationale
Vector embeddingsQdrantGood TypeScript SDK, on-prem option
Code relationshipsNebula GraphNative graph model, efficient traversal
Session/cacheRedisSpeed, TTL support
Document metadataMySQLRelational queries, joins

7.2 LLM Selection

TaskModelRationale
Embeddingtext-embedding-3-largeBest quality/cost for code
AI RankingClaude 3.5 SonnetFast, accurate for classification
PlanningClaude Opus 4.5Complex reasoning, long context
SummarizationGPT-4o-miniCost-effective for simple tasks

7.3 Key Metrics (December 2025)

  • Total indexed files: 847,000 across 12 repositories
  • Average search latency: 2.3s (p50), 4.1s (p95)
  • Planning with experts: 14.2s (p50), 22.8s (p95)
  • Human confirmation adoption: 73% of planning requests
  • Plan accuracy (user-rated): 4.2/5 average

Conclusion

2025 taught us that AI for code intelligence is less about the models and more about the infrastructure: retrieval quality, context richness, human-AI interaction design.

The gap between "demo works" and "production works" is enormous. A vector search that returns plausible results is easy. A system that reliably surfaces the right files for any query, explains why they matter, and generates plans that developers actually follow—that's hard.

Looking forward to 2026, I expect the focus to shift from "make AI code better" to "make AI code trustworthy". The tools that win will be the ones developers rely on without second-guessing.

Thanks for reading. If you're building similar systems, I'd love to hear what you've learned.

— Billion


Thanks to everyone who contributed to Context Graph and Studio in 2025.

Last updated: January 6, 2026