rag emergent-behavior file-system mcp llm-tooling

RAG Without Vectors: Emergent Retrieval

Forget embeddings. Forget chunks. Give an LLM good file system tools and it will teach itself to retrieve—using reasoning as the similarity function.

V
Vario aka Mnehmos

🔍 The Heretical Question

What if you didn't need embeddings, vector databases, or chunking strategies? What if reasoning itself could be the similarity function?

The Standard RAG Pipeline

Everyone knows how Retrieval-Augmented Generation works. You take your documents, you chunk them, you embed them with something like text-embedding-3-small, you store them in Pinecone or Chroma, and at query time you embed the query, find similar vectors, pull the chunks, and stuff them into context.

TRADITIONAL RAG FLOW
1 Documents → Chunk into 500-token pieces
2 Chunks → Embed with neural model
3 Embeddings → Store in vector database
4 Query → EmbedSearch → Top-K chunks
5 Chunks → Stuff into context → LLM answers
~$0.02/1M tokens embedded • Vector DB costs • Chunking complexity

It works. It's battle-tested. But it's also expensive, brittle, and requires constant tuning of chunk sizes, overlap, retrieval thresholds, and reranking strategies.

The Heresy: Let the LLM Be the Retriever

What if instead of pre-computing similarity through embeddings, you gave the LLM tools to explore and let it figure out what's relevant?

Vector RAG

Similarity is pre-computed by a frozen embedding model. It can't reason. "Authentication" and "login" might be far apart in embedding space despite being conceptually identical.

File System RAG

Similarity is computed at query time by the LLM's reasoning. It can understand "I need authentication code" means searching for auth, login, session, and jwt.

// The LLM's retrieval "algorithm"

User: "How does authentication work?"

LLM thinks:
├── Auth code is probably in /src/auth/ or /lib/auth/
├── Maybe there's a middleware folder
└── JWT handling might be in utils/

LLM acts:
├── search_files("*.auth.ts", "src/")
├── search_in_file("jwt", "src/middleware/index.ts")
└── read_file_lines("src/auth/session.ts", 1, 50)

// The LLM IS the similarity function

The Tool Stack That Enables This

Building the OODA MCP server taught me what tools an LLM needs to perform emergent retrieval. It's not one tool—it's a progression from broad discovery to surgical extraction.

DISCOVER list_directory, search_files

"What's in this project?" → Browse folder structure, find files by pattern. Like scanning a library's catalog.

LOCATE search_in_file, batch_search_in_files

"Where is this concept?" → Search for patterns across files with context lines. Supports regex and fuzzy matching with configurable thresholds.

EXTRACT read_file_lines, read_file

"Show me this specific part." → Read line ranges, tail files, get exactly what's needed. offset: -50 reads last 50 lines like Unix tail.

BATCH batch_read_files, batch_search_in_files

"Search all of these at once." → Parallel operations reduce round-trips. One tool call can search 20 files simultaneously.

The magic happens in the progression. The LLM doesn't read everything—it narrows down iteratively, just like a human developer would.

The Secret Sauce: Fuzzy Matching

Vector embeddings capture "semantic similarity" through high-dimensional geometry. But you can get 80% of the benefit with simple Levenshtein distance.

// From ooda.mcp: batch_search_in_files with fuzzy matching

batch_search_in_files({
  searches: [
    { path: "src/auth/session.ts", pattern: "autentication" }
  ],
  isFuzzy: true,
  fuzzyThreshold: 0.7, // 70% similarity
  contextLines: 3
})

// Returns matches for "authentication" despite typo
// similarity: 0.85 → "authentication"

This is critical for LLM tolerance. The model might misspell, use synonyms, or guess at naming conventions. Fuzzy matching catches these without the overhead of embedding computation.

LLM Types
"autentication"
Levenshtein
→ 0.85
Matches
"authentication"

When to Use Which Approach

Emergent RAG isn't a replacement—it's an alternative for specific use cases.

Factor Vector RAG File System RAG
Setup Cost High (embeddings, vector DB, chunking) Zero (files already exist)
Query Cost Low (one embedding + vector search) Higher (multiple tool calls)
Reasoning None (geometric similarity only) Full LLM reasoning at retrieval time
Data Freshness Stale (must re-embed on changes) Always current (reads live files)
Best For Large static corpora, semantic search Codebases, live documents, exploration
Structure Awareness Lost (flat chunks) Preserved (folder hierarchy, file names)

The Real Insight: Context Is Navigation

Traditional RAG treats retrieval as a lookup problem: given a query, return the most similar chunks. But when you give an LLM file system tools, it treats retrieval as a navigation problem: given a goal, explore until you find what you need.

This is closer to how humans actually work. When I need to understand authentication in a codebase, I don't compute embeddings—I:

  1. Look at the folder structure (list_directory)
  2. Find likely candidates (search_files("*auth*"))
  3. Search for key terms (search_in_file("jwt"))
  4. Read the relevant sections (read_file_lines)

The LLM does exactly this—but faster, in parallel, and with perfect recall of everything it's seen in the conversation.

RETRIEVAL AS NAVIGATION
Vector RAG
Query → Top-K → Done
One-shot lookup
vs
File System RAG
Query → Explore → Refine → Extract
Iterative navigation

Implementation: The Tool Stack

1. Discovery Layer

list_directory + search_files

// Broad discovery
search_files({
  directory: "src/",
  pattern: "*.auth.ts",
  recursive: true,
  maxResults: 100
})

// Returns: ["src/auth/session.auth.ts", "src/middleware/jwt.auth.ts", ...]

2. Location Layer

search_in_file + batch_search_in_files

// Find specific patterns with context
search_in_file({
  path: "src/auth/session.ts",
  pattern: "validateToken",
  contextLines: 5,  // 5 lines before and after
  maxMatches: 10
})

// Returns matches with surrounding context
// Line 47: "export function validateToken(token: string) {"

3. Extraction Layer

read_file_lines with surgical precision

// Read just the function, not the whole file
read_file_lines({
  path: "src/auth/session.ts",
  startLine: 47,
  endLine: 72,
  includeLineNumbers: true
})

// Or read the last 50 lines of a log file
read_file_lines({
  path: "logs/auth.log",
  offset: -50  // Negative = from end, like tail
})

The Emergent Property

When you give an LLM these tools, something interesting emerges: it teaches itself retrieval strategies. Not through training, but through in-context reasoning about what it needs.

Ask it to find authentication code and it will search for auth, login, session, jwt, and token—without being told that these concepts are related. The semantic knowledge is already in the model; you just need to give it tools to act on that knowledge.

The Punchline

Vector RAG precomputes similarity and throws away reasoning.
File System RAG computes similarity through reasoning.

The LLM is the embedding model.

Is this always better than vector RAG? No. Is it simpler, cheaper, and more transparent for many use cases? Absolutely. And for codebases—where structure, naming conventions, and conceptual organization matter—it's often superior.

The tools discussed are part of the OODA MCP server, available in the Mnehmos MCP ecosystem.