It's Just Vectors
Semantic Code Search in Go: Indexing Functions and Learning Where It Fails
· 6 min read
Semantic search over source code is often introduced as “grep, but smarter.” That framing is both directionally correct and operationally useless. The actual problem is whether your retrieval layer can consistently surface the right code unit under ambiguity, noise, and naming collisions. This post builds a minimal retrieval system over Go functions and then lets it fail in predictable ways.
The point is to have something concrete to measure against before adding complexity.
From strings to meaning (sort of) #
Traditional search tools like grep operate on exact matches. They are fast, deterministic, and blind to intent. If you search for “database,” you get lines containing that token, but if your code calls the function Connect(), you are on your own.
Semantic search replaces string matching with vector similarity. Each piece of text is embedded into a high-dimensional vector space. Queries are embedded the same way. Retrieval becomes a nearest-neighbor problem.
This sounds clean. It is not.
Embedding models do not “understand” code in a structural sense. They do not parse ASTs or resolve types. They operate on statistical patterns in text. That is good enough to map “write JSON response” to WriteJSON(). It is not good enough to reliably distinguish two unrelated functions named Execute().
Choosing the unit of indexing #
The first non-trivial decision is what constitutes a document.
Indexing entire files is tempting, but wrong for most cases: a file can contain multiple unrelated functions, and embedding the whole thing produces a blended representation that dilutes signal.
Indexing one function per document is a better default. Each embedding represents a single unit of behavior, queries match against focused context rather than mixed concerns, and retrieval granularity aligns with how developers actually think about code.
The implementation uses Go’s AST to extract function declarations:
for _, decl := range f.Decls {
fn, ok := decl.(*ast.FuncDecl)
if !ok {
continue
}
start := fset.Position(fn.Pos()).Offset
end := fset.Position(fn.End()).Offset
body := src[start:end]
docs = append(docs, Document{
ID: fmt.Sprintf("%s::%s", path, fn.Name.Name),
Content: body,
})
}
This works well enough, with one caveat: method receivers and package context are ignored, and that omission becomes important later.
Storing vectors without infrastructure #
A retrieval system needs persistence, because recomputing embeddings on every query is wasteful and slow.
The implementation uses a lightweight, in-process vector store. No server, no orchestration, no Docker. Just files on disk.
This constraint is intentional. Adding infrastructure early hides the core problem behind operational noise. At this stage, the interesting questions are about retrieval quality, not cluster configuration.
Indexing the codebase #
The indexer walks a small synthetic repository and embeds each function. The corpus is deliberately constructed to expose failure modes: duplicate function names across packages, short functions dominated by boilerplate, and mixed concerns within similar naming patterns.
The result is a persisted collection of function embeddings. Once indexed, queries only need to embed the input string and compute similarity against stored vectors.
Querying and early wins #
Basic queries behave as expected:
- “how do I connect to the database” →
Connect() - “check if a token is valid” →
ValidateToken() - “write a JSON response” →
WriteJSON()
These are the easy cases. The function bodies contain enough semantic signal for the embedding model to align them with the query.
This is where most demos stop. It is also where things start getting misleading.
Where it goes wrong #
Two failure modes show up immediately.
Name collisions with weak context #
Consider two functions: db.Execute() and cli.Execute(). They share a name and contain similar structural patterns, including error handling and control flow. The embedding model sees overlapping tokens and assigns similar vectors.
Query:
"run a command"
Result:
cli.Execute()db.Execute()
The scores are nearly identical.
This is not a bug but a limitation of the input representation: the model only sees function bodies and has no awareness of package context or intent.
This part was confusing the first time through because the results feel almost right. The system is not failing catastrophically. It is failing subtly, which is worse.
Short functions and signal dilution #
Short functions are dominated by boilerplate:
if err != nil {
return err
}
When the meaningful content is small, common patterns take over the embedding. Two unrelated functions can appear similar simply because they share idiomatic structure.
Example query:
"how long before a request times out"
Result:
getTimeout()(correct)cache.Set()(not even close)
The second result leaks in because both functions are short and structurally similar.
This is where it got weird. The model is not wrong in a human sense. It is consistent with the data it sees. The issue is that the data is not rich enough to differentiate intent.
What the numbers tell you #
On this synthetic dataset, function-body embeddings land around 60 to 70 percent top-1 accuracy. That number shifts with the model, the corpus, and the query set, but measuring it at all is the point: without a baseline, every tweak feels like progress, and that is how bad retrieval systems acquire suspiciously confident defenders.
The design choices here were deliberately simplified. Skipping package and receiver context kept things tractable but hurt disambiguation. Using raw function bodies meant less preprocessing but left boilerplate dominating embeddings for short functions. These were not oversights; they were constraints to keep the failure modes visible rather than papered over.
And the failures are not random. They point toward enriching documents with package context, reducing boilerplate influence, and introducing weighting strategies, but those changes are only meaningful once you have seen what the baseline actually does. Skipping straight to fixes leads to systems that feel tuned without being understood.
The gap that matters #
Semantic search over code works just well enough to be dangerous. It produces convincing results in simple cases and quietly degrades in edge cases that matter in real systems.
This baseline is intentionally imperfect. It surfaces the gap between matching meaning and retrieving the right thing in context. That gap is where most of the real work lives.
Finding Connect() is the easy part. The question worth sitting with is why the model confuses two Execute() functions, and what you choose to do about it.
Getting started #
cd tutorial/part6/
complete/ has the full implementation. start/ has ExtractFunctions and SearchSimilar stubbed out; the AST walking setup, document storage, and embedding plumbing are intact.
export OPENAI_API_KEY="sk-your-key-here"
cd start/
go mod tidy
go build -o embed .
./embed index --dir ./corpus
./embed search "how do I connect to the database"
# or Ollama (no API key required)
./embed index --dir ./corpus --provider ollama --model embeddinggemma
./embed search "how do I connect to the database" --provider ollama --model embeddinggemma
Run the name collision query "run a command", and look at the scores for db.Execute() and cli.Execute(). They will be close, and that’s the number to beat.
What’s next #
The failures here are specific: structural context is missing, boilerplate dominates short functions, and names collide without disambiguation. Part 7 addresses all three — enriching chunks with package and file context, filtering by namespace, and combining keyword and vector search using reciprocal rank fusion. The accuracy improvement is real, and so is the reason it works.
The tutorial repository and all code: rikdc/semantic-search-experiments