It’s Just Vectors — Series Curriculum & Context #
Overview #
A 10-part tutorial series on AI embeddings, built with Go. Each part teaches one concept by building a real CLI tool. The series runs from “what is a vector” to “arithmetic on meaning.”
- Blog: https://claydon.co/series/its-just-vectors/
- Code repo: https://github.com/rikdc/semantic-search-experiments
- Providers: OpenAI
text-embedding-3-small(cloud) and Ollama/QWEN (local, free) - CLI framework: Cobra
- Embedding dimensions: 1536 (OpenAI), variable (Ollama)
Repository Structure #
tutorial/
├── shared/ # Shared embedder package (OpenAI + Ollama clients)
│ └── embedder/ # embedder.NewOpenAIClient / embedder.NewOllamaClient
├── partN/
│ ├── start/ # Stubbed — reader implements the TODOs
│ ├── complete/ # Reference solution, runs as standalone CLI
│ ├── walkthrough.md # Step-by-step guide for the start/ exercises
│ └── questions.md # Checkpoint questions to verify understanding
Each complete/ directory is a self-contained Go module with a working embed binary.
start/ has the same structure with core functions replaced by // TODO stubs and
failing unit tests that pass once the reader implements correctly.
Writing Status #
| Part | Status | Title |
|---|---|---|
| Appendix | ✅ Written | Vector Math Primer |
| 1 | ✅ Written | Cosine Similarity & the embed analyze command |
| 2 | ✅ Written | Per-Category Centroids & the embed classify command |
| 3 | ✅ Written | Visualization & Dimensionality Reduction |
| 4 | ✅ Written | Synthetic Strings & Anomaly Detection |
| 5 | ❌ Not written | Z-scores & Statistical Thresholds |
| 6 | ❌ Not written | RAG Baseline — Semantic Code Search |
| 7 | ❌ Not written | Signal Cleaning — Enriched Chunks, Filtering, RRF |
| 8 | ❌ Not written | Vector Arithmetic & Embedding Steering |
| 9 | ❌ Not written | Contextual Synthesis — The “R” in RAG |
| 10 | ❌ Not written | Knowledge Graphs — Automated Code Relationship Mapping |
Phase Map #
Phase 1 — Classification & Visualization (Parts 1–3) #
Domain: product/transaction classification Theme: the geometry of meaning; making vectors visible
Phase 2 — Anomaly Detection (Parts 4–5) #
Domain: credit card fraud detection Theme: structured data → embeddings; statistical distance
Phase 3 — Semantic Code Search (Parts 6–7) #
Domain: indexing a Go repository Theme: retrieval quality; why naive search fails and how to fix it
Phase 4 — Vector Arithmetic (Part 8) #
Domain: all prior domains Theme: manipulating meaning with math; the payoff of the series
Phase 5 — LLM Synthesis & Knowledge Graphs (Parts 9–10) #
Domain: production RAG patterns Theme: presentation layer (Part 9) and data layer (Part 10) on top of good vectors
Per-Part Detail #
Part 3 — Visualization & Dimensionality Reduction #
Command: embed visualize
Key concept: PCA projects 1536D vectors to 2D while preserving variance structure.
Go challenge: Implement PCA using gonum.org/v1/gonum/mat. Output an SVG scatter plot
with category-colored dots to output.svg.
Critical implementation note: The query vector must be appended to the data slice
BEFORE PCA runs — it cannot be projected independently, because PCA axes are computed
from the full dataset and the query must share that basis.
Verification: Run embed classify on the same inputs; clusters the classifier separates
must appear spatially distinct in the scatter plot.
Blog hook: You have 1536 numbers. You can’t see them. Here’s how to look.
Part 4 — Synthetic Strings & Anomaly Detection #
Command: embed detect
Key concept: Embedding raw structured data (e.g. 54.20) loses context. A synthetic
string that combines fields into a sentence gives the model the full picture.
Field order matters: put semantically heavy fields first (merchant, amount), modifiers
after (time, location). Models weight earlier tokens more.
Go challenge: FormatTransaction(t Transaction) string → compare raw vs synthetic
embeddings against the “Normal” centroid; show the anomaly sitting far from the cluster.
inputs.json example:
[
{ "label": "Normal_1", "category": "Grocery", "text": "Transaction: 45.00 USD at Whole Foods Market at 11:00 AM" },
{ "label": "Anomaly", "category": "Fraud_Candidate", "text": "Transaction: 4999.00 USD at High-End Jewelry Store at 03:45 AM" }
]
Threshold starting point: similarity < 0.8 to the “Normal” centroid = flagged.
Blog hook: Embedding 54.20 teaches the model about the number. Embedding a sentence
teaches it about a transaction.
Part 5 — Z-scores & Statistical Thresholds #
Command: embed detect --zscores
Key concept: Fixed thresholds don’t adapt to the data’s spread. A z-score of –2
means “statistically unusual” regardless of what the raw similarity value happens to be.
Go challenge: CalculateMeanStdDev(sims []float32) (float64, float64), then add
--zscores flag to embed detect. Flag anything below z = –2 (configurable).
Blog hook: 0.85 similarity might be fine if your data ranges 0.75–0.95, or alarming
if it ranges 0.94–0.98. The number alone doesn’t tell you which.
Connection to Part 1: Part 1 mentioned z-scores as the upgrade path from fixed
thresholds. This is where that seed pays off.
Part 6 — RAG Baseline #
Command: embed search
Key concept: Chunk source code by function, embed each chunk, store in a vector DB,
search by embedding a natural-language query. Semantic search finds InitDatabase when
you ask “how do I connect to the database.”
Go dependencies: github.com/philippgille/chromem-go (pure Go, no server required,
stores to a flat file on disk)
Chunking strategy: one function = one document. Not one file = one document.
End the post honestly: name the baseline wall. Simple cosine on raw function bodies
is 60–70% accurate. Execute() in a DB package scores nearly identical to Execute() in
a CLI package. Short functions get pulled by single variable names. Set up the problem
that Part 7 solves.
Optional comparison: run the same query against BM25 keyword search; show where
semantic wins and where it loses.
Blog hook: You search for “func ConnectDB”. You should be able to search for
“how do I link to the database.”
Part 7 — Signal Cleaning #
Command: embed search --hybrid
Key concept: Three techniques that each address a specific failure mode from Part 6.
Technique A — Enriched chunks (addresses structural context problem):
"File: db/connection.go | Package: db | Function: Connect | Body: func Connect() { ... }"
Same synthetic string pattern as Part 4, applied to code. “Pins” the vector to its
location in the project.
Technique B — Namespace filtering (addresses boilerplate noise):
--package flag on embed search. Hard filtering by directory beats soft scoring
when the domain is known. More effective than people expect.
Technique C — Reciprocal Rank Fusion / hybrid search (the surprising one):
Run keyword (BM25/regex) search AND vector search in parallel. Combine rankings using RRF.
FinalScore = 1/(k + rank_keyword) + 1/(k + rank_vector) where k=60 is standard.
Professional search engines (Elasticsearch, Algolia) do this. Neither search alone matches
the combined result.
Blog structure: comparison table showing baseline vs enriched vs hybrid accuracy.
Blog hook: The results from Part 6 felt broken. Here’s why, and here are three fixes
you can implement today.
Part 8 — Vector Arithmetic & Embedding Steering #
Command: embed steer
Key concept: Directions in embedding space encode semantic relationships. You can
add and subtract them. king – man + woman ≈ queen is the classic example; in your
domain, query – boilerplate_centroid shifts search results away from if err != nil
blocks.
Core function:
// Vq = query vector, Vb = boost/suppress concept centroid, W = weight
func SteerEmbedding(Vq, Vb []float32, W float32, suppress bool) []float32 {
// result = Normalize(Vq + (W * Vb)) for boost
// result = Normalize(Vq - (W * Vb)) for suppress
}
CLI flags: --boost "logging" and --suppress "error handling" on embed search.
Concrete demo: embed search "error handling" --suppress "if err != nil" shifts
results from trivial nil checks toward actual log implementations.
Why this comes before Parts 9–10: Vector arithmetic improves the intelligence of
your vectors. Parts 9–10 are engineering layers (presentation, data structure) that sit
on top. Better to get the geometry right first.
Blog hook: You’ve spent seven parts learning to find meaning in vector space. Now
you can move it.
Part 9 — Contextual Synthesis (The “R” in RAG) #
Command: embed search --explain
Key concept: Retrieval finds the right functions; synthesis explains how they answer
the specific question. The LLM is a presentation layer, not the intelligence layer —
that’s the vectors.
Implementation: After embed search returns top-K results, send the retrieved code
plus the original query to OpenAI/Qwen. Prompt: “Given this Go code, explain how it
answers the question: {query}. Be specific about which lines are relevant and why.”
Why this comes after Part 8: If your retrieval is noisy, the LLM summarizes noise.
Get the vectors right first (Parts 6–8), then add the explanation layer.
Go challenge: streaming the LLM response to the terminal while also showing the
similarity scores and source locations.
Blog hook: Getting the right function back is retrieval. Getting an explanation of
why it answers your question is RAG. Those are different problems.
Part 10 — Knowledge Graphs #
Command: embed graph
Key concept: If func A calls func B, they are semantically related even if their
text embeddings are distant. Call graphs encode structural relationships that vectors miss.
Implementation: Use Go’s go/ast and go/parser packages for static analysis.
Walk the AST to build a call graph. Store relationships alongside embeddings in chromem-go.
At query time, expand the top-K results by one hop in the call graph.
The broken-signature problem: Chunking by function boundary can split a function
signature from its body if the signature spans multiple lines. The AST walk must capture
the full declaration including the opening brace, not just the func keyword line.
Contextual injection: when returning results, include the callers and callees of the
matched function, not just the function itself.
Why last: This is the most complex implementation in the series (static analysis +
graph traversal + vector retrieval). It also requires the enriched indexer from Part 7
and the explanation layer from Part 9 to be fully useful.
Blog hook: Two functions can be textually different and semantically identical if one
calls the other. Embeddings can’t see that. The call graph can.
Shared Design Principles (apply to every part) #
Before any API call #
Test the math against hand-crafted vectors where the expected output is obvious.
wirelessMouse := []float32{8.5, 1.2, 9.2} → verify before touching embeddings.
float64 for intermediate calculations #
Accumulating float32 products introduces rounding error. The diagonal of a similarity matrix (a vector compared to itself) should be exactly 1.0000. Use float64 internally, return float32.
The series arc #
Every part teaches a skill that the next part depends on:
- Parts 1–2 establish the geometry
- Part 3 makes it visible
- Parts 4–5 apply it to structured data with statistical rigor
- Parts 6–7 show where naive retrieval fails and how to fix it
- Part 8 is the payoff: you can manipulate the geometry directly
- Parts 9–10 are production engineering on top of a reliable foundation
Blog voice (from blog-writer skill) #
- Open with the problem, not with context
- Concrete before abstract — worked example first, then formula
- Name failures honestly; don’t hide the baseline wall
- Dry, first-person, technically specific
- End at the right place, not with a summary of what you just said
Key Go Packages Used Across the Series #
| Package | Used in | Purpose |
|---|---|---|
gonum.org/v1/gonum/mat |
Part 3 | PCA / matrix operations |
github.com/philippgille/chromem-go |
Parts 6–10 | Local vector store |
go/ast, go/parser |
Part 10 | AST walking for call graph |
github.com/spf13/cobra |
All parts | CLI framework |
shared/embedder |
All parts | OpenAI + Ollama client wrapper |