It’s Just Vectors — Series Curriculum & Context #

Overview #

A 10-part tutorial series on AI embeddings, built with Go. Each part teaches one concept by building a real CLI tool. The series runs from “what is a vector” to “arithmetic on meaning.”


Repository Structure #

tutorial/
├── shared/              # Shared embedder package (OpenAI + Ollama clients)
│   └── embedder/        # embedder.NewOpenAIClient / embedder.NewOllamaClient
├── partN/
│   ├── start/           # Stubbed — reader implements the TODOs
│   ├── complete/        # Reference solution, runs as standalone CLI
│   ├── walkthrough.md   # Step-by-step guide for the start/ exercises
│   └── questions.md     # Checkpoint questions to verify understanding

Each complete/ directory is a self-contained Go module with a working embed binary. start/ has the same structure with core functions replaced by // TODO stubs and failing unit tests that pass once the reader implements correctly.


Writing Status #

Part Status Title
Appendix ✅ Written Vector Math Primer
1 ✅ Written Cosine Similarity & the embed analyze command
2 ✅ Written Per-Category Centroids & the embed classify command
3 ✅ Written Visualization & Dimensionality Reduction
4 ✅ Written Synthetic Strings & Anomaly Detection
5 ❌ Not written Z-scores & Statistical Thresholds
6 ❌ Not written RAG Baseline — Semantic Code Search
7 ❌ Not written Signal Cleaning — Enriched Chunks, Filtering, RRF
8 ❌ Not written Vector Arithmetic & Embedding Steering
9 ❌ Not written Contextual Synthesis — The “R” in RAG
10 ❌ Not written Knowledge Graphs — Automated Code Relationship Mapping

Phase Map #

Phase 1 — Classification & Visualization (Parts 1–3) #

Domain: product/transaction classification Theme: the geometry of meaning; making vectors visible

Phase 2 — Anomaly Detection (Parts 4–5) #

Domain: credit card fraud detection Theme: structured data → embeddings; statistical distance

Phase 3 — Semantic Code Search (Parts 6–7) #

Domain: indexing a Go repository Theme: retrieval quality; why naive search fails and how to fix it

Phase 4 — Vector Arithmetic (Part 8) #

Domain: all prior domains Theme: manipulating meaning with math; the payoff of the series

Phase 5 — LLM Synthesis & Knowledge Graphs (Parts 9–10) #

Domain: production RAG patterns Theme: presentation layer (Part 9) and data layer (Part 10) on top of good vectors


Per-Part Detail #

Part 3 — Visualization & Dimensionality Reduction #

Command: embed visualize Key concept: PCA projects 1536D vectors to 2D while preserving variance structure. Go challenge: Implement PCA using gonum.org/v1/gonum/mat. Output an SVG scatter plot with category-colored dots to output.svg. Critical implementation note: The query vector must be appended to the data slice BEFORE PCA runs — it cannot be projected independently, because PCA axes are computed from the full dataset and the query must share that basis. Verification: Run embed classify on the same inputs; clusters the classifier separates must appear spatially distinct in the scatter plot. Blog hook: You have 1536 numbers. You can’t see them. Here’s how to look.

Part 4 — Synthetic Strings & Anomaly Detection #

Command: embed detect Key concept: Embedding raw structured data (e.g. 54.20) loses context. A synthetic string that combines fields into a sentence gives the model the full picture. Field order matters: put semantically heavy fields first (merchant, amount), modifiers after (time, location). Models weight earlier tokens more. Go challenge: FormatTransaction(t Transaction) string → compare raw vs synthetic embeddings against the “Normal” centroid; show the anomaly sitting far from the cluster. inputs.json example:

[
  { "label": "Normal_1", "category": "Grocery", "text": "Transaction: 45.00 USD at Whole Foods Market at 11:00 AM" },
  { "label": "Anomaly",  "category": "Fraud_Candidate", "text": "Transaction: 4999.00 USD at High-End Jewelry Store at 03:45 AM" }
]

Threshold starting point: similarity < 0.8 to the “Normal” centroid = flagged. Blog hook: Embedding 54.20 teaches the model about the number. Embedding a sentence teaches it about a transaction.

Part 5 — Z-scores & Statistical Thresholds #

Command: embed detect --zscores Key concept: Fixed thresholds don’t adapt to the data’s spread. A z-score of –2 means “statistically unusual” regardless of what the raw similarity value happens to be. Go challenge: CalculateMeanStdDev(sims []float32) (float64, float64), then add --zscores flag to embed detect. Flag anything below z = –2 (configurable). Blog hook: 0.85 similarity might be fine if your data ranges 0.75–0.95, or alarming if it ranges 0.94–0.98. The number alone doesn’t tell you which. Connection to Part 1: Part 1 mentioned z-scores as the upgrade path from fixed thresholds. This is where that seed pays off.

Part 6 — RAG Baseline #

Command: embed search Key concept: Chunk source code by function, embed each chunk, store in a vector DB, search by embedding a natural-language query. Semantic search finds InitDatabase when you ask “how do I connect to the database.” Go dependencies: github.com/philippgille/chromem-go (pure Go, no server required, stores to a flat file on disk) Chunking strategy: one function = one document. Not one file = one document. End the post honestly: name the baseline wall. Simple cosine on raw function bodies is 60–70% accurate. Execute() in a DB package scores nearly identical to Execute() in a CLI package. Short functions get pulled by single variable names. Set up the problem that Part 7 solves. Optional comparison: run the same query against BM25 keyword search; show where semantic wins and where it loses. Blog hook: You search for “func ConnectDB”. You should be able to search for “how do I link to the database.”

Part 7 — Signal Cleaning #

Command: embed search --hybrid Key concept: Three techniques that each address a specific failure mode from Part 6. Technique A — Enriched chunks (addresses structural context problem):

"File: db/connection.go | Package: db | Function: Connect | Body: func Connect() { ... }"

Same synthetic string pattern as Part 4, applied to code. “Pins” the vector to its location in the project. Technique B — Namespace filtering (addresses boilerplate noise): --package flag on embed search. Hard filtering by directory beats soft scoring when the domain is known. More effective than people expect. Technique C — Reciprocal Rank Fusion / hybrid search (the surprising one): Run keyword (BM25/regex) search AND vector search in parallel. Combine rankings using RRF. FinalScore = 1/(k + rank_keyword) + 1/(k + rank_vector) where k=60 is standard. Professional search engines (Elasticsearch, Algolia) do this. Neither search alone matches the combined result. Blog structure: comparison table showing baseline vs enriched vs hybrid accuracy. Blog hook: The results from Part 6 felt broken. Here’s why, and here are three fixes you can implement today.

Part 8 — Vector Arithmetic & Embedding Steering #

Command: embed steer Key concept: Directions in embedding space encode semantic relationships. You can add and subtract them. king – man + woman ≈ queen is the classic example; in your domain, query – boilerplate_centroid shifts search results away from if err != nil blocks. Core function:

// Vq = query vector, Vb = boost/suppress concept centroid, W = weight
func SteerEmbedding(Vq, Vb []float32, W float32, suppress bool) []float32 {
    // result = Normalize(Vq + (W * Vb)) for boost
    // result = Normalize(Vq - (W * Vb)) for suppress
}

CLI flags: --boost "logging" and --suppress "error handling" on embed search. Concrete demo: embed search "error handling" --suppress "if err != nil" shifts results from trivial nil checks toward actual log implementations. Why this comes before Parts 9–10: Vector arithmetic improves the intelligence of your vectors. Parts 9–10 are engineering layers (presentation, data structure) that sit on top. Better to get the geometry right first. Blog hook: You’ve spent seven parts learning to find meaning in vector space. Now you can move it.

Part 9 — Contextual Synthesis (The “R” in RAG) #

Command: embed search --explain Key concept: Retrieval finds the right functions; synthesis explains how they answer the specific question. The LLM is a presentation layer, not the intelligence layer — that’s the vectors. Implementation: After embed search returns top-K results, send the retrieved code plus the original query to OpenAI/Qwen. Prompt: “Given this Go code, explain how it answers the question: {query}. Be specific about which lines are relevant and why.” Why this comes after Part 8: If your retrieval is noisy, the LLM summarizes noise. Get the vectors right first (Parts 6–8), then add the explanation layer. Go challenge: streaming the LLM response to the terminal while also showing the similarity scores and source locations. Blog hook: Getting the right function back is retrieval. Getting an explanation of why it answers your question is RAG. Those are different problems.

Part 10 — Knowledge Graphs #

Command: embed graph Key concept: If func A calls func B, they are semantically related even if their text embeddings are distant. Call graphs encode structural relationships that vectors miss. Implementation: Use Go’s go/ast and go/parser packages for static analysis. Walk the AST to build a call graph. Store relationships alongside embeddings in chromem-go. At query time, expand the top-K results by one hop in the call graph. The broken-signature problem: Chunking by function boundary can split a function signature from its body if the signature spans multiple lines. The AST walk must capture the full declaration including the opening brace, not just the func keyword line. Contextual injection: when returning results, include the callers and callees of the matched function, not just the function itself. Why last: This is the most complex implementation in the series (static analysis + graph traversal + vector retrieval). It also requires the enriched indexer from Part 7 and the explanation layer from Part 9 to be fully useful. Blog hook: Two functions can be textually different and semantically identical if one calls the other. Embeddings can’t see that. The call graph can.


Shared Design Principles (apply to every part) #

Before any API call #

Test the math against hand-crafted vectors where the expected output is obvious. wirelessMouse := []float32{8.5, 1.2, 9.2} → verify before touching embeddings.

float64 for intermediate calculations #

Accumulating float32 products introduces rounding error. The diagonal of a similarity matrix (a vector compared to itself) should be exactly 1.0000. Use float64 internally, return float32.

The series arc #

Every part teaches a skill that the next part depends on:

  • Parts 1–2 establish the geometry
  • Part 3 makes it visible
  • Parts 4–5 apply it to structured data with statistical rigor
  • Parts 6–7 show where naive retrieval fails and how to fix it
  • Part 8 is the payoff: you can manipulate the geometry directly
  • Parts 9–10 are production engineering on top of a reliable foundation

Blog voice (from blog-writer skill) #

  • Open with the problem, not with context
  • Concrete before abstract — worked example first, then formula
  • Name failures honestly; don’t hide the baseline wall
  • Dry, first-person, technically specific
  • End at the right place, not with a summary of what you just said

Key Go Packages Used Across the Series #

Package Used in Purpose
gonum.org/v1/gonum/mat Part 3 PCA / matrix operations
github.com/philippgille/chromem-go Parts 6–10 Local vector store
go/ast, go/parser Part 10 AST walking for call graph
github.com/spf13/cobra All parts CLI framework
shared/embedder All parts OpenAI + Ollama client wrapper