It's Just Vectors
Z-Scores Over Gut Feel: Teaching Similarity Scores What “Weird” Actually Means
· 9 min read
A fixed similarity threshold is easy to implement and surprisingly hard to justify. It assumes your data behaves consistently, your model behaves consistently, and your definition of “normal” does not drift. None of those assumptions tend to hold for long.
Up to this point in the series, we have treated cosine similarity as a raw signal. A number comes out, we compare it to a cutoff, and we move on. Here we start treating those numbers as a distribution instead.
Once similarity becomes a distribution, you can stop asking “is this low?” and start asking “how unusual is this compared to normal?”
The problem with fixed thresholds #
In Part 4, we used a cutoff of 0.80, and it worked because the anomaly was extreme. A different dataset would have required a different number, and there is no principled way to know what that number should be until something slips through.
The issue is that similarity scores are not absolute. They only make sense relative to the data that produced them.
Consider two datasets:
- Dataset A: similarities range from 0.75 to 0.95
- Dataset B: similarities range from 0.94 to 0.98
A score of 0.85 sits in the middle of Dataset A and would be a red flag in Dataset B… Same number, completely different situation.
A fixed threshold ignores this, it treats similarity as if it exists on a stable scale. In practice, that scale shifts with the model, the dataset, the preprocessing choices, and, occasionally, your luck.
From scores to distributions #
Once you compute similarities across a set of normal examples, those numbers have shape: a center, a spread, a range of what the system considers unremarkable.
Instead of comparing a value to an arbitrary cutoff, you compare a new score against that shape, and z-scores are how you make that comparison.
What a z-score measures #
The formula is familiar:
z = (s - μ) / σ
Where:
sis the similarity scoreμis the mean similarity of normal transactionsσis the standard deviation of those similarities
Standard deviation is effectively the “typical distance” from the mean. Dividing by it rescales everything so that instead of asking “is 0.66 low?”, you can ask “how many typical deviations from normal is this?” — a question you can answer without knowing anything about the underlying model.
Wiring it into the pipeline #
The pipeline barely changes.
- Compute embeddings
- Build the normal centroid
- Compute cosine similarities
- Collect similarities for normal samples
- Compute mean and standard deviation
- Convert similarities to z-scores
- Flag based on a z-score threshold
One thing to be aware of at step 4: the normal similarities are computed against a centroid built from those same normal examples, so each point is partly scoring itself. This is in-sample calibration, a teaching shortcut that keeps the code simple. In a production setting you would calibrate on a held-out reference set to avoid inflating the similarity distribution. Here it does not matter much because the anomaly is far enough outside the normal cluster that the leakage does not change the conclusion, but the z-score scale is somewhat optimistic as a result.
The detection loop ends up looking like this:
normalSims := make([]float32, 0, len(normalVecs))
for i, tx := range transactions {
if tx.Category == "Normal" {
normalSims = append(normalSims, sims[i])
}
}
mean, stddev := math.CalculateMeanStdDev(normalSims)
// Teaching note: normalSims are scored against a centroid built from those same
// normals — each point is partly scoring itself. This inflates similarity and
// compresses variance, making z-scores look more tightly distributed than they
// would on held-out data. In production you would calibrate on a separate
// holdout set, or use leave-one-out scoring for each normal point.
// For this tutorial the effect is small relative to how extreme the anomaly is,
// but it is worth knowing the scale is optimistic.
for i, tx := range transactions {
z := (float64(sims[i]) - mean) / stddev
flag := "✓ normal"
if z < zscoreThreshold {
flag = "✗ FLAGGED"
}
fmt.Printf("%-14s (%-16s) sim: %.4f z: %6.2f %s\n",
tx.Label, tx.Category, sims[i], z, flag)
}
The code is straightforward; what changes is what those numbers mean once they leave the loop.
One edge case to watch for: if your normal examples are all very similar to each other, the standard deviation can be near zero, and the division will blow up or produce meaningless extreme values. This is common with small, homogeneous toy datasets. In production you would guard against it, that is: return an error, fall back to a fixed threshold, or reject the calibration set as too uniform. For this tutorial the dataset is varied enough that it doesn’t arise, but if you see z-scores in the thousands, a near-zero stddev is usually why.
Computing mean and standard deviation #
The implementation uses a two-pass approach:
func CalculateMeanStdDev(sims []float32) (mean, stddev float64) {
if len(sims) == 0 {
return 0, 0
}
n := float64(len(sims))
for _, s := range sims {
mean += float64(s)
}
mean /= n
for _, s := range sims {
diff := float64(s) - mean
stddev += diff * diff
}
stddev = math.Sqrt(stddev / n) // population stddev — treats the calibration set as the full reference
return mean, stddev
}
This uses population standard deviation (dividing by n, not n-1), which treats the normal reference set as the complete definition of “normal.” That is a reasonable stance when the calibration set is your entire reference population, but it is worth naming: with only a handful of examples, population stddev will be slightly smaller than sample stddev, which pushes z-scores a little farther from zero than they should be. The practical effect is small on extreme outliers, but it can matter for borderline cases. If your calibration set is small, treat your threshold as approximate and expect to tune it.
There are single-pass ways to compute this in one loop. They are slightly faster, though naive one-pass formulas come with a catch: they tend to be numerically fragile. (Welford’s online algorithm avoids this, but adds complexity that is not warranted here.)
The fragility problem shows up with naive approaches when your similarities are tightly clustered, which is exactly the case here. Standard deviation is small, so you end up subtracting numbers that are very close together. With floating-point math, that can lose precision and skew the result just enough to matter. Not enough to crash anything, but enough to move borderline cases across your threshold.
The two-pass version avoids that by computing the mean first, then measuring deviations from it directly. It is a bit more work, but it is stable and easy to reason about.
Also note the type conversion. Inputs are float32, outputs are float64. When your standard deviation is small, that extra precision helps avoid turning “slightly unusual” into “apparently catastrophic.”
A quick sanity check #
Imagine these are cosine similarities of normal transactions to the centroid:
0.80, 0.84, 0.86, 0.90
You get:
- Mean ≈ 0.85
- Standard deviation ≈ 0.036
Now take a candidate with similarity 0.70:
z ≈ (0.70 - 0.85) / 0.036 ≈ -4.2
That is more than four standard deviations below normal, exactly where intuition says it should be.
Choosing a threshold #
The default threshold is -2.0. That is a heuristic for “two standard deviations below the mean of your normal reference set”; a useful starting point, but not a statistical significance test. Cosine similarity is bounded and often skewed, so a z-score of -2 does not carry the same meaning it does in a symmetric Gaussian distribution. What it does tell you is how far below the normal band a point sits, relative to the spread you observed in calibration.
You can tune it:
-3.0: stricter, catches only extreme outliers-2.0: reasonable starting point-1.5: more sensitive, more false positives
The right value depends on your calibration data, your tolerance for false positives, and how much the similarity distribution resembles a normal curve. Treat -2.0 as a reasonable first guess, not a principled answer.
Unlike a fixed threshold, -2.0 has meaning relative to your own data as it’s you who is deciding how far below the normal band is acceptable and not just picking a raw similarity value.
Robustness and limits #
Z-scores normalize the scale, which can reduce how often you need to retune thresholds. Swap embedding models and the raw values shift, but if the outlier is comparably extreme in the new distribution, the z-score will reflect that — and your threshold may not need to move. The same goes for gradual dataset drift: recalibrate on a fresh reference set and the scale updates automatically.
That only holds when the shape of the distribution stays roughly stable. Different models can produce different skew or heavier tails. Moving domains can change how normal and anomalous points separate, not just where they land on the scale. In those cases you need to rebuild the calibration baseline and re-evaluate the threshold. Z-scores reduce tuning effort; they do not eliminate it.
There are a few other limits worth naming. Small calibration sets make the statistics noisy: the threshold you land on should be treated as approximate, not authoritative. Non-Gaussian distributions can make z-score interpretation unreliable. And if your normal data contains two distinct behaviors, the standard deviation widens; values that should look suspicious can fall inside that band and slip through. This implementation assumes a single coherent cluster of normal transactions. That assumption is doing more work than it looks like.
What this changes #
From the sample run:
Anomaly sim: 0.6593 z: -16.03
A similarity of 0.6593 is hard to act on without knowing the distribution. Sixteen standard deviations below normal is not, you don’t need to know anything about the model to recognise that something is wrong.
The code change is modest. The more useful shift is in how you think about the numbers: once similarity is a position within a distribution rather than a standalone value, the same framing extends naturally to ranking results, evaluating retrieval quality, or comparing outputs across models. Z-scores are self-contextualising in a way that raw similarity scores are not.
Which is a modest improvement over guessing whether 0.80 felt right at the time.
Parts 1–5 worked with transaction and product data. Part 6 shifts domain to source code… where the same embedding approach runs into a problem that structured data doesn’t have.