Why are we accepting silent data corruption in Vector Search? (x86 vs. ARM)

by varshith17on 12/23/2025, 4:57 PMwith 10 comments

I spent the last week chasing a "ghost" in a RAG pipeline and I think I’ve found something that the industry is collectively ignoring.

We assume that if we generate an embedding and store it, the "memory" is stable. But I found that f32 distance calculations (the backbone of FAISS, Chroma, etc.) act as a "Forking Path."

If you run the exact same insertion sequence on an x86 server (AVX-512) and an ARM MacBook (NEON), the memory states diverge at the bit level. It’s not just "floating point noise" it’s a deterministic drift caused by FMA (Fused Multiply-Add) instruction differences.

I wrote a script to inspect the raw bits of a sentence-transformers vector across my M3 Max and a Xeon instance. Semantic similarity was 0.9999, but the raw storage was different

For a regulated AI agent (Finance/Healthcare), this is a nightmare. It means your audit trail is technically hallucinating depending on which server processed the query. You cannot have "Write Once, Run Anywhere" index portability.

The Fix (Going no_std) I got so frustrated that I bypassed the standard libraries and wrote a custom kernel (Valori) in Rust using Q16.16 Fixed-Point Arithmetic. By strictly enforcing integer associativity, I got 100% bit-identical snapshots across x86, ARM, and WASM.

Recall Loss: Negligible (99.8% Recall@10 vs standard f32).

Performance: < 500µs latency (comparable to unoptimized f32).

The Ask / Paper I’ve written a formal preprint analyzing this "Forking Path" problem and the Q16.16 proofs. I am currently trying to submit it to arXiv (Distributed Computing / cs.DC) but I'm stuck in the endorsement queue.

If you want to tear apart my Rust code: https://github.com/varshith-Git/Valori-Kernel

If you are an arXiv endorser for cs.DC (or cs.DB) and want to see the draft, I’d love to send it to you.

Am I the only one worried about building "reliable" agents on such shaky numerical foundations?

by realitydrifton 12/30/2025, 2:35 PM

This reads more like a semantic fidelity problem at the infrastructure layer. We’ve normalized drift because embeddings feel fuzzy, but the moment they’re persisted and reused, they become part of system state, and silent divergence across hardware breaks auditability and coordination. Locking down determinism where we still can feels like a prerequisite for anything beyond toy agents, especially once decisions need to be replayed, verified, or agreed upon.

by codingdaveon 12/24/2025, 1:56 PM

> We assume that if we generate an embedding and store it, the "memory" is stable.

Why do you assume that? In my experience, the "memory" is never stable. You seem to have higher expectations of reliability than would be reasonable.

If you have proven that unreliability, that proof is actually interesting. But seems less like a bug, and more of an observation of how things work.

by varshith17on 12/23/2025, 4:58 PM

Github repo: https://github.com/varshith-Git/Valori-Kernel

by chrisjjon 12/24/2025, 11:02 AM

> Am I the only one worried about building "reliable" agents on such shaky numerical foundations?

You might be the only one expecting a reliable "AI" agent period.