A visual exploration of vector embeddings

by pamelafoxon 5/28/2025, 8:21 PMwith 39 comments

by godelskion 5/29/2025, 5:51 PM

The more I've studied this stuff the less useful I actually think the vitalizations are. Pamela uses the classic approach and I'm not trying to call her wrong but I think our intuitions really fail us outside 2D and 3D.

Once you move up in dimensionality things get really messy really fast. There's a contraction in variance and the meaning of distance becomes much more fuzzy. You can't differentiate your nearest neighbor from your furthest. Angles get much harder too. Everything is orthogonal, in most directions too! I'm not all that surprised "god" and "dog". I EXPECT them to be. After all, they are the reverse of one another. The question rather is about "similar in which direction?"

There's no need to believe you've measured along a direction that is human meaningful. So doesn't have to be semantics. Doesn't have to be permutations either. Just like you can rotate your xy axis and travel in both directions.

So these things can really trick us. At best, be very careful to not become overly reliant upon them

by tanelpoderon 5/30/2025, 1:06 AM

I took a completely different path to visualizing embedding vectors’ physical layout [1] - mainly to explain how the data structures, data volumes and comparison would radically differ, compared to your regular btree index searches. I made sure to mention that you can’t make any conclusions based on just human eyeballing of these vector heatmaps, but the database people I’ve demoed this to, seem to have reached some a-ha moments about understanding how radically different a vector search is compared to the usual database index lookup work:

[1] https://tanelpoder.com/posts/comparing-vectors-of-the-same-r...

by pamelafoxon 5/29/2025, 7:46 PM

I forgot that I also put together this little website, if you want to compare vectors for word2vec versus text-embedding-ada-002: https://pamelafox.github.io/vectors-comparison/

(I never added text-embedding-3 to it)

by antirezon 5/29/2025, 6:14 PM

Here I tried to use 2D visualization, and it may be more immediate:

https://antirez.com/news/150

by persedeson 5/30/2025, 12:22 AM

Would be interesting to see how e.g sentence transformer models compare to this. My takeaway with the e.g. openai embedding models was that they were better suited for larger chunks of texts, so getting god + dog with a higher similarity might be indicative that it's not a good model for such small text?

  emb = SentenceTransformer("all-MiniLM-L6-v2")
  embeddings = emb.encode(["dog", "god"])
  cosine_similarity(embeddings)
  Out[16]: 
  array([[1.        , 0.41313702],
       [0.41313702, 1.0000004 ]], dtype=float32)

by kaycebasqueson 5/29/2025, 11:07 PM

The post title reminds me of something that I researched a little a couple months back. Practically all embeddings are implemented as vectors, right? Definitionally, an embedding doesn't have to be a vector. But in practice there's not really any such thing as a non-vector embedding, is there?

One thing I learned recently is that, if your embedding model supports task types (clustering, STS, retrieval, etc.), then that can have a non-trivial impact on the generated embedding for a given text: https://technicalwriting.dev/ml/embeddings/tasks/index.html

Parquet and Polars sound very promising for reducing embeddings storage requirements. Still haven't tinkered with them: https://minimaxir.com/2025/02/embeddings-parquet/

And this post gave me a lot more awareness to be more careful about how exactly I'm comparing embeddings. OP's post seems to do a good job explaining common techniques, too. https://p.migdal.pl/blog/2025/01/dont-use-cosine-similarity/

by minimaxiron 5/29/2025, 4:29 PM

Since this was oriented toward a Python audience, it may have also been useful to demonstrate on the poster how in Python you can create the embeddings (e.g. using requests/OpenAI client and hitting OpenAI's embeddings API) and calculate the similarities (e.g. using numpy) since most won't read the linked notebooks. Mostly as a good excuse to showoff Python's rare @ operator for dot products in numpy.

As a tangent, what root data source are you using to calculate the movie embeddings?

by isjustintimeon 5/29/2025, 3:59 PM

I love the visual approaches used to explain these concepts. Words and math hurt my brain, but when accompanied by charts and diagrams, my brain hurts much less.

by podgietaruon 5/29/2025, 9:52 PM

I did the traditional blog post about this, and used it to create an RSS Aggregator website using AWS Bedrock.

https://aws.amazon.com/blogs/machine-learning/use-language-e...

The website is unfortunately down now, due to the fact I no longer work at Amazon, but the code is still readily available if you want to run it yourself.

https://github.com/aws-samples/rss-aggregator-using-cohere-e...

by galaxyLogicon 5/29/2025, 11:36 PM

> The text-embedding-ada-002 model accepts up to 8192 "tokens", where a "token" is the unit of measurement for the model (typically corresponding to a word or syllable),

So the "input" is up to 8192 "units of measurement". What would that mean in practice? How are the units of measurement produced? Can they be anything?

by ConteMascetti71on 5/30/2025, 9:51 AM

I have done some experimentation in the field of vector arithmetic. The results, in the field of images, are very interesting. https://github.com/vagrillo/CLIPSemanticImageArythmetics

by jamesk_auon 5/30/2025, 12:20 AM

Does anyone have any insight (or informed guesses) that might explain the strange downward "spike" that was consistently observed at dimension 196 in OpenAI's text-embedding-ada-002 model?