Achieving 10,000x training data reduction with high-fidelity labels

by badmonsteron 8/7/2025, 9:11 PMwith 28 comments

by ericydon 8/7/2025, 11:18 PM

> in production traffic only very few (<1%) ads are actually clickbait

That's a fascinating claim, and it does not align with my anecdotal experience using the web for many years.

by abhghon 8/8/2025, 6:08 AM

Active Learning is a very tricky area to get right ... over the years I have had mixed luck with text classification, to the point that my colleague and I decided to perform a thorough empirical study [1], that normalized various experiment settings that individual papers had reported. We observed that post normalization, randomly picking instances to label is better!

[1] https://aclanthology.org/2024.emnlp-main.1240/

by unixheroon 8/8/2025, 6:06 PM

Why were high fidelity labels not used from the start?

by scribuon 8/8/2025, 10:22 AM

I’m confused by the clustering step:

> To find the most informative examples, we separately cluster examples labeled clickbait and examples labeled benign, which yields some overlapping clusters

How can you get overlapping clusters if the two sets of labelled examples are disjoint?

by patreshon 8/8/2025, 11:08 AM

What is the clustering performed on? Is another embedding model used to produce the embeddings or do they come from the LLM?

Typically LLMs don't produce usable embeddings for clustering or retrieval and embedding models trained with contrastive learning are used instead, but there seems to be no mention of any other models than LLMs.

I'm also curious about what type of clustering is used here.

by ghm2180on 8/8/2025, 12:10 PM

Is it just me or does the showing of hyperspheres deliberately meant to obfuscate some kind of a trade secrets of how to select examples for human to send to a human?

The obfuscation being use of a support vector machines which are the goto for selecting the Support vectors and ignoring the outliers and distance being defined between embedding vectors.

I could be wrong they could be using something different for clustering or fancier like a variant of DBScan.

by trhwayon 8/8/2025, 3:58 AM

Reminds how one of the winners of the 2001 Andrew Ng’s Data-Centric AI competition analyzed embeddings separation to choose training data https://rensdimmendaal.com/posts/data-centric-ai