I made an app to fuzzy-deduplicate my Google Sheets and CRM records
- No manual configuration required - Works out-of-the-box on most data types (ex. people, companies, product catalog)
Implementation details:
- Embeds records using an E5 model - Performs similarity search using DuckDB w/ vector similarity extension - Does last-mile comparison and merges duplicates using Claude
Demo video: https://youtu.be/7mZ0kdwXBwM
Github repo (Apache 2.0 licensed): https://github.com/SnowPilotOrg/dedupe_it
Lmk any feedback on how to make this better!
Curious how this scales. Just tried this with the test dataset and it was probably the slickest deduplication experience I’ve had