Ask HN: How the AI companies collect data to train models?

by piotrkeon 3/5/2024, 6:33 AMwith 1 comments

From the Internet, obviously, but how? Are they crawling through every website out there based on the IPs or domain names? Or do they piggyback on Google. Or is there all-internet-data store to just download the latest 'Internet data' dump?

by richardjam73on 3/5/2024, 2:44 PM

They use datasets like common crawl.