From the Internet, obviously, but how? Are they crawling through every website out there based on the IPs or domain names? Or do they piggyback on Google. Or is there all-internet-data store to just download the latest 'Internet data' dump?
They use datasets like common crawl.
They use datasets like common crawl.