In a similar vein, I have found success using request interception [1] for some websites where the HTML and API authentication scheme is unstable, but the API responses themselves are stable.
If you can drive the browser using simple operations like keyboard commands, you can get the underlying data reliably by listening for matching 'response' events and handling the data as it comes in.
[1] https://github.com/puppeteer/puppeteer/blob/main/docs/api.md...
Awesome, I wonder if there is a possibility to create a chrome extension that works like 'Vue devttools' and show the heap and changes in real-time and maybe allow editing. That would be amazing for learning / debugging.
> We use the --no-headless argument to boot a windowed Chrome instance (i.e. not headless) because Google can detect and thwart headless Chrome - but that's a story for another time.
Use `puppeteer-extra-plugin-stealth`(1) for such sites. It defeats a lot of bot identification including recaptcha v3.
(1) https://www.npmjs.com/package/puppeteer-extra-plugin-stealth
That's an exceedingly clever idea, thanks for sharing it!
Please consider adding an actual license text file to your repo, since (a) I don't think GitHub's licensee looks inside package.json (b) I bet most of the "license" properties of package.json files are "yeah, yeah, whatever" versus an intentional choice: https://github.com/adriancooney/puppeteer-heap-snapshot/blob... I'm not saying that applies to you, but an explicit license file in the repo would make your wishes clearer
If this catches on, web developers may start employing memory obscurification techniques like game developers
https://technology.riotgames.com/news/riots-approach-anti-ch...
Love this approach, thanks for sharing!
I am trying this on a website for which Puppeteer has trouble loading so I got a heap snapshot directly in Chrome. I was trying to search for relevant objects directly in the Chrome heap viewer but I don't think the search looks inside objects.
I think your tool would work: "puppeteer-heap-snapshot query -f /tmp/file.heapsnapshot -p property1" or really any JSON parser but it requires extra steps. Would you say this is the easiest way to view/debug a heap snapshot?
Wow this is brilliant. I've sometimes tried to reverse engineer APIs in the past, but this is definitely the next level.
I used to think ML models could be good for scraping too, but this seems better.
I think this + a network request interception tool (to get data that is embedded into HTML) could be the future.
The article brings up two interesting points for web preservation:
1. The reliance on externally hosted APIs
2. Source code obfuscation
For 1, in order to fully preserve a webpage, you'd have to go down the rabbit hole of externally hosted APIs, and preserve those as well. For example, sometimes a webpage won't render latex notation since a MathJax endpoint can't be connected to. Were we to save this webpage, we would need a copy of MathJax JS too.
For 2, I think WASM makes things more interesting. With Web Assembly, I'd imagine it's much easier to obfuscate source code: a preservationist would need a WASM decompiler for whatever source language was used.
This is great, thanks a lot.
It's my understanding that Playwright is the "new Puppeteer" (even with core devs migrating). I presume this sort of technique would be feasible on Playwright too? Do you think there's any advantage or disadvantage of using one over the other for this use case, or it's basically the same (or I'm off base and they're not so interchangeable)?
I'm basing my personal "scraping toolbox" off Scrapy which I think has decent Playwright integration, hence the question if I try to reproduce this strategy in Playwright.
A neat idea for sure, I just wanted to point out that this is why I prefer XPath over CSS selectors.
We all know the display of the page and the structure of the page should be mutually exclusive, so why would you base your selectors on display? Particularly if you’re looking for something on a semantically designed page, why would I look for an .article, a class that may disappear with the next redesign, when they’re unlikely to stop using the article HTML tag?
> Developers no longer need to label their data with class-names or ids - it's only a courtesy to screen readers now.
In general, screen readers don't use class names or IDs. In principle they can, to enable site-specific workarounds for accessibility problems. But of course, that's as fragile as scraping. Perhaps you were thinking of semantic HTML tag names and ARIA roles.
Scraping is inherently fragile due to all the small changes that can happen to the data model as a website evolves. The important thing is to fix these things quickly. This article discusses a related approach of debugging such failures directly on the server: https://talktotheduck.dev/debugging-jsoup-java-code-in-produ...
It's in Java (using JSoup) but the approach will work for Node, Python, Kotlin etc. The core concept is to discover the cause of the regression instantly on the server and deploy a fix fast. There are also user specific regressions in scraping that are again very hard to debug.
This isn't future proof at all. Game dev had been using automatic memory obfuscation since forever. If this become popular, it will take no more than adding a webpack plugin to defeat, no data structure changes required.
Very interesting! I have a feeling that this will break if people use the advanced mode of the Closure compiler. It's able to optimize away object attribute names. Is this not something commonly done anymore?
Nice this won't work anymore then
Awesome experimentation! I'd be curious to how you navigate the heap dump in some real website examples.
I've used a similar technique on some web pages that get returned from the server with an in-tact redux state object just sitting in a <script> tag. Instead of parsing the HTML, I just pull out the state object. Super
This sadly does not help if js code is minified/obfuscated and data is exchanged using some binary/binary-like protocol like grpc. Unfortunately this is increasingly common.
The only long term way is to parse visible text.
Is he scraping the heap because the data wasn't present in the HTML, or is he doing it because the API response, present in the heap changes less often than the HTML?
Seems easy to defeat by deleting objects after generating the HTML or DOM nodes? Although I suppose taking heap snapshots before the deletions would avoid that.
Depending on how exactly the page is loading data, it might be easier to use something like mitmproxy and observe the data flow and intercept there.
Would this method work if the website obfuscated its HTML as per the usual techniques, but also rendered everything server side?
Someone knows if a Chrome browser extension has access to heap snapshots?
Why doesn't the example chosen, YouTube, use something like Cloudflare "anti-bot" protection or Google reCAPTCHA.
When I request a video page, I can see the JSON in the page, without the need for examining a heap snapshot.
Really cool approach, great work
Very interesting. Can't wait to give it a shot.
I personally use a combination of xpath, basic math and regex, so this class/id security solution isn't a major deterrent. Couple of times, I did find it to be an hassle to scrape data embedded in iframes, and I can see the heap snapshots treat iframes differently.
Also, if a website takes the extra steps to block web scrapers, identification of elements is never the main problem. It is always IP bans and other security measures.
After all that, I do look forward using something like this and making a switch to nodejs based solution soon. But if you are trying web scraping at scale, reverse engineering should always be your first choice. Not only it enables you a faster solution, it is more ethical (IMO) as you are minimizing your impact to it's resources. Rendering full website resources is always my last choice.