Today is going to be a lot of thoughts comparing cloud web scraping to local web scraping. I haven’t yet reached conclusions yet, which is why this is part 1. It is also focused on just showing the process that I am going through to reach my (hopefully inevitable, and soon to come) conclusions.
Websites that try and prevent robots or automated access are most often trying to protect themselves. Most often from a security point of view so that they aren’t getting hacked or some other means. This is why they would place a captcha on login or some other access point.
A lot of the services that are employed to protect against these attacks supposedly detect things like speed of requests, user agents, and suspicious IP addresses. I’ve recently had some experience with two different sites where I was never prompted for a captcha when scraping from my local, residential IP address but was prompted for a captcha 100% of the time when web scraping from the cloud. This is using the same exact code.
The fact that the same code works almost 100% of the time from my local computer/residential IP address and works almost 0% of the time from the cloud tells me that it isn’t user agent, speed of requests, or user actions.
All of the above points to IP address. The target website sees a cloud IP address and reacts differently. There is a really cool tool here – https://tools.keycdn.com/. On it is shown the provider of the IP address. You’ll often see something like “Digital Ocean” or “GOOGLE”. This is an easy way to tell if an IP address is on the cloud.
So, I tried a rotating proxy. And guess what? It didn’t help. Not even a little bit. Running the script from the cloud with a proxy reacted the exact same way as if there were no proxy. Rotating the proxy from my home? Never got the captcha. Rotating the proxy from the cloud? Captcha 100% of the time. What gives?
Being the webserver
Next thing I tried was creating an endpoint and analysing everything from webserver access logs to the request coming to the server. I did this in a previous post here.
What I found in that post is the same thing I found out this week. There is not much discernible difference when scraping with Puppeteer from the cloud vs locally. If we proxy the IP address to a residential location and the user agent is spoofed, they look idential.
My quest is going to continue into next week. There is something tipping off the target site that the request is coming from a cloud webserver. I can hit a site 100 times from my home address and maybe hit a captcha once and then scrape it 5 times from the cloud and hit a captcha every time. There is something different there.
The results in this post aren’t great, I know. Not much to show yet. Don’t worry though, work is in progress.