This is a sponsored blog post by scrapestack. All reviews and opinions expressed here are, however, based on my personal experience.
I am a regular reader on the scraping hub subreddit. Just this last week someone posted about getting product status from Amazon without getting blocked. This kind of question is fairly common on this subreddit and others that discuss web scraping.
The slow down
My experience has been that it is very rare that companies will try and block you. You are their customers and blocking is something very final. They want you to use their products. If they block you, then you can’t pay them.
A much more effective way of detering web scraping is to slow you down with captchas or rate limiting. If you get timed out for a minute then your web scraping is hurt real bad. If a captcha pops up, then you’re pretty much out of luck. The website can keep you as a customer while still slowing down the web scraping.
scrapestack is an extremely simple, and affordable, tool to help with the slow down or risk of getting blocked. This is literally all you need to get started:
http://api.scrapestack.com/scrape ? access_key = YOUR_ACCESS_KEY & url = https://apple.com
Bam. Done. It handles proxies, ip addresses, and rotating geolocations. It has a very high success rate AND, most impressive of all, it’s FAST. Way faster than I would have thought possible using proxies. I actually did a post earlier this year where I did some proxying and the slow down was a lot more significant than what I experienced with scrapestack.
I did testing with scrapestack using axios and puppeteer. It performed amazingly well with axios but using it with puppeteer, while still pretty good, was a bit more complicated. I think the end goal with using something like scrapestack would be to use it with something like axios exclusively. If you wanted to do some page manipulation however, puppeteer is still the way you’d want to go.
scrapestack + axios
Axios and scrapestack were made to be friends forever. I tested against both Amazon and Google.
When testing with Amazon, I hit the same page 10 times in a row pretty much as fast as I could. The requests were made to a single product page –
https://www.amazon.com/dp/B00170EB1Q. Here are the results:
As you can see, the speed difference is very small. That was the most impressive part to me. I’m classifying an error here as getting blocked/captcha’d or detected in some way. In the attempts I made with axios to Amazon I didn’t have any problem.
When testing with google, the results were still crazy fast but google is a lot more strict on robot checking and it showed.
There were a couple of times when google detected that it was automated and that is where the errors came in. In those instances, it hit google’s recaptcha screen:
The nice thing about this recaptcha screen is that it displays the ip address and so I was able to easily see that IP rotation was happening on each request. So if the same ip address isn’t being used on each request, that means that there are some known (to google) bad IPs that are being used in the rotation. Because it wasn’t consistent, in this case I’d just put something to check if they caught it and if so, then just retry the request.
Note: I feel that it is worth mentioning that I saw something similar to this when doing my own proxying. A lot of the IP addresses I proxied with were blocked on my first request, leading me to believe that there is a list of IP addresses that are already blacklisted.
scrapestack + puppeteer
Using scrapestack and puppeteer together had a bit more issues. I seemed to get flagged by google more often, though that could be because of my smaller sample size.
The other interesting thing is when I hit google with scrapestack, it served up a different, more light weight version of google and as a result it was a LOT faster. I think this was less due to scrapestack itself and more due to the fact that possibly this light weight version of google was served to those IP addresses or geolocations that scrapestack was using.
The speed test here isn’t really apples to apples due to the different versions of google served but that is pretty quick speed with a proxy. Like, barely noticeable from what I would expect for basic puppeteer usage. It should also be noted that with these google tests I ran them concurrently which is always kind of tricky with puppeteer because it’s a lot more dependent on the machine running it than on strictly the speed of the http requests.
For the amazon test I made them synchronous so it wouldn’t have speed issues because of ram or processor speed dealing with chrome opening up 10 windows at the exact same time. As a result the overall time for both is a lot slower but I feel it’s a more accurate comparison between scrapestack and just normal puppeteer usage.
With this, the speed difference is more noticeable when using the proxy. Still, it honestly isn’t that much slower compared to other proxying or other proxy services I’ve used.
One final note with puppeteer: Click navigation when proxying didn’t work so well. I think it has to do with the relative routes to the domain. Whenever I clicked a link, it did not work. The better way to handle this is to just collect the urls you want to visit and then visit them directly rather than navigating via click.
Honestly, I’m super impressed. I realize this is a sponsored post but the numbers speak for themselves. The speed is barely impacted when using scrapestack and their pricing is crazy affordable. They even have a free plan that allows you to make 10,000 requests per month!
10/10, will scrape with again. I highly recommend scrapestack!