Jordan Mass Scrapes Amazon for Potential Products (Part 2 of 2)

Demo code here

This is the second and final part of mass scraping Amazon. The first part is here. In this part, I want to talk more about the actual scraping of both the results and details page. We used a lot of this code in my original post on scraping one individual category but this is a bit more refactored.

I also want to dig more into the complications I had of scraping Amazon from a linux web server compared to windows. Finally, I want to talk a bit more about the possible consequences of scraping something like Amazon, where it is against their Terms of Service.

It’s all turned out to be pretty technical, so feel free to ask questions in the comments or reach directly to me on twitter or email. If you’d rather not deal with the technical bits, I set up an emailer that sends out a weekly email with 10-30 random products that I’ve scraped using this code and it’s 100% free. Find more details here or just sign up for the list.

Get FREE lists of potential products!

As of writing this post I currently have more than 15,000 products scraped and the list is growing. A lot of them are pretty solid leads. A few are total garbage. Sorry about that. If you feel like you hate waiting and just want the whole list right now you can get more details here.

For now, though, let’s start into the code. In the last post we mentioned a function we use called scrapeResults which is called like below from our src/massSearch.ts file. I tried to make this a pretty pure function so we pass in the category we want to search for and then our other arguments for the kinds of products that we want to find.

Starting off, it’s really important to note that this was a bit tricky to set up on a linux webserver. Puppeteer doesn’t work out of the box, even with our ubuntu specific script. These additional dependencies had to be installed
sudo apt-get install libx11-xcb1 libxcomposite1 libXdamage1 libxi6 libxext6 libxtst6 libnss3 libcups2 libxss1 libxrandr2 libasound2 libpangocairo-1.0-0 libatk1.0-0 libatk-bridge2.0-0 libgtk-3-0
. A nice guy named Michael brought this to my attention and I’m grateful to him for it.

const products = await scrapeResults(category, numberOfPagesToSearch, wantSoldByAmazon, minimumAllowedNumberOfVendors, minimumPrice);

That function starts by setting up the Puppeteer browser (we’ll talk about this later) and then initializing a Puppeteer page. Our baseUrl is, as I went over in my first post about scraping Amazon, is just the direct route to the results page. We just pass in the query parameter that we want to search for in as field-keywords and it’ll directly bring us the results. We also initialize potentialProducts as an empty array. This is where we are going to dump our products that we find and then return.

export async function scrapeResults(searchParam: string, numberOfPagesToSearch: number, wantSoldByAmazon: boolean, minimumAllowedNumberOfVendors: number, minimumPrice: number) {
    try {
        let browser: Browser = await setUpBrowser();      
        let page = await setUpNewPage(browser);

        const baseUrl = 'https://www.amazon.com/s?field-keywords=';
        const url = baseUrl + searchParam;

        const potentialProducts: any = [];

This next part is when we start getting into the loop. Because I am actively using this in production, I want to keep track of errors and handle them accordingly so we also start a categoryError count. This was one of the first things I did to start handling when Amazon blocked my IP address. I’ll talk to this point in just a moment.

        let categoryError = 0;
        for (let i = 1; i < numberOfPagesToSearch + 1; i++) {  
            await page.goto(`${url}&page=${i}`);
            let productsOnPage = await page.$$('.s-result-item');
            const resultsCol = await getPropertyBySelector(page, '#resultsCol', 'innerHTML');

We do three main things in this set of code besides just starting the loop and initializing the error count. We go to the url that we built earlier with the category and the baseUrl. Then we search for all the listings on the page with let productsOnPage = await page.$$('.s-result-item');. This was all I was doing originally and then I kept errors that there weren’t any products on the page. I knew that something like “Bird food” would have results so I was guessing that I was being blocked by Amazon.

const resultsCol = await getPropertyBySelector(page, '#resultsCol', 'innerHTML'); is what I used to checked this. #resultsCol was the selector that held all the products and if it was empty, I knew Amazon was serving me their robot check page. So I built this next part to try and be aware and maybe back off my scraper so it wasn’t hitting it as hard and then to continue.

            if (!resultsCol) {
                i--;
                categoryError++;
                if (categoryError > 4) {
                    return Promise.reject('Category not resolving. Skipping');
                }
                await page.close();
                await browser.close();
                await timeout(3000);
                browser = await setUpBrowser();
                page = await setUpNewPage(browser);
                continue;
            }

If there wasn’t a resultsCol I would increment my error checker and reduce my index by one so we could try this category again. If we got more than 4 errors, I’d just exit this category altogether and try another by rejecting the promise with an error message. This would go out to my src/massSearch.ts file and use our discord webhook to notify me.

Before reaching four errors, I’d try closing the page and browser, wiating 3 seconds and then trying again. It was all actually kind of bizarre. They would block probably 3/5 categories searched consistently. Like I could run through 100 categories and it could block the first three but the next two would be fine. They didn’t unilaterally shut me down. This was bizarre to me and most bizarre of all to me was that I was ONLY seeing this on my linux webserver and not on my local Windows machine.

Short answer is the above Amazon blocking, error handling code really didn’t make much difference. Backing off for three seconds didn’t suddenly make Amazon like me and allow me to search. So, I went back to the drawing board and guess there must be something diferent about the chrome browser Puppeteer.

After a bit of trying different things, I found that setting up the Puppeteer browser differently on Ubuntu did the trick. Remember earlier when we talked about setting the browser? That function is at the bottom of src/resultsPage.ts and the additional launch options fixed the blocking problem almost universally.

async function setUpBrowser() {
        let browser: Browser;
        let ubuntu = false;
        let headless = false;
        if (process.argv[2] === 'ubuntu' || process.argv[3] === 'ubuntu') {
            ubuntu = true;
        }
        if (process.argv[2] === 'headless' || process.argv[3] === 'headless') {
            headless = true;
        }
        if (ubuntu) {
            browser = await puppeteer.launch({ headless: true, ignoreHTTPSErrors: true, args: [
                '--no-sandbox',
                '--disable-setuid-sandbox',
                '--disable-infobars',
                '--window-position=0,0',
                '--ignore-certifcate-errors',
                '--ignore-certifcate-errors-spki-list',
                '--user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3312.0 Safari/537.36"'
            ] });
        }
        else {
            browser = await puppeteer.launch({ headless: headless, args: [`--window-size=${1800},${1200}`] });
        }

        return Promise.resolve(browser);
}

I think it’s pretty odd that they didn’t block us all the time without these additions. If it checks the user-agent, for example, why does it sometimes allow the linux Puppeteer user-agent to work but not always? Kind of interesting and noteworthy.

The rest of this is pretty much identical to what we did in the first post so I’m going to go through it pretty quickly. We just check the name and price to make sure we were able to find them and that the price is greater than our minimum ($25, in my case) and then we proceed into the scraping the details page, which checks who has the buybox and how many vendors there are. Then we take all of those details and push them into our array if they meet our criteria.

I had to make a decision here and I think I’m happy with my decision. It’s really the decision every single player video gamer has to make. How often do I save? In this case, how often do I dump to my database? I elected to do it after every category search. If I’m going through 100 categories, I hit my database pretty often. The alternative is to risk the script breaking, and it does still occassionally break, and losing everything we just scraped.

Now. The consequences of scraping in this way. My conscience doesn’t feel too terrible. Maybe it should. Maybe I’m justifying. I don’t feel that I hit it that hard and I feel that if Amazon really wants to block this kind of thing, they could put additional measures to stop it. Justifying, sure. I also like to think that my purpose for scraping them is in line with Amazon’s purpose, selling more awesome products.

Besides guilt, I do feel a bit of fear. There is always the chance that somehow they find out who I am and permanently ban me. I don’t think this is likely but it is a chance. I’m sure there are probably ways I could get around this with VPNs and IP address changes and account remakes but that all sounds quite painful. Plus, my sweet, sweet audible subscription. I’m not sure what would happen to those books.

I don’t recommend doing it from your home IP. At the minimum use a VPN (I use TunnelBear from home because it’s so dang easy) and preferred is to use a cloud server where you can easily rotate your IP address. I use Digital Ocean and REALLY like it. I just got all my code installed and working, then just made a snapshot. Whenever the IP got blocked (and while I was improving this script, I got blocked a LOT) I would just destroy the droplet and remake another from that same snapshot.

Anyway, if that’s all a risk you’d like to avoid or you don’t want to mess with all of this technical crap, then join the free email list below.

Get FREE lists of potential products!

Or if you don’t want to wait you can buy the whole list.

Demo code here

Leave a Reply

Your email address will not be published. Required fields are marked *