Jordan Mass Scrapes Amazon for Potential Products (Part 1 of 2)

Demo code here

A few weeks ago I shared a post I wrote about scraping Amazon for potential products on reddit. It got some cool responses and some that got me thinking. I’ll touch on some of those in this post.

There were also a quite a few people that messaged me asking if there was an easy way for non coding people to get access to this. I built a kind of solution to that in that I extended the functionality of this code to mass search Amazon. It’ll go through a TON of different categories and keep scraping until it dies or it runs out of categories. Or the IP address is blocked by Amazon. This happened to me several times in the building of this.

So, it didn’t really turn out too much more user friendly for non coders. But, in the meantime, I was able to curate a huge list of potential products. I also set up a mailer that sends out somewhere between like 10-30 products a week. If anyone else wants access to this, it’s 100% free. Find more details here or just sign up for the list.

Get FREE lists of potential products!

As of writing this post I currently have more than 15,000 products scraped and the list is growing. A lot of them are pretty solid leads. A few are total garbage. Sorry about that. If you feel like you hate waiting and just want the whole list right now you can get more details here.

Okay, now into the details of how this code works and how you can just get it to work for you.

The toughest parts of this endeavor were:

  • Dealing with Amazon stopping almost all of my requests once I was on a linux box (but I was fine from home?)
  • The script would just stop periodically without any errors
  • Mulling over the consequences, ethical and other, of mass scraping Amazon

Jumping from the last post about this, I’ve refactored the code quite a bit to make the functions we had from before reusable. We have two main parts of Amazon that we need to scrape. The thing I call the results page, which are the 17 or so products per page shown after we search, and the details page, which has all the details of an individual product.

The script starts on our massSearch file. There are some things that aren’t strictly necessary in this script but I feel like really make it a lot better. For example, I use discord to handle notifications of when errors happen. If you don’t want this, you will need to go through and remove all references to the discord webhooks. Looks for lines that have hook or webhook.

I also have a mongo database set up in this code. The reason for that is that it just gets so many listings it’s not realistic to just log out the list. You could put them all into a local CSV as an alternative. In that case, I would remove all references to dbHelpers and db and then pipe it all out into a CSV file.

I should also say that as disclaimer and as firmly stated by /u/formergamedev here, scraping IS against Amazon’s TOS. While I don’t agree that using the MWS API is quick and easy or really even useful, doing it this way risks getting your IP blocked and if they find your account, possibly a ban? I don’t know if they realistically temp or permanently ban people’s accounts for this (I looked and couldn’t find anyone complaining about it) but it’s a risk.

My wife’s reaction when I asked how she would feel if we were banned from Amazon.

I don’t recommend doing it from your home IP. At the minimum use a VPN (I use TunnelBear from home because it’s so dang easy) and preferred is to use a cloud server where you can easily rotate your IP address. I use Digital Ocean and REALLY like it. I just got all my code installed and working, then just made a snapshot. Whenever the IP got blocked (and while I was improving this script, I got blocked a LOT) I would just destroy the droplet and remake another from that same snapshot.

I don’t think that they can track and ban my personal account. Coming through a cloud web server with a rotating IP address with no ties to my personal account just make me skeptical that they can find me. Puppeteer is also not super fast so while it’s faster than a human doing it, it’s not pounding their servers too hard. I don’t, however, doubt Amazon’s considerable power and if they really wanted to, they could probably track me down and ban me. Do this at your own risk.

Starting off, it’s really important to note that this was a bit tricky to set up on a linux webserver. Puppeteer doesn’t work out of the box, even with our ubuntu specific script. These additional dependencies had to be installed
sudo apt-get install libx11-xcb1 libxcomposite1 libXdamage1 libxi6 libxext6 libxtst6 libnss3 libcups2 libxss1 libxrandr2 libasound2 libpangocairo-1.0-0 libatk1.0-0 libatk-bridge2.0-0 libgtk-3-0
. A nice guy named Michael brought this to my attention and I’m grateful to him for it.

We do something similar to last time where we line out our specifications at the top with how many results pages to go through (numbersOfPagesToSearch), if we want products that are sold by Amazon (wantSoldByAmazon), that there are at least three vendors (minimumAllowedNumberOfVendors), and that the minimum price is at least $25 (minimumPrice).

// How many pages of results to go through
const numberOfPagesToSearch = 10;

// Do we want to compete on products that Amazon sell?
const wantSoldByAmazon = false;

// How many other vendors do we want there to be?
const minimumAllowedNumberOfVendors = 3;

// ...minimum price
const minimumPrice = 25;

const webHookName = 'Amazon Product Scraper';

I then start into my self calling async function. Puppeteer is all promise based so this allows us to use async/await. Because it’s awesome.

(async () => {
  // awesome code here

})();

Check this for more explanation on async/await.

I then initialize my database. I included a sample-config file which, if you are using a mongo database, you would rename the file to config.ts and replace my fake values with real ones. If you replace them matching the variable names, it should just work.

I also have my discord webhook url. Like I mentioned above, I use discord to easily manage my notifications. That way I get notified if something broke or my IP got blocked while I’m away from my computer.

// src/sample-config.ts
export const config = {
    mongoUser: "<mongo-user-name>",
    mongoPass: "<mongo-password>",
    mongoUrl: "ip.address.goes.here:portHere",
    mongoDb: "<mongo-db-name>",
    mongoCollection: "<mong-collection>",
    webhookUrl: '<discord-webhook-url>'
};

I also have included here a huge list of product categories in the src/taxonomy.ts file. You are welcome to replace with your own categories or add some to this. As long as it’s a string, it’ll work fine. Because I’m not a monster, I limit each scraping to 100 categories. The taxonomy file has something like 5,000 categories ranging from yachts to bird food so they won’t all be wins. I thought about going through and removing the categories for huge things like yachts but then I thought, “hey, maybe it’ll include some smaller yacht product that would work” so I kept it.

const dbUrl = `mongodb://${config.mongoUser}:${config.mongoPass}@${config.mongoUrl}/${config.mongoDb}`;
const db = await dbHelpers.initializeMongo(dbUrl);
const hook = new Webhook.Webhook(config.webhookUrl);
const sampleCategories = getRandomFromArray(categories, 100);

So, I take 100 random categories from the taxonomy of categories and then start looping through them. During all of this I tried to be pretty diligent with my exception handling and notify all of the errors via the discord webhook and try/catches.

I want to finish off what happens on this file and in the script generally before we go in too deep to some of the other functions. There are really two parts, scrapeResults which does the work of going through the results page and then the details page of each result on those pages and then there is the part where we store it.

// loop through the categories and then scrape the results
for (let index = 0; index < sampleCategories.length; index++) {
    const category = sampleCategories[index];
    try {
        const products = await scrapeResults(category, numberOfPagesToSearch, wantSoldByAmazon, minimumAllowedNumberOfVendors, minimumPrice);

Because I don’t want duplicates, I check my database to see if there is any other product with the same ASIN. If there is not, I proceed with the insert.

// Insert to database
// Check if it already exists
if (products.length > 0) {
    try {
        for (let i = 0; i < products.length; i++) {
                const product = products[i];
                try {
                    const matches = await dbHelpers.getAllFromMongo(db, config.mongoCollection, { asin: product.asin });
                    if (matches.length < 1) {
                         await dbHelpers.insertToMongo(db, config.mongoCollection, product);
                    }
                }
                catch (e) {
                   console.log(e);
                   await hook.info(webHookName, `Unexpected error - ${e}`);
                 }
         }

     }
     catch (e) {
          console.log('Unexpected error', e);
          await hook.info(webHookName, `Unexpected error - ${e}`);
     }
}

Okay, that is going to end part one of this series. Next week I’ll have part two lined out where I go a bit deeper into the details and results pages. I also will spend some time talking about web server implementation and some of the troubles I had using this in a production environment.

If this all sounds terrible and you don’t want to mess with it, remember that I have an email list where I email out these leads in batches of 10-30 each weekly. You can sign up here.

Get FREE lists of potential products!

And finally, if you want all the leads right now, they are available for purchase.

Demo code here

Leave a Reply

Your email address will not be published. Required fields are marked *