Jordan Scrapes Amazon Looking For Products To Sell

Update on February 2, 2019 for the mass searching of categories:

You can get guides on how to do this here (part 1) and here (part 2). With the introduction of this new script, it also adds the requirement of having a config file. This config file doesn’t do anything for this script or guide but is necessary for the mass searching. If you just want to scrape a single category, you just need to rename the sample-config.ts in the src¬†directory to config.ts. After doing that, everything should work with an npm start.

Demo code here

I’m back to the Amazon well. I think that this is a growing area for any entrepreneur and I’ve done some selling here. Amazon has really made it simple for people to run side businesses from their home.

I’ve spent several years selling on Amazon and one of the trickiest parts for me was finding a product to sell. This bit of code is a way to automate and make this a bit easier. I tried using Amazon’s product advertising API but it felt cumbersome and because I wasn’t an active affliate selling anything they were constantly killing my API keys.

This is an alternative to that and, in my opinion, superior. You search by keyword and what you see is what you expect to see. This code is aimed at people looking to wholesale products. Meaning they want to find brands that have products that they would like to sell on Amazon for them.

I personally searched for good products to wholesale based on a three primary criteria: number of vendors (I preferred greater than three so it was more likely the brand wasn’t exclusive), buybox price (greater than $25), and whether it’s sold by Amazon (I don’t want to fight against Amazon. It’s their platform and they are allowed to cheat on it).

I start the code declaring these criteria, along with my search terms, so we can easily change them.

// Whatever search param you can to use
const searchParam = 'pet food';

// How many pages of results to go through
const numberOfPagesToSearch = 3;

// Do we want to compete on products that Amazon sell?
const wantSoldByAmazon = false;

// How many other vendors do we want there to be?
const minimumAllowedNumberOfVendors = 3;

// ...minimum price
const minimumPrice = 25;

numberOfPagesToSearch may be the item that would be most adjusted. When you do a search, Amazon returns a TON of pages of results. We pretty much always want more than just the first page since a lot of the first page results are generally sold by Amazon. How many pages of results you want is up to you. I started here with 3 but probably in practice would always use more.

I then start into my self calling async function. Puppeteer is all promise based so this allows us to use async/await. Because it’s awesome

(async () => {
  // awesome code here

})();

Check this for more explanation on async/await.

I also run a lot of my Puppeteer scrapers on a Digital Ocean (which I love) ubuntu box so I have a block at the top that makes Puppeteer work in ubuntu. I have scripts set up in package.json that pass in arguments for being both headless and/or ubuntu.

// package.json
...
"scripts": {
    "start": "tsc && node ./dist/index.js",
    "start:ubuntu": "tsc && node ./dist/index.js ubuntu",
    "start:headless": "tsc && node ./dist/index.js headless"
},
...

We just check to see if either the second or the third argument are one of the items we are looking for (maybe I want to run both options and I don’t want to care about the order?). There probably is a better way to pass in arguments to the script but that’s not really the focus of this code.

let ubuntu = false;
let headless = false;
if (process.argv[2] === 'ubuntu' || process.argv[3] === 'ubuntu') {
	ubuntu = true;
}
if (process.argv[2] === 'headless' || process.argv[3] === 'headless') {
	headless = true;
}
if (ubuntu) {
	browser = await puppeteer.launch({ headless: true, args: [`--window-size=${1800},${1200}`, '--no-sandbox', '--disable-setuid-sandbox'] });
}
else {
	browser = await puppeteer.launch({ headless: headless, args: [`--window-size=${1800},${1200}`] });
}

It’s worth noting that I am depending upon a homemade npm package that has some helpful puppeteer helper functions. I found it helpful for me but it’s definitely not required. You would just have to use puppeteer functions to replace the functions I use. Up at the top you can see the functions I’m importing – import { getPropertyBySelector, setUpNewPage } from'puppeteer-helpers';

Back to the code. We just set up a base url for Amazon’s search page and concatenated the search params that we declared at the beginning of our code. I think one cool thing to always look for when scraping is if you can find a direct url with query params. Instead of me starting at Amazon.com’s homepage and then typing in my search params and clicking “Search”, we just navigate directly to the results by passing in the search we want with the query param we want. In this case https://www.amazon.com/s?field-keywords= does the trick.

const baseUrl = 'https://www.amazon.com/s?field-keywords=';
const url = baseUrl + searchParam;

const potentialProducts: any = [];

const page = await setUpNewPage(browser);

We initialize an array that we are going to fill later with the delicious products that we find. We also use one of those helper functions, setUpNewPage. It really just sets up a new page and stops downloading images to save bandwidth.

From here, we start into the loops. We have two that happen, one goes through the pages of results, the other goes through the products found on the page.

// Loop through the pages of results
for (let i = 1; i < numberOfPagesToSearch + 1; i++) {
    await page.goto(`${url}&page=${i}`);
	...
}

In our loop we start from 1 and increment until we reach our max number of pages. Because we started from 1 instead of the normal 0 we need to add an additional 1 to our numberOfPagesToSearch to make sure we search the correct amount.

Why did we start with 1 instead of 0 in our loop? Because, like we referenced above, Amazon has a query parameter for the page number! This is great because now we don’t need to find the “next” or “continue” or whatever button it is. When we finish going through our results, we can just navigate directly to the next page of results.

Back to the results. We exercise the strength of Puppeteer and pick out the products on the page by selecting .s-result-item. When we use $$ it returns an array of items and that is what we want. We throw them into productsOnPage and then we loop through that.

for (let productOnPage of productsOnPage) {
	const name = await getPropertyBySelector(productOnPage, 'img', 'alt');
	const price = await getPropertyBySelector(productOnPage, '.sx-price-whole', 'innerHTML');
	if (name && name !== '' && parseInt(price) > minimumPrice) {
		...
	}
	...
}

We use another of the helper functions here getPropertyBySelector which just uses Puppeteer’s functions to take a Puppeteer ElementHandle, a selector, and then the property we want and then it returns it to us. We use the alt property of the image to get the product name and then just use the .sx-price-whole selector to get the price.

I want to take a second here to talk about scraping in general. It’s never a science. We are depending on other people’s servers, and that their code hasn’t changed in a way to break what we are scraping in a way that breaks our code. This is an example here. Sometimes…there isn’t a product name with this. We could take the time to dig into when exactly Amazon is serving a product name in their alt property and code an alternative…or, we could just skip the ones that don’t have product names. I chose to skip them. My goal is to get a lot of products. I don’t need specific products in this case. If one is malformed or isn’t matching the common pattern, I felt it was worth just skipping it and scraping more pages. Hence our condition – if (name && name !== '' && parseInt(price) > minimumPrice). We check to make sure our price is greater than our minimum and we make sure that there is a product name. If not, we just go on to the next product.

I now separate out the next part as its own function where it handles everything on the details page. This function accepts the product ElementHandle, the puppeteer browser, and then our required conditions of minimumAllowedNumberOfVendors and if weWantSoldByAmazon or not. const detailsObject =awaitgoToDetailsPage(productOnPage, browser, minimumAllowedNumberOfVendors, wantSoldByAmazon);

We’ll set up a new puppeteer page and then get the url of the details page. Then…we go there and start getting the stuff we want!

const page = await setUpNewPage(browser);
const url = await getPropertyBySelector(product, '.s-access-detail-page', 'href');
await page.goto(url);
let buyboxVendor = (await getPropertyBySelector(page, '#merchant-info a', 'innerHTML'));
if (buyboxVendor) {
	buyboxVendor = buyboxVendor.trim();
};
const brand = await getPropertyBySelector(page, '#bylineInfo', 'innerHTML');
let numberOfVendors = await getPropertyBySelector(page, '#olp_feature_div a', 'innerHTML');
if (numberOfVendors) {
	numberOfVendors = numberOfVendors.split('</b>')[1].trim();
}
if (numberOfVendors) {
	numberOfVendors = numberOfVendors.split(')')[0];
}
if (numberOfVendors) {
	numberOfVendors = numberOfVendors.split('(')[1].trim();
}

The things we really want to keep here are url, the buybox vendor, the brand name, and the number of vendors. Both buybox vendor and number of vendors are pretty ugly. Amazon has several different details pages that they use and so we can’t depend that we will always get what we want. It’s also formatted differently if it’s sold by Amazon or if it’s using ‘easy-to-open packaging’. Or more. Who knows. Amazon A/B tests like crazy and they mix things up.

Some examples:

Buybox vendor is Amazon
Buybox vendor is Amazon and this one has that it’s exclusively for prime members.
Buybox with Petco as vendor
Buybox vendor is Petco. Not eligible for Prime.
Vendor is Amazon
Buybox vendor is Amazon.
Easy to open packaging
Easy-to-open packaging isn’t the buybox vendor! It’s Amazon. Tricky.

These are just some of the varities and it’s really where web scraping can get fragile. We have to make a decision of how many varities we are okay handling. To handle the buybox vendor I see if there are any anchor tags in #merchant-info and then down below I do a couple of check for some other items that may sneak through as vendors, such as ‘Details’ and ‘easy-to-open packaging’ – buyboxVendor && ¬†(buyboxVendor !== 'Details' || buyboxVendor !== 'easy-to-open packaging').

The next one that’s tricky is number of vendors. This one isn’t as bad in the number of formats it has it just doesn’t have a CSS selector that takes us right to it. So we have to do some splitting. (Split is awesome in web scraping)

The innerHTML looks something like this from #olp_feature_div a – “<b>New</b> (26) from $21.99”. So we split on </b>, get the second item from the split which should leave us “(26) from $21.99”. Then we split on ) and take the first item, leaving us “(26”. Then we do one final split on ( and take the second item because we want everything after the ( and we trim and done.

I also was getting some instances where it would numberOfVendors would be null due to some malformed html so I have some conditional checks to ensure we have something to split on.

let numberOfVendors = await getPropertyBySelector(page, '#olp_feature_div a', 'innerHTML');
if (numberOfVendors) {
	numberOfVendors = numberOfVendors.split('</b>')[1].trim();
}
if (numberOfVendors) {
	numberOfVendors = numberOfVendors.split(')')[0];
}
if (numberOfVendors) {
	numberOfVendors = numberOfVendors.split('(')[1].trim();
}

Finally…we do a check to make sure we were able to get all the required fields that we want. If we don’t, we resolve null in our Promise.

// If it's sold by Amazon the data parsed from buyboxVendor will be 'Details', 'easy-to-open packaging', or null

if ((!wantSoldByAmazon && buyboxVendor && (buyboxVendor !== 'Details' || buyboxVendor !== 'easy-to-open packaging')) && parseInt(numberOfVendors) >= minimumAllowedNumberOfVendors) {

	const dataToReturn = { numberOfVendors: numberOfVendors, buyboxVendor: buyboxVendor, brand: brand, url: url };
	await page.close();

	return Promise.resolve(dataToReturn);
}
else {
	await page.close();
	return Promise.resolve();
}

And that’s it. We’re pretty much done. Back in our main function we confirm that there is a detailsObject returned and make sure that we haven’t already found this product with an array filter if (potentialProducts.filter(product => product.name === name).length === 0) and then we push it into our array of potential products.

From here on out is up to you. You could store all these in a database or just look through them in the console. We wanted to get the brand name since that is who we need to call and ask if we can set up a wholesale account.

Demo code here

2 thoughts on “Jordan Scrapes Amazon Looking For Products To Sell

  1. Hello, I stumbled across this post and I was trying to get the code working to follow along your article. I cloned to a debian box in google cloud, but when I run ‘npm start’ I get this error:

    “`
    mmuser@test-node-01:~/amz_test$ npm start

    > jordan-scrapes-amazon-for-products@1.0.0 start /home/mmuser/amz_test
    > npm i && tsc && node ./dist/index.js

    audited 124 packages in 0.884s
    found 0 vulnerabilities

    src/massSearch.ts:5:24 – error TS2307: Cannot find module ‘./config’.

    5 import { config } from ‘./config’;
    ~~~~~~~~~~

    src/massSearch.ts:25:30 – error TS2304: Cannot find name ‘getRandom’.

    25 const sampleCategories = getRandom(categories, 100);
    ~~~~~~~~~

    Found 2 errors.

    npm ERR! code ELIFECYCLE
    npm ERR! errno 2
    npm ERR! jordan-scrapes-amazon-for-products@1.0.0 start: `npm i && tsc && node ./dist/index.js`
    npm ERR! Exit status 2
    npm ERR!
    npm ERR! Failed at the jordan-scrapes-amazon-for-products@1.0.0 start script.
    npm ERR! This is probably not a problem with npm. There is likely additional logging output above.

    npm ERR! A complete log of this run can be found in:
    npm ERR! /home/mmuser/.npm/_logs/2019-01-16T21_13_08_825Z-debug.log
    “`

    1. I saw this comment after your email so I’m going to just paste my email response here.

      I’ve actually been working on some additional changes to this script which involves mass searching through many categories and that is where the error is coming from.

      1) For your scenario, you should be able to just rename the sample-config file to config.ts. It should be noted that this doesn’t have valid credentials and so the massSearch script will still fail. For your case, though, it should all compile and work since you aren’t trying the massSearch script.
      2) If you want to, you can definitely add a mongoDb and then replace the config credentials with yours. massSearch should work for you then and if you wanted you could save information from the npm start script or whatever else.

      The second bug to me (the getRandom one) is a surprise. I messed up there and didn’t update the function name. I’ve updated it and pushed the updated code. If you’re familiar with git, you can just pull the latest code. If not, I’d go to src/massSearch.ts and replace all instances of “getRandom(” with “getRandomFromArray(“.

      Side note, I’ll have a blog post out I think next week addressing how to use massSearch, though it won’t go much into mongodb.

      Good luck!

Leave a Reply

Your email address will not be published. Required fields are marked *