Jordan Continues Working on Scraping for Private Label Products

This post is going to be a bit different. It’s kind of a half post going through my thought process as I work to improve the scrape for private label products. I’m just going to address some of the problems I’m having and the steps I’m taking to try and overcome them. I’m really just writing as I’m working on this here so things could be a bit chaotic.

Fix memory overflow.

I found pages that didn’t always have titles in it and so this code would fail. When it failed, the page wasn’t getting closed so after hitting a bunch of these the droplet would just overflow with open browser tabs.

    let title: string;
    try {
        title = await getPropertyBySelector(page, '#productTitle', 'innerHTML');
        title = title.trim();
    }
    catch (e) {
        await page.close();
        return Promise.reject(e);
    }

Rabbit holing.

I currently search like this. High selling item, get urls of related items, search those items, get more urls from related items from those items, forever. As a result that list is going to get bigger forever and very likely will be all closely related products. I’ve started by putting a limit on the number of related urls it grabs per details page like this:


        if (howManyDetailUrls) {
            productUrls = productUrls.concat(detailsResults.productUrls.slice(0, howManyDetailUrls));
        }
        else {
            productUrls = productUrls.concat(detailsResults.productUrls);
        }

I really am not sure if this will fix the problem. We are really still going to grow forever and we’ll grow forever in the same category. We probably need to set a hard limit on the category and then swap to a different category after it’s done.

Specificity

Probably the biggest problem so far. The pages we do find are too specific. It’s not really what a human would search for you so you can’t really compare your competition.

I mean, look at these. Pretty specific.

How do I get a better search term from a product page? Maybe I could look for some of the best selling category deals and extrapolate something?

Here’s a thing I’m trying now to improve the search term. I’ve found that it’s pretty common for sellers to put a bunch of things in their title to increase their SEO value. So I remove everything after commas and dashes. This does expose me to things like this…”https://www.amazon.com/s?k= ” but that’s probably okay to get a few of those.

Next thing on the list for specificity is trying to limit it to four words.

    // Remove the brand, split on commas and dashes
    let searchTerm: string;
    try {
        searchTerm = title.replace(brand, '').split(',')[0].split('-')[0];
        // Get the first four words
        searchTerm = searchTerm.split(' ').slice(0, 4).join().replace(/,/g, ' ');

    }
    catch (e) {
        await page.close();
        return Promise.reject(e);
    }

Duplicates

When pushing to my keeper urls I was getting duplicates, especially when in the same category. So I added a check to make sure we don’t push the exact same url. We’re still going to get similar ones but that’s okay for now.

            if (keepers.filter(keeper => keeper === results.url).length === 0) {
                keepers.push(results.url);
                console.log('***** Added a keeper in strict mode ******', results.url, keepers.length);
            }

Leave a Reply

Your email address will not be published. Required fields are marked *