Jordan Scrapes Alibaba

Demo code here

I’ll admit it, this scrape is super light. It really just shows the basic of a scrape using Puppeteer. I think I would like to extend it to possibly combine with Amazon product research in the future, if I ever figure out how to scrape efficiently for private label products.

The goal with this scrape is to search and return the title, price, and minimum order required. An example of a search is pasta maker and what Alibaba returns is something like this:

Alibaba pasta maker results.

It’s a pretty classic results page. We’re just getting the results from the first page now, going on the assumption that to dig any deeper and the results probably won’t be something you are interested in. I want to find only one result that works for me so digging deeper doesn’t really start.

I start into my self calling async function. Puppeteer is all promise based so this allows us to use async/await. Because it’s awesome.

(async () => {
  // awesome code here

})();

Check this for more explanation on async/await.

I start with setting our baseDomain and searchText as separate variables so it’s easy to adjust the searchText. I then just concatenate them together after they are set. I really think it’s worthwhile to look at how websites are setting up their searches. Puppeteer allows me to act exactly like a user would. I could enter my search query in the input and click the “search” magnifying glass. I can almost always get around that and allow me to use less code by looking at the url for the query parameters after I’ve started my search.

searchText in the query param.
    const browser = await puppeteer.launch();

    const page = await setUpNewPage(browser);

    await page.goto(url);

    const products = await page.$$('.m-product-item');

Next, I set up Puppeteer and go to the url. From there, I grab all the product elements on the page by using the .m-product-item class. I just search for whatever is the repeated element selector and use that. Using the Puppeteer `$$` function will return an array of element handles. I also think it’s kind of funny to see this class: searchitem-new-theme. I’ve definitely used the “new” identifier in code but I always think it’s a mistake after. I mean…new to what? Is it still new? Or was it new before? I think being more specific with identifying features is better.

.m-product-item as our selector

I set up an empty array that will hold the product data from each of the products that I loop through. I go through the loop, pick out the data that I want and then push it into the array.

    const parsedProducts: any[] = [];
    for (let product of products) {
        const parsedProduct = {
            title: '',
            price: '',
            minimumOrder: ''
        };
        parsedProduct.title = await getPropertyBySelector(product, '.title a', 'innerHTML');
        parsedProduct.price = await getPropertyBySelector(product, '.price b', 'innerHTML')
        if (parsedProduct.price) {
            parsedProduct.price = parsedProduct.price.replace(/\n/g, '').trim();
        }
        parsedProduct.minimumOrder = await getPropertyBySelector(product, '.min-order b', 'innerHTML');

        console.log('parsed Product', parsedProduct);

        if (parsedProduct.title) {
            parsedProducts.push(parsedProduct);
        }

    }

I’ve created a few Puppeteer helper functions and I use one here. getPropertyBySelector allows me to get the property I want from an element by using the selector and the property that I want. I use it for each of the fields that I want.

That’s it! All done. From here you can save the data off into a database if you want or just use it from the console.

Demo code here

Leave a Reply

Your email address will not be published. Required fields are marked *