Jordan Scrapes Contact Pages

Demo Code Here

In my last post I went through scraping Amazon to find some product that we could potentially sell. Our goal was to find a product and the brand that owns it that would fit our criteria.

This post starts from there. We have the brand and now we want to contact them and ask if we can represent their product. The code we will go through today will access a website and search for any contact phone numbers.

We use Puppeteer as our main tool here. We’ve used it in other scrapes and it just does a great job for all around web scraping.

We start with an self invoking async / await block to handle the many promises that we will be working with through Puppeteer.

(async () => {
  // awesome code here

})();

Check this for more explanation on async/await.

I also run a lot of my Puppeteer scrapers on a Digital Ocean (which I love) ubuntu box so I have a block at the top that makes Puppeteer work in ubuntu. I have scripts set up in package.json that pass in arguments for being both headless and/or ubuntu.

// package.json
...
"scripts": {
    "start": "tsc && node ./dist/index.js",
    "start:ubuntu": "tsc && node ./dist/index.js ubuntu",
    "start:headless": "tsc && node ./dist/index.js headless"
},
...

We start at the top by setting the url we are going to go to. You can adjust this to whatever you would like.

const desiredUrl = 'http://clenera.com';

We just check to see if either the second or the third argument are one of the items we are looking for (maybe I want to run both options and I don’t want to care about the order?). There probably is a better way to pass in arguments to the script but that’s not really the focus of this code.

let ubuntu = false;
let headless = false;
if (process.argv[2] === 'ubuntu' || process.argv[3] === 'ubuntu') {
	ubuntu = true;
}
if (process.argv[2] === 'headless' || process.argv[3] === 'headless') {
	headless = true;
}
if (ubuntu) {
	browser = await puppeteer.launch({ headless: true, args: [`--window-size=${1800},${1200}`, '--no-sandbox', '--disable-setuid-sandbox'] });
}
else {
	browser = await puppeteer.launch({ headless: headless, args: [`--window-size=${1800},${1200}`] });
}

It’s worth noting that I am depending upon a homemade npm package that has some helpful puppeteer helper functions. I found it helpful for me but it’s definitely not required. You would just have to use puppeteer functions to replace the functions I use. Up at the top you can see the functions I’m importing – import {
getPropertyByHandle , setUpNewPage } from'puppeteer-helpers';

Back to the code. I use http://clenera.com as an example because they are a website I know that has phone numbers available and a contact page. So we go there first await page.goto('http://clenera.com');.

I have built out two pure functions that are pretty simple that are the staples of this. The first is one just grabs the whole html as a string and then has a regex that checks for phone numbers. It should be noted that this will not be perfect. There are a ton of different phone number formats in the world. This does cover most of the scenarios in the US. You can change this regex to one that works for your country.

export async function getPhoneNumber(page: Page, phoneNumbers: string[] = []): Promise<string[]> {
    const body = await getPropertyByHandle(await page.$('body'), 'innerHTML');
    const potentialPhoneNumbers = body.match(/(<a href.*?>.*?([(]?(\d{3})[)]?[(\s)?.-](\d{3})[\s.-](\d{4})).*?<\/a>)|([(]?(\d{3})[)]?[(\s)?.-](\d{3})[\s.-](\d{4}))/g);
    if (potentialPhoneNumbers) {
        for (let number of potentialPhoneNumbers) {
            // Check if we already have some of these numbers
            if (phoneNumbers.filter(phoneNumber => phoneNumber === number).length === 0) {
                phoneNumbers.push(number);
            }
        }
    }

    return Promise.resolve(phoneNumbers);
}

We get the list of potentialPhoneNumbers and then loop through them and check to see if we already have these numbers. This function is designed to be reusable (and we’ll reuse it later) so you’ll see that we accept an array of phoneNumbers and if we don’t receive anything we just set it equal to an empty array.

Back to where we were, if we don’t have the phone number, we’ll push it into our phoneNumbers array and then it’ll resolve the promise with those phone numbers.

Note that we do this on the homepage before we search for any contact pages with let potentialPhoneNumbers: any = await getPhoneNumber(page);. Some pages will have phone numbers right on their home page so we should grab it while we are here.

Then we use the next function that we built. This function accepts a phoneNumbers array and returns an array that includes the original array that was passed in. If no array is passed it, it will initialize an empty array. This is why we just redefine the potentialPhoneNumbers variable with the return from our function with potentialPhoneNumbers = await getPhoneNumbersFromContactPage(page, browser, potentialPhoneNumbers;.

The function we call here will use Puppeteer to find all the links on the page and loop through them all, checking if any of them contain the word “contact”. You can certainly change this to be “About” or any other option that you think would be used to hold the phone numbers that you want.

export async function getPhoneNumbersFromContactPage(page: Page, browser: Browser, phoneNumbers: string[] = []): Promise<string[]> {
    // Get all the links and go to the contact pages to look for addresses
    const links = await page.$$('a');
    for (let link of links) {
        if ((await getPropertyByHandle(link, 'innerHTML')).toLowerCase().includes('contact')) {
            let contactUrl;
            try {
                contactUrl = await getPropertyByHandle(link, 'href');
                const contactPage = await setUpNewPage(browser);
                if (!contactUrl.includes('mailto:')) {
                    await contactPage.goto(contactUrl, {
                        waitUntil: 'networkidle0',
                        timeout: 3500
                    });
                    phoneNumbers = await getPhoneNumber(contactPage, phoneNumbers);
                }
                await contactPage.close();
            }
            catch (err) {
                console.log('Err while going to contact page', err, contactUrl);
            }
        }
    }

    return Promise.resolve(phoneNumbers);
}

If it contains “contact”, we confirm that it isn’t a “mailto:’ link and then we navigate to that page. Once on the page we just call the first function that we built earlier that gets phone numbers. We take the returned array and then just return it to our original function and then we are done.

Demo Code Here

Leave a Reply

Your email address will not be published. Required fields are marked *