Jordan Unit Tests Web Scraping Done with Request

Demo code here

Okay, it’s not actually request I’m testing here but request-promise. I am just going to test the code I built last week that would go out and scrape Craigslist.

It required a bit of refactoring, which definitely made me want to start doing more test driven development because I can see how my code now is more testable and, frankly, cleaner. And having good automated tests is very important!

Let’s me tell you a story. Once upon a time there was a sweet SaaS application. It was the fairest in all the land and all of the customers wanted to pay their monies for it. The people developing this SaaS application also loved it and wanted it to continue to be the fairest in the land.

So they would develop more and more amazing things and release them to the customers. The customer would be very happy but soon would notice that some of the original stuff that they loved about the SaaS had broken since the new things had come. The beloved SaaS application was feeling…regressionitis.

Bill Watterson. Fixing a bug or adding a new feature lets new bugs in.

The SaaS application had a deep, dark secret. It didn’t have great automated tests. It had manual QA testers that were amazing but because the automated tests were not good enough or expansive enough to trust fully, they had to spend three weeks going through the SaaS application to try and reduce the regressionitis.

And since the very intelligent, good lucking, rugged people creating the new features spent about three weeks making new features, that means that a customer of this SaaS application would have to wait over six weeks (three weeks development, three weeks QA) for a new feature. This caused great sadness and pain.

Customers, developers, QA, parents of developers, parents of QA, parents of customers, friends of customers’ parents all feeling great pain and sadness.

Don’t let this story of sadness and pain happen to you! Test your code! Build automated tests into it so you can have an increased level of confidence that future features and fixes won’t break existing functionality. All of this will allow you to iterate quicker and get feedback quicker. Plus, look at those green checkmarks! Doesn’t that feel good?

Green checkmarks!

The three libraries we use here for testing are Mocha, Chai, and Nock. This website does a great job explaining the difference between Mocha and Chai. Nock is just an amazing HTTP interceptor, essentially. You tell it which URLs you want to have intercepted and then you can return whatever status and/or content that you want.

After the refactor, the code we want to test is our simple scrapeCraigslistfunction.

export async function scrapeCraigslist(categories: ICategory[]) {

    for (let category of categories) {

        let html;
        try {
            html = await requestPromise.get(category.url);
        }
        catch (e) {
            return Promise.reject(`Failure while getting category url ${category.url}, ${e}`);
        }

        const $ = cheerio.load(html);

        $('.result-row').each((index, element) => {
            const car = {
                title: $(element).find('p .result-title').text(),
                price: $(element).find('a .result-price').text(),
                location: $(element).find('.result-meta .result-hood').text(),
                url: $(element).find('a').attr('href')
            };
            category.foundCars.push(car);
        });
    }

    return Promise.resolve(categories);

}

This is a pretty simple function to test because I really only have to mock out the requestPromise function. If you can refactor your code to be as simple and testable as this, you are a very blessed man. In reality, most of the time the functions are more complicated and when testing you’ll have to do a lot of mocking and spying.

I find it helpful to be very explicit and liberal with your describes. It makes it a lot easier to debug tests when something is going wrong. In a full scale application, it’s very realistic to have 1,000s of tests and so being able to quickly find which test is breaking is important. I break them up into two sections within the scrapeCraigslist()section, one for handling successes and one for errors.

It’s really important to test error scenarios! Maybe even more so than success scenarios. It forces you to think if your code is prepared to handle when something goes wrong. When developing, we are already thinking along the happy path and often forget to plan for when things will inevitably go wrong.

In my success cases, I am always going to be calling to the same url, so I set up my nock mock outside of the tests like this:



        nock('https://boise.craigslist.org').get('/search/cta?auto_make_model=toyota%20camry&auto_title_status=1&max_auto_miles=100000&max_price=11000&min_auto_year=2010&min_price=7000').reply(200, successCraigslistHtml).persist();


        after(() => {
            nock.cleanAll();
        });

The nock looks at the domain and then the path and replies with whatever, I want. In this case, a 200 status and a huge string that is the page that we are requesting. I just went “View Page Source” and copied it all and then pasted into an exported string in src/test/craigslist-page.stub.ts. So we are actually returning a craiglist page that looks exactly like what we will call when the function is actually running.

Making the nock .persist() allows it to continue to work after it does an intercept. We use nock.cleanAll() within the after block so that after all these tests are done, we clear the mocked route and response so we can use a different one when we get to our error block.

Then we just start testing. You will know your test scenarios best based on what your function does. What I typically try to test is valid scenarios of what I want my function to be able to do and any errors that may occur with it. Then as bugs pop up, I extend my unit tests to include those scenarios.

        it('should return the same amount of categories passed in', async () => {
            const categories: ICategory[] = [
                {
                    name: 'Toyota Camrys',
                    url: 'https://boise.craigslist.org/search/cta?auto_make_model=toyota%20camry&auto_title_status=1&max_auto_miles=100000&max_price=11000&min_auto_year=2010&min_price=7000',
                    foundCars: [] as ICar[]
                }
            ];

            const categoriesResponse: any = await scrapeCraigslist(categories);

            expect(categoriesResponse.length).to.equal(categories.length);
        });

So, this function should search through the categories that I pass in and just extend the objects. It shouldn’t add any additional categories so I test for that here.

      it('should have a price of $8495 on the first car', async () => {
            const categories: ICategory[] = [
                {
                    name: 'Toyota Camrys',
                    url: 'https://boise.craigslist.org/search/cta?auto_make_model=toyota%20camry&auto_title_status=1&max_auto_miles=100000&max_price=11000&min_auto_year=2010&min_price=7000',
                    foundCars: [] as ICar[]
                }
            ];

            try {
                const categoriesResponse: ICategory[] = await scrapeCraigslist(categories);
                expect(categoriesResponse[0].foundCars[0].price).to.equal('$8495');
            }
            catch (e) {
                fail(`Shouldn't fail, ${e}`);
            }
        });

This part tests that I am successfully getting prices from the page for the cars. I am only testing that the first and second car have a price and then I’m assuming that it’s working for the rest of the vehicles. If later somehow this assumption is broken, I can extend my tests to test all of the vehicles.

        it('should have a price of $10871 on the second car', async () => {
            const categories: ICategory[] = [
                {
                    name: 'Toyota Camrys',
                    url: 'https://boise.craigslist.org/search/cta?auto_make_model=toyota%20camry&auto_title_status=1&max_auto_miles=100000&max_price=11000&min_auto_year=2010&min_price=7000',
                    foundCars: [] as ICar[]
                }
            ];

            try {
                const categoriesResponse: ICategory[] = await scrapeCraigslist(categories);
                expect(categoriesResponse[0].foundCars[1].price).to.equal('$10871');
            }
            catch (e) {
                fail(`Shouldn't fail, ${e}`);
            }
        });

Now I just go through and test the additional data pieces that I am looking for, in this case both title and location.

        it('should have a title of 2012 TOYOTA CAMRY SE on the first car', async () => {
            const categories: ICategory[] = [
                {
                    name: 'Toyota Camrys',
                    url: 'https://boise.craigslist.org/search/cta?auto_make_model=toyota%20camry&auto_title_status=1&max_auto_miles=100000&max_price=11000&min_auto_year=2010&min_price=7000',
                    foundCars: [] as ICar[]
                }
            ];

            try {
                const categoriesResponse: ICategory[] = await scrapeCraigslist(categories);
                expect(categoriesResponse[0].foundCars[0].title).to.equal('2012 TOYOTA CAMRY SE');
            }
            catch (e) {
                fail(`Shouldn't fail, ${e}`);
            }
        });

        it('should have a location of " (BOISE)" on the first car', async () => {
            const categories: ICategory[] = [
                {
                    name: 'Toyota Camrys',
                    url: 'https://boise.craigslist.org/search/cta?auto_make_model=toyota%20camry&auto_title_status=1&max_auto_miles=100000&max_price=11000&min_auto_year=2010&min_price=7000',
                    foundCars: [] as ICar[]
                }
            ];

            try {
                const categoriesResponse: ICategory[] = await scrapeCraigslist(categories);
                expect(categoriesResponse[0].foundCars[0].location).to.equal(' (BOISE)');
            }
            catch (e) {
                fail(`Shouldn't fail, ${e}`);
            }
        });

Then finally for the successes, we just ensure that the number of cars returned is the amount we expect.

        it('should find four cars', async () => {
            const categories: ICategory[] = [
                {
                    name: 'Toyota Camrys',
                    url: 'https://boise.craigslist.org/search/cta?auto_make_model=toyota%20camry&auto_title_status=1&max_auto_miles=100000&max_price=11000&min_auto_year=2010&min_price=7000',
                    foundCars: [] as ICar[]
                }
            ];

            try {
                const categoriesResponse: ICategory[] = await scrapeCraigslist(categories);
                expect(categoriesResponse[0].foundCars.length).to.equal(4);
            }
            catch (e) {
                fail(`Shouldn't fail, ${e}`);
            }
        });

As for errors, I just test if the request fails and that I handle it ok.

    describe('Errors', () => {


        it('should handle a request error', async () => {
            nock('https://boise.craigslist.org').get('/search/cta?auto_make_model=toyota%20camry&auto_title_status=1&max_auto_miles=100000&max_price=11000&min_auto_year=2010&min_price=7000').replyWithError('Broken');
            const categories: ICategory[] = [
                {
                    name: 'Toyota Camrys',
                    url: 'https://boise.craigslist.org/search/cta?auto_make_model=toyota%20camry&auto_title_status=1&max_auto_miles=100000&max_price=11000&min_auto_year=2010&min_price=7000',
                    foundCars: [] as ICar[]
                }
            ];

            let error;
            try {
                await scrapeCraigslist(categories);
            }
            catch (e) {
                error = e;
            }

            expect(error).to.equal(`Failure while getting category url ${categories[0].url}, RequestError: Error: Broken`);

        });

    });

And that’s it! Simple as that. It’s a pretty simple test case scenario but nock really makes unit testing when using request-promise very easy.

Leave a Reply

Your email address will not be published. Required fields are marked *