Jordan Productizes the Link Checker

Sample code here

What is productizing?

I’ve really started to like this link checker. I decided it was time to make it more product ready. A lot of times when I’m building a piece of code I’ll build it quicker at the beginning to test it out. If I decide I like it, I’ll refactor as much as I can and then build out unit tests. And so that is what I have done here.

Accept user input

function readLinePromise(question: string): Promise<any> {
    return new Promise((resolve, reject) => {


        const rl = readline.createInterface({
            input: process.stdin,
            output: process.stdout
        });

        rl.question(question, (answer) => {
            resolve(answer);
            rl.close();
        });
    });
}

I needed to make it so some of the parameters necessary for the script were able to be set at the onset of the script. I used the readline module and it worked pretty well. I created a function to make it work as a promise for cleaner code.

I wanted a user to be able to set the domain they wanted to test. I tried to make it a bit smart and hopefully it handles scenarios pretty well. The smarter you try to make code the higher chance for the added complexity to be error prone.

The problem was that domains could come in quite a few different forms. What if they didn’t put http? What protocol should I assume? I just took the stance of if there isn’t http I’m going to assume they meant to put http and not https. This hope that most domains will have some kind of auto redirect if they are supposed to be https.

export function formatDomain(desiredDomain?: string): string {
    let domain: string = '';

    if (desiredDomain) {
        if (!desiredDomain.includes('http')) {
            console.log('\x1b[36m%s\x1b[0m', 'WARN: Http protocol was not included. Prepending http. Hope this works.');
            domain = `http://${desiredDomain}`;
        }
        else {
            domain = desiredDomain;
        }
    }
    else {
        domain = "https://javascriptwebscrapingguy.com";
    }

    return domain;
}

Another nice to have might be validation before it tries to make the request. If the domain is malformed, it’s definitely going to throw an error when it tries to make the request. I might be able to make better validation here to give user feedback if they put just javascriptwebscrapingguy and don’t add a top level domain like .com.

Next was that I wanted to let them set the number of I/O threads that would be allowed. I default to 4 because that is where I saw my speeds peak but I wanted the user to be in control ultimately of that. Here’s what both look like with readline.

        const desiredDomain = await readLinePromise('Desired website to check links? (https://javascriptwebscrapingguy.com) ');

        domain = formatDomain(desiredDomain);

        let desiredIOThreads = await readLinePromise('Desired I/O threads? (4) ');

        if (!desiredIOThreads) {
            desiredIOThreads = 4;
        }

Unit tests

A bit disappointed

I’ll admit I was a bit disappointed with my tests and I’m going to spend some more time on them over the next week. I just had a bear of a time getting my spies to properly work. I was able to get the easy function tests pretty well.

describe('formatDomain()', () => {

    it('should properly format a domain without an http protocol', () => {
        let domain = 'aarmora.com';
        domain = formatDomain(domain);

        expect(domain).to.equal('http://aarmora.com');
    });

    it('should return the domain if it includes an http protocol', () => {
        let domain = 'https://aarmora.com';
        domain = formatDomain(domain);

        expect(domain).to.equal('https://aarmora.com');
    });

    it('should return https://javascriptwebscrapingguy.com if nothing is passed in', () => {
        const domain = formatDomain();

        
       expect(domain).to.equal('https://javascriptwebscrapingguy.com');
    });

});

formatDomain() was a simple one where I just wanted to make sure I was handling all of the possible scenarios that I was expected well.

domainCheck() was another simple function to test. The goal with the function is just to make sure the link and newDomain I am looking at contains our home domain. I want to make sure I am comparing apples to apples so it has to do some formatting.

export function domainCheck(link: string, domain: string, newDomain: string) {

    link = link.replace('www.', '');
    domain = domain.replace('www.', '');
    newDomain = newDomain.replace('www.', '');

    return link.includes(domain) && newDomain.includes(domain);
}

describe('domainCheck()', () => {

    it('should return true if the link and newDomain include the original domain', () => {
        const link = 'http://pizzalink.com/whoa';
        const newDomain = 'http://pizzalink.com';
        const domain = 'http://pizzalink.com';

        expect(domainCheck(link, domain, newDomain)).to.equal(true);

    });

    it('should return true if the link and newDomain include the original domain even if the link has www', () => {
        const link = 'http://www.pizzalink.com/whoa';
        const newDomain = 'http://pizzalink.com';
        const domain = 'http://pizzalink.com';

        expect(domainCheck(link, domain, newDomain)).to.equal(true);

    });

    it('should return true if the link and newDomain include the original domain even if the domain has www', () => {
        const link = 'http://pizzalink.com/whoa';
        const newDomain = 'http://pizzalink.com';
        const domain = 'http://www.pizzalink.com';

        expect(domainCheck(link, domain, newDomain)).to.equal(true);

    });

    it('should return true if the link and newDomain include the original domain even if the newDomain has www', () => {
        const link = 'http://pizzalink.com/whoa';
        const newDomain = 'http://www.pizzalink.com';
        const domain = 'http://pizzalink.com';

        expect(domainCheck(link, domain, newDomain)).to.equal(true);

    });

    it('should return false if the link does not include the original domain', () => {
        const link = 'http://pizzalinkerer.com/whoa';
        const newDomain = 'http://pizzalink.com';
        const domain = 'http://pizzalink.com';

        expect(domainCheck(link, domain, newDomain)).to.equal(false);

    });

    it('should return false if the newDomain does not include the original domain', () => {
        const link = 'http://pizzalinker.com/whoa';
        const newDomain = 'http://pizzalinker.com';
        const domain = 'http://pizzalink.com';

        expect(domainCheck(link, domain, newDomain)).to.equal(false);

    });
});

This is one of those scenarios where the tests have a lot more lines of code than the actual code under test. I think it’s worth it. It’ll give me some confidence against regressions when doing any future changes to this bit of code.

getLinks() is a more complicated bit of code but not a terrible one to test. It doesn’t call any external functions so I didn’t have my spy problem. It’s still a lot more lines of code for the tests than the function. Still worth it for me.

I wanted to make sure I was only pushing the proper links into my next checks (no javascript:, tel:, or mailto: hrefs). I also wanted to test that no hrefs that include #comment or #respond are included. Finally, I wanted to make sure it could a page with multiple links but that it wouldn’t include the same link twice.

describe('getLinks()', () => {

    it('should return an array with a length of 26 if passed in html with 26 a tags in it', async () => {
        const domain = 'https://javascriptwebscrapingguy.com';
        const currentUrl = 'https://javascriptwebscrapingguy.com/jordan-plays-pool-multi-threading-with-a-pool-queue/';
        const links = await getLinks(testHTMLPage, domain, currentUrl);

        expect(links.length).to.equal(26);
    });

    it('should have an array filled with ILinkObjects', async () => {
        const domain = 'https://javascriptwebscrapingguy.com';
        const currentUrl = 'https://javascriptwebscrapingguy.com/jordan-plays-pool-multi-threading-with-a-pool-queue/';
        const links = await getLinks(testHTMLPage, domain, currentUrl);
        const testLinkObject: ILinkObject = {
            link: 'https://javascriptwebscrapingguy.com/#content',
            status: null,
            locationOfLink: currentUrl
        };

        expect(links[0]).to.deep.equal(testLinkObject);
    });

    it('should not include an href with "javascript:"', async () => {
        const domain = 'https://javascriptwebscrapingguy.com';
        const currentUrl = 'https://javascriptwebscrapingguy.com/jordan-plays-pool-multi-threading-with-a-pool-queue/';
        const testHTML = '<a href="javascript:void()">click me </a>';
        const links = await getLinks(testHTML, domain, currentUrl);

        expect(links.length).to.equal(0);
    });

    it('should not include an href with "mailto:"', async () => {
        const domain = 'https://javascriptwebscrapingguy.com';
        const currentUrl = 'https://javascriptwebscrapingguy.com/jordan-plays-pool-multi-threading-with-a-pool-queue/';
        const testHTML = '<a href="mailto:bill@pizza.com">click me </a>';
        const links = await getLinks(testHTML, domain, currentUrl);

        expect(links.length).to.equal(0);
    });

    it('should not include an href with "tel:"', async () => {
        const domain = 'https://javascriptwebscrapingguy.com';
        const currentUrl = 'https://javascriptwebscrapingguy.com/jordan-plays-pool-multi-threading-with-a-pool-queue/';
        const testHTML = '<a href="tel:1234567890">click me </a>';
        const links = await getLinks(testHTML, domain, currentUrl);

        expect(links.length).to.equal(0);
    });

    it('should not include an href that has "#comment"', async () => {
        const domain = 'https://javascriptwebscrapingguy.com';
        const currentUrl = 'https://javascriptwebscrapingguy.com/jordan-plays-pool-multi-threading-with-a-pool-queue/';
        const testHTML = `<a href="${currentUrl}#comment">click me </a>`;
        const links = await getLinks(testHTML, domain, currentUrl);

        expect(links.length).to.equal(0);
    });

    it('should not include an href that has "#respond"', async () => {
        const domain = 'https://javascriptwebscrapingguy.com';
        const currentUrl = 'https://javascriptwebscrapingguy.com/jordan-plays-pool-multi-threading-with-a-pool-queue/';
        const testHTML = `<a href="${currentUrl}#respond">click me </a>`;
        const links = await getLinks(testHTML, domain, currentUrl);

        expect(links.length).to.equal(0);
    });

    it('should split off all query parameters if not doing a deep check', async () => {
        const domain = 'https://javascriptwebscrapingguy.com';
        const currentUrl = 'https://javascriptwebscrapingguy.com/jordan-plays-pool-multi-threading-with-a-pool-queue/';
        const testHTML = `<a href="${currentUrl}?pizza=true">click me </a>`;
        const links = await getLinks(testHTML, domain, currentUrl);

        expect(links[0].link).to.equal(currentUrl);
    });

    it('should split off the slash at the first character if it is there and then add the domain with a slash', async () => {
        const domain = 'https://javascriptwebscrapingguy.com';
        const currentUrl = 'https://javascriptwebscrapingguy.com/jordan-plays-pool-multi-threading-with-a-pool-queue/';
        const testHTML = `<a href="/mysterio-man">click me </a>`;
        const links = await getLinks(testHTML, domain, currentUrl);

        expect(links[0].link).to.equal(`${domain}/mysterio-man`);
    });

    it('should handle multiple links', async () => {
        const domain = 'https://javascriptwebscrapingguy.com';
        const currentUrl = 'https://javascriptwebscrapingguy.com/jordan-plays-pool-multi-threading-with-a-pool-queue/';
        const testHTML = `<a href="/mysterio-man">click me </a> <a href="/mysterio-manners">click me again </a>`;
        const links = await getLinks(testHTML, domain, currentUrl);

        expect(links.length).to.equal(2);
    });

    it('should not add the same link twice', async () => {
        const domain = 'https://javascriptwebscrapingguy.com';
        const currentUrl = 'https://javascriptwebscrapingguy.com/jordan-plays-pool-multi-threading-with-a-pool-queue/';
        const testHTML = `<a href="/mysterio-man">click me </a> <a href="/mysterio-man">click me again </a>`;
        const links = await getLinks(testHTML, domain, currentUrl);

        expect(links.length).to.equal(1);
    });

});

This next part is where I didn’t do very well. It’s the checkLink function and it is recursive. I’d like to spy on how many times it does this but my spy kept failing. I’ll dig more into this over the next week.

For now, I test how I handle the returns of this function depending on what the http request returns. The clear winner for this part of the testing is nock. It makes it so easy to mock out http requests. It intercepts wherever I’m calling to and returns any html or status that I want. It also can time out which came in handy in my tests below.


describe('checkLink()', () => {

    it('should return an object with an ILinkObject with a status of 200 if request is successful', async () => {
        const originalLinkObject: ILinkObject = {
            link: 'https://javascriptwebscrapingguy.com/jordan-takes-advantage-of-multithreaded-i-o-in-nodejs/',
            status: null,
            locationOfLink: 'https://javascriptwebscrapingguy.com'
        };
        const originalLinks = [];
        const domain = 'https://javascriptwebscrapingguy.com';
        const desiredIOThreads = 4;

        nock('https://javascriptwebscrapingguy.com').get('/jordan-takes-advantage-of-multithreaded-i-o-in-nodejs/').reply(200, '');

        const checkLinkResponse = await checkLink(originalLinkObject, originalLinks, domain, desiredIOThreads);

        expect(checkLinkResponse.link.status).to.equal(200);

    });

    it('should return an object with an ILinkObject with a status of 404 if request returns 404', async () => {
        const originalLinkObject: ILinkObject = {
            link: 'https://javascriptwebscrapingguy.com/jordan-takes-advantage-of-multithreaded-i-o-in-nodejs/',
            status: null,
            locationOfLink: 'https://javascriptwebscrapingguy.com'
        };
        const originalLinks = [];
        const domain = 'https://javascriptwebscrapingguy.com';
        const desiredIOThreads = 4;

        nock('https://javascriptwebscrapingguy.com').get('/jordan-takes-advantage-of-multithreaded-i-o-in-nodejs/').reply(404, '');

        const checkLinkResponse = await checkLink(originalLinkObject, originalLinks, domain, desiredIOThreads);

        expect(checkLinkResponse.link.status).to.equal(404);

    });

    it('should return an object with an ILinkObject with a status of 999 if request takes longer than 10000ms', async () => {
        const originalLinkObject: ILinkObject = {
            link: 'https://javascriptwebscrapingguy.com/jordan-takes-advantage-of-multithreaded-i-o-in-nodejs/',
            status: null,
            locationOfLink: 'https://javascriptwebscrapingguy.com'
        };
        const originalLinks = [];
        const domain = 'https://javascriptwebscrapingguy.com';
        const desiredIOThreads = 4;

        nock('https://javascriptwebscrapingguy.com').get('/jordan-takes-advantage-of-multithreaded-i-o-in-nodejs/').delayConnection(10050);

        const checkLinkResponse = await checkLink(originalLinkObject, originalLinks, domain, desiredIOThreads);

        expect(checkLinkResponse.link.status).to.equal(999);

    });
});

I don’t have tests in place for the initial function, findDeadLinks since it really is dependent on the spying as well. I’ll get that part over this next week and get tests added for the rest. As for now, it’s in pretty good shape!

Leave a Reply

Your email address will not be published. Required fields are marked *