Jordan Does Dead Link Checking

Update as of June 21, 2019: I moved the branch from master to singlethreaded-io since I decided this wasn’t the ideal way to go. I have several posts trying different methods but ended up going with a multithreaded-io approach to speed it up. I’ve updated the links here to go to the correct branch.

Sample Code Here

Starting with the shame (the bad stuff)

Shame GIF

I have to start out with the known issue. I’m not happy about it but I feel coming out up front and saying it makes me a more honest fella. Sometimes, this code fails. I have some kind of memory leak or something in it.

Killed my memory heap.

So far, it has only happened on one website and only after checking somewhere around its 153rd link. That website happens to have podcasts and videocasts on it so maybe it’s related to that. I used it on our friends at Citadel Packaging and it ran over 5,000 links and did just fine.

The good stuff

There are some definite things that I’m happy with about this script. The goal with this script was to go through a website and find any dead links. The dead links could be pointing to internal or external websites.

I had to find a way for it to be able to find new links as it went but without it adding links from other domains during the search. So I had to do some comparison of the domain for which I was trying to find the dead links and the link I was currently searching. If the link I was currently searching had the domain in the link, I searched for more links. If it didn’t, I didn’t. Things just got a bit tricky if the domain had an autoredirect or link to itself with a ‘www’ in it.

The okay stuff

I think I’m handling exceptions pretty well. Occasionally my test request would fail without a status code. I’d go back and check the links and they appear to be good so I’m guessing that the website that was being linked to was doing something to block me. So I just set a timeout on the request of 10 seconds and gave it a status code of 999. I would still filter this as a bad request and maybe this could get annoying always getting this false negative.

I think the speed is just okay. This is something I would like to see if it could be improved. I have some ideas about maybe using the asynchronous nature of nodejs and maintaining like five searches going at once. Bad part about this is it would increase my load on the website under test by 5 times.

frustrated frustration GIF
What’s a valid hash vs just a hash that takes me to another part of the page?

Handling hashes. I’m not sure the best way to handle something like this yet. A link like https://laurenslatest.com/fail-proof-pizza-dough-and-cheesy-garlic-bread-sticks-just-like-in-restaurants/#comment-26820 isn’t really a new link. We can’t just cut out things after a # though because if a site is using something like angular 1.x they could very likely have something like https://domain.com/#/some/valid/path. So right now I’m probably testing more links than is strictly necessary. Potentially a lot more links.

I’m assuming that all links will somehow be linked to from the home page of the site. This assumption won’t always be correct and so some links could be missed. I’m not sure of a better way though, at this time, to dig through and find the rest.

The code

The dependencies

I think the code here is pretty simple. I have built four functions. The first is the exported parent function findDeadLinks.

export async function findDeadLinks(domain: string) {
    let html: any;
    try {
        const options = {
            method: 'GET',
            resolveWithFullResponse: true
        };
        const response: any = await requestPromise.get(`${domain}`, options);
        html = response.body;
    }
    catch (e) {
        console.log('Error trying to request the domain', e.statusCode);
        throw `Error requesting base domain - ${domain}, ${e.statusCode}`;
    }
    let links: ILinkObject[] = await getLinks(html, domain, domain);

    for (let i = 0; i < links.length; i++) {
        if (!links[i].status) {
            const checkLinkResponse = await checkLink(links[i], links, domain);
            links[i] = checkLinkResponse.link;
            links = checkLinkResponse.links;

            console.log('after link is checked link', links[i], i);
        }
    }

    console.log('links', links.length);
    console.log('bad links', links.filter(link => link.status && link.status > 399));

}

I’ve used request-promise before in my craiglist scraper but I hadn’t tried it with trying to get the status codes. This was pretty easy to do with the resolveWithFullResponse option. As I look at it now, I’m pretty sure I could remove the method: 'GET' from that option block.

Since this list is going to get bigger and bigger, I’m only going to check links that don’t have a status code yet. I’m making the assumption here if the status code is greater than 399, that’s a bad link and we output it at the end.

async function checkLink(linkObject: ILinkObject, links: ILinkObject[], domain: string) {
    let html: any;
    let newDomain: any;
    try {
        const options: requestPromise.RequestPromiseOptions = {
            method: 'GET',
            resolveWithFullResponse: true,
            timeout: 10000
        };
        const response: any = await requestPromise.get(linkObject.link, options);
        newDomain = `${response.request.uri.protocol}//${response.request.uri.host}`;
        linkObject.status = response.statusCode;
        html = response.body;
    }
    catch (e) {
        if (e.statusCode) {
            console.log(`Error trying to request url ${linkObject.link}`, e.statusCode);
            linkObject.status = e.statusCode;
        }
        else {
            console.log(`Error trying to request url ${linkObject.link}`, e);
            // Some other error happened so let's give it a 999
            linkObject.status = 999;
        }
    }

    // Let's not get further links if we are on someone else's domain
    if (newDomain) {
        if (html && domainCheck(linkObject.link, domain, newDomain)) {
            links = await getLinks(html, domain, linkObject.link, false, links);
        }
    }

    return Promise.resolve({
        link: linkObject,
        links: links
    });

}

The findDeadLinks function calls checkLink(). This function checks to see if this link is valid and then we get push additional links into our list of links to check. The goal of this function is to put a status code on this link object.

I perform a domain check in this function. With the resolveWithFullResponse I am able to get the protocol and the host and so get something to check against with ${response.request.uri.protocol}//${response.request.uri.host};. Then I pass this newDomain into our smallest function, domainCheck().

function domainCheck(link: string, domain: string, newDomain:string) {
    link = link.replace('www.', '');
    domain = domain.replace('www.', '');
    newDomain = newDomain.replace('www.', '');

    // console.log('in domain checker **************', link, domain, newDomain);

    return link.includes(domain) && newDomain.includes(domain);
}

I just remove the ‘www’s to ensure we are comparing apples to apples. I want to make sure that I’m not adding additional links from a domain that is not being checked. For example, on Amateur Dota 2 League, I have a link to steam. I want to check to make sure the link to steam is good but I don’t want to grab all the links on steam check them and then check all of steam’s link and then check all of steam’s links’ links’ and then check all of steam’s links’ links’ links’.

This gets tricky on auto redirects. In the AD2L example above, I have a link to https://dota.playon.gg/auth/steam which auto redirects to https://steamcommunity.com/openid/login?…. So my first check ensures that the original link we are checking (dota.playon.gg/auth/steam) has my check domain in it. It does, yay. My second check looks at the response host and protocol to ensure that has the domain in it. In this case, it does not so we don’t continue adding links.

async function getLinks(html: any, domain: string, currentUrl: string, deep: boolean = false, links: ILinkObject[] = []) {
    const $ = cheerio.load(html);

    $('a').each((index, element) => {
        let link = $(element).attr('href');
        if (link && (!link.includes('javascript:') && !link.includes('tel:') && !link.includes('mailto:'))) {
            // Sometimes the first character of the link isn't the domain and has a slash. Let's clean it up
            link = link.charAt(0) === '/' ? link.slice(1) : link;

            let linkToPush = link.includes('http') ? link : `${domain}/${link}`;
            // If we're doing a deep check, we'll check the same urls with just different query params
            linkToPush = deep ? linkToPush : linkToPush.split('?')[0];
            if (links.filter(linkObject => linkObject.link === linkToPush).length < 1) {
                // console.log('adding new link', linkToPush, link)
                links.push({
                    link: linkToPush,
                    status: null,
                    locationOfLink: currentUrl
                });
            }
        }
    });

    console.log('current links length ***************', links.length);
    return links;

}

The final function, getLinks(), uses Cheerio and parses the html for all anchor tags and loops through them. The goal with this is to add to my array of links so we keep checking but I don’t want to add a link that I already have. I also don’t want to worry about links like javascript:, tel:, and mailto:.

And that about does it. The function will run recursively, getting more and more links until it doesn’t and then it will complete.

Sample Code Here

happy so excited GIF

1 thought on “Jordan Does Dead Link Checking

Leave a Reply

Your email address will not be published. Required fields are marked *