Jordan scrapes Craigslist and then automates emailing the results to himself

Demo code here

Note that this code has been refactored to be more testable on 3/4/2019.

I’m in the market for a car. I have a truck right now and since I work from home it really just sits in the garage. Since we never use it, I’d like a smaller vehicle to drive around with better gas mileage and a little bit less expensive for things like tires.

I’ve found myself checking craigslist almost daily and whenever I notice this kind of thing, I try to think of how I can just automate this process. So I decided to build a scraper that will check the car categories I’m looking into and then have it mail the results to me.

Craiglist does have a pretty good “alert” feature and while it didn’t do exactly what I wanted (like seeing the change in the current listings) I’ll admit that I mostly did this scrape for theoretical purposes and good practice.

The node packages I used in this project were Request-Promise (which requires Request), Cheerio, and NodeMailer. With these three packages, this scrape was extremely easy. I opted to go with Request-Promise with this scrape rather than Puppeteer (which I still think is awesome) because I didn’t need to do any interaction and I didn’t expect Craigslist to take any measures to actively stop the scraping.

I wrote this, like I normally do, completely in Typescript. I mention it in this post because I actually used interfaces in this scrape. It’s not required to do this but I thought because I have an array of catogies that also has an array of found cars within it that it would make the intellisense cleaner for me to do so.

I start into my self calling async function. The Request-Promise package is all promise based so this allows us to use async/await. Because it’s awesome.

(async () => {
  // awesome code here

})();

    const categories: ICategory[] = [
        {
            name: 'Toyota Camrys',
            url: 'https://boise.craigslist.org/search/cta?auto_make_model=toyota%20camry&auto_title_status=1&max_auto_miles=100000&max_price=11000&min_auto_year=2010&min_price=7000',
            foundCars: [] as ICar[]
        },
        {
            name: 'Mazda 3s',
            url: 'https://boise.craigslist.org/search/cta?min_price=7000&max_price=12000&auto_make_model=mazda&min_auto_year=2014&max_auto_year=2016&max_auto_miles=100000&auto_title_status=1',
            foundCars: [] as ICar[]
        }
    ];

To start with, go to Craigslist.com and then perform the search that you want with all the filters set. This will give you a url with all of the filters in the query parameters. You can get however many urls you need, just make an object in the array for each one and replace the url field with the url you want to scrape.

Get the url up there.

The name of the category is just for readability so you just need to replace that with whatever makes sense for you. After that, I start into the actual scraping. There really is barely any code with this. These packages make it all very clean and simple.

    for (let category of categories) {

        const html = await requestPromise.get(category.url);

        const $ = cheerio.load(html);

        $('.result-row').each((index, element) => {
            const car = {
                title: $(element).find('p .result-title').text(),
                price: $(element).find('a .result-price').text(),
                location: $(element).find('.result-meta .result-hood').text(),
                url: $(element).find('a').attr('href')
            };
            category.foundCars.push(car);
        });
    }

We just loop through our categories, open each url and then use the very handy jQuery style selectors that Cheerio uses. Once we find a car and the data we want, we push it into our foundCars array and we’re done.

Next I set up the email function. This has more code but it’s mostly around the formatting of the html. It’s pretty straight forward and the documentation around nodemailer make it really simple. Note that Nodemailer is super simple for things like this (where I’m only emailing myself once a day) but would not work for things like mass emailing.

async function sendMail(categories: ICategory[]) {

    let transporter = nodemailer.createTransport({
        service: 'gmail',
        auth: {
            user: credentials.email,
            pass: credentials.password
        }
    });
    let html = `Here's some cars that may be of general interest <br><br>`;

    for (let category of categories) {
        html += `${category.name} <br><br>`
        
        for (let car of category.foundCars) {
            html += `<a href="${car.url}">${car.title}</a>, ${car.price}, ${car.location}<br>`;
        }

        html += `<br><br>`;
    }


    const mailOptions = {
        from: 'jbhansen84@gmail.com',
        to: 'jbhansen84@gmail.com',
        subject: 'Craigslist updater',
        html: html
    };

    transporter.sendMail(mailOptions, (err, info) => {
        if (err) {
            console.log('Err', err);
            return Promise.reject(err);
        }
        else {
            return Promise.resolve(info);
        }
    });
}

UPDATE as of 3/5/2019: I’ve changed from using a credentials file to using environmental variables. This is so I can do automated builds. I have included a .sample.env file that you will need to rename to .env and replace the sample email and password with your actual credentials. You’ll also need to import the dotenv module as I do in src/index.ts.

import * as dotenv from 'dotenv';

dotenv.config();

I have included a sample-gmail-credentials.ts file that just needs to be renamed to gmail-credentials.ts. It also needs to have the credentials updated with your valid gmail credentials. You can also modify your nodemailer set up and use something besides gmail. Gmail sometimes (maybe always?) requires an application specific password when doing this kind of SMTP mailer. If you use your normal password, it’ll throw an error directing you where to go to get your application specific password.

Now, let’s automate this guy. I use Digital Ocean and REALLY like it. It makes spinning up new linux boxes incredibly simple and it’s very inexpensive. I just cloned this repository to my linux box and then set up a simple crontab to run every day. This is a really handy tool to help set up your crontab. Cron is a simple tool that you can use for scheduled tasks.

Make sure to run npm run build after pulling the code and then something like this from your crontab (crontab -e) should do the trick. 0 4 * * * /root/.nvm/versions/node/v6.9.5/bin/node /home/jordan-scrapes-craigslist/dist/index.js. Now you are set!

Leave a Reply

Your email address will not be published. Required fields are marked *