Jordan Scrapes Pinterest for Recipes (with infinite scrolling)

Demo code here

I have a confession to make. I’m not on the Pinterest bandwagon. My wife loves it. I don’t know how to use it. So learning how it all works was the first prerequisite for me for this script.

And…honestly, I can say that I now think it’s probably one of the better social media platforms. I don’t even know if I fully count it as social media. It’s not people posting about how amazing their family is but people sharing ideas and cool tricks. You know, helpful stuff.

The idea behind this script is to be able to show how to scrape recipes on Pinterest. I found it helpful to have a Pinterest account and be logged in. The results they returned were different than when they were unauthenticated. Plus, as I scroll down the page unauthenticated, it pops up a modal suggesting that you login which complicates it.

The biggest complication after logging in was hanlding the infinite scrolling. With infinite scrolling, it only loads in the first few items (about six, in this case) and then polls the server for more as the user scrolls down. In this case, it also appeared that the UI dumps the ones you scroll past. It was a fun problem to deal with.

As a final note, Pinterest makes this easier for me by formatting all of the recipe information in the same way. This is kind of interesting because Pinterest isn’t just a site for recipes so I’m unsure how they are making the decision of what gets formatted in the “recipe” way and what doesn’t. I also think it’s possible that I could stumble upon a “Pinned” card that doesn’t have the same format since maybe it doesn’t recognize it as a recipe.

To start off, like I said, you should create a Pinterest account or use your Pinterest credentials. Go to .sample.env and rename it to .env and then replace the sample credentials with the real credentials. You’ll also want to run a npm install.

The src/index.ts has my base setup for the scrape. I define my query, ‘cookies’ in this case, set up Puppeteer and start into the code. I start with an async block because Puppeteer is all promise based. Check this for more explanation on async/await.

(async () => {
  // awesome code here

})();

My src/index.ts file is pretty simple. It just calls the two main functions that I use, getItemLinks and the getDetailsFromItemLinks function. I also define how many times I want to scroll down.

   const query = 'cookies';
    const browser = await puppeteer.launch({headless: false});
    const scrolls = 10;

    const itemLinks = await getItemLinks(query, browser, scrolls);

    const recipes: any[] = [];    

    for (let href of itemLinks) {

        recipes.push(await getDetailsFromItemLinks(href, browser));
    }

    console.log('the end', recipes);
    process.exit();

The first function I start with, getItemLink is just going to open up Pinterest, login, and get the item links. As I think about it further, to make this a better function I probably should have do the navigating and logging in outside of this function so this function only does the one thing it’s responsible for.

export async function getItemLinks(query: string, browser: Browser, scrolls: number = 10) {
    const url = `https://www.pinterest.com/search/pins/?q=${query}`;

    const page = await browser.newPage();
    await page.goto(url);
    await page.waitForSelector('.headerLoginButton');

    // Sign in
    await page.click('.headerLoginButton');

    if (!process.env.email || !process.env.password) {
        throw 'You need to set your email or password in the .env file';
    }
    await page.type('#email', process.env.email);
    await page.type('#password', process.env.password);
    await Promise.all([
        await page.click('.SignupButton'),
        await page.waitForNavigation({ waitUntil: 'networkidle2' })
    ]);

Note that I use the Pinterest credentials defined in our .env file. Make sure to rename the .sample.env to .env and replace the sample credentials with your own. Another fun thing that I do here is an await Promise.all block. This just makes it so that promise won’t fulfill until both of the promises inside have resolved. I waitForNavigation until networkidle2. This, discussed here, will wait for navigation until there at most two network connections still active. This handles a lot of things like long polling or pings that are constantly going back and forth. It’s a pretty neat trick to help get a more accurate waitForNavigation end.

I then start into the scrolling. I get an array of all the gridItems (each Pinned card) and then call getGridItemHrefs which gets the hrefs from each card.

Grid Items
    let gridItems = await page.$$('div[data-grid-item]');
    let itemLinks: string[] = [];

    itemLinks = await getGridItemHrefs(gridItems);

And now to show the getGridItemHrefs function.

async function getGridItemHrefs(gridItems: ElementHandle[], itemLinks: string[] = []) {

    for (let item of gridItems) {
        const href = await getPropertyBySelector(item, 'a', 'href');

        if (!itemLinks.includes(href)) {
            itemLinks.push(href);
        }
    }

    return Promise.resolve(itemLinks);
}

getPropertyBySelector is a function that comes from puppeteer-helpers just combines multiple Puppeteer functions in order to get which attribute you want from the HTML element with a Puppeteer element handle and a css selector. So this function just goes through all the items passed into it and then adds the href associated with the item to the itemLinks array that is also passed in.

Now to the real scrolling part.

for (let i = 0; i < scrolls; i++) {
        try {
            await gridItems[gridItems.length - 1].hover();
            await page.waitFor(1000);
        }
        catch (e) {
            console.log('Error hovering', e);
        }

        try {
            gridItems = await page.$$('div[data-grid-item]');
        }
        catch (e) {
            console.log('error in resetting grid items again', e);
        }

        try {
            itemLinks = await getGridItemHrefs(gridItems, itemLinks);
        }
        catch (e) {
            console.log('error in getting item hrefs', e);
        }
    }
    return Promise.resolve(itemLinks);

I use Puppeteer’s hover function to hover over the last gridItem that I had found, which scrolls the screen to that item. Then I get all grid items again because more are shown and get the itemLinks of those grid items. Then I just keep doing that until I finish how many scrolls I want to go through. After the loops, I resolve the Promise back to the index file, which is going to handle going to the details page of each of these results and gets the ingredients.

export async function getDetailsFromItemLinks(url: string, browser: Browser) {

    const recipe = {
        title: '',
        diets: '',
        servings: '',
        ingredients: [] as string[]
    };

    const page = await browser.newPage();
    await page.goto(url);

    recipe.title = await getPropertyBySelector(page, 'h5 div', 'innerHTML');
    recipe.diets = await getPropertyBySelector(page, 'div[data-test-id="diets-section"] div', 'innerHTML');
    recipe.servings = await getPropertyBySelector(page, 'div[data-test-id="serving-summary-section"] div', 'innerHTML');

    const ingredientHandles = await page.$$('li[data-test-id="recipe-ingredient"]');

    for (let ingredientHandle of ingredientHandles) {
        recipe.ingredients.push(await getPropertyBySelector(ingredientHandle, 'div > div', 'innerHTML'));
    }

    await page.close();

    return Promise.resolve(recipe);

}

I just handle the details page and because Pinterest has the recipes all laid out in the same format (of all the ones I checked) that we just used the selectors for the correct section and get all the ingredients.

And that’s it! It’s a pretty simple scrape where the most part is handling the auto scroll.

1 thought on “Jordan Scrapes Pinterest for Recipes (with infinite scrolling)

Leave a Reply