Jordan Scrapes Allrecipes.com
This scraping was meant to be a simple one where I could show how all you needed was use of split()
to scrape. Then they blocked my IP. I was kind of surprised. I really hadn’t even been hitting it that fast or hard but as soon as I used a VPN with a different IP, I was able to get at that data again.
So, if you are using this and suddenly you can’t access the site anymore…sorry. Use TunnelBear like I did. The block is temporary and a few days later I was able to access allrecipes normally.
The mission: Get the cookies.
The tools:
// package.json
{
"name": "all-recipes",
"version": "1.0.0",
"description": "",
"main": "index.js",
"scripts": {
"start": "node index.js"
},
"author": "Jordan Hansen",
"license": "ISC",
"dependencies": {
"request": "^2.88.0"
}
}
There were really two parts to this. I wanted to create a json file with the ingredients and the instructions. I formed it like this:
{
title: string;
ingredients: string[];
instructions: string[];
}
The first thing I normally do with a place where I’m searching for something is to see if there is a direct url so I don’t have to submit search query everytime. You can replace this url with your own if you don’t like cookies (monster). const searchForCookiesUrl = 'https://www.allrecipes.com/search/results/?wt=cookies&sort=re';
starts us off then.
That is our base line and we do our request from that. That is gives us a big list of cookies. I found a regex that find all urls on a page ( const allUrls = html.match(/\bhttps?:\/\/\S+/gi);
) and then just loop through that.
I found that all recipes contained /recipe/
specifically in the url so I made sure to only keep the urls that included this. Because there was often more than one link to the same recipe I created an array of recipes and once I found a valid one I put it in there. I then just checked this array to make sure I wasn’t duplicating.
for (let i = 0; i < allUrls.length; i++) {
// Specific recipes contain '/recipe/' in the url.
// Let's make sure we don't put a duplicate into our array.
if (allUrls[i].includes('/recipe/') && !recipeUrls.includes(allUrls[i])) {
const recipeUrl = allUrls[i].replace('"', "");
recipeUrls.push(recipeUrl);
try {
listOfRecipes.push(await getRecipeDetails(recipeUrl));
}
catch (e) {
console.log('error: ', e);
}
}
}
From there I’d make another request to each of those recipe urls (honestly, this is probably why they blocked me. I’d get like 50 recipes and then make a request call to those recipes much faster than any mere human could.) I grabbed the title and to make my getRecipeDetails()
not such a monster, I split off functions to get the list of ingredients and the instructions. Also, I know there is a request promise library that, you guessed it, handles promises with request. I’ll need to check that out.
function getRecipeDetails(url) {
return new Promise((resolve, reject) => {
request.get(url, (err, res, html) => {
if (err) {
reject('err in getting details');
}
if (html) {
let recipeDetails = {
title: html.split('recipe-main-content" class="recipe-summary__h1" itemprop="name">')[1].split('</h1>')[0],
ingredients: [],
instructions: []
};
recipeDetails = addIngredientList(html, recipeDetails);
recipeDetails = addInstructions(html, recipeDetails);
resolve(recipeDetails);
}
else {
resolve(null);
}
});
});
}
The add ingredient list function was pretty simple. We were to lucky enough to find an string that only the ingredients had: ingredientsSection.split('itemprop="recipeIngredient">');
. From there we could just loop through the list. I probably could have dug into why there was occasionaly an undefined in my array.
function addIngredientList(html, recipeDetails) {
const ingredientsSection = html.split('polaris-app">')[1].split('<label class="checkList__item" id="btn-addtolist">')[0];
const listOfIngredients = ingredientsSection.split('itemprop="recipeIngredient">');
for (i = 0; i < listOfIngredients.length; i++) {
const splitIngredient = listOfIngredients[i].split('title="')[1];
// splitIngredient sometimes returns undefined so let's check it
if (splitIngredient) {
recipeDetails.ingredients.push(listOfIngredients[i].split('title="')[1].split('">\r\n')[0]);
}
}
return recipeDetails;
}
Adding the instructions was similarly fortunate in that it had the a string in each instruction that we could split and loop on: instructionsSection.split('recipe-directions__list--item">');
. Thanks allrecipes.com engineers! It also is worth noting that there is often whitespace around the strings you want. Using trim()
takes care of that.
function addInstructions(html, recipeDetails) {
const instructionsSection = html.split('itemprop="recipeInstructions">')[1].split("</ol>")[0];
const listOfInstructions = instructionsSection.split('recipe-directions__list--item">');
for (i = 0; i < listOfInstructions.length; i++) {
const instruction = listOfInstructions[i].split('\n')[0].trim();
// There is a white space
if (instruction) {
recipeDetails.instructions.push(instruction);
}
}
return recipeDetails;
}
BAM. Done. Now if you’ve used this code, you’ve been blocked from allrecipes.com and your wife is mad at you because she can’t make her favorite cookie recipe.
Difficulty level: 6/10
Difficult at that level mostly due to the IP blocking. I have scraped a lot of other sites a lot harder that didn’t block me. I was in denial that they were blocking me until I used tunnelbear and saw that it worked fine with a different IP.