All you need to scrape websites is knowledge of split()

Some demo javascript code

And how to get the actual site’s html. And maybe some knowledge of how arrays work. Okay okay, and some looping. But that’s not too hard, right?

I know there are a lot of REALLY powerful tools out there puppeteer, cheerio to help make web scraping easier. I honestly often just use split(), though.

Why? Well, it’s really just pretty simple. Let’s do an example. The extremely complicated code below was taken from W3 Schools.

<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>

<h1>This is a Heading</h1>
<p>This is a paragraph.</p>

</body>
</html>

Let’s say we want to get the title and the innerHTML of the <p> tag.

// Get the html with request or some other XHR method.
const html = someWayToGetTheHTML();

const title = html.split('<title>')[1].split('</')[0];
const paragraph = html.split('<p>')[1].split('</')[0];

// title => 'Page Title'
// paragraph => 'This is a paragraph.'

There. Done. Pretty simple, right?

Let’s break down a bit what we did. We take the html and split() on the opening tag of the element we want. In this case, we get arrays with two elements in it. All of the string before the opening tag and then all of the string after the opening tag. Note that the item you are splitting on is in neither element.

const splitTitle = html.split('<title>');
    
// splitTitle => ["<!DOCTYPE html>↵<html>↵<head>↵", "Page Title</title>↵</head>↵<body>↵↵<h1>This is a H…/h1>↵<p>This is a paragraph.</p>↵↵</body>↵</html>"]

    
const splitParagraph = html.split('<p>');
    
// splitParagraph => ["<!DOCTYPE html>↵<html>↵<head>↵<title>Page Title</title>↵</head>↵<body>↵↵<h1>This is a Heading</h1>↵", "This is a paragraph.</p>↵↵</body>↵</html>"]

So you select the second element [1] and then split the latter part off at the closing tag.

Not so bad, right? I know, you are probably saying…

“The page I’m scraping has 322 p tags, Jordan. This method is garbage.”

Don’t despair, my young friend. And don’t worry, you didn’t offend me by calling my method garbage. I mean, maybe it thinks you are garbage.

Yes, 322 p tags does make it more complicated but you just have to get a bit more creative with your splitting and maybe dig a bit deeper. Split and then split again. And then again. And maybe even again.

Why do I like this method? Sometimes I just don’t want to dig into the docs of Cheerio or Puppeteer. I think I’m a pretty sharp guy but man, this way is easy. I don’t have to learn or remember how to use one of the other tools.

Anyway, hope it helps. Please don’t share this with anyone. It’s a secret. They wouldn’t understand.

Some demo code

Leave a Reply

Your email address will not be published. Required fields are marked *