To protect my diginity, I suggest you stop reading this. I’m not proud of how my resolution turned out here. I can’t deny that I’m not happy, though, because while the execution isn’t perfect, the end goal is being met.
I have a scraper that has a lot of work to do. It runs through a huge list of websites and at this point I want it be running nonstop. I set up a droplet on digital ocean dedicated to running this script.
In order to keep the script running perpetually, I take advantage of the PM2 package. It’s extremely simple and has been very reliable. Simple
npm install -g pm2 and then
pm2 start path/to/script. It also has a neat tool that allows your scripts to automatically restart on startup.
Eventually the script ends so I set a cronjob to just reset the script every hour. That way it’ll pull a new list of websites to check and continue its work. I just use
crontab -e and add
0 */1 * * * /root/.nvm/versions/node/v12.8.1/bin/node /root/.nvm/versions/node/v12.8.1/bin/pm2 restart all. This restarts all the pm2 scripts running. Simple, right?
This is where the ugly comes in. Any time I am doing something like web scraping I’m worried about issues coming up with long running scripts. There is a lot of variables on response time between websites and different error scenarios. This scraper complicated things even more by using a combination of both Axios and Puppeteer.
It happened mostly as I expected. It ran for a while without having any problems (hours and hours, even) and then it started to fail with
heap out of memory. I was kind of confused by this just because of the fact that it got worse as time went on. Wouldn’t the script restart every hour clean it up? I thought that maybe a server restart would help, so I started to do that every four hours.
This seemed to work a little bit better. Still, as time went on, the script would lock up completely. I could do a droplet restart, start the script, and then it would lock up within less than a minute. This made no sense to me. Is the memory not freed up on script restart? I mean, surely when the box is restarted it frees up the memory?
The short answer is…I’m not sure what’s up. I have to keep investigating to figure what is going.
What I have found that handles it better in the meantime is the following:
pm2 start file.js --max-memory-restart 100M
This will have pm2 watch how much memory the script is running and then automatically restart it. It’s really like taking a sledge hammer to crack a nut. It sucks. My script is slowly, steadily, painfully, making its way through the list.