This is going to be a tutorial on how to set up Puppeteer to work on Ubuntu 16.04 using a Digital Ocean droplet. While I will be going through specific steps for Digital Ocean, most of these steps should work great for any web server or just Ubuntu linux box.
I really love Digital Ocean. It’s a very simple and inexpensive platform to host any kind of web application. I am going to be targeting Ubuntu 16.04 here. I am going to be using an Amazon scraper that I’ve discussed in previous posts as our target to get working.
Getting into it, after creating an account with Digital Ocean, you will want to create a new droplet. A droplet in Digital Ocean is essentially just a linux machine that comes with an ip address that’s ready to go. When you select it, you should come to this screen. For our purposes, we can use the smallest droplet at only $5/mo.
If you are unfamiliar with doing any kind of cloud hosting, one of the beauties of it is that this $5/mo is the cost if you leave the droplet running all month. You can also scale up the power at any time. This is ideal for the purposes of web scraping. If we are just scraping something daily for like 10 – 15 minutes, we can just spin up the droplet for that time and then spin it down when we are done. It’s pretty great and makes it super economical.
The rest of the options should be pretty self explanatory. For accessing our droplet I strongly recommend using an ssh key. It’s a lot more secure and quicker since you don’t have to enter your password each time. I’m not going to go into how to do it here but Digital Ocean has a great article walking through it.
Once the droplet is created, you can copy the IP address and connect via ssh from your bash terminal of choice. There are some various basic server setup that is recommended and Digital Ocean has another great article here about it. For example, you probably shouldn’t stick with a root user. I’m going to live on the edge here and just stick with root for demonstration purposes.
I like to host my stuff in
/home/ but you can put it wherever you prefer. I’m also using git to pull down my files. Using the Amazon scraper example, we clone it with
git clone https://github.com/aarmora/jordan-scrapes-amazon-for-products-to-sell.git.
We will need node.js to get all of our dependencies, including Puppeteer, working on our webserver. Digital Ocean has yet another article on how to do this. The necessary commands you will need to run, however, are the following:
sudo apt-get update sudo apt-get install build-essential libssl-dev curl -sL https://raw.githubusercontent.com/creationix/nvm/v0.33.8/install.sh -o install_nvm.sh bash install_nvm.sh source ~/.profile nvm install 8.15.0
This will install node version 8.15.0. You can verify it by running
node -v. From here we just go into our scraper directory and run
As I discussed in my post about running this Amazon scraper, you will need to rename
src/config.ts. If you do not, you’ll see an error stating that it cannot find the
Now if we were on a windows (and probably mac?) machine, we’d be good to go. Running even our
npm run start:ubuntu which should run Puppeteer ready for ubuntu will still throw the following error:
An error has occurred: Something went wrong in the initial set up: Error: Failed to launch chrome! /home/jordan-scrapes-amazon-for-products-to-sell/node_modules/puppeteer/.local-chromium/linux-609904/chrome-linux/chrome: error while loading shared libraries: libX11-xcb.so.1: cannot open shared object file: No such file or directory
You will need to install some additional depedencies:
sudo apt-get install libx11-xcb1 libxcomposite1 libxi6 libxext6 libxtst6 libnss3 libcups2 libxss1 libxrandr2 libasound2 libpangocairo-1.0-0 libatk1.0-0 libatk-bridge2.0-0 libgtk-3-0
and then you should be good to go! Our scraper is working like a charm!