As of today, I have three scrapers that I run regularly. Two scrape government websites and one scrapes Amazon. With at least two of these, I have taken measures to avoid being blocked. As I do this, I wonder about the reasons for them blocking me and the damage that is being caused by me scraping their sites.
Ethics, not legality
I want to be clear here. I want to focus on the ethics and not the legality here. I’ll mention some court cases at the end of the post but the majority of the time I want to be talking about ethics.
I really enjoyed this article that discusses some of what this person’s ethics were on web scraping. I would like to go through some of his points of ethical web scraping here
1 – Use the public API if it provides the data I’m looking for?
I generally agree with this assuming the API is reasonable to use. Sometimes, I feel that Amazon’s API options are not reasonable to use and it is very difficult to get the information I need.
2 – Identifying myself in the User Agent string?
I think I would do this varying on what I am scraping. I believe that scraping something like a competitor’s public prices would be ethical. I accept the fact that web scraping will happen. I accept that competitors will scrape my public prices and would have to weigh the value I’m getting from having the prices public against the value of letting everyone know my prices.
For me, identifying myself in the User Agent string is not required to be an ethical web scraper.
3 – Request at a reasonable rate?
I believe that my choices should be influenced by the consequences and I believe that my choices should have consequences. If I’m scraping a web site so hard that it is preventing them from doing business, I am only hurting both of us. I am obviously attempting to gain value by scraping them and if their site is no longer operating, I have taken the value away from myself and from them.
Often in discussions of web scraping, performance is brought up. Puppeteer, for example, is a slower web scraping library and often gets criticized for this. I actually think it’s a perk. I feel that it is closer to how a human would use the site in question and therefore the site will be better equipped to handle it. The beauty of scraping is that it is an automated process. I can run the scraper when I’m sleeping. I feel that time is less of an issue when I’m allowed to do other things concurrently.
There have been quite a few cases but the main one I want to just mention here is hiQ v LinkedIn. I feel that the details of this case are better explained here but the takeaway for me was, as of this time, that web scraping is legal for anything that is public. The line was drawn at things behind a password.