The legality of web scraping is a complicated topic that requires a lot of explanations and caveats. While web scraping is definitely legal, it’s still possible to get into trouble if it isn’t done with proper care.
Before we dive deeper into the legality of web scraping, note that this article does not constitute legal advice. Before attempting the extraction of any web data, always consult with a professional.
What is web scraping?
Web scraping is the process of extracting data from internet sources through automated access to websites. There are various ways to achieve the same goal, but it’s usually done by running bots that go through large amounts of URLs. They then download the content stored in the page.
Depending on the sophistication of the web scraping tools in use, the content could then be parsed to make it more susceptible to analysis. While parsing, at some stage of the pipeline, is necessary, web scraping is technically possible without it.
Web scraping tools are widely used by many companies for various business purposes. As the essence of the practice is to automate access and web data download from many sources, such as search engines, companies can use scraping to facilitate an informational advantage or the development of revenue generating operations.
Some of the common applications for web scraping tools include market research, price monitoring, product development, sentiment analysis, etc. Each of these use cases can generate tremendous amounts of revenue for businesses. Naturally, all of them would be drawn to doing so.
Web scraping, however, has a somewhat shaky reputation, primarily due to the lack of any worldwide regulation or legislation. Is web scraping legal remains a relatively open question that has some often changing and quickly developing answers.
Is web scraping legal?
The short answer is yes, web scraping is legal, but with a lot of caveats and requirements. There’s still plenty of room to get into trouble without professional advice.
One of the most important aspects of the entire endeavor is that there is currently no direct legislation for web scraping. However, since it inevitably acquires data, many of the laws regulating the acquisition and processing of the latter apply.
In turn, that makes data laws and regulations some of the most important pieces of legislation to consider before engaging with web scraping. GDPR, CPPA, The Computer Fraud and Abuse act, and related US Supreme Court rulings - all of these and more apply to web scraping.
As such, following the web scraping legal context and its latest developments is vital, especially US Supreme Court rulings as these can overturn many case law decisions. In general, case law should be monitored closely as there’s still a lot of back-and-forth happening that changes how web scraping is understood.
A great example has been the contention surrounding The Computer Fraud and Abuse Act (CFAA) and whether it’s applicable to web scraping. Only recently, has a case (HiQ Labs vs LinkedIn) concluded that ruled that The Computer Fraud and Abuse Act is not applicable to web scraping.
Web scraping publicly available data
According to the latest understanding of the web scraping legal context, public, non-personal data is fair game. While websites attempt to add clauses that would prevent bots from acquiring publicly available data, so far it seems that the worst that can happen is getting the web scraping tool banned.
There are three important caveats, however. First, publicly available data data is everything that’s accessible without a login. In other words, everyone should be able to see that data through their regular browser and no extra steps should be taken to access it.
Secondly, anything that can be considered creative work or in any other way subject to copyright or trademarks also cannot be scraped. Since web scraping creates a copy of the work, doing so might infringe on these rights. There are some possible ways out of the copyright conundrum, however, these are few and far between with lots of requirements.
Thirdly, not all publicly available data is up for grabs. Some of what’s stored publicly is personal data, which is protected by legislation such as GDPR. Scraping public, personal data opens up an entirely new field of requirements and regulations.
So, publicly available data has a few important traits as it:
- Can be accessed by anyone without any registration, logging in, or usage of passwords.
- Is not copyrighted, subject to trademarks, or any other intellectual property protection.
- Is not private, personally identifiable information, or any other data of a similar nature.
Risks of scraping personal data
As mentioned above, personal data opens up a whole host of new restrictions and regulations that have to be met for the information to be processed. Technically, it might be possible to acquire such data. In practice, however, the requirements are way too strict.
For example, under GDPR to process personal data the company must receive explicit consent from each individual whose information will be scraped. In other words, you would have to have every person whose data you scraped manually and get consent.
Additionally, while GDPR only applies to EU citizens, you’d likely need to contact them regardless of region as it may not always be clear where the person is from. Feigning ignorance wouldn’t work out well.
There are other important pieces of legislation related to the processing of personal data, but GDPR by itself almost kills off the entire web scraping endeavor. Most web scraping business-level integrations go through thousands if not hundreds of thousands of pages per day. Contacting each person whose data might be stored there is impossible.
Finally, there’s another important aspect of personal data. There may be small data points, which by themselves are not identifying, but combining many of them together might change that. For example, acquiring a single car route at a random day might not identify anyone. But getting all routes and some vehicle data would instantly identify a person.
In the end, scraping personal data should be, in practice, considered as impossible. All personal data will either directly or indirectly identify a particular person. Additionally, some data might identify someone only in combination, but it still be considered personal.
Web scraping legal cases
|Web scraping case||Ruling|
|Ryanair v. Expedia (2019)||Details confidential. Case involved whether CFAA applies to scraping.|
|HiQ labs v. LinkedIn (2019)||Court ruled that CFAA does not apply to web scraping (i.e. web scraping is not hacking).|
Web scraping is legal, however, there are many regulations involved with the processing of data that all need to be followed for lawful use. Additionally, a lot of the current rules surrounding web scraping can quickly change, so case law and other legislation should be closely followed.
In the end, web scraping is extremely legally complicated. Always consult with a professional before engaging in any activity that involves data acquisition, whether it’s public or not.