Web scraping is something that’s a lot of coverage online. While it’s something that has existed almost since the birth of the World Wide Web, web scraping only recently became accessible and valuable to the public at large.
In short, it’s the process of automated data collection from publicly available sources on the internet. There’s three data criteria to decide if it can be scraped: whether it exists, whether it’s online, and whether it’s publicly accessible. There is a theoretical possibility of scraping data behind logins and such, but that’s a can of worms that ends in a hefty fine (at best).
How does web scraping work?
Web scraping is not unlike a more complicated version of copying and pasting data from any website to some locally stored file. Fortunately, it’s also more scalable, faster, and cheaper.
On the surface, all that happens is that an automated script goes through a set (or collected during its journey) amount of URLs, downloads the page, and temporarily stores it in local memory. A search function is then used to go through all the collected data to extract the desired information. All of it is then output into a preferred file or database.
There may always be differences between implementations of web scraping. Some may choose to carefully curate URL lists, go through multiple pages, and then dump the entire HTML code collection somewhere. Others may search the HTML code as they go instead of keeping the entire thing in memory. Nevertheless, the idea remains the same.
In that sense, a scraper or web crawler is not that different from a regular internet user browsing a website. Some might even load the page through a browser. The most popular approach, however, is to send HTTP requests to the website, which then delivers the same content, only the crawler doesn’t draw anything on the screen. Such an approach saves computing power.
Web data extraction has a significant drawback. HTMLs are intended to be viewed on browsers, not analyzed. Most of what a web scraper gathers is a garbled mess of unusable information. Parsers have to be developed to combat that issue.
A parser tool turns unstructured data into a structured format. In other words, it’s code that goes through the garbled mess of information and turns it into something that’s easy for humans (or other software) to understand.
In the end, web scraping is a process of automatically collecting data left on the internet, turning it into usable information, and storing it somewhere safe.
What is web scraping used for?
There’s an inordinately large amount of use cases for web scraping. Enumerating all of them is likely impossible as a web scraper is useful everywhere where data is required. As modern businesses thrive off of data, it’s no surprise that automated collection of it can be so important.
One of the most popular use cases for web scraping is dynamic pricing. Used by the largest e-commerce stores, travel fare aggregators, and many other businesses, it’s a strategy that’s becoming increasingly popular as it gives a slight edge over the competition.
A web scraper is used to continually collect pricing data on matching products across all available competitor stores. It is then matched to the pricing information of identical products the company sells itself. Any changes by the competitor are then matched accordingly.
Dynamic pricing can involve more complicated frameworks. They can, potentially, use mathematical modeling to predict ceilings and floor for prices across product categories, match closely related products, and include available stock information. The underlying principle, however, remains the same.
Professional data acquisition
Social media websites, primarily LinkedIn, store vast arrays of important information about businesses. Competitors can use web data extraction to collect such data to create informed decisions.
Professional data can then be used for lead generation, job candidates, and as a way to estimate company health. On the other hand, finance companies can use the movement of highly skilled employees to predict stock valuations.
Additionally, social media is a treasure trove of information for companies who want to see how competitor’s products are performing. People leave lots of reviews and comments about products and services, all of which can be used for sentiment analysis and other insight-generating purposes.
Alternative data is a rising trend in many industries. Previously, most data-driven decisions were informed by either internal information, acquired through CRMs and similar software, or large-scale publications such as financial reports.
Web scraping affords businesses the opportunity to extract data from numerous sources nearly instantaneously. As such, “alternative data” was coined as a way to differentiate such collections of information.
It has been used by many businesses in many industries. Financial services, however, have taken a particular liking to it. There have been numerous studies conducted showing that alternative data has a place in investment decisions such as Does Alternative Data Improve Financial Forecasting? The Horizon Effect where the study authors confirm that short-term forecasting can be made better with it.
Real estate companies, for example, have uses for alternative data as well. Property valuations are subject to many factors, some of which can be a pain to collect and evaluate. Reviews of businesses can be used as a secondary signal, pointing to the quality of the property in question. While it’s a more accurate predictor of the quality of the block, real estate companies have something to hang on to.
Is web scraping legal?
Web harvesting is in a tricky legal state of affairs. While there hasn’t been any direct legislation against automated data collection or web scraping software, a lot of companies use information protection legislation such as GDPR as the foundation.
There have been several high-profile cases that have built our current understanding of the legitimacy of web scraping. So far, it seems that users can scrape data that is publicly accessible, but not personal. You can’t extract private data or one behind logins.
We would recommend, however, to always consult with a lawyer. Our blog does not confer legal advice. We are only describing the general trend of web scraping and its legality.
The process of data scraping
As mentioned previously, data scraping goes through several steps. We will now delve a little deeper into the technical side of things.
Writing a script
A script is the starting point of every web scraper. Usually, it’s written in Python as it has a lot of libraries created that make web scraping a lot easier. A Python library is, in simple terms, a pre-packaged piece of code that can be easily reused by others.
Beautiful Soup is a great example of a Python library as it’s frequently used for HTML parsing. Instead of having to write the entire thing anew, developers can use Beautiful Soup to parse HTMLs by simply downloading and importing the library.
Additionally, Python has higher accessibility for newcomers than some other programming languages. Due to the philosophical guidelines, called the Zen of Python, much of the development of the language has focused on simplicity and readability.
After a script that can communicate with websites and download HTMLs is written, a collection of URLs will be required. The script has to go somewhere, after all.
There are two approaches to it. The first one is to manually collect URLs and add them to the code or some file from which the script will read. It’s suitable for small-scale projects that may need only a few data points.
The second approach is to write-in a small section within the script that extracts URLs from HTMLs. Since the web scraping software visits a website and downloads everything from it, that includes any links. With some parsing, these can be extracted and added to the collection automatically. It is the preferred method for anything above a home-brew project.
Once everything is prepared, a script will be able to go through the assorted collection of URLs. Pythoneers and Pythonistas frequently use libraries such as Requests, Selenium, or Pyppeteer to send requests to websites. The content of the response is then, as mentioned, stored into some variable.
Since the entire HTML is almost always unnecessary, the script has to initiate some form of search. There are three main methods used to coerce the data out of its HTML shell - Regular Expression (RegEx), XPath, CSS selector. The first is best used for strings (i.e. text) if HTML files are content-heavy. The differences between the other two are a little too complicated for this article.
No one lives forever nor wants to read data from terminals. Data analysts usually want to see data represented in a suitable format. There are numerous options with two popular ones in web scraping being JSON and CSV. Going by our Python example there are libraries such as Pandas that make exporting processes a lot easier.
Before continuing onwards, we should note that we have described the basic process of web scraping. Advanced solutions include artificial intelligence, machine learning, and many other applications that make the process more efficient.
Proxies and web scraping
We’ve left out an important aspect when answering the question of what is web scraping - proxies. You might have been wondering how come websites allow users to scrape data if it’s all automated. Administrators don’t have a penchant for bots.
The short answer is that, in many cases, they don’t. Bots by themselves might not be terrible, but a huge swath of them constantly attempting to scrape data can put strain on servers, which can dampen user experience. Differentiating between people who extract data responsibly and those who don’t is difficult, so all too often both get hit by the ban hammer.
As such, your carefully crafted web scraping software can get blocked even if you do everything to minimize negative impact. Proxies are an answer to the problem.
Most websites will block suspected web scraping software by adding their IP address to a block list. Proxies, on the other hand, are devices that are connected to the internet and hold an IP address of their own. So, when they are in use, the website seems to get requests coming from someone else.
For scraping, residential proxies are the most commonly used type. They are regular household devices with ISP-provided IP addresses. Datacenter proxies are an option, but as they are created in business-owned servers, it’s a lot easier to detect that something is amiss.
Additionally, residential proxies are better if you need to acquire data that may be geographically sensitive. As they come from household devices, getting a particular location is generally easier than with datacenter proxies.
Web scraping is a somewhat complicated process with enormous potential. It can serve as the launchpad for businesses and as a way to support existing processes. Whether you use it for lead generation or other purposes, there’s no reason to avoid giving a shot at it to see if it can produce value.