Why would you build a scrapper?

Agustin Payaslian
devartis
Published in
6 min readAug 31, 2020

--

According to Internet Live Stats in the present there are more than one and a half billion web pages. This implies that there is a lot of public information out there for anybody to use it. So, how can we use all of the data available for us? We will get into that in a moment, but first let’s see a few related concepts.

Crawler

It’s the algorithm that goes through the website and pages with the links provided. This simulates the behaviour of a person browsing the web. For example when you are searching movies at IMDB and you select next to see the next page of movies.

If you want to crawl for all the action movies, you only need to increment by fifty the ‘start’ argument in the URL, and it will be like a person clicking the next button.

Scrapper

It’s the algorithm that extracts information from a web page. This will get you the data that you want. It consists in parsing the html of the web page and extracting the information contained in it. For example the title of the movie, rating, etc.

Here we can see how the html matches with the web page, then we would know how to parse it to extract the information.

How do these tools work together?

When the crawler crawls each page, you download the html, and save it to a database or enqueue it. In parallel, the scrapper can consume the database or the queue and scrape the data you want from the html. Then you can store that processed data in another database to analyze it later. As I mentioned before there are a lot of web pages, so how can these tools help me? Let’s give an example.

Imagine you are a manager at an important electronics shop. You want to do a market analysis to price your products. To get the public market data you should search in every competitor’s website and save somewhere each price of each product they have for sale. You will need to repeat this process everytime you want to do a market analysis. This takes a lot of time, time you may need for something else.

With a scrapper, this isn’t necessary. The crawler would crawl all pages and download their html. You don’t have to search in every page, the crawler will do it for you. Then, the scrapper will parse every html the crawler downloaded and extract the data you want. Now you have all your data in one place to analyze it and price all your products.

Perks of building a scrapper

All data in one place

Perhaps you just need to search for the cheapest TV, or maybe you need to store the data to train a machine learning algorithm, in any way you don’t have to search for that information everywhere, you have it all in one place for you to use it.

Some data may change

Data may change. If you don’t have a scrapper you have to look periodically if what you searched before has changed. Instead, with a scrapper, you are continuously crawling and scraping the web and updating your information.

Saves time

This is a completely automated software so you will save a lot of time not looking for information on internet while the scrapper gets it for you.

Important decisions

Having all data together allows you to have a clearer picture to make decisions, for example, if you have your competitors market data, you can analyze it and decide which is the best option.

Disadvatages of building a scrapper.

All websites are different

It’s not the same to scrap a web page like IMDB, that to scrap Rotten tomatoes. Their data is shown in different htmls. So in order to scrap both pages you will have to analyze each site and make a specific scraper for each one. You will have the same problem with the crawler, the URLs are different for every website, soy you will need to design a specific crawler for each one of them.

A website can change

Maybe one day your scrapper starts to throw errors when parsing the html. That’s when you have to update it. Sometimes people update their site htmls and change how data is displayed and your scrapper gets outdated and you have to change it in order to keep scrapping the website.

Great power comes with great responsibility

If you make a lot of requests with your scrapper to a server you might get banned or worse, you can cause a denial of service, downing the page. In order to be gentle to the server you need to measure the amount of requests you are making. You don’t want to make a DOS attack to the server, you just want to extract some public information to use it later.

Proxies

Sometimes you might get banned even if you make few requests because the behaviour of the crawler is detected as a bot and requests comming from it are denied from that point. To avoid this, you can use proxies and user agents in your requests, to further simulate a normal user behaviour when crawling.

A proxy is an intermediary server between you and the server you want to scrap. It’s function is to hide your indentity to the other server. On the other hand a user agent tells the server what the visiting device is, so if you make different combinations of them you can avoid being detected and banned.

Some advice

There are a lot of pages out there that offer you proxies. But I don’t recommend using every one of them. Proxies have different levels of anonymity, try to use always those with the higher level so you don’t get caught. You should also use proxies with high availability so most of your requests are successful. Try to periodically change your proxies because they may get banned or come down. This will improve the successful rate of requests you make which is very important to improve performance. Think that performance is really important for a scrapper. Imagine you need to scrape 1 million pages and you make 1 request per second in order to avoid a DOS attack on the server. That would take eleven and a half days to finish. You will never have 100% of successful requests. If you have 50%, it would take double the time.

Next steps

In today’s world, where there is a lot of information you need to make the best decision, a scrapper is one of the best options to choose when you don’t have an API to get that data. There are a lot of guides you can follow and libraries that make your life easier like beautifulsoup for python. So now that you know what a scrapper is, try to think how is this tool helpful for you. Maybe you can use it for your company, or maybe you just want to enjoy the experience of building a scrapper. Go on and happy scrapping.

Visit us!

--

--