How to Extract Data from Web Pages

A tutorial on data extraction from pages on the Web

Posted by Josh on 04-08-2018

The main virtue of the Web is its openness. Every modern website is an application whose source code (HTML and JavaScript) can be seen and downloaded by anyone.

This gives an enormous opportunity to build online businesses. One can build a website that combines data from multiple independent websites and creates something completely new. An online service that combines data from other online services and data sources is called a mashup.

Traditionally, mashups are built using APIs (or Application Programming Interfaces). Multiple sites on the Web provide their data as RESTful APIs others can use. Wikipedia has an API, as well as NASA, GitHub, StackOverflow, Imgur, and many other online services and content providers.

But what to do if you need some specific data, but the website that has it in public access doesn't have an API? The solution is to extract the data you need data from traditional web pages. Such web pages don't contain the data in its pure form. Usually, data elements on web pages are surrounded by the elements of design, advertisements, and navigation links. This makes it very difficult to distinguish data elements from "noise".

To extract data and avoid extracting noise, web services hire developers to build scrapers. A scraper is a software program that downloads a web page and extracts data from it.

Usually, a scraper consists of a collection of XPATH expressions and a software that can apply those expressions to a given web page to extract data.

An XPATH expression specifies a "path" to an element you want to extract from a web page. For example, let's say that we want to extract the price on all pages of an online store. Let's assume that the part of the HTML that contains the price looks like this:

<div>
    <ul>
        <li>Name: <b>Frisbie</b>
        <li>Price: <b>$12</b>
    </ul>
</div>

To extract the price, we need to specify the element <b> that contains the text "$12". However there are at least two <b> elements on the web page. This is where XPATH comes in handy. We can specify the specific <b> by describing the path from one of its predecessors in the DOM tree:

//div/ul/li[2]/b

The reader can find more information on the syntax of XPATH in this tutorial.

The problem, however, is that we need a software that will apply the XPATHs we define for a web page. One possibility is to install a browser extension that will apply XPATHs to the loaded page. For example, Scraper extension for Chrome does exactly that: applies XPATHs you define to pages and saves the extracted data to a spreadsheet.

However, this is still a manual approach. If you want to build an automated solution, you can use a programming framework such as Scrapy. You will need to define the list of pages you want to extract data from (in other words, pages you want to "scrape") as well as XPATHs for each data element of interest on those pages.

A popular tool that allows extracting data from HTML is BeautifulSoup. If you are familiar with the Python programming language, you will find it easy to build web data extractors using either Scrapy or BeautifulSoup.

A major problem of automated scraping frameworks, such as Scrapy, is that they download web pages directly. This means that they don't load them in a browser like normal users do. Many modern web pages contain so-called dynamic content. This content appears only in the modern Web browser after a specific JavaScript code is executed. Furthermore, many modern content providers detect that the page is being downloaded directly and restrict access to those pages to reduce the load on the web server.

If you are not a programmer or want to avoid downloading web pages directly, there are web services that will scrape a page or a complete website for you. You will only need to specify the elements of interest on a web page by clicking on them and the web service will try to build XPATH for you and then apply it on other pages of the website to extract data. Multiple such services exist, including import.io, Mozenda, and Portia.

Whatever way you decide to go, scraping using XPATHs or their alternatives has one big drawback: you fix extraction rules and then hope that website you extract data from will not change. If the website design changes, even slightly, the XPATH may stop working or, what's worse, it will continue to work. But instead of selecting the right data element, it will now select something else. It can take time before you notice the error. After that, you will need to fix the XPATH and reprocess the bad extractions. If your web service depends on scraping thousands of sources, the scraper repairs will be your daily routine.

Finally, the modern alternative to fixed XPATH-based scrapers, manually (or semi-automatically, with import.io, Mozenda or Portia) tailored to each source, one can use recent advances in Artificial Intelligence and Computer Vision and train a Machine Learning model that will "look" at a web page and detect elements that look like what you want to extract: prices, titles, dates, locations, companies, labels, authors, descriptions, images, captions, quotes, etc.

If you use an AI to scrape, it will not break when the source page design change. It will continue to extract the price because it looks like a price, and it will accurately extract an article title because it looks like an article title.

If you have enough expertise, you can build your own AI-based scraper and train it on real page examples. Alternatively, you can use an existing solution, such as semanti.ca.

As of August 2018, semanti.ca extracts data from any web article: news, blog posts, online magazine articles and similar pages. The AI of semanti.ca recognizes authors, titles, headlines, publication dates, tags and categories, images and captions, quotes, section names, and, of course, text paragraphs.

If you need to scrape something else, contact us and most likely we will be able to train a Machine Learning model that will recognize the data elements that you care about, such as article name and price in online store article pages, or stock market tickets, or, maybe, scores given by online reviewers to gadgets, music albums or movies. An AI-based web scraper will not break if the website design changes, so you can concentrate your effort on building your unique solution rather than repairing broken scrapers.


Read our previous post "The Most Useful Linux Commands You Probably Need to Know" or subscribe to our RSS feed.

Found a mistyping or an inconsistency in the text? Let us know and we will improve it.


Like it? Share it!