Scraping websites: Having fun with node.js and jQuery

I’ve been looking for an excuse to use node.js for a while and the opportunity presented itself last week. While thinking about different alternatives for extracting certaing information from a website I realized it would be very useful to use jQuery selectors for this task. But, unlike the common jQuery use case, I wanted to execute this code on the server side and I also wanted to have access to a database to persist the scraped data. So using jQuery within a browser was out of the question. This was when node.js appeared as a possible solution to my problem. After a bit of research I realized it was simple to use jQuery from a node.js application.

The problem was reduced to three simple steps:

  • Perform a request to the website we want to scrape.
  • Use jQuery to extract relevant information.
  • Persist the results to a database.

Each of these steps required a specific node.js module to be completed. The number of available modules is huge and selecting which one to use is not always a simple task, but after a bit of testing a ended up using request, jQuery and sqlite3, all of them installed using npm.

With everything set up the implementation was very simple:

var $ = require('jQuery');
var request = require('request');
var sqlite3 = require('sqlite3');

// Open sqlite database and prepare insert statement
var db = new sqlite3.Database('sqlite3.db');
var stmt = db.prepare("insert into country(name, code, flat_image) values(?, ?, ?);");

// Perform GET request
request('http://en.wikipedia.org/wiki/ISO_3166-1', function(error, response, body) {
    if (error) { throw error; }

    // Find every country in the page:
    // Since there no id or class to identify each country
    // we must rely on the page structure:
    // The first table with class wikitable contains the country list
    $(body).find('table.wikitable').first().find('tr').each(function(index) {
        // TR layout:
        // <tr>
        //   <td>
        //     <span class="flagicon">
        //       <img alt="" src="22px-Flag_of_Afghanistan.svg.png" width="22" height="15" class="thumbborder">&nbsp;
        //     </span>
        //     <a href="/wiki/Afghanistan" title="Afghanistan">Afghanistan</a>
        //   </td>
        //   <td><a href="/wiki/ISO_3166-1_alpha-2#AF" title="ISO 3166-1 alpha-2"><tt>AF</tt></a></td>
        //   <td><tt>AFG</tt></td>
        //   <td><tt>004</tt></td>
        //   <td><a href="/wiki/ISO_3166-2:AF" title="ISO 3166-2:AF">ISO 3166-2:AF</a></td>
        // </tr>
        var country = $(this);
        var code = $(country.children()[1]).text();
        var flag = country.find('img').attr('src');
        var name = $(country.children()[0]).text();

        console.log(name + ' (' + code + '): ' + flag);
        stmt.run(name, code, flag);
    });
});

In this case we are scraping a list of countries from Wikipedia with some additional information for each one. Using jQuery we can easily obtain the list of countries (in this case a list of TR elements) and extract the name, code and flag from each TD. The selectors used and how each value is retrieved will vary from site to site but if you are familiar with jQuery it should not be hard to figure out.

Using sqlite3 from node.js is also pretty straightforward. We just open the database, prepare the insert statement and execute it multiple times. It’s important to note that everything is asynchronous but in this simple example it really doesn’t affect us. Using a different database is very similar, you’ll just have to find the right module for the job.

That’s it. With a few lines of code we can scrape any website using a familiar and powerful framework like jQuery.

Tags: , ,

One Response to “Scraping websites: Having fun with node.js and jQuery”

  1. rata Says:

    You may want to take a look at http://scrapy.org/, a a fast high-level screen scraping and web crawling python framework. It’s awesome

Leave a Reply


Get Adobe Flash player