Scraping websites: Having fun with node.js and jQuery

January 5th, 2012 by ger

I’ve been looking for an excuse to use node.js for a while and the opportunity presented itself last week. While thinking about different alternatives for extracting certaing information from a website I realized it would be very useful to use jQuery selectors for this task. But, unlike the common jQuery use case, I wanted to execute this code on the server side and I also wanted to have access to a database to persist the scraped data. So using jQuery within a browser was out of the question. This was when node.js appeared as a possible solution to my problem. After a bit of research I realized it was simple to use jQuery from a node.js application.

The problem was reduced to three simple steps:

  • Perform a request to the website we want to scrape.
  • Use jQuery to extract relevant information.
  • Persist the results to a database.

Each of these steps required a specific node.js module to be completed. The number of available modules is huge and selecting which one to use is not always a simple task, but after a bit of testing a ended up using request, jQuery and sqlite3, all of them installed using npm.

With everything set up the implementation was very simple:

var $ = require('jQuery');
var request = require('request');
var sqlite3 = require('sqlite3');

// Open sqlite database and prepare insert statement
var db = new sqlite3.Database('sqlite3.db');
var stmt = db.prepare("insert into country(name, code, flat_image) values(?, ?, ?);");

// Perform GET request
request('http://en.wikipedia.org/wiki/ISO_3166-1', function(error, response, body) {
    if (error) { throw error; }

    // Find every country in the page:
    // Since there no id or class to identify each country
    // we must rely on the page structure:
    // The first table with class wikitable contains the country list
    $(body).find('table.wikitable').first().find('tr').each(function(index) {
        // TR layout:
        // <tr>
        //   <td>
        //     <span class="flagicon">
        //       <img alt="" src="22px-Flag_of_Afghanistan.svg.png" width="22" height="15" class="thumbborder">&nbsp;
        //     </span>
        //     <a href="/wiki/Afghanistan" title="Afghanistan">Afghanistan</a>
        //   </td>
        //   <td><a href="/wiki/ISO_3166-1_alpha-2#AF" title="ISO 3166-1 alpha-2"><tt>AF</tt></a></td>
        //   <td><tt>AFG</tt></td>
        //   <td><tt>004</tt></td>
        //   <td><a href="/wiki/ISO_3166-2:AF" title="ISO 3166-2:AF">ISO 3166-2:AF</a></td>
        // </tr>
        var country = $(this);
        var code = $(country.children()[1]).text();
        var flag = country.find('img').attr('src');
        var name = $(country.children()[0]).text();

        console.log(name + ' (' + code + '): ' + flag);
        stmt.run(name, code, flag);
    });
});

In this case we are scraping a list of countries from Wikipedia with some additional information for each one. Using jQuery we can easily obtain the list of countries (in this case a list of TR elements) and extract the name, code and flag from each TD. The selectors used and how each value is retrieved will vary from site to site but if you are familiar with jQuery it should not be hard to figure out.

Using sqlite3 from node.js is also pretty straightforward. We just open the database, prepare the insert statement and execute it multiple times. It’s important to note that everything is asynchronous but in this simple example it really doesn’t affect us. Using a different database is very similar, you’ll just have to find the right module for the job.

That’s it. With a few lines of code we can scrape any website using a familiar and powerful framework like jQuery.

Simple MVC for PHP

January 17th, 2011 by ger

I consider myself a developer. Not a Java, Python or .NET developer. Just a developer. I don’t believe the language should have a significant impact in the kind of code you create and that’s why I enjoy creating well written applications in any language I happen to be using.

This is also true for PHP. Why do I mention PHP specifically? Because this is a language that has been usually associated with hard-to-understand, ugly code. I’m not going to get into this subject since a lot has already been written. Fortunately, thanks in part to frameworks like CakePHP, this idea is changing. I would strongly recommend using it for any moderate size application. But what should you do if you need a simple two-page site and have little time to learn a new framework? I want to show you that is easy to follow well known practices (like MVC in this case) in PHP with little effort.

In my case, I’ve recently been working in a big PHP site developed years ago and it was actually a nightmare. Business and view logic were mixed up in ways you don’t even imagine. So when I started a small site for a marketing campaign, I wanted to do it the right way. Since I didn’t want to use CakePHP (or any other similar framework) for something so small I decided to implement MVC by hand.

My application would consist in a simple model (plain PHP classes like User, Country, Score), several views (mainly HTML with minimal inline PHP) and the controllers. The main task here was to create a base controller to handle the interaction between the view and the model. I wanted to implement controllers like this:

class IndexController extends Controller {
    protected function get() {
        return new View('../resources/view/home.php');
    }
}

$controller = new IndexController();
$controller->start();

This is a simple controller with no logic. It just displays home.php (which is purely HTML).

A more complex case could be something like this:

class UserHomeController extends Controller {

    protected function get() {
        $user = User::getLoggedUser();
        if ($user != null && $user->isRegistered()) {
            $scores = Score::get($user);
        }

        return new View('../resources/view/homeuser.php', array(
            'scores' => $scores,
            'user' => $user));
    }
}

$controller = new UserHomeController();
$controller->start();

In this case we retrieve the logged user and his score from the database. This information is passed to the view using the View object.

Finally, the controller for a registration form which needs to handle GET and POST requests in different ways:

class RegistrationController extends Controller {

    protected function get() {
        // Render registration form
        return new View('register.php', null, View::REDIRECT_ACTION);
    }

    protected function post() {
        // Persist user
        $user = new User();
        $user->name = $_POST['name'];
        $user->email = $_POST['email'];
        $user->setBirthdate($_POST['birthdate_year'], $_POST['birthdate_month'], $_POST['birthdate_day']);
        $user->registration_date = Date::now();

        if ($user->isValid()) {
            User::save($user);
        } else {
            header('HTTP/1.1 500 Internal Server Error');
            exit;
        }

        // Redirect to user home
        return new View('userhome.php', null, View::REDIRECT_ACTION);
    }
}

$controller = new RegistrationController();
$controller->start();

As you can see, the main idea is to separate logic completely from the view and, additionally, we can place common controller logic (session management for instance) in one place.

The BaseController itself is a simple class that can be reuse in any project:

/**
 * Base Controller for all pages
 * Handles session management, GET/POST requests and response rendering
 *
 * @author german
 */
class Controller {

    /**
     * This method should be called from the controller PHP to handle
     * the current request
     */
    public function start() {
        session_start();
        $this->init();

        if ($_SERVER['REQUEST_METHOD'] == 'POST') {
            $view = $this->post();
        } else {
            $view = $this->get();
        }

        if ($view != null) {
            $this->display($view);
        }
    }

    /**
     * Override this method to initalize the controller before handling
     * the request
     */
    protected function init() {
    }

    /**
     * GET request handler
     */
    protected function get() {
        $this->process();
    }

    /**
     * GET request handler
     */
    protected function post() {
        $this->process();
    }

    /**
     * Request handler. This method will be called if no method specific handler
     * is defined
     */
    protected function process() {
        throw new Exception($_SERVER['REQUEST_METHOD'] . ' request not handled');
    }

    /**
     * Populates the given object with POST data.
     * If not object is given a StdClass is created.
     * @param StdClass $obj
     * @return StdClass
     */
    protected function populateWithPost($obj = null) {
        if(!is_object($obj)) {
            $obj = new StdClass();
        }

        foreach ($_POST as $var =&gt; $value) {
            $obj->$var = trim($value); //here you can add a filter, like htmlentities ...
        }

        return $obj;
    }

    private function display($view) {
        if ($view->action == View::RENDER_ACTION) {
            $context = $view->context;
            include($view->url);
        } else if ($view->action == View::REDIRECT_ACTION) {
            header('Location: ' . $view->url);
        } else {
            throw Exception('Unknown view action: ' . $view->action);
        }
    }
}

class View {
    const RENDER_ACTION = 'render';
    const REDIRECT_ACTION = 'redirect';

    public $url;
    public $context;
    public $action;

    public function __construct($url, $context=array(), $action=View::RENDER_ACTION) {
        $this->url = $url;
        $this->context = $context;
        $this->action = $action;
    }
}

I don’t want to keep adding code to this post but you can imagine how the views are implemented. Anything included in the $context variable (View object) is available in the view to display it.

As you can see there’s really no reason to write spaghetti code in PHP (or any other laguange!) and even small project can be implemented in a nice and elegant way.

This is my small contribution to erradicate that old PHP myth…

Meet Marvin, our Twitter meta-bot

August 26th, 2010 by ger

It all started with a small Twitter project for a client. Then we began thinking about different ideas for twitter bots (not many of them became an actual product but we had a lot of ideas at that time).

It didn’t take too much time to realize that implementing each bot from scratch didn’t make sense. Specially when changes from Twitter’s side have an important impact on each developed bot (like deprecating basic authentication in favor of OAuth).

So we decided to create Marvin: our own meta-bot in charge of handling everything except creating the tweet itself.

Design and Implementation

The idea was to simplify the process of creating a new Twitter bot. Marvin will handle all the repetitive tasks and let each bot concentrate on doing what needs to be done and nothing else.

What does Marvin offer to us:

  • OAuth authentication
  • Logging
  • Data persistence
  • Debugging
  • More to come…

We implemented Marvin using oauth-python-twitter which provides us with a simple API to communicate with Twitter. Instead of wrapping this API within Marvin, we let each bot use it directly since that gives us a lot of flexibility and it’s very simple to use. We might review this in the future to allow a more powerful debug mechanism.

A bot is implemented as a simple python module which must define the consumer key and consumer secret for the application (defined in Twitter) and implement a run method that will be executed by Marvin:

# Twitter application configuration
# These are fake keys!
CONSUMER_KEY = "j28Rup5fr4thUwruXAP2f"
CONSUMER_SECRET = "j28Rup5fr4thUwruXAP2fj28Rup5fr4thUwruXAP2f"

from datetime import datetime

def run(twitter, data, log):
    # Create your tweet, probably getting information
    # from some external source
    count = data.get("count", 1)
    message = "Tweet number %d: Marvin has something to tell you..." % count

    # Post the message using oauth-python-twitter API
    log.info("Tweetting: " + message)
    twitter.PostUpdate(message)

    # Save the count number for the next execution
    data["count"] = count + 1

As you can see, implementing a bot is very simple since we only have to worry about the tweet and nothing else!

One of the features implemented by Marvin is data persistence. Normally, a bot will have to save information from one execution to the next and we need to provide a way of doing that easily. In the previous example we stored the count number just to show you how it’s done.

To solve that problem we use pickle, a Python module for serializing and deserializing objects. As expected, Marvin will handle saving and restoring the data for you. You just need to store what you need there and read it in the next executing…

Running Marvin

Once Marvin has successfully logged in to an account (for a particular bot instance) we tell him to execute the actual bot process using the following command:

ger@piazzolla:~/projects/marvin$ python marvin.py run bot

In this example, our bot is implemented in bot.py module.

To execute the bot periodically we decided to use cron, which led us to the next section…

Why cron?

We thought about using an in-process task scheduler for Python, like APScheduler but cron’s simplicity and stability was a really apealing reason for us.
With cron running Marvin is as simple as adding this line to the cron’s table:

*/1 * * * * cd $MARVIN_HOME; python marvin.py run bot

In this example Marvin will execute the bot every single minute (don’t worry, we won’t do that!)

We haven’t dismissed APScheduler for future versions, but for now we are very happy with the results we get from using cron.

How to handle OAuth authentication with a command line bot?

Using OAuth from command line is not as simple as it seams. From a web application you are normally redirected to Twitter were you authorize it to access your Twitter account. Everything is simple and with a couple of clicks you are ready to go. But we don’t have a web application or a web browser to be redirected!

The solution for us was to use PIN-based OAuth mechanism created by Twitter. We just tell Marvin that we want to login using the following command:

ger@piazzolla:~/projects/marvin$ python marvin.py login bot
Authorization URL: http://twitter.com/oauth/authorize?&oauth_consumer_key=FjZYXIe7z8X6rRHvXjGzrA&oauth_signature_method=HMAC-SHA1
Enter OAuth PIN:

Using this approach we need to open the authorization URL, allow the application to access Twitter (defined by CONSUMER_KEY and CONSUMER_SECRET) and enter the given PIN as shown below:

Enter OAuth PIN: 9823410
Marvin logged in successfully. You can now start using your bot!

Marvin will use this PIN to retrieve the access token from Twitter and will save it for us. Using this access token we will be able to access the authorized Twitter account transparently.

Is this all?

Definitely not. We are planning to add more features to Marvin, like autofollow based on different queries, automated DM responses and much more. You’ll be hearing more about this once we implemented these new features.

Also, based on the reaction we get from this post, we might consider publishing Marvin as the first devartis’ open-source project.

We now invite you all to follow @futbolar, our first bot implemented using Marvin.

Get Adobe Flash player