Managing Elasticsearch data on Python

Lucas Lavandeira
devartis
Published in
5 min readAug 14, 2018

--

Over the last few years, Elasticsearch has become the most popular choice for search engines in open source embracing environments. Over at devartis it’s our first choice whenever we need to implement text search in our projects, or perform complex aggregations on our data. Our latest use case has been an open source Time Series API for Argentina’s national government, where ES works both as a text analyzer for the time series, and as a data store of the numeric values to perform aggregations on. Furthermore, for web services we often end up using Django, Python’s biggest web framework. This post talks about accessing Elasticsearch features from Python, through two libraries developed by the Elasticsearch team, elasticsearch-py and elasticsearch-dsl.

For this post it’s assumed an ES cluster is already set up and available, on your local machine (I recommend the ES Docker container for this), or on a different environment. The Python libraries can be easily installed with pip.

Elasticsearch-py and Elasticsearch-dsl

Elasticsearch works with a RESTful API model, that is, you completely handle your data by making HTTP requests to your Elasticsearch cluster, and receiving response data. The format for both request and response is in JSON. The first library, elasticsearch-py is a wrapper over most of the service’s endpoints, with a body argument to specify a Python dict that is then translated to JSON. So for instance, this curl command in bash and this Python code snippet are equivalent:

The whole translation process is very simple and it can be useful if you have a lot of bash scripts and want to quickly rewrite them on Python. But for a more advanced usage, there exists elasticsearch-dsl , which builds on top of the first library to provide a more pythonic, object oriented approach.

Setting up your data

The main advantage of using the DSL over the simple wrapper library is the management of indexes and their mappings and settings: the Index class, along with DocType , can make configuring your index or indexes very simple, without having to write a single JSON line!

First off we’ll start by creating a new index, with custom settings. Elasticsearch by default will create an index for you if it doesn’t exist, but this step is often necessary to fine tune several options.

The Index class presented here has a number of methods bounded to its instances, with features to create, delete, clone an index, or set up custom text analyzers, among others.

This example of course works, but without any document type defined the index is of little use. Next we’ll set up a DocType class with its corresponding mapping. Keep in mind that doc types are deprecated in ES 6.0, and will be removed in a future version, so it’s best to keep a single doc type defined for each index, to make an eventual migration easier.

The Meta class defined here is completely optional. Defining your client and index on the same class makes it so you don’t have to specify them later on when searching, so it’s useful in common use cases. They can be left blank if you’re handling multiple ES clusters for instance and want to dynamically set them.

Creating documents then, is as simple as instantiating an object of a DocType inheriting class, and calling the save() method. The DSL library will take care of the type conversions, JSON formatting, and the request to the cluster.

Just like the Index class, the DocType instances have a lot of bound and unbound (class) methods to manage your documents. You can create, delete, update your docs from this API, execute text searches like in the example above, or different type of filter operations. Moreover, each individual field ( Text, Date, ...)defined can be configured with its own settings, such as specific index time and search time analyzers, copies of data indexed to that field, and anything else that Elasticsearch offers for you in the Mapping parameters documentation.

A useful tip in this step is to specify unique identifiers for the document with the _id keyword argument to the constructor, if you want to be able to reference and update them later.

Bulk indexing data with helpers

When indexing multiple docs, though, the above process simply does not work out. Each save call implies a request made to the Elasticsearch index API. This method in an ETL type of process is not nearly performant enough. Fortunately, Elasticsearch provides us with a Bulk Index API, and the elasticsearch library goes even further, giving us a wrapper function that indexes several documents from a Python iterable. There are a couple of tricks to integrating our DSL document objects with this helper function:

The basic wrapper of elasticsearch-py does not understand the models, and expects a JSON-like body to pass on to the HTTP API, so we have to use the .to_dict(include_meta=True) method of the doc to get the desired dict that the bulk helper understands.

There are a couple other bulk helpers. The streaming_bulk in the same module is a generator of results, and allow you to iterate over them to check for any possible errors. parallel_bulk is similar to the former, but makes use of python’s multiprocessing library to execute your indexing in parallel.

Aside from just indexing, these helper functions can be set up to perform delete or update actions, by manually modifying each action dict, with a method key. By default, its value is interpreted as "index", meaning the document with the same _id will be overwritten.

Conclusion

Configuring your ES index and indexing data from Python is a very simple process, the last example showed us how the main indexing routine can be written in ten lines of code after it’s all set up. The whole experience is a breeze to use, with the well defined classes that serve as an abstraction over the raw REST API. Obviously, you’ll still need to understand what’s going on behind the scenes when debugging, but hopefully the process of writing complex index configurations and mappings will be completely streamlined.

Visit us!

--

--