Complementing Cassandra with Elasticsearch

Published in

devartis

7 min readOct 23, 2018

On this entry, I wanted to talk a bit about these two technologies, and how they can be used to complement each other.

First of all, I will give you a brief introduction to Cassandra, as well as Elasticsearch, what they are and how are they normally used, so let us begin!

Apache Cassandra

Cassandra is a distributed column-based NoSQL database. It offers its own query language, called CQL (Cassandra Query Language), with some similarities to SQL.

For example, the following statement is a query written for Cassandra using CQL:

SELECT date, total FROM tickets WHERE ticketid IN (10, 11, 12);

Pretty SQL-like. But beware, there are some things that can not be done with CQL. Joining, subquerying and some aggregations are not allowed in CQL by design.

For more information about CQL, I recommend checking out the official documentation.

Being Cassandra a database, you will need a driver to connect to it in your code. Depending on which language you are planning to use, you may have several options, but I will list a few of them with their respective documentation.

These drivers handle your connection with Cassandra. However they do not provide any kind of model mapping or query creation. For that you will need something like:

These are known as ORMs, and they work on a different level of abstraction. Both of them utilize a Cassandra driver to connect, and provide you also with model mapping structures and query helpers.

Elasticsearch

Also a distributed technology, Elasticsearch is a RESTful engine used for searching and processing data through indexes. Indexes store portions of your data, defined by you, to be later used to search or aggregate.

One important thing to take into account when using Elasticsearch, is that even though it stores your data, it is not to be considered a database. The data you have stored in your indexes, should be only used in queries and must be deletable to re-index it should the situation require it.

Other important matter are indexes. Since you decide what is to be stored in them, you should plan ahead and base your indexes content on the queries you are likely to use.

As mentioned before, Elasticsearch is a RESTful engine, and that means that we communicate with it through a REST API.

For example, a simple query using the API could be the following:

curl -X POST 'http://localhost:9200/_search' -d \
'{
  "query":{
    "bool":{
      "must":[
        {
          "match_all":{}
        }
      ],
      "must_not":[],
      "should":[]
    },
    "from":0,
    "size":10,
    "sort":[],
    "aggs":{}
}'

This query matches all of the documents in the index. This is just to show that you are connecting to Elasticsearch by using its REST API through a CURL request.

Again, I recommend to visit the official documentation for the Elasticsearch query language for more details.

Why use them together?

To answer this question, we must think about the problem that we are trying to solve. First of all, Cassandra is able to handle large amounts of inserts without much problem and at high speeds. This is possible thanks to Cassandra’s storage engine, that stores data sequentially on disk for each table, but of course, nothing comes for free. The price paid to make them possible, comes when you issue a read. Reads are slower, because the engine has to sweep the disk to look for the most recent version of the object that it is looking for. To sum this up, Cassandra will perform much better in environments with more inserts than reads.

You may now be thinking: what good is having large amounts of data stored, if I am penalized every time I want to read it? And what happens when I want to issue more complex queries on this data (range queries, aggregations, etc). Well, here is when something like Elasticsearch comes in handy.

Elasticsearch provides us with a wide spectrum of queries and aggregations to apply on the data you hold in your indexes. The whole point of using Elasticsearch, is being able to do queries on your data. “Bool”, “Full text” and “Range”, are some of the most commonly used queries in Elasticsearch, and of course, the aggregations.

This is a simple range query, with a term aggregation, and its result:

Query:

{
  "query": {
    "range": {
      "date": {
        "gte": "2017-08-29T02:00:00Z",
        "lte": "2017-08-29T08:00:00Z",
        "boost": 2
      }
    }
  },
  "from": 0,
  "size": 10,
  "sort": [],
  "fields": [
    "date",
    "tags"
  ],
  "aggs": {
    "status": {
      "terms": {
        "field": "tags"
      }
    }
  }
}

As you can observe, the query is filtering documents from 29–08–2017 between 2AM and 8AM. It also has an aggregation, on the term “tags”, and it specifies the fields it wants in the response (in order to avoid showing each entire document and just focus on the fields we want for this example).

I will split the response in two parts now, mainly because I want to focus in two different parts from it: the query and the aggregation. And also because it will be pretty long otherwise:

Response (Part corresponding to the query):

"hits":[  
      {
        "fields":{  
          "date":[  
            "2017-08-29T02:46:00.000Z"
          ],
          "tags":[  
            "tag1",
            "tag2"
          ]
        }
      },
      {  
        "fields":{  
          "date":[  
            "2017-08-29T02:42:49.000Z"
          ],
          "tags":[  
            "tag1",
            "tag2"
          ]
        }
      },
      {
        "fields":{  
          "date":[  
            "2017-08-29T02:12:20.000Z"
          ],
          "tags":[  
            "tag3",
            "tag1"
          ]
        }
      },
      {
        "fields":{  
          "date":[  
            "2017-08-29T07:20:29.000Z"
          ],
          "tags":[  
            "tag3",
            "tag1"
          ]
        }
      },
      {
        "fields":{  
          "date":[  
            "2017-08-29T05:45:47.000Z"
          ],
          "tags":[  
            "tag3",
            "tag1"
          ]
        }
      },
      { 
        "fields":{  
          "date":[  
            "2017-08-29T04:59:45.000Z"
          ],
          "tags":[  
            "tag1",
            "tag2"
          ]
        }
      },
      {
        "fields":{  
          "date":[  
            "2017-08-29T03:05:27.000Z"
          ],
          "tags":[   
            "tag3",
            "tag1"
          ]
        }
      }
    ]

Again, just to make it more readable I removed some control fields that were not useful for this example. As you can see, we have 7 hits, each with an array of “tags”. Notice that all the hits are, rightfully, inside the determined date range.

Response (Part corresponding to the aggregation):

"aggregations":{  
    "status":{  
      "doc_count_error_upper_bound":0,
      "sum_other_doc_count":0,
      "buckets":[  
        {  
          "key":"tag1",
          "doc_count":7
        },
        {  
          "key":"tag3",
          "doc_count":4
        },
        {  
          "key":"tag2",
          "doc_count":3
        }
      ]
    }
  }

As you can see, the aggregation was successfully calculated as well, counting the amount of hits, corresponding to each possible “tag”.

What about the interaction between them?

Even though Elasticsearch provides you with awesome queries and aggregations to use your data in all sorts of ways, as I stated at the beginning, it is not a database! Even if you were to use Elasticsearch as your query API, you would still need a proper database to store your data for security purposes. In case of any problem with an index, you are advised to delete and re-index, and to be able to do so, you need your data persisted elsewhere.

If you have decided that you want to use Elasticsearch, that would mean that the amount of data you are going to be dealing with, is not small, and since Elasticsearch biggest advantage is querying, then you won’t be querying your database that much. For these reasons, Cassandra pops up as a suitable candidate. You can use its fast writing speeds to the maximum, and minimize its penalty on reads, since you will be doing most, if not all, of your reading through Elasticsearch.

As for how the interaction between them should work in your project, it will be completely related to what do you want to do with the data. One thing for certain is that you will want to index the data you will need to and also save it in Cassandra.

A more practical example

Let us assume you receive events, with the following structure:

{
  "id": 1234,
  "status": "some status",
  "message": "some message",
  "field1": "",
  ...
  "field50": ""
}

The first thing you should consider is which of these fields will be worth indexing and which will not. To do this, you must think ahead to the queries you are likely to do over this data. All the fields you choose to index in ElasticSearch, should be there because there is a query that works with them. Here let us assume that id, status, message will be the ones that get indexed.

Once you have figured out which fields will be indexed, you should also design your model for Cassandra. For this example we will keep all fields in Cassandra. Remember that your data in Cassandra should be enough to re-index all contents should you require to do so. This means that at the very least, you will need to save in Cassandra the same fields that you have chosen to index.

When it comes to querying, you will always want to prioritize using ElasticSearch. But what happens if you need to display field1, field2, field3 for the results of a query? These fields are not indexed, and to access them you need to access Cassandra but as I stated before, querying in Cassandra can be pretty slow, and you may not be able to issue the query you want with CQL. One solution to this particular problem, is to issue the query through ElasticSearch, and with the result, access your data in Cassandra directly by ID (which will be much faster than querying Cassandra in most cases)

Conclusions

I find that this use case of Elasticsearch + Cassandra, really shows how two technologies that at first sight might be seen as “two things for the same purpose”, in reality, are not. This realization that I came across working in a project that featured both of these technologies encouraged me to some extent, to keep an open mind when thinking about which technologies to use on different situations or different projects. How to combine the strengths and weaknesses of different technologies without overlapping, to ultimately get a better result.

If you end up deciding to use Cassandra and Elasticsearch, I encourage you to look at their respective documentations, since I just touched the tip of the iceberg in this post.