How ElasticSearch Works (Basic Concepts)

Published on Author gryzli

I’m going to skip the intro about Elasticsearch and it’s primary application (which is for search) and will go straight to the point.

If you need some basic understanding of what is Elasticsearch and how to use it, I suggest you to start with the official documentation which is one of the best software documentations I’ve read.

 

Important notice : Everything written below is based on ElasticSearch 6.x (6.4.2). If you are using different major versions (especially lower ones), some of the things written may not be relevant, so do your checks …

 

Let first start with some

Basic components

in the elastic ecosystem.

 

Index

If you try to understand Elastic components related to RDBMs (which is not the right thing to do actually), the Index is your “database” . Before you have the chance to put any data into ES, you should first have an Index created (or let ES create it for you automatically during the data insertion process).

 

After you have your Index created, you will be able to store information in form of “Documents” , which are actually JSON objects containing your data.

As with the RDBM’s your Index is going to have some Schema or Mappings , which will help ES to store your data in the right (or maybe the wrong) way, so you would be able to search it the right (or maybe the wrong) way later.

 

Good practice is to design your Index Mappings in the beginning, before you actually have any data put inside your index. As with most of the things with ES , your mappings could be dynamically created by Elastic during the process of data insertion, but then you should pray that Elastic recognizes your data the right way.

Under the hood each index is comprised of multiple Shards , which are the storage containers for your data. Shard is actually a (Apache) Lucene Index.

 

Document  (the core data containing unit)

The ‘Document’ inside Elastic is like the ‘row’ inside RDBM. It is your actual data entry inside elastic.

The document is always a JSON object, the structure of which depends on what you have inserted.

Each document consists of two things:

  1. Document meta-data
  2. Document body (the actual data)

Both of this are inside same JSON structure.

If we insert a simple document inside our test index (ourindex) with the following call :

# Insert document with id: 1 into "ourindex" index
PUT ourindex/_doc/1
{
   "user" : "someuser",
   "nickname" : "some_nickname"
}

 

 

Then we can retrieve our newly inserted document

# This is to be executed inside Kibana Dev Console 
GET ourindex/_doc/1

Result
------
{
"_index": "ourindex",
"_type": "_doc",
"_id": "1",
"_version": 1,
"found": true,
"_source": {
        "user": "someuser",
        "nickname": "some_nickname"
         }
}

 

So let’s decode what do we have:

  1. Document meta data  – The text  colored in “red”  is meta data added by ElasticSearch during the document indexing process. Each document has “_index”  , “_type“, “_id“,  “_version”  and “_source” meta fields.
    _index   –   the name of the index, the document is part of
    _type     – the type of the document (thing that’s going to be deprecated as a concept from ES 7)
    _id         – Unique ID of our document
    _version – This is field for versioning, if we re-input different data in the same document id
    _source  – This is always the container of the actual data we have inserted
  2. The actual data –  The actual data is colored in “green” . This is the data we have inserted in the first  place.

 

Shard (so-called Primary Shard)

As we stated earlier, a Shard is the logical representation of Apache Lucene Index, and is the main building block of Elastic Index.

By default ES will create your Index with 5 shards if you didn’t provide this setting during the Index creation process.

Important notice: Once your Index is created, you cannot change the number of Primary shards. If you want to modify the number of shards your index has, you will need to re-create your index. 

Each shard is able to store up to 2,147,483,519 documents. Knowing this limitation, you should plan carefully the number of Shards your index is going to have.

Also if you plan to have Elastic Cluster with multiple machines, you may want to choose your Shard number in relation to your cluster nodes. Elastic will try to locate each Shard on a different cluster node, which is made because of the redundancy and performance impacts in mind.

 

Deciding the right number of shards to use 

There are a lot of things to consider when you try to choose the right number of P shards for your index. Most of the time this is tightly related to the work you are going to do with this index like:

  • How much information you are going to store inside your index
  • How fast / parallel you want ES to be able to write the data
  • How much time you will want to wait when ES relocates your shard (during addition of a new cluster node)
  • How fast / parallel you want to read data (this depends also on the replica shards)

 

Most of the time you could stick to ES default configuration (5 P shards per Index).

Another rule you can choose is to use as much primary shards as the number of node machines you have in your cluster.

I suggest you to also read the following article when considering Shard number:

https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster

 

Replica Shard

By default ES will not only create your Index with five Primary Shards, but also you will have one replica shard for each of the primary shards. So in total you will have 10 shards (5 Primary + 5 Replicas).

As the name suggests, the Replica Shard is actually a read-only copy (similiar to MySQL Master-Slave replication) of your Primary Shard.

 

Replica Shards exists for two main reasons :

  • Redundancy
  • Performance

 

Redundancy  

When you are using ElasticSearch in a cluster configuration, Elastic will make sure, that each of your Replica Shards are physically residing on a Cluster Node different than the Node where the corresponding Primary Shard is.

If a cluster node containing primary shard goes down, ES will automatically convert your replica shard to a primary shard. 

 

Performance 

Having a copy for each Primary Shard, gives ES the ability to use both of the shards when searching/reading a data.

 

Good practice: This is another thing you should consider while creating your Index and make sure you have the right number of replica shards based on your redundancy/performance needs.

Important NoticeES gives you the opportunity to read/search from replica shards even if they are not in full synchronization with your primary shard.  This means that there is a chance to receive no fully up-to-date result when you search and ES is using not fully synced replica shard.

 

The good of having more replica shards

By having more R Shards you will be able to load-balance your read/search performance, because ES would be able to read from more copies of your data.

 

The bad of having more replica shards

As everything in life – Replica Shards does not come for free !

Write overhead – Each write operation on your P Shard, will result in the same write operation to all of its Replica shards. This means that each additional replica shard is adding 100% to your write operations. Having 5x Replicas, means your will do 6 instead of 1 write operation, for each chunk of data you are storing to your Index. The same applies for all modify operations: write / delete / update 

 

Disk space consumption – Replica shard is containing a mirror of your Primary shard’s data. Which means that if your P Shard contains 1G of documents, so will do each of its replica shards. Having 1G documents on a Index with 5  replicas, means wasting 5x times the space of your actual data.

 

Index Mapping (Index Schema)

There  are the so-called ‘field datatypes’, which define the type of data your field contains.

By ‘field’ we are meaning the value of a key inside indexed JSON Document.

If you have JSON document like this:

"Name" : "Some Name", 

"Age" : 15

}

Each of the fields “Name” and “Age” will have their datatype (part of the Index Mapping).

You need to either pre-define your  mappings during the Index Creation process or ElasticSearch will dynamically assign mapping to each field, after first Document insertion.

If you let ElasticSearch choose the mapping for you, always double-check and make sure that the right mappings have been chosen for your scenario. You could easily check an Index mappings, by issuing ES query:

GET /my_test_index/_mapping?pretty

 

 

Currently the field data types you are able to choose from, contain:

basic: text, keyword, date, long, double, boolean, ip

hierarchy support: object, nested

special types: geo_point, geo_shape, completion

 

 

 

Important notice: Once field mapping are created, you cannot change them ! If you want to change any of your mappings, you will need to re-create your Index !

 

I have shared some experience about choosing the right index field mappings .

Useful resources

How To Install ElasticSearch On Centos7

How To Install Kibana