Table of Contents
I’m going to skip the intro about Elasticsearch and it’s primary application (which is for search) and will go straight to the point.
If you need some basic understanding of what is Elasticsearch and how to use it, I suggest you to start with the official documentation which is one of the best software documentations I’ve read.
Important notice : Everything written below is based on ElasticSearch 6.x (6.4.2). If you are using different major versions (especially lower ones), some of the things written may not be relevant, so do your checks …
Let first start with some
Basic components
in the elastic ecosystem.
Index
If you try to understand Elastic components related to RDBMs (which is not the right thing to do actually), the Index is your “database” . Before you have the chance to put any data into ES, you should first have an Index created (or let ES create it for you automatically during the data insertion process).
After you have your Index created, you will be able to store information in form of “Documents” , which are actually JSON objects containing your data.
As with the RDBM’s your Index is going to have some Schema or Mappings , which will help ES to store your data in the right (or maybe the wrong) way, so you would be able to search it the right (or maybe the wrong) way later.
Good practice is to design your Index Mappings in the beginning, before you actually have any data put inside your index. As with most of the things with ES , your mappings could be dynamically created by Elastic during the process of data insertion, but then you should pray that Elastic recognizes your data the right way.
Under the hood each index is comprised of multiple Shards , which are the storage containers for your data. Shard is actually a (Apache) Lucene Index.
Document (the core data containing unit)
The ‘Document’ inside Elastic is like the ‘row’ inside RDBM. It is your actual data entry inside elastic.
The document is always a JSON object, the structure of which depends on what you have inserted.
Each document consists of two things:
- Document meta-data
- Document body (the actual data)
Both of this are inside same JSON structure.
If we insert a simple document inside our test index (ourindex) with the following call :
# Insert document with id: 1 into "ourindex" index PUT ourindex/_doc/1 { "user" : "someuser", "nickname" : "some_nickname" }
Then we can retrieve our newly inserted document
# This is to be executed inside Kibana Dev Console GET ourindex/_doc/1 Result ------ { "_index": "ourindex", "_type": "_doc", "_id": "1", "_version": 1, "found": true, "_source": { "user": "someuser", "nickname": "some_nickname" } }
So let’s decode what do we have:
- Document meta data – The text colored in “red” is meta data added by ElasticSearch during the document indexing process. Each document has “_index” , “_type“, “_id“, “_version” and “_source” meta fields.
_index – the name of the index, the document is part of
_type – the type of the document (thing that’s going to be deprecated as a concept from ES 7)
_id – Unique ID of our document
_version – This is field for versioning, if we re-input different data in the same document id
_source – This is always the container of the actual data we have inserted - The actual data – The actual data is colored in “green” . This is the data we have inserted in the first place.
Shard (so-called Primary Shard)
As we stated earlier, a Shard is the logical representation of Apache Lucene Index, and is the main building block of Elastic Index.
By default ES will create your Index with 5 shards if you didn’t provide this setting during the Index creation process.
Important notice: Once your Index is created, you cannot change the number of Primary shards. If you want to modify the number of shards your index has, you will need to re-create your index.
Each shard is able to store up to 2,147,483,519 documents. Knowing this limitation, you should plan carefully the number of Shards your index is going to have.
Also if you plan to have Elastic Cluster with multiple machines, you may want to choose your Shard number in relation to your cluster nodes. Elastic will try to locate each Shard on a different cluster node, which is made because of the redundancy and performance impacts in mind.
Deciding the right number of shards to use
There are a lot of things to consider when you try to choose the right number of P shards for your index. Most of the time this is tightly related to the work you are going to do with this index like:
- How much information you are going to store inside your index
- How fast / parallel you want ES to be able to write the data
- How much time you will want to wait when ES relocates your shard (during addition of a new cluster node)
- How fast / parallel you want to read data (this depends also on the replica shards)
Most of the time you could stick to ES default configuration (5 P shards per Index).
Another rule you can choose is to use as much primary shards as the number of node machines you have in your cluster.
I suggest you to also read the following article when considering Shard number:
https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster
Replica Shard
By default ES will not only create your Index with five Primary Shards, but also you will have one replica shard for each of the primary shards. So in total you will have 10 shards (5 Primary + 5 Replicas).
As the name suggests, the Replica Shard is actually a read-only copy (similiar to MySQL Master-Slave replication) of your Primary Shard.
Replica Shards exists for two main reasons :
- Redundancy
- Performance
Redundancy
When you are using ElasticSearch in a cluster configuration, Elastic will make sure, that each of your Replica Shards are physically residing on a Cluster Node different than the Node where the corresponding Primary Shard is.
If a cluster node containing primary shard goes down, ES will automatically convert your replica shard to a primary shard.
Performance
Having a copy for each Primary Shard, gives ES the ability to use both of the shards when searching/reading a data.
Good practice: This is another thing you should consider while creating your Index and make sure you have the right number of replica shards based on your redundancy/performance needs.
Important Notice : ES gives you the opportunity to read/search from replica shards even if they are not in full synchronization with your primary shard. This means that there is a chance to receive no fully up-to-date result when you search and ES is using not fully synced replica shard.
The good of having more replica shards
By having more R Shards you will be able to load-balance your read/search performance, because ES would be able to read from more copies of your data.
The bad of having more replica shards
As everything in life – Replica Shards does not come for free !
Write overhead – Each write operation on your P Shard, will result in the same write operation to all of its Replica shards. This means that each additional replica shard is adding 100% to your write operations. Having 5x Replicas, means your will do 6 instead of 1 write operation, for each chunk of data you are storing to your Index. The same applies for all modify operations: write / delete / update
Disk space consumption – Replica shard is containing a mirror of your Primary shard’s data. Which means that if your P Shard contains 1G of documents, so will do each of its replica shards. Having 1G documents on a Index with 5 replicas, means wasting 5x times the space of your actual data.
Index Mapping (Index Schema)
There are the so-called ‘field datatypes’, which define the type of data your field contains.
By ‘field’ we are meaning the value of a key inside indexed JSON Document.
If you have JSON document like this:
{
"Name" : "Some Name",
"Age" : 15
}
Each of the fields “Name” and “Age” will have their datatype (part of the Index Mapping).
You need to either pre-define your mappings during the Index Creation process or ElasticSearch will dynamically assign mapping to each field, after first Document insertion.
If you let ElasticSearch choose the mapping for you, always double-check and make sure that the right mappings have been chosen for your scenario. You could easily check an Index mappings, by issuing ES query:
GET /my_test_index/_mapping?pretty
Currently the field data types you are able to choose from, contain:
basic: text, keyword, date, long, double, boolean, ip
hierarchy support: object, nested
special types: geo_point, geo_shape, completion
Important notice: Once field mapping are created, you cannot change them ! If you want to change any of your mappings, you will need to re-create your Index !
I have shared some experience about choosing the right index field mappings .