ElasticSearch Choosing Field Mappings

Published on Author gryzli

If you want to spend good time with ElasticSearch you must choose very carefully your elasticsearch index field mappings. Proper field mappings are extremely important in order to be able to search properly inside your data.

Keep in mind that ElasticSearch differs a lot between major versions. The current article is written for the current stable ES version , which is 6.5.x .

Difference Between Keyword And Text DataTypes For Storing Your Data

Before going into some real-life examples for using different data-types, there are some things to know about the most used ES data types for storing data.

 

When to use “keyword” field datatype in ElasticSearch

Keyword should be use in the following situations:

  • Keywords are stored as they are inside the Lucene Index
  • Keyword can be used for filtering and aggregations – That is really important if you try to do things like ‘GROUP BY’  or “WHERE id=’5′”
  • Keywords could be sarched with Term-Level Queries like: range, exists, regexp, wildcard

Information about all the term-level queries supported by ES could be checked here:

https://www.elastic.co/guide/en/elasticsearch/reference/current/term-level-queries.html

 

Summary:  If you are not going to use the field for full-text search and you wont store very big chunks of data inside this field, keyword is your datatype to go with.

 

When to use “text” field datatype in ElasticSearch

Text datatype should be used when you plan to do full-text search on this field.

Also it could be used for storing big chunks of information like multiline logs , text files or scripts and be able to search there.

 

How To Store IP Addresses Inside ElasticSearch

Most people are using ES for storing network-related data (logs, metrics or something similar), which almost 100% will contain IP addresses.

You could store your IP address in both keyword, text or ip datatype, but most of the times you will choose IP Datatype or Keyword Datatype

 

Storing IP Addresses As Keyword Datatype

The good point of storing your IP inside keyword field, is that you will be able to execute queries that are using regexp or wildcards like:

 

# Give me all addresses beginning with 192.168, 192.169

# Assuming your index is named: test_index 
# Assuming your ip field is "client_ip"
POST /test_index/_search
{
  "query": {
    "regexp": {
      "client_ip": "192.16[89].*"
    }
  }
}

 

…or with wildcard…

# Give me all addresses beginning with 10.

{
  "query": {
    "wildcard": {
      "client_ip.keyword": "10.*"
    }
  }
}

 

 

Storing IP Addresses As IP Datatype

If you are storing your ip addresses as IP Datatype you will have the following benefits:

  • Elastic will validate if your IP is complying IPv4 / IPv6
  • You would be able to search by CIDR for ip addresses
  • Faster search times than keyword

 

If you used ip for your mapping, you would be able to issue the following query :

# Assuming your index is named: test_index
# Assuming your ip field is "client_ip"
POST /test_index/_search
{
  "query": {
    "term": {
      "client_ip": "1.1.2.0/17"
    }
  }
}

 

Storing IP Addresses As Both IP/Keyword

This is the best of the two options above. By having your IP as both keyword and ip , you will get the benefits of both datatypes. The only drawback is that each record, will take bit more storage.

By using the “fields” mapping setting, you are able to tell ElasticSearch to automatically copy the value of one field into another field of different mapping type.

No need to create two mappings for the different types.

Here is how such a definition looks like (inside Kibana):

# This creates mapping for my_test_index
# When you import IP inside "ip" field ,
# it will be automatically copied into "ip.keyword" as keyword type

PUT my_test_index/_mapping/_doc 
{
  "properties": {
    "ip": {
      "type": "ip",
      "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 128
          }
      }
    }
  }
}

With the mapping above, you could access your ip as IP (my_test_index->ip) or Keyword (my_test_index->ip.keyword).

 

 

How To Store Binary Files Inside ElasticSearch

If you are planning to store some binary data inside your ES, for example uploads, user files, images or things like that, you wont be able to store them in their original format.

Before storing binary content you will need to encode it , preferably in base64.

After you have encoded your binary content, you can store it either in “text” or “binary” datatype field.

 

More about binary datatype could be found in Elastic Docs:

https://www.elastic.co/guide/en/elasticsearch/reference/current/binary.html

 

How To Store Geo Coordinates Inside ElasticSearch

If you plan to use Kibana visualizations to draw your data on a Geo Map, you will need to have your Geo coordinates stored in a special field map , called: “geo_point” datatype.

There is very well written example of how to use geo_point by Elastic

https://www.elastic.co/guide/en/elasticsearch/reference/current/geo-point.html

 

How To Store Arrays Inside ElasticSearch

Currently there is no separate array datatype for arrays inside Elastic. Every field could have multiple values in the form of array.

Here is an example of this

# This will insert 'data' with value as array 
# ..... 

PUT  my_test_index/_doc/1
{
  "data":["value11", "value22", "value33"] 
}

PUT  my_test_index/_doc/2
{
  "data":["value44", "value66", "value55"] 
}

 

Now if we search the index, will get the following:

# Search by one of the elements inside the array 

POST /my_test_index/_search
{
  "query": {
    "wildcard": {
      "data": "value4*"
    }
  }
}

---------
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_test_index",
        "_type": "_doc",
        "_id": "2",
        "_score": 1,
        "_source": {
          "data": [
            "value44",
            "value66",
            "value55"
          ]
        }
      }
    ]
  }
}

 

What happens here is that Elastic will store each of the array values as a separate fully-functional value. So when you execute search based on some criteria, each of the array values will be iterated and matched to your criteria.

 

How To Store Hash (Associative Arrays) Inside ElasticSearch

 

The easiest way to explain when you may need to use hashes or associative arrays inside Elastic is by example …of course.

 

Let say you are inserting document per HTTP query inside your index. So you are actually putting something like that:

PUT /my_http_requests/_doc/1
{
  "request_url":"https://gryzli.info",
  "request_method":"POST", 
  "request_ip":"1.1.1.1"
}

 

Time has passed, and you decided, that you want to parse the “Cookie” HTTP header, and store each of the cookies as a separate key/value pair inside your Index.

Cookie headers look like this:

Cookie: name1=value1 ; name2=value2 ; name3=value3 ……

 

After parsing your Cookie field, it is very likely that you will end up with Index inputs like those (at least I did it):

# Add the parsed cookie fields to the index 

PUT /my_http_requests/_doc/1
{
  "request_url":"https://gryzli.info",
  "request_method":"POST", 
  "request_ip":"1.1.1.1" 
  "cookie": {
     "$cookie_name1" : "$cookie_value1",
     "$cookie_name2" : "$cookie_value2",
     "$cookie_nameNN" : "$cookie_valueNN"
  }
}

 

“$cookie_name” is random “name” extracted from your Cookie header.

Because $cookie_name is “random“, Elastic will apply dynamic field mapping for each “$cookie*_name”  you insert in the index.

This design has one HUGE DRAWBACK/PROBLEM  it will lead you to the famous term “ElasticSearch Mapping Explosion“.

There is nice blog post on elastic blog about this:

https://www.elastic.co/blog/found-crash-elasticsearch

 

This will happen because during the time, you will create more and more new/unique fields to your Index Mapping, that at certain point will fulfill your node’s memory.

 

How To Prevent Mapping Explosion In ElasaticSearch ?

or The right way of storing hash/associative array/ structures inside ElasticSearch …

 

1) First create your field as “nested” data type

# Creating additional mapping for our cookie field 
# which is of type nested
PUT my_http_requests 
{
  "mappings": {
    "_doc": { 
      "properties": { 
        "request_url":        { "type": "keyword"  },
        "request_method":     { "type": "keyword"  }, 
        "request_ip":         { "type": "ip"       },  
        "cookie":             { "type": "nested"   }
      }
    }
  }
}

 

 

2) Put your data inside your nested structure

# Putting nested data 
PUT /my_http_requests/_doc/1
{
  "request_url":"https://gryzli.info",
  "request_method":"POST", 
  "request_ip":"1.1.1.1",
  "cookie":
      [
        {
        "name":"cookie_name1",
        "value":"cookie_value1"
        },
        {
        "name":"cookie_name2",
        "value":"cookie_value2"
        }
      ]
  }

What you may noticed already is that your field names are always the same (name and value) and only the values inside are changing, which is perfectly normal.

 

3) Finally, let’s try to search by cookie_name or cookie_value

# Searching by cookie_name
GET /my_http_requests/_search
{
  "query": {
    "nested": {
      "path": "cookie",
      "query": {
        "bool": {
          "must": [
            { "match": { "cookie.name": "cookie_name1" }}
          ]
        }
      }
    }
  }
}

----

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.18232156,
    "hits": [
      {
        "_index": "my_http_requests",
        "_type": "_doc",
        "_id": "1",
        "_score": 0.18232156,
        "_source": {
          "request_url": "https://gryzli.info",
          "request_method": "POST",
          "request_ip": "1.1.1.1",
          "cookie": [
            {
              "name": "cookie_name1",
              "value": "cookie_value1"
            },
            {
              "name": "cookie_name1",
              "value": "cookie_value1"
            }
          ]
        }
      }
    ]
  }
}