0% found this document useful (0 votes)
44 views

What Is Elasticsearch

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

What Is Elasticsearch

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 63

What is Elasticsearch?

You know, for search (and analysis)


Elasticsearch is the distributed search and analytics engine at the heart of the
Elastic Stack. Logstash and Beats facilitate collecting, aggregating, and enriching
your data and storing it in Elasticsearch. Kibana enables you to interactively
explore, visualize, and share insights into your data and manage and monitor the
stack. Elasticsearch is where the indexing, search, and analysis magic happens.
Elasticsearch provides near real-time search and analytics for all types of data.
Whether you have structured or unstructured text, numerical data, or geospatial
data, Elasticsearch can efficiently store and index it in a way that supports fast
searches. You can go far beyond simple data retrieval and aggregate information to
discover trends and patterns in your data. And as your data and query volume
grows, the distributed nature of Elasticsearch enables your deployment to grow
seamlessly right along with it.
While not every problem is a search problem, Elasticsearch offers speed and
flexibility to handle data in a wide variety of use cases:
 Add a search box to an app or website
 Store and analyze logs, metrics, and security event data
 Use machine learning to automatically model the behavior of your data in
real time
 Use Elasticsearch as a vector database to create, store, and search vector
embeddings
 Automate business workflows using Elasticsearch as a storage engine
 Manage, integrate, and analyze spatial information using Elasticsearch as a
geographic information system (GIS)
 Store and process genetic data using Elasticsearch as a bioinformatics
research tool
We’re continually amazed by the novel ways people use search. But whether your
use case is similar to one of these, or you’re using Elasticsearch to tackle a new
problem, the way you work with your data, documents, and indices in Elasticsearch
is the same.
Data in: documents and indices
Elasticsearch is a distributed document store. Instead of storing information as rows
of columnar data, Elasticsearch stores complex data structures that have been
serialized as JSON documents. When you have multiple Elasticsearch nodes in a
cluster, stored documents are distributed across the cluster and can be accessed
immediately from any node.
When a document is stored, it is indexed and fully searchable in near real-time--
within 1 second. Elasticsearch uses a data structure called an inverted index that
supports very fast full-text searches. An inverted index lists every unique word that
appears in any document and identifies all of the documents each word occurs in.
An index can be thought of as an optimized collection of documents and each
document is a collection of fields, which are the key-value pairs that contain your
data. By default, Elasticsearch indexes all data in every field and each indexed field
has a dedicated, optimized data structure. For example, text fields are stored in
inverted indices, and numeric and geo fields are stored in BKD trees. The ability to
use the per-field data structures to assemble and return search results is what
makes Elasticsearch so fast.
Elasticsearch also has the ability to be schema-less, which means that documents
can be indexed without explicitly specifying how to handle each of the different
fields that might occur in a document. When dynamic mapping is enabled,
Elasticsearch automatically detects and adds new fields to the index. This default
behavior makes it easy to index and explore your data—just start indexing
documents and Elasticsearch will detect and map booleans, floating point and
integer values, dates, and strings to the appropriate Elasticsearch data types.
Ultimately, however, you know more about your data and how you want to use it
than Elasticsearch can. You can define rules to control dynamic mapping and
explicitly define mappings to take full control of how fields are stored and indexed.
Defining your own mappings enables you to:
 Distinguish between full-text string fields and exact value string fields
 Perform language-specific text analysis
 Optimize fields for partial matching
 Use custom date formats
 Use data types such as geo_point and geo_shape that cannot be
automatically detected
It’s often useful to index the same field in different ways for different purposes. For
example, you might want to index a string field as both a text field for full-text
search and as a keyword field for sorting or aggregating your data. Or, you might
choose to use more than one language analyzer to process the contents of a string
field that contains user input.
The analysis chain that is applied to a full-text field during indexing is also used at
search time. When you query a full-text field, the query text undergoes the same
analysis before the terms are looked up in the index.
Information out: search and analyze
While you can use Elasticsearch as a document store and retrieve documents and
their metadata, the real power comes from being able to easily access the full suite
of search capabilities built on the Apache Lucene search engine library.
Elasticsearch provides a simple, coherent REST API for managing your cluster and
indexing and searching your data. For testing purposes, you can easily submit
requests directly from the command line or through the Developer Console in
Kibana. From your applications, you can use the Elasticsearch client for your
language of choice: Java, JavaScript, Go, .NET, PHP, Perl, Python or Ruby.
Searching your data
The Elasticsearch REST APIs support structured queries, full text queries, and
complex queries that combine the two. Structured queries are similar to the types of
queries you can construct in SQL. For example, you could search
the gender and age fields in your employee index and sort the matches by
the hire_date field. Full-text queries find all documents that match the query string
and return them sorted by relevance—how good a match they are for your search
terms.
In addition to searching for individual terms, you can perform phrase searches,
similarity searches, and prefix searches, and get autocomplete suggestions.
Have geospatial or other numerical data that you want to search? Elasticsearch
indexes non-textual data in optimized data structures that support high-
performance geo and numerical queries.
You can access all of these search capabilities using Elasticsearch’s comprehensive
JSON-style query language (Query DSL). You can also construct SQL-style queries to
search and aggregate data natively inside Elasticsearch, and JDBC and ODBC drivers
enable a broad range of third-party applications to interact with Elasticsearch via
SQL.
Analyzing your data
Elasticsearch aggregations enable you to build complex summaries of your data and
gain insight into key metrics, patterns, and trends. Instead of just finding the
proverbial “needle in a haystack”, aggregations enable you to answer questions
like:
 How many needles are in the haystack?
 What is the average length of the needles?
 What is the median length of the needles, broken down by manufacturer?
 How many needles were added to the haystack in each of the last six
months?
You can also use aggregations to answer more subtle questions, such as:
 What are your most popular needle manufacturers?
 Are there any unusual or anomalous clumps of needles?
Because aggregations leverage the same data-structures used for search, they are
also very fast. This enables you to analyze and visualize your data in real time. Your
reports and dashboards update as your data changes so you can take action based
on the latest information.
What’s more, aggregations operate alongside search requests. You can search
documents, filter results, and perform analytics at the same time, on the same data,
in a single request. And because aggregations are calculated in the context of a
particular search, you’re not just displaying a count of all size 70 needles, you’re
displaying a count of the size 70 needles that match your users' search criteria—for
example, all size 70 non-stick embroidery needles.
But wait, there’s more
Want to automate the analysis of your time series data? You can use machine
learning features to create accurate baselines of normal behavior in your data and
identify anomalous patterns. With machine learning, you can detect:
 Anomalies related to temporal deviations in values, counts, or frequencies
 Statistical rarity
 Unusual behaviors for a member of a population
And the best part? You can do this without having to specify algorithms, models, or
other data science-related configurations.
Scalability and resilience: clusters, nodes, and shards
Elasticsearch is built to be always available and to scale with your needs. It does
this by being distributed by nature. You can add servers (nodes) to a cluster to
increase capacity and Elasticsearch automatically distributes your data and query
load across all of the available nodes. No need to overhaul your application,
Elasticsearch knows how to balance multi-node clusters to provide scale and high
availability. The more nodes, the merrier.
How does this work? Under the covers, an Elasticsearch index is really just a logical
grouping of one or more physical shards, where each shard is actually a self-
contained index. By distributing the documents in an index across multiple shards,
and distributing those shards across multiple nodes, Elasticsearch can ensure
redundancy, which both protects against hardware failures and increases query
capacity as nodes are added to a cluster. As the cluster grows (or shrinks),
Elasticsearch automatically migrates shards to rebalance the cluster.
There are two types of shards: primaries and replicas. Each document in an index
belongs to one primary shard. A replica shard is a copy of a primary shard. Replicas
provide redundant copies of your data to protect against hardware failure and
increase capacity to serve read requests like searching or retrieving a document.
The number of primary shards in an index is fixed at the time that an index is
created, but the number of replica shards can be changed at any time, without
interrupting indexing or query operations.
It depends…
There are a number of performance considerations and trade offs with respect to
shard size and the number of primary shards configured for an index. The more
shards, the more overhead there is simply in maintaining those indices. The larger
the shard size, the longer it takes to move shards around when Elasticsearch needs
to rebalance a cluster.
Querying lots of small shards makes the processing per shard faster, but more
queries means more overhead, so querying a smaller number of larger shards might
be faster. In short…it depends.
As a starting point:
 Aim to keep the average shard size between a few GB and a few tens of GB.
For use cases with time-based data, it is common to see shards in the 20GB
to 40GB range.
 Avoid the gazillion shards problem. The number of shards a node can hold is
proportional to the available heap space. As a general rule, the number of
shards per GB of heap space should be less than 20.
The best way to determine the optimal configuration for your use case is
through testing with your own data and queries.
In case of disaster
A cluster’s nodes need good, reliable connections to each other. To provide better
connections, you typically co-locate the nodes in the same data center or nearby
data centers. However, to maintain high availability, you also need to avoid any
single point of failure. In the event of a major outage in one location, servers in
another location need to be able to take over. The answer? Cross-cluster replication
(CCR).
CCR provides a way to automatically synchronize indices from your primary cluster
to a secondary remote cluster that can serve as a hot backup. If the primary cluster
fails, the secondary cluster can take over. You can also use CCR to create secondary
clusters to serve read requests in geo-proximity to your users.
Cross-cluster replication is active-passive. The index on the primary cluster is the
active leader index and handles all write requests. Indices replicated to secondary
clusters are read-only followers.
Care and feeding
As with any enterprise system, you need tools to secure, manage, and monitor your
Elasticsearch clusters. Security, monitoring, and administrative features that are
integrated into Elasticsearch enable you to use Kibana as a control center for
managing a cluster. Features like downsampling and index lifecycle
management help you intelligently manage your data over time.
What’s new in 8.14
Coming in 8.14.
Here are the highlights of what’s new and improved in Elasticsearch 8.14! For
detailed information about this release, see the Release notes and Migration guide.
Other versions:
8.13 | 8.12 | 8.11 | 8.10 | 8.9 | 8.8 | 8.7 | 8.6 | 8.5 | 8.4 | 8.3 | 8.2 | 8.1 | 8.0
Query phase KNN now supports query_vector_builder
It is now possible to pass model_text and model_id within a knn query in the query
DSL to convert a text query into a dense vector and run the nearest neighbor query
on it, instead of requiring the dense vector to be directly passed (within
the query_vector parameter). Similar to the top-level knn query (executed in the
DFS phase), it is possible to supply a query_vector_builder object containing
a text_embedding object with model_text (the text query to be converted into a
dense vector) and model_id (the identifier of a deployed model responsible for
transforming the text query into a dense vector). Note that an embedding model
with the referenced model_id needs to be deployed on a ML node. in the cluster.
#106068
A SIMD (Neon) optimised vector distance function for merging int8 Scalar Quantized
vectors has been added
An optimised int8 vector distance implementation for aarch64 has been added. This
implementation is currently only used during merging. The vector distance
implementation outperforms Lucene’s Pamana Vector implementation for binary
comparisons by approx 5x (depending on the number of dimensions). It does so by
means of SIMD (Neon) intrinsics compiled into a separate native library and link by
Panama’s FFI. Comparisons are performed on off-heap mmap’ed vector data. Macro
benchmarks, SO_Dense_Vector with scalar quantization enabled, shows significant
improvements in merge times, approximately 3 times faster.
#106133
Preview: Support for the Anonymous IP and Enterprise databases in the geoip
processor
As a Technical Preview, the geoip processor can now use the
commercial GeoIP2 Enterprise and GeoIP2 Anonymous IP databases from MaxMind.
Run Elasticsearch locally in Docker (without security)
DO NOT USE THESE INSTRUCTIONS FOR PRODUCTION DEPLOYMENTS
The instructions on this page are for local development only. Do not use these
instructions for production deployments, because they are not secure. While this
approach is convenient for experimenting and learning, you should never run the
service in this way in a production environment.
The following commands help you very quickly spin up a single-node Elasticsearch
cluster, together with Kibana in Docker. Note that if you don’t need the Kibana UI,
you can skip those instructions.
When would I use this setup?
Use this setup if you want to quickly spin up Elasticsearch (and Kibana) for local
development or testing.
For example you might:
 Want to run a quick test to see how a feature works.
 Follow a tutorial or guide that requires an Elasticsearch cluster, like our quick
start guide.
 Experiment with the Elasticsearch APIs using different tools, like the Dev Tools
Console, cURL, or an Elastic programming language client.
 Quickly spin up an Elasticsearch cluster to test an executable Python
notebook locally.
Prerequisites
If you don’t have Docker installed, download and install Docker Desktop for your
operating system.
Set environment variables
Configure the following environment variables.
export ELASTIC_PASSWORD="<ES_PASSWORD>" # password for "elastic"
username
export KIBANA_PASSWORD="<KIB_PASSWORD>" # Used _internally_ by Kibana,
must be at least 6 characters long
Create a Docker network
To run both Elasticsearch and Kibana, you’ll need to create a Docker network:
docker network create elastic-net
Run Elasticsearch
Start the Elasticsearch container with the following command:
docker run -p 127.0.0.1:9200:9200 -d --name elasticsearch --network elastic-net \
-e ELASTIC_PASSWORD=$ELASTIC_PASSWORD \
-e "discovery.type=single-node" \
-e "xpack.security.http.ssl.enabled=false" \
-e "xpack.license.self_generated.type=trial" \
docker.elastic.co/elasticsearch/elasticsearch:8.14.3
Run Kibana (optional)
To run Kibana, you must first set the kibana_system password in the Elasticsearch
container.
# configure the Kibana password in the ES container
curl -u elastic:$ELASTIC_PASSWORD \
-X POST \
http://localhost:9200/_security/user/kibana_system/_password \
-d '{"password":"'"$KIBANA_PASSWORD"'"}' \
-H 'Content-Type: application/json'
Start the Kibana container with the following command:
docker run -p 127.0.0.1:5601:5601 -d --name kibana --network elastic-net \
-e ELASTICSEARCH_URL=http://elasticsearch:9200 \
-e ELASTICSEARCH_HOSTS=http://elasticsearch:9200 \
-e ELASTICSEARCH_USERNAME=kibana_system \
-e ELASTICSEARCH_PASSWORD=$KIBANA_PASSWORD \
-e "xpack.security.enabled=false" \
-e "xpack.license.self_generated.type=trial" \
docker.elastic.co/kibana/kibana:8.14.3
The service is started with a trial license. The trial license enables all features of
Elasticsearch for a trial period of 30 days. After the trial period expires, the license is
downgraded to a basic license, which is free forever. If you prefer to skip the trial
and use the basic license, set the value of
the xpack.license.self_generated.type variable to basic instead. For a detailed
feature comparison between the different licenses, refer to our subscriptions page.
Connecting to Elasticsearch with language clients
To connect to the Elasticsearch cluster from a language client, you can use basic
authentication with the elastic username and the password you set in the
environment variable.
You’ll use the following connection details:
 Elasticsearch endpoint: http://localhost:9200
 Username: elastic
 Password: $ELASTIC_PASSWORD (Value you set in the environment variable)
For example, to connect with the Python elasticsearch client:
import os
from elasticsearch import Elasticsearch
username = 'elastic'
password = os.getenv('ELASTIC_PASSWORD') # Value you set in the environment
variable
client = Elasticsearch(
"http://localhost:9200",
basic_auth=(username, password)
)
print(client.info())
Here’s an example curl command using basic authentication:
curl -u elastic:$ELASTIC_PASSWORD \
-X PUT \
http://localhost:9200/my-new-index \
-H 'Content-Type: application/json'
Next steps
Use our quick start guide to learn the basics of Elasticsearch: how to add data and
query it.
Moving to production
This setup is not suitable for production use. For production deployments, we
recommend using our managed service on Elastic Cloud. Sign up for a free trial (no
credit card required).
Otherwise, refer to Install Elasticsearch to learn about the various options for
installing Elasticsearch in a self-managed production environment, including using
Docker.
Quick start guide
This guide helps you learn how to:
 Run Elasticsearch and Kibana (using Elastic Cloud or in a local Docker dev
environment),
 add simple (non-timestamped) dataset to Elasticsearch,
 run basic searches.
If you’re interested in using Elasticsearch with Python, check out Elastic Search
Labs. This is the best place to explore AI-powered search use cases, such as working
with embeddings, vector search, and retrieval augmented generation (RAG).
 Tutorial: this walks you through building a complete search solution with
Elasticsearch, from the ground up.
 elasticsearch-labs repository: it contains a range of
Python notebooks and example apps.
Run Elasticsearch
The simplest way to set up Elasticsearch is to create a managed deployment with
Elasticsearch Service on Elastic Cloud. If you prefer to manage your own test
environment, install and run Elasticsearch using Docker.
Elastic Cloud Local Dev (Docker)
1. Get a free trial.
2. Log into Elastic Cloud.
3. Click Create deployment.
4. Give your deployment a name.
5. Click Create deployment and download the password for the elastic user.
6. Click Continue to open Kibana, the user interface for Elastic Cloud.
7. Click Explore on my own.
Send requests to Elasticsearch
You send data and other requests to Elasticsearch using REST APIs. This lets you
interact with Elasticsearch using any client that sends HTTP requests, such as curl.
You can also use Kibana’s Console to send requests to Elasticsearch.
Use Kibana
1. Open Kibana’s main menu ("☰" near Elastic logo) and go to Dev Tools >
Console.
2. Run the following test API request in Console:
GET /
ConsolePHPPythonRubyGoJavaScript
Copy as curlTry in Elastic
Use curl
To communicate with Elasticsearch using curl or another client, you need your
cluster’s endpoint.
1. Open Kibana’s main menu and click Manage this deployment.
2. From your deployment menu, go to the Elasticsearch page. Click Copy
endpoint.
3. To submit an example API request, run the following curl command in a new
terminal session. Replace <password> with the password for the elastic user.
Replace <elasticsearch_endpoint> with your endpoint.
curl -u elastic:<password> <elasticsearch_endpoint>/
Add data
You add data to Elasticsearch as JSON objects called documents. Elasticsearch
stores these documents in searchable indices.
Add a single document
Submit the following indexing request to add a single document to the books index.
The request automatically creates the index.
POST books/_doc
{"name": "Snow Crash", "author": "Neal Stephenson", "release_date": "1992-06-01",
"page_count": 470}
ConsolePHPPythonRubyGoJavaScript
Copy as curlTry in Elastic
The response includes metadata that Elasticsearch generates for the document
including a unique _id for the document within the index.
Expand to see example response
Add multiple documents
Use the _bulk endpoint to add multiple documents in one request. Bulk data must
be newline-delimited JSON (NDJSON). Each line must end in a newline character (\n),
including the last line.
POST /_bulk
{ "index" : { "_index" : "books" } }
{"name": "Revelation Space", "author": "Alastair Reynolds", "release_date": "2000-
03-15", "page_count": 585}
{ "index" : { "_index" : "books" } }
{"name": "1984", "author": "George Orwell", "release_date": "1985-06-01",
"page_count": 328}
{ "index" : { "_index" : "books" } }
{"name": "Fahrenheit 451", "author": "Ray Bradbury", "release_date": "1953-10-15",
"page_count": 227}
{ "index" : { "_index" : "books" } }
{"name": "Brave New World", "author": "Aldous Huxley", "release_date": "1932-06-
01", "page_count": 268}
{ "index" : { "_index" : "books" } }
{"name": "The Handmaids Tale", "author": "Margaret Atwood", "release_date":
"1985-06-01", "page_count": 311}
ConsolePHPPythonRubyGoJavaScript
Copy as curlTry in Elastic
You should receive a response indicating there were no errors.
Expand to see example response
Search data
Indexed documents are available for search in near real-time.
Search all documents
Run the following command to search the books index for all documents:
GET books/_search
ConsolePHPPythonRubyGoJavaScript
Copy as curlTry in Elastic
The _source of each hit contains the original JSON object submitted during indexing.
match query
You can use the match query to search for documents that contain a specific value
in a specific field. This is the standard query for performing full-text search,
including fuzzy matching and phrase searches.
Run the following command to search the books index for documents
containing brave in the name field:
GET books/_search
{
"query": {
"match": {
"name": "brave"
}
}
}
ConsolePHPPythonRubyGoJavaScript
Copy as curlTry in Elastic
Next steps
Now that Elasticsearch is up and running and you’ve learned the basics, you’ll
probably want to test out larger datasets, or index your own data.
Learn more about search queries
 Search your data. Jump here to learn about exact value search, full-text
search, vector search, and more, using the search API.
Add more data
 Learn how to install sample data using Kibana. This is a quick way to test out
Elasticsearch on larger workloads.
 Learn how to use the upload data UI in Kibana to add your own CSV, TSV, or
JSON files.
 Use the bulk API to ingest your own datasets to Elasticsearch.
Elasticsearch programming language clients
 Check out our client library to work with your Elasticsearch instance in your
preferred programming language.
 If you’re using Python, check out Elastic Search Labs for a range of examples
that use the Elasticsearch Python client. This is the best place to explore AI-
powered search use cases, such as working with embeddings, vector search,
and retrieval augmented generation (RAG).
 This extensive, hands-on tutorial walks you through building a
complete search solution with Elasticsearch, from the ground up.
 elasticsearch-labs contains a range of executable
Python notebooks and example apps.
Set up Elasticsearch
This section includes information on how to setup Elasticsearch and get it running,
including:
 Downloading
 Installing
 Starting
 Configuring
Supported platforms
The matrix of officially supported operating systems and JVMs is available
here: Support Matrix. Elasticsearch is tested on the listed platforms, but it is
possible that it will work on other platforms too.
Use dedicated hosts
In production, we recommend you run Elasticsearch on a dedicated host or as a
primary service. Several Elasticsearch features, such as automatic JVM heap sizing,
assume it’s the only resource-intensive application on the host or container. For
example, you might run Metricbeat alongside Elasticsearch for cluster statistics, but
a resource-heavy Logstash deployment should be on its own host.
Installing Elasticsearch
Hosted Elasticsearch Service
Elastic Cloud offers all of the features of Elasticsearch, Kibana, and Elastic’s
Observability, Enterprise Search, and Elastic Security solutions as a hosted service
available on AWS, GCP, and Azure.
To set up Elasticsearch in Elastic Cloud, sign up for a free Elastic Cloud trial.
Self-managed Elasticsearch options
If you want to install and manage Elasticsearch yourself, you can:
 Run Elasticsearch using a Linux, MacOS, or Windows install package.
 Run Elasticsearch in a Docker container.
 Set up and manage Elasticsearch, Kibana, Elastic Agent, and the rest of the
Elastic Stack on Kubernetes with Elastic Cloud on Kubernetes.
To try out Elasticsearch on your own machine, we recommend using Docker and
running both Elasticsearch and Kibana. For more information, see Run Elasticsearch
locally. Please note that this setup is not suitable for production use.
Elasticsearch install packages
Elasticsearch is provided in the following package formats:

Linux and The tar.gz archives are available for installation on any Linux distribution
MacOS tar.gz archives MacOS.
Install Elasticsearch from archive on Linux or MacOS

Windows .zip archive The zip archive is suitable for installation on Windows.
Install Elasticsearch with .zip on Windows

deb The deb package is suitable for Debian, Ubuntu, and other Debian-based
systems. Debian packages may be downloaded from the Elasticsearch
website or from our Debian repository.
Install Elasticsearch with Debian Package

rpm The rpm package is suitable for installation on Red Hat, Centos, SLES,
OpenSuSE and other RPM-based systems. RPMs may be downloaded from
Elasticsearch website or from our RPM repository.
Install Elasticsearch with RPM
For a step-by-step example of setting up the Elastic Stack on your own premises, try
out our tutorial: Installing a self-managed Elastic Stack.
Elasticsearch container images
You can also run Elasticsearch inside a container image.

docker Docker container images may be downloaded from the Elastic Docker Registry.
Install Elasticsearch with Docker
Java (JVM) Version
Elasticsearch is built using Java, and includes a bundled version of OpenJDK from
the JDK maintainers (GPLv2+CE) within each distribution. The bundled JVM is the
recommended JVM.
To use your own version of Java, set the ES_JAVA_HOME environment variable. If you
must use a version of Java that is different from the bundled JVM, it is best to use
the latest release of a supported LTS version of Java. Elasticsearch is closely coupled
to certain OpenJDK-specific features, so it may not work correctly with other JVMs.
Elasticsearch will refuse to start if a known-bad version of Java is used.
If you use a JVM other than the bundled one, you are responsible for reacting to
announcements related to its security issues and bug fixes, and must yourself
determine whether each update is necessary or not. In contrast, the bundled JVM is
treated as an integral part of Elasticsearch, which means that Elastic takes
responsibility for keeping it up to date. Security issues and bugs within the bundled
JVM are treated as if they were within Elasticsearch itself.
The bundled JVM is located within the jdk subdirectory of the Elasticsearch home
directory. You may remove this directory if using your own JVM.
JVM and Java agents
Don’t use third-party Java agents that attach to the JVM. These agents can reduce
Elasticsearch performance, including freezing or crashing nodes.
Install Elasticsearch with Docker
Docker images for Elasticsearch are available from the Elastic Docker registry. A list
of all published Docker images and tags is available at www.docker.elastic.co. The
source code is in GitHub.
This package contains both free and subscription features. Start a 30-day trial to try
out all of the features.
If you just want to test Elasticsearch in local development, refer to Local dev setup
(Docker). Please note that this setup is not suitable for production environments.
Run Elasticsearch in Docker
Use Docker commands to start a single-node Elasticsearch cluster for development
or testing. You can then run additional Docker commands to add nodes to the test
cluster or run Kibana.
This setup doesn’t run multiple Elasticsearch nodes or Kibana by default. To create a
multi-node cluster with Kibana, use Docker Compose instead. See Start a multi-node
cluster with Docker Compose.
Start a single-node cluster
1. Install Docker. Visit Get Docker to install Docker for your environment.
If using Docker Desktop, make sure to allocate at least 4GB of memory. You can
adjust memory usage in Docker Desktop by going to Settings > Resources.
2. Create a new docker network.
docker network create elastic
3. Pull the Elasticsearch Docker image.
docker pull docker.elastic.co/elasticsearch/elasticsearch:8.14.3
4. Optional: Install Cosign for your environment. Then use Cosign to verify the
Elasticsearch image’s signature.
5. wget https://artifacts.elastic.co/cosign.pub
cosign verify --key cosign.pub docker.elastic.co/elasticsearch/elasticsearch:8.14.3
The cosign command prints the check results and the signature payload in JSON
format:
Verification for docker.elastic.co/elasticsearch/elasticsearch:8.14.3 --
The following checks were performed on each of these signatures:
- The cosign claims were validated
- Existence of the claims in the transparency log was verified offline
- The signatures were verified against the specified public key
6. Start an Elasticsearch container.
docker run --name es01 --net elastic -p 9200:9200 -it -m 1GB
docker.elastic.co/elasticsearch/elasticsearch:8.14.3
Use the -m flag to set a memory limit for the container. This removes the need
to manually set the JVM size.
The command prints the elastic user password and an enrollment token for Kibana.
7. Copy the generated elastic password and enrollment token. These credentials
are only shown when you start Elasticsearch for the first time. You can
regenerate the credentials using the following commands.
8. docker exec -it es01 /usr/share/elasticsearch/bin/elasticsearch-reset-password
-u elastic
docker exec -it es01 /usr/share/elasticsearch/bin/elasticsearch-create-enrollment-
token -s kibana
We recommend storing the elastic password as an environment variable in your
shell. Example:
export ELASTIC_PASSWORD="your_password"
9. Copy the http_ca.crt SSL certificate from the container to your local machine.
docker cp es01:/usr/share/elasticsearch/config/certs/http_ca.crt .
10.Make a REST API call to Elasticsearch to ensure the Elasticsearch container is
running.
curl --cacert http_ca.crt -u elastic:$ELASTIC_PASSWORD https://localhost:9200
Add more nodes
1. Use an existing node to generate a enrollment token for the new node.
docker exec -it es01 /usr/share/elasticsearch/bin/elasticsearch-create-enrollment-
token -s node
The enrollment token is valid for 30 minutes.
2. Start a new Elasticsearch container. Include the enrollment token as an
environment variable.
docker run -e ENROLLMENT_TOKEN="<token>" --name es02 --net elastic -it -m 1GB
docker.elastic.co/elasticsearch/elasticsearch:8.14.3
3. Call the cat nodes API to verify the node was added to the cluster.
curl --cacert http_ca.crt -u elastic:$ELASTIC_PASSWORD
https://localhost:9200/_cat/nodes
Run Kibana
1. Pull the Kibana Docker image.
docker pull docker.elastic.co/kibana/kibana:8.14.3
2. Optional: Verify the Kibana image’s signature.
3. wget https://artifacts.elastic.co/cosign.pub
cosign verify --key cosign.pub docker.elastic.co/kibana/kibana:8.14.3
4. Start a Kibana container.
docker run --name kib01 --net elastic -p 5601:5601
docker.elastic.co/kibana/kibana:8.14.3
5. When Kibana starts, it outputs a unique generated link to the terminal. To
access Kibana, open this link in a web browser.
6. In your browser, enter the enrollment token that was generated when you
started Elasticsearch.
To regenerate the token, run:
docker exec -it es01 /usr/share/elasticsearch/bin/elasticsearch-create-enrollment-
token -s kibana
7. Log in to Kibana as the elastic user with the password that was generated
when you started Elasticsearch.
To regenerate the password, run:
docker exec -it es01 /usr/share/elasticsearch/bin/elasticsearch-reset-password -u
elastic
Remove containers
To remove the containers and their network, run:
# Remove the Elastic network
docker network rm elastic

# Remove Elasticsearch containers


docker rm es01
docker rm es02

# Remove the Kibana container


docker rm kib01
Next steps
You now have a test Elasticsearch environment set up. Before you start serious
development or go into production with Elasticsearch, review the requirements and
recommendations to apply when running Elasticsearch in Docker in production.
Start a multi-node cluster with Docker Compose
Use Docker Compose to start a three-node Elasticsearch cluster with Kibana. Docker
Compose lets you start multiple containers with a single command.
Configure and start the cluster
1. Install Docker Compose. Visit the Docker Compose docs to install Docker
Compose for your environment.
If you’re using Docker Desktop, Docker Compose is installed automatically. Make
sure to allocate at least 4GB of memory to Docker Desktop. You can adjust memory
usage in Docker Desktop by going to Settings > Resources.
2. Create or navigate to an empty directory for the project.
3. Download and save the following files in the project directory:
 .env
 docker-compose.yml
4. In the .env file, specify a password for
the ELASTIC_PASSWORD and KIBANA_PASSWORD variables.
The passwords must be alphanumeric and can’t contain special characters, such
as ! or @. The bash script included in the docker-compose.yml file only works with
alphanumeric characters. Example:
# Password for the 'elastic' user (at least 6 characters)
ELASTIC_PASSWORD=changeme

# Password for the 'kibana_system' user (at least 6 characters)


KIBANA_PASSWORD=changeme
...
5. In the .env file, set STACK_VERSION to the current Elastic Stack version.
6. ...
7. # Version of Elastic products
8. STACK_VERSION=8.14.3
...
9. By default, the Docker Compose configuration exposes port 9200 on all
network interfaces.
To avoid exposing port 9200 to external hosts, set ES_PORT to 127.0.0.1:9200 in
the .env file. This ensures Elasticsearch is only accessible from the host machine.
...
# Port to expose Elasticsearch HTTP API to the host
#ES_PORT=9200
ES_PORT=127.0.0.1:9200
...
10.To start the cluster, run the following command from the project directory.
docker-compose up -d
11.After the cluster has started, open http://localhost:5601 in a web browser to
access Kibana.
12.Log in to Kibana as the elastic user using the ELASTIC_PASSWORD you set
earlier.
Stop and remove the cluster
To stop the cluster, run docker-compose down. The data in the Docker volumes is
preserved and loaded when you restart the cluster with docker-compose up.
docker-compose down
To delete the network, containers, and volumes when you stop the cluster, specify
the -v option:
docker-compose down -v
Next steps
You now have a test Elasticsearch environment set up. Before you start serious
development or go into production with Elasticsearch, review the requirements and
recommendations to apply when running Elasticsearch in Docker in production.
Using the Docker images in production
The following requirements and recommendations apply when running Elasticsearch
in Docker in production.
Set vm.max_map_count to at least 262144
The vm.max_map_count kernel setting must be set to at least 262144 for
production use.
How you set vm.max_map_count depends on your platform.
Linux
To view the current value for the vm.max_map_count setting, run:
grep vm.max_map_count /etc/sysctl.conf
vm.max_map_count=262144
To apply the setting on a live system, run:
sysctl -w vm.max_map_count=262144
To permanently change the value for the vm.max_map_count setting, update the
value in /etc/sysctl.conf.
macOS with Docker for Mac
The vm.max_map_count setting must be set within the xhyve virtual machine:
1. From the command line, run:
screen ~/Library/Containers/com.docker.docker/Data/vms/0/tty
2. Press enter and use sysctl to configure vm.max_map_count:
sysctl -w vm.max_map_count=262144
3. To exit the screen session, type Ctrl a d.
Windows and macOS with Docker Desktop
The vm.max_map_count setting must be set via docker-machine:
docker-machine ssh
sudo sysctl -w vm.max_map_count=262144
Windows with Docker Desktop WSL 2 backend
The vm.max_map_count setting must be set in the "docker-desktop" WSL instance
before the Elasticsearch container will properly start. There are several ways to do
this, depending on your version of Windows and your version of WSL.
If you are on Windows 10 before version 22H2, or if you are on Windows 10 version
22H2 using the built-in version of WSL, you must either manually set it every time
you restart Docker before starting your Elasticsearch container, or (if you do not
wish to do so on every restart) you must globally set every WSL2 instance to have
the vm.max_map_count changed. This is because these versions of WSL do not
properly process the /etc/sysctl.conf file.
To manually set it every time you reboot, you must run the following commands in a
command prompt or PowerShell window every time you restart Docker:
wsl -d docker-desktop -u root
sysctl -w vm.max_map_count=262144
If you are on these versions of WSL and you do not want to have to run those
commands every time you restart Docker, you can globally change every WSL
distribution with this setting by modifying your %USERPROFILE%\.wslconfig as
follows:
[wsl2]
kernelCommandLine = "sysctl.vm.max_map_count=262144"
This will cause all WSL2 VMs to have that setting assigned when they start.
If you are on Windows 11, or Windows 10 version 22H2 and have installed the
Microsoft Store version of WSL, you can modify the /etc/sysctl.conf within the
"docker-desktop" WSL distribution, perhaps with commands like this:
wsl -d docker-desktop -u root
vi /etc/sysctl.conf
and appending a line which reads:
vm.max_map_count = 262144
Configuration files must be readable by the elasticsearch user
By default, Elasticsearch runs inside the container as user elasticsearch using
uid:gid 1000:0.
One exception is Openshift, which runs containers using an arbitrarily assigned user
ID. Openshift presents persistent volumes with the gid set to 0, which works without
any adjustments.
If you are bind-mounting a local directory or file, it must be readable by
the elasticsearch user. In addition, this user must have write access to the config,
data and log dirs (Elasticsearch needs write access to the config directory so that it
can generate a keystore). A good strategy is to grant group access to gid 0 for the
local directory.
For example, to prepare a local directory for storing data through a bind-mount:
mkdir esdatadir
chmod g+rwx esdatadir
chgrp 0 esdatadir
You can also run an Elasticsearch container using both a custom UID and GID. You
must ensure that file permissions will not prevent Elasticsearch from executing. You
can use one of two options:
 Bind-mount the config, data and logs directories. If you intend to install
plugins and prefer not to create a custom Docker image, you must also bind-
mount the plugins directory.
 Pass the --group-add 0 command line option to docker run. This ensures that
the user under which Elasticsearch is running is also a member of
the root (GID 0) group inside the container.
Increase ulimits for nofile and nproc
Increased ulimits for nofile and nproc must be available for the Elasticsearch
containers. Verify the init system for the Docker daemon sets them to acceptable
values.
To check the Docker daemon defaults for ulimits, run:
docker run --rm docker.elastic.co/elasticsearch/elasticsearch:8.14.3 /bin/bash -c
'ulimit -Hn && ulimit -Sn && ulimit -Hu && ulimit -Su'
If needed, adjust them in the Daemon or override them per container. For example,
when using docker run, set:
--ulimit nofile=65535:65535
Disable swapping
Swapping needs to be disabled for performance and node stability. For information
about ways to do this, see Disable swapping.
If you opt for the bootstrap.memory_lock: true approach, you also need to define
the memlock: true ulimit in the Docker Daemon, or explicitly set for the container as
shown in the sample compose file. When using docker run, you can specify:
-e "bootstrap.memory_lock=true" --ulimit memlock=-1:-1
Randomize published ports
The image exposes TCP ports 9200 and 9300. For production clusters, randomizing
the published ports with --publish-all is recommended, unless you are pinning one
container per host.
Manually set the heap size
By default, Elasticsearch automatically sizes JVM heap based on a nodes’s roles and
the total memory available to the node’s container. We recommend this default
sizing for most production environments. If needed, you can override default sizing
by manually setting JVM heap size.
To manually set the heap size in production, bind mount a JVM options file
under /usr/share/elasticsearch/config/jvm.options.d that includes your desired heap
size settings.
For testing, you can also manually set the heap size using
the ES_JAVA_OPTS environment variable. For example, to use 1GB, use the following
command.
docker run -e ES_JAVA_OPTS="-Xms1g -Xmx1g" -e ENROLLMENT_TOKEN="<token>"
--name es01 -p 9200:9200 --net elastic -it
docker.elastic.co/elasticsearch/elasticsearch:8.14.3
The ES_JAVA_OPTS variable overrides all other JVM options. We do not recommend
using ES_JAVA_OPTS in production.
Pin deployments to a specific image version
Pin your deployments to a specific version of the Elasticsearch Docker image. For
example docker.elastic.co/elasticsearch/elasticsearch:8.14.3.
Always bind data volumes
You should use a volume bound on /usr/share/elasticsearch/data for the following
reasons:
1. The data of your Elasticsearch node won’t be lost if the container is killed
2. Elasticsearch is I/O sensitive and the Docker storage driver is not ideal for fast
I/O
3. It allows the use of advanced Docker volume plugins
Avoid using loop-lvm mode
If you are using the devicemapper storage driver, do not use the default loop-
lvm mode. Configure docker-engine to use direct-lvm.
Centralize your logs
Consider centralizing your logs by using a different logging driver. Also note that the
default json-file logging driver is not ideally suited for production use.
Configuring Elasticsearch with Docker
When you run in Docker, the Elasticsearch configuration files are loaded
from /usr/share/elasticsearch/config/.
To use custom configuration files, you bind-mount the files over the configuration
files in the image.
You can set individual Elasticsearch configuration parameters using Docker
environment variables. The sample compose file and the single-node example use
this method. You can use the setting name directly as the environment variable
name. If you cannot do this, for example because your orchestration platform
forbids periods in environment variable names, then you can use an alternative
style by converting the setting name as follows.
1. Change the setting name to uppercase
2. Prefix it with ES_SETTING_
3. Escape any underscores (_) by duplicating them
4. Convert all periods (.) to underscores (_)
For example, -e bootstrap.memory_lock=true becomes -e
ES_SETTING_BOOTSTRAP_MEMORY__LOCK=true.
You can use the contents of a file to set the value of
the ELASTIC_PASSWORD or KEYSTORE_PASSWORD environment variables, by
suffixing the environment variable name with _FILE. This is useful for passing
secrets such as passwords to Elasticsearch without specifying them directly.
For example, to set the Elasticsearch bootstrap password from a file, you can bind
mount the file and set the ELASTIC_PASSWORD_FILE environment variable to the
mount location. If you mount the password file
to /run/secrets/bootstrapPassword.txt, specify:
-e ELASTIC_PASSWORD_FILE=/run/secrets/bootstrapPassword.txt
You can override the default command for the image to pass Elasticsearch
configuration parameters as command line options. For example:
docker run <various parameters> bin/elasticsearch -
Ecluster.name=mynewclustername
While bind-mounting your configuration files is usually the preferred method in
production, you can also create a custom Docker image that contains your
configuration.
Mounting Elasticsearch configuration files
Create custom config files and bind-mount them over the corresponding files in the
Docker image. For example, to bind-mount custom_elasticsearch.yml with docker
run, specify:
-v full_path_to/custom_elasticsearch.yml:/usr/share/elasticsearch/config/
elasticsearch.yml
If you bind-mount a custom elasticsearch.yml file, ensure it includes
the network.host: 0.0.0.0 setting. This setting ensures the node is reachable for
HTTP and transport traffic, provided its ports are exposed. The Docker image’s built-
in elasticsearch.yml file includes this setting by default.
The container runs Elasticsearch as user elasticsearch using uid:gid 1000:0.
Bind mounted host directories and files must be accessible by this user, and the
data and log directories must be writable by this user.
Create an encrypted Elasticsearch keystore
By default, Elasticsearch will auto-generate a keystore file for secure settings. This
file is obfuscated but not encrypted.
To encrypt your secure settings with a password and have them persist outside the
container, use a docker run command to manually create the keystore instead. The
command must:
 Bind-mount the config directory. The command will create
an elasticsearch.keystore file in this directory. To avoid errors, do not directly
bind-mount the elasticsearch.keystore file.
 Use the elasticsearch-keystore tool with the create -p option. You’ll be
prompted to enter a password for the keystore.
For example:
docker run -it --rm \
-v full_path_to/config:/usr/share/elasticsearch/config \
docker.elastic.co/elasticsearch/elasticsearch:8.14.3 \
bin/elasticsearch-keystore create -p
You can also use a docker run command to add or update secure settings in the
keystore. You’ll be prompted to enter the setting values. If the keystore is
encrypted, you’ll also be prompted to enter the keystore password.
docker run -it --rm \
-v full_path_to/config:/usr/share/elasticsearch/config \
docker.elastic.co/elasticsearch/elasticsearch:8.14.3 \
bin/elasticsearch-keystore \
add my.secure.setting \
my.other.secure.setting
If you’ve already created the keystore and don’t need to update it, you can bind-
mount the elasticsearch.keystore file directly. You can use
the KEYSTORE_PASSWORD environment variable to provide the keystore password
to the container at startup. For example, a docker run command might have the
following options:
-v full_path_to/config/elasticsearch.keystore:/usr/share/elasticsearch/config/
elasticsearch.keystore
-e KEYSTORE_PASSWORD=mypassword
Using custom Docker images
In some environments, it might make more sense to prepare a custom image that
contains your configuration. A Dockerfile to achieve this might be as simple as:
FROM docker.elastic.co/elasticsearch/elasticsearch:8.14.3
COPY --chown=elasticsearch:elasticsearch elasticsearch.yml
/usr/share/elasticsearch/config/
You could then build and run the image with:
docker build --tag=elasticsearch-custom .
docker run -ti -v /usr/share/elasticsearch/data elasticsearch-custom
Some plugins require additional security permissions. You must explicitly accept
them either by:
 Attaching a tty when you run the Docker image and allowing the permissions
when prompted.
 Inspecting the security permissions and accepting them (if appropriate) by
adding the --batch flag to the plugin install command.
See Plugin management for more information.
Troubleshoot Docker errors for Elasticsearch
Here’s how to resolve common errors when running Elasticsearch with Docker.
elasticsearch.keystore is a directory
Exception in thread "main" org.elasticsearch.bootstrap.BootstrapException:
java.io.IOException: Is a directory:
SimpleFSIndexInput(path="/usr/share/elasticsearch/config/elasticsearch.keystore")
Likely root cause: java.io.IOException: Is a directory
A keystore-related docker run command attempted to directly bind-mount
an elasticsearch.keystore file that doesn’t exist. If you use the -v or --volume flag to
mount a file that doesn’t exist, Docker instead creates a directory with the same
name.
To resolve this error:
1. Delete the elasticsearch.keystore directory in the config directory.
2. Update the -v or --volume flag to point to the config directory path rather
than the keystore file’s path. For an example, see Create an encrypted
Elasticsearch keystore.
3. Retry the command.
elasticsearch.keystore: Device or resource busy
Exception in thread "main" java.nio.file.FileSystemException:
/usr/share/elasticsearch/config/elasticsearch.keystore.tmp ->
/usr/share/elasticsearch/config/elasticsearch.keystore: Device or resource busy
A docker run command attempted to update the keystore while directly bind-
mounting the elasticsearch.keystore file. To update the keystore, the container
requires access to other files in the config directory, such as keystore.tmp.
To resolve this error:
1. Update the -v or --volume flag to point to the config directory path rather
than the keystore file’s path. For an example, see Create an encrypted
Elasticsearch keystore.
2. Retry the command.
Important Elasticsearch configuration
Elasticsearch requires very little configuration to get started, but there are a
number of items which must be considered before using your cluster in production:
 Path settings
 Cluster name setting
 Node name setting
 Network host settings
 Discovery settings
 Heap size settings
 JVM heap dump path setting
 GC logging settings
 Temporary directory settings
 JVM fatal error log setting
 Cluster backups
Our Elastic Cloud service configures these items automatically, making your cluster
production-ready by default.
Path settings
Elasticsearch writes the data you index to indices and data streams to
a data directory. Elasticsearch writes its own application logs, which contain
information about cluster health and operations, to a logs directory.
For macOS .tar.gz, Linux .tar.gz, and Windows .zip installations, data and logs are
subdirectories of $ES_HOME by default. However, files in $ES_HOME risk deletion
during an upgrade.
In production, we strongly recommend you set
the path.data and path.logs in elasticsearch.yml to locations outside
of $ES_HOME. Docker, Debian, and RPM installations write data and log to locations
outside of $ES_HOME by default.
Supported path.data and path.logs values vary by platform:
Unix-like systems Windows
Linux and macOS installations support Unix-style paths:
path:
data: /var/data/elasticsearch
logs: /var/log/elasticsearch
Don’t modify anything within the data directory or run processes that might
interfere with its contents. If something other than Elasticsearch modifies the
contents of the data directory, then Elasticsearch may fail, reporting corruption or
other data inconsistencies, or may appear to work correctly having silently lost
some of your data. Don’t attempt to take filesystem backups of the data directory;
there is no supported way to restore such a backup. Instead, use Snapshot and
restore to take backups safely. Don’t run virus scanners on the data directory. A
virus scanner can prevent Elasticsearch from working correctly and may modify the
contents of the data directory. The data directory contains no executables so a virus
scan will only find false positives.
Multiple data paths
Deprecated in 7.13.0.
If needed, you can specify multiple paths in path.data. Elasticsearch stores the
node’s data across all provided paths but keeps each shard’s data on the same
path.
Elasticsearch does not balance shards across a node’s data paths. High disk usage
in a single path can trigger a high disk usage watermark for the entire node. If
triggered, Elasticsearch will not add shards to the node, even if the node’s other
paths have available disk space. If you need additional disk space, we recommend
you add a new node rather than additional data paths.
Unix-like systems Windows
Linux and macOS installations support multiple Unix-style paths in path.data:
path:
data:
- /mnt/elasticsearch_1
- /mnt/elasticsearch_2
- /mnt/elasticsearch_3
Migrate from multiple data paths
Support for multiple data paths was deprecated in 7.13 and will be removed in a
future release.
As an alternative to multiple data paths, you can create a filesystem which spans
multiple disks with a hardware virtualisation layer such as RAID, or a software
virtualisation layer such as Logical Volume Manager (LVM) on Linux or Storage
Spaces on Windows. If you wish to use multiple data paths on a single machine then
you must run one node for each data path.
If you currently use multiple data paths in a highly available cluster then you can
migrate to a setup that uses a single path for each node without downtime using a
process similar to a rolling restart: shut each node down in turn and replace it with
one or more nodes each configured to use a single data path. In more detail, for
each node that currently has multiple data paths you should follow the following
process. In principle you can perform this migration during a rolling upgrade to 8.0,
but we recommend migrating to a single-data-path setup before starting to
upgrade.
1. Take a snapshot to protect your data in case of disaster.
2. Optionally, migrate the data away from the target node by using an allocation
filter:
3. PUT _cluster/settings
4. {
5. "persistent": {
6. "cluster.routing.allocation.exclude._name": "target-node-name"
7. }
}
ConsolePHPPythonRubyGoJavaScript
Copy as curlTry in Elastic
You can use the cat allocation API to track progress of this data migration. If some
shards do not migrate then the cluster allocation explain API will help you to
determine why.
8. Follow the steps in the rolling restart process up to and including shutting the
target node down.
9. Ensure your cluster health is yellow or green, so that there is a copy of every
shard assigned to at least one of the other nodes in your cluster.
10.If applicable, remove the allocation filter applied in the earlier step.
11.PUT _cluster/settings
12.{
13. "persistent": {
14. "cluster.routing.allocation.exclude._name": null
15. }
}
ConsolePHPPythonRubyGoJavaScript
Copy as curlTry in Elastic
16.Discard the data held by the stopped node by deleting the contents of its
data paths.
17.Reconfigure your storage. For instance, combine your disks into a single
filesystem using LVM or Storage Spaces. Ensure that your reconfigured
storage has sufficient space for the data that it will hold.
18.Reconfigure your node by adjusting the path.data setting in
its elasticsearch.yml file. If needed, install more nodes each with their
own path.data setting pointing at a separate data path.
19.Start the new nodes and follow the rest of the rolling restart process for them.
20.Ensure your cluster health is green, so that every shard has been assigned.
You can alternatively add some number of single-data-path nodes to your cluster,
migrate all your data over to these new nodes using allocation filters, and then
remove the old nodes from the cluster. This approach will temporarily double the
size of your cluster so it will only work if you have the capacity to expand your
cluster like this.
If you currently use multiple data paths but your cluster is not highly available then
you can migrate to a non-deprecated configuration by taking a snapshot, creating a
new cluster with the desired configuration and restoring the snapshot into it.
Cluster name setting
A node can only join a cluster when it shares its cluster.name with all the other
nodes in the cluster. The default name is elasticsearch, but you should change it to
an appropriate name that describes the purpose of the cluster.
cluster.name: logging-prod
Do not reuse the same cluster names in different environments. Otherwise, nodes
might join the wrong cluster.
Changing the name of a cluster requires a full cluster restart.
Node name setting
Elasticsearch uses node.name as a human-readable identifier for a particular
instance of Elasticsearch. This name is included in the response of many APIs. The
node name defaults to the hostname of the machine when Elasticsearch starts, but
can be configured explicitly in elasticsearch.yml:
node.name: prod-data-2
Network host setting
By default, Elasticsearch only binds to loopback addresses such
as 127.0.0.1 and [::1]. This is sufficient to run a cluster of one or more nodes on a
single server for development and testing, but a resilient production cluster must
involve nodes on other servers. There are many network settings but usually all you
need to configure is network.host:
network.host: 192.168.1.10
When you provide a value for network.host, Elasticsearch assumes that you are
moving from development mode to production mode, and upgrades a number of
system startup checks from warnings to exceptions. See the differences
between development and production modes.
Discovery and cluster formation settings
Configure two important discovery and cluster formation settings before going to
production so that nodes in the cluster can discover each other and elect a master
node.
discovery.seed_hosts
Out of the box, without any network configuration, Elasticsearch will bind to the
available loopback addresses and scan local ports 9300 to 9305 to connect with
other nodes running on the same server. This behavior provides an auto-clustering
experience without having to do any configuration.
When you want to form a cluster with nodes on other hosts, use
the static discovery.seed_hosts setting. This setting provides a list of other nodes in
the cluster that are master-eligible and likely to be live and contactable to seed
the discovery process. This setting accepts a YAML sequence or array of the
addresses of all the master-eligible nodes in the cluster. Each address can be either
an IP address or a hostname that resolves to one or more IP addresses via DNS.
discovery.seed_hosts:
- 192.168.1.10:9300
- 192.168.1.11
- seeds.mydomain.com
- [0:0:0:0:0:ffff:c0a8:10c]:9301
The port is optional and defaults to 9300, but can be overridden.
If a hostname resolves to multiple IP addresses, the node will attempt to discover other node
all resolved addresses.
IPv6 addresses must be enclosed in square brackets.
If your master-eligible nodes do not have fixed names or addresses, use
an alternative hosts provider to find their addresses dynamically.
cluster.initial_master_nodes
When you start an Elasticsearch cluster for the first time, a cluster
bootstrapping step determines the set of master-eligible nodes whose votes are
counted in the first election. In development mode, with no discovery settings
configured, this step is performed automatically by the nodes themselves.
Because auto-bootstrapping is inherently unsafe, when starting a new cluster in
production mode, you must explicitly list the master-eligible nodes whose votes
should be counted in the very first election. You set this list using
the cluster.initial_master_nodes setting.
After the cluster forms successfully for the first time, remove
the cluster.initial_master_nodes setting from each node’s configuration. Do not use
this setting when restarting a cluster or adding a new node to an existing cluster.
discovery.seed_hosts:
- 192.168.1.10:9300
- 192.168.1.11
- seeds.mydomain.com
- [0:0:0:0:0:ffff:c0a8:10c]:9301
cluster.initial_master_nodes:
- master-node-a
- master-node-b
- master-node-c
Identify the initial master nodes by their node.name, which defaults to their hostname. Ensur
that the value in cluster.initial_master_nodes matches the node.name exactly. If you use a fu
qualified domain name (FQDN) such as master-node-a.example.com for your node names, th
you must use the FQDN in this list. Conversely, if node.name is a bare hostname without any
trailing qualifiers, you must also omit the trailing qualifiers in cluster.initial_master_nodes.
See bootstrapping a cluster and discovery and cluster formation settings.
Heap size settings
By default, Elasticsearch automatically sets the JVM heap size based on a
node’s roles and total memory. We recommend the default sizing for most
production environments.
If needed, you can override the default sizing by manually setting the JVM heap size.
JVM heap dump path setting
By default, Elasticsearch configures the JVM to dump the heap on out of memory
exceptions to the default data directory. On RPM and Debian packages, the data
directory is /var/lib/elasticsearch. On Linux and MacOS and Windows distributions,
the data directory is located under the root of the Elasticsearch installation.
If this path is not suitable for receiving heap dumps, modify the -
XX:HeapDumpPath=... entry in jvm.options:
 If you specify a directory, the JVM will generate a filename for the heap dump
based on the PID of the running instance.
 If you specify a fixed filename instead of a directory, the file must not exist
when the JVM needs to perform a heap dump on an out of memory exception.
Otherwise, the heap dump will fail.
GC logging settings
By default, Elasticsearch enables garbage collection (GC) logs. These are configured
in jvm.options and output to the same default location as the Elasticsearch logs. The
default configuration rotates the logs every 64 MB and can consume up to 2 GB of
disk space.
You can reconfigure JVM logging using the command line options described in JEP
158: Unified JVM Logging. Unless you change the default jvm.options file directly,
the Elasticsearch default configuration is applied in addition to your own settings. To
disable the default configuration, first disable logging by supplying the -
Xlog:disable option, then supply your own command line options. This
disables all JVM logging, so be sure to review the available options and enable
everything that you require.
To see further options not contained in the original JEP, see Enable Logging with the
JVM Unified Logging Framework.
Examples
Change the default GC log output location to /opt/my-app/gc.log by
creating $ES_HOME/config/jvm.options.d/gc.options with some sample options:
# Turn off all previous logging configuratons
-Xlog:disable

# Default settings from JEP 158, but with `utctime` instead of `uptime` to match
the next line
-Xlog:all=warning:stderr:utctime,level,tags

# Enable GC logging to a custom location with a variety of options


-Xlog:gc*,gc+age=trace,safepoint:file=/opt/my-app/
gc.log:utctime,level,pid,tags:filecount=32,filesize=64m
Configure an Elasticsearch Docker container to send GC debug logs to standard
error (stderr). This lets the container orchestrator handle the output. If using
the ES_JAVA_OPTS environment variable, specify:
MY_OPTS="-Xlog:disable -Xlog:all=warning:stderr:utctime,level,tags -
Xlog:gc=debug:stderr:utctime"
docker run -e ES_JAVA_OPTS="$MY_OPTS" # etc
Temporary directory settings
By default, Elasticsearch uses a private temporary directory that the startup script
creates immediately below the system temporary directory.
On some Linux distributions, a system utility will clean files and directories
from /tmp if they have not been recently accessed. This behavior can lead to the
private temporary directory being removed while Elasticsearch is running if features
that require the temporary directory are not used for a long time. Removing the
private temporary directory causes problems if a feature that requires this directory
is subsequently used.
If you install Elasticsearch using the .deb or .rpm packages and run it
under systemd, the private temporary directory that Elasticsearch uses is excluded
from periodic cleanup.
If you intend to run the .tar.gz distribution on Linux or MacOS for an extended
period, consider creating a dedicated temporary directory for Elasticsearch that is
not under a path that will have old files and directories cleaned from it. This
directory should have permissions set so that only the user that Elasticsearch runs
as can access it. Then, set the $ES_TMPDIR environment variable to point to this
directory before starting Elasticsearch.
JVM fatal error log setting
By default, Elasticsearch configures the JVM to write fatal error logs to the default
logging directory. On RPM and Debian packages, this directory
is /var/log/elasticsearch. On Linux and MacOS and Windows distributions,
the logs directory is located under the root of the Elasticsearch installation.
These are logs produced by the JVM when it encounters a fatal error, such as a
segmentation fault. If this path is not suitable for receiving logs, modify the -
XX:ErrorFile=... entry in jvm.options.
Cluster backups
In a disaster, snapshots can prevent permanent data loss. Snapshot lifecycle
management is the easiest way to take regular backups of your cluster. For more
information, see Create a snapshot.
Taking a snapshot is the only reliable and supported way to back up a
cluster. You cannot back up an Elasticsearch cluster by making copies of the data
directories of its nodes. There are no supported methods to restore any data from a
filesystem-level backup. If you try to restore a cluster from such a backup, it may
fail with reports of corruption or missing files or other data inconsistencies, or it may
appear to have succeeded having silently lost some of your data.
Mapping
Mapping is the process of defining how a document, and the fields it contains, are
stored and indexed.
Each document is a collection of fields, which each have their own data type. When
mapping your data, you create a mapping definition, which contains a list of fields
that are pertinent to the document. A mapping definition also includes metadata
fields, like the _source field, which customize how a document’s associated
metadata is handled.
Use dynamic mapping and explicit mapping to define your data. Each method
provides different benefits based on where you are in your data journey. For
example, explicitly map fields where you don’t want to use the defaults, or to gain
greater control over which fields are created. You can then allow Elasticsearch to
add other fields dynamically.
Before 7.0.0, the mapping definition included a type name. Elasticsearch 7.0.0 and
later no longer accept a default mapping. See Removal of mapping types.
Experiment with mapping options
Define runtime fields in a search request to experiment with different mapping
options, and also fix mistakes in your index mapping values by overriding values in
the mapping during the search request.
Dynamic mapping
Dynamic mapping allows you to experiment with and explore data when you’re just
getting started. Elasticsearch adds new fields automatically, just by indexing a
document. You can add fields to the top-level mapping, and to
inner object and nested fields.
Use dynamic templates to define custom mappings that are applied to dynamically
added fields based on the matching condition.
Explicit mapping
Explicit mapping allows you to precisely choose how to define the mapping
definition, such as:
 Which string fields should be treated as full text fields.
 Which fields contain numbers, dates, or geolocations.
 The format of date values.
 Custom rules to control the mapping for dynamically added fields.
Use runtime fields to make schema changes without reindexing. You can use
runtime fields in conjunction with indexed fields to balance resource usage and
performance. Your index will be smaller, but with slower search performance.
Settings to prevent mapping explosion
Defining too many fields in an index can lead to a mapping explosion, which can
cause out of memory errors and difficult situations to recover from.
Consider a situation where every new document inserted introduces new fields,
such as with dynamic mapping. Each new field is added to the index mapping,
which can become a problem as the mapping grows.
Use the mapping limit settings to limit the number of field mappings (created
manually or dynamically) and prevent documents from causing a mapping
explosion.
Runtime fields
A runtime field is a field that is evaluated at query time. Runtime fields enable you
to:
 Add fields to existing documents without reindexing your data
 Start working with your data without understanding how it’s structured
 Override the value returned from an indexed field at query time
 Define fields for a specific use without modifying the underlying schema
You access runtime fields from the search API like any other field, and Elasticsearch
sees runtime fields no differently. You can define runtime fields in the index
mapping or in the search request. Your choice, which is part of the inherent
flexibility of runtime fields.
Use the fields parameter on the _search API to retrieve the values of runtime fields.
Runtime fields won’t display in _source, but the fields API works for all fields, even
those that were not sent as part of the original _source.
Runtime fields are useful when working with log data (see examples), especially
when you’re unsure about the data structure. Your search speed decreases, but your
index size is much smaller and you can more quickly process logs without having to
index them.
Benefits
Because runtime fields aren’t indexed, adding a runtime field doesn’t increase the
index size. You define runtime fields directly in the index mapping, saving storage
costs and increasing ingestion speed. You can more quickly ingest data into the
Elastic Stack and access it right away. When you define a runtime field, you can
immediately use it in search requests, aggregations, filtering, and sorting.
If you change a runtime field into an indexed field, you don’t need to modify any
queries that refer to the runtime field. Better yet, you can refer to some indices
where the field is a runtime field, and other indices where the field is an indexed
field. You have the flexibility to choose which fields to index and which ones to keep
as runtime fields.
At its core, the most important benefit of runtime fields is the ability to add fields to
documents after you’ve ingested them. This capability simplifies mapping decisions
because you don’t have to decide how to parse your data up front, and can use
runtime fields to amend the mapping at any time. Using runtime fields allows for a
smaller index and faster ingest time, which combined use less resources and reduce
your operating costs.
Incentives
Runtime fields can replace many of the ways you can use scripting with
the _search API. How you use a runtime field is impacted by the number of
documents that the included script runs against. For example, if you’re using
the fields parameter on the _search API to retrieve the values of a runtime field, the
script runs only against the top hits just like script fields do.
You can use script fields to access values in _source and return calculated values
based on a script valuation. Runtime fields have the same capabilities, but provide
greater flexibility because you can query and aggregate on runtime fields in a
search request. Script fields can only fetch values.
Similarly, you could write a script query that filters documents in a search request
based on a script. Runtime fields provide a very similar feature that is more flexible.
You write a script to create field values and they are available everywhere, such
as fields, all queries, and aggregations.
You can also use scripts to sort search results, but that same script works exactly
the same in a runtime field.
If you move a script from any of these sections in a search request to a runtime field
that is computing values from the same number of documents, the performance
should be about the same. The performance for these features is largely dependent
upon the calculations that the included script is running and how many documents
the script runs against.
Compromises
Runtime fields use less disk space and provide flexibility in how you access your
data, but can impact search performance based on the computation defined in the
runtime script.
To balance search performance and flexibility, index fields that you’ll frequently
search for and filter on, such as a timestamp. Elasticsearch automatically uses
these indexed fields first when running a query, resulting in a fast response time.
You can then use runtime fields to limit the number of fields that Elasticsearch
needs to calculate values for. Using indexed fields in tandem with runtime fields
provides flexibility in the data that you index and how you define queries for other
fields.
Use the asynchronous search API to run searches that include runtime fields. This
method of search helps to offset the performance impacts of computing values for
runtime fields in each document containing that field. If the query can’t return the
result set synchronously, you’ll get results asynchronously as they become
available.
Queries against runtime fields are considered expensive.
If search.allow_expensive_queries is set to false, expensive queries are not allowed
and Elasticsearch will reject any queries against runtime fields.
Field data types
Each field has a field data type, or field type. This type indicates the kind of data the
field contains, such as strings or boolean values, and its intended use. For example,
you can index strings to both text and keyword fields. However, text field values
are analyzed for full-text search while keyword strings are left as-is for filtering and
sorting.
Field types are grouped by family. Types in the same family have exactly the same
search behavior but may have different space usage or performance characteristics.
Currently, there are two type families, keyword and text. Other type families have
only a single field type. For example, the boolean type family consists of one field
type: boolean.
Common types
binary
Binary value encoded as a Base64 string.
boolean
true and false values.
Keywords
The keyword family, including keyword, constant_keyword, and wildcard.
Numbers
Numeric types, such as long and double, used to express amounts.
Dates
Date types, including date and date_nanos.
alias
Defines an alias for an existing field.
Objects and relational types
object
A JSON object.
flattened
An entire JSON object as a single field value.
nested
A JSON object that preserves the relationship between its subfields.
join
Defines a parent/child relationship for documents in the same index.
Structured data types
Range
Range types, such as long_range, double_range, date_range, and ip_range.
ip
IPv4 and IPv6 addresses.
version
Software versions. Supports Semantic Versioning precedence rules.
murmur3
Compute and stores hashes of values.
Aggregate data types
aggregate_metric_double
Pre-aggregated metric values.
histogram
Pre-aggregated numerical values in the form of a histogram.
Text search types
text fields
The text family, including text and match_only_text. Analyzed, unstructured text.
annotated-text
Text containing special markup. Used for identifying named entities.
completion
Used for auto-complete suggestions.
search_as_you_type
text-like type for as-you-type completion.
semantic_text
token_count
A count of tokens in a text.
Document ranking types
dense_vector
Records dense vectors of float values.
sparse_vector
Records sparse vectors of float values.
rank_feature
Records a numeric feature to boost hits at query time.
rank_features
Records numeric features to boost hits at query time.
Spatial data types
geo_point
Latitude and longitude points.
geo_shape
Complex shapes, such as polygons.
point
Arbitrary cartesian points.
shape
Arbitrary cartesian geometries.
Other types
percolator
Indexes queries written in Query DSL.
Arrays
In Elasticsearch, arrays do not require a dedicated field data type. Any field can
contain zero or more values by default, however, all values in the array must be of
the same field type. See Arrays.
Multi-fields
It is often useful to index the same field in different ways for different purposes. For
instance, a string field could be mapped as a text field for full-text search, and as
a keyword field for sorting or aggregations. Alternatively, you could index a text
field with the standard analyzer, the english analyzer, and the french analyzer.
This is the purpose of multi-fields. Most field types support multi-fields via
the fields parameter.
Metadata fields
Each document has metadata associated with it, such as
the _index and _id metadata fields. The behavior of some of these metadata fields
can be customized when a mapping is created.
Identity metadata fields

_index The index to which the document belongs.

_id The document’s ID.


Document source metadata fields
_source
The original JSON representing the body of the document.
_size
The size of the _source field in bytes, provided by the mapper-size plugin.
Doc count metadata field
_doc_count
A custom field used for storing doc counts when a document represents pre-
aggregated data.
Indexing metadata fields
_field_names
All fields in the document which contain non-null values.
_ignored
All fields in the document that have been ignored at index time because
of ignore_malformed.
Routing metadata field
_routing
A custom routing value which routes a document to a particular shard.
Other metadata field
_meta
Application specific metadata.
_tier
The current data tier preference of the index to which the document belongs.
Text analysis
Text analysis is the process of converting unstructured text, like the body of an
email or a product description, into a structured format that’s optimized for search.
When to configure text analysis
Elasticsearch performs text analysis when indexing or searching text fields.
If your index doesn’t contain text fields, no further setup is needed; you can skip the
pages in this section.
However, if you use text fields or your text searches aren’t returning results as
expected, configuring text analysis can often help. You should also look into analysis
configuration if you’re using Elasticsearch to:
 Build a search engine
 Mine unstructured data
 Fine-tune search for a specific language
 Perform lexicographic or linguistic research
Text analysis overview
Text analysis enables Elasticsearch to perform full-text search, where the search
returns all relevant results rather than just exact matches.
If you search for Quick fox jumps, you probably want the document that contains A
quick brown fox jumps over the lazy dog, and you might also want documents that
contain related words like fast fox or foxes leap.
Tokenization
Analysis makes full-text search possible through tokenization: breaking a text down
into smaller chunks, called tokens. In most cases, these tokens are individual words.
If you index the phrase the quick brown fox jumps as a single string and the user
searches for quick fox, it isn’t considered a match. However, if you tokenize the
phrase and index each word separately, the terms in the query string can be looked
up individually. This means they can be matched by searches for quick fox, fox
brown, or other variations.
Normalization
Tokenization enables matching on individual terms, but each token is still matched
literally. This means:
 A search for Quick would not match quick, even though you likely want either
term to match the other
 Although fox and foxes share the same root word, a search for foxes would
not match fox or vice versa.
 A search for jumps would not match leaps. While they don’t share a root
word, they are synonyms and have a similar meaning.
To solve these problems, text analysis can normalize these tokens into a standard
format. This allows you to match tokens that are not exactly the same as the search
terms, but similar enough to still be relevant. For example:
 Quick can be lowercased: quick.
 foxes can be stemmed, or reduced to its root word: fox.
 jump and leap are synonyms and can be indexed as a single word: jump.
To ensure search terms match these words as intended, you can apply the same
tokenization and normalization rules to the query string. For example, a search
for Foxes leap can be normalized to a search for fox jump.
Customize text analysis
Text analysis is performed by an analyzer, a set of rules that govern the entire
process.
Elasticsearch includes a default analyzer, called the standard analyzer, which works
well for most use cases right out of the box.
If you want to tailor your search experience, you can choose a different built-in
analyzer or even configure a custom one. A custom analyzer gives you control over
each step of the analysis process, including:
 Changes to the text before tokenization
 How text is converted to tokens
 Normalization changes made to tokens before indexing or search

Anatomy of an analyzer
An analyzer — whether built-in or custom — is just a package which contains
three lower-level building blocks: character filters, tokenizers, and token
filters.

The built-in analyzers pre-package these building blocks into analyzers


suitable for different languages and types of text. Elasticsearch also exposes
the individual building blocks so that they can be combined to define
new custom analyzers.

Character filters
A character filter receives the original text as a stream of characters and can
transform the stream by adding, removing, or changing characters. For
instance, a character filter could be used to convert Hindu-Arabic numerals (٠١
٢٣٤٥٦٧٨٩) into their Arabic-Latin equivalents (0123456789), or to strip HTML
elements like <b> from the stream.

An analyzer may have zero or more character filters, which are applied in
order.

Tokenizer
A tokenizer receives a stream of characters, breaks it up into
individual tokens (usually individual words), and outputs a stream of tokens.
For instance, a whitespace tokenizer breaks text into tokens whenever it sees
any whitespace. It would convert the text "Quick brown fox!" into the
terms [Quick, brown, fox!].

The tokenizer is also responsible for recording the order or position of each
term and the start and end character offsets of the original word which the
term represents.
An analyzer must have exactly one tokenizer.

Token filters
A token filter receives the token stream and may add, remove, or change
tokens. For example, a lowercase token filter converts all tokens to
lowercase, a stop token filter removes common words (stop words)
like the from the token stream, and a synonym token filter introduces
synonyms into the token stream.

Token filters are not allowed to change the position or character offsets of
each token.

An analyzer may have zero or more token filters, which are applied in
order.

Index and search analysis


Text analysis occurs at two times:
Index time
When a document is indexed, any text field values are analyzed.
Search time
When running a full-text search on a text field, the query string (the text the user is
searching for) is analyzed.
Search time is also called query time.
The analyzer, or set of analysis rules, used at each time is called the index
analyzer or search analyzer respectively.
How the index and search analyzer work together
In most cases, the same analyzer should be used at index and search time. This
ensures the values and query strings for a field are changed into the same form of
tokens. In turn, this ensures the tokens match as expected during a search.
Example
A document is indexed with the following value in a text field:
The QUICK brown foxes jumped over the dog!
The index analyzer for the field converts the value into tokens and normalizes them.
In this case, each of the tokens represents a word:
[ quick, brown, fox, jump, over, dog ]
These tokens are then indexed.
Later, a user searches the same text field for:
"Quick fox"
The user expects this search to match the sentence indexed earlier, The QUICK
brown foxes jumped over the dog!.
However, the query string does not contain the exact words used in the document’s
original text:
 Quick vs QUICK
 fox vs foxes
To account for this, the query string is analyzed using the same analyzer. This
analyzer produces the following tokens:
[ quick, fox ]
To execute the search, Elasticsearch compares these query string tokens to the
tokens indexed in the text field.

Token Query string text field

quick X X

brown X

fox X X

jump X

over X

dog X
Because the field value and query string were analyzed in the same way, they
created similar tokens. The tokens quick and fox are exact matches. This means the
search matches the document containing "The QUICK brown foxes jumped over the
dog!", just as the user expects.
When to use a different search analyzer
While less common, it sometimes makes sense to use different analyzers at index
and search time. To enable this, Elasticsearch allows you to specify a separate
search analyzer.
Generally, a separate search analyzer should only be specified when using the same
form of tokens for field values and query strings would create unexpected or
irrelevant search matches.
Example
Elasticsearch is used to create a search engine that matches only words that start
with a provided prefix. For instance, a search for tr should return tram or trope—but
never taxi or bat.
A document is added to the search engine’s index; this document contains one such
word in a text field:
"Apple"
The index analyzer for the field converts the value into tokens and normalizes them.
In this case, each of the tokens represents a potential prefix for the word:
[ a, ap, app, appl, apple]
These tokens are then indexed.
Later, a user searches the same text field for:
"appli"
The user expects this search to match only words that start with appli, such
as appliance or application. The search should not match apple.
However, if the index analyzer is used to analyze this query string, it would produce
the following tokens:
[ a, ap, app, appl, appli ]
When Elasticsearch compares these query string tokens to the ones indexed
for apple, it finds several matches.

Token appli apple

a X X

ap X X

app X X
Token appli apple

appl X X

appli X
This means the search would erroneously match apple. Not only that, it would
match any word starting with a.
To fix this, you can specify a different search analyzer for query strings used on
the text field.
In this case, you could specify a search analyzer that produces a single token rather
than a set of prefixes:
[ appli ]
This query string token would only match tokens for words that start with appli,
which better aligns with the user’s search expectations.
Stemming
Stemming is the process of reducing a word to its root form. This ensures variants of
a word match during a search.
For example, walking and walked can be stemmed to the same root word: walk.
Once stemmed, an occurrence of either word would match the other in a search.
Stemming is language-dependent but often involves removing prefixes and suffixes
from words.
In some cases, the root form of a stemmed word may not be a real word. For
example, jumping and jumpiness can both be stemmed to jumpi. While jumpi isn’t a
real English word, it doesn’t matter for search; if all variants of a word are reduced
to the same root form, they will match correctly.
Stemmer token filters
In Elasticsearch, stemming is handled by stemmer token filters. These token filters
can be categorized based on how they stem words:
 Algorithmic stemmers, which stem words based on a set of rules
 Dictionary stemmers, which stem words by looking them up in a dictionary
Because stemming changes tokens, we recommend using the same stemmer token
filters during index and search analysis.
Algorithmic stemmers
Algorithmic stemmers apply a series of rules to each word to reduce it to its root
form. For example, an algorithmic stemmer for English may remove the -s and -
es suffixes from the end of plural words.
Algorithmic stemmers have a few advantages:
 They require little setup and usually work well out of the box.
 They use little memory.
 They are typically faster than dictionary stemmers.
However, most algorithmic stemmers only alter the existing text of a word. This
means they may not work well with irregular words that don’t contain their root
form, such as:
 be, are, and am
 mouse and mice
 foot and feet
The following token filters use algorithmic stemming:
 stemmer, which provides algorithmic stemming for several languages, some
with additional variants.
 kstem, a stemmer for English that combines algorithmic stemming with a
built-in dictionary.
 porter_stem, our recommended algorithmic stemmer for English.
 snowball, which uses Snowball-based stemming rules for several languages.
Dictionary stemmers
Dictionary stemmers look up words in a provided dictionary, replacing unstemmed
word variants with stemmed words from the dictionary.
In theory, dictionary stemmers are well suited for:
 Stemming irregular words
 Discerning between words that are spelled similarly but not related
conceptually, such as:
 organ and organization
 broker and broken
In practice, algorithmic stemmers typically outperform dictionary stemmers. This is
because dictionary stemmers have the following disadvantages:
 Dictionary quality
A dictionary stemmer is only as good as its dictionary. To work well, these
dictionaries must include a significant number of words, be updated regularly,
and change with language trends. Often, by the time a dictionary has been
made available, it’s incomplete and some of its entries are already outdated.
 Size and performance
Dictionary stemmers must load all words, prefixes, and suffixes from its
dictionary into memory. This can use a significant amount of RAM. Low-quality
dictionaries may also be less efficient with prefix and suffix removal, which
can slow the stemming process significantly.
You can use the hunspell token filter to perform dictionary stemming.
If available, we recommend trying an algorithmic stemmer for your language before
using the hunspell token filter.
Control stemming
Sometimes stemming can produce shared root words that are spelled similarly but
not related conceptually. For example, a stemmer may reduce
both skies and skiing to the same root word: ski.
To prevent this and better control stemming, you can use the following token filters:
 stemmer_override, which lets you define rules for stemming specific tokens.
 keyword_marker, which marks specified tokens as keywords. Keyword tokens
are not stemmed by subsequent stemmer token filters.
 conditional, which can be used to mark tokens as keywords, similar to
the keyword_marker filter.
For built-in language analyzers, you also can use the stem_exclusion parameter to
specify a list of words that won’t be stemmed.
Token graphs
When a tokenizer converts a text into a stream of tokens, it also records the
following:
 The position of each token in the stream
 The positionLength, the number of positions that a token spans
Using these, you can create a directed acyclic graph, called a token graph, for a
stream. In a token graph, each position represents a node. Each token represents an
edge or arc, pointing to the next position.
Synonyms
Some token filters can add new tokens, like synonyms, to an existing token stream.
These synonyms often span the same positions as existing tokens.
In the following graph, quick and its synonym fast both have a position of 0. They
span the same positions.

Multi-position tokens
Some token filters can add tokens that span multiple positions. These can include
tokens for multi-word synonyms, such as using "atm" as a synonym for "automatic
teller machine."
However, only some token filters, known as graph token filters, accurately record
the positionLength for multi-position tokens. These filters include:
 synonym_graph
 word_delimiter_graph
Some tokenizers, such as the nori_tokenizer, also accurately decompose compound
tokens into multi-position tokens.
In the following graph, domain name system and its synonym, dns, both have a
position of 0. However, dns has a positionLength of 3. Other tokens in the graph
have a default positionLength of 1.

Using token graphs for search


Indexing ignores the positionLength attribute and does not support token graphs
containing multi-position tokens.
However, queries, such as the match or match_phrase query, can use these graphs
to generate multiple sub-queries from a single query string.
Example
A user runs a search for the following phrase using the match_phrase query:
domain name system is fragile
During search analysis, dns, a synonym for domain name system, is added to the
query string’s token stream. The dns token has a positionLength of 3.

The match_phrase query uses this graph to generate sub-queries for the following
phrases:
dns is fragile
domain name system is fragile
This means the query matches documents containing either dns is fragile or domain
name system is fragile.

Invalid token graphs


The following token filters can add tokens that span multiple positions but only
record a default positionLength of 1:
 synonym
 word_delimiter
This means these filters will produce invalid token graphs for streams containing
such tokens.
In the following graph, dns is a multi-position synonym for domain name system.
However, dns has the default positionLength value of 1, resulting in an invalid
graph.
Avoid using invalid token graphs for search. Invalid graphs can cause unexpected
search results.

Data streams
A data stream lets you store append-only time series data across multiple indices
while giving you a single named resource for requests. Data streams are well-suited
for logs, events, metrics, and other continuously generated data.
You can submit indexing and search requests directly to a data stream. The stream
automatically routes the request to backing indices that store the stream’s data. You
can use index lifecycle management (ILM) to automate the management of these
backing indices. For example, you can use ILM to automatically move older backing
indices to less expensive hardware and delete unneeded indices. ILM can help you
reduce costs and overhead as your data grows.
Should you use a data stream?
To determine whether you should use a data stream for your data, you should
consider the format of the data, and your expected interaction. A good candidate
for using a data stream will match the following criteria:
 Your data contains a timestamp field, or one could be automatically
generated.
 You mostly perform indexing requests, with occasional updates and deletes.
 You index documents without an _id, or when indexing documents with an
explicit _id you expect first-write-wins behavior.
For most time series data use-cases, a data stream will be a good fit. However, if
you find that your data doesn’t fit into these categories (for example, if you
frequently send multiple documents using the same _id expecting last-write-wins),
you may want to use an index alias with a write index instead. See documentation
for managing time series data without a data stream for more information.
Keep in mind that some features such as Time Series Data Streams (TSDS) and data
stream lifecycles require a data stream.
Backing indices
A data stream consists of one or more hidden, auto-generated backing indices.
A data stream requires a matching index template. The template contains the
mappings and settings used to configure the stream’s backing indices.
Every document indexed to a data stream must contain a @timestamp field,
mapped as a date or date_nanos field type. If the index template doesn’t specify a
mapping for the @timestamp field, Elasticsearch maps @timestamp as a date field
with default options.
The same index template can be used for multiple data streams. You cannot delete
an index template in use by a data stream.
The name pattern for the backing indices is an implementation detail and no
intelligence should be derived from it. The only invariant the holds is that each data
stream generation index will have a unique name.
Read requests
When you submit a read request to a data stream, the stream routes the request to
all its backing indices.
Write index
The most recently created backing index is the data stream’s write index. The
stream adds new documents to this index only.
You cannot add new documents to other backing indices, even by sending requests
directly to the index.
You also cannot perform operations on a write index that may hinder indexing, such
as:
 Clone
 Delete
 Shrink
 Split
Rollover
A rollover creates a new backing index that becomes the stream’s new write index.
We recommend using ILM to automatically roll over data streams when the write
index reaches a specified age or size. If needed, you can also manually roll over a
data stream.
Generation
Each data stream tracks its generation: a six-digit, zero-padded integer starting
at 000001.
When a backing index is created, the index is named using the following
convention:
.ds-<data-stream>-<yyyy.MM.dd>-<generation>
<yyyy.MM.dd> is the backing index’s creation date. Backing indices with a higher
generation contain more recent data. For example, the web-server-logs data stream
has a generation of 34. The stream’s most recent backing index, created on 7 March
2099, is named .ds-web-server-logs-2099.03.07-000034.
Some operations, such as a shrink or restore, can change a backing index’s name.
These name changes do not remove a backing index from its data stream.
The generation of the data stream can change without a new index being added to
the data stream (e.g. when an existing backing index is shrunk). This means the
backing indices for some generations will never exist. You should not derive any
intelligence from the backing indices names.
Append-only (mostly)
Data streams are designed for use cases where existing data is rarely updated. You
cannot send update or deletion requests for existing documents directly to a data
stream. However, you can still update or delete documents in a data stream by
submitting requests directly to the document’s backing index.
If you need to update a larger number of documents in a data stream, you can use
the update by query and delete by query APIs.
If you frequently send multiple documents using the same _id expecting last-write-
wins, you may want to use an index alias with a write index instead. See Manage
time series data without data streams.
Dimensions
Dimensions are field names and values that, in combination, identify a document’s
time series. In most cases, a dimension describes some aspect of the entity you’re
measuring. For example, documents related to the same weather sensor may
always have the same sensor_id and location values.
A TSDS document is uniquely identified by its time series and timestamp, both of
which are used to generate the document _id. So, two documents with the same
dimensions and the same timestamp are considered to be duplicates. When you use
the _bulk endpoint to add documents to a TSDS, a second document with the same
timestamp and dimensions overwrites the first. When you use the PUT
/<target>/_create/<_id> format to add an individual document and a document
with the same _id already exists, an error is generated.
You mark a field as a dimension using the boolean time_series_dimension mapping
parameter. The following field types support the time_series_dimension parameter:
Metrics
Metrics are fields that contain numeric measurements, as well as aggregations
and/or downsampling values based off of those measurements. While not required,
documents in a TSDS typically contain one or more metric fields.
Metrics differ from dimensions in that while dimensions generally remain constant,
metrics are expected to change over time, even if rarely or slowly.
To mark a field as a metric, you must specify a metric type using
the time_series_metric mapping parameter. The following field types support
the time_series_metric parameter:
 aggregate_metric_double
 histogram
 All numeric field types
Accepted metric types vary based on the field type:
Valid values for time_series_metric
counter
A cumulative metric that only monotonically increases or resets to 0 (zero). For
example, a count of errors or completed tasks.
A counter field has additional semantic meaning, because it represents a
cumulative counter. This works well with the rate aggregation, since a rate can be
derived from a cumulative monotonically increasing counter. However a number of
aggregations (for example sum) compute results that don’t make sense for a
counter field, because of its cumulative nature.
Only numeric and aggregate_metric_double fields support the counter metric type.
Due to the cumulative nature of counter fields, the following aggregations are
supported and expected to provide meaningful results with
the counter field: rate, histogram, range, min, max, top_metrics and variable_width_
histogram. In order to prevent issues with existing integrations and custom
dashboards, we also allow the following aggregations, even if the result might be
meaningless on counters: avg, box plot, cardinality, extended stats, median
absolute deviation, percentile ranks, percentiles, stats, sum and value count.
gauge
A metric that represents a single numeric that can arbitrarily increase or decrease.
For example, a temperature or available disk space.
Only numeric and aggregate_metric_double fields support the gauge metric type.
null (Default)
Not a time series metric.
Time series mode
The matching index template for a TSDS must contain a data_stream object with
the index_mode: time_series option. This option ensures the TSDS creates backing
indices with an index.mode setting of time_series. This setting enables most TSDS-
related functionality in the backing indices.
If you convert an existing data stream to a TSDS, only backing indices created after
the conversion have an index.mode of time_series. You can’t change
the index.mode of an existing backing index.
_tsid metadata field
When you add a document to a TSDS, Elasticsearch automatically generates
a _tsid metadata field for the document. The _tsid is an object containing the
document’s dimensions. Documents in the same TSDS with the same _tsid are part
of the same time series.
The _tsid field is not queryable or updatable. You also can’t retrieve a
document’s _tsid using a get document request. However, you can use
the _tsid field in aggregations and retrieve the _tsid value in searches using
the fields parameter.
The format of the _tsid field shouldn’t be relied upon. It may change from version to
version.
Time-bound indices
In a TSDS, each backing index, including the most recent backing index, has a range
of accepted @timestamp values. This range is defined by
the index.time_series.start_time and index.time_series.end_time index settings.
When you add a document to a TSDS, Elasticsearch adds the document to the
appropriate backing index based on its @timestamp value. As a result, a TSDS can
add documents to any TSDS backing index that can receive writes. This applies
even if the index isn’t the most recent backing index.
Some ILM actions mark the source index as read-only, or expect the index to not be
actively written anymore in order to provide good performance. These actions are:
- Delete - Downsample - Force merge - Read only - Searchable
snapshot - Shrink Index lifecycle management will not proceed with executing these
actions until the upper time-bound for accepting writes, represented by
the index.time_series.end_time index setting, has lapsed.
If no backing index can accept a document’s @timestamp value, Elasticsearch
rejects the document.
Elasticsearch automatically
configures index.time_series.start_time and index.time_series.end_time settings as
part of the index creation and rollover process.
Look-ahead time
Use the index.look_ahead_time index setting to configure how far into the future
you can add documents to an index. When you create a new write index for a TSDS,
Elasticsearch calculates the index’s index.time_series.end_time value as:
now + index.look_ahead_time
At the time series poll interval (controlled via time_series.poll_interval setting),
Elasticsearch checks if the write index has met the rollover criteria in its index
lifecycle policy. If not, Elasticsearch refreshes the now value and updates the write
index’s index.time_series.end_time to:
now + index.look_ahead_time + time_series.poll_interval
This process continues until the write index rolls over. When the index rolls over,
Elasticsearch sets a final index.time_series.end_time value for the index. This value
borders the index.time_series.start_time for the new write index. This ensures
the @timestamp ranges for neighboring backing indices always border but never
overlap.
Look-back time
Use the index.look_back_time index setting to configure how far in the past you can
add documents to an index. When you create a data stream for a TSDS,
Elasticsearch calculates the index’s index.time_series.start_time value as:
now - index.look_back_time
This setting is only used when a data stream gets created and controls
the index.time_series.start_time index setting of the first backing index. Configuring
this index setting can be useful to accept documents with @timestamp field values
that are older than 2 hours (the index.look_back_time default).
Accepted time range for adding data
A TSDS is designed to ingest current metrics data. When the TSDS is first created
the initial backing index has:
 an index.time_series.start_time value set to now - index.look_ahead_time
 an index.time_series.end_time value set to now + index.look_ahead_time
Only data that falls inside that range can be indexed.
You can use the get data stream API to check the accepted time range for writing to
any TSDS.
Dimension-based routing
Within each TSDS backing index, Elasticsearch uses the index.routing_path index
setting to route documents with the same dimensions to the same shards.
When you create the matching index template for a TSDS, you must specify one or
more dimensions in the index.routing_path setting. Each document in a TSDS must
contain one or more dimensions that match the index.routing_path setting.
Dimensions in the index.routing_path setting must be plain keyword fields.
The index.routing_path setting accepts wildcard patterns (for example dim.*) and
can dynamically match new fields. However, Elasticsearch will reject any mapping
updates that add scripted, runtime, or non-dimension, non-keyword fields that
match the index.routing_path value.
TSDS documents don’t support a custom _routing value. Similarly, you can’t require
a _routing value in mappings for a TSDS.
Index sorting
Elasticsearch uses compression algorithms to compress repeated values. This
compression works best when repeated values are stored near each other — in the
same index, on the same shard, and side-by-side in the same shard segment.
Most time series data contains repeated values. Dimensions are repeated across
documents in the same time series. The metric values of a time series may also
change slowly over time.
Internally, each TSDS backing index uses index sorting to order its shard segments
by _tsid and @timestamp. This makes it more likely that these repeated values are
stored near each other for better compression. A TSDS doesn’t support
any index.sort.* index settings.
What’s next?
Now that you know the basics, you’re ready to create a TSDS or convert an existing
data stream to a TSDS.

You might also like