introduction to
elasticsearch.
Ruslan Zavacky
@ruslanzavacky | ruslan.zavacky@gmail.com
Released in 2010
In 2014, 70$ million in Series C
funding
2
real time data real time analytics
Data flows into your system all the time. The question is … Search isn’t just free text search anymore - it’s about
how quickly can that data become an insight? With exploring your data. Understanding it. Gaining insights
Elasticsearch, real-time is the only time. that will make your business better or improve your
product.
high availability multi-tenancy
Elasticsearch clusters are resilient - they will detect and A cluster can host multiple indices which can be queried
remove failed nodes, and reorganise themselves to ensure independently or as a group. Index aliases allow you to
that your data is safe and accessible. add indexes on the fly, while being transparent to your
application.
3
full text search document oriented
Elasticsearch uses Lucene under the covers to provide the Store complex real world entities in Elasticsearch as
most powerful full text search capabilities available in any structured JSON documents. All fields are indexed by
open source product. Search comes with multi-language default, and all the indices can be used in a single query,
support, a powerful query language, support for to return results at breath taking speed.
geolocation, context aware did-you-mean suggestions,
autocomplete and search snippets.
conflict management schema free
Optimistic version control can be used where needed to Elasticsearch allows you to get started easily. Toss it a
ensure that data is never lost due to conflicting changes JSON document and it will try to detect the data structure,
from multiple processes index the data and make it searchable. Later, apply your
domain specific knowledge of your data to customise how
your data is indexed.
4
restful api per-operation persistence
Elasticsearch is API driven. Almost any action can be Elasticsearch puts your data safety first. Document
performed using a simple RESTful API using JSON over changes are recorded in transaction logs on multiple
HTTP. An API already exists in the language of your nodes in the cluster to minimise the chance of any data
choice. loss.
apache 2 open source license build on top of apache lucene™
Elasticsearch can be downloaded, used and modified free Apache Lucene is a high performance, full-featured
of charge. It is available under the Apache 2 license, one Information Retrieval library, written in Java. Elasticsearch
of the most flexible open source licenses available. uses Lucene internally to build its state of the art
distributed search and analytics capabilities.
5
who
6
I
7
8
Unstructured search
9
Structured search
10
Enrichment
11
Sorting
12
Pagination
13
Aggregation
14
Suggestions
15
Elasticsearch in 10 seconds
• Schema-free, REST & JSON based distributed
document store
• Open Source: Apache License 2.0
• Zero configuration
• Written in Java, extensible
16
The most
important question
17
18
Exploding kittens
on Kickstarter
> 195,794 bakers
> $7,840,830 pledged
… and yes, Kickstarter use
elasticsearch
19
Capabilities
20
Capabilities
Store schema less data
Or create a schema for your data
Manipulate your data record by record
Or use Multi-document APIs to do Bulk ops
Perform Queries/Filters on your data for insights
Or if you are DevOps person, use APIs to monitor
Do not forget about built-in Full-Text search and analysis
Document API Search APIs Indices API Cat APIs Cluster API Query DSL
Validate API Search API More Like This API Mapping Analysis Modules
21
Auto Completion
SELECT name
FROM product
WHERE name LIKE ‘d%’
1k records 500k records 20m records
22
Auto Completion
Yea, sure…
23
Auto Completion: FST
24
Auto Completion
Multiple Inputs Going fuzzy
Single Unified Output Statistics
Scoring
Payloads
Synonyms
Ignoring stopwords
25
Auto Completion
curl -X PUT localhost:9200/hotels/hotel/2 -d '
{
"name" : "Hotel Monaco",
"city" : "Munich",
"name_suggest" : {
"input" : [
"Monaco Munich",
"Hotel Monaco"
],
"output": "Hotel Monaco",
"weight": 10
}
}'
26
Faceted Navigation
27
Aggregation & Filtering
Documents
28
Aggregation & Filtering
Documents
Query
29
Aggregation & Filtering
Documents
Query
Buckets
30
Aggregation & Filtering
Documents
Query
Buckets
31
Aggregation & Filtering
Documents
Query
Buckets
Metrics 123 344 545
32
Faceted Navigation
33
Snapshot / Restore
Snapshot
curl -XPUT "localhost:9200/_snapshot/my_backup/snapshot_1?wait_for_completion=true"
Restore
curl -XPOST "localhost:9200/_snapshot/my_backup/snapshot_1/_restore"
34
Percolate API
Store queries in ElasticSearch.
Pass documents as queries.
Observe matched queries.
WUT?
35
Percolate API
Use Case
You tell customer, that you will notify them
when Plane ticket will be available and
cheaper.
Solution
Store customer criteria about desired flight
- departure, destination, max price
When you store flight data, match it against
saved percolators.
36
Percolate API
Store Query
curl -XPUT 'localhost:9200/my-index/.percolator/1' -d '{
"query" : {
"match" : {
"message" : "bonsai tree"
}
}
}'
Match document
curl -XGET 'localhost:9200/my-index/my-type/_percolate'
-d '{
"doc" : {
"message" : "A new bonsai tree in the office"
}
}'
37
Percolate API
{
"took" : 19,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"total" : 1,
"matches" : [
{
"_index" : "my-index",
"_id" : "1"
}
]
}
38
More like this API
curl -XGET 'http://localhost:9200/memes/meme/1/_mlt?mlt_fields=face&min_doc_freq=1'
39
scalability
40
Distributed & scalable
Replication
Read scalability
Removing SPOF
Sharding
Split logical data over several machines
Write scalability
Control data flows
41
Distributed & scalable
node 1
curl -X PUT localhost:9200/orders -d ’{
“settings.index.number_of_shards" : 4
orders “settings.index.number_of_replicas”: 1
1 2 }'
3 4
curl -X PUT localhost:9200/products -d ’{
products
“settings.index.number_of_shards" : 2
1 2 “settings.index.number_of_replicas”: 0
}'
42
Distributed & scalable
node 1 node 2
orders orders
1 2 1 2
3 4 3 4
products products
1 2
43
Distributed & scalable
node 1 node 2 node 3
orders orders orders
1 2 2 1
4 3 3 4
products products products
1 2
44
API tour
45
Create
» curl -X PUT localhost:9200/books/book/1 -d '
{
"title" : "Elasticsearch - The definitive guide",
"authors" : "Clinton Gormley",
"started" : "2013-02-04",
"pages" : 230
}'
46
Update
» curl -X PUT localhost:9200/books/book/1 -d '
{
"title" : "Elasticsearch - The definitive guide",
"authors" : [ "Clinton Gormley", "Zachary Tong"],
"started" : "2013-02-04",
"pages" : 230
}'
47
Delete
» curl -X DELETE localhost:9200/books/book/1
Get
» curl -X GET localhost:9200/books/book/1
48
Search
» curl -X GET localhost:9200/books/_search?q=elasticsearch
{
"took" : 2, "timed_out" : false,
"_shards" : { "total" : 5, "successful" : 5, "failed" : 0 },
"hits" : {
"total" : 1, "max_score" : 0.076713204,
"hits" : [ {
"_index" : “books", "_type" : “book", "_id" : "1",
"_score" : 0.076713204, "_source" : {
"title" : "Elasticsearch - The definitive guide",
"authors" : [ "Clinton Gormley", "Zachary Tong" ],
"started" : “2013-02-04", "pages" : 230
}
}]
}
}
49
Search Query DSL
»»curl
curl -XGET
-XGET ‘localhost:9200/books/book/_search'
‘localhost:9200/books/book/_search' -d
-d '{
'{
"query":
"query": {{
"filtered"
"filtered" :: {{
"query"
"query" :: {{
"match":
"match": {{
"text"
"text" :: {{
"query"
"query" :: “To
“To Be
Be Or
Or Not
Not To
To Be",
Be",
"cutoff_frequency" : 0.01
"cutoff_frequency" : 0.01
}}
}}
},
},
"filter"
"filter" :: {{
"range":
"range": {{
"price":
"price": {{
"gte":
"gte": 20.0
20.0
"lte": 50.0
"lte": 50.0
……
}
}
}'
}'
50
Use case: Product Search Engine
51
Product Search Engine
Just index all your products and be happy?
Search is not that easy
Synonyms, Suggestions, Faceting, De-compounding,
Custom scoring, Analytics, Price agents,
Query optimisation, beyond search
52
Neutrality? Really?
Is full-text search relevancy really your
preferred scoring algorithm?
Possible influential factors
Age of the product, been ordered in last 24h
In stock?
Special offer
Provision
No shipping costs
Rating (product, seller)
Returns
….
53
Neutrality? Really?
54
Neutrality? Really?
55
ecosystem
56
Ecosystem
• Plugins
• Clients for many languages
• Kibana
• Logstash
• Hadoop integration
• Marvel
57
Ecosystem
• Plugins
• Clients for many languages
• Kibana
• Logstash
• Hadoop integration
• Marvel
58
spoiler alert!
59
what is data?
60
provides value for
Whatever
your business.
61
Domain data Application data
Internal
Orders Log files
products
Metrics
External
Social media streams
email
62
63
Logstash
• Managing events and logs
• Collect data
• Parse data
• Enrich data
• Store data (search and visualising)
64
Why collect and centralise data?
• Access log files without system access
• Shell scripting: Too limited or slow
• Using unique ids for errors, aggregate it across
your stack
• Reporting (everyone can create his/her own report)
• Bonus points: Unify your data to make it easily
searchable
65
Unify dates
• apache [19/Feb/2015:19:00:00 +0000]
• unix timestamp 1424372400
• log4j [2015-02-19 19:00:00,000]
• postfix.log Feb 19 19:00:00
• ISO 8601 2015-02-19T19:00:00+02:00
66
Logstash
}
• Managing events and logs
Input
• Collect data
• Parse data
• Enrich data } Filter
• Store data (search and visualise)
} Output
67
kibana
68
Kibana
69
Kibana
70
Kibana
71
Kibana
72
Thank You!
73
Feedback
☺ ! ☹
Sponsors of XXVIII DevClub.lv