This article consists of 4 Steps to Add search features to your application using Elasticsearch.
This is the full program:
part 1 : the concepts
part 2 : start small
part 3 : attaching indexation to events
part 4 : search (define a query grammar, parse query, build Elasticsearch query, search, build response)
This article consists of 4 Steps to Add search features to your application using Elasticsearch.
This is the full program:
part 1 : the concepts
part 2 : start small
part 3 : attaching indexation to events
part 4 : search (define a query grammar, parse query, build Elasticsearch query, search, build response)
Original Title
Add Search Features to Your Application - Try Elastic
This article consists of 4 Steps to Add search features to your application using Elasticsearch.
This is the full program:
part 1 : the concepts
part 2 : start small
part 3 : attaching indexation to events
part 4 : search (define a query grammar, parse query, build Elasticsearch query, search, build response)
This article consists of 4 Steps to Add search features to your application using Elasticsearch.
This is the full program:
part 1 : the concepts
part 2 : start small
part 3 : attaching indexation to events
part 4 : search (define a query grammar, parse query, build Elasticsearch query, search, build response)
Download as DOCX, PDF, TXT or read online from Scribd
Download as docx, pdf, or txt
You are on page 1of 15
Add search features to your application, try Elasticsearch
part 1 : the concepts
bylouis gueye Hi all, The ultimate goal of a search engine is to provide fast, reliable, easy to use, scalable search features to a software. But before diving into complex and technical considerations we should ask ourselves why bother with a search engine. 1 Why should my team bother with a new complex component ? Adding a new component needs new dev/integration/ops skills. The learning curve might be important. The configuration in testing mode can really be a nightmare to setup. So why introducing such risk and complexity in a project ? One day or another, anyone familiar with databases did add some contains semantics to search features. You end up writing such queries: 1 2 3 4 5 6 7 8 9 select * from table0 t0 left join table1 t1 on = left join table2 t2 on = left join table3 t3 on = where t0.title = '%term%' or t1.description = '%term%' and t3.created > '2011-01-12' and t3.status not in ('archived', 'suspended', 'canceled')
The above query will perform slower and slower as your amount of data grows because the time consuming parts of the query dont use the optimized path: indices. Moreover, as the requirements evolve, building your query will be more of a nightmare than pleasure. As a rule of thumb, whenever the time spent waiting for results in a complex search is not acceptable youre left with 2 choices: optimize your request: make sure you use the most optimized path, use a search engine: highly optimized for reading and searching because it indexes almost everything (not true for database which emphasizes on relations and structure). 2 How does it work ? The principle of a full-text search engine is based on indexing documents. First index documents then search in those documents. A document is a succession of words stored in sections/paragraphs. An analogy with database could be : a table for a document, a field for a section. Words are tokens. Indexing is the process of analysing a document and storing the result of that analysis for further retrieval. Analysing is the process of extracting tokens from a field, counting occurences (which are valuable for pertinence), associating them with unique path in document. Not all tokens are relevant to search, some are so common that they are ignored. Indexers user analyzers that can ignore such tokens. Not all fields are analyzed. For instance a unique reference like ISBN should not be analyzed. All the settings can be configured in a mapping. You can search in the same type of document or in all type of documents. The later use case, though less intuitive, can be a great time saver when it comes to build cross cutting informations like statistics. Keep in mind : first write a document definition, setup your engine with that definition, index documents (tokenize, store) then search. 3 Which tool does the job ? So far so good, I understand the concepts but dont know which tool does the job. Before choosing a tool, to avoid getting lost in a world were not familiar with, lets write down requirements. the tool should integrate seamlessly with either java or http (because HTTP is a great interface). the tool should be easy to install : debian package would be awesome. the tool should be easy to configure : declarative settings would be much appreciated. the tool should provide a comprehensive documentation that allow one to get familiar with the concept first then the practice. the tool should provide a comprehensive integration/acceptance test suite that will serve as a learning tool. the obvious ones : fast at runtime and lowest possible memory footprint. While you have Woosh in python and Zend Lucene in php, you have Solr, Elasticsearch andHibernate Search in java. They all rely on Lucene, are written in java and 2 of them (Elasticsearch and solr) offer HTTP interface to index and search. Lucene is a very advanced and mature project. The amount of work around it is huge. But Lucene mainly focuses on the very technical details about parsing and analysing text. It focuses on providing fast searcher and reliable indexer and low level features like custom analysers, synonyms and all the plumbing/noise that avoid one to focus on the business search requirements. The other projects take advantage of that core and offer higher level features around it like remoting (REST/HTTP), declarative configuration, scaling (clusters, etc), etc. I did go for Elasticsearch because it offers in-memory nodes which are valuable when testing in embedded mode. In addition, REST is the preferred way to instrument Elasticsearch. I really like the idea because I believe that HTTP is a hell of an interface. Moreover the REST API is really simpler than the . I cant do a comparative work, I can just explain why I was attracted by Elasticsearch. I think were good for the concepts. This article is the first in a series of 4 Add search features to your application, try Elasticsearch. This is the full program: part 1 : the concepts part 2 : start small part 3 : attaching indexation to events part 4 : search (define a query grammar, parse query, build Elasticsearch query, search, build response)
Add search features to your application, try Elasticsearch part 2 : start small
Jan21by louis gueye Hi, This article is part of a whole which aims to describe how one could integrate Elasticsearch. Theprevious post discussed the concepts: why using a search engine?. No people learn the same way. I usually need to understand the theory and I need to start small. So I usually start by books. If there is no book I look for a blog that explains the philosophy. Finally I look for pieces of codes (official documentation, blogs, tutorials, github, etc). Starting small makes me confident and allows me to increase complexity gradually. Depending on what you want to know about Elasticsearch you should read different section of the guide : - SETUP section : describes how to install and run Elasticsearch (run as a service). API : this is the REST API which seems more complete than the others. Describes how to inter-operate with nodes : search, index, check clusters status. Query DSL : the query API is quite rich. You have explanations about the syntax and semantics of queries and filters. Mapping : mapping configures elasticsearch when indexing/searching on a particular type of document. Mapping is an important part which deserves special care. Modules : presents the technical architecture with low level services like discovery or http. Index modules : low level configuration on indices like sharding and logging. River : the river concept is the ability to feed your index from another datasource (pull data every X ms). Java and groovy API : if your software already runs in a jvm you can benefit from that and control elastic search via this API. To avoid getting lost in the documentation, lets focus on simple goals. Well implement them progressively: create node/client in test environment create node/client in non test environment integrate with Spring create/delete/exists an index on a node wait until node status is ok before operating on it create/update/delete/find data on an index create a mapping 1 Admin operations Operations on indices are admin operation. You can find them in the API section under Indices. * Create node/client in test environment A node is a process (a member) belonging to a cluster (a group). A builder is responsible of joining, detaching, configuring the node. When creating a node, sensible defaults are already configured. I didnt dive into the discovery process and I wont. Creating a node will automatically create its encapsulating cluster. Creating a node is as simple as : 1 2 3 4 5 6 // settings private Settings defaultSettings = ImmutableSettings.settingsBuilder().put("", "test-cluster-" + NetworkUtils.getLocalAddress().getHostName()).build(); // create node final Node node = NodeBuilder.nodeBuilder().local(true).data(true).settings(defaultSettings).build(); // start node node.start() The above code will create a node instance. The node doesnt use the transport layer (tcp). So no rmi, no http, no network services. Everything happens in the jvm. To operate with a node you must acquire a client from your node. Any single operation depends on it 1 Client client = node.client(); An invaluable resource on how to setup nodes and clients in test env is the AbstractNodesTestsclass. * Create node in non test environment In non test env just install Elasticsearch like described in the documentation SETUP section.This installation uses the transport layer (tcp). There isnt an official debian package but Nicolas Huray and Damien Hardy contributed on the project and wrote one which has will be integrated to the 0.19 branch. This branch moves from a gradle building system to maven. It will use the jdeb-maven-plugin to build the debian package. It will then be available for download on the elasticsearch site. Once installed you should have an Elasticsearch instance up and running with a discovery service listening on port 54328, an HTTP service listening on 9200 and an inter-node communication port 9300. The default cluster name is elasticsearch but we do no use it to make sure tests run in isolation. For more on nodes and clusters configuration feel free to read this page in the official documentation. * Integrate with Spring You can integrate with spring by creating a FactoryBean wich is responsible for the Node/Client construction. Dont forget to destroy them as they really are memory consuming (beware of PermGenSpace ). This post, though a bit complex for my needs, was helpful. If interested in that specific part you can take a look at LocalNodeClientFactoryBean. * Create an index on a node Once your node is up you can create indices. The main property of the index is its name which acts like an id. A node cant have 2 indices with same name. The index name cant contains special chars like ., / etc. Keep it simple. 1 Client client = node.client(); Then create an index with adverts id, intended to store adverts : 1 2 client.admin().indices().prepareCreate("adverts") .execute().actionGet(); Depending on your organization you can choose to create one indice per software or one indice per stored type or wathever settings suits you. You just have to maintain the indices names. * Remove an index from a node As soon as you have the name it is straightforward. You should can test existence before removing. 1 2 3 4 5 if (client.admin().indices().prepareExists("adverts") .execute().actionGet().exists()) { client.admin().indices().prepareDelete("adverts") .execute().actionGet(); } * Wait for cluster health 1 2 3 client.admin().cluster() .prepareHealth("adverts").setWaitForYellowStatus() .execute().actionGet(); 2 Data operations * Create / Update 1 2 3 4 client.prepareIndex("adverts", "advert", "1286743")// .setRefresh(true) // .setSource(advertToJsonByteArrayConverter.convert(advert)) // .execute().actionGet(); The above code will index data (the source) whose type is advert under the adverts index area. It will also commit (refresh) the index modifications. The source can be many types ranging from a fieldName/value map to the byte array. The byte array is the preferred way so I created converters from/to byte[]/Advert. AdvertToJsonByteArrayConverter (relies on Spring Converter interface) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 ... /** * @see org.springframework.core.convert.converter.Converter#convert(java.lang.Object) */ @Override public byte[] convert(final Advert source) {
... Updating means re-indexing so its the exact same operation. * Delete data When done with an object and we dont want it to appear in search results we might want to delete it from the index: 1 2 client.prepareDelete("adverts", "advert", "4586321") .setRefresh(true).execute().actionGet(); That will delete then refresh immediately after. * Find data The search API is very rich so you have to understand search semantics. If youre familiar with lucene then everything will seem obvious to you. But if you dont youll have to get familiar with the basics. There are 2 main types of search : exact match and full text. The exact match operates on fields as a whole data. The field is considered a term (even if it contains spaces). It is not analyzed, so querying field=condition will return nothing if field equals excellent condition. Exact match suits very well for certain fields (id, reference, date, status, etc) but not for all. Exact match fields can be sorted. The full text operates on tokens. The analyzer removes stop words, splits the field in tokens, groups them. The most relevant result is the one that contains the higher term occurences (roughly). You obviously cant apply a lexical sort on those fields. They are sorted by score. Below, an exact match example (will match adverts with provided id): 1 2 3 4 5 private SearchResponse findById(final Long id) { return client.prepareSearch("adverts").setTypes("advert") .setQuery(QueryBuilders.boolQuery() .must(QueryBuilders.termQuery("_id", id))).execute().actionGet(); } Below, a full text on a single field example (will match adverts whose description field contains at least once the term condition): 1 2 3 4 client.prepareSearch("adverts").setTypes("advert") .setQuery(QueryBuilders.boolQuery() .must(QueryBuilders.queryString("condition") .defaultField("description")).execute().actionGet(); * Create a mapping The searchable nature of a field is an important design decision and can be configured in the mapping. It defines, for an indexed type, the indexed fields and for each field some interesting properties like analyzed nature (analyze|not_analyzed), type (long, string, date) , etc.Elasticsearch provides a default mapping: strings fields are analyzed, other ones are not. I really recommend you to spend some time on that section. One dont necessarily have to design the perfect mapping the first time (it requires some experience) but the decisions taken on that part will impact the search results. Below an example of mapping: 1 2 3 4 5 6 7 8 9 10 { "advert" : { "properties" : { "id" : { "type" : "long", "index" : "not_analyzed" }, "name" : { "type" : "string" 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 }, "description" : { "type" : "string" }, "email" : { "type" : "string", "index" : "not_analyzed" }, "phoneNumber" : { "type" : "string", "index" : "not_analyzed" }, "reference" : { "type" : "string", "index" : "not_analyzed" }, "address" : { "dynamic" : "true", "properties" : { "streetAddress" : { "type" : "string" }, "postalCode" : { "type" : "string" }, "city" : { "type" : "string" }, "countryCode" : { "type" : "string" } } } } } } I gathered those CRUD operations in 2 integration tests : ElasticSearchDataOperationsTestITand ElasticSearchAdminOperationsTestIT. Now that were familiar with Elasticsearch basic operations we can move on. We can consider improving the code. You agree that handling the indexing task manually is an option but not the most elegant and reliable one. In the next post well discuss the different solutions to automatically index our data.
Add search features to your application, try Elasticsearch part 3 : attaching indexation to events
Jan22by louis gueye Now that we are able to index, we should think of when whe should trigger indexing tasks. A simple answer would be : whenever some indexed data has changed. Changed means change cardinality (add/remove) or change existing data. Either we invoke indexing tasks whenever we code an action that changes data or we use an event model which listens to precise events. 1 JPA event model If you use JPA as a persistence mechanism you can take advantage of its elegant mechanism. You can register an entitys listeners either at class level or at method level via annotations. One can annotate an entity method as a listner to an event. The method is the will be executed when the event defined by the annotation occurs. If this solution seems too intrusive or too specific, one can externalize this behaviour in a class and annotate the entity.
@Entity @EntityListeners({EmployeeDebugListener.class, NameValidator.class}) public class Employee { @Id private int id; private String name; @Transient private long syncTime;
@PostPersist @PostUpdate @PostLoad private void resetSyncTime() { syncTime = System.currentTimeMillis(); System.out.println("Employee.resetSyncTime called on employee id: " + getId()); }
public long getCachedAge() { return System.currentTimeMillis() - syncTime; }
public int getId() { return id; }
public void setId(int id) { = id; }
public String toString() { return "Employee id: " + getId() ; } } That model, although very elegant, doesnt suit you if you use Spring because the persistence cant use a bean instance. It creates its own instances which totally goes against dependency injection. This post is a rather complete material about the solution based on JPA. 2 Hibernate event model When using Hibernate, without JPA, with Spring you can register instances, not only classes. Your application can listen to post-insert/post-update/post-delete events. This solution is my favorite one if your application writes little and reads much. You can specify your listeners by setting the eventListeners property of theLocalSessionFactoryBean. Its a map which associates an event key to an array of listeners instance. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 <bean name="sessionFactory" class="org.springframework.orm.hibernate3.LocalSessionFactoryBean"> <property name="dataSource" ref="dataSource"/> <property name="mappingLocations" value="classpath:hibernate/mapping/*.xml"/> <property name="hibernateProperties"> <props> <prop key="hibernate.dialect">${hibernate.dialect}</prop> <prop key="">${}</prop> <prop key="hibernate.show_sql">${hibernate.show_sql}</prop> <prop key="hibernate.connection.useUnicode">true</prop> <prop key="hibernate.connection.characterEncoding">UTF-8</prop> </props> </property> <property name="eventListeners"> <map> <entry key="post-insert"> <ref bean="PostCommitInsertEventListener"/> </entry> <entry key="post-update"> <ref bean="PostCommitUpdateEventListener"/> </entry> <entry key="post-delete"> <ref bean="PostCommitDeleteEventListener"/> </entry> </map> </property>
public class PostCommitDeleteEventListener implements PostDeleteEventListener { public static final String BEAN_ID = "PostCommitDeleteEventListener"; @Autowired private SearchEngine searchEngine; @Override public void onPostDelete(PostDeleteEvent event) { if (event == null) return; Object eventEntity = event.getEntity(); if (!(eventEntity instanceof Advert)) return; Advert advert = (Advert) eventEntity; Long id = advert.getId(); 17 18 19 searchEngine.removeFromIndex(id); } } ... This is one of the most non intrusive solution. It also ensures that even if you add a new business method that updates the database state, changes will automatically reflect in the index. No need to manually call index tasks. 3 Spring event model When youre stuck with JPA you can use Spring event model. You useApplicationEventPublisher to publish CRUD event then implement ApplicationListener to react to the event. Parameterized types (generics) ensure your code will react to one type only, this can be quite convenient: not reacting to Job event but reacting to Advert events. That solution is not very resistant to changes because if you forget to trigger an event nothing will happen. It is equivalent to manually call index taks but it is the only one available when using JPA. Example of ApplicationEventPublisher call: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ... @Autowired private ApplicationEventPublisher eventPublisher; /** * @see */ @Override @Transactional(propagation = Propagation.REQUIRED) public void deleteAdvert(final Long advertId) { Preconditions.checkArgument(advertId != null, "Illegal call to deleteAdvert, advert identifier is required"); this.baseDao.delete(Advert.class, advertId); this.eventPublisher.publishEvent(new PostDeleteAdvertEvent(new Advert(advertId))); } ... Example of event listener: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ... /** * @author */ @Component public class PostDeleteAdvertEventListener implements ApplicationListener<PostDeleteAdvertEvent> { @Autowired private SearchEngine searchEngine; /** * @see org.springframework.context.ApplicationListener#onApplicationEvent(org.springframework.context.ApplicationEvent) */ @Override public void onApplicationEvent(PostDeleteAdvertEvent event) { if (event == null) return; final Advert entity = event.getSource(); if (entity == null || entity.getId() == null) return; this.searchEngine.removeFromIndex(Advert.class, entity.getId()); } } ... 4 Elasticsearch river An Elasticsearch River is a mechanism which pulls data from a datasource (couchdb, twitter, wikipedia, rabbitmq, rss) on a regular basis (500 ms for example) and updates the index based on what changed since the last refresh. The idea is really nice but there a too few plugin yet.Elasticsearch provides only 4 rivers plugin but contributions are more than welcome :). So far weve got familiar with search engine concepts then we started to have a first contact with Elasticsearch writing CRUD tests. We just discussed several solutions to trigger indexing. Now that we know how to index data we can finally focus on the search business which is what the next post will try to present. The source code hasnt moved, still on github. Feel free to explore it
Add search features to your application, try Elasticsearch part 4 : search
Jan23by louis gueye Elasticsearch relies on Lucene engine. The good news is that Lucene is really fast and powerful. Yet, its not a good idea to expose such power to the user. Elasticsearch acts as a first filter but remains quite complete. When you dont master an API, a good practice is to have control over what you expose to the user. But this comes with a cost, youll have to: implement a query language implement a language parser implement a query translator (translate into something understandable by elasticsearch) run search translate Elasticsearch results into a custom structure The task seems daunting but no worry: were going to take a look at each step. 1 The query language Once youve delimit the perimeter, its simpler. I imagined something like: 1 http://domain/search/adverts?query=reference:REF-TTTT111gg4!description~condition legal!created lt 2009&from=2&itemsPerPage=10&sort=created+desc
1 2 3 4 5 query := (Clause!)* ; Clause := (Field Operator Value)* | (Value)+ ; Field := ? fieldname without space ? ; Operator := (:|~|lt|gt|lte|gte) ; Value : ? anything form-url-encoded ? ; The query param is optional, if not specified, the default search should return all elements. The from param is optional, if not specified, the 1st page is assumed The itemsPerPage is optional, if not specified, a page will contain 10 results The sort param is optional, if not specified, the result will be sorted by id desc. Well even so, it is not trivial. For the purpose of the poc I simplified my requirements: I did not use a ENBF parser like ANTLR : parsers deserve their own post. I did not implement all the operators. Below, the piece of code used to split clauses: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 List extractSearchClauses(final String queryString) { if (StringUtils.isEmpty(queryString)) return null; final List clauses = Arrays.asList(queryString.split(CLAUSES_SEPARATOR)); final Collection cleanClauses = Collections2.filter(clauses, new Predicate() {
} 2 Translate to Elasticsearch language First lets establish a few rules an empty clauses list means returning all the elements (/adverts?) a single clause that contains no field and no operator means a full text search on all searchable fields (/adverts?q=condition+legal) a multiple clause means a boolean AND query between clauses (/adverts?query=reference:REF- TTTT111gg4!description~condition legal) Elasticsearch comes with a rich search API which encapsulates the query building in a collection of QueryBuilders. Below an example of search instructions: 1 2 3 4 ... ((BoolQueryBuilder) queryBuilder).must(queryString(clause.getValue()) .defaultField(clause.getField())); ... 3 Running search The SearchRequestBuilder is an abstraction that encapsulates the well known search domain which specifies: pagination properties (items per page, page number), sort specification (fields, sort direction), query (built by chaining QueryBuilders). Once youve configured your SearchRequestBuilder you can run the actual search 1 2 3 ... final SearchResponse searchResponse = searchRequestBuilder.execute().actionGet(); ... 4 Transfer results to a custom structure Ideally, we should return a search result that contains total hits and pagination results(previous, next, first, last). Those are the only information needed by the user. But remember : the index stores a json byte array (not mandatory but I chose it because I build RESTful services), not an object. We have to re-build our object from JSON representation. Again, writing a Converter really helps. I did not implement pagination as its another entire concern : building a RESTful search response that respects HATEOAS principles. Ill blog on that later. Example of Converter invocation: 1 2 3 4 ... final SearchResult result = this.searchResponseToSearchResultConverter.convert(searchResponse); return result; ... And below the Converter source (I could have used a transform function ): 1 2 3 4 5 6 7 8 9 10 11 12 ... public SearchResult convert(final SearchResponse source) { final SearchResult result = new SearchResult(); final SearchHits hits = source.getHits(); result.setTotalHits(hits.getTotalHits()); for (final SearchHit searchHit : hits.getHits()) { final Advert advert = jsonByteArrayToAdvertConverter.convert(searchHit.source()); result.addItem(advert); } return result; } ... This post closes a series of 4 on elasticsearch first contact. We discussed the concepts, but before designing anything we wanted to get familiar with our new tool. Once more comfortable with Elasticsearch, we started serious work: attaching indexing tasks to application events first, then building a simple search endpoint that uses Elasticsearch under the hood. I cant say that I totally adopted the tool because there still is a lot to validate: searches : I dont know all the specifics/semantics/differences between all the pre-definedQueryBuilders. facets : how do they work in Elasticsearch ? I always heard it is insanely fast with high volumes. I want to see it with my own eyes. JPA was disappointing (not an Elasticsearch problem) : maybe I could use CDI I still have to figure out how to cleanly setup different clients instanciation modes : memory and transport clients. Using Spring profiles is a solution but Im not a big fan of profiles I wish I could test a mysql river. Id like to compare the river to the events mechanism. I try not to be too exhalted but I have to say its a real pleasure once youve past the first pitfalls mostly related to: node/client management: like jdbc connections, youre responsible for opening/closing your resources otherwise you may have unexpected side effects, mapping design: analysed and not_analysed properties have a huge impact on your search, and blindness: in memory testing is for experienced user who already know the API. I would suggest a real time-saver tool : Elasticsearch Head. This tool helped us understand how data was organized/stored, what data was currently in the index, if it was correctly deleted, etc. The price to pay: only works with transport configuration, not in-memory. Anyway I hope you enjoyed the reading. If so feel free to share. If not, let me know why (I might have some inacurate informations) as soon as we learn something. The full source is on github. Run the following command to launch Jbehave search stories 1 mvn clean verify -Psearch