Quantcast
Channel: Imifos' Lucubratory » Java
Viewing all articles
Browse latest Browse all 15

The ElasticSearch Java API

$
0
0

When you start diving into the ElasticSearch Java API, you will undoubtedly note that it is kind of a mystery. There is the official documentation and there is that what’s missing in these pages.

ElasticSearch is an amazing product, but using it in Java requires patience for searching in blogs, Stack Overflow posts and even the source code itself. Diving into the source code is actually an excellent way of getting to understand the Java API. The source code – implementations and tests – are a great place to find examples and use cases. However, to quote Simon Brown on this: “The code doesn’t tell the whole story“.

In this article, I’ll start with some code snippets, a kind of “How to” collection. Most of these informations are coming from somewhere on the internet (thank you fellow bloggers and SO problem solvers :) or from the source codes.

Let’s start with the Maven pom file that I used to compile these examples. The “exec-maven-plugin” allows to start the application via Maven (if you want to do this).

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>pro.carl</groupId>
    <artifactId>estest</artifactId>
    <version>0.0.1-SNAPSHOT</version>
    <packaging>jar</packaging>

    <name>estest</name>
    <url>http://maven.apache.org</url>

    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.elasticsearch</groupId>
            <artifactId>elasticsearch</artifactId>
            <version>1.3.2</version>
        </dependency>
        <dependency>
            <groupId>com.fasterxml.jackson.core</groupId>
            <artifactId>jackson-databind</artifactId>
            <version>2.1.3</version>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>2.3.2</version>
                <configuration>
                    <source>1.7</source>
                    <target>1.7</target>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.codehaus.mojo</groupId>
                <artifactId>exec-maven-plugin</artifactId>
                <configuration>
                    <mainClass>pro.carl.estest.App</mainClass>
                </configuration>
            </plugin>
        </plugins>
    </build>

</project>

The first thing to do in a Java ElasticSearch client application is to create a “client” (API access) object. There are two types of “client” implementations: the Node Client and the Transport Client. This is well described in the official documentation, so I’ll not discuss this here.

ElasticSearch uses the Builder patter for everything. To build the object required in order to establish a connection to the ElasticSearch cluster, we use the “NodeBuilder“. Again, the source files are the source for a great deal of information.

Node node = null;
Client client = null;
try {

    // Instantiates an ElasticSearch cluster node in the
    // current VM. The behavior of this node is defined by the
    // settings we specify. "client(true)" indicates that we
    // going to be a pure client node, which means it will
    // hold no index data and other optimizations are applied
    // by different modules.
    // The node tries to find a master node to connect too and
    // has a default timeout of 30 seconds.
    node = NodeBuilder.nodeBuilder().client(true).node();

    // This is actually the object that allows us to execute
    // commands against our local node and by this the entire
    // cluster.
    client = node.client();

    :
    // check if all is GREEN and do your work here...
    :
 }
 finally {
    if (node != null) node.close();
 }

Normally, the node configuration is loaded from the “elasticsearch.yml” file in your classpath, but everything can be setup programmatically as well.

Setting the cluster name:

node = NodeBuilder.nodeBuilder().
                   client(true).
                   clusterName("mycluster").
                   node();

If you have a look into the source code, you will see that some of these methods are just adding values to the internal settings properties file – no magic here. Consequently, we can configure our cluster in the same way.

Settings settings = ImmutableSettings.settingsBuilder().
                           put("cluster.name", "myclustername").
                           put("node.data", false).
                           put("node.name", "mynodename").
                           build();

node = NodeBuilder.nodeBuilder().
                           settings(settings).
                           node();
client = node.client();

If you are curious to see what you are getting back, hidden behind the “Node” interface, again let’s look into the source code. Note all the code around the thread pool handling…

Following this, you need to check the cluster status before doing the real work. The cluster status has 3 possible values: RED, YELLOW and GREEN. Consult the documentation to get more information.

ClusterHealthResponse hr=null;
try {
    hr=client.admin().cluster().
                 prepareHealth().
                 setWaitForGreenStatus().
                 setTimeout(TimeValue.timeValueMillis(250)).
                 execute().
                 actionGet();
}
catch(MasterNotDiscoveredException e) {
    // No cluster status since we don't have a cluster
}

if (hr!=null) {
    System.out.println("Data nodes found:"
                   +hr.getNumberOfDataNodes());
    System.out.println("Timeout? :"
                   +hr.isTimedOut());
    System.out.println("Status:"
                   +hr.getStatus().name());
}

One additional comment on the cluster status: if you specify the requirement of one replica at index creation, but you only have one single data node running. Your status will never pass to GREEN, but will remain YELLOW – the requested replica is not present. This may happen during development where typically one single data node is running in a corner. In a development environment, this may be ignored. In a production environment on the other hand, this indicates that one of the nodes is not working as expected and no other data was able to take over!

Relates source codes: ClusterHealthResponse, and documentation on Cluster Health.

An important note concerning the “actionGet()“: The name of this method is not related to a potential GET REST call to the cluster. In fact, the “execute()” method returns a “ListenableActionFuture” object, which is the descriptor of the operation that is executed asynchronously in a separate thread. In order to wait for the result and making the operation a blocking call, the “afterGet()” method must be invoked on this Future.

Before we continue with handling indices, types and mappings, you should get acquainted with these concepts, if not yet done: Basic ElasticSearch Concepts.

The next step would be creating an index:

CreateIndexRequestBuilder cirb = client.admin().
                                    indices().
                                    prepareCreate("myindexname");

HashMap<String,Object> settings = new HashMap<>();

// Choosing the number of shards is actually an important one.
// This is well documented in the official documentation.
// The default value is 5, but we keep all data in 1 single shard
settings.put("number_of_shards", 1);

// Number of replication of indices within the cluster
settings.put("number_of_replicas", 1);

cirb.setSettings(settings);

CreateIndexResponse createIndexResponse=null;
try {
    createIndexResponse = cirb.execute().actionGet();
}
catch(IndexAlreadyExistsException e) {
    // Index already exists
    return;
}

if (createIndexResponse!=null && 
     createIndexResponse.isAcknowledged()) {
    // Index created
    return;
}
else {
    // Index creation failed
    return;
}

Once the index is created, we add a mapping associated with an index type. ElasticSearch can work in a schema-less way, but it’s generally a good practice to use a well defined schema.

ElasticSearch helps with this be allowing to set the “dynamic” schema mode to “strict”. As consequence, ElasticSearch throws an exception when you try to populate the index with data that does not respect the schema i.e. the mapping. This setting is off by default, which signifies that ElasticSearch can add fields on the fly to the mapping when they appear in the data set. See the official documentation for more info on the dynamic mapping settings.

The creation of a “new” mapping over an existing mapping merges these mappings and an error will be throw when incompatible types are detected. To avoid trouble, it’s a good practice to check if a mapping exists before creating a new (different) one. After merging mappings, it’s likely that the index has to be rebuild to re-populate the new fields or update the existing ones in case the analyser rules have changed.

There are multiple possibilities to create a type/mapping. We can use a classic JSON file or the ElasticSearch builder.

XContentBuilder builder = XContentFactory.jsonBuilder().
          startObject().
             startObject(TYPE).
                field("dynamic", "strict").
                startObject("_id").
                     field("path", "id").
                endObject().
                startObject("_all").
                     field("enabled", "true").
                endObject().
                startObject("properties").
                     startObject("id").
                         field("type", "long").
                         field("store", "yes").
                         field("index", "not_analyzed").
                     endObject().
                     startObject("country_code").
                         field("type", "string").
                         field("store", "yes").
                         field("index", "not_analyzed").
                     endObject().
                     startObject("names").
                         field("type", "string"). // (*)
                         field("store", "yes").
                         field("index", "analyzed").
                     endObject().
                     startObject("postal_codes").
                         field("type", "string").
                         field("store", "yes"). 
                         field("index", "analyzed").
                     endObject().
                endObject().
            endObject().
        endObject();

PutMappingResponse response=client.admin().
                           indices().
                           preparePutMapping("myindexname").
                           setType("mytypename").
                           setSource(builder).
                           execute().
                           actionGet();

if (response.isAcknowledged()) {
    // Type and Mapping created!
}
else {
    // Failed to create type and mapping
}

(*) A good thing to know about the “type” in mappings is that ElasticSearch treats them as arrays of the specified type. This means that the field “names” of type “string” can actually contain ["name1", "name2", "name3"] – or just “name”.

Note the “field(“dynamic”, “strict”)” part of the mapping (see above).

It’s also possible to have more complex structures with nested objects. In this example, the names are nested inside the postal code.

XContentBuilder builder = XContentFactory.jsonBuilder().
    startObject().
       startObject(TYPE).
          field("dynamic", "strict").
          startObject("_id").
             field("path", "id").
          endObject().
          startObject("_all").
             field("enabled", "true").
          endObject().
          startObject("properties").
              startObject("id").
                 field("type", "long").
                 field("store", "yes").
                 field("index", "not_analyzed").
              endObject().
              startObject("country_code").
                 field("type", "string").
                 field("store", "yes").
                 field("index", "not_analyzed").
              endObject().
              startObject("postal_codes").
                 field("type", "nested").
                 startObject("properties").
                     startObject("code").
                         field("type", "string").
                         field("store", "yes").
                         field("index", "analyzed").
                     endObject().
                     startObject("names").
                         field("type", "nested").
                         startObject("properties").
                             startObject("name").
                                 field("type", "string").
                                 field("store", "yes").
                                 field("index", "analyzed").
                             endObject().
                         endObject().
                     endObject().
                 endObject().
              endObject().
         endObject().
    endObject();

I would suggest to apply the KISS rule in this case ;) The “XContentBuilder” has also a very handy “toString()” method.

You can of course use plain old JSON as input. This comes very handy when the schema is read from application settings, which would allow to play with different analysers, for example, without redeploying the entire application. A re-indexing however is always required.

String mappingString="{"\"dynamic\":\"strict\",\"_id\":{\"path\":\"id\"},\"properties\":{\"country_code\":{\"type\":\"string\",\"index\":\"not_analyzed\",\"store\":true},\"id\":{\"type\":\"long\",\"store\":true},\"names\":{\"type\":\"string\",\"store\":true},\"postal_codes\":{\"type\":\"string\",\"store\":true}}}";

PutMappingResponse response=client.admin().
                              indices().
                              preparePutMapping("myindexname").
                              setType("mytypename").
                              setSource(mappingString).
                              execute().
                              actionGet();

Now, let’s see different operations that can be executed on the index:

Read the mapping of an index and type:

IndexMetaData imd = null;
try {
    ClusterState cs = client.admin().
                         cluster().
                         prepareState().
                         setIndices("myindexname").
                         execute().
                         actionGet().
                         getState();
   
    imd = cs.getMetaData().index("myindexname");
}
catch (IndexMissingException e) {
    // If there is no index, there is no mapping either
}

MappingMetaData mdd = imd.mapping(type);

if (mdd == null) {
    // No mapping found
}
else {
    System.out.println("Mapping as JSON string:" + mdd.source());
}

This verifies if a type exists on an index. It does however NOT verify if the type has a mapping defined.

client.admin().indices().
               prepareTypesExists("myindexname").
               setTypes("mytypename").
               execute().
               actionGet().
               isExists();

The next command verifies if an index exists. It does not verify if there is data in the index.

client.admin().indices().
               prepareExists("myindexname").
               execute().
               actionGet().
               isExists();

Now, we verify if a mapping exists on an index and type:

IndexMetaData imd = null;
try {
    ClusterState cs = client.admin().
                             cluster().
                             prepareState().
                             setIndices("myindexname").
                             execute().
                             actionGet().
                             getState();

    imd = cs.getMetaData().index(index);
}
catch (IndexMissingException e) {
   // If there is no index, there is no mapping either
   return false;
}

MappingMetaData mdd = imd.mapping(type);

if (mdd != null)
    return true;

return false;

Deleting a mapping is simple:

client.admin().indices().
               prepareDeleteMapping("myindexname").
               setType(type).
               execute().
               actionGet();

And deleting an index too:

DeleteIndexResponse rep = null;
try {
    rep = client.admin().
                 indices().
                 prepareDelete("myindexname").
                 execute().
                 actionGet();
}
catch (IndexMissingException e) {
    // Index not found
    return;
}

if (rep.isAcknowledged()) {
    // Index deleted
}
else {
    // Failed to delete index
}

An alias is a logical name that can be assigned to an index and then used instead of the index name to execute commands against that index. Aliases are very helpful when you need to seamlessly rebuild an index.

To add an alias, do…

try {
    client.admin().
           indices().
           prepareAliases().
           addAlias("myindexname", "myaliasname").
           execute().
           actionGet();
}
catch(IndexMissingException e) {
    // Index not found
}

Switching an alias over from one index to another index, in one single atomic operation, can be done via…

client.admin().indices().
               prepareAliases().
               addAlias("mynewindex", "myalias").
               removeAlias("myoldindex", "myalias").
               execute().
               actionGet();

And to obtain the index for a given alias, do:

ImmutableOpenMap<String, AliasMetaData> iom=
               client.admin().
                      cluster().
                      state(new ClusterStateRequest()).
                      actionGet().
                      getState().
                      getMetaData().
                      aliases().
                      get("myalias);

if (iom==null) // alias not found.

Iterator<ObjectObjectCursor<String, AliasMetaData>> 
                                      i=iom.iterator();

while(i.hasNext()) {
    ObjectObjectCursor<String, AliasMetaData> ooc=i.next();
    System.out.println("Index="+ooc.key+"/Alias="+
                          ooc.value.getAlias());
}

There are many ways to fill an index and the basic API for bulk filling is explained in the official documentation. I’ll cover more complex cases in another article.

When it comes to querying ElasticSearch indices, we face an entire new science :) The basic commands are explained in the official documentation, but the ways to obtain what you want from ElasticSearch are just legion.

Covering the possibilities of querying indices goes far beyond the scope of this article. After all, ElasticSearch provides an access point to the underlying (hard working) Lucene framework, a full text indexing and searching system that come now to its 15th year of existence.

Another aspect that I wanted to point out is that reading the source code of an open-source project, even as complex and big as ElasticSearch, is nothing magic. It does not only help to employ the product, but it also gives deep insights that the documentation often does not provide. Moreover, and that’s a very important note, reading and understanding the source codes brings you one step closer to contributing to the project!

Et zou…


Viewing all articles
Browse latest Browse all 15

Trending Articles