In this article written by Bharvi Dixit, author of the book Elasticsearch Essentials, we understand that getting a search engine to behave can be very hard. It does not matter if you are a newbie or have years of experience with Elasticsearch or Solr, you must have definitely struggled with low-quality search results in your application. The default algorithm of Lucene does not come close to meeting your requirements, and there is always a struggle to deliver the relevant search results.

We will be covering the following topics:

(For more resources related to this topic, see here.)

Introducing relevant search

Out of the Box Tools from Elasticsearch

Controlling relevancy with custom scoring

Introducing relevant search

Relevancy is the root of a search engine's value proposition and can be defined as the art of ranking content for a user's search based on how much that content satisfies the needs of the user or the business.

In an application, it does not matter how beautiful your user interface looks or how many functionalities you are providing to the user; search relevancy cannot be avoided at any cost. So, despite of the mystical behavior of search engines, you have to find a solution to get the relevant results. The relevancy becomes more important because a user does not care about the whole bunch of documents that you have. The user enters his keywords, selects filters, and focuses on a very small amount of data—the relevant results. And if your search engine fails to deliver according to expectations, the user might be annoyed, which might be a loss for your business.

A search engine like Elasticsearch comes with a built-in intelligence. You enter the keyword and within a blink of an eye, it returns to you the results that it thinks are relevant according to its intelligence. However, Elasticsearch does not a built-in intelligence according to your application domain. The relevancy is not defined by a search engine; rather it is defined by your users, their business needs, and the domains. Take an example of Google or Twitter, they have put in years of engineering experience, but still fail occasionally while providing relevancy. Don't they?

Further, the challenges of search differ with the domain: the search on an e-commerce platform is about driving sales and bringing positive customer outcomes, whereas in fields such as medicine, it is about the matter of life and death. The lives of search engineers become more complicated because they do not have domain-specific knowledge, which can be used to understand the semantics of user queries.

However, despite of all the challenges, the implementation of search relevancy is up to you, and it depends on what information you can extract from the users, their queries, and the content they see. We continuously take feedbacks from the users, create funnels, or enable loggings to capture the search behavior of the users so that we can improve our algorithms to provide the relevant results.

The Elasticsearch out-of-the-box tools

Elasticsearch primarily works with two models of information retrieval: the Boolean model and the Vector Space model. In addition to these, there are other scoring algorithms available in Elasticsearch as well, such as Okapi BM25, Divergence from Randomness (DFR), and Information Based (IB). Working with these three models requires an extensive mathematical knowledge and needs some extra configurations in Elasticsearch.

The Boolean model uses the AND, OR, and NOT conditions in a query to find all the matching documents. This Boolean model can be further combined with the Lucene scoring formula, TF/IDF, to rank documents.

The Vector Space model works differently from the Boolean model, as it represents both queries and documents as vectors. In the vector space model, each number in the vector is the weight of a term that is calculated using TF/IDF.

The queries and documents are compared using a cosine similarity in which angles between two vectors are compared to find the similarity, which ultimately leads to finding the relevancy of the documents.

An example: why defaults are not enough

Let's build an index with sample documents to understand the examples in a better way.

First, create an index with the name profiles:

curl -XPUT 'localhost:9200/profiles'

Then, put the mapping with the document type as candidate:

curl -XPUT 'localhost:9200/profiles/candidate'

{

 "properties": {

   "geo_code": {

     "type": "geo_point",

     "lat_lon": true

   }

 }

}

Please note that in preceding mapping, we are putting mapping only for the geo data type. The rest of the fields will be indexed dynamically.

Now, you can create a data.json file with the following content in it:

{ "index" : { "_index" : "profiles", "_type" : "candidate", "_id" : 1 }}

{ "name" : "Sam", "geo_code" : "12.9545163,77.3500487", "total_experience":5, "skills":["java","python"] }

{ "index" : { "_index" : "profiles", "_type" : "candidate", "_id" : 2 }}

{ "name" : "Robert", "geo_code" : "28.6619678,77.225706", "total_experience":2, "skills":["java"] }

{ "index" : { "_index" : "profiles", "_type" : "candidate", "_id" : 3 }}

{ "name" : "Lavleen", "geo_code" : "28.6619678,77.225706", "total_experience":4, "skills":["java","Elasticsearch"] }

{ "index" : { "_index" : "profiles", "_type" : "candidate", "_id" : 4 }}

{ "name" : "Bharvi", "geo_code" : "28.6619678,77.225706", "total_experience":3, "skills":["java","lucene"] }

{ "index" : { "_index" : "profiles", "_type" : "candidate", "_id" : 5 }}

{ "name" : "Nips", "geo_code" : "12.9545163,77.3500487", "total_experience":7, "skills":["grails","python"] }

{ "index" : { "_index" : "profiles", "_type" : "candidate", "_id" : 6 }}

{ "name" : "Shikha", "geo_code" : "28.4250666,76.8493508", "total_experience":10, "skills":["c","java"] }

If you are indexing skills, which are separated by spaces or which include non-English characters, that is, c++, c#, or core java, you need to create mapping for the skills field as not_analyzed in advance to have exact term matching.

Once the file is created, execute the following command to put the data inside the index we have just created:

curl -XPOST 'localhost:9200' --data-binary @data.json

If you look carefully at the example, the documents contain the data of the candidates who might be looking for jobs. For hiring candidates, a recruiter can have the following criteria:

Candidates should know about Java

Candidate should have an experience between 3 to 5 years

Candidate should fall in the distance range of 100 kilometers from the office of the recruiter.

You can construct a simple bool query in combination with a term query on the skills field along with geo_distance and range filters on the geo_code and total_experience fields respectively. However, does this give a relevant set of results? The answer would be NO.

The problem is that if you are restricting the range of experience and distance, you might even get zero results or no suitable candidate. For example, you can put a range of 0 to 100 kilometers of distance but your perfect candidate might be at a distance of 101 kilometers. At the same time, if you define a wide range, you might get a huge number of non-relevant results.

The other problem is that if you search for candidates who know Java, there are chances that a person who knows only Java and not any other programming language will be at the top, while a person who knows other languages apart from Java will be at the bottom. This happens because during the ranking of documents with TF/IDF, the lengths of the fields are taken into account. If the length of a field is small, the document is more relevant.

Elasticsearch is not intelligent enough to understand the semantic meaning of your queries but for these scenarios, it offers you the full power to redefine how scoring and document ranking should be done.

Controlling relevancy with custom scoring

In most cases, you are good to go with the default scoring algorithms of Elasticsearch to return the most relevant results. However, some cases require you to have more control on the calculation of a score. This is especially required while implementing a domain-specific logic such as finding the relevant candidates for a job, where you need to implement a very specific scoring formula. Elasticsearch provides you with the function_score query to take control of all these things.

Here we cover the code examples only in Java because a Python client gives you the flexibility to pass the query inside the body parameter of a search function. Python programmers can simply use the example queries in the same way. There is no extra module required to execute these queries.

function_score query

Function score query allows you to take the complete control of how a score needs to be calculated for a particular query:

Syntax of a function_score query:

{

  "query": {"function_score": {

    "query": {},

    "boost": "boost for the whole query",

    "functions": [

      {}

    ],

    "max_boost": number,

    "score_mode": "(multiply|max|...)",

    "boost_mode": "(multiply|replace|...)",

    "min_score" : number

  }}

}

The function_score query has two parts: the first is the base query that finds the overall pool of results you want. The second part is the list of functions, which are used to adjust the scoring. These functions can be applied to each document that matches the main query in order to alter or completely replace the original query _score.

In a function_score query, each function is composed of an optional filter that tells Elasticsearch which records should have their scores adjusted (defaults to "all records") and a description of how to adjust the score.

The other parameters that can be used with a functions_score query are as follows:

boost: An optional parameter that defines the boost for the entire query.

max_boost: The maximum boost that will be applied by a function score.

boost_mode: An optional parameter, which defaults to multiply. Score mode defines how the combined result of the score functions will influence the final score together with the subquery score. This can be replace (only the function score is used, the query score is ignored), max (the maximum of the query score and the function score), min (the minimum of the query score and the function score), sum (the query score and the function score are added), avg, or multiply (the query score and the function score are multiplied).

score_mode: This parameter specifies how the results of individual score functions will be aggregated. The possible values can be first (the first function that has a matching filter is applied), avg, max, sum, min, and multiply.

min_score: The minimum score to be used.

Excluding Non-Relevant Documents with min_score

To exclude documents that do not meet a certain score threshold, the min_score parameter can be set to the desired score threshold.

The following are the built-in functions that are available to be used with the function score query:

weight

field_value_factor

script_score

The decay functions—linear, exp, and gauss

Let's see them one by one and then you will learn how to combine them in a single query.

weight

A weight function allows you to apply a simple boost to each document without the boost being normalized: a weight of 2 results in 2 * _score. For example:

GET profiles/candidate/_search

{

  "query": {

    "function_score": {

      "query": {

        "term": {

          "skills": {

            "value": "java"

          }

        }

      },

      "functions": [

        {

          "filter": {

            "term": {

              "skills": "python"

            }

          },

          "weight": 2

        }

      ],

      "boost_mode": "replace"

    }

  }

}

The preceding query will match all the candidates who know Java, but will give a higher score to the candidates who also know Python. Please note that boost_mode is set to replace, which will cause _score to be calculated by a query that is to be overridden by the weight function for our particular filter clause. The query output will contain the candidates on top with a _score of 2 who know both Java and Python.

Java example

The previous query can be implemented in Java in the following way:

First, you need to import the following classes into your code:

import org.elasticsearch.action.search.SearchResponse;

import org.elasticsearch.client.Client;

import org.elasticsearch.index.query.QueryBuilders;

import org.elasticsearch.index.query.functionscore.FunctionScoreQueryBuilder;

import org.elasticsearch.index.query.functionscore.ScoreFunctionBuilders;

Then the following code snippets can be used to implement the query:

FunctionScoreQueryBuilder functionQuery = new FunctionScoreQueryBuilder(QueryBuilders.termQuery("skills", "java"))

    .add(QueryBuilders.termQuery("skills", "python"),   ScoreFunctionBuilders.weightFactorFunction(2)).boostMode("replace");

 

SearchResponse response = 
client.prepareSearch().setIndices(indexName)

        .setTypes(docType).setQuery(functionQuery)

        .execute().actionGet();

field_value_factor

It uses the value of a field in the document to alter the _score:

GET profiles/candidate/_search

{

  "query": {

    "function_score": {

      "query": {

        "term": {

          "skills": {

            "value": "java"

          }

        }

      },

      "functions": [

        {

          "field_value_factor": {

            "field": "total_experience"

          }

        }

      ],

      "boost_mode": "multiply"

    }

  }

}

The preceding query finds all the candidates with java in their skills, but influences the total score depending on the total experience of the candidate. So, the more experience the candidate will have, the higher ranking he will get. Please note that boost_mode is set to multiply, which will yield the following formula for the final scoring:

_score = _score * doc['total_experience'].value

However, there are two issues with the preceding approach: first are the documents that have the total experience value as 0 and will reset the final score to 0. Second, Lucene _score usually falls between 0 and 10, so a candidate with an experience of more than 10 years will completely swamp the effect of the full text search score.

To get rid of this problem, apart from using the field parameter, the field_value_factor function provides you with the following extra parameters to be used:

factor: This is an optional factor to multiply the field value with. This defaults to 1.

modifier: This is a mathematical modifier to apply to the field value. This can be :none, log, log1p, log2p, ln, ln1p, ln2p, square, sqrt, or reciprocal. It defaults to none.

Java example

The preceding query can be implemented in Java in the following way:

First, you need to import the following classes into your code:

import org.elasticsearch.action.search.SearchResponse;

import org.elasticsearch.client.Client;

import org.elasticsearch.index.query.QueryBuilders;

import org.elasticsearch.index.query.functionscore*;

Then the following code snippets can be used to implement the query:

FunctionScoreQueryBuilder functionQuery = new FunctionScoreQueryBuilder(QueryBuilders.termQuery("skills", "java"))

    .add(new FieldValueFactorFunctionBuilder("total_experience")).boostMode("multiply");

 

SearchResponse response = client.prepareSearch().setIndices("profiles")

        .setTypes("candidate").setQuery(functionQuery)

        .execute().actionGet();

script_score

script_score is the most powerful function available in Elasticsearch. It uses a custom script to take complete control of the scoring logic. You can write a custom script to implement the logic you need. Scripting allows you to write from a simple to very complex logic. Scripts are cached, too, to allow faster executions of repetitive queries. Let's see an example:

{

  "script_score": {

    "script": "doc['total_experience'].value"

  }

}

Look at the special syntax to access the field values inside the script parameter. This is how the value of the fields is accessed using groovy scripting language.

Scripting is, by default, disabled in Elasticsearch, so to use script score functions, first you need to add this line in your elasticsearch.yml file: script.inline: on

To see some of the power of this function, look at the following example:

GET profiles/candidate/_search

{

  "query": {

    "function_score": {

      "query": {

        "term": {

          "skills": {

            "value": "java"

          }

        }

      },

      "functions": [

        {

          "script_score": {

            "params": {

              "skill_array_provided": [

                "java",

                "python"

              ]

            },

            "script": "final_score=0; skill_array = doc['skills'].toArray(); counter=0; while(counter<skill_array.size()){for(skill in skill_array_provided){if(skill_array[counter]==skill){final_score = final_score+doc['total_experience'].value};};counter=counter+1;};return final_score"

          }

        }

      ],

      "boost_mode": "replace"

    }

  }

}

Let's understand the preceding query:

params is the placeholder where you can pass the parameters to your function, similar to how you use parameters inside a method signature in other languages. Inside the script parameter, you write your complete logic.

This script iterates through each document that has Java mentioned in the skills, and for each document, it fetches all the skills and stores them inside the skill_array variable. Finally, each skill that we have passed inside the params section is compared with the skills inside skill_array. If this matches, the value of the final_score variable is incremented with the value of the total_experience field of that document. The score calculated by the script score will be used to rank the documents because boost_mode is set to replace the original _score value.

Do not try to work with the analyzed fields while writing the scripts. You might get weird results. This is because, had our skills field contained a value such as "core java", you could not have got the exact matching for it inside the script section. So, the fields with space-separated values need to be set as not_analyzed or the keyword has to be analyzed in advance.

To write these script functions, you need to have some command over groovy scripting. However, if you find it complex, you can write these scripts in other languages, such as python, using the language plugin of Elasticsearch. More on this can be found here: https://github.com/elastic/elasticsearch-lang-python

For a fast performance, use Groovy or Java functions. Python and JavaScript code requires the marshalling and unmarshalling of values that kill performances due to more CPU/memory usage.

Java example

The previous query can be implemented in Java in the following way:

First, you need to import the following classes into your code:

import org.elasticsearch.action.search.SearchResponse;

import org.elasticsearch.client.Client;

import org.elasticsearch.index.query.QueryBuilders;

import org.elasticsearch.index.query.functionscore.*;

import org.elasticsearch.script.Script;

Then, the following code snippets can be used to implement the query:

String script = "final_score=0; skill_array =            doc['skills'].toArray(); "

        + "counter=0; while(counter<skill_array.size())"

        + "{for(skill in skill_array_provided)"

        + "{if(skill_array[counter]==skill)"

        + "{final_score =     final_score+doc['total_experience'].value};};"

        + "counter=counter+1;};return final_score";

 

ArrayList<String> skills = new ArrayList<String>();

  skills.add("java");

  skills.add("python");

 

Map<String, Object> params = new HashMap<String, Object>();

  params.put("skill_array_provided",skills);

  FunctionScoreQueryBuilder functionQuery = new

 

FunctionScoreQueryBuilder(QueryBuilders.termQuery("skills", "java"))

    .add(new ScriptScoreFunctionBuilder(new Script(script,   ScriptType.INLINE, "groovy", params))).boostMode("replace");

 

SearchResponse response =   client.prepareSearch().setIndices(indexName)

        .setTypes(docType).setQuery(functionQuery)

        .execute().actionGet();

As you can see, the script logic is a simple string that is used to instantiate the Script class constructor inside ScriptScoreFunctionBuilder.

Decay functions - linear, exp, gauss

We have seen the problems of restricting the range of experience and distance that could result in getting zero results or no suitable candidates. May be a recruiter would like to hire a candidate from a different province because of a good candidate profile. So, instead of completely restricting with the range filters, we can incorporate sliding-scale values such as geo_location or dates into _score to prefer documents near a latitude/longitude point or recently published documents.

Function score provide to work with this sliding scale with the help of three decay functions: linear, exp (that is, exponential), and gauss (that is, Gaussian). All three functions take the same parameter as shown in the following code and are required to control the shape of the curve created for the decay function: origin, scale, decay, and offset.

The point of origin is used to calculate distance. For date fields, the default is the current timestamp. The scale parameter defines the distance from the origin at which the computed score will be equal to the decay parameter.

The origin and scale parameters can be thought of as your min and max that define a bounding box within which the curve will be defined. If we want to give more boosts to the documents that have been published in the past10 days, it would be best to define the origin as the current timestamp and the scale as 10d.

The offset specifies that the decay function will only compute the decay function of the documents with a distance greater that the defined offset. The default is 0.

Finally, the decay option alters how severely the document is demoted based on its position. The default decay value is 0.5.

All three decay functions work only on numeric, date, and geo-point fields.

GET profiles/candidate/_search

{

  "query": {

    "function_score": {

      "query": {

        "match_all": {}

      },

      "functions": [

        {

          "exp": {

            "geo_code": {

              "origin": {

                "lat": 28.66,

                "lon": 77.22

              },

              "scale": "100km"

            }

          }

        }

      ],"boost_mode": "multiply"

    }

  }

}

In the preceding query, we have used the exponential decay function that tells Elasticsearch to start decaying the score calculation after a distance of 100 km from the given origin. So, the candidates who are at a distance of greater than 100km from the given origin will be ranked low, but not discarded. These candidates can still get a higher rank if we combine other functions score queries such as weight or field_value_factor with the decay function and combine the result of all the functions together.

Java example:

The preceding query can be implemented in Java in the following way:

First, you need to import the following classes into your code:

import org.elasticsearch.action.search.SearchResponse;

import org.elasticsearch.client.Client;

import org.elasticsearch.index.query.QueryBuilders;

import org.elasticsearch.index.query.functionscore.*;

Then, the following code snippets can be used to implement the query:

Map<String, Object> origin = new HashMap<String, Object>();

    String scale = "100km";

    origin.put("lat", "28.66");

    origin.put("lon", "77.22");

FunctionScoreQueryBuilder functionQuery = new     FunctionScoreQueryBuilder()

    .add(new ExponentialDecayFunctionBuilder("geo_code",origin,     scale)).boostMode("multiply");

//For Linear Decay Function use below syntax

//.add(new LinearDecayFunctionBuilder("geo_code",origin,   scale)).boostMode("multiply");

//For Gauss Decay Function use below syntax

//.add(new GaussDecayFunctionBuilder("geo_code",origin,   scale)).boostMode("multiply");

   

SearchResponse response = client.prepareSearch().setIndices(indexName)

        .setTypes(docType).setQuery(functionQuery)

        .execute().actionGet();

In the preceding example, we have used the exp decay function but, the commented lines show examples of how other decay functions can be used.

At last, as always, remember that Elasticsearch lets you use multiple functions in a single function_score query to calculate a score that combines the results of each function.

Summary

Overall we covered the most important aspects of search engines, that is, relevancy. We discussed the powerful scoring capabilities available in Elasticsearch and the practical examples to show how you can control the scoring process according to your needs. Despite the relevancy challenges faced while working with search engines, the out–of-the-box features such as functions scores and custom scoring always allow us to tackle challenges with ease.