Elasticsearch Query Optimization for improving search relevance - Part II

In Part I, we learnt how to draft a series of search queries for various scenarios starting with a simpler one leading to much more complex ones. In this article, our focus will shift toward the explain API and how it can help us in writing efficient search queries and also in understanding the internals of the Elasticsearch scoring algorithm.

By the end of this article, one will be able to

  1. Understand the key metrics referred to in the explain API output.

  2. Understand how scoring works in ES search queries

  3. Use explain API for writing simpler to complex search queries.

  4. Refine search query based on the explain metrics.

Recap:

In Part I, we created a book catalogue that allowed users to search for relevant books based on certain criteria.


An example of book records looks like the following:


{
  "title": "Designing Hard Software",
  "isbn": "133046192",
  "pageCount": 350,
  "publishedDate": "1997-02-01T00:00:00.000-0800",
  "shortDescription": "\"This book is well written ... The author does not fear to be controversial. In doing so, he writes a coherent book.\" --Dr. Frank J. van der Linden, Phillips Research Laboratories",
  "longDescription": "Have you ever heard, \"I can't define a good design but I know one when I see it\"  Designing Hard Software discusses ways to develop software system designs that have the same tangibility and visibility as designs for hard objects like buildings or computer hardware. It emphasizes steps called \"essential tasks\" which result in software specifications that show how each requirement, including robustness and extensibility, will be satisfied. All software developers and managers seeking to develop \"hard\" software will benefit from these ideas.    There are six essential tasks necessary for a good design:    User (run-time) requirements  Development sponsor (build-time) requirements  Domain information  Behavior identification and allocation  Behavior description  Software system architecture  Designing Hard Software goes beyond the standard software development methodologies such as those by Booch, Rumbaugh, Yourdon, and others, by providing techniques for a complete system architecture as well as explicit measures of the goodness of design. So, \"you define a good design.\"",
  "status": "PUBLISH",
  "authors": [
    "Douglas W. Bennett"
  ],
  "categories": [
    "Object-Oriented Programming",
    "S"
  ]
}

Note: All records used for this example are available here


Books have been indexed using the below mapping:

PUT books
{
  "mappings": {
    "properties": {
      "categories": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "longDescription": {
        "type": "text"
      },
      "shortDescription": {
        "type": "text"
      }
    }
  }
}

Note: In the interest of focus, we have created mappings only for the fields for which search will be performed.


Overview of Elasticsearch scoring algorithm:

Elasticsearch used the TF-IDF as their default similarity algorithm and have shifted to BM25 (Best Matching) ever since the introduction of Lucene 6.

Given a query Q with keywords, q1…..qn, the BM25 similarity score is defined as (not for the faint hearted) :


Where:

  • f(qᵢ,D) : term frequency of qᵢ in document D

  • |fieldLen| : length of document in words

  • k1 : helps determine term frequency saturation parameter and has a default value of 1.2. It limits how much a single query term can affect the score of a given document

  • b : length normalization parameter. Higher the value of b, the effects of the document length compared to average length is amplified. Defaults to 0.75

  • avgFieldLen : average document length in the text collection

  • IDF : inverse document frequency defined as :


  • f(qᵢ) : number of documents containing qᵢ

  • docCount : total number of documents in the collection

Tf (term frequency) is calculated by combining

freq / (freq + k1 * (1 — b + b * fieldLen / avgFieldLen))

For more information on the algorithm and more on BM25


Let’s the search begin:

In continuation of the previous article, we will be finding the word Open Source in the books index. The search will be focusing on the three fields: categories, longDescription, shortDescription.

GET books/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "multi_match": {
            "query": "Open Source",
            "fields": [
              "shortDescription",
              "longDescription",
              "categories.keyword"
            ],
            "type": "phrase"
          }
        }
      ]
    }
  },
  "highlight": {
    "fields": {
      "shortDescription": {},
      "longDescription": {},
      "categories.keyword": {}
    }
  }
}

We get 36 matched documents with the max_score of 8.118633.


Sample Response:

{
  "took": 746,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 36,
      "relation": "eq"
    },
    "max_score": 8.118633,
    "hits": [
      {
        "_index": "books",
        "_type": "_doc",
        "_id": "TjEgUYABjdRWGf7Bqv0J",
        "_score": 8.118633,
        "_source": {
          "title": "Subversion in Action",
          "isbn": "1932394478",
          "pageCount": 356,
          "publishedDate": "2004-12-01T00:00:00.000-0800",
          "thumbnailUrl": "https://s3.amazonaws.com/AKIAJC5RLADLUMVRPFDQ.book-thumb-images/machols.jpg",
          "shortDescription": "Learn all about this new open source version control application and why it is replacing CVS as the standard. Examples demonstrate how to customize features to deal with day-to-day problems.",
          "longDescription": "A new-generation version control tool, Subversion is replacing..",
          "status": "PUBLISH",
          "authors": [
            "Jeffrey Machols"
          ],
          "categories": [
            "Java"
          ]
        },
        "highlight": {
          "longDescription": [
            "A new-generation version control tool, Subversion is replacing the current <em>open</em> <em>source</em> standard, CVS."
          ],
          "shortDescription": [
            "Learn all about this new <em>open</em> <em>source</em> version control application and why it is replacing CVS as the standard"
          ]
        }
      }
     ...
    ]
  }
}

Let us deep dive further and try to understand the ranking of the results. By adding the “explain”: true option to the search query, Elasticsearch will provide detailed information on the scoring computed for the search result.

GET books/_search
{
  "explain": true,
  "query": {
    "bool": {
      "should": [
        {
          "multi_match": {
            "query": "Open Source",
            "fields": []
          }
        },
        ...
      ]
    }
  }
}

Below is the output for the search result with the explanation. Don’t be alarmed by the response because we will be breaking it down for better understanding in the upcoming sections.

"_explanation": {
  "value": 8.118633,
  "description": "max of:",
  "details": [
    {
      "value": 3.591908,
      "description": """weight(longDescription:"open source" in 152) [PerFieldSimilarity], result of:""",
      "details": [
        {
          "value": 3.591908,
          "description": "score(freq=1.0), computed as boost * idf * tf from:",
          "details": [
            {
              "value": 2.2,
              "description": "boost",
              "details": []
            },
            {
              "value": 3.1677296,
              "description": "idf, sum of:",
              "details": [
                {
                  "value": 1.7381384,
                  "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                  "details": [
                    {
                      "value": 41,
                      "description": "n, number of documents containing term",
                      "details": []
                    },
                    {
                      "value": 235,
                      "description": "N, total number of documents with field",
                      "details": []
                    }
                  ]
                },
                {
                  "value": 1.4295912,
                  "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                  "details": [
                    {
                      "value": 56,
                      "description": "n, number of documents containing term",
                      "details": []
                    },
                    {
                      "value": 235,
                      "description": "N, total number of documents with field",
                      "details": []
                    }
                  ]
                }
              ]
            },
            {
              "value": 0.51541185,
              "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
              "details": [
                {
                  "value": 1.0,
                  "description": "phraseFreq=1.0",
                  "details": []
                },
                {
                  "value": 1.2,
                  "description": "k1, term saturation parameter",
                  "details": []
                },
                {
                  "value": 0.75,
                  "description": "b, length normalization parameter",
                  "details": []
                },
                {
                  "value": 136.0,
                  "description": "dl, length of field (approximate)",
                  "details": []
                },
                {
                  "value": 191.19148,
                  "description": "avgdl, average length of field",
                  "details": []
                }
              ]
            }
          ]
        }
      ]
    },
    {
      "value": 8.118633,
      "description": """weight(shortDescription:"open source" in 152) [PerFieldSimilarity], result of:""",
      "details": [
        {
          "value": 8.118633,
          "description": "score(freq=1.0), computed as boost * idf * tf from:",
          "details": [
            {
              "value": 2.2,
              "description": "boost",
              "details": []
            },
            {
              "value": 6.77204,
              "description": "idf, sum of:",
              "details": [
                {
                  "scoring": "omitted for brevity"
                }
              ]
            },
            {
              "value": 0.54493,
              "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
              "details": [
                {
                  "scoring": "omitted for brevity"
                }
              ]
            }
          ]
        }
      ]
    }
  ]
}


Understanding the Scoring from Search Results:

In this section, we will try to understand how the scoring actually works under the hood.

To begin with, consider the first document with isbn as 1932394478 from the above search result. If we notice in the explanation, the word Open Source has been found in the fields longDescription and shortDescription

As we know,

   tf = freq / (freq + k₁ * (1 — b + b * fieldLen / avgFieldLen))

In our case, the values for doc with isbn : 1932394478, the PerFieldSimilarity of the word phrase Open Source in longDescription is as follows:

1. freq = 1 since Open Source occurs only once in this document.

2. k₁ = 1.2; default value

3. b = 0.75; default value

4. dl : 32; since there are six words in the longDescription field in this document

5. avgdl : 53.82192 ; this is the average length of the longDescription field in shard [books][0]

Plugging the values into the formula; we arrive at tf = 0.51541185

With tf, calculated next step is to calculate the idf (Inverse Document Frequency) using the formulae:

              idf = log(1 + (N — n + 0.5) / (n + 0.5))

For the word Open, N = 235 (in shard [books][0]) and n = 41; we arrive at idf = 1.7381384

For the word Source, N = 235 (in shard [books][0]) and n = 56; we arrive at idf = 1.4295912

Summing the scores gives the idf as 3.1677296

The default boost score is set as 2.2 and comes from the value (k1 + 1) as specified in the BM25 formula.

So, the final score in our case for this search phrase in longDescription is:

    tf * idf * (k1+1) = 0.51541185 * 3.1677296 * 2.2 = 3.591908

Similarly, the scores are calculated for each and every field given in the search param and the scores for shortDescription is:

       tf * idf * (k1+1) = 0.54493 * 6.77204 * 2.2 = 8.118633

Since we are using the type phrase the result takes the max from every field and returns the highest scored document.


Boosting Our Search:

Sometimes we would want to give more preference to certain fields over the others. For such scenarios, we need to make use of the boosting functionality and try to influence the search result.

Now let us apply boosting to the categories field and see how it impacts the search results.

GET books/_search
{
 "query": {
   "bool": {
     "should": [
       {
         "multi_match": {
           "query": "Open Source",
           "fields": [
             "shortDescription",
             "longDescription",
             "categories.keyword^50"
           ],
           "type": "phrase"
         }
       }
     ]
   }
 },
 "highlight": {
   "fields": {
     "shortDescription": {},
     "longDescription": {},
     "categories.keyword": {}
   }
 }
}

Sample Response

{
  "took": 15,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 36,
      "relation": "eq"
    },
    "max_score": 250.57788,
    "hits": [
      {
        "_index": "books",
        "_type": "_doc",
        "_id": "tjEgUYABjdRWGf7BqvwI",
        "_score": 250.57788,
        "_source": {
          "title": "Unlocking Android",
          "isbn": "1933988673",
          "pageCount": 416,
          "publishedDate": "2009-04-01T00:00:00.000-0700",          
      ...
    ]
  }
} 

If you notice the final score would be score * additional boost = 250.57788 for the top most document. The result also indicates that the top most document from the previous search query is different as the boosting has impacted the search results.

This above scenario is likely to be used where users might want to give more weightage to selective fields for better search results.


Extending it further

So far we have been returning search results based on matching phrases but what if we want to return partial matching results as well.

GET books/_search
{
  "explain": true,
  "query": {
    "bool": {
      "should": [
        {
          "multi_match": {
            "query": "Open Source",
            "fields": [
              "shortDescription",
              "longDescription",
              "categories.keyword^50"
            ],
            "type": "phrase"
          }
        },
        {
          "multi_match": {
            "query": "Open Source",
            "fields": [
              "shortDescription",
              "longDescription",
              "categories^40"
            ],
            "type": "most_fields"
          }
        }
      ]
    }
  },
  "highlight": {
    "fields": {
      "shortDescription": {},
      "longDescription": {},
      "categories": {},
      "categories.keyword": {}
    }
  }
}

Sample Response

{
  "took": 27,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 61,
      "relation": "eq"
    },
    "max_score": 530.05334,
    "hits": [
      {
        "_shard": "[books][0]",
        "_node": "PiILupiCSs21etzGXXfN0g",
        "_index": "books",
        "_type": "_doc",
        "_id": "ETEgUYABjdRWGf7Bqv0J",
        "_score": 530.05334,
        "_source": {
          "title": "Lucene in Action, Second Edition",
          "isbn": "1933988177",
          ...
    ]
  }
}

The max_score for the topmost document would be 530.05334 for the document with isbn 1933988177.

The results would vary depending on the type selected and it is important to understand every type before starting to write complex search queries.

For example, if we had chosen the type as best_fields instead of most_fields the document with isbn 1933988673 would have returned.

This is because best_fields makes use of the max of the search results whereas most_fields makes use of the max of the search results.

For more information on different support types:

  • https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-multi-match-query.html#multi-match-types

Summary:

To summarise, writing search queries is no mean feat but if we can understand how the scoring works we would be able to get our desired results.

The ideal rule of thumb when it comes to writing search queries are:

Step 1 — Identify the Requirement: The initial phase is the requirement phase where we identify what the expected search results would be.

Step 2 — Planning: Once we have established the requirement, the next phase is planning. In this phase, we try to identify the fields for which search query can or will be performed.

Step 3 — Boosting: Sometimes certain fields need to be given more preference over others. We can make use of the boost function to perform the same. Choosing boost value needs to be done carefully. Sometimes over-boosting can lead to undesired results.

Step 4 — Writing the Query: In this phase, we would be writing the actual query taking into consideration taken from Step1 to Step 3.

Step 5 — Explain API: The most important phase is where we will make use of the explain API to understand the result and in what way we can tweak or make necessary changes.

Step 6 — Repeat: This step is required if you do not get the required result. We go back to Step2 and follow the same process again till we arrive at our expected result.

So, for better understanding, you can up pick a sample dataset and follow the above rule of thumb to check if you can meet the required desired ranking and search results.

Hope the two-part series has helped make you understand how to write search queries along with the Elasticsearch internal scoring algorithm.

To reiterate, please do consider the following while developing your search queries:

  1. Keywords field types are not intended for partial matches, they can hurt the search results. It is recommended to use keyword field types only when a single word is to be searched so that an exact word match can be performed.

  2. Keywords fields are supposed to be unique word limits and overutilization of keyword fields can lead to improper search results.

  3. It is important to only boost fields when there is an absolute necessity. Over-boosting of fields can lead to undesired results.

  4. The use of field types in match queries should be based on the need and requirement.

  5. Sometimes it is better to filter first and then perform a search operation.

  6. Function score can be used for decaying the score and relevance of older documents

References:

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-explain.html