top of page

Elasticsearch Query Optimization for improving search relevance - Part II

In Part I, we learnt how to draft a series of search queries for various scenarios starting with a simpler one leading to much more complex ones. In this article, our focus will shift toward the explain API and how it can help us in writing efficient search queries and also in understanding the internals of the Elasticsearch scoring algorithm.

By the end of this article, one will be able to

  1. Understand the key metrics referred to in the explain API output.

  2. Understand how scoring works in ES search queries

  3. Use explain API for writing simpler to complex search queries.

  4. Refine search query based on the explain metrics.

Recap:

In Part I, we created a book catalogue that allowed users to search for relevant books based on certain criteria.


An example of book records looks like the following:


{
  "title": "Designing Hard Software",
  "isbn": "133046192",
  "pageCount": 350,
  "publishedDate": "1997-02-01T00:00:00.000-0800",
  "shortDescription": "\"This book is well written ... The author does not fear to be controversial. In doing so, he writes a coherent book.\" --Dr. Frank J. van der Linden, Phillips Research Laboratories",
  "longDescription": "Have you ever heard, \"I can't define a good design but I know one when I see it\"  Designing Hard Software discusses ways to develop software system designs that have the same tangibility and visibility as designs for hard objects like buildings or computer hardware. It emphasizes steps called \"essential tasks\" which result in software specifications that show how each requirement, including robustness and extensibility, will be satisfied. All software developers and managers seeking to develop \"hard\" software will benefit from these ideas.    There are six essential tasks necessary for a good design:    User (run-time) requirements  Development sponsor (build-time) requirements  Domain information  Behavior identification and allocation  Behavior description  Software system architecture  Designing Hard Software goes beyond the standard software development methodologies such as those by Booch, Rumbaugh, Yourdon, and others, by providing techniques for a complete system architecture as well as explicit measures of the goodness of design. So, \"you define a good design.\"",
  "status": "PUBLISH",
  "authors": [
    "Douglas W. Bennett"
  ],
  "categories": [
    "Object-Oriented Programming",
    "S"
  ]
}

Note: All records used for this example are available here


Books have been indexed using the below mapping:

PUT books
{
  "mappings": {
    "properties": {
      "categories": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "longDescription": {
        "type": "text"
      },
      "shortDescription": {
        "type": "text"
      }
    }
  }
}

Note: In the interest of focus, we have created mappings only for the fields for which search will be performed.


Overview of Elasticsearch scoring algorithm:

Elasticsearch used the TF-IDF as their default similarity algorithm and have shifted to BM25 (Best Matching) ever since the introduction of Lucene 6.

Given a query Q with keywords, q1…..qn, the BM25 similarity score is defined as (not for the faint hearted) :


Where:

  • f(qᵢ,D) : term frequency of qᵢ in document D

  • |fieldLen| : length of document in words

  • k1 : helps determine term frequency saturation parameter and has a default value of 1.2. It limits how much a single query term can affect the score of a given document

  • b : length normalization parameter. Higher the value of b, the effects of the document length compared to average length is amplified. Defaults to 0.75

  • avgFieldLen : average document length in the text collection

  • IDF : inverse document frequency defined as :


  • f(qᵢ) : number of documents containing qᵢ

  • docCount : total number of documents in the collection

Tf (term frequency) is calculated by combining

freq / (freq + k1 * (1 — b + b * fieldLen / avgFieldLen))

For more information on the algorithm and more on BM25


Let’s the search begin:

In continuation of the previous article, we will be finding the word Open Source in the books index. The search will be focusing on the three fields: categories, longDescription, shortDescription.

GET books/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "multi_match": {
            "query": "Open Source",
            "fields": [
              "shortDescription",
              "longDescription",
              "categories.keyword"
            ],
            "type": "phrase"
          }
        }
      ]
    }
  },
  "highlight": {
    "fields": {
      "shortDescription": {},
      "longDescription": {},
      "categories.keyword": {}
    }
  }
}

We get 36 matched documents with the max_score of 8.118633.


Sample Response:

{
  "took": 746,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 36,
      "relation": "eq"
    },
    "max_score": 8.118633,
    "hits": [
      {
        "_index": "books",
        "_type": "_doc",
        "_id": "TjEgUYABjdRWGf7Bqv0J",
        "_score": 8.118633,
        "_source": {
          "title": "Subversion in Action",
          "isbn": "1932394478",
          "pageCount": 356,
          "publishedDate": "2004-12-01T00:00:00.000-0800",
          "thumbnailUrl": "https://s3.amazonaws.com/AKIAJC5RLADLUMVRPFDQ.book-thumb-images/machols.jpg",
          "shortDescription": "Learn all about this new open source version control application and why it is replacing CVS as the standard. Examples demonstrate how to customize features to deal with day-to-day problems.",
          "longDescription": "A new-generation version control tool, Subversion is replacing..",
          "status": "PUBLISH",
          "authors": [
            "Jeffrey Machols"
          ],
          "categories": [
            "Java"
          ]
        },
        "highlight": {
          "longDescription": [
            "A new-generation version control tool, Subversion is replacing the current <em>open</em> <em>source</em> standard, CVS."
          ],
          "shortDescription": [
            "Learn all about this new <em>open</em> <em>source</em> version control application and why it is replacing CVS as the standard"
          ]
        }
      }
     ...
    ]
  }
}

Let us deep dive further and try to understand the ranking of the results. By adding the “explain”: true option to the search query, Elasticsearch will provide detailed information on the scoring computed for the search result.

GET books/_search
{
  "explain": true,
  "query": {
    "bool": {
      "should": [
        {
          "multi_match": {
            "query": "Open Source",
            "fields": []
          }
        },
        ...
      ]
    }
  }
}

Below is the output for the search result with the explanation. Don’t be alarmed by the response because we will be breaking it down for better understanding in the upcoming sections.

"_explanation": {
  "value": 8.118633,
  "description": "max of:",
  "details": [
    {
      "value": 3.591908,
      "description": """weight(longDescription:"open source" in 152) [PerFieldSimilarity], result of:""",
      "details": [
        {
          "value": 3.591908,
          "description": "score(freq=1.0), computed as boost * idf * tf from:",
          "details": [
            {
              "value": 2.2,
              "description": "boost",
              "details": []
            },
            {
              "value": 3.1677296,
              "description": "idf, sum of:",
              "details": [
                {
                  "value": 1.7381384,
                  "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                  "details": [
                    {
                      "value": 41,
                      "description": "n, number of documents containing term",
                      "details": []
                    },
                    {
                      "value": 235,
                      "description": "N, total number of documents with field",
                      "details": []
                    }
                  ]
                },
                {
                  "value": 1.4295912,
                  "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                  "details": [
                    {
                      "value": 56,
                      "description": "n, number of documents containing term",
                      "details": []
                    },
                    {
                      "value": 235,
                      "description": "N, total number of documents with field",
                      "details": []
                    }
                  ]
                }
              ]
            },
            {
              "value": 0.51541185,
              "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
              "details": [
                {
                  "value": 1.0,
                  "description": "phraseFreq=1.0",
                  "details": []
                },
                {
                  "value": 1.2,
                  "description": "k1, term saturation parameter",
                  "details": []
                },
                {
                  "value": 0.75,
                  "description": "b, length normalization parameter",
                  "details": []
                },
                {
                  "value": 136.0,
                  "description": "dl, length of field (approximate)",
                  "details": []
                },
                {
                  "value": 191.19148,
                  "description": "avgdl, average length of field",
                  "details": []
                }
              ]
            }
          ]
        }
      ]
    },
    {
      "value": 8.118633,
      "description": """weight(shortDescription:"open source" in 152) [PerFieldSimilarity], result of:""",
      "details": [
        {
          "value": 8.118633,
          "description": "score(freq=1.0), computed as boost * idf * tf from:",
          "details": [
            {
              "value": 2.2,
              "description": "boost",
              "details": []
            },
            {
              "value": 6.77204,
              "description": "idf, sum of:",
              "details": [
                {
                  "scoring": "omitted for brevity"
                }
              ]
            },
            {
              "value": 0.54493,
              "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
              "details": [
                {
                  "scoring": "omitted