Elasticsearch, as the name suggests, is primarily used for its blazingly fast search capabilities. But have you ever wondered how scoring happens in the background and why certain documents are given more scoring compared to the rest? In this two-part series, we will be focusing on how to write simple to complex search queries followed by which we will be explaining the calculations that happen under the hood using the explain API.
Search Results Ranking Basics:
Elasticsearch used to have TF-IDF as their default similarity algorithm and have shifted to BM25 (Best Matching) ever since the introduction of Lucene 6.
A simple explanation of how search results are assigned a score is the function of th
e below criteria:
Term Frequency (TF): No of times the search term appears in the document.
Length of the document containing the search term vs the average length of the document in the search result.
Inverse Document Frequency (IDF): Number of documents that contain a value for the search field vs Number of documents that contain the search term in the field that we are searching. In simpler terms, how rare is this search term in a document. Higher the score if rare search terms are present.
Elasticsearch will sum up the score computed by all search terms.
Learning with a scenario:
Let us assume we have a books catalogue (books index) where category, a short description and a long description are primarily used for finding relevant books. We start by writing a simple search query to find the relevant books from the catalogue.
{
“title”: “Unlocking Android”,
“isbn”: “1933988673”,
“pageCount”: 416,
“categories”: [
“Open Source”,
“Mobile”
],
“shortDescription”: “Unlocking Android: A Developer’s Guide provides concise, hands-on instruction for the Android operating system and development tools. This book teaches important architectural concepts in a straightforward writing style and builds on this with practical and useful examples throughout.”,
“longDescription”: “Android is an open source mobile phone platform based on the Linux operating system and developed by the Open Handset Alliance, a consortium of over 30 hardware, software and telecom companies that focus on open standards for mobile devices. Led by search giant, Google, Android is designed to deliver a better and more open and cost effective mobile experience. Unlocking Android: A Developer’s Guide provides concise, hands-on instruction for the Android operating system and development tools. This book teaches important architectural concepts in a straightforward writing style and builds on this with practical and useful examples throughout. Based on his mobile development experience and his deep knowledge of the arcane Android technical documentation, the author conveys the know-how you need to develop practical applications that build upon or replace any of Androids features, however small. Unlocking Android: A Developer’s Guide prepares the reader to embrace the platform in easy-to-understand language and builds on this foundation with re-usable Java code examples. It is ideal for corporate and hobbyists alike who have an interest, or a mandate, to deliver software functionality for cell phones. WHAT’S INSIDE: * Android’s place in the market * Using the Eclipse environment for Android development * The Intents — how and why they are used * Application classes: o Activity o Service o IntentReceiver * User interface design * Using the ContentProvider to manage data * Persisting data with the SQLite database * Networking examples * Telephony applications * Notification methods * OpenGL, animation & multimedia * Sample Applications “,
“status”: “PUBLISH”,
“authors”: [
“W. Frank Ableson”,
“Charlie Collins”,
“Robi Sen”
]
}
Note: Step by step guide on how to reproduce the queries have been provided at the end of the article.
Let’s begin our search journey:
Let us start our search journey by trying to find the word “Open Search” in the books index.
Scenario 1: Simple search:
Search for the word “Open Search” in the field named title.
GET books/_search
{
“query”: {
“term”: {
“title”: {
“value”: “Open Search”
}
}
},
“highlight”: {
“fields”: {
“title”: {}
}
}
}
The search would give 100+ hits for the keyword “Open Search”. It might seem relevant at an initial glance but none of them is relevant since they don’t talk about Lucene (which is a famous Open Source based search engine).
In reality, the word “Open source” can also be referenced in the fields short description and long description and this was not even considered in the search query. In the upcoming section, we will be taking a look at how different fields influence the search results.
Scenario 2: Adding more fields to the query:
Extending Scenario 1, now let us say we want to search the same word phrase “Open Search” in multiple fields: categories, longDescription, shortDescription.
GET books/_search
{
“explain”: false,
“query”: {
“bool”: {
“should”: [
{
“multi_match”: {
“query”: “Open Search”,
“fields”: [
“shortDescription”,
“longDescription”,
“categories.keyword”
],
“type”: “phrase”
}
}
]
}
},
“highlight”: {
“fields”: {
“shortDescription”: {},
“longDescription”: {},
“categories.keyword”: {}
}
}
}
You may notice a considerable change in results with a total of 36 matched documents having a max_score of 8.118633. To our surprise, the document that was on top in Scenario 1 would be placed somewhere in the middle.
Note: The type “phrase” takes the max from every field and returns the document with the highest score.
Scenario 3: Boosting our Search:
In certain use cases, we may want to influence the result by giving more preference to certain fields. To achieve this we make use of the boost functionality. In simple terms, giving more weightage to selective fields for better search results.
GET books/_search
{
“query”: {
“bool”: {
“should”: [
{
“multi_match”: {
“query”: “Open Search”,
“fields”: [
“shortDescription”,
“longDescription”,
“categories.keyword⁵⁰”
],
“type”: “phrase”
}
}
]
}
},
“highlight”: {
“fields”: {
“shortDescription”: {},
“longDescription”: {},
“categories.keyword”: {}
}
}
}
Note: Boosting can be done by making use of the (^) caret symbol followed by the boost score.
Scenario 4: Extending it further:
Now let us say we also want to return partial matches for the word “Open Search” in addition to the exact match.
GET books/_search
{
“explain”: true,
“query”: {
“bool”: {
“should”: [
{
“multi_match”: {
“query”: “Open Search”,
“fields”: [
“shortDescription”,
“longDescription”,
“categories.keyword⁵⁰”
],
“type”: “phrase”
}
},
{
“multi_match”: {
“query”: “Open Search”,
“fields”: [
“shortDescription”,
“longDescription”,
“categories⁴⁰”
],
“type”: “most_fields”
}
}
]
}
},
“highlight”: {
“fields”: {
“shortDescription”: {},
“longDescription”: {},
“categories”: {},
“categories.keyword”: {}
}
}
}
If you notice, the max_score would be 530.05334 for the document with ISBN 1933988177 (Lucene in Action, Second Edition) for the topmost document.
The results would vary depending on the type selected and it is important to understand every type before starting to write complex search queries. For example, if we had chosen the “best_fields” instead of “most_fields” the document with ISBN 1933988673 would have returned. This is because “best_fields” makes use of the max of the search results whereas “most_fields” makes use of the max of the search results. (Other Supported types)
Summary:
To summarize, the more we go in-depth about the tuning of ES queries and boosting of fields, the more we will notice the importance of each parameter. It is important to consider the following while developing your search queries:
Keywords field types are not intended for partial matches, they can hurt the search results. It is recommended to use keyword field types only when a single word is to be searched so that an exact word match can be performed.
Keywords fields are supposed to be unique word limits and overutilization of keyword fields can lead to improper search results.
It is important to only boost fields when there is an absolute necessity. Overboosting of fields can lead to undesired results.
The use of field types in match queries should be based on the need and requirement.
Sometimes it is better to filter first and then perform a search operation.
Function score can be used for decaying the score and relevance of older documents
In the follow-up article, we will be focusing on explaining what is happening under the hood by using the explain API. Stay tuned...
Exercise:
Step 1: Download the following sample books dataset from the following link
Step 2: Create an index called books with the following mappings as given below. For article sake, we have created mappings only for fields against which search will be performed.
PUT books
{
“mappings”: {
“properties”: {
“categories”: {
“type”: “text”,
“fields”: {
“keyword”: {
“type”: “keyword”
}
}
},
“longDescription”: {
“type”: “text”
},
“shortDescription”: {
“type”: “text”
}
}
}
}
Note: It is important to create explicit mapping for better performance and less overhead.
Step 3: Upload the JSON into elasticsearch using the bulk API and start your search query journey.
Comments