Introduction

The built-in scoring mechanism in Elasticsearch and Solr can seem mysterious to beginners and experienced practitioners alike. Instead of delving into the mathematical definitions of TF-IDF and BM25, this article will help you develop an intuitive understanding of these metrics by walking you through a series of simple examples. Each example consists of a query and list of several indexed documents. As you read along, try to guess which document comes up on top for each query. In each case, we will examine why that particular document gets the highest score and we’ll extract the general principle behind this behavior. A set of six examples will be followed by an extra credit section focusing on more advanced topics. Along with illustrating all of the key behaviors of BM25, our examples will touch on some of the gotchas around scoring in cluster scenario, where shards and replicas come into play. This article aims to teach you, in a short time and without any math, everything you’ll ever need to know about scoring. Having a solid understanding of scoring will prepare you to better diagnose relevance problems and improve relevance in real-world applications.

Query 1: dog

Let’s say I search for “dog” and there are only three documents in my index, as shown below. Which one of these documents is going to come up on top?

Doc 1: "dog"

Doc 2: "dog dog"

Doc 3: "dog dog dog

If you’re not quite sure, that’s good, because I haven’t given you enough of the context to know the answer. All the queries in this article were tested in Elasticsearch 7.4 where BM25 is the default scoring algorithm and its parameters are set as k=1.2 and b=0.7. (Please ignore that if it’s meaningless to you.) In most of the examples, we’ll assume the documents have a single text field called “title” that uses the standard analyzer:

"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "standard"
}
}
}

For the most part, we’ll be doing simple match queries against the title field. Recall that the default boolean operator in Elasticsearch is OR. Here’s what our “dog” query looks like:

GET test/_search
{
"query": {
"match": {
"title": "dog"
}
}
}

With those details out of the way, are you ready to tell me which one of those three documents (“dog,” “dog dog,” and “dog dog dog”) is going to get the highest score?

Here are the results:

IDTitleScore
1dog0.167
2dog dog0.183
3dog dog dog0.189

Doc 3 gets the highest score because it has contains the highest number of tokens that match the query term. Another way to say this is that Doc 3 has the highest term frequency for the term in question. Of course if we were using the keyword analyzer, only Doc 1 would have matched the query, but we’re using the standard analyzer which breaks the title into multiple tokens. The moral of this story is that, from a scoring perspective at least,

High term frequency is good.

Before we move on, take a moment to compare Doc 1 and Doc 3. Notice that although Doc 3 has three times the term frequency for “dog” as Doc 1, its score isn’t three times as high. So while higher term frequency gives a higher score, its impact is not multiplicative.

Query 2: dog dog cat

Now I’m searching for “dog dog cat” and there are only two documents in my index:

Doc 1: "cat"
Doc 2: "dog"

Which one of these is going to come up on top? Or are they going to be tied?

In fact, “dog” is the winner here:

IDTitleScore
1cat0.6
2dog1.3

Why does “dog” get twice the score of “cat?” The lesson here is that

Scores for each query term are summed.

Our query has two instances of “dog.” The score for the whole query is the sum of the scores for each term. Each instance of “dog” in our query is going to match the “dog” in Doc 2 and contribute roughly 0.6 to the score, for a total of 1.3. Using the standard analyzer, query terms aren’t deduplicated, so each instance of “dog” is treated separately. Doc 1 doesn’t get a similar advantage because our query only contains one instance of “cat.”

Query 3: dog dog cat

Now I’m executing the same query as before, “dog dog cat,” but my index is different. This time I have lots of “dog” documents:

Doc 1: "dog"
Doc 2: "dog"
Doc 3: "dog"
Doc 4: "dog"
Doc 5: "dog"
Doc 6: "dog"
Doc 7: "cat"

What’s going to happen now? Do the “dog” documents still win over “cat” because my query mentions “dog” twice? Here are the results:

IDTitleScore
1dog0.4
2dog0.4
3dog0.4
4dog0.4
5dog0.4
6dog0.4
7cat1.5

The results are different this time because the terms have different document frequencies than before. A term’s document frequency is the number of documents in the index that contain the term. From a scoring perspective, low document frequency is good and high document frequency is bad. In this example, “cat” is a rare term in the index — it has low document frequency — so matches on that term help the score more than matches on “dog,” which is a common term. The lesson here is that:

Matches for rarer terms are better.

If I want to tell the search engine that a common term is particularly important to me in a certain scenario, I can boost the term. If I had executed my query with a boost of 7 on “dog,” the dog documents would come up above the cat document. Here’s how I’d set that up:

GET test/_search
{
"query": {
"query_string": {
"query": "dog^7 cat",
"fields": ["title"]
}
}
}

Query 4: dog cat

In this example I’m searching for “dog cat,” and I’ve got three documents in my index: one with a lot of dogs, one with a lot of cats, and one with a single instance of dog and cat each, plus a lot of mats.

Doc 1: "dog dog dog dog dog dog dog"
Doc 2: "cat cat cat cat cat cat cat"
Doc 3: "dog cat mat mat mat mat mat"

Which document comes up on top this time? Notice that in Doc 1 and Doc 2, every single term matches one of the query terms, whereas in Doc 3 there are five terms that don’t match anything. So the results might be a little surprising:

IDTitleScore
1dog dog dog dog dog dog dog0.88
2cat cat cat cat cat cat cat0.88
3dog cat mat mat mat mat mat0.94

Document 3 gets the highest score because it has matches for both of the query terms, “dog” and “cat.” While Documents 1 and 2 have higher term frequency for “dog” and “cat” respectively, they each contain only one of the terms. The lesson is that

Matching more query terms is good.

Query 5: dog

Now I’ll search for “dog” and there are only two documents in my index:

Doc 1: "dog cat zebra"
Doc 2: "dog cat

Both of these documents match my query, and both have some terms that don’t match. Which document does the best? Here are there results:

IDTitleScore
1dog cat zebra0.16
2dog cat0.19

In this case, Document 2 does better because it is shorter. The thinking is that when a term occurs in a shorter document, we can be more confident that the term is significant to the document (or that the document is about the term). When a term occurs in a longer document, we have less confidence that this occurrence is meaningful. The lesson here is that

Matches in shorter documents are better.

Query 6: orange dog

Now let’s consider a scenario that’s a little more complicated than the previous ones. For the first time, our documents will have two fields, “color” and “type.”

Doc 1: {"color": "brown", "type": "dog"}
Doc 2: {"color": "brown", "type": "dog"}
Doc 3: {"color": "brown", "type": "cat"}
Doc 4: {"color": "orange", "type": "cat"}

We’re searching for “orange dog” but as you can see, there are no orange dogs in the index. There are two brown dogs, a brown cat, and an orange cat. Which one is going to come up on top?

I should mention that we’re searching across both fields using a multi_match like this:

GET test/_search
{
"query": {
"multi_match": {
"query": "orange dog",
"fields": ["type", "color"],
"type": "most_fields"
}
}
}

Here are the results:

IDColorTypeScore
1browndog0.6
2browndog0.6
3browncatN/A
4orangecat1.2

This example hints at some of the unexpected behavior that can arise when we search across multiple fields. The search engine doesn’t know which field is most important to us. If someone is searching for an orange dog, we might guess they’re more interested in seeing dogs than seeing arbitrary things that happen to be orange. (“Orange dog” would be very strange query to enter if you meant “show me anything that’s orange.”) In this case, however, the color field is taking priority because “orange” is a rare term within that field (there are 3 browns and only 1 orange). Within the type field, “dog” and “cat” have the same frequency. The orange cat comes up on top because a match for the rare term “orange” is treated as more valuable than a match for “dog.”

If we want to give more weight to the “type” field we can boost it like this:

GET test/_search
{
"query": {
"multi_match": {
"query": "orange dog",
"fields": [“type^2", "color"],
"type": "most_fields"
}
}
}

With the boost applied, the “brown dog” documents now score 1.3 and come up above the “orange cat.”

The lesson here is that searching across multiple fields can be tricky because the per-field scores are added together without concern for which fields are more important. To rectify this,

We can use boosting to express field priorities.

Query 7: dog

Now we’re moving into advanced territory although our query looks simpler than anything we’ve seen before. We’re searching for “dog” and our index has three identical “dog” documents:

Doc 1: "dog"
Doc 2: "dog"
Doc 3: "dog"

Which one is going to come up on top? You might guess that these three documents, being identical, should get the same score. So let’s take a moment to look at how ties are handled. When two documents have the same score, they’ll be sorted by their internal Lucene doc id. This internal id is different from the value in the document’s _id field, and it can differ even for the same document across replicas of a shard. If you really want ties to be broken in the same way regardless of which replica you hit, you can add a sort to your query, where you sort first by _score and then by a designated tiebreaker like _id or date.

But this point about tiebreaking is only an aside. When I actually ran this query, the documents didn’t come back with identical scores!


IDTitleScore
1dog0.28
2dog0.18
3dog0.18

How is it possible that identical documents would get different scores? The lesson in this example is that

Term statistics are measured per shard.

Though I didn’t state it explicitly at the beginning of the post, all our previous examples were using one shard. In the current example, however, I set up my index with two shards. Document 1 landed on Shard 1 while Documents 2 and 3 landed on Shard 2. Document 1 got a higher score because within Shard 1, “dog” is a rarer term — it only occurs once. Within Shard 2, “dog” is more common — it occurs twice. Here’s how I set up the example:

PUT /test 
{ "settings": { "number_of_shards": 2 } }

PUT /test/_doc/1?routing=0
{ "title" : "dog" }

PUT /test/_doc/2?routing=1
{ "title" : "dog" }

PUT /test/_doc/3?routing=1
{ "title" : "dog" }

If you’re working with multiple shards and you want scores to be consistent regardless of which shard a document lives in, you can do a Distributed Frequency Search by added the following parameter to your query: search_type=dfs_query_then_fetch. This tells Elasticsearch to retrieve term statistics from all the shards and combine them before computing the scores.

But its also important to know that

Replicas of a shard may have different term statistics.

This is a consequence of how deletion is handled in Lucene. Documents that are marked for deletion but not yet physically removed (when their segments are merged) still contribute to term statistics. When a document is deleted, all the replicas will immediately “know” about the deletion, but they might not carry it out physically at the same time, so they might end up with different term statistics. To reduce the impact of this, you can specify a user or session ID in the shard copy preference parameter. This encourages Elasticsearch to route requests from the same user to the same replicas, so that, for example, a user will not notice scoring discrepancies when issuing the same query multiple times.

Its also important to know that, from a scoring perspective,

Document updates behave like insertions, until segments are merged.

When you update a document in Lucene, a new version of it is written to disk and the old version is marked for deletion. But the old version continues to contribute to term statistics until it is physically deleted.

In the example below, I create a “dog cat” document and then I update its contents to be “dog zebra.” Immediately after the update, if I query for “dog” and look at the explain output, Elasticsearch tells me there are two documents containing the term “dog.” The number goes down to one after I do a _forcemerge. The moral: if you’re doing relevancy tuning in Elasticsearch, looking closely at scores, and also updating documents at the same time, be sure to run a _forcemerge after your updates, or else rebuild the index entirely.

PUT test/_doc/1
{ "title": "dog cat" }

GET test/_search?format=yaml
{ "query" : { "match" : { "title": "dog" } },
"explain": true }

PUT test/_doc/1?refresh
{ "title": "dog zebra" }

GET test/_search?format=yaml
{ "query" : { "match" : { "title": "dog" } },
"explain": true }

POST test/_forcemerge

GET test/_search?format=yaml
{ "query" : { "match" : { "title": "dog" } },
“explain": true }

Query 8: dog cat

Now let’s take another look at Query 4, where we searched for “dog cat” and we found that a document containing both terms did better than documents with lots of instances of one term or the other. In Query 4, we were searching the title field, the only field available. Here, we’ve got two fields, pet1 and pet2:

Doc 1: {"pet1": "dog", "pet2": "dog"}
Doc 2: {"pet1": "dog", "pet2": "cat"}

We’ll do a multi_match including those two fields like this:

GET test/_search
{
"query": {
"multi_match": {
"query": "dog cat",
"fields": [“pet1”, "pet2"],
"type": "most_fields"
}
}
}

Since Document 2 matches both of our query terms we’d certainly hope that it does better than Document 1. That’s the lesson we took away from Query 4, right? Well, here are the results:

IDPet 1Pet 2Score
1dogdog0.87
2dogcat0.87

You can see the documents are tied. It might look like “cat” is a rare term that should help Document 2 rise to the top, but within the Pet 2 field, “cat” and “dog” have the same document frequencies, and the scoring is done on a per-field basis. It also looks like Document 2 should get an advantage for matching more of the query terms, but again, the scoring is done a per-field basis: when we compute the score in the Pet 1 field, both documents do the same; when we compute the score in the Pet 2 field, both documents again do the same.

Does this contradict what we learned from Query 4? Not quite, but it warrants a refinement of the earlier lesson:

Matching more query terms within the same field is good. But there’s no advantage when the matches happen across fields.

If you’re not happy with this situation, there are some things you can do. You can combine the contents of Pet 1 and Pet 2 into a single field. You can also switch from the most_fields to the cross_fields query type to simulate a single field. (Just be aware that cross_fields has some other consequences on scoring, changing the procedure from field-centric to term-centric. We won’t go into details here.)

Query 9: orange dog

Now let’s revisit the lesson from Query 5, where we saw that matches in shorter fields are better. We’re going to search for “orange dog.” We have a “dog” document with a description mentioning that the dog is brown. And we have a “cat” document with a description mentioning that the cat is sometimes orange. Notice that the dog document has a longer description than the cat document, and both descriptions are longer than the contents of the type field.

Doc 1: {“type”: “dog”, “description”: “A sweet and loving pet that is always eager to play. Brown coat. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non nibh sagittis, mollis ex a, scelerisque nisl. Ut vitae pellentesque magna, ut tristique nisi. Maecenas ut urna a elit posuere scelerisque. Suspendisse vel urna turpis. Mauris viverra fermentum ullamcorper. Duis ac lacus nibh. Nulla auctor lacus in purus vulputate, maximus ultricies augue scelerisque.”}

Doc 2: {“type”: “cat”, “description”: “Puzzlingly grumpy. Occasionally turns orange.”}

We’ll do a multi_match like this:

GET test/_search
{
"query": {
"multi_match": {
"query": "orange dog",
"fields": ["type", “description"],
"type": "most_fields"
}
}
}

What’s going to take precedence here: the match for “dog” in the type field, which is a really short field, or the match for “orange” in the description field, which is significantly longer? If matches in shorter fields are better, shouldn’t “dog” win here? In fact, the results look like this:

IDTypeDescriptionScore
1dogA sweet… brown…
0.69
2catPuzzlingly grumpy… orange.
1.06

The match for “orange” in the Description field is sending Document 2 to the top even though that match occurs in a longer field body than the match for “dog” in Document 1. Does this contradict what we learned about matches in short and long fields from Query 5? No, but it points to something we hadn’t mentioned. The lesson is that:

“Shortness” is relative to the field’s average.

Within the type field, “dog” and “cat” don’t get an advantage for being short because, in fact, they’re both of average length for that field. On the other hand, the description for the cat is shorter than average for the description field overall, so it gets a benefit for being “short.”

Query 10: abcd efghijklmnopqrstuvwxyz

Here’s an easy one to close out our set of examples. I’ve divided the alphabet into two query terms. One of them is a short term, “abcd,” and the other is a long term, “efghijklmnopqrstuvwxyz.” I’m going to search for both terms together: “abcd efghijklmnopqrstuvwxyz.” My index has one match for each term:

Doc 1: "abcd"
Doc 2: "efghijklmnopqrstuvwxyz"

Which document is going to do the best? The results look like this:

IDTitleScore
1abcd0.69
2efghijklmnopqrstuvwxyz0.69

Why are the documents tied if it’s true that matches in shorter fields are better? The lesson is that

Term length is not significant.

When we talk about a short or long field, we’re talking about how many terms the field contains, not how many characters. In this example, the titles for both documents are of length 1.

That’s it for now. Hopefully these examples have helped build your intuitions about how scoring works in Elasticsearch and Solr. Thanks for following along! If you’d like to go further and understand how the BM25 scoring function actually achieves the behaviors we’ve seen here, check out our companion article on Understanding TF-IDF and BM25.

KMW solves hard search problems, contact us to learn more