Page MenuHomePhabricator

Lexeme searches prefer forms over lemmas
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

What happens?:

None of the displayed lexemes are an exact match to the search term. The two lexemes which are an exact match are only shown if you click "more".

This is misleading, because it makes it seem like we don't have a lexeme for that word yet.

What should have happened instead?:

The two lexemes which are an exact match should be the first results.

Software version (skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):

Screenshot:

asse.png (525×629 px, 18 KB)

The two exact matches are not the first results on Special:Search either: https://www.wikidata.org/w/index.php?ns146=1&search=asse

Event Timeline

Let's spend at most 1 day to investigate the issue, we'll re-discuss if we want to fix it once we understand the problem.

EBernhardson subscribed.

The UI for adding statements is using wbsearchentities (explain). Target results are L1191921 and L1144955.

The method of scoring for wbsearchentities could be sumarized as bucketing results into 3 groups based on how well they match, and then sorting by popularity (statement count and incoming link counts) within those buckets. Of all the docs that make the best possible match (near_match on lemma or near_match on lexeme_forms.representation) the two target documents have the lowest popularity with zero incoming links and a single statement each. Reviewing a few of the documents that were not targeted but ranked higher, they also match lexme_forms.representation. In a more traditional search context using term frequencies the fact that the target lexmes have a single statement each would push them up in the ranking, but because wbsearchentities isn't giving them individual scores that doesn't happen here.

One thing we could do is be less strict on the bucketing. In a quick test setting a dismax tie breaker of 0.02 gives these target documents a boost up to the top of the ranking. This is not directly configurable, it was set in the initial commit for WikibaseLexemeCirrusSearch and never changed. This does read from our profile service at least, so it shouldn't be too hard to add a custom profile parameter to control the dismax tie breaker and set this to something that works a bit better. What value is appropriate is hard to say, at 0.01 these docs get a boost up into the top-7, but not all the way to the top. Essentially what ends up pushing these docs to the top of the ranking with the tie breaker is that they match both the lemma and lexeme_forms.representation field, where the other docs only match one of the two fields.

Change 973882 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/WikibaseLexemeCirrusSearch@master] Introduce a tie-breaker for entity search

https://gerrit.wikimedia.org/r/973882

Another possibility might be to increase the importance of a lemma match over a representation match. Currently they are treated exactly the same. I'm not entirely sure on what the distinction is between these values yet

Change 973882 merged by jenkins-bot:

[mediawiki/extensions/WikibaseLexemeCirrusSearch@master] Rank lemma's over forms

https://gerrit.wikimedia.org/r/973882

The example in the ticket looks to work as expected now