Lexeme searches prefer forms over lemmas
Closed, ResolvedPublicBUG REPORT
Actions

Assigned To

Authored By

	Nikki
	Oct 13 2023, 3:38 PM

Description

Steps to replicate the issue (include links if applicable):

Start adding a statement on https://www.wikidata.org/wiki/Lexeme:L123
Select the property Sandbox-Lexeme
Enter "asse"

What happens?:

None of the displayed lexemes are an exact match to the search term. The two lexemes which are an exact match are only shown if you click "more".

This is misleading, because it makes it seem like we don't have a lexeme for that word yet.

What should have happened instead?:

The two lexemes which are an exact match should be the first results.

Software version (skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):

Screenshot:

The two exact matches are not the first results on Special:Search either: https://www.wikidata.org/w/index.php?ns146=1&search=asse

Details

	Subject	Repo	Branch	Lines +/-
	Rank lemma's over forms	mediawiki/extensions/WikibaseLexemeCirrusSearch	master	+29 -27

Customize query in gerrit

Related Objects

Mentioned In: rEWLCbce24497d5f1: Rank lemma's over forms

Event Timeline

Nikki created this task.Oct 13 2023, 3:38 PM

Gehel moved this task from needs triage to Current work on the Discovery-Search board.Oct 16 2023, 3:18 PM

Gehel edited projects, added Discovery-Search (Current work); removed Discovery-Search.

Let's spend at most 1 day to investigate the issue, we'll re-discuss if we want to fix it once we understand the problem.

EBernhardson moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.Oct 23 2023, 3:30 PM

The UI for adding statements is using wbsearchentities (explain). Target results are L1191921 and L1144955.

The method of scoring for wbsearchentities could be sumarized as bucketing results into 3 groups based on how well they match, and then sorting by popularity (statement count and incoming link counts) within those buckets. Of all the docs that make the best possible match (near_match on lemma or near_match on lexeme_forms.representation) the two target documents have the lowest popularity with zero incoming links and a single statement each. Reviewing a few of the documents that were not targeted but ranked higher, they also match lexme_forms.representation. In a more traditional search context using term frequencies the fact that the target lexmes have a single statement each would push them up in the ranking, but because wbsearchentities isn't giving them individual scores that doesn't happen here.

One thing we could do is be less strict on the bucketing. In a quick test setting a dismax tie breaker of 0.02 gives these target documents a boost up to the top of the ranking. This is not directly configurable, it was set in the initial commit for WikibaseLexemeCirrusSearch and never changed. This does read from our profile service at least, so it shouldn't be too hard to add a custom profile parameter to control the dismax tie breaker and set this to something that works a bit better. What value is appropriate is hard to say, at 0.01 these docs get a boost up into the top-7, but not all the way to the top. Essentially what ends up pushing these docs to the top of the ranking with the tie breaker is that they match both the lemma and lexeme_forms.representation field, where the other docs only match one of the two fields.

Change 973882 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/WikibaseLexemeCirrusSearch@master] Introduce a tie-breaker for entity search

https://gerrit.wikimedia.org/r/973882

gerritbot added a project: Patch-For-Review.Nov 13 2023, 11:08 PM

Another possibility might be to increase the importance of a lemma match over a representation match. Currently they are treated exactly the same. I'm not entirely sure on what the distinction is between these values yet

Change 973882 merged by jenkins-bot:

[mediawiki/extensions/WikibaseLexemeCirrusSearch@master] Rank lemma's over forms

https://gerrit.wikimedia.org/r/973882

ReleaseTaggerBot added a project: MW-1.42-notes (1.42.0-wmf.7; 2023-11-28).Nov 16 2023, 8:01 AM

Maintenance_bot removed a project: Patch-For-Review.Nov 16 2023, 8:10 AM

EBernhardson mentioned this in rEWLCbce24497d5f1: Rank lemma's over forms.Nov 16 2023, 8:13 AM

EBernhardson moved this task from Needs review to To Be Deployed on the Discovery-Search (Current work) board.Nov 16 2023, 6:05 PM