Page MenuHomePhabricator

on-wiki search is failing to find relatively newer titles on enwiki
Closed, ResolvedPublic2 Estimated Story PointsBUG REPORT

Description

New articles not appearing in search bar autocomplete

When typing a title into the search box (enwiki), relatively new articles do not appear in the list of suggestions. For example https://en.wikipedia.org/wiki/Statue_of_Queen_Victoria,_Hove was created 2022-12-23. When I begin searching the page title, the article is not suggested until I type the complete page title perfectly (eg, start typing "Statue of Queen Victoria, Hov" and nothing appears).

Reported by multiple users at https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)/Archive_202#Articles_created_not_appearing_in_search_results

Steps to reproduce:

  1. Create a page with a new title
  2. (optional but done to rule out problems) mark the page as patrolled
  3. Wait some time (in this example weeks)
  4. Try to search for the title on wiki

What is being reported:

  1. The search autocomplete will not complete the title
  2. When searching for the title by a sub string of the title name (in this case just leaving off some characters from the end) - the title is not included in the search results

What is expected:
Both the autocomplete and the search results should include the page. The search documentation at https://www.mediawiki.org/wiki/Help:CirrusSearch#How_frequently_is_the_search_index_updated? suggests indexing occurs in "near real time".

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Xaosflux renamed this task from New articles not appearing in search bar autocomplete to on-wiki search is failing to find relatively newer titles.Jan 17 2023, 7:29 PM
Xaosflux updated the task description. (Show Details)

For what it is worth, external search providers such as Google find the article just fine.

Xaosflux renamed this task from on-wiki search is failing to find relatively newer titles to on-wiki search is failing to find relatively newer titles on enwiki.Jan 17 2023, 7:33 PM

I was not able to duplicate this problem on dewiki

We need to update the documentation. But at least:

  • "near realtime" means usually a few minutes, but up to 30 minutes is still considered normal conditions. This applies to the full text search. The title suggest index (used for the auto complete in the search box) is updated daily.
  • We have a process that reindexes all pages over a period of 8 weeks. So if an update is lost, it will eventually be re-applied.

It is unlikely that we'll be able to investigate the indexing failures. We are working on a new version of our update pipeline (T317045) that should give us better visibility into search failures.

Gehel set the point value for this task to 2.Feb 13 2023, 4:37 PM

I tried to clarify the section of docs about updates to include the distinction between full-text and title completion search indexes: https://www.mediawiki.org/w/index.php?title=Help:CirrusSearch&diff=prev&oldid=5897819

To check that titles are making it into the primary search index i ran a quick python script (P47281) and ran it for the last 7 days worth of new pages according to the recent changes api. It found 12572 pages that were created and should exist in the enwiki search index. Of these 12 were not found in the search index. A manual check shows them all to be redirects to redirects which we don't index. This looks to be generally working, although there could certainly be edge cases that are not handled.

Poking at the logs for the script that builds the daily autocomplete indices, we may be missing errors that happen there. The logs show that the enwiki completion index failed its daily build from dec 9 2022 through jan 20 2023. This was not identified by any of our monitoring, we should correct that so these errors bubble up sooner and get fixed immediately.

We can use an elasticsearch query to find the oldest dated completion indices. This query will give us the 5 titlesuggest indices with the oldest batch_id (~= indexing timestamp) when issued against the :9243 cluster:

POST /omega:*_titlesuggest,psi:*_titlesuggest,*_titlesuggest/_search
{
    "size": 0,
    "aggs": {
        "by_index": {
            "terms": {
                "field": "_index",
                "order": {"max_batch_id": "asc"}
            },
            "aggs": {
                "max_batch_id": {"max":{"field": "batch_id"}}
            }
        }
    }
}

Not sure yet where it goes. We could integrate this into prometheus-wmf-elasticsearch-exporter and then add an alertmanager rule, i suppose that would be the most straightforward. We could record the age (now - batch_id) of the oldest batch per-cluster and alert if the age gets more than a couple days. It feels a little overkill to run this query every minute though.

Change 911940 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/puppet@production] search: Report age of titlesuggest indices to prometheus

https://gerrit.wikimedia.org/r/911940

Change 911945 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/alerts@master] search: Add alert based on age of titlesuggest indices

https://gerrit.wikimedia.org/r/911945

Change 911940 merged by Ryan Kemper:

[operations/puppet@production] search: Report age of titlesuggest indices to prom

https://gerrit.wikimedia.org/r/911940

Change 913959 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/puppet@production] search: Fix collection of *_titlesuggest metric on small clusters

https://gerrit.wikimedia.org/r/913959

Change 913959 merged by Ryan Kemper:

[operations/puppet@production] search: Fix collection of *_titlesuggest metric on small clusters

https://gerrit.wikimedia.org/r/913959

Change 911945 merged by jenkins-bot:

[operations/alerts@master] search: Add alert based on age of titlesuggest indices

https://gerrit.wikimedia.org/r/911945