Page MenuHomePhabricator

Huge number of duplicate rows in wb_terms
Closed, ResolvedPublic

Description

There's apparently a huge number of duplicate rows in wb_terms, for example:

mysql:wikiadmin@db1045 [wikidatawiki]> SELECT * FROM wb_terms WHERE term_entity_type = 'item' AND term_entity_id = 2807 AND term_language = 'es' AND term_type = 'description'\G 
*************************** 1. row ***************************
     term_row_id: 709061768
  term_entity_id: 2807
term_entity_type: item
   term_language: es
       term_type: description
       term_text: ciudad, capital de España
 term_search_key: ciudad, capital de españa
     term_weight: 0.254
*************************** 2. row ***************************
     term_row_id: 709061771
  term_entity_id: 2807
term_entity_type: item
   term_language: es
       term_type: description
       term_text: ciudad, capital de España
 term_search_key: ciudad, capital de españa
     term_weight: 0.254
2 rows in set (0.01 sec)

I found the above example merely by looking for a single term, testing other combinations also quickly brought up duplicates, thus I assume there are a lot of these.

Apparently we have over 129M rows which appear in the table twice (that means at least 65M duplicates!)!

Event Timeline

Number of rows which occur at least twice (excludes aliases, for performance reasons):

mysql:wikiadmin@db1070 [wikidatawiki]> SELECT COUNT(*) FROM wb_terms AS t1 WHERE term_type != 'alias' AND EXISTS(SELECT 1 FROM wb_terms AS t2 USE INDEX(wb_terms_entity_id) WHERE t1.term_entity_type = t2.term_entity_type AND t1.term_entity_id = t2.term_entity_id AND t1.term_type = t2.term_type AND t1.term_language = t2.term_language AND t1.term_search_key = t2.term_search_key AND t1.term_row_id != t2.term_row_id);
+-----------+
| COUNT(*)  |
+-----------+
| 129337897 |
+-----------+
1 row in set (21 hours 13 min 33.19 sec)

Probably at least 10% of the table are duplicates (at least 129M/2).

mysql:wikiadmin@db1070 [wikidatawiki]> SELECT COUNT(*) FROM wb_terms;
+-----------+
| COUNT(*)  |
+-----------+
| 613038548 |
+-----------+
1 row in set (2 min 41.65 sec)

list of duplicates (4295 count) from test wikidata: P5325

https://test.wikidata.org/w/index.php?title=Property:P29941&action=history is one of them and was just created today, so there must be an issue currently in the code that is causing this.

the edit linked above was done by selenium test user, which appears to use wbeditentity:

https://github.com/wmde/WikidataApiGem/blob/master/lib/mediawiki_api/wikidata/wikidata_client.rb

(i can't reproduce the issue locally yet, but need to investigate further)

WMDE-leszek moved this task from Proposed to Doing on the Wikidata-Former-Sprint-Board board.
WMDE-leszek moved this task from Doing to Backlog on the Wikidata-Former-Sprint-Board board.

This is a bit offtopic to T163551 but with the latest schema changes, wb_terms has become the largest table on a wiki (with the exception of revision on enwiki and image on commons)- and I think it will get bigger once the new column (I assume) gets populated with actual data. The data size is not a problem, but monolithical tables get progresively harder to manage (to a point in which schema changes are really difficult) and progresively slower. Anything done to reduce the individual size per table may help- starting by reducing the huge number of indexes.

Note that something as easy as setting up a separate service/db shard to track changes is not out of the question, as I have suggested some time ago.

@jcrespo We are working with @Smalyshev on switching entity search to using Elastic, see T125500. Once this is done, all rows with term_type = "alias" can be deleted, and we should have a lot less read load on this table. That should help, at least a little.

To clarify- reads on a slave are not a big concern for MySQL- of course, if you get in the end better latency, that is cool (and I normally ping because it means there is an inefficiency that could be solved); but reads are easy to scale in the large order of things ("just buy more replicas"), writes is more complicated. The tables being large is usually a concern because it means reads and writes are amplified, not the size itself. Also, s8 is scheduled to be setup in Q1/Q2 FY2017-2018, containing only wikidata.

Change 354550 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[mediawiki/extensions/Wikibase@master] Prevent saving duplicate Terms in TermSqlIndex::insertTerms

https://gerrit.wikimedia.org/r/354550

My suggestion is, once the fix for not introducing new duplicates (https://gerrit.wikimedia.org/r/354550) is in, to list all duplicate row ids in a text file (generated via a query similar to the one I already mentioned). Then we can delete these in batches using a maintenance script reading from the file of term_row_id to delete.

Change 354550 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Prevent saving duplicate Terms in TermSqlIndex::insertTerms

https://gerrit.wikimedia.org/r/354550

I confirm with this patch merged, when I edit terms that are duplicated, the new ones doesn't get duplicated so if we wait for a while probably most of the duplicates get fixed automagically.

When editing a page in VisualEditor, I sometimes notice that the description (from Wikidata's pageterms API) is duplicated. I results in the following rendering:

Screen Shot 2017-09-16 at 20.51.29.png (442×1 px, 144 KB)

The joining by a single comma also looks odd, but is possibly a bug in VisualEditor (depending on whether or not this value is meant to be an array or not, if meant to be a string, then the silent conversion to string is to be expected).

Underlying query: https://en.wikipedia.org/w/api.php?action=query&format=jsonfm&prop=info%7Cpageprops%7Cpageimages%7Cpageterms&pithumbsize=80&pilimit=50&wbptterms=description&ppprop=disambiguation&titles=Momo%20(food)&continue=

{
    "query": {
        "pages": {
            "1351345": {
                "pageid": 1351345,
                "ns": 0,
                "title": "Momo (food)",

                "terms": {
                    "description": [
                        "Nepalese and Indian cuisine originated in Nepal",
                        "Nepalese and Indian cuisine originated in Nepal"
                    ]
                }

                "contentmodel": "wikitext",
                "pagelanguage": "en",
                "pagelanguagehtmlcode": "en",
                "pagelanguagedir": "ltr",
                "touched": "2017-09-12T13:36:49Z",
                "lastrevid": 800270184,
                "length": 8406,
                "thumbnail": {
                    "source": "https://upload.wikimedia.org/wikipedia/commons/thumb/a/a1/Momo_nepal.jpg/80px-Momo_nepal.jpg",
                    "width": 80,
                    "height": 58
                },
                "pageimage": "Momo_nepal.jpg"
            }
        }
    },
    "batchcomplete": ""
}

https://www.wikidata.org/wiki/Q134092

That's definitely not how it should be. There should only ever be one description for a given item and language. Is this also caused by the duplicate rows?

That's definitely not how it should be. There should only ever be one description for a given item and language. Is this also caused by the duplicate rows?

I'm not sure. I was gonna file a new task, but found this existing task instead. I suppose the end-result of duplicate values return via the API could be caused by a different bug as well, it looks like this bug would also cause it. And while a potential fix was provided for new submissions, I don't see any record of existing items having been fixed.

There are two problems right now, 1- We have duplicate rows:

mysql:wikiadmin@db1087 [wikidatawiki]> select * from wb_terms where term_full_entity_id = 'Q134092' and term_language = 'en';
+-------------+----------------+---------------------+------------------+---------------+-------------+-------------------------------------------------+-------------------------------------------------+-------------+
| term_row_id | term_entity_id | term_full_entity_id | term_entity_type | term_language | term_type   | term_text                                       | term_search_key                                 | term_weight |
+-------------+----------------+---------------------+------------------+---------------+-------------+-------------------------------------------------+-------------------------------------------------+-------------+
|   617069321 |         134092 | Q134092             | item             | en            | description | Nepalese and Indian cuisine originated in Nepal | nepalese and indian cuisine originated in nepal |       0.034 |
|   617069322 |         134092 | Q134092             | item             | en            | description | Nepalese and Indian cuisine originated in Nepal | nepalese and indian cuisine originated in nepal |       0.034 |
|   159448432 |         134092 | Q134092             | item             | en            | label       | Momo                                            | momo                                            |           0 |
+-------------+----------------+---------------------+------------------+---------------+-------------+-------------------------------------------------+-------------------------------------------------+-------------+
3 rows in set (0.00 sec)

2- The API surfaces all of them. I think both parts needs to be fixed but the second one probably requires another phab card.

We identify and delete duplicate rows (not trivial, but not difficult either), then we add a UNIQUE restriction over that combination of columns so that never happens again.

The maintenance script that we have can do it, it only takes some time. Is it okay for you?

Not sure if with "you", you mean me, but if it is safe, yes. We may have to defragment the table later to reclaim disk space, but that can be done later and it is not a blocker.

BTW, we may not be able to actualy use a restriction on those tables, as the restriction may only apply to some values, but there are some code changes (locking) that could be done when editing unique properties to avoid duplications. Maybe those should be deployed first?

I meant you as you :) I will make the patch when the current one (populating term_full_entity_id) is done.

Change 381433 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[operations/puppet@production] mediawiki: Maintenance script to clean up duplicates in wb_terms

https://gerrit.wikimedia.org/r/381433

Change 381433 merged by Marostegui:
[operations/puppet@production] mediawiki: Maintenance script to clean up duplicates in wb_terms

https://gerrit.wikimedia.org/r/381433

With the deduplication passing Q11M, the number of duplicate rows has not vanished but it's close to none:

mysql:[email protected] [wikidatawiki]> SELECT COUNT(*) FROM wb_terms AS t1 WHERE term_type != 'alias' AND EXISTS(SELECT 1 FROM wb_terms AS t2 USE INDEX(wb_terms_entity_id) WHERE t1.term_entity_id < 1000000 and t2.term_entity_id < 1000000 and t1.term_entity_type = t2.term_entity_type AND t1.term_entity_id = t2.term_entity_id AND t1.term_type = t2.term_type AND t1.term_language = t2.term_language AND t1.term_search_key = t2.term_search_key AND t1.term_row_id != t2.term_row_id);
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
Connection id:    3108283
Current database: wikidatawiki

+----------+
| COUNT(*) |
+----------+
|     1309 |
+----------+
1 row in set (6 hours 20 min 4.14 sec)

Note that it's only in the first Q1M, the real results when the script is done should be around 40 times bigger than that but comparing to the size of the table, it's nothing.

Do we know where the dupes come from?

Do we know where the dupes come from?

Well, prior to ab66ee5348ede2ad0686f5f9e733970c79a9847a we only checked on replica if a term already exists, before writing it. Together with many edits in a row, this can easily lead to duplicates.

hoo removed hoo as the assignee of this task.Oct 25 2017, 3:54 PM
hoo removed a project: Patch-For-Review.

Not sure this should even be in the current sprint.

It should, there is a maintenance script running (using cron) to de-duplicate rows and it's 90% done.

Change 386588 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[operations/puppet@production] mediawiki: Disable rebuildTermSqlIndex

https://gerrit.wikimedia.org/r/386588

Change 386588 merged by Marostegui:
[operations/puppet@production] mediawiki: Disable rebuildTermSqlIndex

https://gerrit.wikimedia.org/r/386588