Page MenuHomePhabricator

CirrusSearch SQL query for locating pages for reindex performs poorly
Closed, ResolvedPublic

Description

Query:

SELECT /* BatchRowIterator::next www-data@mediaw... */  page_id,page_namespace,page_title,page_restrictions,page_is_redirect,page_is_new,page_random,page_touched,page_links_updated,page_latest,page_len,page_content_model  FROM `page`   WHERE page_namespace = '0' AND (( page_id > '38263463' ))  ORDER BY page_id ASC LIMIT 100;

Explain:

+------+-------------+-------+-------+--------------------+---------+---------+------+------+-------------+
| id   | select_type | table | type  | possible_keys      | key     | key_len | ref  | rows | Extra       |
+------+-------------+-------+-------+--------------------+---------+---------+------+------+-------------+
|    1 | SIMPLE      | page  | index | PRIMARY,name_title | PRIMARY | 4       | NULL |  200 | Using where |
+------+-------------+-------+-------+--------------------+---------+---------+------+------+-------------+

Running the query though (against enwiki) i get

100 rows in set (16.48 sec)

The explain claims it's using the primary index, but if i adjust the query:

SELECT /* BatchRowIterator::next www-data@mediaw... */  page_id,page_namespace,page_title,page_restrictions,page_is_redirect,page_is_new,page_random,page_touched,page_links_updated,page_latest,page_len,page_content_model  FROM `page` FORCE INDEX (PRIMARY)  WHERE page_namespace = '0' AND (( page_id > '38263463' ))  ORDER BY page_id ASC LIMIT 100;

Then the execution runs as expected:

100 rows in set (0.00 sec)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Please, please test the query on every single host (you have the list on db-eqiad.php and db-codfw.php) before proclaiming victory, or you will hit bugs like T147113. Index hints should be the exception, not the rule. Let me take a look at this to try to see why and if there is an alternative.

Change 316693 had a related patch set uploaded (by Smalyshev):
Allow forceSearchIndex to ask for specific index to be used

https://gerrit.wikimedia.org/r/316693

Change 316693 merged by jenkins-bot:
Allow forceSearchIndex to ask for specific index to be used

https://gerrit.wikimedia.org/r/316693

Deskana subscribed.

The above is a hack that unblocks @Smalyshev's work temporarily, but we are awaiting a proper solution to this.

Hi @jcrespo and @Marostegui - can we get a status on this issue, please? Thanks!

@debt It is on "next", which means it will be one of the tasks we will work immediately after we finish what we are doing now. There is a big but, which is the reason why it is taking so long, which is in case there is something else creating outages, it takes priority, and recently we had really bad occurrences such as T149076 or T149077, so this got delayed.

Technically this is not blocked on me, I only asked to get it blocked until there is an alternative to the FORCE INDEX (PRIMARY) which is a really bad idea and should only be the last option. I would strongly suggest if you can work on your own on it and see if you have an alternative solution even before I can give it a look. If that would not be possible, work can also be done on the alternative to:

Please, please test the query on every single host,

to avoid a worse incident, which anybody with server access can do, not only DBAs. :-)

I'm not really sure what to do here either though, to the best of my knowledge this should generate a sane query plan, and explain claims its generating a sane query plan, but then when actually running the query everything mucks up.

I ran the query against all database's (host+wikidb paris) in eqiad: P4303
Almost everything ran in a ms or two, which could just as well be the round trip time in the network. There were a few that took longer, but nothing insane. A summary of all wikis where at least one db server took > 100ms:

bnwikisource:
    db1075: 128.55ms LONG QUERY
    db1038: 402.28ms LONG QUERY
    db1015: 405.71ms LONG QUERY
    db1077: 132.56ms LONG QUERY
    db1078: 143.13ms LONG QUERY

enwikibooks:
    db1075: 1.48ms
    db1038: 1.55ms
    db1015: 2.12ms
    db1077: 203.09ms LONG QUERY
    db1078: 3.06ms

enwikinews:
    db1075: 66.2ms
    db1038: 181.91ms LONG QUERY
    db1015: 122.88ms LONG QUERY
    db1077: 64.63ms
    db1078: 70.03ms

tawikisource:
    db1075: 122.91ms LONG QUERY
    db1038: 382.9ms LONG QUERY
    db1015: 383.46ms LONG QUERY
    db1077: 125.17ms LONG QUERY
    db1078: 140.6ms LONG QUERY

commonswiki:
    db1040: 219.87ms LONG QUERY
    db1053: 194.55ms LONG QUERY
    db1056: 122.95ms LONG QUERY
    db1059: 178.52ms LONG QUERY
    db1064: 103.95ms LONG QUERY
    db1068: 92.9ms
    db1081: 72.23ms
    db1084: 83.72ms
    db1091: 81.73ms

Same thing for codfw, run from wasat.codfw.wmnet to not have the roundtrip latency: P4304

bnwikisource:
    db2018: 388.01ms LONG QUERY
    db2036: 182.58ms LONG QUERY
    db2043: 169.2ms LONG QUERY
    db2050: 170.5ms LONG QUERY
    db2057: 369.58ms LONG QUERY

enwikinews:
    db2018: 178.85ms LONG QUERY
    db2036: 92.32ms
    db2043: 89.31ms
    db2050: 90.22ms
    db2057: 186.44ms LONG QUERY

knwiki:
    db2018: 80.35ms
    db2036: 88.49ms
    db2043: 105.05ms LONG QUERY
    db2050: 117.86ms LONG QUERY
    db2057: 106.27ms LONG QUERY

kvwiki:
    db2018: 107.69ms LONG QUERY
    db2036: 20.12ms
    db2043: 31.61ms
    db2050: 194.04ms LONG QUERY
    db2057: 23.75ms

lawikisource:
    db2018: 80.98ms
    db2036: 123.39ms LONG QUERY
    db2043: 119.38ms LONG QUERY
    db2050: 126.14ms LONG QUERY
    db2057: 113.74ms LONG QUERY

mediawikiwiki:
    db2018: 53.97ms
    db2036: 109.18ms LONG QUERY
    db2043: 74.7ms
    db2050: 94.63ms
    db2057: 70.08ms

suwiktionary:
    db2018: 16.59ms
    db2036: 24.88ms
    db2043: 24.43ms
    db2050: 178.83ms LONG QUERY
    db2057: 22.41ms

tawikisource:
    db2018: 1332.79ms LONG QUERY
    db2036: 1427.14ms LONG QUERY
    db2043: 956.6ms LONG QUERY
    db2050: 1046.43ms LONG QUERY
    db2057: 1157.34ms LONG QUERY

zh_classicalwiki:
    db2018: 229.89ms LONG QUERY
    db2036: 146.79ms LONG QUERY
    db2043: 153.58ms LONG QUERY
    db2050: 174.81ms LONG QUERY
    db2057: 159.43ms LONG QUERY

commonswiki:
    db2019: 970.21ms LONG QUERY
    db2037: 429.47ms LONG QUERY
    db2044: 729.13ms LONG QUERY
    db2051: 641.11ms LONG QUERY
    db2058: 796.19ms LONG QUERY
    db2065: 462.69ms LONG QUERY

metawiki:
    db2029: 317.9ms LONG QUERY
    db2040: 338.8ms LONG QUERY
    db2047: 272.21ms LONG QUERY
    db2054: 294.04ms LONG QUERY
    db2061: 187.79ms LONG QUERY
    db2068: 194.64ms LONG QUERY

viwiki:
    db2029: 32.29ms
    db2040: 49.93ms
    db2047: 54.05ms
    db2054: 70.46ms
    db2061: 102.88ms LONG QUERY
    db2068: 45.8ms

FWIW, while EXPLAIN shows the sane plan, SHOW EXPLAIN on the actual running query doesn't:

mysql:wikiadmin@db1080 [enwiki]> show explain for 8917347527;
+------+-------------+-------+------+--------------------+------------+---------+-------+----------+----------------------------------------------------+
| id   | select_type | table | type | possible_keys      | key        | key_len | ref   | rows     | Extra                                              |
+------+-------------+-------+------+--------------------+------------+---------+-------+----------+----------------------------------------------------+
|    1 | SIMPLE      | page  | ref  | PRIMARY,name_title | name_title | 4       | const | 22233685 | Using index condition; Using where; Using filesort |
+------+-------------+-------+------+--------------------+------------+---------+-------+----------+----------------------------------------------------+
1 row in set, 1 warning (0.00 sec)

mysql:wikiadmin@db1080 [enwiki]> show warnings\G
*************************** 1. row ***************************
  Level: Note
   Code: 1003
Message: SELECT   page_id,page_namespace,page_title,page_restrictions,page_is_redirect,page_is_new,page_random,page_touched,page_links_updated,page_latest,page_len,page_content_model  FROM `page`   WHERE page_namespace = '0' AND (( page_id > '38263463' ))  ORDER BY page_id ASC LIMIT 100
1 row in set (0.01 sec)

I don't know what might cause the explain and the actual select to use different plans.

debt triaged this task as Medium priority.Oct 27 2016, 8:33 PM

There isn't a big rush here, this query is incredibly rare. It's part of a reindexing maintenance script that is run by a human, not a machine. Additionally that script requires a particular set of flags to trigger this query, and this is the first time we've used that set of flags in the last couple years (only when we change something about the document structure in elasticsearch for a single namespace). It seems worth fixing at some point, but I wouldn't be surprised if we don't need this query again for at least another year.

There isn't a big rush here, this query is incredibly rare. It's part of a reindexing maintenance script that is run by a human, not a machine

I would then not lose time on optimizing it, FORCE the usage of the PRIMARY index (FORCE INDEX(PRIMARY)) but please make sure it is run on the vslow group so it doesn't affect webrequests if something changes in the future. Maintenance should run on non-web request hosts.

Change 320969 had a related patch set uploaded (by DCausse):
Use vslow with forceSearchIndex db queries

https://gerrit.wikimedia.org/r/320969

@jcrespo thanks for the investigation, I agree with you and Erik this tool is rarely used so I don't think it's worth spending too much time on this query.
Should we close this ticket?

Change 320969 merged by jenkins-bot:
Use vslow with forceSearchIndex db queries

https://gerrit.wikimedia.org/r/320969

Deskana claimed this task.