Background
MediaWiki's link database tables are among the largest tables of any WMF production database. Optimizing the way we store and retreive this would generally see benefits such as:
- lower diskspace per database: slower growth, less hardware in the long run (co-locate more wikis together), and ensures each individual replica host can continue to fit and perform well with a full core database of a given wiki.
- faster backup and recovery,
- lower replication lag (smaller queries, smaller binlogs).
This is a continuation in spirit of T222224: RFC: Normalize MediaWiki link tables and T300222: Implement normalizing MediaWiki link tables, where a similar optimization and major win was made to the templatelinks table (with upcoming work for pagelinks, imagelinks and categorylinks).
externalinks
This table has grown a lot in production. It's one of the biggest tables for Commons, at 200GB, and will cause more issues in the future.
Unlike other links table, this table can't be normalized using the linktarget feature (T299417). We currently store each external link upto 8 times as part of a single row (and its indexes).
For example, this is a typical row today:
*************************** 4. row *************************** el_id: 9 el_from: 379228 el_to: http://www.iau.org/public_press/themes/naming/#minorplanets el_index: http://org.iau.www./public_press/themes/naming/#minorplanets el_index_60: http://org.iau.www./public_press/themes/naming/#minorplanets
One external link is three ways in three columns of a given row, and those columns in turn are used in five different indexes.
Option A: Create new table
Create a new table holding the domai names (call it, externaldomain?) and it would hold the domain and its inverse.
Something like:
ed_id: 1234 ed_domain: http://www.iau.org ed_index: http://org.iau.www.
The row externallinks would be reduced to
el_id: 9 el_from: 379228 el_to_domain: 1234 # Key to externaldomain.ed_id el_to_path: public_press/themes/naming/#minorplanets
Option B:
Avoid creating a new table by storing the domain inline, using only the index variant. We go from 8 to 3 copies (1 in the columns, plus two large indexes).
el_id: 9 el_from: 379228 el_to_domain_index: http://org.iau.www. el_to_path: /public_press/themes/naming/#minorplanets
Change the indexes:
- INDEX el_from (el_from, el_to(40)), - INDEX el_from_index_60 (el_from, el_index_60, el_id), + INDEX el_page_domain_id (el_from, el_to_domain_index, el_id), - INDEX el_to (el_to(60), el_from), + INDEX el_domain_page (el_to_domain_index, el_from), - INDEX el_index (el_index(60)), - INDEX el_index_60 (el_index_60, el_id),
Option B-2 (chosen)
Per T312666#8146447, even further minimised the index. This reduces from 8 to 2 copies (1 in the columns, and 1 large index).
el_id: 9 el_from: 379228 el_to_domain_index: http://org.iau.www. el_to_path: /public_press/themes/naming/#minorplanets
Change the indexes:
- INDEX el_from (el_from, el_to(40)), - INDEX el_from_index_60 (el_from, el_index_60, el_id), + INDEX el_from (el_from, el_id), - INDEX el_to (el_to(60), el_from), + INDEX el_to (el_to_domain_index, el_from), - INDEX el_index (el_index(60)), - INDEX el_index_60 (el_index_60, el_id),
Prior art: T59176: ApiQueryExtLinksUsage::run query has crazy limit