Wikidata:SPARQL query service/WDQS backend update/Blazegraph failure playbook

By mid-October 2022, Wikidata reached 100 million items. It remains unclear how far that is from the maximum of what its current infrastructure can handle gracefully.

Executive Summary

edit

This document is a playbook for the event of catastrophic failure of Wikidata Query Service (WDQS) due to the Blazegraph graph backend maxing out, outlined as the predominant risk in the WDQS August 2021 scaling update. How much time we have before catastrophic failure is difficult to predict, but the probability of it occurring is very high within the next 5 years if no action is taken. While we are working to avoid this scenario by planning a migration off of Blazegraph, as well as exploring other potential solutions such as graph-splitting and federation, we feel that it is crucial to have this playbook, given the probability of catastrophic failure.

The goal here is to provide transparency into the discrete steps that the Wikimedia Foundation (WMF) will take in order to maintain a minimum level of WDQS and Wikidata functionality in the case of catastrophic failure.

The following is the order and conditions in which we will delete subgraphs from Blazegraph to provide minimum functionality while working on long-term scaling solutions to return WDQS and Wikidata to desired functionality.

Playbook

edit
  1. In the event of catastrophic Blazegraph failure, delete the scholarly articles subgraph from Blazegraph
    Provides ~3.75 years to implement scaling solutions
  2. Then if more than 90 additional days are needed, delete the astronomical objects subgraph from Blazegraph
    Provides ~4.4 years total time to implement scaling solutions
  3. If again more than 90 additional days are needed, delete the Wikimedia categories subgraph from Blazegraph
    Provides ~4.8 years total time to implement scaling solutions

Details

edit

In the case that Blazegraph maxes out, data must be deleted from it in order to maintain minimum functionality. In the last number of months, we have investigated what possible data can be deleted, approximately how large each candidate dataset is, and approximately how many queries rely on each candidate dataset. Importantly, this data will not be deleted from Wikidata, but only from Blazegraph, meaning that this deletion will affect only the ability for users to make queries / create tools that use the kind of data that will be deleted.

In case of catastrophic failure, we would like to delete the largest dataset possible – this directly translates into the amount of time we have for implementing a longer-term solution – while affecting the fewest possible number of queries: we refer to this as the days/queries ratio. The higher the ratio, the better a dataset is as a candidate for deletion, and this is used as the primary ranking metric for considering deletion candidates, and is how the table below is sorted.

Table 1: Data deletion candidates
Name of subgraph / data type % of entities % of triples # of days for Blazegraph back to same size % of queries affected (monthly) days/queries ratio
scholarly article 40 50 1370 0.7 1957.14
astronomical object 9 9 238 0.2 1190.00
Wikimedia category 5 6 157 0.6 261.67
Wikimedia template 0.9 0.9 23 0.1 230.00
Lexemes / lexicographical entities 8 - 10 0.09 111.11
gene 1.3 0.9 25 0.3 83.33
description* - 20 518 12 43.17
chemical compound 1.3 0.7 19 0.6 31.67
Wikimedia disambiguation page 1.5 1.4 37 1.7 21.76
family name 0.5 1.4 40 2.5 16.00
Wikimedia list article 0.4 0.3 7 0.6 11.67
external id* - 9 239 30 7.97
human 10 7 200 31 6.45
film 0.3 0.4 10 2 5.00
taxon 3.4 3 77 25 3.08
label* - 4 104 48 2.17
name* (sitelinks) - 0.6 16 8 2.00
business 0.2 0.1 3 1.8 1.67
altLabel* (aliases) - 0.8 21 16 1.31
language .011 0.01 0.30 0.80 0.38


* All starred candidates are deletable data that are NOT subgraphs, but are vertical slices of data – and as a result do not comprise a percentage of entities – more details available here. Details on how subgraphs were identified and a more detailed table available here; additional table with subgraph query analysis available here.

In addition to the days/queries ratio, we must also consider how much actual time is necessary to implement a long-term solution. Current estimates are that it will take at least ~2 years for a dedicated team to migrate WDQS away from Blazegraph and its inherent constraints. This means that in the event of catastrophic failure, we will likely need to delete enough candidates to cover this time period. This metric is derived from the size of the candidate and the current average rate of growth of Wikidata, giving us an estimate for how many days deleting that candidate will provide us.

With these factors in mind, deleting the scholarly articles subgraph is the best strategy in the case of catastrophic failure: it has by far the highest days/queries ratio, as well as also giving a dedicated team ~3.75 years to implement a longer solution before Blazegraph reaches the same problematic size limit again – this is double the current estimate for the amount of time needed for backend migration. At the same time, less than 1% of queries are estimated to be affected by scholarly articles being deleted.

In the event that only deleting scholarly articles is insufficient – e.g., if migration takes much longer than expected; if Blazegraph suddenly grows much more quickly than we estimated – the next optimal candidate for deletion is the astronomical objects subgraph, followed by the Wikimedia category subgraph. Combined, these two subgraphs would provide ~1 year of time in addition to the ~3.75 years provided by deleting scholarly articles (totaling ~4.75 years), with the deletion of all 3 subgraphs affecting less than 2% of queries.

As WMF work operates on quarters, decisions to move forward with deleting the next candidate subgraph will be made 90 days in advance. Not only does this align with the cadence of planning – i.e. can the remaining scaling work be completed within the last quarter of the window provided by deleting a subgraph? – but it also provides WDQS users enough time to prepare for disruptive changes and/or suggest alternatives.

We recognize that in this amount of time, many things can change. This playbook can be updated to reflect new developments and/or ideas for preserving WDQS functionality in the case of catastrophic failure.

  1. Are you deleting this data from Wikidata?
    No. This data will continue to be in Wikidata, but will not be present in Blazegraph. This means that they will not be queryable by WDQS.
  2. Are you making those subgraphs forever unavailable for queries?
    No. Because this data continues to be in Wikidata, the deleted subgraphs can/will be rebuilt when we are able to adequately support them.
  3. I disagree with/don’t trust your numbers above. Can you show me where they came from?
    Our WDQS analyst has done amazing work in providing the estimates for this table, with her work and process publicly available on the subpages of her Wikitech profile. It is also important to remember that due to the complexity of Wikidata’s knowledge graph, the numbers we are working with are estimates. While it is likely that they may actually be higher or lower than what we have presented, it is unlikely that they are off so significantly as to change the order of magnitude.
  4. Does this mean that scholarly articles should just be separated from the rest of Wikidata altogether?
    Not necessarily. In terms of long-term desires and interests for Wikidata, this decision should come from the community, rather than being prescribed by WMF. This playbook only outlines a scenario of catastrophic failure on a technical level where something must be deleted to preserve the functionality of what remains. Outside of the technical hardware and software limitations, the community should decide how they want Wikidata to work.
  5. Are you splitting the scholarly subgraph out? Can I access it through a different endpoint and/or federate it?
    No. This playbook only commits to deleting these subgraphs from Blazegraph. Making them available as a separate graph that can be queried and/or federated may be part of longer-term strategies for how WDQS and Wikidata should operate, but this would require further investigation into technical feasibility, costs and benefits, as well as coordination with the community that the results are desirable.
  6. I’m part of the 1% of queries to scholarly subgraphs and this would be a massive disruption to me. Can we delete something else instead?
    Maybe. We recognize that every part of Wikidata is a result/source of somebody’s time and effort, and that there is no deletion solution that does not adversely affect somebody. We have framed our approach as deleting relatively expensive subgraphs (quantified by the days/queries ratio) in a way that allows enough time to implement longer-term solutions. You are free to make another proposal for data deletion, which will be most seriously considered if you can quantify the pros/cons within this framework. We recognize that this full level of analysis may not be possible, as we are unable to share the full query data due to privacy reasons, but providing as much information as you can will help your proposal’s consideration.
  7. What are you doing in the long-term to solve scaling issues?
    Our primary objective is currently to identify a new graph backend to replace Blazegraph, which is the primary source of technical limitations for scaling WDQS. It is likely that there is no perfect alternative, and we are still working to balance user priorities with technical needs in a way that requires the fewest compromises to functionality for our users.