Wikidata:Property proposal/tabular case data


tabular case data

edit

Originally proposed at Wikidata:Property proposal/Natural science

Descriptiontabular data on Wikimedia Commons of confirmed cases, recoveries, deaths, etc. due to a medical event; corresponds to P8011, P1603, P8049, P8010, P1120
Data typeTabular data
Template parameter|datapage= in w:Template:Medical cases chart
Domaindisease outbreak (Q3241045)
Allowed valuesc:Category:Tabular data of medical cases
Example 12020 COVID-19 pandemic in Santa Clara County (Q92341065)COVID-19 cases in Santa Clara County, California.tab
Example 2COVID-19 pandemic in Denmark (Q86597685)COVID-19 hospitalizations in Denmark.tab
Example 3COVID-19 pandemic in California (Q87455852)COVID-19/Cases/CA.tab
Planned useMove Copy statements to tabular data for use with w:Template:Medical cases chart
Expected completenessalways incomplete (Q21873886)
See alsoWikidata:WikiProject COVID-19/Data models/Outbreaks, Wikidata:Property proposal/Historical Population

Motivation

edit

Items like COVID-19 pandemic in the United States (Q83873577) have gained multiple claims per day regarding the number of COVID-19 cases, recoveries, and deaths. The pandemic isn't expected to abate in the short term, so these items will only get increasingly unwieldy. Tabular data is a more scalable and flexible way to store this data when it's already appropriately licensed. For example, w:Template:2019–20 coronavirus pandemic data/United States/California/Santa Clara County medical cases chart uses |datapage= in w:Template:Medical cases chart to load and parse c:Data:COVID-19 cases in Santa Clara County, California.tab. w:Template:Medical cases chart does require a few more parameters to point it to the correct columns in the table. Perhaps we could standardize the column IDs corresponding to number of medical tests (P8011), number of cases (P1603), number of hospitalized cases (P8049), number of recoveries (P8010), and number of deaths (P1120) or create string-typed qualifiers for them. – Minh Nguyễn 💬 02:00, 29 April 2020 (UTC)[reply]

Discussion

edit
  • Tend to   Support, but this needs to be actioned first before such a property can be of any use. (Yuri and Stas will be missed.) Mahir256 (talk) 17:20, 29 April 2020 (UTC)[reply]
    I hadn't thought of WDQS. A lack of support in WDQS would be a strong argument against deleting current claims and would give me pause about deleting historical claims. On the other hand, the property would be readily usable by Scribunto modules – easier to use than claims, in fact. – Minh Nguyễn 💬 04:54, 30 April 2020 (UTC)[reply]
  • There is just a risk that the pandemic is over until this is available, actively useable, and reliably maintainable by several members of WMF staff. Let's not break a working approach. --- Jura 17:30, 29 April 2020 (UTC)[reply]
  •   Comment As Jura said above, the approach is currently working (even if not perfect), so we should not break it.There are some Wiki communities that are using data stored in properties (e.g. fr:Modèle:Infobox_Épidémie), so a full migration to tabular would not be desirable I would   Support it if instead of moving statements, this was done as a parallel deployment. The huge number of values for number of deaths (P1120) and number of cases (P1603) that are currently used to track this data are really less than ideal . We could have legacy information on the tabular format, and update only current information on the specialized properties. TiagoLubiana (talk) 22:48, 29 April 2020 (UTC)[reply]

    Is there any way to find out all the templates that are using the existing historical claims this way? I'm pretty confident we could migrate chart-like usage over to tabular data, based on the system I wrote for Santa Clara County, California, described above. My original idea was only to remove historical claims, keeping current data for infobox-style templates. But as an incurable inclusionist, I have no problem with keeping the existing historical claims and deploying tabular data in parallel.

    To elaborate on my motivation above, I've been maintaining Wikimedia's pandemic coverage of Santa Clara County, one of the original hotspots in the U.S. The county has been retroactively attributing new cases or removing cases from as many as 40 days every day. The state of California has switched to this methodology, which lends itself to growth charts with more accurate curves that aren't as affected by the delay in setting up widespread testing. It would be very time-consuming and error-prone for a human to update scores of statements every day, especially if they're presented in arbitrary order as on Wikidata. If someone has written a bot to synchronize statements with an external data source, we could point it to tabular data on Commons to keep things synchronized.

     – Minh Nguyễn 💬 04:30, 30 April 2020 (UTC)[reply]

  • @Mxn: Not sure about templates, but at least 3 different dashboards use historical data on Wikidata: this one by User:Csisc; this one by User:Gnoeee and this one by User:Egon_Willighagen. Other people are involved in these projects, also. If there is a way to query the tabular data, than I think it would be simple to migrate. I haven't seen any templates on Wikipedia that care about historical data, usually, they want just the most updated number. Historical data is actually a problem in many cases, because it is not always simple to select the most recent item (at least not on ptwiki). TiagoLubiana (talk) 01:08, 6 May 2020 (UTC)[reply]
  •   Comment A second comment is that instead of standardizing column IDs, the property could be shaped to use number of medical tests (P8011), number of cases (P1603), and so on as qualifiers, and add data just about the specifics. That also gives extra flexibility for adding other qualifiers that might be specific to each kind of info. TiagoLubiana (talk) 22:48, 29 April 2020 (UTC)[reply]
    I think it'd be fine to keep using properties like number of medical tests (P8011) in current claims. Despite floating the idea of standardizing column IDs, I think it'd be problematic because there's no way for constraints to automatically flag nonstandard tabular data schemas. Instead, I'd prefer to see a parallel set of properties that name the ID being used by the table. It wouldn't be foolproof, but at least Wikidata's expectations of Commons would be more discoverable. – Minh Nguyễn 💬 04:59, 30 April 2020 (UTC)[reply]
  •   Weak oppose Statistics for the same country/region are reported differently by various organisations and with different determination method or standard (P459). Storing the data as statements instead of as tabular data also provides a means to have complex qualifiers such as determination method or standard (P459), statement disputed by (P1310), etc which are needed to record the changing way in which statistics are reported during a pandemic. I don't think this type of data is well suited to a tabular format for the reason of lack of qualifiers, references and lack of ability to form complex queries across the data using SPARQL. The closest suitability would be create a table for each single reporting organisation with a single determination method or standard (P459), and link one or more of these tables to each Wikidata item. Dhx1 (talk) 12:16, 2 May 2020 (UTC)[reply]
    @Dhx1: I'm currently focused on reports by official local public health departments at the moment, but you're right that there needs to be some nuance when considering all the ways the case count can be reported. Ideally, each table would be kept as simple and homogenous as possible, and there can be multiple tables with their own qualified claims as you describe. In the case of Santa Clara County, I rewrote the page when the county supplemented the single cumulative total in its daily report with daily totals by date of collection. But if enough there's enough interest and involvement, we could maintain the two tables separately. Technically, the case count can now be reported along two dimensions (cases collected on a given date on one axis, date reported on another axis), which allows a 3D graph to visualize the relationship between testing and confirmed infections. That data that can be stored in tabular data, whereas number of cases (P1603) claims qualified by point in time (P585) are only capable of indicating the most recent statistics for a given day. It's another reason why we should supplement the existing Wikidata properties with a "tabular case data" property in spite of the SPARQL limitations. – Minh Nguyễn 💬 18:36, 2 May 2020 (UTC)[reply]
  •   Support This proposed Wikidata property is similar in context to what was fast-tracked in Schema.org recently, although this proposal is more generic for any "tabular data for medical cases", whereas Schema.org's was for disease spread data specifically: [1]https://schema.org/diseaseSpreadStatistics which can be a datatype URL or Dataset (or others) which itself has lots of exposed metadata via Properties, [2]https://schema.org/Dataset  – The preceding unsigned comment was added by Thadguidry (talk • contribs) at 16:00, May 4, 2020‎ (UTC).
  •   Support Very much needed. This should be an instance of tabular data, generally. And we should set up tabular data here on WD the way it is set up on Commons, to avoid an extra cross-wiki step (also because it's not clear to me that Commonsists feel tabular data unrelated to a commons file is appropriate there.) Sj (talk) 23:25, 5 May 2020 (UTC)[reply]
@Mxn, Mahir256, Jura1, TiagoLubiana, Csisc: @Dhx1, Thadguidry, Sj: tabular case data (P8204) has been created. Pamputt (talk) 21:16, 7 May 2020 (UTC)[reply]