Wikidata:WikiProject Dutch National Thesaurus for Author Names

Project aims

This projects aims to

Publish the full NTA on Wikidata. The data of approx. 2.751.666 authors will be made available on Wikidata.
Reversly, add Wikidata Qnumbers to all 2.7M persons in the NTA
Set up periodic synchronisation between the NTA and Wikidata

The main challenge of this project is its size: because of the large number of authors, manually adding authors to Wikidata and 'working with Excels and OpenRefine' are no scalable options. The workflow, including progress administration and the data QA process, needs to be automated as much as possible (bots etc.)

About the NTA

The Dutch National Thesaurus for Author Names (Q104787839) stores names and other personal data of some 2.7 million authors (many Dutch), so that a distinction can be made between authors with the same name. It is maintained by the KB National Library of the Netherlands (Q1526131)

The NTA is available as linked open data and described at http://data.bibliotheken.nl/doc/dataset/persons. All data is available under the CCO public domain license

Example items: Harry Mulisch - Willem Frederik Hermans - Cees Notenboom - Etty Hillesum - find other named persons

Data can be retrieved via the SPARQL endpoint at http://data.bibliotheken.nl/sparql, for example using this basic query to retrieve 1000 persons.

Rationale, why do this project?

1) Wikidata is a very important international hub for libraries worldwide, it is the place where you go to discover database/thesauri identifiers from other parties, and for publishing your own database identifiers. This enables others to find and discover your data & thesauri. Wikidata is THE international hub and discovery point for thousands of databases.

2) The NTA is the biggest international authority for Dutch author names, currently describing over 2.7M authors. At the moment, pm. 512.597 of those 2.7M individuals are already known in Wikidata. The cool thing is now: thanks to Wikidata's central hub function, some 400K records from the Library of Congress have been linked to the NTA, according to slide 16 of this presentation (see the bottom left diagram, 400K = 32% of 1.2M). Without Wikidata this NTA-LoC-interlinking would probably have been a lot more difficult. And of course this does not only apply to the LoC, via Wikidata databases/thesauri of other libraries worldwide can also be connected to the NTA.

3) As KB, we have the public task to make our data as easily findable and reusable as possible. Because of the international interest by of the LoC and others parties in our NTA, making this thesaurus as internationally visible as possible (via Wikidata) is therefore something the KB must pursue. This will especially help foreign reusers who are not readily familiar with our institutional linked open datasets. Or reversely, how many thesauri of Swedish, Brazilian or New Zealand libraries can you name yourself without consulting Wikidata?

Combining 1, 2 and 3, it makes perfect sense to

start describing all 2.7M persons in the NTA in Wikidata as well, In other words, to extend the current 512K person descriptions in Wikidata to all 2.7M persons described in the NTA. Of course a NTA identifier and backlink to data.bibliotheek.nl will be provided.
and the other way around: to add Wikidata Q-identifiers to all 2.7M NTA records in data.bibliotheken.nl

In summary: By using Wikidata as a connecting hub for our NTA, the 2.7M Dutch authors will become more findable, visible and reusable for institutions & databases worldwide.

Project approach

In order to obtain more (international) visibility, reusability, enrichment and external connections of the data about the 2.7M authors in the NTA, putting the data/persons available in the NTA into Wikidata is a logical and important project. This concerns both adding NTA identifiers (Nationale Thesaurus voor Auteursnamen ID (P1006)) to Wikidata items, as well as copying data such as name variations, dates and places of births and deaths etc. from the NTA into Wikidata.

The steps that need to be taken are detailed on this page

Relevant pages/discussion/knowledge/persons - to sort out

Datadumps

NTA: http://data.bibliotheken.nl/files/nta_20211011.rdf.gz (11 october 2021)
VIAF: e.g. http://viaf.org/viaf/data/viaf-20220103-links.txt.gz (via http://viaf.org/viaf/data/)
Wikidata: https://dumps.wikimedia.org/wikidatawiki/ (which file to choose?) Is https://wdumps.toolforge.org/of interest?

Datamodel for authors on Wikidata

https://www.wikidata.org/wiki/EntitySchema:E42 (schema for authors)

Scripts & bots

https://github.com/multichill/toollabs/blob/master/bot/wikidata/nta_from_viaf.py (sort out...) Simpele modus draait deze op een SPARQL query, maar deze kan ook op de dump op http://viaf.org/viaf/data/ draaien
Running it on PAWS is OK, but still need to sort out how to run it locally (ask MaartenD + see this page)

viaf stuff - sort out

Wikidata:WikiProject Dutch National Thesaurus for Author Names

Contents

Project aims

About the NTA

Rationale, why do this project?

Project approach

Relevant pages/discussion/knowledge/persons - to sort out

Datadumps

Datamodel for authors on Wikidata

Scripts & bots

viaf stuff - sort out

Navigation menu

Wikidata:WikiProject Dutch National Thesaurus for Author Names

Project aims

About the NTA

Rationale, why do this project?

Project approach

Relevant pages/discussion/knowledge/persons - to sort out

Datadumps

Datamodel for authors on Wikidata

Scripts & bots

viaf stuff - sort out

Navigation menu

Search