Wikidata:Requests for permissions/Bot/PangolinBot 1
- The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.
- Approved--Ymblanter (talk) 21:18, 26 July 2022 (UTC)[reply]
PangolinBot (talk • contribs • new items • new lexemes • SUL • Block log • User rights log • User rights • xtools)
Operator: PangolinMexico (talk • contribs • logs)
Task/s: Automatically adds author information to Wikidata scholarly articles (items where instance of (P31) = scholarly article (Q13442814)) that have missing author information. Currently works for articles with the following references: PubMed publication ID (P698), PMC publication ID (P932), Dimensions Publication ID (P6179), ADS bibcode (P819). Part of Outreachy Round 24.
Code: https://github.com/outreachy-wasian/wikidata-authorBot
Function details:
- Finds Wikidata items with P31 = Q13442814 that have missing author information and have a reference in a compatible database. This includes items that:
- Have no author (P50) and no author name string (P2093) items.
- Have P50 or P2093 items, but are missing:
- object named as (P1932) (ONLY for P50 items)
- author given names (P9687)
- author last names (P9688)
- Adds author name strings items to the article if missing
- Adds missing author information to existing author/name string items. Successfully matches against against acronyms and aliases.
The program works as follows:
- Finds a given number of articles with missing author information via a SPARQL query.
- Finds author information by calling an academic databases' API, with a compatible reference guaranteed to be found in the article's properties
- Returns a tuple of author first and last names, found by parsing an RIS/mbib/etc citation
- Uses tuple to search through existing author items (if they exist) and adds all missing author information + updated reference
See test run here and example program output here
Note: There was a small problem in my SPARQL query that led to a few articles with complete author information being found during test edits before 11:47 AM, 11th of July. This has been fixed and is no longer an issue. This can be proved by noting that all edits after 11:47 are on articles with incomplete author information.
--PangolinMexico (talk) 12:52, 11 July 2022 (UTC)[reply]
- Support. PangolinMexico is an Outreachy intern, being mentored by User:Mike Peel and me. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 19:45, 11 July 2022 (UTC)[reply]
- Is there any way to compress all these edits down? You are making dozens of edits in a row which makes the edit history hard to follow. Also, how are you matching data for the author (P50) claims? By matching on the label? What if there are two people with the same name? What if people change their name over time? What if there are two authors with the same/similar name? Is this bot going to add statements or just references/qualifiers? BrokenSegue (talk) 01:36, 12 July 2022 (UTC)[reply]
- I don't think that there's a way to compress the edits down - mainly because pywikibot doesn't have an aggregate addQualifiers method, so qualifiers can only be added one at a time. Would be happy to be proven wrong!
- You can see the data matching technique for author name matching here: https://github.com/outreachy-wasian/wikidata-authorBot/blob/main/AuthorBot/NameChecker.py
- It checks over labels, aliases, and performs 'manual matching': checking that at the very least the first initial is the same, the last name is the same, and that if there is a middle name initial, they are the same. This means that for articles with the same author name, their middle initials are being checked.
- This bot adds both author name string statements and references/qualifiers. PangolinMexico (talk) 08:43, 12 July 2022 (UTC)[reply]
- ok, generally I support this then. Though we need to be very careful about weird edge cases like AurthurPSmith mentioned. BrokenSegue (talk) 00:53, 14 July 2022 (UTC)[reply]
- @PangolinMexico: Your name matching algorithm looks ok to me, however I'm not clear on what happens if (A) no author names match, or (B) more than one author name matches. Unfortunately I've spent a lot of time cleaning up complicated cases like that from other bots that have tried to do this in the past. ArthurPSmith (talk) 20:48, 13 July 2022 (UTC)[reply]
- @ArthurPSmith @BrokenSegue I completely agree about being careful about weird edge cases, and dealing with them well (and at the bare minimum having them be well documented) is a priority of mine.
- When no authors match, the bot adds an author name string property with all of the aforementioned qualifiers.
- When there are multiple matches, the first match is added - but that author cannot be matched in the future, so if two authors have the same name / a matchable name, they wont match with the same author. In the case where their name is completely identical and written identically in the citation, this works great, as they will be written the same way in the citation and thus have completely accurate information. In cases where their names aren't identical, I've only seen them be varied by middle initial - which the program checks. I think it's very possible some problematic cases will arise after more running, but due to my perfectionism and passion for this I'm very keen to identify these and sort them out/at the very least manually fix them.
- Having run the bot, the only times author names have been added accidentally (which has only happened once in the time I've run it) have been in cases that I have now fixed (essentially by making sure the bot was checking for transliterations.). I think the best way to run this type of bot is to manually check all added names and all manual finds (finds that weren't added by a matched label or alias.) after each run, as these are the most likely to be problematic.
- This is already easy to do with the bot's current output, but I am also happy to add and print a list of manual checks and manual name additions at the end to more clearly be able to see what the bot did. PangolinMexico (talk) 08:14, 14 July 2022 (UTC)[reply]
- I have added the list of manual checks and name additions. Did a 5 article check of this working. See here: https://github.com/outreachy-wasian/wikidata-authorBot/blob/main/AuthorBotTestRunWithLists.txt PangolinMexico (talk) 11:28, 14 July 2022 (UTC)[reply]
- @PangolinMexico: Unfortunately "When there are multiple matches, the first match is added" is not sufficient. I've been working on some of the ATLAS collaboration papers recently, where there can be 3 dozen or more cases of names like "Z. Li" which can have up to 4 identical names in the author list. Each of the "Z. Li" entries in the author list has a different affiliation, so matching the first one doesn't necessarily match up the correct author entry with the correct position in the author list. There are also names like "M. Kolb" and "M. Golblisch-Korb" where the second name matches the first, but that's an incorrect match. Or a lot of cases like "C. Chen" vs "C. Q. Chen" etc. Unless you have other evidence of the position in the author list or can relate affiliations, I think where you see multiple matches the best choice for a bot is just to leave that name alone, not to make an edit there. ArthurPSmith (talk) 15:15, 14 July 2022 (UTC)[reply]
- @ArthurPSmith Out of curiosity, do you have any links to Wikidata items that contain identical names in the author list? I'd like to see them for analysis. I agree that these are problematic to deal with due to their different positioning in the author list. Thanks for bringing that to my attention - I will make sure articles of this sort are left alone and not edited. The other thing that could be done in these cases is just adding the author first name and last name property but not adding a series ordinal.
- On the other cases: Cases like C. Chen and C. Q. Chen can be differentiated by specifically looking out for a middle initial in a match and rejecting it if not there - although this might lead to some problems down the line by rejecting differences in spelling where there is a legitimate match. Maybe notice if both are there and differentiate then?
- On cases like M. Kolb and M. Golblisch-Korb: I think I set the last name checker to work like this because of how it can easily identify Latin American names: recognising when a last name was slashed and finding the name anyway. But you're right that in cases where there are two authors like this, it would lead to problems. Maybe only accept cases like this when there are perfect matches?
- I think multiple matches are only likely to be risky if there are either:
- * Multiple exact or alias matches (meaning a match found in the item label or alias label)
- * Multiple manual matches (matches found using my matching criteria)
- I think cases where there is an exact match and a manual match should prioritise the exact match. This deals with most cases like "M. Kolb" and "M. Golblisch-Korb", where the second name would only match the first if no exact match was found.
- In any case, I'll set up multiple-match cases to just move on. I'll update this post when that's done. PangolinMexico (talk) 09:32, 15 July 2022 (UTC)[reply]
- Example here: A search for an unexpected asymmetry in the production of e+μ− and e−μ+ pairs in proton–proton collisions recorded by the ATLAS detector at s = 13 TeV (Q112176130). See the four "Z. Li" authors in particular - these have been resolved by the work I've ben doing on this one, but there are lots of similar papers out there that haven't been done yet. ArthurPSmith (talk) 14:40, 15 July 2022 (UTC)[reply]
- @ArthurPSmith I've updated my bot to skip multiple matches, as well as add some extra checks for further checking. Specifically, if there is a series ordinal match but no immediate author match, I've implemented a more lenient match that checks for last name/first name inversions and some slight differences. I've also implemented checking 'object stated as' matches for articles that have P1932 added but not P9687 or P9688 added. You can see this update on the bot's github.
- Let me know if this is suitable, and if this means you approve the bot. Thanks so much! PangolinMexico (talk) 12:29, 21 July 2022 (UTC)[reply]
- Example here: A search for an unexpected asymmetry in the production of e+μ− and e−μ+ pairs in proton–proton collisions recorded by the ATLAS detector at s = 13 TeV (Q112176130). See the four "Z. Li" authors in particular - these have been resolved by the work I've ben doing on this one, but there are lots of similar papers out there that haven't been done yet. ArthurPSmith (talk) 14:40, 15 July 2022 (UTC)[reply]
- @PangolinMexico: Unfortunately "When there are multiple matches, the first match is added" is not sufficient. I've been working on some of the ATLAS collaboration papers recently, where there can be 3 dozen or more cases of names like "Z. Li" which can have up to 4 identical names in the author list. Each of the "Z. Li" entries in the author list has a different affiliation, so matching the first one doesn't necessarily match up the correct author entry with the correct position in the author list. There are also names like "M. Kolb" and "M. Golblisch-Korb" where the second name matches the first, but that's an incorrect match. Or a lot of cases like "C. Chen" vs "C. Q. Chen" etc. Unless you have other evidence of the position in the author list or can relate affiliations, I think where you see multiple matches the best choice for a bot is just to leave that name alone, not to make an edit there. ArthurPSmith (talk) 15:15, 14 July 2022 (UTC)[reply]
- I have added the list of manual checks and name additions. Did a 5 article check of this working. See here: https://github.com/outreachy-wasian/wikidata-authorBot/blob/main/AuthorBotTestRunWithLists.txt PangolinMexico (talk) 11:28, 14 July 2022 (UTC)[reply]
- Support I've looked through the edits this bot has done and it all seems fine to me right now. Thanks for working on this. ArthurPSmith (talk) 13:01, 21 July 2022 (UTC)[reply]