User talk:CellosaurusBot
MeSH descriptor ID (P486) edits
[edit]Please note that the 20 November 2020 Cellosaurus edits adding MeSH descriptor ID (P486) statements to COS-1 (Q27556079) and COS-7 (Q27556092) were not exact matches, and caused database constraint violations that were logged at Wikidata:Database reports/Constraint violations/P486. I have removed those statements. Charles Matthews (talk) 05:56, 2 December 2020 (UTC)
- @Charles Matthews: Thank you for the note. It seems that https://meshb.nlm.nih.gov/record/ui?ui=D019556 refers to multiple concepts at the same time (Entry Term(s) COS-1 Cells COS-7 Cells). The robot has added it again, as it is part of the curation in the Cellosaurus database, so violations are, in this case, a side effect of the greater expressiveness on Wikidata and Cellosaurus. TiagoLubiana (talk) 23:47, 1 March 2021 (UTC)
@TiagoLubiana: According to Wikidata:Bots, there is a duty to "Monitor constraint violation reports for possible errors generated or propagated by your bot". The onus here is on the bot operator. The edits are incorrect, and it is wrong for a bot to re-add such edits once they have been corrected. Charles Matthews (talk) 06:25, 2 March 2021 (UTC)
- @Charles Matthews: I understand your point. In this specific case, I'd argue that the constraint violation is an exception: Wikidata has 3 concepts where MeSH has only one. Adding just one of the 3 would fulfill the constraint, but it would be misleading. Nevertheless, I really do not want to cause any trouble. What do you suggest to do in that case? Should I set the bot to never re-add those 3? Thank you again, best, TiagoLubiana (talk) 12:15, 2 March 2021 (UTC)
@TiagoLubiana: No reason to make an exception. MeSH is a major search system, meaning there are many potential stakeholders in keeping the MeSH system here clean.
Have a look at the first column in the table at Wikidata:ScienceSource_project/MeSH and cleanup dashboard. The MeSH D-numbers, which play an important part in many places, used to have around a thousand constraint violations. I have worked to reduce these - I'm a stakeholder in having a 1-1 mapping of MeSH D-numbers into Wikidata items, so that topical information on PubMed can be translated accurately. There are a number of reasons that such violations occur, human error being one.
But a major reason the number in the past was so high, and growing, was bots adding imperfect matches. This really is unacceptable here. Look at the difference, 719 violations in August 2018, 2829 in August 2019, in the history of Wikidata:Database reports/Constraint violations/P486. That is not human activity, it is irresponsible bot editing.
This is just one example of what has being going on. There is a theory that "xref" information belongs here: but it doesn't if it is included as statements for properties that have constraints set to ensure uniqueness.
To put it in a more technical way: some such information is included in Simple Knowledge Organization System (Q2288360). OK, that means that it can be included in WDQS queries if people need it. SKOS has that function in Semantic Web thinking. It is a bad idea to mirror such information here, if it interferes with basic data modelling that allows for reuse.
I have spent hundreds of hours, literally, fixing up main subject (P921) so it can be used for the WD:SS metadata bot. I don't suppose you want to know the details, but I think some of the older ideas held by bot operators need to change. Charles Matthews (talk) 14:59, 2 March 2021 (UTC)
- @Charles Matthews: I agree, 1:1 mappings are of great value. That does, though, create a need for class creation on Wikidata whenever we have something like this, as other reliable databases do not have 1:1 mappings to MeSH. It would be nice if we can write have guidelines like: in case of 2 or more Wikidata terms matching to a single MeSH, a superclass should be created. Then we can point to the guideline whenever something like this happens.
- An issue is that the "wrong" mappings are valid statements: they are backed by a reliable source. Maybe the bot should add them with a deprecated rank? If so, that could be a guideline too.
- And I would love to know the details, I am also interested on that topic. I've met pmrust in Cambridge a couple years ago, ContentMine has been doing an amazing work. Best, TiagoLubiana (talk) 16:42, 2 March 2021 (UTC)
@TiagoLubiana: OK, first, a short introduction to the NCBI2wikidata bot. It does not edit Wikidata directly: it produces QuickStatements code, which I run later. This is proving useful now, because I can run the NCBI2wikidata output with various scripts, and NCBI2wikidata is really a custom tool for the ScienceSource project. But it is becoming a more generic tool.
So what NCBI2wikidata does for main subject (P921) statements is to translate the major (starred) MeSH terms from a PubMed page into Q-numbers. Correct translation from D-numbers to Q-numbers depends on accurate, 1-1 matching of the MeSH D-numbers into Wikidata.
During 2018-9 I was working on the ScienceSource project with that tool, and because the MeSH data was in a bad state, there were nearly 200 topics that could be translated incorrectly. I have a good list (not quite complete, I suppose) at Wikidata:ScienceSource project/Focus list, main subject MeSH errors, and have cleaned up over 80% of the topics now. A typical problem is "lung cancer" where it should be "lung neoplasm" - MeSH really doesn't talk about cancers in that way.
So, ScienceSource was a research project, and had some typical problems.
The underlying technique is more important now. What NCBI2wikidata does is to use the PubMed API to collect topical information (and other data). The project was only interested in review articles with CC license. I still work only on those - one day there will be about 100K items here for such articles, which in some sense are the most important for Wikipedia. The tool, which is on github, would be easy for a developer to change.
The method of translating from MeSH to Q-numbers can be used in other places - for example, clinical trials have MeSH subjects. Scaling up is clearly possible. The statements found in this way are of high quality, where text-mining titles, for example, is not great.
Second point: I know the arguments about "verifiable" statements that are wrong, and I just think this is a bad direction. Everyone knows that there is some nonsense on the Web, and my question is, why should we include it, especially in science? I have had some private discussion which suggests I'm not the only person who thinks this.
Charles Matthews (talk) 17:31, 2 March 2021 (UTC)
- @Charles Matthews: Wow, thank you for the explanation! Really interesting work. Another MeSH-to-Q that might be useful is converting text annotations (i.e., from ScispaCy). I agree, lots of nonsense on the Web and (unfortunately) in published research. There are good points on both sides; deprecated statements are an opportunity to say 'why' something is nonsense, so one day we do not have to include it.
My personal opinion is that it exposes conflicts, so maybe it helps beginners to figure that not everything is as straightforward as it seems. I can just keep them out for now, but perhaps a "deprecated" will keep people from adding it again in the future? I dk. I have added a ticket on GitHub to deal with it: https://github.com/calipho-sib/cellosaurus-wikidata-bot/issues/2
By the way, can you help me finding NCBI2Wikidata on github? Thanks! TiagoLubiana (talk) 18:30, 2 March 2021 (UTC)
@TiagoLubiana: https://github.com/ContentMine/NCBI2wikidata for the tool: the developer Michael Dales in Cambridge UK has not otherwise been involved with Wikimedia, and it is written in golang. I had to learn Linux to use it, but if you are really interested, I might be able to save you some time.
On the issue we were discussing here: if someone says "lung cancer = lung neoplasm" that is just wrong, and reading the scope note for D008175 is enough to see why. If in an "xref" sense someone says "if you are searching for lung cancer papers and you think of "lung neoplasms" as the only index term you need, you will not find everything", then that makes sense for that application. But it is not really about MeSH. The fact that some sites list D008175 next to an entry "lung cancer" still doesn't mean MeSH descriptor ID (P486) statements should notice that.
Anyway, nice to talk to you. Charles Matthews (talk) 19:25, 2 March 2021 (UTC)
Editing logged out?
[edit]There is Special:Contributions/177.92.116.98. Is that actually activity by this bot, but logged out? —MisterSynergy (talk) 23:16, 8 October 2021 (UTC)
@ MisterSynergy: Yes, it is! Thank you for the heads up, no idea why that happened. I'm going to keep an eye on it. Best, CellosaurusBot (talk) 19:47, 9 October 2021 (UTC) @ MisterSynergy: It is from my IP, too. Something went wrong in the credentials and it did not throw an error, but edited from IP. Best, CellosaurusBot (talk) 19:50, 9 October 2021 (UTC)
- Okay thanks, no problems then. I have unblocked the IP address, so be aware that your bot can technically edit again even if logged out.
On a side note: please do not use your bot account for comments on this page or anywhere else for manual edits. You can simply use your main (operator) account instead. —MisterSynergy (talk) 19:55, 9 October 2021 (UTC)