Wikidata:Requests for permissions/Bot/MewBot
- The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.
- Not done @Superyetkin: This request seems to be abandoned, please reopen it if that is not the case. Thanks. Mike Peel (talk) 20:19, 21 July 2020 (UTC)[reply]
Rua (talk • contribs • new items • new lexemes • SUL • Block log • User rights log • User rights • xtools)
Operator: Rua (talk • contribs • logs)
Task/s: Importing lexemes from en.Wiktionary in specific languages
Code:
Function details: The bot will be used to parse entries from English Wiktionary using pywikibot and mwparserfromhell, and then either create lexemes on Wikidata, or add information to existing lexemes. Care is taken to not duplicate information: the script checks if the lexeme exists and already has the desired properties and only adds anything if not. In case of doubt (e.g. multiple matching lexemes already exist) it skips the edit. I made some test edits using my own user account, they can be seen from [1] to [2]. Today I did a few on the MewBot account.
Individual imports will be proposed with the lexicographical data project first, as it has been said by the project leaders to be careful with imports at first. The current proposal is for Proto-Samic and Proto-Uralic lexemes, seen at Wikidata talk:Lexicographical data#Requesting permission for bot import: Proto-Uralic and Proto-Samic lexemes. Once the project leaders give the ok for all imports, permission will no longer be needed for individual imports. Planned future imports are for Dutch and the modern Sami languages. --—Rua (mew) 09:37, 22 September 2018 (UTC)[reply]
- I am ready to approve this request in a couple of days, provided that no objections will be raised meanwhile. Lymantria (talk) 05:27, 25 September 2018 (UTC)[reply]
- I just noticed that Wikidata:Bots says I need to indicate where the bot copied the data from. How do I indicate that the data came from Wiktionary? —Rua (mew) 10:51, 25 September 2018 (UTC)[reply]
- Could you run your bot on few entries in order to evaluate it? Thanks in advance. Pamputt (talk) 10:59, 26 September 2018 (UTC)[reply]
- I did, already. Do I need to do more? —Rua (mew) 11:02, 26 September 2018 (UTC)[reply]
- Oppose Ah sorry I did not check before asking. For all reconstructed form, I think a reference is mandatory. As these "words" do not exist, these "words" come from specialist's work and have to be sourced. Two linguists may reconstruct different forms. That's said, I am not sure about copyright issue for reconstruct form. It probably belongs to public domain as a scientific work but it would be better to be sure. Pamputt (talk) 21:42, 26 September 2018 (UTC)[reply]
- Not all reconstructions on Wiktionary can be sourced to some external work. Some were reconstructed by Wiktionary editors. This is because not all reconstructed forms are available in external works, and we have to fill the gaps ourselves. The bot adds links to Álgu and Uralonet if one exists. —Rua (mew) 22:26, 26 September 2018 (UTC)[reply]
- I strongly disagree to import reconstructed forms that do not come from scientific works. One need criteria to accept such forms and academic paper is a good one. Otherwise, anyone can guess its own form. So if you run your bot, please import only "validated" forms. Pamputt (talk) 14:18, 27 September 2018 (UTC)[reply]
- I agree with that. Only sourced reconstructed forms should be imported. Unsui (talk) 15:50, 27 September 2018 (UTC)[reply]
- Wiktionary's goal is to be an alternative for existing dictionaries, including etymological dictionaries, not to be dependent on them. The criteria used by Wiktionary is that they follow established sound laws. Some reconstructions from linguistic sources don't pass that criterium. It fits with the general policy in Wiktionary of not blindly copying from dictionaries but making sure that forms make sense. Reconstructions that are questionable, whether from an external source or not, can be discussed and deleted if found to be invalid. If you have doubts about any of the reconstructions in Wiktionary, you should discuss it there.
- That said, what should be done if words in different languages come from a common source, but there is no source that gives a reconstruction? Can lemmas be empty? —Rua (mew) 15:54, 27 September 2018 (UTC)[reply]
- Here are some cases where Wiktionary has had to correct errors and omissions in sources. I provide a link to Wiktionary, and a link to Álgu, which gives its source.
- wikt:Reconstruction:Proto-Samic/čeapēttē [3] South, Ume and Pite Sami all have -o- in the second syllable, which does not agree with the reconstruction but requires čeapōttē.
- wikt:Reconstruction:Proto-Samic/tuovlē [4] South Sami requires *tuovlā instead.
- wikt:Reconstruction:Proto-Samic/tieppē [5] South, Ume and Pite Sami all require final *ā.
- wikt:Reconstruction:Proto-Samic/čiekčë [6] North Sami requires final *ā.
- wikt:Reconstruction:Proto-Samic/lāvkōtēk [7] Skolt Sami requires final *ë.
- wikt:Reconstruction:Proto-Samic/civnë [8] The Northern Sami word is an adjective, and requires *čivnëk.
- wikt:Reconstruction:Proto-Samic/kijttētēk [9] Ter Sami requires final *ō.
- ...and many more. So you see if we have to rely on sources, we become vulnerable to errors, whereas we can correct those errors on Wiktionary, making it more reliable. If Wikidata can't apply the same level of scientific rigour then that is rather worrying. —Rua (mew) 16:42, 27 September 2018 (UTC)[reply]
- Wiktionary's goal is to be an alternative for existing dictionaries, including etymological dictionaries, not to be dependent on them.
- This is maybe the case on the English Wiktionary but on the French Wiktionary, original works for etymology are not allowed, every etymological information have to be sourced. Yet Wikidata has to define its own criteria and about reconstructed form, nothing has been decided so far. About you question "what do we do when a source give a wrong information", I would say in this case, we set a deprecated rank. Pamputt (talk) 19:05, 27 September 2018 (UTC)[reply]
- You say, for exemple, "North Sami requires final *ā". OK but why not *ö ? Because linguists have defined laws for this langage. It is always linguists works. Hence, it is possible to put a reference. Otherwise anything may be created as a reconstructed form. Unsui (talk) 07:16, 28 September 2018 (UTC)[reply]
- That's nonsense. It still has to stand up to scrutiny. —Rua (mew) 10:02, 28 September 2018 (UTC)[reply]
- You say, for exemple, "North Sami requires final *ā". OK but why not *ö ? Because linguists have defined laws for this langage. It is always linguists works. Hence, it is possible to put a reference. Otherwise anything may be created as a reconstructed form. Unsui (talk) 07:16, 28 September 2018 (UTC)[reply]
- I strongly disagree to import reconstructed forms that do not come from scientific works. One need criteria to accept such forms and academic paper is a good one. Otherwise, anyone can guess its own form. So if you run your bot, please import only "validated" forms. Pamputt (talk) 14:18, 27 September 2018 (UTC)[reply]
- Not all reconstructions on Wiktionary can be sourced to some external work. Some were reconstructed by Wiktionary editors. This is because not all reconstructed forms are available in external works, and we have to fill the gaps ourselves. The bot adds links to Álgu and Uralonet if one exists. —Rua (mew) 22:26, 26 September 2018 (UTC)[reply]
- Oppose Ah sorry I did not check before asking. For all reconstructed form, I think a reference is mandatory. As these "words" do not exist, these "words" come from specialist's work and have to be sourced. Two linguists may reconstruct different forms. That's said, I am not sure about copyright issue for reconstruct form. It probably belongs to public domain as a scientific work but it would be better to be sure. Pamputt (talk) 21:42, 26 September 2018 (UTC)[reply]
- I did, already. Do I need to do more? —Rua (mew) 11:02, 26 September 2018 (UTC)[reply]
- For how many new ones is this? --- Jura 11:11, 26 September 2018 (UTC)[reply]
- Oppose for now. It's unclear how many would be imported and we need to solve the original research question first. --- Jura 08:03, 27 September 2018 (UTC)[reply]
- Can you elaborate? I don't see what the problem is. —Rua (mew) 10:07, 27 September 2018 (UTC)[reply]
- Apparently, you don't know how many you plan to import. --- Jura 10:12, 27 September 2018 (UTC)[reply]
- I gave a link to the categories in the other discussion. —Rua (mew) 10:20, 27 September 2018 (UTC)[reply]
- Can you make a reliable statement? Categories tend to evolve and change subcategories. --- Jura 10:22, 27 September 2018 (UTC)[reply]
- wikt:Category:Proto-Samic lemmas currently contains 1303 entries. —Rua (mew) 10:25, 27 September 2018 (UTC)[reply]
- I gave a link to the categories in the other discussion. —Rua (mew) 10:20, 27 September 2018 (UTC)[reply]
- Apparently, you don't know how many you plan to import. --- Jura 10:12, 27 September 2018 (UTC)[reply]
- Can you elaborate? I don't see what the problem is. —Rua (mew) 10:07, 27 September 2018 (UTC)[reply]
- I've made a post regarding the import and the conflict in Wiktionary vs Wikidata's policies: wikt:WT:Beer parlour/2018/September#What is Wiktionary's stance on reconstructions missing from sources?. —Rua (mew) 17:36, 27 September 2018 (UTC)[reply]
- Is there any news on this? —Rua (mew) 10:08, 17 October 2018 (UTC)[reply]
- @Jura1:, are you fine now with the approval of this bot?--Ymblanter (talk) 13:01, 21 October 2018 (UTC)[reply]
- I will try to write something tomorrow. --- Jura 18:21, 21 October 2018 (UTC)[reply]
- First: sorry for the delay. The question what to do with lexemes reconstructed at Wiktionary remains open. In general, we would only import information from other WMF sites when we know or can assume that it can be referenced to other quality sources. This isn't the case here. One could argue that Wiktionary is an independent dictionary website and should be considered a reference on its own. Whether or not this is the case depends on how Wikidata and the various Wiktionaries will work going forward. The closer Wiktionary and Wikidata would work together going forward the less we can consider it as such. --- Jura 04:14, 25 October 2018 (UTC)[reply]
- The majority of the Proto-Samic entries on Wiktionary does have an Álgu lexeme ID (P5903). Proto-Uralic entries mostly have Uralonet ID (P5902), but the lemma is not always identical to the form given on Uralonet, for which User:Tropylium is mostly responsible as the primary Uralic expert on Wiktionary. Would it be acceptable to import only those entries that have one of these IDs?
- If so, that leaves the question of what to do with the remainder. It would be a shame if these can't be included in Wikidata, and would mean that Wiktionary is always more complete than Wikidata can be. Words that have an etymology on Wiktionary would have none on Wikidata, because of the Proto-Samic ancestral form being missing. —Rua (mew) 18:43, 30 October 2018 (UTC)[reply]
- @Rua: yes importing lexeme that have Álgu lexeme ID (P5903) or Uralonet ID (P5902) is fine with me. However, the lexeme for which the lemma is not identical to the form given on Uralonet do not have to be imported because they are not verifiable. They have to be similar to what the source says. Pamputt (talk) 21:58, 30 October 2018 (UTC)[reply]
- @Jura1:, are you fine now with the approval of this bot?--Ymblanter (talk) 13:01, 21 October 2018 (UTC)[reply]
- Now pinging @Pamputt: as well.--Ymblanter (talk) 20:02, 21 October 2018 (UTC)[reply]
- I did not change my opinion because this bot wants to import reconstructed forms without any academic references. If the bot use academic work as source, it is fine with me, if not I oppose (and the discussion shows that we are in this case). Pamputt (talk) 20:08, 21 October 2018 (UTC)[reply]