Wikidata talk:Lexicographical data

Lexicographical data

Place used to discuss any and all aspects of lexicographical data: the project itself, policy and proposals, individual lexicographical items, technical issues, etc.

Translate this header box!

Start a new discussion

On this page, old discussions are archived. An overview of all archives can be found at this page's archive index. The current archive is located at 2024/12.

What part to integrate in a lemme?

Hi y'all,

On Telegram, Mahir256 asked if helping hand (L310405) and a helping hand (L1368364) are two separate lexemes or not? The answer is probably "not" but I figured I should ask here for more point of view.

Also behind this case, the question is more general: what part should be included in the lexeme or not?

In most case, it's obvious and sources are clear (like rain cats and dogs (L1138151) where the verb "rain" is an integral part of the lexeme and is always present).

But in other case (including this one), some source include the article and some don't (same goes for other lexical categories, should we include the verb, the preposition, etc.).

I had a look at Google Ngram for this case and it seems that indeed the article is almost always used before. Can you think of other way to help decide?

@Jsamwrites, Simplificationalizer:

Cheers, VIGNERON (talk) 16:36, 2 October 2024 (UTC)[reply]

I would consider them to be the same Lexeme fwiw.

To decide what to include I think the deciding factor should be what is most useful for reusers. @Denny maybe has some input on this from the Abstract Wikipedia side? LydiaPintscher (talk) 17:11, 2 October 2024 (UTC)[reply]

Thank you for pinging, @LydiaPintscher. I do not have an opinion as a reuser. I am pretty confident that the Abstract Wikipedia use case, as a reuser, will be able to deal with both ways.

As a volunteer I would say to only have one of the two, I guess, because the additional semantics and lexical knowledge seems entirely compositional from one to the other, so having two lexical entries seems superfluent. --Denny (talk) 13:16, 7 October 2024 (UTC)[reply]

@VIGNERON Thanks for asking this relevant question. Interestingly, dictionaries are doing it differently. Where New Oxford American dictionary have helping hand without "a" (Reference: helping hand), Merriam-Webster online dictionary have entries for both "helping hand" and "a helping hand". However, the examples in Merriam-Webster given for "helping hand" have "a". Looking forward to the community decision. John Samuel (talk) 18:06, 2 October 2024 (UTC)[reply]

I tend not to use "noun" for phrases, just for 1-word strings. So I consider a helping hand (L1368364) a better version of these 2 same lexemes. --Infovarius (talk) 19:41, 3 October 2024 (UTC)[reply]

Hyphenation character

After making the Flying Dephyphenator game at https://ordia.toolforge.org/flying-dehyphenator/ I see that a lot of language do not use the hyphenation character to indicate hyphenation point. Instead interpunct, dash or "." is used. We have had a bit of discussion at the property talk at Property talk:P5279. A question is whether the use of interpunct, dash and "." should be regarded as an error or if people want to have it that way. Can I "correct" the character changing it to the hyphenation character in languages I do not know or should I refrain from that? Note there is hyphenation statistics in Synia: https://synia.toolforge.org/#hyphenation — Finn Årup Nielsen (fnielsen) (talk) 11:27, 7 October 2024 (UTC)[reply]

Sorry for offtop: what can I do in your game? I didn't get. --Infovarius (talk) 10:09, 9 October 2024 (UTC)[reply]

You can append hyphenation parts ("syllables"). The right ones gives you two points a wrong one minus one point. Danish (Q9035) and Portuguese (Q5146) are probably the best language to play. You can cheat and see the current hyphenation start parts for Portuguese (Q5146) in Synia (Q121294613) here https://synia.toolforge.org/#language/Q5146/hyphenation (there are not that many). Finn Årup Nielsen (fnielsen) (talk) 12:48, 15 October 2024 (UTC)[reply]

Unwanted to deprecated lexem sense?

Pedestrian crossing in Swedish has two lexemes, of which one (Övergångsställe, L54968) is the preferred, "correct" word to use, while the other (Skyddsväg, L706581) is a Finlandism and not to be used (but "accidentally" used by many in Finland).

Is there a P-property to mark Övergångsställe as the prefered of the two? Or to mark Skyddsväg as "unwished" or "depreceted". Thank you! Robert (talk) 12:18, 7 October 2024 (UTC)[reply]

@Robertsilen: Is there a source you can provide for the dispreference for skyddsväg? If there is, then for now you could add language style (P6191) desuetude (Q109986704) to the sense on that lexeme and add that source as a reference to that statement. Mahir256 (talk) 17:56, 7 October 2024 (UTC)[reply]

The template on the bottom on the Wikidata:Lexicographical_data page lists a number of values, e.g., obsolete form (Q54943392) and depreciative form (Q54948374). Rhather than language style (P6191) I think they would be used with instance of (P31) or has characteristic (P1552) Finn Årup Nielsen (fnielsen) (talk) 13:44, 15 October 2024 (UTC)[reply]

I think I might correct myself. archaism (Q181970) I would use with language style (P6191). I am not really sure what the differencies are between these items, e.g., obsolete word (Q12237354). Finn Årup Nielsen (fnielsen) (talk) 13:51, 15 October 2024 (UTC)[reply]

What is mo?

What is the mo language code supposed to represent, in what capacity is it supposed to be used?

the official language (in Latin script) of Moldova (Q217), the way it was named until 2023 in the Constitution? (nowadays the Constitution simply says Romanian)
an official language (in Cyrillic script) of Moldavian Soviet Socialist Republic (Q170895) (dissolved 1991)?
an official language (in Cyrillic script) of Transnistria (Q907112)?
some form of the language spoken in the past in the historic region of Bessarabia (Q174994) / Principality of Moldavia (Q10957559)?
something else?

Knowing whether this language code is in Latin script or Cyrillic would be a good starting point, because I have seen it used in both, which is unacceptable. Thank you in advance. Gikü (talk) 20:01, 4 November 2024 (UTC)[reply]

Same question about ro-md – I've seen it in lexemes. Gikü (talk) 20:31, 4 November 2024 (UTC)[reply]

@Gikü: as you know, that's a complex question. As our codes should be understood as linguistic codes (following the ISO 639), in that case, "mo" is an obsolete and deprecated code that shouldn't be used (ro-MD is indeed the replacement; caveat, obviously depending on how you consider Moldovan exactly). But here as you point, there is also a political component (Moldovenism) that make things a bit more complicated. Since the situation is unclear and unsettled, I fear there is no good solution for now. Cheers, VIGNERON (talk) 16:13, 21 November 2024 (UTC)[reply]

Rwanda-Rundi

Pinging recent contributors of Kirundi lexemes on this topic – @Mwenekanse @Steve Uwimana @Ferdinand IF99 @Sun Best Ella

According to some sources, Kinyarwanda and Kirundi can be considered two standard varieties of a single Rwanda-Rundi (Q3217514) language. Interestingly, English Wiktionary treats them as a single language, and they are ordinarily very conservative about merging varieties of pluricentric languages. I would like to ask how contributors feel about merging the two for the purpose of Wikidata lexemes. We have the ability to represent multiple spelling standards on the same lexemes and forms, so we do not have to duplicate grammatical information, senses, and references. I set up an example at gucomeka (L1337138) for what a lexeme for both varieties could look like. عُثمان (talk) 16:39, 20 November 2024 (UTC)[reply]

Thank you @عُثمان for your reaction for this languages rwanda-rundi, I was not aware that English Wiktionary treats them as a single language, but if it's that I think it need to be changed and treat them as two differents languages, because there are national languages of two differents countries, even if there are close to each other, and in wikimedia we have two different communities for the two countries, all those @Mwenekanse @Steve Uwimana @Sun Best Ella and myself we belongs to Burundian community. And sometimes you can find some words wich have different meaning in kinyarwanda and in kirundi. and thank you for this, you are experienced than us on wikidata, if there is other was to separate them you can suggest and share with us your thoughts. and also @Ndahiro derrick, we need your thoughts for this as you are in rwandan community Ferdinand IF99 (talk) 18:15, 20 November 2024 (UTC)[reply]

@Ferdinand IF99 Thank you for the additional context. I was mistaken about the example I linked, since gucomeka / guhomeka have different pronunciations they cannot be considered alternative spellings of one another.

It seems like in many cases there can be differences in tone or noun class between Rwanda and Rundi that may be simpler to represent if these varieties are treated individually. One thing I am not sure about is situations where Rundi word forms are used with Rwanda spellings - I set up an example at imvyino/imbyino (L1379479) with a quotation. Maybe these cases could be represented on lexemes like this while Rwanda and Rundi still have their separate lexemes. عُثمان (talk) 15:20, 22 November 2024 (UTC)[reply]

I cannot create a reference for a usage example.

Sorry if I'm asking in the wrong place. I am a newbie. I have created a usage example (P5831) for pr-ꜥꜣ/𓉐𓉻 (L7922). Then I wanted to add a reference: reference URL (P854) with the value https://oraec.github.io/corpus/oraec51-301.html . But then I get the following error message: "Could not save due to an error. The save has failed." What am I doing wrong? Esther82090 (talk) 21:37, 23 November 2024 (UTC)[reply]

Fetch lexicographical data for language script converter

Hi, I would like to know if there is any possibiliy to utilize lexemes from Wikidata as root word dictionaries for the upcoming language script converter (ms to ms-arab) that I am going to develop? This is because recently I have developed a dictionary-based language script converter that load a dictionary data from an external url. I found that the speed of conversion is slow.

If let say a main article on Malay Wikipedia has the Malay word "kuda" (ms) and the word "kuda" could be converted into "کودا" (ms-arab) through some kind of entry point or api that could fetch the existing lexicographical data here (for example Lexeme:L480587), hopefully the process of conversion would be faster. It would be great too if the affix conversion could be coupled with ms-arab text loaded from Wikidata lexemes. Hakimi97 (talk) 10:36, 26 November 2024 (UTC)[reply]

Wikidata talk:Lexicographical data

Contents

What part to integrate in a lemme?

Hyphenation character

Unwanted to deprecated lexem sense?

What is mo?

Rwanda-Rundi

I cannot create a reference for a usage example.

Fetch lexicographical data for language script converter

Navigation menu

Wikidata talk:Lexicographical data

What part to integrate in a lemme?

Hyphenation character

Unwanted to deprecated lexem sense?

What is mo?

Rwanda-Rundi

I cannot create a reference for a usage example.

Fetch lexicographical data for language script converter

Navigation menu

Search