Wikidata:Property proposal/CJKV variant character
CJKV variant character
editOriginally proposed at Wikidata:Property proposal/Generic
Abstract
The proposed property is designed to link equivalent Han characters used in different regions that are encoded under different codepoints in Unicode. It is inspired by the kZVariant property present in the Unihan database but attempts to improve on it by describing how the characters are related to one another.
Motivation
As of July 2018, up to 1016 items with the description "CJK (hanzi/kanji/hanja) character" have been created. Upon closer inspection, most of these characters belong to the set of Kyōiku kanji and are unique. However, two Han characters which are semantically related, 楽 (Q54552723)("楽") and 乐 (Q3594925)("乐") have been created with no indication of their relationship with one another. The former is a Japanese shinjitai while the latter is a simplified Chinese character, both of which are related by their kyūjitai/traditional form "樂". A property is needed to indicate this relationship. KevinUp (talk) 01:02, 3 July 2018 (UTC)
Discussion
- Support Variant Han characters should be linked to each other. Okkn (talk) 18:16, 3 July 2018 (UTC)
- @Was a bee, Deryck Chan, NMaia, VIGNERON, Stevenliuyi: Any thoughts? --Okkn (talk) 18:29, 3 July 2018 (UTC)
- Support BTW, I noticed that the kyūjitai/traditional Chinese character “樂” (U+6A02) has the same form as hanja U+F914, U+F95C and U+F9BF. The three hanja characters are actually one character with different pronunciations (heteronym (Q875690)). Should we create one item or four different items in this case? If we create separate items, should we also use this property to link them? --Stevenliuyi (talk) 22:06, 3 July 2018 (UTC)
- If separated four items were created for 樂, they can be linked using this property. However, at this time, I think we don't have to separate them. --Okkn (talk) 05:40, 4 July 2018 (UTC)
- Hi. "樂" is a rare exception of a single character being split into four codepoints due to different pronunciations. The codepoints U+F914, U+F95C and U+F9BF belong to CJK Compatibility Ideographs (Q2493848). Unfortunately, it is not possible to create new items for them because the compatibility forms will convert back to its main ideograph after being created. An example of how compatibility forms are dealt with can be seen here: 殺 (Q54879672) and 漢 (Q54872914). KevinUp (talk) 11:58, 4 July 2018 (UTC)
- If separated four items were created for 樂, they can be linked using this property. However, at this time, I think we don't have to separate them. --Okkn (talk) 05:40, 4 July 2018 (UTC)
- Tentative Support, and I think the qualifier should be writing system (P282)? Deryck Chan (talk) 22:58, 3 July 2018 (UTC)
- That is reasonable. By doing so, we can explicitly restrict the qualifier value to writing system (Q8192). @KevinUp: But all variants are based on or are able to be represented by writing system (P282)? --Okkn (talk) 05:40, 4 July 2018 (UTC)
- Deryck Chan I think the scope of writing system (Q8192) is a bit too wide. The property being proposed is for Han characters which are part of CJKV character (Q53764732). KevinUp (talk) 11:58, 4 July 2018 (UTC)
- @KevinUp: Qualifier is something like applies to part (P518), not the main property you proposed here. What do you think of using writing system (Q8192) instead of applies to part (P518)? The scope of writing system (Q8192) is narrower than that of applies to part (P518). --Okkn (talk) 12:21, 4 July 2018 (UTC)
- Oh yes, definitely. Thanks for the explanation. I've changed the proposal to include writing system (Q8192) as the qualifier. KevinUp (talk) 12:53, 4 July 2018 (UTC)
- @KevinUp: Qualifier is something like applies to part (P518), not the main property you proposed here. What do you think of using writing system (Q8192) instead of applies to part (P518)? The scope of writing system (Q8192) is narrower than that of applies to part (P518). --Okkn (talk) 12:21, 4 July 2018 (UTC)
- Deryck Chan I think the scope of writing system (Q8192) is a bit too wide. The property being proposed is for Han characters which are part of CJKV character (Q53764732). KevinUp (talk) 11:58, 4 July 2018 (UTC)
- That is reasonable. By doing so, we can explicitly restrict the qualifier value to writing system (Q8192). @KevinUp: But all variants are based on or are able to be represented by writing system (P282)? --Okkn (talk) 05:40, 4 July 2018 (UTC)
- Support Support this. By the way, all Han characters have counterpart in different languages? Although I don't know well about diversity of Han characters, if there are characters which exist only in certain language, how about using, for example, "no equivalent" to show such fact. --Was a bee (talk) 22:07, 4 July 2018 (UTC)
- Was a bee: For Han characters, the same character is often shared between different languages in the East Asian cultural sphere, especially if it is a basic character or frequently used character. However, many variant Chinese characters exist and some of these are not yet encoded in Unicode, so the label "no equivalent" might not be appropriate. Also, although some characters only exist in a certain language, it might be better to mark the script or region where they belong, eg. kokuji (Q1185862) for Japanese, or HKSCS (Q1627000) for characters used in Hong Kong. KevinUp (talk) 23:20, 5 July 2018 (UTC)
- Support Jc86035 (talk) 11:42, 6 July 2018 (UTC)
@KevinUp, Stevenliuyi, Deryck Chan, Was a bee, Jc86035: Done: CJKV variant character (P5475) --Okkn (talk) 17:47, 18 July 2018 (UTC)
- Okkn: Thanks for creating the new property. The property has been implemented in all the examples given above. This property will be put to good use. KevinUp (talk) 12:00, 19 July 2018 (UTC)
- By the way, the relationship between orthodox Chinese character (Q13319256) and variant sinogram (Q837611) that are used in Chinese is much more complicated, with different conventions being used between Taiwan, Hong Kong and mainland China. Take a look at 擧 (Q55645395). Technically, 擧 is a variant form of 舉 and a "variant traditional form" of 举. There are 3 characters placed under the CJKV variant character (P5475) property in 擧 (Q55645395) but I have splitted it into five entries. What do you think? KevinUp (talk) 12:00, 19 July 2018 (UTC)
- @KevinUp: In my opinion, it is better not to split the link to the same item, if in the future we will check the validity and completeness of the statements by using SPARQL query.
As of now, there are no rigid rules for the items of Han chanracters. However, before creating lots of items and statements, we should make a guideline, which shows what properties, qualifieres or items should be consistently used in every Han character items. For example, I had used applies to part (P518) to specify in what language the statement is true, but they should be replaced with writing system (P282) or applies to jurisdiction (P1001). If you wouldn't mind, could you be able to establish a new WikiProject Han character or something, and create a draft of guideline? To be honest, I’m not so familiar with Han characters other than Japanese kanji... Of course, I will help you do so.
By the way, in general, we can’t use instance of (P31) and part of (P361) as qualifiers. And we should try not to use ad hoc qualifiers as much as possible. Best regards, Okkn (talk) 14:37, 19 July 2018 (UTC)- Okkn: The property created here works well for characters that can be found on this wikipedia page: "Differences between Shinjitai and Simplified characters". However, it does not seem to work well for variant Chinese characters (unorthodox forms) that have to be linked with another character which is the orthodox form that is the preferred form for general usage. Here are a few examples of 異体字 from jinmeiyō kanji with their corresponding orthodox forms in brackets that are part of Jōyō kanji. 嶋(島), 盃(杯), 冨(富), 峯(峰), 埜(野). Perhaps you are familiar with some of the examples. Different regions have different guidelines for this. In Japan for example, there is 異体字研究資料集成 (Q17193151). In Taiwan, Dictionary of Variant Chinese Characters (Q10427532) is used as a reference while in mainland China First batch revision table of variant Chinese characters (Q15902917) and Table of General Standard Chinese Characters (Q14941454) are used as references. KevinUp (talk) 15:51, 19 July 2018 (UTC)
- Currently this property has been defined as "equivalent forms of Han characters used in different regions". I will come up with a new proposal for "variant-orthodox pairs of Han characters used in the same region". However, before doing that I need to be more familiar with the formatting and terminology used on Wikidata. So far the most common I have been using are part of (P361), applies to part (P518) and instance of (P31) which are mostly ad hoc qualifiers. At the moment I am not yet familiar with other qualifiers that are suitable for Han characters. If possible can you show me some well-formatted Han character items that I can refer to as an example? So far I have come across 一 (Q4025820) and 雨 (Q3595028) that seem to be well formatted. Yes, I will create a WikiProject Han character so that we can work on improving these items on Wikidata. KevinUp (talk) 15:51, 19 July 2018 (UTC)
- @KevinUp: When I created this property, I changed the description as "used in different regions or writing systems", because we must link between shinjitai (Q1055887) and kyūjitai (Q1147857). But as you say, we must link, for example in Japanese, not only between kyujitai and shinjitai, but also between "嶋" and "島". Should we separate those relations from this property? Or can we apply this property to those relations with some (new?) qualifiers or other solutions? I can't give an answer because I have no idea what kind of variants we should represent... I have just edited 漢 (Q54872914) and 汉 (Q55646361), and I think they are now well-formatted, although in the future CJK Unified Ideographs (Q994386) and CJK Compatibility Ideographs (Q2493848) should be linked using Wikidata:Property_proposal/Unicode_block. --Okkn (talk) 17:18, 19 July 2018 (UTC)
- @Okkn: WikiProject CJKV character has been created. Feel free to edit the properties and add participants to the project. As for the discussion on 異体字, I think it will be necessary to create a new property for it, as "嶋" and "島" are not equivalent forms, unlike kyujitai and shinjitai characters that are equivalent characters used before and after 1946. Semantically, "嶋" and "島" both have the same meaning, ie. "island", but usage of "嶋" is restricted in modern times to be used only in personal names, whereas "島" is the orthodox form for the meaning "island". For this new property, it will be compulsory to state (1) whether the item itself is orthodox/variant (2) list of variant/orthodox forms that are related to the item (3) the jurisdiction where this rule is applied and (4) source of reference. In some cases, the same character can be both orthodox and variant, eg. 鎌 ("sickle" in Japanese but considered a variant form for that meaning in modern Chinese). I'll need some time to figure out how to impose these conditions in my next proposal. KevinUp (talk) 13:30, 20 July 2018 (UTC)
- @KevinUp: When I created this property, I changed the description as "used in different regions or writing systems", because we must link between shinjitai (Q1055887) and kyūjitai (Q1147857). But as you say, we must link, for example in Japanese, not only between kyujitai and shinjitai, but also between "嶋" and "島". Should we separate those relations from this property? Or can we apply this property to those relations with some (new?) qualifiers or other solutions? I can't give an answer because I have no idea what kind of variants we should represent... I have just edited 漢 (Q54872914) and 汉 (Q55646361), and I think they are now well-formatted, although in the future CJK Unified Ideographs (Q994386) and CJK Compatibility Ideographs (Q2493848) should be linked using Wikidata:Property_proposal/Unicode_block. --Okkn (talk) 17:18, 19 July 2018 (UTC)
- Currently this property has been defined as "equivalent forms of Han characters used in different regions". I will come up with a new proposal for "variant-orthodox pairs of Han characters used in the same region". However, before doing that I need to be more familiar with the formatting and terminology used on Wikidata. So far the most common I have been using are part of (P361), applies to part (P518) and instance of (P31) which are mostly ad hoc qualifiers. At the moment I am not yet familiar with other qualifiers that are suitable for Han characters. If possible can you show me some well-formatted Han character items that I can refer to as an example? So far I have come across 一 (Q4025820) and 雨 (Q3595028) that seem to be well formatted. Yes, I will create a WikiProject Han character so that we can work on improving these items on Wikidata. KevinUp (talk) 15:51, 19 July 2018 (UTC)
- Okkn: The property created here works well for characters that can be found on this wikipedia page: "Differences between Shinjitai and Simplified characters". However, it does not seem to work well for variant Chinese characters (unorthodox forms) that have to be linked with another character which is the orthodox form that is the preferred form for general usage. Here are a few examples of 異体字 from jinmeiyō kanji with their corresponding orthodox forms in brackets that are part of Jōyō kanji. 嶋(島), 盃(杯), 冨(富), 峯(峰), 埜(野). Perhaps you are familiar with some of the examples. Different regions have different guidelines for this. In Japan for example, there is 異体字研究資料集成 (Q17193151). In Taiwan, Dictionary of Variant Chinese Characters (Q10427532) is used as a reference while in mainland China First batch revision table of variant Chinese characters (Q15902917) and Table of General Standard Chinese Characters (Q14941454) are used as references. KevinUp (talk) 15:51, 19 July 2018 (UTC)
- @KevinUp: In my opinion, it is better not to split the link to the same item, if in the future we will check the validity and completeness of the statements by using SPARQL query.
- By the way, the relationship between orthodox Chinese character (Q13319256) and variant sinogram (Q837611) that are used in Chinese is much more complicated, with different conventions being used between Taiwan, Hong Kong and mainland China. Take a look at 擧 (Q55645395). Technically, 擧 is a variant form of 舉 and a "variant traditional form" of 举. There are 3 characters placed under the CJKV variant character (P5475) property in 擧 (Q55645395) but I have splitted it into five entries. What do you think? KevinUp (talk) 12:00, 19 July 2018 (UTC)