LDL 2018 Presentation

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 22

Towards the Representation of Etymological and

Diachronic Data on the Semantic Web


Fahad Khan,
Istituto di Linguistica Computazionale <<A. Zampolli>>
Consligio Nazionale delle Ricerche,
Pisa, Italy

6th Workshop on Linked Data in Linguistics


Miyazaki, Japan
Introduction
- In this talk I want to give a very brief overview of a number of issues relating to the publication
of etymological data as linked data.
- I will start with a discussion of what etymologies are, and the different modelling challenges that
they pose.
- After presenting a number of example etymologies I will describe a new proposed extension of
Ontolex-Lemon for dealing with etymological information.
- Due to time constraints I won’t be able to discuss diachrony (but see my paper in the
proceedings for a discussion on this issue).
Etymologies...what are they?
The word etymology comes from the greek ἐτυμολογία (etumología) -- from ἔτυμον (étumon, “true
sense”) and -λογία(-logía, “study of”) -- and has at least two different (related) senses. It can refer to:

- A sub-discipline of historical linguistics that is concerned with the development of individual words (and
other lexemes) over time and which attempts to trace their origins as far back as the record supports
- A single such history of a word (or other lexeme). We will focus on this sense in what follows.

Etymologies in the second sense are commonly found in general purpose dictionaries as well as in
more specialist works. The issue of how to properly model this kind of data as linked data is therefore
of some relevance given the current trend towards migrating retrodigitized dictionaries into the RDF
format.
Previous Work
Previous work on explicitly representing etymologies in linked data include proposed extensions of
lemon by (Chiarcos et. al., 2016) and a linked data based etymological wordnet (De Melo, 2014). See
also (Moran and Bruemmer, 2013).

However In the current work we have also been influenced by attempts in other computational lexicon
standards to represent ‘deep’ etymological information.

These include proposals for modelling etymologies in LMF (Salmon-Alt, 2006) and especially
Bowers and Romary’s proposals for a TEI-based encoding of etymologies in (Bowers and Romary,
2016).
An example etymological entry
GIRL, a female child, young woman. (E.) ME. gerle, girle, gyrle, formerly used of
either sex, and signifying either a boy or girl. In Chaucer, C.T. 3767 (A 3769) gerl
is a young woman; but in C.T. 666 (A 664), the pl. girles means young people of
both sexes. In Will. of Palerne, 816, and King Alisander, 2802, it means ‘young
women;’ in P. Plowman, B.i.33, it means ‘boys;’ cf. B. x. 175. Answering to an AS.
form *gyr-el-, Teut. *gur-wil-, a dimin. form from Teut. base *gur-. Cf. NFries. gor,
a girl; Pomeran. goer, a child; O. Low G. gor, a child; see Bremen Wortebuch, ii.
528. Cf. Swiss gurre, gurrli,a depriciatory term for a girl; Sanders, G. Dict. i. 609,
641; also Norw. gorre, a small child (Aasen); Swed. dial. garra, guerre (the same).
Root uncertain. Der. girl-ish, girlish-ly, girl-ish-ness, girl-hood.
An example etymological entry
GIRL, a female child, young woman. (E.) ME. gerle, girle, gyrle, formerly used of
either sex, and signifying either a boy or girl. In Chaucer, C.T. 3767 (A 3769) gerl
is a young woman; but in C.T. 666 (A 664), the pl. girles means young people of
both sexes. In Will. of Palerne, 816, and King Alisander, 2802, it means ‘young
women;’ in P. Plowman, B.i.33, it means ‘boys;’ cf. B. x. 175. Answering to an AS.
form *gyr-el-, Teut. *gur-wil-, a dimin. form from Teut. base *gur-. Cf. NFries. gor,
a girl; Pomeran. goer, a child; O. Low G. gor, a child; see Bremen Wortebuch, ii.
528. Cf. Swissof gurre,
Description the historygurrli,a
and depriciatory term for a girl; Sanders, G. Dict. i. 609,
development of the word
641; also Norw. gorre, a small child (Aasen); Swed. dial. garra, guerre (the same).
Root uncertain. Der. girl-ish, girlish-ly, girl-ish-ness, girl-hood.
Another Example Entry
girl
Three different hypotheses for the origin
of the same word
, whence girlish, derives from ME girle, varr gerle, gurle: o.o.o.: perh of C origin: cf
Ga and Ir caile, EIr cale, a girl; with Anglo-Ir girleen (dim -een), a (young) girl, cf
Ga-Ir cailin (dim -in), a girl. But far more prob, girl is of Gmc origin: Whitehall
postulates the OE etymon *gyrela or *gyrele and adduces Southern E dial girls,
primrose blossoms, and grlopp, a lout, and tentatively LG goere, a young person
(either sex). Ult, perh, related to L puer, puella, with basic idea '(young) growing
thing'.
Another Example Etymology
The English word friar has an interesting history…

Latin frāter brother< Old French frere brother, also member of a religious order of 'brothers'<
Middle English frere, friar<modern English friar
Another Example Etymology
The English word friar has an interesting history…

Latin frāter brother< Old French frere brother, also member of a religious order of 'brothers'<
Middle English frere, friar<modern English friar

We have two kinds of links between the items in the etymology. The salmon pink coloured links are
words that are inherited from an earlier stage of a language or from a parent language, and the blue
link is a borrowing from one language into another.

NB. The ‘<’ symbol is commonly used in etymological sources to mean ‘is derived from’
Etymons and Cognates
It would be useful to distinguish the lexical entries in a lexicon and the words featured in individual
etymologies that do not belong to the languages covered by the lexicon.

E.g., an English language lexicon containing even a minimal amount of etymological information
might potentially contain thousands of French, Greek, and Latin words. An etymology might contain
form, sense, and semantic information for these words. If we encode these words as separate lexical
entries (without distinguishing them) we will end up with an English-French-Greek-Latin lexicon.

This is without even taking cognates into consideration!


Etymons and Cognates
We can use two concepts from the field of Etymology to distinguish between *regular* lexical entries
and those used in etymologies: etymons and cognates.

An etymon is a source word for another word. That is, it is a direct ancestor of a given word, whereas
a cognate has an ancestor in common with a word. For instance the English word obligation has as
etymons the latin words obligare, ligare, and the reconstructed Proto-Indo-European root *leig-. It has
cognates such as ligament, league in English and obbligare in Italian and obrigado in Portuguese --
but apparently not ありがとう (arigato) which is a false cognate.

We will make both concepts (etymon+cognate) classes in our proposed vocabulary for etymologies.
Etymologies Resemble Family Trees
Etymologies Resemble Family Trees
Etymologies Resemble Family Trees
Neccessity of Including Uncertain Information
Time for Tempura*
Let’s introduce an example which we will model in detail. Since we’re in Japan
let’s look at the English (and Italian) word tempura which refers to a popular
Japanese dish of fried seafood prepared using a light, wheat-flour based batter.
The word comes from the Japanese ‘ 天麩羅’ (tenpura). But this word, like
the dish itself, was borrowed from the Portuguese. However there are two
different etymologies. Tenpora either comes from:

- tempora a Latin word used by Portuguese missionaries to refer to Catholic feast days
in which red meat could not be consumed, or
- tempero a Portuguese noun meaning ‘condiment’ or ‘seasoning’, or the Portuguese
verb temperar meaning ‘to season’.

*Thanks to Jack Bowers for this example.


How to Model ‘Tempura’
The tempura example raises a number of interesting issues that crop up again and again in the
modelling of etymologies. Etymologies regularly tend to feature diverging hypotheses for the origins
of a word (and therefore its possible cognates too) -- as we saw in the second girl example.

The example also shows that etymologies can potentially feature a variety of different languages,
scripts, relevant kinds of linguistic information (we need to assume that there exists classes/properties
to model this).

It also shows the utility of modelling etymons/cognates as separate individuals (and not a ‘part of’
each etymology). How many times will we come across the Latin word tempus in etymologies for
English words for instance?
Etymons, Cognates, and Etymologies
Our proposed new vocabulary, lemonEty, an extension of ontolex-lemon, features the classes Etymon
and Cognate. These are both subclasses of Lexical Entry and therefore inherit the properties of that
class while at the same allowing for a distinction to be made with Lexical Entry.

We decided to reify the etymological shifts between different words, because of the usefulness of
attaching different kinds of information to these shifts, and have therefore created a new class called
Etymological Link.

Individuals of this class can be subtyped as ‘Inheritance’ or ‘Borrowing’ (or other). We then define
individuals of the class Etymology as consisting of a series of individuals belonging to Etymological
Link.
lemonEty - the core of the model
The diagram on the right represents the core new
elements in the lemonEty vocabulary.

We have also defined the following properties:

- etySource and etyTarget: these two properties relate


EtyLinks with their sources and targets
- etymon relates an Etymology with its Etymons
- etymology relates a Lexical Entry (or any other lexical The lemonEty vocabulary is available
element) with an etymology describing its history (with at:
the inverse relation isEtymologyOf).
http://lari-datasets.ilc.cnr.it/lemonEty
tempura
In the diagram below we have given part of the RDF encoding of the ‘tempura’ example. We have encoded one of the
etymologies for the word, that takes us from a Latin etymon all the way to the English lexical entry via Portuguese
and Japanese.

The full example can be found at:


http://lari-datasets.ilc.cnr.it/practiseEtyLex
http://lari-datasets.ilc.cnr.it/practiseEtyLex/queryForm.html (query interface)
Further Work
So far we have seen how to model various different features of etymological data in RDF using a
specialised vocabulary lemonEty that is extends lemon.

An initial version of this vocabulary has been published but it will need to be extended/modified in
order e.g., to deal more explicitly with ‘etymologies’ as hypotheses, represent levels of confidence,
along with attestations and references to the secondary literature.

We have seen how etymologies are similar to family trees and these are fairly straightforward to
represent in RDF/OWL (there exists a popular OWL tutorial that uses a family tree as its guiding
example). Representing the dynamic/temporal aspect of etymologies is a little bit trickier...Luckily
I’m out of time!
Thank You!

どうもありがとう

Gramercy

Iċ þē þancie

Obrigado

Merci Beaucoup

Grātiās Vōbīs Agō

χάριν οἶδα σοι

You might also like