AN ACTIVITY BASED SPOKEN LANGUAGE CORPUS OF NEPALI
Jens Allwood1, Bhim Narayan Regmi2, Sagun Dhakhwa2, and Ram Kisun Uranw2
1
2
University of Gothenburg, Sweden
Centre for Communication and Development Studies, Nepal
[email protected],
[email protected],
[email protected],
[email protected]
ABSTRACT
Language is used for communication and communication
facilitates social activities. If we want to capture this,
linguistic investigation has to be carried out within a wider
context. Examination of linguistic communication in a
wider context shows that it is multimodal. In order to study
naturalistic multimodal communication using a corpus, the
corpus should contain a combination of recordings,
documentation,
and
transcription
of
multimodal
communication from different social activities in naturalistic
settings, preserving unedited conversation. This paper
presents a brief account of the principles, methodology,
current status, and preliminary findings, based on an
incrementally growing and multimodal activity based
spoken language corpus of Nepali.
Index Terms— activity based, multimodal, spoken
language corpus, NSC, Nepali language
1. INTRODUCTION
The Nepali Spoken language Corpus (NSC) is being
developed (with funding support from Swedish Research
Council (VR) and SIDA), in collaboration between the
University of Gothenburg, Sweden and the Centre for
Communication and Development Studies in Nepal. It is
aimed at continuing and completing the work on a spoken
language corpus carried out under the NELRALEC project
[1] [2], and at analyzing some central features of spoken
Nepali. The current goal is the collection of a 500 K words
corpus with pauses, silences and overlaps annotated, where
our purpose is to analyze features of spoken language, like
pauses, silences, overlaps, feedback, and own
communication management.
Nepali has 11 million native speakers in Nepal
according to the national census 2001 [3]. There are two
other Nepali corpora – one spoken language corpus, and
another written language corpus. Both of the corpora are
genre based [4] [5] [6] [7]. The corpus presented here is
social activity based and, thus, differs theoretically as well
as methodologically from the other spoken language corpus.
Taking the concept of “social activity related
communication” as a basis for the study of language and
communication is not new, the seeds were planted as early
as 1953, in the concept of “language game” suggested by
Wittgenstein, and later developed as a concept by Allwood
in 1976, 2000, and 2007 respectively [8] [9] [10] [11]. The
conceptual goal of a social activity based language and
communication study is to describe, understand and explain
linguistic interaction, especially face-to-face, direct,
multimodal communication and the factors that condition
such interaction, see Allwood [12].
Though the methodology of such a study is open
and can include different methods, the main type of method
is, as Allwood (pp. 1-2) has noted, recordings, registration
and analysis of authentic linguistic interaction, in as “nonarranged”, “naturalistic” circumstances as possible with a
primary focus on face-to-face, direct, multimodal
communication. However, communication using different
kinds of communication technology, such as telephones or
computers can also be taken into consideration [13]. This
approach has already been applied to the spoken language
corpus at Gothenburg University and is followed in the
Nepali corpus too [14].
2. KEY POINTS IN DEVELOPING NSC
Some of the key points around which the corpus has been
developed are presented in the following subsections.
2.1. Social activities
There are many activities in society where people
communicate using language, gestures, etc. The constitutive
features of these social activities affect both the verbal and
gestural communication in the activities. Thus, a corpus
based on social activities can be used for the study of many
different aspects of human communication such as language
and gestures, sounds and silences, individual contributions
and interactive patterns etc.
2.2. Interaction
Use of language in a social activity is often interactive,
though there are activities that are more monological, such
as lectures, TV and radio broadcasts or expressive uses of
language in artistic and ritual functions. In our corpus, the
focus is on interaction.
2.3. Naturalistic settings
Though we do not deny the importance of the data from
controlled settings/studio environments, e.g. in extracting
speech features for speech modeling, naturalistic settings
provide a wider perspective on human communication.
Since communication is not limited to features of phonemes
or prosody, but goes beyond to topic, emotions, social
status, activities, and combinations with gestural means of
communication, this naturally led us to considering
naturalistic settings as primary. The corpus also includes
some radio talk shows and TV talk shows produced in a
high quality studio environment and with controlled
behavior where the naturalness may be questioned.
However, a radio talk show and a TV talk show can also be
regarded as social activities, where the studio environment
and controlled behavior of the participants is natural to this
social activity, so it is also a naturalistic setting.
seeing, writing-reading, etc., we could call this audio, audiovisual, visual and written modes. This corpus aims at
multimodality, where at least speech is transcribed in the
case of audio data, and speech, gesture, and visual form of
the setting are annotated in the case of video data.
2.7. Transcription
Transcription includes the rendering in written form of
contributions containing utterances, pauses, silences,
overlaps and gestures. If used this way, the term
“transcription” has an overlapping meaning with the term
“annotation”, but the two terms can also be separated so that
“transcription” means written recording of speech and
“annotation” is slightly wider, including transcription of the
linguistic units and information about the linguistic
functions, gestures and context etc.
2.8. Metadata
Every transcription and recording is accompanied by
metadata. The metadata covers information about the
speakers and recorded activity. They also include social and
geographical information about the speakers and the
participants, setting, necessary artifacts and duration of the
recorded activity. See below, sections 4.5 and 4.6.
2.9. Ethical issues
2.4. Unedited media
Human communication has many repetitions, incomplete
verbal units, long pauses and silences, overlaps, and
variations in pitch, loudness, length, etc. which can only be
captured in unedited audio and audio-video recordings.
Thus, in this corpus most of the recordings have not been
edited, but we have not been able to prevent editing in the
radio talk shows and TV talk shows.
2.5. Contributions
The basic units of interactive talk are contributions. They
are often vocal verbal or gestural and can be characterized
as a participant’s communication from the point where it
begins to the point where another participant makes her/his
contribution or interrupts. Contributions include utterances,
silences, and gestures, and are part of a broader concept of
communication where verbal and gestural elements can
function together. In this corpus, contributions are the
starting point for an analysis either of relations between
contributions or of elements within contributions.
2.6. Multimodality
Communication is realized through various related
simultaneous moods, such as speaking-hearing, gesturing-
We have obtained prior informed consent from all
participants to be recorded in the Nepali spoken language
corpus.
3. SOCIAL ACTIVITIES IN NSC
We started with a list of 18 social activities based on
Swedish Spoken Language Corpus at Gothenburg
University. As we aimed to carry out comparative studies
between Swedish and Nepali spoken language, we did not
modify the list, but instead left some of the social activities
empty and added some new activities, more typical of
Nepal, to the list. Currently, there are 24 activities in the list,
but it may become longer as new activities are identified.
The list is as follows: (1) Shopping, (2) Discussion, (3)
Court proceedings, (4) Task oriented formal meeting, (5)
Dinner conversation, (6) Family gathering, (7) Conversation
while working (weaving, farming, etc.), (8) Quarrel, (9)
Hotel, (10) Academic seminar, (11) Radio talk show, (12)
TV talk show, (13) Interview, (14) Hospital, (15) Classroom
interaction, (16) Story telling , (17) Phone, (18) Market
Place, (19) Task oriented informal meeting, (20) Honor,
(21) Fortune telling, (22) Formal discussion, (23) Thesis
defense, (24) Elicitation.
Currently there are 220 recordings with a total
duration of 61:35:33 in the NSC (Table 1). In the data, 133
files (with a duration of 40:58:15) have been transcribed
and contain 386314 words. The words in the 87
untranscribed files (with a duration of 20:37:18) have not
been counted. However, we estimate that when they are
counted, the total number will be larger than the target 500
K words.
Table 1. Current situation of NSC
Activity Title
Activity Duration
No of
code
recorded
Activities
1. Shopping
1
01:17:01
6
2. Discussion
2
06:46:20
29
3. Court proceedings
4. Task oriented
4
01:31:46
2
formal meeting
5. Dinner
5
00:58:01
3
Conversation
6. Family gathering
7. Conversation
7
02:42:31
17
While working
8. Quarrel
9. Hotel
9
00:07:49
1
10. Academic
seminar
11. Radio talk show
11
09:17:17
20
12. TV talk show
12
01:25:16
2
13. Interview
13
04:14:43
7
14. Hospital
14
03:24:12
33
15. Classroom
15
00:41:14
3
interaction
16. Story telling
17. Phone
17
02:11:37
14
18. Market place
18
00:23:34
2
19. Task oriented
19
02:52:34
9
informal meeting
20. Honour
20
02:14:54
7
21. Fortune telling
21
02:32:03
6
22. Formal discussion 22
05:11:43
13
23. Thesis Defence
23
01:24:10
5
24. Elicitation
24
12:18:48
41
Total
61:35:33
220
Among the empty activities listed above, activity
type no (3) Court proceedings is not possible to record in
Nepal because of lack of legal provision to permit recording
of such activities. The activities of (6) Family gathering, (8)
Quarrel, and (10) Academic Seminar have not yet been
recorded. Activity (15) Classroom interaction has 3
recordings and has been transcribed recently but they are
still to be annotated for pauses, silences, overlaps, and
activity (16) Story telling has just been recorded but not yet
transcribed. So the NSC is an incrementally growing corpus
with the aim of getting as many social activities collected as
possible.
3.1. A brief description of the social activities
A brief description of the social activities in NSC is
presented below. There are 133 files that have been
annotated for pauses, silences and overlaps, but, as we have
already mentioned, 87 files are yet to be coded for these
features. The words in the transribed 133 files have been
counted and are presented with the duration and number of
words in Table 2.
3.1.1. Shopping
Shopping is the activity of visiting shops to buy goods. In
this activity a customer visits a shop, asks a shopkeeper
about the quality, size, price, manufacturer and durability of
the goods, bargains at the end and buys if he is satisfied.
3.1.2. Discussion
Discussion is the activity of talking about a topic. In this
corpus the term is used for open talk between two or more
people about any topic. Most of the casual conversations are
grouped under this activity.
3.1.3. Court Proceedings
A court has well established procedures for filing cases,
petitions, summons, advocating, and judging and finally
ordering implementation of the law. “Court proceedings”
refers to such procedures. Unfortunately, because of the
lack of legal provision for providing us with permission to
record, we have not had the opportunity to record any
activity in this category.
3.1.4. Task oriented formal meetings
This is the kind of meeting where the topic of discussion is
pre-defined and the procedure is formal. A task oriented
formal meeting either leads to a decision on a certain task or
ends at some point on the way to a decision.
3.1.5. Dinner conversation
Dinner conversation is talking while having dinner.
However, We have grouped also lunch under this activity.
The activity perhaps could be better represented with the
term “meal conversation”.
3.1.6. Family gatherings
Here we have in mind a gathering of family members where
there may be close family members or close relatives. We
have not had the opportunity to record this activity. In many
cases, such gatherings are related to private matters such as
property, relationship between family members, etc., where
recording might be sensitive.
3.1.7. Conversation while working
This activity type involves talking and working
simultaneously. Talk itself is a kind of social work, but here
work is taken in a more narrow sense to mean
accomplishment of a physical task such as weaving,
weeding, sewing, etc.
3.1.15. Classroom interactions
This involves teacher and students’ talk concentrated on
teaching-learning activities.
3.1.8. Quarrel
This is fighting-with-words, different from debate or
discussion in that it mostly involves the emotion anger. A
quarrel mostly contains words or structures that are
normally supposed to be offensive, abusive, and hurtful and
sometimes it results in fighting as well. It cannot easily be
arranged and since it usually happens suddenly, it is
difficult to get this activity recorded. Even if we get an
opportunity to do so, there is a risk of physical attack by the
persons involved in the quarrel. We do not have any
recordings of this activity so far.
3.1.16. Story telling
This is the activity of telling a story where a story teller and
listeners take part. Story telling is a very well known
folklore activity, practiced in the villages of Nepal. It has a
special setting, style of conversation and special
communicative roles for the participants. Unfortunately, it is
disappearing gradually. We have a few recordings of these
activities that are yet to be transcribed.
Table 2. Duration and number of words in some of the
activities in NSC
Activity Title and code
Duration No of
No of
activities words
1. Shopping (1)
01:05:12 4
11893
2. Discussion (2)
03:57:21 16
45257
3. Task oriented formal 00:29:53 1
4495
meeting (4)
4. Dinner Conversation 00:58:01 3
9202
(5)
5. Conversation While
01:51:27 5
16727
working (7)
6. Hotel (9)
00:07:49 1
1346
7. Radio talk show (11) 09:17:17 20
90735
8. TV talk show (12)
01:25:16 2
15978
9. Interview (13)
04:14:43 7
38811
10. Hospital (14)
03:24:12 33
23230
11. Phone (17)
02:11:37 14
22664
12. Market place (18)
00:19:04 1
2112
13. Task Oriented
01:54:58 3
17816
Informal Meeting (19)
14. Honour (20)
02:14:54 7
16181
15. Fortune telling (21)
02:32:03 6
21789
16. Formal discussion
03:30:18 5
33020
(22)
17. Thesis Defence (23) 01:24:10 5
15058
40:58:15 133
386314
3.1.9. Hotel
This activity is related to talk between hotel personnel and
owner or guests concerning facilities, number of rooms,
rates, reservations, etc. There is only one recording of this
activity in NSC.
3.1.10. Academic Seminar
This is an academic activity where academicians meet and
talk on a topic in a formal way. Because of the many
participants and many speakers from different sides, it has
been difficult to record. There is no recording of this
activity yet in NSC.
3.1.11. Radio Talk Show
A radio talk show is a radio program where a journalist
invites a person/s to discuss or express their views on a
given topic. During the conversation the journalist leads the
invitee/s to a topic or intervenes with queries, feedback, etc.
It is a semi controlled conversation recorded in high quality
studio settings which is useful when we want to extract
speech features for language processing purposes.
3.1.12. TV Talk show
A TV talk show is similar to a radio talk show, only
differing in media in terms of quality, however, there is a
vast difference in terms of modality. It is audio-visual, while
radio is only audio.
3.1.13. Interview
An interview is talk about a person's experiences, ideas,
views, etc., e.g. involving a journalist or researcher and an
interviewee to make the interviewee's thought, experiences,
life, etc. known to others.
3.1.14. Hospital
In this activity, the hospital personnel may be doctor, health
worker or administrative staff, talking to the patients or the
patient's caretakers.
3.1.17. Phone
This is telephone communication. It is not face-to-face but
direct and has its own patterns, styles, and vocabulary.
3.1.18. Market place
This is an open market activity where many buyers and
sellers
are
involved
in
activities
such
as
promoting/advertising their goods, bargaining, choosing,
buying and selling, etc.
3.1.19. Task oriented informal meeting
This is a task oriented meeting without formal procedures.
3.1.20. Honor
This is an activity of felicitation where a person is awarded
a kind of recognition letter, praised, and some other people
make a speech on his/her contribution to society, biography,
etc. Such meetings are growing in popularity in Nepal.
3.1.21. Fortune telling
This is an activity of talking about a person's past, present,
and future by a professional fortune-teller.
3.1.22. Formal discussion
This is a discussion carried out in a formal way.
3.1.23. Thesis defense
This is an academic activity where a student presents his
research findings orally and the experts make queries in
order to evaluate his work.
4.3. Sample size of the recordings
There is no fixed digital size or size in terms of duration or
number of words for the recordings. Since it is based on
naturalistic social activities, the size is determined by the
activity. For example, a social activity entitled ‘hospital’
includes doctor-patient conversation and the size of that
recording is determined by the time given by the doctor to
that particular patient as in the recording V001014023
which is a 1 minute and 10 seconds long video recording
containing 221 words. Similarly, there is no predictable time
and number of words ratio. For example a recording of an
interview A001013001 which has a duration of 00:31:17
contains 4739 words whereas another interview
V001013003 with longer duration of 00:39:22 (more than 8
minutes longer) contains 4378 words (only 361 less words).
4.4. Naming conventions
3.1.24. Elicitation
This activity involves question-answer during field research
where a researcher makes queries to a person who knows
about a topic and the person answers. In our corpus, it
differs from interviews in that it is specific to a certain topic
and involves a researcher, who tries to get specific
information and often has a wider perspective than is
common in interviews.
4. METHODOLOGICAL CONSIDERATIONS
The methodology followed in building the NSC is presented
below in brief.
4.1. Selection and environment of recording
As social activities are the basis of the corpus, natural
settings have been chosen for the recordings. Thus, most of
the recordings have been made in their actual environments.
For example, a recording of a weeding takes place in a
nursery garden and includes the background noise produced
there and a recording of a dinner conversation includes
noises produced with tools and utensils. There are also
studio recordings, since radio talk shows and TV talk shows
are also important social activities, which are normally
recorded in a studio environment.
4.2. Tools for recording
We have used a Sony handy cam and mp3 digital recorder,
and Samsung’s mp3 digital recorder in most of the cases.
But the data from radio and TV have been recorded with
their own equipment which we got from their archives. The
data has been recorded af a 44khz frequency rate.
The NSC naming conventions have been established in the
following way: There is a letter A or V for audio and video
respectively followed by nine numbers grouped into each 3
sets for corpus label, activity label and activity number. For
example, V001002003 stands for the ‘third’ (003) ‘video
recording’ (V) of the social activity ‘discussion’ (002)
within the Nepali spoken language corpus, a part of the
Nepali National Corpus developed in the NELRALEC
project’ (001). Likewise A002024001 is the ‘first’ (001)
‘audio recording’ (A) of the social activity ‘elicitation’
(024) within the ‘Nepali spoken language corpus developed
under the Nepali and Lohorung Spoken Language project’
(002). This convention allows us to have 999 recordings
within 999 social activities, within 999 corpora as its
maximum.
4.5. Speaker information
The following information (metadata) concerning
participants is noted while recording and maintained in the
corpus: (1) Name, (2) Age, (3) Gender, (4) Mother tongue,
(5) Second language, (6) Dialect (social and geographical),
(7) Education, (8) Place of primary education, (9)
Profession, (10) Address (including contact phone number
and email). This information can be helpful to carry out
sociolinguistic, dialectal and second language studies using
the corpus.
4.6. Information on recorded activities
The following information (metadata) is maintained in the
corpus: (1) Transcription status, (2) Recorded activity ID,
(3) Recorded activity title, (4) Short name, (5) Recorded
activity date, (6) Tape, (7) Anonymity, (8) Access, (9)
Activity Type (at three levels), (10) Activity purpose, (11)
Activity roles, (12) Activity procedures, (13) Activity
environment, (14) Activity artifacts, (15) Duration, (16)
Participants, (17) Recorder, (18) Transcription name, (19)
Transcriber, (20) Transcriber’s ID, (21) Transcription date,
(22) Transcribed segments, (23) Transcription system, (24)
Checker, (25) Checking date, (26) Description, (27)
Comment, (28) Start and end times, and (29) Section.
provide new opportunities of research and a wider
perspective on language and communication for researchers.
4.7. Transcription and annotation
[2] [7] Yadava, Y. P., A. Hardie, R. R. Lohani, B. N. Regmi, S.
Gurung, A. Gurung, T. McEnery, J. Allwood and P. Hall,
"Construction and annotation of a corpus of contemporary Nepali",
Corpora Vol. 3 (2), Edinburg University Press, UK, 2008, pp.
213–225.
Nepali is written in Devanagari script with its long tradition
of writing. However, standardized writing has become
different from speech in the course of time. Thus, in order to
maintain closeness to spoken language, the speech of the
recordings have been transcribed phonemically in
Devanagari Unicode.
The format has been based on the Gothenburg
Transcription Standard (GTS), where spoken language
features such as pauses, silences, overlaps, unclear speech,
broken words etc. are annotated [15].
5. RESEARCH BASED ON NSC
Some preliminary research has been carried out based on
the Nepali spoken language corpus. There is a paper on the
intonation patterns of Nepali feedback units and another
paper on the relation between writing and speech and the
functions of a multifunction Nepali word chaahin [16] [17].
These papers present only preliminary results, but show the
importance of research in the field of spoken language,
multimodal communication and corpus based studies and
the validity of such a corpus for studies of Nepali.
Although not yet formulated as research papers,
there are many interesting features of spoken language and
communication that can be found while working with the
corpus. To note a very few, spoken Nepali is different from
written language not only in pronunciation but also in terms
of vocabulary and grammatical structure. There is even a set
of words and a set of functions, which are found only in the
spoken language. Another interesting feature of
interpersonal communication connected to overlap and
pause is that most of the overlaps follow pauses in the
conversation. However, these and other issues need to be
supported by quantitative analysis and appropriate
explanations drawing on the research to be carried out.
10. SUMMARY AND CONCLUSIONS
Summarizing the above discussion, the Nepali spoken
language corpus is being developed as a continuously
growing corpus, which is now ready for research on spoken
language features and other communicative features. The
notion of social activity is the basis of the corpus and the
corpus is multimodal, based on unedited recordings in
natural settings, containing detailed information about
participants and the recorded activity itself. We hope it will
10. REFERENCES
[1] http://bhashasanchar.org/ncorpus_spoken.php
[3] Central Bureau of Statistics National Planning Commission
Secretariat His Majesty's Government of Nepal (CBS) and
UNFPA, "Population Census 2001: National Report", CBS and
UNFPA, Kathmandu, 2002.
[4]
http://cqpweb.lancs.ac.uk/bandhu/index.php?thisQ=corpus
Metadata&uT=y
[5]
http://cqpweb.lancs.ac.uk/nncv2/index.php?thisQ=corpus
Metadata&uT=y
[6] http://bhashasanchar.org/ncorpus_written.php
[8] Wittgenstein, L., Philosophical Investigations. Oxford,
Blackwell, 1953.
[9] Allwood, J, “Linguistic Communication as Action and
Cooperation”, Gothenburg Monographs in Linguistics 2,
University of Göteborg, Dept of Linguistics. 1976.
[10] Allwood, J., “An Activity Based Approach to Pragmatics”, In
Bunt, H., & Black, B. (Eds.) Abduction, Belief and Context in
Dialogue: Studies in Computational Pragmatics. Amsterdam, John
Benjamins, 2000, pp. 47-80.
[11] [12] [13] Allwood, J., “Activity Based Studies of Linguistic
Interaction”, Gothenburg Papers in Theoretical Linguistics 93,
Department of Linguistics, Göteborg University, 2007.
[14] Allwood, J., “The Swedish Spoken Language Corpus at
Göteborg University”, Proceedings Fonetik 99:The Swedish
Phonetics Conf. June 1999, Göteborg University, Sweden, 1999.
[15] Nivre, J., J. Allwood, L. Grönqvist, M. Gunnarsson, E.
Ahlsén, H. Vappula, J. Hagman, S. Larsson, S. Sofkova, and C.
Ottesjö, Göteborg Transcription Standard v6.4.: Department of
Linguistics, Göteborg University, 2004.
[16] Allwood, J. and B. N. Regmi, “Intonation Patterns in Nepali
Feedback Units”, Proceedings of Oriental COCOSDA 2010, Paper
61, November 24-25, Kathmandu, 2010.
[17] Regmi, B. N., “ChaahiN: a case study in terms of variants,
frequency and function”, Proceedings of Oriental COCOSDA
2009, Poster 10, Aug.10th -12th, Beijing, China, 2009.