Academia.eduAcademia.edu

The Partitur format at BAS

1998, Proceedings of the First …

Most spoken language resources are produced and dissem-inated together with symbolic information relating to the speech signal. These are for instance orthographic tran-scripts, labelling and segmentation on the phonologic, pho-netic, prosodic, phrasal level. Most of the known ...

The Partitur Format at BAS Florian Schiel1 & Susanne Burger2 & Anja Geumann2 & Karl Weilhammer2 1 Bavarian Archive for Speech Signals (BAS), Munich, Germany 2 Department of Phonetics and Speech Communication University of Munich, Germany [[email protected]] Abstract Most spoken language resources are produced and disseminated together with symbolic information relating to the speech signal. These are for instance orthographic transcripts, labelling and segmentation on the phonologic, phonetic, prosodic, phrasal level. Most of the known formats for these symbolic data are de ned in a 'closed form' that is not exible enough to allow simple and platformindependent processing and easy extensions. At the Bavarian Archive for Speech Signals (BAS) a new format has been developed and used over the last few years that shows some signi cant advantages over other existing formats. This paper describes the basic principles behind this format, discusses brie y the advantages and gives detailed de nitions of the description levels used so far. Furthermore, we will give some examples for easy processing of the format and distributed work on the same data. In the future all corpora produced and disseminated by BAS will be distributed with the new BAS Partitur Format, if they contain segmental information of any kind. The former used formats will be retained but not further updated. General Overview Most le formats containing segmental information on speech signals have the disadvantage that     they are not easy to extend (without rewriting software that uses the existing format). they are not easy to process with UNIX standard tools. they mix di erent description levels (which leads to technical and conceptual problems) they were de ned as ad-hoc solutions for very specialised problems and are not capable of being re-used in a di erent setup. Therefore a new open format based on the SAM Label Format was developed, which circumvents most of the mentioned problems. In this format all levels of description may be annotated independently but are time aligned like the individual tiers of a score. Hence this format was called 'BAS Partitur Format' ('Partitur' = German for 'score'). In the future all BAS corpora will be distributed with the new BAS Partitur Format if they contain segmental information of any kind. The formerly used formats will be retained but not further updated. A rst draft of the BAS Partitur Format was published in [Atmanspacher et al., 1995]. The BAS Partitur Format has the following features:      SAM compatible header structure Easy to extend and to process by simple UNIX commands Open format; extensions to the format can be implemented without alterations to the software that reads the older format Time-aligned independent description of a virtually unlimited number of di erent levels of the speech signal (see examples later in this paper). Symbolic links between the independent levels allow logical assignments aside from the physical time scale. These links are based on the word units of the utterance. De nition (Version 1.2) A Partitur le name has the same pre x as the corresponding signal le (8 Bytes for Iso 9660 compatibility) but the extension .par. All contents are in 7-bit ASCII exclusively (to guarantee portability to all platforms). Each line starts with a three-byte label followed by a colon; this label de nes synopsis and semantics of the ensuing line. The following units of the line are separated by 'white spaces' (blank, tab). The Partitur le is structured into a header and a body (like SAM description les). The header stretches from the beginning of the le to the label LBD:; the body from the label LBD: to the end of le where the last line has to be closed by a 'new line' or a 'CR + LF' (the nal SAM label ELF: was omitted for the BAS Partitur Format since it prevents e ective processing of the Partitur les). The header contains SAM-compatible lines of general information. The following entries are compulsory: LHD: REP: SNB: SAM: SBF: SSB: NCH: SPN: LBD: Partitur file version Place of recording Number of Bytes per Sample Sampling Frequency in Hz Byteorder (Intel 01, Motorola 10) Bit Resolution Number of Channels Speaker ID Example: LHD: REP: SNB: SAM: SBF: SSB: NCH: SPN: LBD: Partitur 1.2 Muenchen 2 16000 01 16 1 PS1 The following entries are optional; apart from these, other entries are tolerated as long as they do not conict with compulsory and optional entries: FIL: TYP: DBN: VOL: DIR: SRC: BEG: END: RED: RET: RCC: CMT: SPI: PCF: PCN: EXP: SYS: DAT: SPA: SAM File Type Type of SAM Label File Corpus Name Number of Volume Directory in Volume Name of speech file Begin of labelling sequence End of labelling sequence Date of Recording Duration Recording Conditions Comment Speaker Information Name of Protocol File Protocol Number Name of Segmenter Labelling System Date of Labelling SAM-PA Version The body starts after the label LBD: and stretches to the end of le. It contains the di erent tiers of the BAS Partitur Format. Each tier is identi ed by a unique label. The order of tiers as well as the order of lines within a tier is not signi cant. In the following sections the ve basic classes of tiers are de ned. Tiers with symbolic relation (class 1) A line of this tier contains:    the tier label a comma-separated list of integers (symbolic links) a string with the labelling information The symbolic links refer to a reference tier which numbers the word units beginning with zero. The label string has an internal synopsis which is de ned in the tier de nition. Example: TRL: 6,7 mit'm In this example the word events 6 and 7 of an utterance are transliterated. Tiers with time-consuming events (class 2) A line of this tier contains:    the tier label two integers denoting the begin and duration of the event. a string containing the labelling information The semantics of the integers is de ned by the tier de nition (possible are samples, millisecs, etc.) Example: PHN: 13456 3450 aU In this example a phonemic segment labelled /aU/ stretches from sample 13456 for the next 3450 samples. Tiers with non time-consuming events (class 3) A line of this tier contains:    the tier label an integer denoting the time position of the event a string containing the labelling information Example: PRO: 13456 TON: P*; FUN: PA In this example the prosodic event labelled TON: P*; FUN: PA (GTobi, see [Grice et al., 1995]) takes place at sample 13456 of the utterance. Tiers with time and symbolic relation, timeconsuming (class 4)  If the symbolic link in a tier is not (or not yet) known, the symbolic link is set to -1 (e.g. noises from other sources than the recorded speaker).  The same symbolic relation may occur in di erent lines of a tier (for example if more than one event can be assigned to the same word of an utterance). A line of this tier contains:  the tier label  two integers denoting the start and duration of the event.  a comma separated list of integers (symbolic links)  a string containing the labelling information Example: SAP: 13456 3450 9 aU In addition to the example above this tier not only gives the starting point and the duration of the phonemic segment but also a pointer to the word unit where it belongs (word 9). Tiers with time and symbolic relation, not time-consuming (class 5) A line of this tier contains:  the tier label  an integer denoting the time position of the event.  a comma separated list of integers (symbolic links)  a string containing the labelling information Example: PRB: 13456 9 TON: P*; FUN: PA Again, in this example the prosodic event is not only placed in time but also assigned to a word of the utterance (word 9). Remarks  If not otherwise noted, durational parameters are given in samples counting from the beginning of the digitised utterance  An item may be referred to more than one word in the utterance (suprasegmental events, assimilation at word boundaries, phrases, etc.) De nition of Tiers The following sections give an overview of the currently de ned tiers in the BAS Partitur Format (version 1.2.2). Please keep in mind that this is an open list in the sense that new tiers can be de ned whenever there is a need for it. If somebody would like to work with speech resources from BAS and to de ne a new tier for his or her speci c problem, please contact the BAS to get a new tier label assigned. By doing this we can keep up a consistent documentation of the format and avoid con icts between matching labels. The version of the BAS Partitur Format is incremented by one on the third digit whenever a new tier de nition is added to it. In accordance to the basic principle this does not imply that any software has to be changed. Canonical Pronunciation  Tier label: KAN  Class: 1  Synopsis: (symbolic links) (transcript) This tier is the reference tier for all other tiers that use symbolic links. It contains a list of the spoken words within the utterance annotated in extended German SAM-PA (see [SAM, 1989] for a general definition of the SAM-PA and [SAM, 1996] for a special description of the extended German SAM-PA as used in several German projects). Note that these forms are the phonologically expected citation forms, not the actually spoken form. The segmentation of the whole utterance is done into word units, where everything counts as a word that is produced by the articulatory organs of the speaker and can be seen as speech. Following this de nition hesitations are words, whereas laughing, coughs, etc. are not. This separation isn't always clear, but on the other hand the selection of word units is arbitrary as well. The main point here is a unique reference tier for symbolic relations in other tiers. Another problem is the reduction of words that are annotated in the orthographic form, e.g. "mit'm". In these cases the reduction is restituted (in this example /mIt de:m/). The reason for this lies in the fact that some of these reductions should later be automatically accessible. Example: KAN: KAN: KAN: KAN: KAN: KAN: 0 1 2 3 4 5 j'a: Qalzo: QE:m h'OYt@ Qo:d6 m'O6g@n Orthography    Tier label: ORT Class: 1 Synopsis: (symbolic links) (lexical orthography) The tier orthography contains the orthographic (lexical) strings corresponding to the units in the tier canonical form. Words are not capitalised at the beginning of an utterance or sentence within an utterance (except nouns of course). German 'Umlauts' and other letters not included within 7 Bit ASCII are written in LaTeX notation. This tier is used for easy lexical access; therefore no additional markers except lexical words are allowed. There is no punctuation in this tier. Lexical words include items that are contained in the KAN tier (e.g. hesitations, repairs, word fragments, etc.). This tier can be used to access customised pronunciation dictionaries, to create unique word frequency lists, etc. Example: ORT: ORT: ORT: ORT: ORT: ORT: 0 1 2 3 4 5 ja also <"ahm> heute oder morgen Verbmobil Transliteration - VM I    Tier label: TRL Class: 1 Synopsis: (symbolic links) (transliteration) The tier transliteration VMI contains the orthographic transcript of the utterance according to the VM I conventions 3.0. The transliteration is segmented into the units of the tier canonical pronunciation. Therefore multiple references may occur (e.g. if a reduced form of two words is written as one unit in the transliteration). Although especially de ned for the German Verbmobil I project, this format has been used in many other resources of spontaneous speech as well. See [Kohler et al., 1994] (German only) or online [Burger, 1995] for a detailed description of the VM I Transliteration format. Example: TRL: TRL: TRL: TRL: TRL: TRL: TRL: 0 0 1 2 3 4 5 <A> ja , also <"ahm> <:<#Klicken> heute:> oder morgen . Verbmobil Transliteration - VM II    Tier label: TR2 Class: 1 Synopsis: (symbolic links) (transliteration) The tier transliteration VMII contains the orthographic transcript of the utterance according to the VM II conventions. A detailed de nition of this format can be found in [Burger, 1997] (German only). In contrast to the VM I format this new updated de nition has the advantage of being fully parsable. Furthermore, with this format multi-party and multilingual dialogs may be transliterated (because compatible de nitions for the languages English and Japanese do exist). To denote overlapping speech parts between di erent speakers in a dialog, a new tier SUP was de ned (see below). Superimposed Speech - VM II    Tier label: SUP Class: 1 Synopsis: (symbolic links) (transliteration) This is a very specialised tier to denote overlapping speech in multi-party recordings. The synopsis of the turn marker and the transliteration is de ned for the VM II transliteration format (see above). The speech annotated in this tier stems from a di erent speaker who actively superimposes his speech on the speech of this Partitur le. See [Burger, 1997] (German only) for a detailed description of superimposed speech in the VM II format. Example: TR2: 0 ich TR2: 1 w"urde TR2: 2 vorschlagen , TR2: TR2: TR2: TR2: TR2: TR2: TR2: TR2: TR2: TR2: TR2: TR2: SUP: 3 da"s 4 wir9@ 5 dann9@ 6 <:<#> hinfliegen:> , 7 <:<#> ich:> 8 hab' 9 jetzt <!1 jetz'> 10 aber 11 <:<#Rascheln> grade:> 12 <:<#Rascheln> keine:> 13 Unterlagen 14 da . <#> 4,5 g002acn2_028_AAK.par @9ja In this example the utterance of another speaker (AAK, utterance "ja") is superimposed on the 4th and 5th word of the Partitur le (utterance "wir dann"). Broad Phonetic Segmentation - PhonDat  Tier label: PHO  Class: 4  Synopsis: (integer) (integer) (list of symbolic links) (label string) This tier contains a totally time-consuming segmentation into broad phonetic units (extended German SAM-PA). The rst number denotes the beginning of the segment in samples counted from the beginning of the speech le; the second number the duration of the segment in samples. The label string contains an additional relation to the canonical pronunciation (aside from the symbolic links to the tier canonical form). The '-' sign denotes di erences to the expected canonical pronunciation on a segmental level: a leading '-' sign means the following segment was inserted (e.g. /-a:/); a trailing '-' sign means the segment was deleted (e.g. /a:-/); a '-' sign between segment labels means that the canonical expected segment was replaced (e.g. /a:-E:/). This tier also contains prosodic and phrasal labelling and segmentation. The full conventions of labelling and segmentation for German are brie y described in [Pompino, 1992] or online in [PHO, 1995]. Example: PHO: PHO: PHO: PHO: PHO: PHO: PHO: PHO: PHO: PHO: 6637 6637 7553 8373 9292 10162 11586 12310 13276 13965 0 916 820 919 870 1424 724 966 689 0 0 0 0 0 1 1 1 1 1 2 #c: ##%Q $I $C+ ##m $9 $C $t $@+ ##Q- PHO: PHO: PHO: PHO: PHO: PHO: PHO: PHO: PHO: 13965 13965 15989 16506 17572 18111 18931 20798 20798 0 2024 517 1066 539 820 1867 0 1111 2 2 2 2 3 3 3 3 3 $-q $a: $b $6+ ##Q $'U $n-N $#g$@ Broad Phonetic Segmentation - Verbmobil    Tier label: SAP Class: 4 Synopsis: (integer) (integer) (list of symbolic links) (label string) In contrast to the PHO tier this segmentation is not stringently time-consuming. That is, there might be pauses in the signal that are not labelled (which happens frequently in spontaneous speech). Furthermore the conventions are di erent in some points to the PHO tier to simplify parsing and processing of the tier. SAP is an exclusively phonemic tier; there is no other information encoded here. Example: SAP: SAP: SAP: SAP: SAP: SAP: SAP: SAP: SAP: SAP: SAP: SAP: SAP: SAP: SAP: SAP: SAP: SAP: SAP: SAP: SAP: 2541 894 0 m 3435 1140 0 aI 4575 270 0 n 4845 510 1 n 5355 1326 1 a: 6681 795 1 m 7476 277 1 @ 7753 0 2 Q7753 614 2 I 8367 1457 2 s 9824 0 2 t9824 656 3 t 10480 1796 3 s 12276 1953 3 E 14229 988 3 l 15217 535 3 t 15752 370 3 -H 16122 2097 3 h 18219 2608 3 o: 20827 1643 3 f 22470 4265 3 6q A detailed description of the SAP labelling conventions can be found in [Geumann et al., 1997]. Automatic Broad Phonetic Segmentation by MAUS  Tier label: MAU   Class: 4 Synopsis: (integer) (integer) (symbolic links) (label string) This tier contains an automatically generated broad phonetic segmentation in units of German SAM-PA. The segmentation is done fully automatically by the MAUS system ([Kipp et al., 1996]). The segmentation is totally time-consuming and the labelling has no direct relation to the tier canonical form as done in the tier SAP. (However, there are symbolic links to the words). The units are labelled in extended German SAM-PA as in the de nition of the SAP tier (see appendix A). Additional labels are <nib> (nonspeech event) and <p:> (pause). These labels always get the symbolic link -1 (no link). Example: MAU: MAU: MAU: MAU: MAU: MAU: MAU: MAU: MAU: MAU: MAU: MAU: MAU: 0 676 -1 <p:> 677 7861 -1 <nib> 8539 450 0 g 8990 2436 0 u: 11427 1740 0 t 13168 958 1 d 14127 1298 1 a 15426 3820 1 n 19247 303 2 n 19551 1785 2 e: 21337 624 2 m 21962 636 2 n 22599 501 3 v Word Segmentation    Tier label: WOR Class: 4 Synopsis: (integer) (integer) (symbolic links) (word label) This tier contains a segmentation of the utterance in word or word equivalents. The segmentation need not to be stringent. The label string may contain orthographic or pronunciation information (e.g. in SAMPA). A '-' at the end of the label string denotes a missing word in the reference of the tier canonical from (of course a missing word has zero duration); a leading '' denotes an inserted word; a '-' between two words (word1-word2) denotes a replacement. The symbolic links give the relation to the tier canonical form. Note that inserted words have a symbolic link to the previous word in the reference tier. Example: WOR: 1245 13245 0 <"ahm> WOR: WOR: WOR: WOR: 14490 25277 30366 39152 10787 1 guten 5089 1 -<hm> 8786 2 Tag 3089 3 ich # insertion Dialog Act Segmentation    Tier label: DAS Class: 1 Synopsis: (symbolic links) (marker string) This tier contains a segmentation in dialog acts according to the ongoing work of the 'Deutsches Forschungszentrums fur kunstliche Intelligenz' (DFKI), Saarbrucken, Germany. Each marker covers a portion of the speech signal that is denoted by the symbolic links to the reference tier canonical form. A description of the format can be found in [Jekat et al., 1995]) or online in [DAS, 1996]. Example: DAS: 0,1,2,3,4,5 @m(REJECT_DATE) @m(GIVE_REASON) DAS: 6,7,8,9 @(SUGGEST_SUPPORT_DATE) DAS: 10,11,12,13,14 @(REQUEST_SUGGEST_DATE) Prosodic Segmentation - GTobi    Tier label: PRB Class: 5 Synopsis: (integer) (symbolic links) (marker string) This tier contains the prosodic segmentation (by hand) according to GTobi de ned by the Technical University of Braunschweig, Germany. A detailed description of the GTobi labelling format can be found in [Grice et al., 1995] or online in [PRB, 1996] (German only). Example: PRB: PRB: PRB: PRB: 54212 63269 76371 79967 5 7 8 8 TON: TON: BRE: TON: H*; FUN: NA L+H*; FUN: EK B3; TON: L-L% L*+H; FUN: PA Easy Processing and Distributed Work Since the Bas Partitur Format is strictly line structured, allows only 7-Bit ASCII and the order within a le does have no semantic meaning, it is very easy to use standard UNIX text processing tools like gawk, grep or sed to work with data stored in this format. For example the following lines of GAWK code will analyse a stream of piped-in BAS Partitur les for a certain phoneme, capture the total length and summarise into a mean value: /^MAU:.*aU$/ {count ++ totallength += $3 } END {print "Mean Duration for /aU/:" print totallength/count } In the same manner single BAS Partitur tiers may be selected, updated or ltered using grep, tiers can be easily transformed into format suitable for di erent kinds of visualising tools (for instance the public domain software package SFS by University College London). The German Verbmobil project gives a good example for the bene ts of using the BAS Partitur format at di erent sites on the same data. For instance the tiers DAS and PRB were de ned by partners at DFKI, Saarbrucken and University of Braunschweig respectively. Since such an extension does not require any basic software to be re-written, these cooperations using the same physical data went very smoothly. References [Atmanspacher et al., 1995] S. Atmanspacher, S. Burger, Chr. Draxler, A. Kipp, Chr. Scheer, F. Schiel, M.-B. Wesenick (1995). Partiturformat fur die Darstellung unterschiedlicher Reprasentationsebenen von gesprochener Sprache (Verbmobil Memo 90-95). University of Munich, September 1995. [Burger, 1995] Susanne Burger (1995). Transliterationslexikon (Verbmobil-TechDok 36-95). University of Munich, October 1995. (Online version in English: http://www.phonetik. uni-muenchen.de/VMTraLexeng.html) [Burger, 1997] Susanne Burger (1997). Transliteration spontansprachlicher Daten Lexikon der Transliterationskonventionen - Verbmobil II (Verbmobil-TechDok 56-97), University of Munich, April 1997. (Online version: http://www.phonetik.unimuenchen.de/VMtrlex2d.html) [Geumann et al., 1997] Anja Geumann, Daniela Oppermann, Felix Schaeer (1997). The Conventions for Phonetic Transcription and Segmentation of German Used for the Munich Verbmobil Corpus (Verbmobil Memo 129-96). University of Munich, December 1997. View publication stats [Grice et al., 1995] Grice, Martine and Ralf Benzmueller (1995). Transcription of German Intonation using ToBI tones; The Saarbruecken System. Paper presented at Tutorial Workshop on Discourse and Dialogue Prosody, Stuttgart, February 1995, modi ed version also in Phonus 1, University of the Saarland, pp33-51. [Jekat et al., 1995] Susanne Jekat, Alexandra Klein, Elisabeth Maier, Ilona Maleck, Marion Mast, Joachim Quantz (1995). Dialogue Acts in Verbmobil (Verbmobil-Report 65). Universitat Hamburg, DFKI GmbH, Universitat Erlangen, TU Berlin. April 1995. [Kipp et al., 1996] A. Kipp, M.-B. Wesenick, F. Schiel (1996). Automatic Detection and Segmentation of Pronunciation Variants in German Speech Corpora; in: Proceedings of the ICSLP 1996. Philadelphia, pp. 106-109, Oct 1996. [Kohler et al., 1994] Kohler, Lex, Patzold, Sche ers, Simpson, Thon (1994). Handbuch zur Datenaufnahme und Transliteration in TP14 von VERBMOBIL - 3.0 (Verbmobil-TechDok 11-94). IPDS, University of Kiel, 1994. [Pompino, 1992] Pompino-Marschall, B. (1992). PhonDat - Verbundvorhaben zum Aufbau einer Sprachsignaldatenbank fur gesprochenes Deutsch. FIPKM 30/1992, pp. 99-128. [DAS, 1996] http://www.phonetik.unimuenchen.de/Bas/BasDialogaktDok/vm-reportfor-partitur 1.html [PHO, 1995] http://www.phonetik.unimuenchen.de/Bas/BasFormatsPHOdeu.html [PRB, 1996] http://www.phonetik.unimuenchen.de/Bas/BasProsodie.html [SAM, 1989] http://www.phon.ucl.ac.uk/ home/sampa/home.htm [SAM, 1996] http://www.phonetik.unimuenchen.de/Bas/BasSAMPA