Academia.eduAcademia.edu

LingPy Documentation (Version 2.6.1)

This is the documentation for LingPy-2.6.1, the most recent release of the LingPy library for quantitative tasks in historical linguistics. This release is very stable, it does not contain too many new features, but rather tries to provide a very stable version of the algorithms.

LingPy Documentation Release 2.6 Johann-Mattis List, Simon Greenhill, and Robert Forkel 2017-11-23 CONTENTS 1 Sequence Modelling 1 2 Dataset Handling 7 3 Data Export 9 4 Sequence Comparison 11 5 Language Comparison 15 6 Handling Phylogenetic Trees 29 7 Plotting Data 31 8 Evaluation 33 9 Reference 35 10 Download 199 Python Module Index 201 i ii CHAPTER ONE SEQUENCE MODELLING 1.1 Sound Classes (sound_classes) This module provides functions and classes to deal with sound-class sequences. Sound classes go back to an approach Dolgopolsky1964. The basic idea behind sound classes is to reduce the IPA space of possible phonetic segments in order to guarantee both the comparability of sounds between languages, but also to give some assessment regarding the probability that sounds belonging to the same class occur in correspondence relations in genetically related languages. More recently, sound classes have been applied in a couple of approaches, including phonetic alignment (see List2012a), and automatic cognate detection (see Turchin2012, List2012b). 1.1.1 Functions ipa2tokens(istring, **keywords) tokens2class(tokens, model[, stress, ]) prosodic_string(string[, _output]) prosodic_weights(prostring[, _transform]) class2tokens(tokens, classes[, gap_char, local]) pid(almA, almB[, mode]) get_all_ngrams(sequence[, sort]) sampa2uni(seq) Tokenize IPA-encoded strings. Convert tokenized IPA strings into their respective class strings. Create a prosodic string of the sonority profile of a sequence. Calculate prosodic weights for each position of a sequence. Turn aligned sound-class sequences into an aligned sequences of IPA tokens. Calculate the Percentage Identity (PID) score for aligned sequence pairs. Function returns all possible n-grams of a given sequence. Convert sequence in IPA-sampa-format to IPA-unicode. 1.1.2 Classes Model(model[, path]) Class for the handling of sound-class models. 1.2 Generate Random Sequences (generate) 1.2.1 Classes 1 LingPy Documentation, Release 2.6 MCBasic(seqs) MCPhon(words[, tokens, prostrings, classes, ]) Basic class for creating Markov chains from sequence training data. Class for the creation of phonetic sequences (pseudo words). 1.3 Generate Orthography Profiles (profile) 1.3.1 Functions simple_profile(wordlist[, ref, ]) context_profile(wordlist[, ref, col, ]) Create an initial Orthography Profile using Lingpys clean_string procedure. Create an advanced Orthography Profile with context and doculect information. 1.4 Sound Class Models (Model) class lingpy.data.model.Model(model, path=None) Class for the handling of sound-class models. Parameters model : { sca, dolgo, asjp, art, _color } A string indicating the name of the model which shall be loaded. Select between: • sca - the SCA sound-class model (see List2012a), • dolgo - the DOLGO sound-class model (see: :evobib:‘Dolgopolsky1986), • asjp - the ASJP sound-class model (see Brown2008 and Brown2011), • art - the sound-class model which is used for the calculation of sonority profiles and prosodic strings (see List2012), and • _color - the sound-class model which is used for the coloring of sound-tokens when creating html-output. See also: lingpy.data.derive.compile_model, lingpy.data.derive.compile_dvt Notes Models are loaded from binary files which can be found in the data/models/ folder of the LingPy package. A model has two essential attributes: • converter – a dictionary with IPA-tokens as keys and sound-class characters as values, and • scorer – a scoring dictionary with tuples of sound-class characters as keys and scores (integers or floats) as values. Examples When loading LingPy, the models sca, asjp, dolgo, and art are automatically loaded, and they are accessible via the rc() function for global settings: 2 Chapter 1. Sequence Modelling LingPy Documentation, Release 2.6 >>> from lingpy import * >>> rc('asjp') <sca-model "asjp"> Define variables for the standard models for convenience: >>> >>> >>> >>> asjp = rc('asjp') sca = rc('sca') dolgo = rc('dolgo') art = rc('art') Check how the letter a is converted in the various models: >>> ... ... a > a > a > a > for m in [asjp,sca,dolgo,art]: print('{0} > {1} ({2})'.format('a',m.converter['a'],m.name)) a A V 7 (asjp) (sca) (dolgo) (art) Retrieve basic information of a given model: >>> print(sca) Model: sca Info: Extended sound class model based on Dolgopolsky (1986) Source: List (2012) Compiler: Johann-Mattis List Date: 2012-03 Attributes converter scorer dict A dictionary with IPA tokens as keys and sound-class characters as values. dict info name dict str A scoring dictionary with tuples of sound-class characters as keys and similarity scores as values. A dictionary storing the key-value pairs defined in the INFO. The name of the model which is identical with the name of the folder from wich the model is loaded. 1.5 Predefined Datasets (data) LingPy comes along with many different kinds of predefined data. When loading the library, the following dictionary is automatically loaded and employed by all LingPy modules: rcParams : dict As an alternative to all global variables, this dictionary contains all these variables, and additional ones. This dictionary is used for internal coding purposes and stores parameters that are globally set (if not defined otherwise by the user), such as • specific debugging messages (warnings, messages, errors) • default values, such as gop (gap opening penalty), scale (scaling factor 1.5. Predefined Datasets (data) 3 LingPy Documentation, Release 2.6 • by which extended gaps are penalized), or figsize (the default size of • figures if data is plotted using matplotlib). These default values can be changed with help of the rc function that takes any keyword and any variable as input and adds or modifies the specific key of the rcParams dictionary, but also provides more complex functions that change whole sets of variables, such as the following statement: >>> rc(schema="asjp") which switches the variables asjp, dolgo, etc. to the ASCII-based transcription system of the ASJP project. If you want to change the content of c{rcParams} directly, you need to import the dictionary explicitly: >>> from lingpy.settings import rcParams However, changing the values in the dictionary randomly can produce unexpected behavior and we recommend to use the regular rc function for this purpose. lingpy.settings.rc(rval=None, **keywords) Function changes parameters globally set for LingPy sessions. Parameters rval : string (default=None) Use this keyword to specify a return-value for the rc-function. schema : {ipa, asjp} Change the basic schema for sequence comparison. When switching to asjp, this means that sequences will be treated as sequences in ASJP code, otherwise, they will be treated as sequences written in basic IPA. Notes This function is the standard way to communicate with the rcParams dictionary which is not imported as a default. If you want to see which parameters there are, you can load the rcParams dictonary directly: >>> from lingpy.settings import rcParams However, be careful when changing the values. They might produce some unexpected behavior. Examples Import LingPy: >>> from lingpy import * Switch from IPA transcriptions to ASJP transcriptions: >>> rc(schema="asjp") You can check which basic orthography is currently loaded: 4 Chapter 1. Sequence Modelling LingPy Documentation, Release 2.6 >>> rc(basic_orthography) 'asjp' >>> rc(schema='ipa') >>> rc(basic_orthography) 'fuzzy' 1.5.1 Functions rc([rval]) Function changes parameters globally set for LingPy sessions. 1.6 Creating Sound-Class Models (derive) 1.6.1 Functions compile_model(model[, path]) compile_dvt([path]) 1.6. Creating Sound-Class Models (derive) Function compiles customized sound-class models. Function compiles diacritics, vowels, and tones. 5 LingPy Documentation, Release 2.6 6 Chapter 1. Sequence Modelling CHAPTER TWO DATASET HANDLING 2.1 Word Lists (wordlist) Word lists represent the core of LingPys data model, and a proper understanding of how we deal with word lists is important for automatic cognate detection, alignments, and borrowing detection. The basic class that handles word lists, is the Wordlist class, which is also the base class of the LexStat class for automatic cognate detection and the Alignments class for multiple alignment of cognate words. 2.1.1 Functions get_wordlist(path[, delimiter, quotechar, ]) Load a wordlist from a normal CSV file. 2.1.2 Classes Wordlist(filename[, row, col, conf]) Basic class for the handling of multilingual word lists. 2.2 Cognate Detection (sanity) 2.2.1 Functions mutual_coverage(wordlist[, concepts]) mutual_coverage_check(wordlist, threshold[, ]) mutual_coverage_subset(wordlist, threshold) synonymy(wordlist[, concepts, languages]) Compute mutual coverage for all language pairs in your data. Check whether a given mutual coverage is fulfilled by the dataset. Compute maximal mutual coverage for all language in a wordlist. Check the number of synonyms per language and concept. 7 LingPy Documentation, Release 2.6 8 Chapter 2. Dataset Handling CHAPTER THREE DATA EXPORT 3.1 Converting Data to Strings (strings) The strings module provides some general and some specific functions which allow to convert data into strings which can then be imported by other software tools. You can import it by typing: >>> from lingpy.convert.strings import * Or by typing: >>> from lingpy.convert import strings Most of the functions are used internally, being triggered when writing, for example, data from a ~lingpy.basic.wordlist.Wordlist object to file. They can, however, also be used directly, and especially the ~lingpy.convert.strings.write_nexus function may prove useful to get a more flexible nexus-output of wordlist data. 3.1.1 Functions scorer2str(scorer) msa2str(msa[, wordlist, comment, _arange, merge]) matrix2dst(matrix[, taxa, stamp, filename, ]) pap2nex(taxa, paps[, missing, filename, ]) pap2csv(taxa, paps[, filename]) multistate2nex(taxa, matrix[, filename, missing]) write_nexus(wordlist[, mode, filename, ref, ]) Convert a scoring function to a string. Function converts an MSA object into a string. Convert matrix to dst-format. Function converts a list of paps into nexus file format. Write paps created by the Wordlist class to a csv-file. Convert the data in a given wordlist to NEXUS-format for multistate analyses in PAUP. Write a nexus file for phylogenetic analyses. 3.2 Converting Data to CLDF (cldf) 3.2.1 Functions to_cldf(wordlist[, path, source_path, ref, ]) from_cldf(path[, to]) Convert a wordlist in LingPy to CLDF. Load data from CLDF into a LingPy Wordlist object or similar. 9 LingPy Documentation, Release 2.6 10 Chapter 3. Data Export CHAPTER FOUR SEQUENCE COMPARISON 4.1 Helper Functions for SCA Alignment (calign, and misc) The helper functions and classes below play an important role in all SCA alignment algorithms in LingPy (List2012b). They are implemented both in pure Python and in Cython (only supported for Python 3), in order to allow for faster implementations of the core alignment functions. Instead of using these functions directly, we recommend to use the more general functions which you can find in the pairwise and the multiple module of LingPy, and which are based on the helper functions we list below. 4.1.1 Functions globalign secondary_globalign localign secondary_localign semi_globalign secondary_semi_globalign dialign secondary_dialign align_pair align_pairwise align_pairs align_profile score_profile swap_score_profile corrdist Carry out global alignment of two sequences. Carry out global alignment of two sequences with secondary sequence structures. Carry out semi-global alignment of two sequences. Carry out lobal alignment of two sequences with sensitivity to secondary sequence structures. Carry out semi-global alignment of two sequences. Carry out semi-global alignment of two sequences with sensitivity to secondary sequence structures. Carry out dialign alignment of two sequences. Carry out dialign alignment of two sequences with sensitivity for secondary sequence structures. Align a pair of sequences. Align a list of sequences pairwise. Align multiple sequence pairs. Align two profiles using the basic modes. Basic function for the scoring of profiles. Basic function for the scoring of profiles which contain swapped sequences. Create a correspondence distribution for a given language pair. 4.1.2 Classes 11 LingPy Documentation, Release 2.6 ScoreDict Class allows quick access to scoring functions using dictionary syntax. 4.2 Miscellaneous Helper Functions (malign) The helper functions below are miscellaneous deep implementations of alignment and string similarity algorithms. They are implemented both in pure Python and in Cython (only supported for Python 3), in order to allow for faster implementations of the core alignment functions. Instead of using these functions directly, we recommend to use the more general functions which you can find in the pairwise and the multiple module of LingPy, and which are based on the helper functions we list below. 4.2.1 Functions edit_dist nw_align restricted_edit_dist structalign sw_align we_align Return the edit-distance between two strings. Align two sequences using the Needleman-Wunsch algorithm. Return the restricted edit-distance between two strings. Carry out a structural alignment analysis using Dijkstras algorithm. Align two sequences using the Smith-Waterman algorithm. Align two sequences using the Waterman-Eggert algorithm. 4.3 Helper Functions for Traditional Alignment (talign) The helper functions and classes below play an important role in traditional alignment algorithms in LingPy which do not make use of sound classes. They are implemented both in pure Python and in Cython (only supported for Python 3), in order to allow for faster implementations of the core alignment functions. Instead of using these functions directly, we recommend to use the more general functions which you can find in the pairwise and the multiple module of LingPy, and which are based on the helper functions we list below. 4.3.1 Functions globalign localign semi_globalign dialign align_pair align_pairwise align_pairs align_profile score_profile swap_score_profile 12 Carry out global alignment of two sequences. Carry out semi-global alignment of two sequences. Carry out semi-global alignment of two sequences. Carry out dialign alignment of two sequences. Align a pair of sequences. Align all sequences pairwise. Align multiple sequence pairs. Align two profiles using the basic modes. Basic function for the scoring of profiles. Basic function for the scoring of profiles in swapped sequences. Chapter 4. Sequence Comparison LingPy Documentation, Release 2.6 4.4 Pairwise Alignment (pairwise) 4.4.1 Functions nw_align(seqA, seqB[, scorer, gap]) sw_align(seqA, seqB[, scorer, gap]) we_align(seqA, seqB[, scorer, gap]) edit_dist(seqA, seqB[, normalized, restriction]) SCA(infile, **keywords) Carry out the traditional Needleman-Wunsch algorithm. Carry out the traditional Smith-Waterman algorithm. Carry out the traditional Waterman-Eggert algorithm. Return the edit distance between two strings. Method returns alignment objects depending on input file or input data. 4.4.2 Classes Pairwise(seqs[, seqB]) PSA(infile, **keywords) Basic class for the handling of pairwise sequence alignments (PSA). Basic class for dealing with the pairwise alignment of sequences. 4.5 Multiple Alignment (multiple) 4.5.1 Functions mult_align(seqs[, gop, scale, tree_calc, ]) SCA(infile, **keywords) A short-cut method for multiple alignment analyses. Method returns alignment objects depending on input file or input data. 4.5.2 Classes Multiple(seqs, **keywords) MSA(infile, **keywords) Alignments(infile[, row, col, conf, ]) 4.5. Multiple Alignment (multiple) Basic class for multiple sequence alignment analyses. Basic class for carrying out multiple sequence alignment analyses. Class handles Wordlists for the purpose of alignment analyses. 13 LingPy Documentation, Release 2.6 14 Chapter 4. Sequence Comparison CHAPTER FIVE LANGUAGE COMPARISON 5.1 Cluster Algorithms (clustering and extra) 5.1.1 Functions flat_cluster(method, threshold, matrix[, ]) flat_upgma(threshold, matrix[, taxa, revert]) fuzzy(threshold, matrix, taxa[, method, revert]) link_clustering(threshold, matrix, taxa[, ]) mcl(threshold, matrix, taxa[, max_steps, ]) neighbor(matrix, taxa[, distances]) upgma(matrix, taxa[, distances]) infomap_clustering(threshold, matrix[, ]) affinity_propagation(threshold, matrix, taxa) valid_cluster(sequence) generate_all_clusters(numbers) generate_random_cluster(numbers[, bias]) order_cluster(clr) mutate_cluster(clr[, chance]) Carry out a flat cluster analysis based on linkage algorithms. Carry out a flat cluster analysis based on the UPGMA algorithm (Sokal1958). Create fuzzy cluster of a given distance matrix. Carry out a link clustering analysis using the method by Ahn2010. Carry out a clustering using the MCL algorithm (Dongen2000). Function clusters data according to the Neighbor-Joining algorithm (Saitou1987). Carry out a cluster analysis based on the UPGMA algorithm (Sokal1958). Compute the Infomap clustering analysis of the data. Compute affinity propagation from the matrix. Only allow to have sequences which have consecutive ordering of elements. Generate all possible clusters for a number of elements. Generate a random cluster for a number of elements. Order a cluster into the form of a valid cluster. Mutate a cluster. 5.2 Cognate Detection (LexStat) class lingpy.compare.lexstat.LexStat(filename, **keywords) Basic class for automatic cognate detection. Parameters filename : str The name of the file that shall be loaded. model : Model The sound-class model that shall be used for the analysis. Defaults to the SCA soundclass model. 15 LingPy Documentation, Release 2.6 merge_vowels : bool (default=True) Indicate whether consecutive vowels should be merged into single tokens or kept apart as separate tokens. transform : dict A dictionary that indicates how prosodic strings should be simplified (or generally transformed), using a simple key-value structure with the key referring to the original prosodic context and the value to the new value. Currently, prosodic strings (see prosodic_string()) offer 11 different prosodic contexts. Since not all these are helpful in preliminary analyses for cognate detection, it is useful to merge some of these contexts into one. The default settings distinguish only 5 instead of 11 available contexts, namely: • C for all consonants in prosodically ascending position, • c for all consonants in prosodically descending position, • V for all vowels, • T for all tones, and • _ for word-breaks. Make sure to check also the vowel keyword when initialising a LexStat object, since the symbols you use for vowels and tones should be identical with the ones you define in your transform dictionary. vowels : str (default=VT_) For scoring function creation using the get_scorer function, you have the possibility to use reduced scores for the matching of tones and vowels by modifying the vscale parameter, which is set to 0.5 as a default. In order to make sure that vowels and tones are properly detected, make sure your prosodic string representation of vowels matches the one in this keyword. Thus, if you change the prosodic strings using the transform keyword, you also need to change the vowel string, to make sure that vscale works as wanted in the get_scorer function. check : bool (default=False) If set to True, the input file will first be checked for errors before the calculation is carried out. Errors will be written to the file errors, defaulting to errors.log. See also apply_checks apply_checks : bool (default=False) If set to True, any errors identified by check will be handled silently. no_bscorer: bool (default=False) : If set to True, this will suppress the creation of a language-specific scoring function (which may become quite large and is additional ballast if the method lexstat is not used after all. If you use the lexstat method, however, this needs to be set to False. errors : str The name of the error log. segments : str (default=tokens) The name of the column in your data which contains the segmented transcriptions, or in which the segmented transcriptions should be placed. transcription : str (default=ipa) The name of the column in your data which contains the unsegmented transcriptions. 16 Chapter 5. Language Comparison LingPy Documentation, Release 2.6 classes : str (default=classes) The name of the column in the data which contains the sound class representation of the transcriptions, or in which this information shall be placed after automatic conversion. numbers : str (default=numbers) The language-specific triples consisting of language id (numeric), sound class string (one character only), and prosodic string (one character only). Usually, numbers are automatically created from the columns classes, prostrings, and langid, but you can also provide them in your data. langid : str (default=langid) Name of the column that contains a numerical language identifier, needed to produce the language-specific character triples (numbers). Unless specific explicitly, this is automatically created. prostrings : str (default=prostrings) Name of the column containing prosodic strings (see List2014d for more details) of the segmented transcriptions, containing one character per prosodic string. Prostrings add a contextual component to phonetic sequences. They are automatically created, but can likewise be submitted from the initial data. weights : str (default=weights) The name of the column which stores the individual gap-weights for each sequence. Gap weights are positive floats for each segment in a string, which modify the gap opening penalty during alignment. tokenize : function (default=ipa2tokens) The function which should be used to tokenize the entries in the column storing the transcriptions in case no segmentation is provided by the user. get_prostring : function (default=prosodic_string) The function which should be used to create prosodic strings from the segmented transcription data. If you want to completely ignore prosodic strings in LexStat calculations, you could just pass the following function: >>> lex = LexStat('inputfile.tsv', get_prostring=lambda x: ["x" ֒→for y in x]) Notes Instantiating this class does not require a lot of parameters. However, the user may modify its behaviour by providing additional attributes in the input file. 5.2. Cognate Detection (LexStat) 17 LingPy Documentation, Release 2.6 18 Chapter 5. Language Comparison LingPy Documentation, Release 2.6 Attributes pairs dict model Model chars list rchars list scorer dict 5.2. Cognate Detection (LexStat) A dictionary with tuples of language names as key and indices as value, pointing to unique combinations of words with the same meaning in all language pairs. The sound class model instance which serves to convert the phonetic data into sound classes. A list of all unique languagespecific character types in the instantiated LexStat object. The characters in this list consist of • the language identifier (numeric, referenced as langid as a default, but customizable via the keyword langid) • the sound class symbol for the respective IPA transcription value • the prosodic class value All values are represented in the above order as one string, separated by a dot. Gaps are also included in this collection. They are traditionally represented as X for the sound class and - for the prosodic string. A list containing all unique character types across languages. In contrast to the chars-attribute, the rchars (raw chars) do not contain the languahttp://tsv.lingpy.org/triples/get_data.py?history=tru identifier, thus they only consist of two values, separated by a dot, namely, the sound class symbol, and the prosodic class value. A collection of ScoreDict objects, which are used to score the strings. LexStat distinguishes two different scoring functions: • rscorer: A raw scorer that is not language-specific and consists only of sound class values and prosodic string values. This scorer is traditionally used to carry out the first alignment in order to calculate the languagespecific scorer. It is directly accessible as an attribute of the LexStat class (rscorer). The characters which constitute the val-19 ues in this scorer are accessible via the rchars attribue of each lexstat class. LingPy Documentation, Release 2.6 Methods align_pairs(idxA, idxB[, concept]) cluster([method, cluster_method, threshold, ]) get_distances([method, mode, gop, scale, ]) get_random_distances([method, runs, mode, ]) get_scorer(**keywords) output(fileformat, **keywords) Align all or some words of a given pair of languages. Function for flat clustering of words into cognate sets. Method calculates different distance estimates for language pairs. Method calculates randoms scores for unrelated words in a dataset. Create a scoring function based on sound correspondences. Write data to file. Inherited WordList Methods pickle([filename]) get_entries(entry) add_entries(entry, source, function[, override]) calculate(data[, taxa, concepts, ref]) export(fileformat[, sections, entries, ]) export(fileformat[, sections, entries, ]) get_dict([col, row, entry]) get_dict([col, row, entry]) get_etymdict([ref, entry, modify_ref]) get_etymdict([ref, entry, modify_ref]) get_list([row, col, entry, flat]) get_list([row, col, entry, flat]) get_paps([ref, entry, missing, modify_ref]) get_paps([ref, entry, missing, modify_ref]) output(fileformat, **keywords) renumber(source[, target, override]) Store the QLCParser instance in a pickle file. Return all entries matching the given entry-type as a two-dimensional list. Add new entry-types to the word list by modifying given ones. Function calculates specific data. Export the wordlist to specific fileformats. Export the wordlist to specific fileformats. Function returns dictionaries of the cells matched by the indices. Function returns dictionaries of the cells matched by the indices. Return an etymological dictionary representation of the word list. Return an etymological dictionary representation of the word list. Function returns lists of rows and columns specified by their name. Function returns lists of rows and columns specified by their name. Function returns a list of present-absent-patterns of a given word list. Function returns a list of present-absent-patterns of a given word list. Write wordlist to file. Renumber a given set of string identifiers by replacing the ids by integers. 5.3 Partial Cognate Detection (Partial) class lingpy.compare.partial.Partial(infile, **keywords) Extended class for automatic detection of partial cognates. Parameters filename : str The name of the file that shall be loaded. 20 Chapter 5. Language Comparison LingPy Documentation, Release 2.6 model : Model The sound-class model that shall be used for the analysis. Defaults to the SCA soundclass model. merge_vowels : bool (default=True) Indicate whether consecutive vowels should be merged into single tokens or kept apart as separate tokens. transform : dict A dictionary that indicates how prosodic strings should be simplified (or generally transformed), using a simple key-value structure with the key referring to the original prosodic context and the value to the new value. Currently, prosodic strings (see prosodic_string()) offer 11 different prosodic contexts. Since not all these are helpful in preliminary analyses for cognate detection, it is useful to merge some of these contexts into one. The default settings distinguish only 5 instead of 11 available contexts, namely: • C for all consonants in prosodically ascending position, • c for all consonants in prosodically descending position, • V for all vowels, • T for all tones, and • _ for word-breaks. Make sure to check also the vowel keyword when initialising a LexStat object, since the symbols you use for vowels and tones should be identical with the ones you define in your transform dictionary. vowels : str (default=VT_) For scoring function creation using the get_scorer function, you have the possibility to use reduced scores for the matching of tones and vowels by modifying the vscale parameter, which is set to 0.5 as a default. In order to make sure that vowels and tones are properly detected, make sure your prosodic string representation of vowels matches the one in this keyword. Thus, if you change the prosodic strings using the transform keyword, you also need to change the vowel string, to make sure that vscale works as wanted in the get_scorer function. check : bool (default=False) If set to True, the input file will first be checked for errors before the calculation is carried out. Errors will be written to the file errors, defaulting to errors.log. See also apply_checks apply_checks : bool (default=False) If set to True, any errors identified by check will be handled silently. no_bscorer: bool (default=False) : If set to True, this will suppress the creation of a language-specific scoring function (which may become quite large and is additional ballast if the method lexstat is not used after all. If you use the lexstat method, however, this needs to be set to False. errors : str The name of the error log. 5.3. Partial Cognate Detection (Partial) 21 LingPy Documentation, Release 2.6 Notes This method automatically infers partial cognate sets from data which was previously morphologically segmented. 22 Chapter 5. Language Comparison LingPy Documentation, Release 2.6 5.3. Partial Cognate Detection (Partial) 23 LingPy Documentation, Release 2.6 Attributes 24 pairs dict model Model chars list rchars list scorer dict A dictionary with tuples of language names as key and indices as value, pointing to unique combinations of words with the same meaning in all language pairs. The sound class model instance which serves to convert the phonetic data into sound classes. A list of all unique languagespecific character types in the instantiated LexStat object. The characters in this list consist of • the language identifier (numeric, referenced as langid as a default, but customizable via the keyword langid) • the sound class symbol for the respective IPA transcription value • the prosodic class value All values are represented in the above order as one string, separated by a dot. Gaps are also included in this collection. They are traditionally represented as X for the sound class and - for the prosodic string. A list containing all unique character types across languages. In contrast to the chars-attribute, the rchars (raw chars) do not contain the language identifier, thus they only consist of two values, separated by a dot, namely, the sound class symbol, and the prosodic class value. A collection of ScoreDict objects, which are used to score the strings. LexStat distinguishes two different scoring functions: • rscorer: A raw scorer that is not language-specific and consists only of sound class values and prosodic string values. This scorer is traditionally used to carry out the first alignment in order to calculate the languagespecific scorer. It is directly accessible as an attribute of the LexStat class (rscorer). The characters which constitute the valChapter 5. uesLanguage in this scorerComparison are accessible via the rchars attribue of each lexstat class. • bscorer: The language- LingPy Documentation, Release 2.6 Methods partial_cluster([method, threshold, scale, ]) add_cognate_ids(source, target[, idtype, ]) Cluster the words into partial cognate sets. Compute normal cognate identifiers from partial cognate sets. Inherited LexStat Methods align_pairs(idxA, idxB[, concept]) cluster([method, cluster_method, threshold, ]) get_distances([method, mode, gop, scale, ]) get_random_distances([method, runs, mode, ]) get_scorer(**keywords) output(fileformat, **keywords) Align all or some words of a given pair of languages. Function for flat clustering of words into cognate sets. Method calculates different distance estimates for language pairs. Method calculates randoms scores for unrelated words in a dataset. Create a scoring function based on sound correspondences. Write data to file. Inherited WordList Methods pickle([filename]) get_entries(entry) add_entries(entry, source, function[, override]) calculate(data[, taxa, concepts, ref]) export(fileformat[, sections, entries, ]) export(fileformat[, sections, entries, ]) get_dict([col, row, entry]) get_dict([col, row, entry]) get_etymdict([ref, entry, modify_ref]) get_etymdict([ref, entry, modify_ref]) get_list([row, col, entry, flat]) get_list([row, col, entry, flat]) get_paps([ref, entry, missing, modify_ref]) get_paps([ref, entry, missing, modify_ref]) output(fileformat, **keywords) renumber(source[, target, override]) 5.3. Partial Cognate Detection (Partial) Store the QLCParser instance in a pickle file. Return all entries matching the given entry-type as a two-dimensional list. Add new entry-types to the word list by modifying given ones. Function calculates specific data. Export the wordlist to specific fileformats. Export the wordlist to specific fileformats. Function returns dictionaries of the cells matched by the indices. Function returns dictionaries of the cells matched by the indices. Return an etymological dictionary representation of the word list. Return an etymological dictionary representation of the word list. Function returns lists of rows and columns specified by their name. Function returns lists of rows and columns specified by their name. Function returns a list of present-absent-patterns of a given word list. Function returns a list of present-absent-patterns of a given word list. Write wordlist to file. Renumber a given set of string identifiers by replacing the ids by integers. 25 LingPy Documentation, Release 2.6 5.4 Borrowing Detection (phylogeny) class lingpy.compare.phylogeny.PhyBo(dataset, tree=None, paps=’pap’, ref=’cogid’, tree_calc=’neighbor’, output_dir=None, **keywords) Basic class for calculations using the TreBor method. Parameters dataset : string Name of the dataset that shall be analyzed. tree : {None, string} Name of the tree file. paps : string (default=pap) Name of the column that stores the specific cognate IDs consisting of an arbitrary integer key and a key for the concept. ref : string (default=cogid) Name of the column that stores the general cognate ids (the reference of the analysis). tree_calc : {neighbor,upgma} (default=neighbor) Select the algorithm to be used for the tree calculation if no tree is passed with the file. missing : int (default=-1) Specify how missing data should be handled. If set to -1, missing data can account for both presence or absence of a cognate set in the given language. If set to 0, missing data is treated as absence. degree : int (default=100) The degree which is chosen for the projection of the tree layout. Methods analyze([runs, mixed, output_gml, tar, ]) get_AVSD(glm, **keywords) get_CVSD() get_GLS([mode, ratio, restriction, ]) get_IVSD([output_gml, output_plot, tar, ]) get_MLN (glm[, threshold, method]) get_MSN ([glm, external_edges, deep_nodes]) get_PDC(glm, **keywords) get_edge(glm, nodeA, nodeB[, entries, msn]) get_stats(glm[, subset, filename]) plot_MLN ([glm, fileformat, threshold, ]) plot_MSN ([glm, fileformat, threshold, ]) plot_concept_evolution(glm[, concept, ]) 26 Carry out a full analysis using various parameters. Function retrieves all pap s for ancestor languages in a given tree. Calculate the Contemporary Vocabulary Size Distribution (CVSD). Create gain-loss-scenarios for all non-singleton paps in the data. Calculate VSD on the basis of each item. Compute an Minimal Lateral Network for a given model. Plot the Minimal Spatial Network. Calculate Patchily Distributed Cognates. Return the edge data for a given gain-loss model. Calculate basic statistics for a given gain-loss model. Plot the MLN with help of Matplotlib. Plot a minimal spatial network. Plot the evolution of specific concepts along the reference tree. Continued on next page Chapter 5. Language Comparison LingPy Documentation, Release 2.6 Table 5.7 – continued from previous page plot_two_concepts(concept, cogA, cogB[, ]) Plot the evolution of two concepts in space. Inherited Methods pickle([filename]) get_entries(entry) add_entries(entry, source, function[, override]) calculate(data[, taxa, concepts, ref]) export(fileformat[, sections, entries, ]) get_dict([col, row, entry]) get_etymdict([ref, entry, modify_ref]) get_list([row, col, entry, flat]) get_paps([ref, entry, missing, modify_ref]) output(fileformat, **keywords) renumber(source[, target, override]) 5.4. Borrowing Detection (phylogeny) Store the QLCParser instance in a pickle file. Return all entries matching the given entry-type as a two-dimensional list. Add new entry-types to the word list by modifying given ones. Function calculates specific data. Export the wordlist to specific fileformats. Function returns dictionaries of the cells matched by the indices. Return an etymological dictionary representation of the word list. Function returns lists of rows and columns specified by their name. Function returns a list of present-absent-patterns of a given word list. Write wordlist to file. Renumber a given set of string identifiers by replacing the ids by integers. 27 LingPy Documentation, Release 2.6 28 Chapter 5. Language Comparison CHAPTER SIX HANDLING PHYLOGENETIC TREES 6.1 Trees (Tree) 6.1.1 Functions random_tree(taxa[, branch_lengths]) Create a random tree from a list of taxa. 6.1.2 Classes Tree(tree, **keywords) Basic class for the handling of phylogenetic trees. 29 LingPy Documentation, Release 2.6 30 Chapter 6. Handling Phylogenetic Trees CHAPTER SEVEN PLOTTING DATA 7.1 Plotting Data and Results (plot) The plot-module provides some general and some specific functions for the plotting of data and results. This module is not imported as a default, so you need to import it explicitly by typing: >>> from lingpy.convert.plot import * Or by typing: >>> from lingpy.convert import plot 7.1.1 Functions plot_gls(gls, treestring[, degree, fileformat]) plot_tree(treestring[, degree, fileformat, root]) plot_concept_evolution(scenarios, tree[, ]) plot_heatmap(wordlist[, filename, ]) Plot a gain-loss scenario for a given reference tree. Plot a Newick tree to PDF or other graphical formats. Plot the evolution according to the MLN method of all words for a given concept. Create a heatmap-representation of shared cognates for a given wordlist. 31 LingPy Documentation, Release 2.6 32 Chapter 7. Plotting Data CHAPTER EIGHT EVALUATION 8.1 Automatic Cognate Detection (acd) This module provides functions that can be used to evaluate how well algorithms perform in the task of automatic cognate detection. 8.1.1 Functions bcubes(wordlist[, gold, test, modify_ref, ]) partial_bcubes(wordlist, gold, test[, pprint]) pairs(lex[, gold, test, modify_ref, pprint, ]) diff(wordlist[, gold, test, modify_ref, ]) npoint_ap(scores, cognates[, reverse]) random_cognates(wordlist[, ref, bias]) extreme_cognates(wordlist[, ref, bias]) Compute B-Cubed scores for test and reference datasets. Compute B-Cubed scores for test and reference datasets for partial cognate detection. Compute pair scores for the evaluation of cognate detection algorithms. Write differences in classifications on an item-basis to file. Calculate the n-point average precision. Populate a wordlist with random cognates for each entry. Return extreme cognates, either lump all words together or split them. 8.2 Automatic Linguistic Reconstruction (acd) This module provides functions that can be used to evaluate how well algorithms perform in the task of automatic linguistic reconstruction. 8.2.1 Functions mean_edit_distance(wordlist[, gold, test, ]) Function computes the edit distance between gold standard and test set. 8.3 Automatic Phonetic Alignment (apa) This module provides functions that can be used to evaluate how well algorithms perform in the task of automatic phonetic alignment analyses. 33 LingPy Documentation, Release 2.6 8.3.1 Classes EvalPSA(gold, test) EvalMSA(gold, test) 34 Base class for the evaluation of automatic pairwise sequence analyses. Base class for the evaluation of automatic multiple sequence analyses. Chapter 8. Evaluation CHAPTER NINE REFERENCE 9.1 Reference 9.1.1 lingpy package Subpackages lingpy.algorithm package Subpackages lingpy.algorithm.cython package Submodules lingpy.algorithm.cython.calign module lingpy.algorithm.cython.calign.align_pair() Align a pair of sequences. Parameters seqA, seqB : list The list containing the sequences. gopA, gopB : list The gap opening penalties (individual for each sequence, therefore passed as a list of floats or integers). proA, proB : str The prosodic strings which have the same length as seqA and seqB. scale : float The gap extension scale by which consecutive gaps are reduced. LingPy uses a scale rather than a constant gap extension penalty. factor : float The factor by which matches are increased when two segments occur in the same prosodic position of an alignment. scorer : { dict, lingpy.algorithm.cython.misc.ScoreDict } 35 LingPy Documentation, Release 2.6 The scoring function which needs to provide scores for all segments in seqA and seqB. mode : { global, local, overlap, dialign } Select one of the four basic modes for alignment analyses. restricted_chars : str The string containing restricted characters. Restricted characters occur, as a rule, in the prosodic strings, not in the normal sequence. distance : int (default=0) Select whether you want to calculate the normalized distance or the similarity between two strings (following Downey2008 for normalization). Returns alignment : tuple The aligned sequences and the similarity or distance. Notes This is a utility function that allows calls any of the four classical alignment functions (lingpy.algorithm. cython.calign.globalign lingpy.algorithm.cython.calign.semi_globalign, lingpy.algorithm.cython.calign.localign, lingpy.algorithm.cython.calign. dialign,) and their secondary counterparts. lingpy.algorithm.cython.calign.align_pairs() Align multiple sequence pairs. Parameters seqs : list A two-dimensional list containing one pair of sequences each. gops : list The gap opening penalties (individual for each sequence, therefore passed as a list of floats or integers). pros : list The prosodic strings which have the same length as seqA and seqB. scale : float The gap extension scale by which consecutive gaps are reduced. LingPy uses a scale rather than a constant gap extension penalty. factor : float The factor by which matches are increased when two segments occur in the same prosodic position of an alignment. scorer : { dict, lingpy.algorithm.cython.misc.ScoreDict } The scoring function which needs to provide scores for all segments in seqA and seqB. mode : { global, local, overlap, dialign } Select one of the four basic modes for alignment analyses. restricted_chars : { str } The string containing restricted characters. Restricted characters occur, as a rule, in the prosodic strings, not in the normal sequence. 36 Chapter 9. Reference LingPy Documentation, Release 2.6 distance : int (default=0) Select whether you want to calculate the normalized distance or the similarity between two strings (following Downey2008 for normalization). If you set this value to 2, both distances and similarities will be returned. Returns alignments : list A list of tuples of size 3 or 4, containing the alignments, and the similarity or the distance (or both, if distance is set to 2). Notes This function computes alignments of all pairs passed in the list of sequence pairs (a two-dimensional list with two sequences each) and is basically used in LingPys module for cognate detection (lingpy.compare. lexstat.LexStat). lingpy.algorithm.cython.calign.align_pairwise() Align a list of sequences pairwise. Parameters seqs : list The list containing the sequences. gops : list The gap opening penalties (individual for each sequence, therefore passed as a list of floats or integers). pros : list The prosodic strings which have the same length as seqA and seqB. scale : float The gap extension scale by which consecutive gaps are reduced. LingPy uses a scale rather than a constant gap extension penalty. factor : float The factor by which matches are increased when two segments occur in the same prosodic position of an alignment. scorer : { dict, ~lingpy.algorithm.cython.misc.ScoreDict } The scoring function which needs to provide scores for all segments in seqA and seqB. mode : { global, local, overlap, dialign } Select one of the four basic modes for alignment analyses. r : str The string containing restricted characters. Restricted characters occur, as a rule, in the prosodic strings, not in the normal sequence. Returns alignments : list A list of tuples of size 4, containing the alignment, the similarity and the distance for each sequence pair. 9.1. Reference 37 LingPy Documentation, Release 2.6 Notes This function computes alignments of all possible pairs passed in the list of sequences and is basically used in LingPys module for multiple alignment analyses (lingpy.align.multiple). lingpy.algorithm.cython.calign.align_profile() Align two profiles using the basic modes. Parameters profileA, profileB : list Two-dimensional list for each of the profiles. gopA, gopB : list The gap opening penalties (individual for each sequence, therefore passed as a list of floats or integers). proA, proB : str The prosodic strings which have the same length as profileA and profileB. gop : int The general gap opening penalty which will be used to introduce a gap between the two profiles. scale : float The gap extension scale by which consecutive gaps are reduced. LingPy uses a scale rather than a constant gap extension penalty. factor : float The factor by which matches are increased when two segments occur in the same prosodic position of an alignment. scorer : { dict, lingpy.algorithm.cython.misc.ScoreDict } The scoring function which needs to provide scores for all segments in the two profiles. restricted_chars : { str } The string containing restricted characters. Restricted characters occur, as a rule, in the prosodic strings, not in the normal sequence. They need to be computed by computing a consensus string from all prosodic strings in the profile. mode : { global, local, overlap, dialign } Select one of the four basic modes for alignment analyses. gap_weight : float This handles the weight that is given to gaps in a column. If you set it to 0, for example, this means that all gaps will be ignored when determining the score for two columns in the profile. Returns alignment : tuple The aligned profiles, and the overall similarity of the profiles. Notes This function computes alignments of two profiles of multiple sequences (see Durbin2002 for details on profiles) and is basically used in LingPys module for multiple alignment (lingpy.align.multiple). 38 Chapter 9. Reference LingPy Documentation, Release 2.6 lingpy.algorithm.cython.calign.corrdist() Create a correspondence distribution for a given language pair. Parameters threshold : float The threshold of sequence distance which determines whether a sequence pair is included or excluded from the calculation of the distribution. seqs : list The sequences passed as a two-dimensional list of sequence pairs. gops : list The gap opening penalties, passed as individual lists of penalties for each sequence. pros : list The list of prosodic strings for each sequence. gop : int The general gap opening penalty which will be used to introduce a gap between the two profiles. scale : float The gap extension scale by which consecutive gaps are reduced. LingPy uses a scale rather than a constant gap extension penalty. factor : float The factor by which matches are increased when two segments occur in the same prosodic position of an alignment. scorer : { dict, lingpy.algorithm.cython.misc.ScoreDict } The scoring function which needs to provide scores for all segments in the two profiles. mode : { global, local, overlap, dialign } Select one of the four basic modes for alignment analyses. restricted_chars : { str } The string containing restricted characters. Restricted characters occur, as a rule, in the prosodic strings, not in the normal sequence. They need to be computed by computing a consensus string from all prosodic strings in the profile. Returns results : tuple A dictionary containing the distribution, and the number of included sequences. Notes This function is the core of the LexStat function to compute distributions of aligned segment pairs. lingpy.algorithm.cython.calign.dialign() Carry out dialign alignment of two sequences. Parameters seqA, seqB : list The list containing the sequences. proA, proB : str 9.1. Reference 39 LingPy Documentation, Release 2.6 The prosodic strings which have the same length as seqA and seqB. M, N : int The lengths of seqA and seqB. scale : float The gap extension scale by which consecutive gaps are reduced. LingPy uses a scale rather than a constant gap extension penalty. factor : float The factor by which matches are increased when two segments occur in the same prosodic position of an alignment. scorer : { dict, lingpy.algorithm.cython.misc.ScoreDict } The scoring function which needs to provide scores for all segments in seqA and seqB. Returns alignment : tuple A tuple of the two alignments and the alignment score. Notes This is the function that is called to carry out local dialign alignment analyses (keyword dialign) when using many of LingPys classes for alignment analyses which is at the same time sensitive for secondary sequence structures (see the description of secondary alignment in List2014d for details), like Pairwise, Multiple, or LexStat. Dialign (see Morgenstern1996) is an alignment algorithm that does not require gap penalties and generally works in a rather local fashion. lingpy.algorithm.cython.calign.globalign() Carry out global alignment of two sequences. Parameters seqA, seqB : list The list containing the sequences. gopA, gopB : list The gap opening penalties (individual for each sequence, therefore passed as a list of floats or integers). proA, proB : str The prosodic strings which have the same length as seqA and seqB. M, N : int The lengths of seqA and seqB. scale : float The gap extension scale by which consecutive gaps are reduced. LingPy uses a scale rather than a constant gap extension penalty. factor : float The factor by which matches are increased when two segments occur in the same prosodic position of an alignment. scorer : { dict, lingpy.algorithm.cython.misc.ScoreDict } The scoring function which needs to provide scores for all segments in seqA and seqB. 40 Chapter 9. Reference LingPy Documentation, Release 2.6 Returns alignment : tuple A tuple of the two alignments and the alignment score. Notes This is the function that is called to carry out global alignment analyses when using many of LingPys classes for alignment analyses, like Pairwise, Multiple, or LexStat. It differs from classical Needleman-Wunsch alignment (compare Needleman1970) in a couple of aspects. These include, among others, the use of a gap extension scale rather than a gap extension penalty (the scale consecutively reduces the gap penalty and thus lets gap penalties approach zero if gapped regions are large), the use of individual gap opening penalties for all positions of a sequence, and the use of prosodic strings, and prosodic factors that raise scores when segments occur in the same prosodic environment. If one sets certain of these parameters to zero or one and uses the same gap opening penalties, however, the function will behave like the traditional Needleman-Wunsch algorithm, and since it is implemented in Cython, it will work faster than a pure Python implementation for alignment algorithms. Examples We show that the Needleman-Wunsch algorithms yields the same result as the globalign algorithm, provided we adjust the parameters: >>> from lingpy.algorithm.cython.calign import globalign >>> from lingpy.align.pairwise import nw_align >>> nw_align('abab', 'baba') (['a', 'b', 'a', 'b', '-'], ['-', 'b', 'a', 'b', 'a'], 1) >>> globalign(list('abab'), list('baba'), 4 * [-1], 4 * [-1], 'aaaa', 'aaaa', 4, ֒→4, 1, 0, {("a","b"):-1, ("b","a"): -1, ("a","a"): 1, ("b", "b"): 1}) (['a', 'b', 'a', 'b', '-'], ['-', 'b', 'a', 'b', 'a'], 1.0) lingpy.algorithm.cython.calign.localign() Carry out semi-global alignment of two sequences. Parameters seqA, seqB : list The list containing the sequences. gopA, gopB : list The gap opening penalties (individual for each sequence, therefore passed as a list of floats or integers). proA, proB : str The prosodic strings which have the same length as seqA and seqB. M, N : int The lengths of seqA and seqB. scale : float The gap extension scale by which consecutive gaps are reduced. LingPy uses a scale rather than a constant gap extension penalty. factor : float 9.1. Reference 41 LingPy Documentation, Release 2.6 The factor by which matches are increased when two segments occur in the same prosodic position of an alignment. scorer : { dict, lingpy.algorithm.cython.misc.ScoreDict } The scoring function which needs to provide scores for all segments in seqA and seqB. Returns alignment : tuple A tuple of the two alignments and the alignment score. The alignments are each a list of suffix, alignment, and prefix. Notes This is the function that is called to carry out local alignment analyses when using many of LingPys classes for alignment analyses which is at the same time sensitive for secondary sequence structures (see the description of secondary alignment in List2014d for details), like Pairwise, Multiple, or LexStat. Local alignment means that only the best matching substring between two sequences is returned (compare Smith1981), also called the Smith-Waterman algorithm. lingpy.algorithm.cython.calign.score_profile() Basic function for the scoring of profiles. Parameters colA, colB : list The two columns of a profile. scorer : { dict, lingpy.algorithm.cython.misc.ScoreDict } The scoring function which needs to provide scores for all segments in the two profiles. gap_weight : float (default=0.0) This handles the weight that is given to gaps in a column. If you set it to 0, for example, this means that all gaps will be ignored when determining the score for two columns in the profile. Returns score : float The score for the profile Notes This function handles how profiles are scored. lingpy.algorithm.cython.calign.secondary_dialign() Carry out dialign alignment of two sequences with sensitivity for secondary sequence structures. Parameters seqA, seqB : list The list containing the sequences. proA, proB : str The prosodic strings which have the same length as seqA and seqB. M, N : int The lengths of seqA and seqB. scale : float 42 Chapter 9. Reference LingPy Documentation, Release 2.6 The gap extension scale by which consecutive gaps are reduced. LingPy uses a scale rather than a constant gap extension penalty. factor : float The factor by which matches are increased when two segments occur in the same prosodic position of an alignment. scorer : { dict, ScoreDict } The scoring function which needs to provide scores for all segments in seqA and seqB. r : { str } The string containing restricted characters. Restricted characters occur, as a rule, in the prosodic strings, not in the normal sequence. Returns alignment : tuple A tuple of the two alignments and the alignment score. Notes This is the function that is called to carry out local dialign alignment analyses (keyword dialign) when using many of LingPys classes for alignment analyses which is at the same time sensitive for secondary sequence structures (see the description of secondary alignment in List2014d for details), like Pairwise, Multiple, or LexStat. Dialign (see Morgenstern1996) is an alignment algorithm that does not require gap penalties and generally works in a rather local fashion. lingpy.algorithm.cython.calign.secondary_globalign() Carry out global alignment of two sequences with secondary sequence structures. Parameters seqA, seqB : list The list containing the sequences. gopA, gopB : list The gap opening penalties (individual for each sequence, therefore passed as a list of floats or integers). proA, proB : str The prosodic strings which have the same length as seqA and seqB. M, N : int The lengths of seqA and seqB. scale : float The gap extension scale by which consecutive gaps are reduced. LingPy uses a scale rather than a constant gap extension penalty. factor : float The factor by which matches are increased when two segments occur in the same prosodic position of an alignment. scorer : { dict, lingpy.algorithm.cython.misc.ScoreDict } The scoring function which needs to provide scores for all segments in seqA and seqB. r : { str } 9.1. Reference 43 LingPy Documentation, Release 2.6 The string containing restricted characters. Restricted characters occur, as a rule, in the prosodic strings, not in the normal sequence. Returns alignment : tuple A tuple of the two alignments and the alignment score. Notes This is the function that is called to carry out global alignment analyses when using many of LingPys classes for alignment analyses which is at the same time sensitive for secondary sequence structures (see the description of secondary alignment in List2014d for details), like Pairwise, Multiple, or LexStat. It differs from classical Needleman-Wunsch alignment (compare Needleman1970) in a couple of aspects. These include, among others, the use of a gap extension scale rather than a gap extension penalty (the scale consecutively reduces the gap penalty and thus lets gap penalties approach zero if gapped regions are large), the use of individual gap opening penalties for all positions of a sequence, and the use of prosodic strings, and prosodic factors that raise scores when segments occur in the same prosodic environment. If one sets certain of these parameters to zero or one and uses the same gap opening penalties, however, the function will behave like the traditional Needleman-Wunsch algorithm, and since it is implemented in Cython, it will work faster than a pure Python implementation for alignment algorithms. Examples We compare globalign with secondary_globalign:: >>> from lingpy.algorithm.cython.calign import globalign, secondary_globalign >>> globalign(list('abab'), list('baba'), 4 * [-1], 4 * [-1], 'aaaa', 'aaaa', ֒→4, 4, 1, 0, {("a","b"):-1, ("b","a"): -1, ("a","a"): 1, ("b", "b"): 1}) (['a', 'b', 'a', 'b', '-'], ['-', 'b', 'a', 'b', 'a'], 1.0) >>> secondary_globalign(list('ab.ab'), list('ba.ba'), 5 * [-1], 5 * [-1], 'ab. ֒→ab', 'ba.ba', 5, 5, 1, 0, {("a","b"):-1, ("b","a"): -1, ("a","a"): 1, ("b", ֒→"b"): 1, ("a",".") : -1, ("b","."):-1, (".","."):0, (".", "b"): -1, (".", "a ֒→"):-1}, '.') (['a', 'b', '-', '.', 'a', 'b', '-'], ['-', 'b', 'a', '.', '-', 'b', 'a'], -2.0) lingpy.algorithm.cython.calign.secondary_localign() Carry out lobal alignment of two sequences with sensitivity to secondary sequence structures. Parameters seqA, seqB : list The list containing the sequences. gopA, gopB : list The gap opening penalties (individual for each sequence, therefore passed as a list of floats or integers). proA, proB : str The prosodic strings which have the same length as seqA and seqB. M, N : int The lengths of seqA and seqB. scale : float 44 Chapter 9. Reference LingPy Documentation, Release 2.6 The gap extension scale by which consecutive gaps are reduced. LingPy uses a scale rather than a constant gap extension penalty. factor : float The factor by which matches are increased when two segments occur in the same prosodic position of an alignment. scorer : { dict, lingpy.algorithm.cython.misc.ScoreDict } The scoring function which needs to provide scores for all segments in seqA and seqB. r : { str } The string containing restricted characters. Restricted characters occur, as a rule, in the prosodic strings, not in the normal sequence. Returns alignment : tuple A tuple of the two alignments and the alignment score. The alignments are each a list of suffix, alignment, and prefix. Notes This is the function that is called to carry out local alignment analyses when using many of LingPys classes for alignment analyses which is at the same time sensitive for secondary sequence structures (see the description of secondary alignment in List2014d for details), like Pairwise, Multiple, or LexStat. Local alignment means that only the best matching substring between two sequences is returned (compare Smith1981), also called the Smith-Waterman algorithm. lingpy.algorithm.cython.calign.secondary_semi_globalign() Carry out semi-global alignment of two sequences with sensitivity to secondary sequence structures. Parameters seqA, seqB : list The list containing the sequences. gopA, gopB : list The gap opening penalties (individual for each sequence, therefore passed as a list of floats or integers). proA, proB : str The prosodic strings which have the same length as seqA and seqB. M, N : int The lengths of seqA and seqB. scale : float The gap extension scale by which consecutive gaps are reduced. LingPy uses a scale rather than a constant gap extension penalty. factor : float The factor by which matches are increased when two segments occur in the same prosodic position of an alignment. scorer : { dict, lingpy.algorithm.cython.misc.ScoreDict } The scoring function which needs to provide scores for all segments in seqA and seqB. r : { str } 9.1. Reference 45 LingPy Documentation, Release 2.6 The string containing restricted characters. Restricted characters occur, as a rule, in the prosodic strings, not in the normal sequence. Returns alignment : tuple A tuple of the two alignments and the alignment score. Notes This is the function that is called to carry out semi-global alignment analyses (keyword overlap) when using many of LingPys classes for alignment analyses which is at the same time sensitive for secondary sequence structures (see the description of secondary alignment in List2014d for details), like Pairwise, Multiple, or LexStat. Semi-global alignment means that the suffixes or prefixes in one of the words are not penalized. lingpy.algorithm.cython.calign.semi_globalign() Carry out semi-global alignment of two sequences. Parameters seqA, seqB : list The list containing the sequences. gopA, gopB : list The gap opening penalties (individual for each sequence, therefore passed as a list of floats or integers). proA, proB : str The prosodic strings which have the same length as seqA and seqB. M, N : int The lengths of seqA and seqB. scale : float The gap extension scale by which consecutive gaps are reduced. LingPy uses a scale rather than a constant gap extension penalty. factor : float The factor by which matches are increased when two segments occur in the same prosodic position of an alignment. scorer : { dict, lingpy.algorithm.cython.misc.ScoreDict } The scoring function which needs to provide scores for all segments in seqA and seqB. Returns alignment : tuple A tuple of the two alignments and the alignment score. Notes This is the function that is called to carry out semi-global alignment analyses (keyword overlap) when using many of LingPys classes for alignment analyses which is at the same time sensitive for secondary sequence structures (see the description of secondary alignment in List2014d for details), like Pairwise, Multiple, or LexStat. Semi-global alignment means that the suffixes or prefixes in one of the words are not penalized. 46 Chapter 9. Reference LingPy Documentation, Release 2.6 Examples We compare globalign with semi_globalign:: >>> from lingpy.algorithm.cython.calign import globalign, semi_globalign >>> globalign(list('abab'), list('baba'), 4 * [-1], 4 * [-1], 'aaaa', 'aaaa', ֒→4, 4, 1, 0, {("a","b"):-1, ("b","a"): -1, ("a","a"): 1, ("b", "b"): 1}) (['a', 'b', 'a', 'b', '-'], ['-', 'b', 'a', 'b', 'a'], 1.0) >>> semi_globalign(list('abab'), list('baba'), 4 * [-1], 4 * [-1], 'aaaa', ֒→'aaaa', 4, 4, 1, 0, {("a","b"):-1, ("b","a"): -1, ("a","a"): 1, ("b", "b"): ֒→1}) (['a', 'b', 'a', 'b', '-'], ['-', 'b', 'a', 'b', 'a'], 3.0) lingpy.algorithm.cython.calign.swap_score_profile() Basic function for the scoring of profiles which contain swapped sequences. Parameters colA, colB : list The two columns of a profile. scorer : { dict, lingpy.algorithm.cython.misc.ScoreDict } The scoring function which needs to provide scores for all segments in the two profiles. gap_weight : float (default=0.0) This handles the weight that is given to gaps in a column. If you set it to 0, for example, this means that all gaps will be ignored when determining the score for two columns in the profile. swap_penalty : int (default=-5) The swap penalty applied to swapped columns. Returns score : float The score for the profile. Notes This function handles how profiles with swapped segments are scored. lingpy.algorithm.cython.cluster module lingpy.algorithm.cython.cluster.flat_cluster() Carry out a flat cluster analysis based on the UPGMA algorithm. Parameters method : str { upgma, single, complete } Select between ugpma, single, and complete. threshold : float The threshold which terminates the algorithm. matrix : list or numpy.array A two-dimensional list containing the distances. taxa : list (default = []) 9.1. Reference 47 LingPy Documentation, Release 2.6 A list containing the names of the taxa. If the list is left empty, the indices of the taxa will be returned instead of their names. Returns clusters : dict A dictionary with cluster-IDs as keys and a list of the taxa corresponding to the respective ID as values. Examples The function is automatically imported along with LingPy. >>> from lingpy import * Create a list of arbitrary taxa. >>> taxa = ['German','Swedish','Icelandic','English','Dutch'] Create an arbitrary distance matrix. >>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3]) >>> matrix array([[ 0. , 0.5 , 0.67, 0.8 , 0.2 ], [ 0.5 , 0. , 0.4 , 0.7 , 0.6 ], [ 0.67, 0.4 , 0. , 0.8 , 0.8 ], [ 0.8 , 0.7 , 0.8 , 0. , 0.3 ], [ 0.2 , 0.6 , 0.8 , 0.3 , 0. ]]) Carry out the flat cluster analysis. >>> flat_upgma(0.5,matrix,taxa) {0: ['German', 'Dutch', 'English'], 1: ['Swedish', 'Icelandic']} lingpy.algorithm.cython.cluster.flat_upgma() Carry out a flat cluster analysis based on the UPGMA algorithm (Sokal1958). Parameters threshold : float The threshold which terminates the algorithm. matrix : list or numpy.array A two-dimensional list containing the distances. taxa : list (default = []) A list containing the names of the taxa. If the list is left empty, the indices of the taxa will be returned instead of their names. Returns clusters : dict A dictionary with cluster-IDs as keys and a list of the taxa corresponding to the respective ID as values. Examples The function is automatically imported along with LingPy. 48 Chapter 9. Reference LingPy Documentation, Release 2.6 >>> from lingpy import * Create a list of arbitrary taxa. >>> taxa = ['German','Swedish','Icelandic','English','Dutch'] Create an arbitrary distance matrix. >>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3]) >>> matrix array([[ 0. , 0.5 , 0.67, 0.8 , 0.2 ], [ 0.5 , 0. , 0.4 , 0.7 , 0.6 ], [ 0.67, 0.4 , 0. , 0.8 , 0.8 ], [ 0.8 , 0.7 , 0.8 , 0. , 0.3 ], [ 0.2 , 0.6 , 0.8 , 0.3 , 0. ]]) Carry out the flat cluster analysis. >>> flat_upgma(0.5,matrix,taxa) {0: ['German', 'Dutch', 'English'], 1: ['Swedish', 'Icelandic']} lingpy.algorithm.cython.cluster.neighbor() Function clusters data according to the Neighbor-Joining algorithm (Saitou1987). Parameters matrix : list or numpy.array A two-dimensional list containing the distances. taxa : list An list containing the names of all taxa corresponding to the distances in the matrix. distances : bool If set to False, only the topology of the tree will be returned. Returns newick : str A string in newick-format which can be further used in biological software packages to view and plot the tree. Examples Function is automatically imported when importing lingpy. >>> from lingpy import * Create an arbitrary list of taxa. >>> taxa = ['Norwegian','Swedish','Icelandic','Dutch','English'] Create an arbitrary matrix. >>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3]) Carry out the cluster analysis. >>> neighbor(matrix,taxa) '(((Norwegian,(Swedish,Icelandic)),English),Dutch);' 9.1. Reference 49 LingPy Documentation, Release 2.6 lingpy.algorithm.cython.cluster.upgma() Carry out a cluster analysis based on the UPGMA algorithm (Sokal1958). Parameters matrix : list or numpy.array A two-dimensional list containing the distances. taxa : list An list containing the names of all taxa corresponding to the distances in the matrix. distances : bool If set to False, only the topology of the tree will be returned. Returns newick : str A string in newick-format which can be further used in biological software packages to view and plot the tree. Examples Function is automatically imported when importing lingpy. >>> from lingpy import * Create an arbitrary list of taxa. >>> taxa = ['German','Swedish','Icelandic','English','Dutch'] Create an arbitrary matrix. >>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3]) Carry out the cluster analysis. >>> upgma(matrix,taxa,distances=False) '((Swedish,Icelandic),(English,(German,Dutch)));' lingpy.algorithm.cython.compilePYX module Script handles compilation of Cython files to C and also to C-Extension modules. lingpy.algorithm.cython.compilePYX.main() lingpy.algorithm.cython.compilePYX.pyx2py(infile, debug=False) lingpy.algorithm.cython.malign module This module provides various alignment functions in an optimized version. lingpy.algorithm.cython.malign.edit_dist() Return the edit-distance between two strings. Parameters seqA, seqB : list The sequences to be aligned, passed as list. normalized : bool 50 Chapter 9. Reference LingPy Documentation, Release 2.6 Indicate whether you want the normalized or the unnormalized edit distance to be returned. Returns dist : { int, float } Either the normalized or the unnormalized edit distance. lingpy.algorithm.cython.malign.nw_align() Align two sequences using the Needleman-Wunsch algorithm. Parameters seqA, seqB : list The sequences to be aligned, passed as list. scorer : dict A dictionary containing tuples of two segments as key and numbers as values. gap : int The gap penalty. Returns alignment : tuple A tuple of the two aligned sequences, and the similarity score. Notes This function is a very straightforward implementation of the Needleman-Wunsch algorithm (Needleman1970). We recommend to use the function if you want to test your own scoring dictionaries and profit from a fast implementation (as we use Cython, the implementation is indeed faster than pure Python implementations, as long as you use Python 3 and have Cython installed). If you want to test the NW algorithm without specifying a scoring dictionary, we recommend to have a look at our wrapper function with the same name in the pairwise module. lingpy.algorithm.cython.malign.restricted_edit_dist() Return the restricted edit-distance between two strings. Parameters seqA, seqB : list The two sequences passed as list. resA, resB : str The restrictions passed as a string with the same length as the corresponding sequence. We note a restriction if the strings show different symbols in their restriction string. If the symbols are identical, it is modeled as a non-restriction. normalized : bool Determine whether you want to return the normalized or the unnormalized edit distance. Notes Restrictions follow the definition of Heeringa2006: Segments that are not allowed to match are given a penalty of ∞. We model restrictions as strings, for example consisting of letters c and v. So the sequence woldemort could be modeled as cvccvcvcc, and when aligning it with the sequence walter and its restriction string cvccvc, the matching of those segments in the sequences in which the segments of the restriction string differ, would be heavily penalized, thus prohibiting an alignment of vowels and consonants (v and c). 9.1. Reference 51 LingPy Documentation, Release 2.6 lingpy.algorithm.cython.malign.structalign() Carry out a structural alignment analysis using Dijkstras algorithm. Parameters seqA,seqB : str The input sequences. restricted_chars : str (default = ) The characters which are used to separate secondary from primary segments in the input sequences. Currently, the use of restricted chars may fail to yield an alignment. Notes Structural alignment is hereby understood as an alignment of two sequences whose alphabets differ. The algorithm returns all alignments with minimal edit distance. Edit distance in this context refers to the number of edit operations that are needed in order to convert one sequence into the other, with repeated edit operations being penalized only once. lingpy.algorithm.cython.malign.sw_align() Align two sequences using the Smith-Waterman algorithm. Parameters seqA, seqB : list The sequences to be aligned, passed as list. scorer : dict A dictionary containing tuples of two segments as key and numbers as values. gap : int The gap penalty. Returns alignment : tuple A tuple of the two aligned sequences, and the similarity score. Notes This function is a very straightforward implementation of the Smith-Waterman algorithm (Smith1981). We recommend to use the function if you want to test your own scoring dictionaries and profit from a fast implementation (as we use Cython, the implementation is indeed faster than pure Python implementations, as long as you use Python 3 and have Cython installed). If you want to test the SW algorithm without specifying a scoring dictionary, we recommend to have a look at our wrapper function with the same name in the pairwise module. lingpy.algorithm.cython.malign.we_align() Align two sequences using the Waterman-Eggert algorithm. Parameters seqA, seqB : list The input sequences passed as a list. scorer : dict A dictionary containing tuples of two segments as key and numbers as values. gap : int The gap penalty. 52 Chapter 9. Reference LingPy Documentation, Release 2.6 Returns alignments : list A list consisting of tuples. Each tuple gives the alignment of one of the subsequences of the input sequences. Each tuple contains the aligned part of the first, the aligned part of the second sequence, and the score of the alignment. Notes This function is a very straightforward implementation of the Waterman-Eggert algorithm (Waterman1987). We recommend to use the function if you want to test your own scoring dictionaries and profit from a fast implementation (as we use Cython, the implementation is indeed faster than pure Python implementations, as long as you use Python 3 and have Cython installed). If you want to test the WE algorithm without specifying a scoring dictionary, we recommend to have a look at our wrapper function with the same name in the pairwise module. lingpy.algorithm.cython.misc module class lingpy.algorithm.cython.misc.ScoreDict Bases: object Class allows quick access to scoring functions using dictionary syntax. Parameters chars : list The list of all character tokens for the scoring dictionary. matrix : list A two-dimensional scoring matrix. Notes Since this class has dictionary syntax, you can always also just create a dictionary in order to store your scoring functions. Scoring dictionaries should contain a tuple of segments to be compared as a key, and a float or integer as a value, with negative values indicating dissimilarity, and positive values similarity. Examples Initialize a ScoreDict object:: >>> from lingpy.algorith.cython.misc import ScoreDict >>> scorer = ScoreDict(['a', 'b'], [1, -1, -1, 1]) Retrieve scores:: >>> scorer['a', 'b'] -1 >>> scorer['a', 'a'] 1 >>> scorer['a', 'X'] -22.5 lingpy.algorithm.cython.misc.squareform() A simplified version of the scipy.spatial.distance.squareform() function. 9.1. Reference 53 LingPy Documentation, Release 2.6 Parameters x : numpy.array or list The one-dimensional flat representation of a symmetrix distance matrix. Returns matrix : numpy.array The two-dimensional redundant representation of a symmetric distance matrix. lingpy.algorithm.cython.misc.transpose() Transpose a matrix along its two dimensions. Parameters matrix : list A two-dimensional list. lingpy.algorithm.cython.talign module lingpy.algorithm.cython.talign.align_pair() Align a pair of sequences. Parameters seqA, seqB : list The sequences to be aligned, passed as lists. gop : int The gap opening penalty. scale : float The gap extension scale. scorer : { dict, ~lingpy.algorithm.cython.misc.ScoreDict } The scoring dictionary containing scores for all possible segment combinations in the two sequences. mode : { global, local, overlap, dialign } Select the mode for the alignment analysis (overlap refers to semi-global alignments). distance : int (default=0) Select whether you want distances or similarities to be returned (0 indicates similarities, 1 indicates distances, 2 indicates both). Returns alignment : tuple The aligned sequences and the similarity or distance scores, or both. Notes This is a utility function that allows calls any of the four classical alignment functions (lingpy.algorithm. cython.talign.globalign lingpy.algorithm.cython.talign.semi_globalign, lingpy.algorithm.cython.talign.lotalign, lingpy.algorithm.cython.talign. dialign,) and their secondary counterparts. lingpy.algorithm.cython.talign.align_pairs() Align multiple sequence pairs. Parameters seqs : list The sequences to be aligned, passed as lists. 54 Chapter 9. Reference LingPy Documentation, Release 2.6 gop : int The gap opening penalty. scale : float The gap extension scale. scorer : { dict, ~lingpy.algorithm.cython.misc.ScoreDict } The scoring dictionary containing scores for all possible segment combinations in the two sequences. mode : { global, local, overlap, dialign } Select the mode for the alignment analysis (overlap refers to semi-global alignments). distance : int (default=0) Indicate whether distances or similarities should be returned. Returns alignments : list A list of tuples, containing the aligned sequences, and the similarity or the distance scores. Notes This function aligns all pairs which are passed to it. lingpy.algorithm.cython.talign.align_pairwise() Align all sequences pairwise. Parameters seqs : list The sequences to be aligned, passed as lists. gop : int The gap opening penalty. scale : float The gap extension scale. scorer : { dict, ~lingpy.algorithm.cython.misc.ScoreDict } The scoring dictionary containing scores for all possible segment combinations in the two sequences. mode : { global, local, overlap, dialign } Select the mode for the alignment analysis (overlap refers to semi-global alignments). Returns alignments : list A list of tuples, containing the aligned sequences, the similarity and the distance scores. Notes This function aligns all possible pairs between the sequences you pass to it. It is important for multiple alignment, where it can be used to construct the guide tree. 9.1. Reference 55 LingPy Documentation, Release 2.6 lingpy.algorithm.cython.talign.align_profile() Align two profiles using the basic modes. Parameters profileA, profileB : list Two-dimensional list for each of the profiles. gop : int The gap opening penalty. scale : float The gap extension scale by which consecutive gaps are reduced. LingPy uses a scale rather than a constant gap extension penalty. scorer : { dict, lingpy.algorithm.cython.misc.ScoreDict } The scoring function which needs to provide scores for all segments in the two profiles. mode : { global, overlap, dialign } Select one of the four basic modes for alignment analyses. gap_weight : float This handles the weight that is given to gaps in a column. If you set it to 0, for example, this means that all gaps will be ignored when determining the score for two columns in the profile. Returns alignment : tuple The aligned profiles, and the overall similarity of the profiles. Notes This function computes alignments of two profiles of multiple sequences (see Durbin2002 for details on profiles) and is important for multiple alignment analyses. lingpy.algorithm.cython.talign.dialign() Carry out dialign alignment of two sequences. Parameters seqA, seqB : list The sequences to be aligned, passed as lists. M, N : int The length of the two sequences. scale : float The gap extension scale. scorer : { dict, ~lingpy.algorithm.cython.misc.ScoreDict } The scoring dictionary containing scores for all possible segment combinations in the two sequences. Returns alignment : tuple The aligned sequences and the similarity score. 56 Chapter 9. Reference LingPy Documentation, Release 2.6 Notes This algorithm carries out dialign alignment (Morgenstern1996). lingpy.algorithm.cython.talign.globalign() Carry out global alignment of two sequences. Parameters seqA, seqB : list The sequences to be aligned, passed as lists. M, N : int The length of the two sequences. gop : int The gap opening penalty. scale : float The gap extension scale. scorer : { dict, ~lingpy.algorithm.cython.misc.ScoreDict } The scoring dictionary containing scores for all possible segment combinations in the two sequences. Returns alignment : tuple The aligned sequences and the similarity score. Notes This algorithm carries out classical Needleman-Wunsch alignment (Needleman1970). lingpy.algorithm.cython.talign.localign() Carry out semi-global alignment of two sequences. Parameters seqA, seqB : list The sequences to be aligned, passed as lists. M, N : int The length of the two sequences. gop : int The gap opening penalty. scale : float The gap extension scale. scorer : { dict, ~lingpy.algorithm.cython.misc.ScoreDict } The scoring dictionary containing scores for all possible segment combinations in the two sequences. Returns alignment : tuple The aligned sequences and the similarity score. 9.1. Reference 57 LingPy Documentation, Release 2.6 Notes This algorithm carries out local alignment (Smith1981). lingpy.algorithm.cython.talign.score_profile() Basic function for the scoring of profiles. Parameters colA, colB : list The two columns of a profile. scorer : { dict, lingpy.algorithm.cython.misc.ScoreDict } The scoring function which needs to provide scores for all segments in the two profiles. gap_weight : float (default=0.0) This handles the weight that is given to gaps in a column. If you set it to 0, for example, this means that all gaps will be ignored when determining the score for two columns in the profile. Returns score : float The score for the profile Notes This function handles how profiles are scored. lingpy.algorithm.cython.talign.semi_globalign() Carry out semi-global alignment of two sequences. Parameters seqA, seqB : list The sequences to be aligned, passed as lists. M, N : int The length of the two sequences. gop : int The gap opening penalty. scale : float The gap extension scale. scorer : { dict, ~lingpy.algorithm.cython.misc.ScoreDict } The scoring dictionary containing scores for all possible segment combinations in the two sequences. Returns alignment : tuple The aligned sequences and the similarity score. Notes This algorithm carries out semi-global alignment (Durbin2002). lingpy.algorithm.cython.talign.swap_score_profile() Basic function for the scoring of profiles in swapped sequences. 58 Chapter 9. Reference LingPy Documentation, Release 2.6 Parameters colA, colB : list The two columns of a profile. scorer : { dict, lingpy.algorithm.cython.misc.ScoreDict } The scoring function which needs to provide scores for all segments in the two profiles. gap_weight : float (default=0.0) This handles the weight that is given to gaps in a column. If you set it to 0, for example, this means that all gaps will be ignored when determining the score for two columns in the profile. swap_penalty : int (default=-5) The swap penalty applied to swapped columns. Returns score : float The score for the profile. Notes This function handles how profiles with swapped segments are scored. Module contents Package provides modules for time-consuming routines. Submodules lingpy.algorithm.cluster_util module Various utility functions which are useful for algorithmic operations lingpy.algorithm.cluster_util.generate_all_clusters(numbers) Generate all possible clusters for a number of elements. Returns clr : iterator An iterator that will yield the next of all possible clusters. lingpy.algorithm.cluster_util.generate_random_cluster(numbers, bias=False) Generate a random cluster for a number of elements. Parameters numbers : int Number of separate entities which should be clustered. bias : str (default=False) When set to lumper will tend to create larger groups, when set to splitter it will tend to produce smaller groups. Returns cluster : list A list with consecutive ordering of clusters, starting from zero. lingpy.algorithm.cluster_util.mutate_cluster(clr, chance=0.5) Mutate a cluster. 9.1. Reference 59 LingPy Documentation, Release 2.6 Parameters clr : cluster A list with ordered clusters. chance : float (default=0.5) The mutation rate for each element in a cluster. If set to 0.5, this means that in 50% of the cases, an element will be assigned to another cluster or a new cluster. Returns valid_cluster : list A newly clustered list in consecutive order. lingpy.algorithm.cluster_util.order_cluster(clr) Order a cluster into the form of a valid cluster. Parameters clr : list A list with clusters assigned by given each element a specific clusuter ID. Returns valid_cluster : list A list in which the IDs start from zero and increase consecutively with each new cluster introduced. lingpy.algorithm.cluster_util.valid_cluster(sequence) Only allow to have sequences which have consecutive ordering of elements. Parameters sequence : list A cluster sequence in which elements should be consecutively ordered, starting from 0, and identical segments in the sequence retrieve the same number. Returns valid_cluster : bool True, if the cluster is valid, and False if it judged to be invalid. Examples We define a valid and an invalid cluster sequence: >>> clrA = [0, 1, 2, 3] >>> clrB = [1, 1, 2, 3] # should be [0, 0, 1, 2] >>> from lingpy.algorithm.utils import valid_cluster >>> valid_cluster(clrA) True >>> valid_cluster(clrB) False lingpy.algorithm.clustering module Module provides general clustering functions for LingPy. lingpy.algorithm.clustering.best_threshold(matrix, trange=(0.3, 0.7, 0.05)) Calculate the best threshold by maximizing partition density for a given range of thresholds. Notes This method makes use of the idea of partition density proposed in Ahn2010. 60 Chapter 9. Reference LingPy Documentation, Release 2.6 lingpy.algorithm.clustering.check_taxon_names(taxa) lingpy.algorithm.clustering.find_threshold(matrix, thresholds=[0.9, 0.8500000000000001, 0.8, 0.75, 0.7000000000000001, 0.65, 0.6000000000000001, 0.55, 0.5, 0.45, 0.4, 0.35000000000000003, 0.30000000000000004, 0.25, 0.2, 0.15000000000000002, 0.1, 0.05], logs=True) Use a variant of the method by Apeltsin2011 in order to find an optimal threshold. Parameters matrix : list The distance matrix for which the threshold shall be determined. thresholds : list (default=[i*0.05 for i in range(1,19)[::-1]) The range of thresholds that shall be tested. logs : {bool,builtins.function} (default=True) If set to True, the logarithm of the score beyond the threshold will be assigned as weight to the graph. If set to c{False} all weights will be set to 1. Use a custom function to define individual ways to calculate the weights. Returns threshold : {float,None} If a float is returned, this is the threshold identified by the method. If None is returned, no threshold could be identified. Notes This is a very simple method that may not work well depending on the dataset. So we recommend to use it with great care. lingpy.algorithm.clustering.flat_cluster(method, threshold, matrix, taxa=None, revert=False) Carry out a flat cluster analysis based on linkage algorithms. Parameters method : { upgma, single, complete, ward} Select between ugpma, single, and complete. You can also test ward, but theres no guarantee that this is the correct algorithm. threshold : float The threshold which terminates the algorithm. matrix : list A two-dimensional list containing the distances. taxa : list (default=None) A list containing the names of the taxa. If the list is left empty, the indices of the taxa will be returned instead of their names. Returns clusters : dict A dictionary with cluster-IDs as keys and a list of the taxa corresponding to the respective ID as values. 9.1. Reference 61 LingPy Documentation, Release 2.6 See also: flat_cluster, flat_upgma, fuzzy, link_clustering, mcl Examples The function is automatically imported along with LingPy. >>> from lingpy import * >>> from lingpy.algorithm import squareform Create a list of arbitrary taxa. >>> taxa = ['German','Swedish','Icelandic','English','Dutch'] Create an arbitrary distance matrix. >>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3]) >>> matrix [[0.0, 0.5, 0.67, 0.8, 0.2], [0.5, 0.0, 0.4, 0.7, 0.6], [0.67, 0.4, 0.0, 0.8, 0.8], [0.8, 0.7, 0.8, 0.0, 0.3], [0.2, 0.6, 0.8, 0.3, 0.0]] Carry out the flat cluster analysis. >>> flat_cluster('upgma',0.6,matrix,taxa) {0: ['German', 'Dutch', 'English'], 1: ['Swedish', 'Icelandic']} lingpy.algorithm.clustering.flat_upgma(threshold, matrix, taxa=None, revert=False) Carry out a flat cluster analysis based on the UPGMA algorithm (Sokal1958). Parameters threshold : float The threshold which terminates the algorithm. matrix : list A two-dimensional list containing the distances. taxa : list (default=None) A list containing the names of the taxa. If the list is left empty, the indices of the taxa will be returned instead of their names. Returns clusters : dict A dictionary with cluster-IDs as keys and a list of the taxa corresponding to the respective ID as values. See also: flat_cluster, flat_upgma, fuzzy, link_clustering, mcl Examples The function is automatically imported along with LingPy. 62 Chapter 9. Reference LingPy Documentation, Release 2.6 >>> from lingpy import * >>> from lingpy.algorithm import squareform Create a list of arbitrary taxa. >>> taxa = ['German','Swedish','Icelandic','English','Dutch'] Create an arbitrary distance matrix. >>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3]) >>> matrix [[0.0, 0.5, 0.67, 0.8, 0.2], [0.5, 0.0, 0.4, 0.7, 0.6], [0.67, 0.4, 0.0, 0.8, 0.8], [0.8, 0.7, 0.8, 0.0, 0.3], [0.2, 0.6, 0.8, 0.3, 0.0]] Carry out the flat cluster analysis. >>> flat_upgma(0.6,matrix,taxa) {0: ['German', 'Dutch', 'English'], 1: ['Swedish', 'Icelandic']} lingpy.algorithm.clustering.fuzzy(threshold, matrix, taxa, method=’upgma’, revert=False) Create fuzzy cluster of a given distance matrix. Parameters threshold : float The threshold that shall be used for the basic clustering of the data. matrix : list A two-dimensional list containing the distances. taxa : list An list containing the names of all taxa corresponding to the distances in the matrix. method : { upgma, single, complete } (default=upgma) Select the method for the flat cluster analysis. distances : bool If set to False, only the topology of the tree will be returned. revert : bool (default=False) Specify whether a reverted dictionary should be returned. Returns cluster : dict A dictionary with cluster-IDs as keys and a list as value, containing the taxa that are assigned to a given cluster-ID. See also: link_clustering Notes This is a very simple fuzzy clustering algorithm. It basically does nothing else than removing taxa successively from the matrix, flat-clustering the remaining taxa with the corresponding threshold, and then returning a 9.1. Reference 63 LingPy Documentation, Release 2.6 combined consensus cluster in which taxa may be assigned to multiple clusters. Examples The function is automatically imported along with LingPy. >>> from lingpy import * from lingpy.algorithm import squareform Create a list of arbitrary taxa. >>> taxa = ['German','Swedish','Icelandic','English','Dutch'] Create an arbitrary distance matrix. >>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3]) >>> matrix [[0.0, 0.5, 0.67, 0.8, 0.2], [0.5, 0.0, 0.4, 0.7, 0.6], [0.67, 0.4, 0.0, 0.8, 0.8], [0.8, 0.7, 0.8, 0.0, 0.3], [0.2, 0.6, 0.8, 0.3, 0.0]] Carry out the fuzzy flat cluster analysis. >>> fuzzy(0.5,matrix,taxa) {1: ['Swedish', 'Icelandic'], 2: ['Dutch', 'German'], 3: ['Dutch', 'English']} lingpy.algorithm.clustering.link_clustering(threshold, matrix, taxa, link_threshold=False, revert=False, matrix_type=’distances’, fuzzy=True) Carry out a link clustering analysis using the method by Ahn2010. Parameters threshold : {float, bool} The threshold that shall be used for the initial selection of links assigned to the data. If set to c{False}, the weights from the matrix will be used directly. matrix : list A two-dimensional list containing the distances. taxa : list An list containing the names of all taxa corresponding to the distances in the matrix. link_threshold : float (default=0.5) The threshold that shall be used for the internal clustering of the data. matrix_type : {distances,similarities,weights} (default=distances) Specify the type of the matrix. If the matrix contains distance data, it will be adapted to similarity data. If it contains similarities, no adaptation is needed. If it contains weights, a weighted version of link clustering (see the supplementary in Ahn2010 for details) ]will be carried out. Returns cluster : dict A dictionary with cluster-IDs as keys and a list as value, containing the taxa that are assigned to a given cluster-ID. 64 Chapter 9. Reference LingPy Documentation, Release 2.6 See also: fuzzy Examples The function is automatically imported along with LingPy. >>> from lingpy import * >>> from lingpy.algorithm import squareform Create a list of arbitrary taxa. >>> taxa = ['German','Swedish','Icelandic','English','Dutch'] Create an arbitrary distance matrix. >>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3]) >>> matrix [[0.0, 0.5, 0.67, 0.8, 0.2], [0.5, 0.0, 0.4, 0.7, 0.6], [0.67, 0.4, 0.0, 0.8, 0.8], [0.8, 0.7, 0.8, 0.0, 0.3], [0.2, 0.6, 0.8, 0.3, 0.0]] Carry out the link-clustering analysis. >>> link_clustering(0.5,matrix,taxa) {1: ['Dutch', 'English', 'German'], 2: ['Icelandic', 'Swedish']} lingpy.algorithm.clustering.matrix2groups(threshold, matrix, ter_method=’upgma’) Calculate flat cluster of distance matrix. taxa, clus- Parameters threshold : float The threshold to be used for the calculation. matrix : list The distance matrix to be used. taxa : list A list of the taxa in the distance matrix. cluster_method : {upgma, mcl, single, complete} (default=upgma) Returns groups : dict A dictionary with the taxa as keys and the group assignment as values. Notes This function is important for internal calculations within wordlist. It is not recommended for further use. lingpy.algorithm.clustering.matrix2tree(matrix, taxa, tree_calc=’neighbor’, tances=True, filename=”) Calculate a tree of a given distance matrix. 9.1. Reference dis- 65 LingPy Documentation, Release 2.6 Parameters matrix : list The distance matrix to be used. taxa : list A list of the taxa in the distance matrix. tree_calc : str (default=neighbor) The method for tree calculation that shall be used. Select between: • neighbor: Neighbor-joining method (Saitou1987) • upgma : UPGMA method (Sokal1958) distances : bool (default=True) If set to c{True}, distances will be included in the tree-representation. filename : str (default=) If a filename is specified, the data will be written to that file. Returns tree : ~lingpy.thirdparty.cogent.tree.PhyloNode A ~lingpy.thirdparty.cogent.tree.PhyloNode object for handling tree files. lingpy.algorithm.clustering.mcl(threshold, matrix, taxa, max_steps=1000, inflation=2, expansion=2, add_self_loops=True, revert=False, logs=True, matrix_type=’distances’) Carry out a clustering using the MCL algorithm (Dongen2000). Parameters threshold : {float, bool} The threshold that shall be used for the initial selection of links assigned to the data. If set to c{False}, the weights from the matrix will be used directly. matrix : list A two-dimensional list containing the distances. taxa : list An list containing the names of all taxa corresponding to the distances in the matrix. max_steps : int (default=1000) Maximal number of iterations. inflation : int (default=2) Inflation parameter for the MCL algorithm. expansion : int (default=2) Expansion parameter of the MCL algorithm. add_self_loops : {True, False, builtins.function} (default=True) Determine whether self-loops should be added, and if so, how they should be weighted. If a function for the calculation of self-loops is given, it will take the whole column of the matrix for each taxon as input. logs : { bool, function } (default=True) If set to c{True}, the logarithm of the score beyond the threshold will be assigned as weight to the graph. If set to c{False} all weights will be set to 1. Use a custom function to define individual ways to calculate the weights. 66 Chapter 9. Reference LingPy Documentation, Release 2.6 matrix_type : { distances, similarities } Specify the type of the matrix. If the matrix contains distance data, it will be adapted to similarity data. If it contains similarities, no adaptation is needed. Examples The function is automatically imported along with LingPy. >>> from lingpy import * >>> from lingpy.algorithm import squareform Create a list of arbitrary taxa. >>> taxa = ['German','Swedish','Icelandic','English','Dutch'] Create an arbitrary distance matrix. >>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3]) >>> matrix [[0.0, 0.5, 0.67, 0.8, 0.2], [0.5, 0.0, 0.4, 0.7, 0.6], [0.67, 0.4, 0.0, 0.8, 0.8], [0.8, 0.7, 0.8, 0.0, 0.3], [0.2, 0.6, 0.8, 0.3, 0.0]] Carry out the link-clustering analysis. >>> mcl(0.5,matrix,taxa) {1: ['German', 'English', 'Dutch'], 2: ['Swedish', 'Icelandic']} lingpy.algorithm.clustering.neighbor(matrix, taxa, distances=True) Function clusters data according to the Neighbor-Joining algorithm (Saitou1987). Parameters matrix : list A two-dimensional list containing the distances. taxa : list An list containing the names of all taxa corresponding to the distances in the matrix. distances : bool (default=True) If set to False, only the topology of the tree will be returned. Returns newick : str A string in newick-format which can be further used in biological software packages to view and plot the tree. See also: upgma Examples Function is automatically imported when importing lingpy. 9.1. Reference 67 LingPy Documentation, Release 2.6 >>> from lingpy import * >>> from lingpy.algorithm import squareform Create an arbitrary list of taxa. >>> taxa = ['Norwegian','Swedish','Icelandic','Dutch','English'] Create an arbitrary matrix. >>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3]) Carry out the cluster analysis. >>> neighbor(matrix,taxa) '(((Norwegian,(Swedish,Icelandic)),English),Dutch);' lingpy.algorithm.clustering.partition_density(matrix, t) Calculate partition density for a given threshold on a distance matrix. Notes See Ahn2012 for details on the calculation of partition density in a given network. lingpy.algorithm.clustering.upgma(matrix, taxa, distances=True) Carry out a cluster analysis based on the UPGMA algorithm (Sokal1958). Parameters matrix : list A two-dimensional list containing the distances. taxa : list An list containing the names of all taxa corresponding to the distances in the matrix. distances : bool (default=True) If set to False, only the topology of the tree will be returned. Returns newick : str A string in newick-format which can be further used in biological software packages to view and plot the tree. See also: neighbor Examples Function is automatically imported when importing lingpy. >>> from lingpy import * >>> from lingpy.algorithm import squareform Create an arbitrary list of taxa. >>> taxa = ['German','Swedish','Icelandic','English','Dutch'] 68 Chapter 9. Reference LingPy Documentation, Release 2.6 Create an arbitrary matrix. >>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3]) Carry out the cluster analysis. >>> upgma(matrix,taxa,distances=False) '((Swedish,Icelandic),(English,(German,Dutch)));' lingpy.algorithm.extra module Adapting specific cluster algorithms from scikit-learn to LingPy. lingpy.algorithm.extra.affinity_propagation(threshold, matrix, taxa, revert=False) Compute affinity propagation from the matrix. Parameters threshold : float The threshold for clustering you want to use. matrix : list The two-dimensional matrix passed as list or array. taxa : list The list of taxon names. If set to False a fake list of taxon names will be created, giving a positive numerical ID in increasing order for each column in the matrix. revert : bool If set to False, dont return taxon names but simply the language identifiers and their labels as a dictionary. Otherwise returns a dictionary with labels as keys and list of taxon names as values. Returns clusters : dict Either a dictionary of taxon identifiers and labels, or a dictionary of labels and taxon names. Notes Affinity propagation is a clustering method originally proposed by Frey2007. Requires the scikitlearn package, downloadable from http://scikit-learn.org/. lingpy.algorithm.extra.dbscan(threshold, matrix, taxa, revert=False, min_samples=1) Compute DBSCAN cluster analysis. Parameters threshold : float The threshold for clustering you want to use. matrix : list The two-dimensional matrix passed as list or array. taxa : list The list of taxon names. If set to False a fake list of taxon names will be created, giving a positive numerical ID in increasing order for each column in the matrix. 9.1. Reference 69 LingPy Documentation, Release 2.6 revert : bool If set to False, dont return taxon names but simply the language identifiers and their labels as a dictionary. Otherwise returns a dictionary with labels as keys and list of taxon names as values. min_samples : int (default=1) The minimal samples parameter of the DBCSCAN method from the SKLEARN package. Returns clusters : dict Either a dictionary of taxon identifiers and labels, or a dictionary of labels and taxon names. Notes This method does not work as expected, probably since it normally requires distances between points as input. We list it only for completeness here, but urge to be careful when using the code and checking properly our implementation in the source code. Requires the scikitlearn package, downloadable from http://scikit-learn.org/. lingpy.algorithm.extra.infomap_clustering(threshold, matrix, taxa=False, revert=False) Compute the Infomap clustering analysis of the data. Parameters threshold : float The threshold for clustering you want to use. matrix : list The two-dimensional matrix passed as list or array. taxa : list The list of taxon names. If set to False a fake list of taxon names will be created, giving a positive numerical ID in increasing order for each column in the matrix. revert : bool If set to False, dont return taxon names but simply the language identifiers and their labels as a dictionary. Otherwise returns a dictionary with labels as keys and list of taxon names as values. Returns clusters : dict Either a dictionary of taxon identifiers and labels, or a dictionary of labels and taxon names. Notes Infomap clustering is a community detection method originally proposed by Rosvall2008. Requires the igraph package is required, downloadable from http://igraph.org/. Module contents Package for specific algorithms and time-intensive routines. 70 Chapter 9. Reference LingPy Documentation, Release 2.6 lingpy.align package Submodules lingpy.align.multiple module Module provides classes and functions for multiple alignment analyses. class lingpy.align.multiple.Multiple(seqs, **keywords) Bases: clldutils.misc.UnicodeMixin Basic class for multiple sequence alignment analyses. Parameters seqs : list List of sequences that shall be aligned. Notes Depending on the structure of the sequences, further keywords can be specified that manage how the items get tokenized. align(method, **kw) get_local_peaks(threshold=2, gap_weight=0.0) Return all peaks in a given alignment. Parameters threshold : { int, float } (default=2) The threshold to determine whether a given column is a peak or not. gap_weight : float (default=0.0) The weight for gaps. get_pairwise_alignments(**keywords) Function creates a dictionary of all pairwise alignments scores. Parameters new_calc : bool (default=True) Specify, whether the analysis should be repeated from the beginning, or whether already conducted analyses should be carried out. model : string (default=sca) A string indicating the name of the Model object that shall be used for the analysis. Currently, three models are supported: • dolgo – a sound-class model based on Dolgopolsky1986, • sca – an extension of the dolgo sound-class model based on List2012b, and • asjp – an independent sound-class model which is based on the sound-class model of Brown2008 and the empirical data of Brown2011 (see the description in List2012. mode : string (default=global) A string indicating which kind of alignment analysis should be carried out during the progressive phase. Select between: 9.1. Reference 71 LingPy Documentation, Release 2.6 • global – traditional global alignment analysis based on the Needleman-Wunsch algorithm Needleman1970, • dialign – global alignment analysis which seeks to maximize local similarities Morgenstern1996. gop : int (default=-3) The gap opening penalty (GOP) used in the analysis. gep_scale : float (default=0.6) The factor by which the penalty for the extension of gaps (gap extension penalty, GEP) shall be decreased. This approach is essentially inspired by the exension of the basic alignment algorithm for affine gap penalties Gotoh1982. factor : float (default=1) The factor by which the initial and the descending position shall be modified. gap_weight : float (default=0) The factor by which gaps in aligned columns contribute to the calculation of the column score. When set to 0, gaps will be ignored in the calculation. When set to 0.5, gaps will count half as much as other characters. restricted_chars : string (default=T) Define which characters of the prosodic string of a sequence reflect its secondary structure (cf. List2012b) and should therefore be aligned specifically. This defaults to T, since this is the character that represents tones in the prosodic strings of sequences. get_peaks(gap_weight=0) Calculate the profile score for each column of the alignment. Parameters gap_weight : float (default=0) The factor by which gaps in aligned columns contribute to the calculation of the column score. When set to 0, gaps will be ignored in the calculation. When set to 0.5, gaps will count half as much as other characters. Returns peaks : list A list containing the profile scores for each column of the given alignment. get_pid(mode=1) Return the Percentage Identity (PID) score of the calculated MSA. Parameters mode : { 1, 2, 3, 4, 5 } (default=1) Indicate which of the four possible PID scores described in Raghava2006 should be calculated, the fifth possibility is added for linguistic purposes: 1. identical positions / (aligned positions + internal gap positions), 2. identical positions / aligned positions, 3. identical positions / shortest sequence, or 4. identical positions / shortest sequence (including internal gap pos.) 5. identical positions / (aligned positions + 2 * number of gaps) Returns score : float The PID score of the given alignment as a floating point number between 0 and 1. 72 Chapter 9. Reference LingPy Documentation, Release 2.6 See also: lingpy.sequence.sound_classes.pid iterate_all_sequences(check=’final’, mode=’global’, gop=-3, gap_weight=1, restricted_chars=’T_’) Iterative refinement based on a complete realignment of all sequences. scale=0.5, factor=0, Parameters check : { final, immediate } (default=final) Specify when to check for improved sum-of-pairs scores: After each iteration (immediate) or after all iterations have been carried out (final). mode : { global, overlap, dialign } (default=global) A string indicating which kind of alignment analysis should be carried out during the progressive phase. Select between: • global – traditional global alignment analysis based on the Needleman-Wunsch algorithm Needleman1970, • dialign – global alignment analysis which seeks to maximize local similarities Morgenstern1996. • overlap – semi-global alignment, where gaps introduced in the beginning and the end of a sequence do not score. gop : int (default=-5) The gap opening penalty (GOP) used in the analysis. gep_scale : float (default=0.5) The factor by which the penalty for the extension of gaps (gap extension penalty, GEP) shall be decreased. This approach is essentially inspired by the exension of the basic alignment algorithm for affine gap penalties Gotoh1981. factor : float (default=0.3) The factor by which the initial and the descending position shall be modified. gap_weight : float (default=0) The factor by which gaps in aligned columns contribute to the calculation of the column score. When set to 0, gaps will be ignored in the calculation. When set to 0.5, gaps will count half as much as other characters. See also: Multiple.iterate_clusters, Multiple.iterate_similar_gap_sites, Multiple. iterate_orphans Notes This method essentially follows the iterative method of Barton1987 with the exception that an MSA has already been calculated. iterate_clusters(threshold, check=’final’, mode=’global’, gop=-3, scale=0.5, factor=0, gap_weight=1, restricted_chars=’T_’) Iterative refinement based on a flat cluster analysis of the data. Parameters threshold : float The threshold for the flat cluster analysis. 9.1. Reference 73 LingPy Documentation, Release 2.6 check : string (default=final) Specify when to check for improved sum-of-pairs scores: After each iteration (immediate) or after all iterations have been carried out (final). mode : { global, overlap, dialign } (default=global) A string indicating which kind of alignment analysis should be carried out during the progressive phase. Select between: • global – traditional global alignment analysis based on the Needleman-Wunsch algorithm Needleman1970, • dialign – global alignment analysis which seeks to maximize local similarities Morgenstern1996. • overlap – semi-global alignment, where gaps introduced in the beginning and the end of a sequence do not score. gop : int (default=-5) The gap opening penalty (GOP) used in the analysis. gep_scale : float (default=0.6) The factor by which the penalty for the extension of gaps (gap extension penalty, GEP) shall be decreased. This approach is essentially inspired by the exension of the basic alignment algorithm for affine gap penalties Gotoh1981. factor : float (default=0.3) The factor by which the initial and the descending position shall be modified. gap_weight : float (default=0) The factor by which gaps in aligned columns contribute to the calculation of the column score. When set to 0, gaps will be ignored in the calculation. When set to 0.5, gaps will count half as much as other characters. See also: Multiple.iterate_similar_gap_sites, Multiple.iterate_all_sequences Notes This method uses the lingpy.algorithm.clustering.flat_upgma() function in order to retrieve a flat cluster of the data. iterate_orphans(check=’final’, mode=’global’, gop=-3, scale=0.5, factor=0, gap_weight=1.0, restricted_chars=’T_’) Iterate over the most divergent sequences in the sample. Parameters check : string (default=final) Specify when to check for improved sum-of-pairs scores: After each iteration (immediate) or after all iterations have been carried out (final). mode : { global, overlap, dialign } (default=global) A string indicating which kind of alignment analysis should be carried out during the progressive phase. Select between: • global – traditional global alignment analysis based on the Needleman-Wunsch algorithm Needleman1970, 74 Chapter 9. Reference LingPy Documentation, Release 2.6 • dialign – global alignment analysis which seeks to maximize local similarities Morgenstern1996. • overlap – semi-global alignment, where gaps introduced in the beginning and the end of a sequence do not score. gop : int (default=-5) The gap opening penalty (GOP) used in the analysis. gep_scale : float (default=0.6) The factor by which the penalty for the extension of gaps (gap extension penalty, GEP) shall be decreased. This approach is essentially inspired by the exension of the basic alignment algorithm for affine gap penalties Gotoh1981. factor : float (default=0.3) The factor by which the initial and the descending position shall be modified. gap_weight : float (default=0) The factor by which gaps in aligned columns contribute to the calculation of the column score. When set to 0, gaps will be ignored in the calculation. When set to 0.5, gaps will count half as much as other characters. See also: Multiple.iterate_clusters, Multiple.iterate_similar_gap_sites, Multiple. iterate_all_sequences Notes The most divergent sequences are those whose average distance to all other sequences is above the average distance of all sequence pairs. iterate_similar_gap_sites(check=’final’, mode=’global’, gop=-3, scale=0.5, factor=0, gap_weight=1, restricted_chars=’T_’) Iterative refinement based on the Similar Gap Sites heuristic. Parameters check : { final, immediate } (default=final) Specify when to check for improved sum-of-pairs scores: After each iteration (immediate) or after all iterations have been carried out (final). mode : { global, overlap, dialign } (default=global) A string indicating which kind of alignment analysis should be carried out during the progressive phase. Select between: • global – traditional global alignment analysis based on the Needleman-Wunsch algorithm Needleman1970, • dialign – global alignment analysis which seeks to maximize local similarities Morgenstern1996. • overlap – semi-global alignment, where gaps introduced in the beginning and the end of a sequence do not score. gop : int (default=-5) The gap opening penalty (GOP) used in the analysis. gep_scale : float (default=0.5) 9.1. Reference 75 LingPy Documentation, Release 2.6 The factor by which the penalty for the extension of gaps (gap extension penalty, GEP) shall be decreased. This approach is essentially inspired by the exension of the basic alignment algorithm for affine gap penalties Gotoh1982. factor : float (default=0.3) The factor by which the initial and the descending position shall be modified. gap_weight : float (default=1) The factor by which gaps in aligned columns contribute to the calculation of the column score. When, e.g., set to 0, gaps will be ignored in the calculation. When set to 0.5, gaps will count half as much as other characters. See also: Multiple.iterate_clusters, iterate_orphans Multiple.iterate_all_sequences, Multiple. Notes This heuristic is fairly simple. The idea is to try to split a given MSA into partitions with identical gap sites. lib_align(**keywords) Carry out a library-based progressive alignment analysis of the sequences. Parameters model : { dolgo, sca, asjp } (default=sca) A string indicating the name of the Model object that shall be used for the analysis. Currently, three models are supported: • dolgo – a sound-class model based on Dolgopolsky1986, • sca – an extension of the dolgo sound-class model based on List2012b, and • asjp – an independent sound-class model which is based on the sound-class model of Brown2008 and the empirical data of Brown2011 (see the description in List2012. mode : { global, dialign } (default=global) A string indicating which kind of alignment analysis should be carried out during the progressive phase. Select between: • global – traditional global alignment analysis based on the Needleman-Wunsch algorithm Needleman1970, • dialign – global alignment analysis which seeks to maximize local similarities Morgenstern1996. modes : list (default=[(global,-10,0.6),(local,-1,0.6)]) Indicate the mode, the gap opening penalties (GOP), and the gap extension scale (GEP scale), of the pairwise alignment analyses which are used to create the library. gop : int (default=-5) The gap opening penalty (GOP) used in the analysis. gep_scale : float (default=0.6) 76 Chapter 9. Reference LingPy Documentation, Release 2.6 The factor by which the penalty for the extension of gaps (gap extension penalty, GEP) shall be decreased. This approach is essentially inspired by the exension of the basic alignment algorithm for affine gap penalties Gotoh1982. factor : float (default=1) The factor by which the initial and the descending position shall be modified. tree_calc : { neighbor, upgma } (default=upgma) The cluster algorithm which shall be used for the calculation of the guide tree. Select between neighbor, the Neighbor-Joining algorithm (Saitou1987), and upgma, the UPGMA algorithm (Sokal1958). guide_tree : tree_matrix Use a custom guide tree instead of performing a cluster algorithm for constructing one based on the input similarities. The use of this option makes the tree_calc option irrelevant. gap_weight : float (default=0) The factor by which gaps in aligned columns contribute to the calculation of the column score. When set to 0, gaps will be ignored in the calculation. When set to 0.5, gaps will count half as much as other characters. restricted_chars : string (default=T) Define which characters of the prosodic string of a sequence reflect its secondary structure (cf. List2012b) and should therefore be aligned specifically. This defaults to T, since this is the character that represents tones in the prosodic strings of sequences. Notes In contrast to traditional progressive multiple sequence alignment approaches such as Feng1981 and Thompson1994, library-based progressive alignment Notredame2000 is based on a pre-processing of the data where the information given in global and local pairwise alignments of the input sequences is used to derive a refined scoring function (library) which is later used in the progressive phase. prog_align(**keywords) Carry out a progressive alignment analysis of the input sequences. Parameters model : { dolgo, sca, asjp } (defaul=sca) A string indicating the name of the Model object that shall be used for the analysis. Currently, three models are supported: • dolgo – a sound-class model based on Dolgopolsky1986, • sca – an extension of the dolgo sound-class model based on List2012b, and • asjp – an independent sound-class model which is based on the sound-class model of Brown2008 and the empirical data of Brown2011 (see the description in List2012. mode : { global, dialign } (default=global) A string indicating which kind of alignment analysis should be carried out during the progressive phase. Select between: • global – traditional global alignment analysis based on the Needleman-Wunsch algorithm Needleman1970, 9.1. Reference 77 LingPy Documentation, Release 2.6 • dialign – global alignment analysis which seeks to maximize local similarities Morgenstern1996. gop : int (default=-2) The gap opening penalty (GOP) used in the analysis. scale : float (default=0.5) The factor by which the penalty for the extension of gaps (gap extension penalty, GEP) shall be decreased. This approach is essentially inspired by the exension of the basic alignment algorithm for affine gap penalties Gotoh1982. factor : float (default=0.3) The factor by which the initial and the descending position shall be modified. tree_calc : { neighbor, upgma } (default=upgma) The cluster algorithm which shall be used for the calculation of the guide tree. Select between neighbor, the Neighbor-Joining algorithm (Saitou1987), and upgma, the UPGMA algorithm (Sokal1958). guide_tree : tree_matrix Use a custom guide tree instead of performing a cluster algorithm for constructing one based on the input similarities. The use of this option makes the tree_calc option irrelevant. gap_weight : float (default=0.5) The factor by which gaps in aligned columns contribute to the calculation of the column score. When set to 0, gaps will be ignored in the calculation. When set to 0.5, gaps will count half as much as other characters. restricted_chars : string (default=T) Define which characters of the prosodic string of a sequence reflect its secondary structure (cf. List2012b) and should therefore be aligned specifically. This defaults to T, since this is the character that represents tones in the prosodic strings of sequences. sum_of_pairs(alm_matrix=’self’, mat=None, gap_weight=0.0, gop=-1) Calculate the sum-of-pairs score for a given alignment analysis. Parameters alm_matrix : { self, other } (default=self) Indicate for which MSA the sum-of-pairs score shall be calculated. mat : { None, list } If other is chosen as an option for alm_matrix, define for which matrix the sum-of-pairs score shall be calculated. gap_weight : float (default=0) The factor by which gaps in aligned columns contribute to the calculation of the column score. When set to 0, gaps will be ignored in the calculation. When set to 0.5, gaps will count half as much as other characters. Returns The sum-of-pairs score of the alignment. : swap_check(swap_penalty=-3, score_mode=’classes’) Check for possibly swapped sites in the alignment. Parameters swap_penalty : { int, float } (default=-3) 78 Chapter 9. Reference LingPy Documentation, Release 2.6 Specify the penalty for swaps in the alignment. score_mode : { classes, library } (default=classes) Define the score-mode of the calculation which is either based on sound classes proper, or on the specific scores derived from the library approach. Returns result : bool Returns True, if a swap was identified, and False otherwise. The information regarding the position of the swap is stored in the attribute swap_index. Notes The method for swap detection is described in detail in List2012b. Examples Define a set of strings whose alignment contans a swap. >>> from lingpy import * >>> mult = Multiple(["woldemort", "waldemar", "wladimir"]) Align the data, using the progressive approach. >>> mult.prog_align() Check for swaps. >>> mult.swap_check() True Print the alignment >>> w w v print(mult) o l d a l d l a d e e i m m m o a i r r r t - lingpy.align.multiple.mult_align(seqs, gop=-1, scale=0.5, dict=False, pprint=False) A short-cut method for multiple alignment analyses. tree_calc=’upgma’, score- Parameters seqs : list The input sequences. gop = int (default=-1) : The gap opening penalty. scale : float (default=0.5) The scaling factor by which penalties for gap extensions are decreased. tree_calc : { upgma neighbor } (default=upgma) The algorithm which is used for the calculation of the guide tree. pprint : bool (default=False) 9.1. Reference 79 LingPy Documentation, Release 2.6 Indicate whether results shall be printed onto screen. Returns alignments : list A two-dimensional list in which alignments are represented as a list of tokens. Examples >>> w w - m = mult_align(["woldemort", "waldemar", "vladimir"], pprint=True) o l d e m o r t a l d e m a r v l a d i m i r - lingpy.align.pairwise module Module provides classes and functions for pairwise alignment analyses. class lingpy.align.pairwise.Pairwise(seqs, seqB=False, **keywords) Bases: object Basic class for the handling of pairwise sequence alignments (PSA). Parameters seqs : string list Either the first string of a sequence pair that shall be aligned, or a list of sequence tuples. seqB : string or bool (default=None) Define the second sequence that shall be aligned with the first sequence, if only two sequences shall be compared. align(**keywords) Align a pair of sequences or multiple sequence pairs. Parameters gop : int (default=-1) The gap opening penalty (GOP). scale : float (default=0.5) The gap extension penalty (GEP), calculated with help of a scaling factor. mode : {global,local,overlap,dialign} The alignment mode, see List2012a for details. factor : float (default = 0.3) The factor by which matches in identical prosodic position are increased. restricted_chars : str (default=T_) The restricted chars that function as an indicator of syllable or morpheme breaks for secondary alignment, see List2012c for details. distance : bool (default=False) If set to True, return the distance instead of the similarity score. Distance is calculated using the formula by Downey2008. model : { None, ~lingpy.data.model.Model } 80 Chapter 9. Reference LingPy Documentation, Release 2.6 Specify the sound class model that shall be used for the analysis. If no model is specified, the default model of List2012a will be used. pprint : bool (default=False) If set to True, the alignments are printed to the screen. lingpy.align.pairwise.edit_dist(seqA, seqB, normalized=False, restriction=”) Return the edit distance between two strings. Parameters seqA,seqB : str The strings that shall be compared. normalized : bool (default=False) Specify whether the normalized edit distance shall be returned. If no restrictions are chosen, the edit distance is normalized by dividing by the length of the longer string. If restriction is set to cv (consonant-vowel), the edit distance is normalized by the length of the alignment. restriction : {cv} (default=) Specify the restrictions to be used. Currently, only cv is supported. This prohibits matches of vowels with consonants. Returns dist : {int float} The edit distance, which is a float if normalized is set to c{True}, and an integer otherwise. Notes The edit distance was first formally defined by V. I. Levenshtein (Levenshtein1965). The first algorithm to compute the edit distance was proposed by Wagner and Fisher (Wagner1974). Examples Align two sequences:: >>> seqA = 'fat cat' >>> seqB = 'catfat' >>> edit_dist(seqA, seqB) 3 lingpy.align.pairwise.nw_align(seqA, seqB, scorer=False, gap=-1) Carry out the traditional Needleman-Wunsch algorithm. Parameters seqA, seqB : {str, list, tuple} The input strings. These should be iterables, so you can use tuples, lists, or strings. scorer [dict (default=False)] If set to c{False} a scorer will automatically be calculated, otherwise, the scorer needs to be passed as a dictionary that covers all segment matches between the input strings (segment matches need to be passed as tuples of two segments, following the order of the input sequences). Note also that the scorer 9.1. Reference 81 LingPy Documentation, Release 2.6 can well be asymmetric, so you could also use it for two completely different alphabets. All you need to make sure is that the tuples representing the segment matches follow the order of your input sequences. gap [int (default=-1)] The gap penalty. Returns alm : tuple A tuple consisting of the aligments of the first and the second sequence, and the alignment score. Notes The Needleman-Wunsch algorithm (see Needleman1970) returns a global alignment of two sequences. + .join(almB), (sim={0}).format(sim)) a b a b - - b a b a (sim=1) Nothing unexpected so far, you could reach the same result without the scorer. But now lets make a scorer that favors mismatches for our little two-letter alphabet: >>> >>> >>> >>> scorer = { ('a','b'): 1, ('a','a'):-1, ('b','b'):-1, ('b', 'a'): 1} seqA, seqB = 'abab', 'baba' almA, almB, sim = nw_align(seqA, seqB, scorer=scorer) print(' '.join(almA)+' + .join(almB), (sim={0}).format(sim)) a b a b b a b a (sim=4) Now, lets analyse two strings which are completely different, but where we use the scorer to define mappings between the segments. We simply do this by using lower case letters in one and upper case letters in the other case, which will, of course, be treated as different symbols in Python: >>> >>> >>> >>> scorer = { ('A','a'): 1, ('A','b'):-1, ('B','a'):-1, ('B', 'B'): 1} seqA, seqB = 'ABAB', 'aa' almA, almB, sim = nw_align(seqA, seqB, scorer=scorer) print(' '.join(almA)+' + .join(almB), (sim={0}).format(sim)) A B A B a - a - (sim=0) lingpy.align.pairwise.pw_align(seqA, seqB, gop=-1, scale=0.5, scorer=False, mode=’global’, distance=False, **keywords) Align two sequences in various ways. Parameters seqA, seqB : {text_type, list, tuple} The input strings. These should be iterables, so you can use tuples, lists, or strings. scorer : dict (default=False) If set to c{False} a scorer will automatically be calculated, otherwise, the scorer needs to be passed as a dictionary that covers all segment matches between the input strings. gop : int (default=-1) The gap opening penalty. scale : float (default=0.5) 82 Chapter 9. Reference LingPy Documentation, Release 2.6 The gap extension scale. This scale is similar to the gap extension penalty, but in contrast to the traditional GEP, it scales the gap opening penalty. mode : {global, local, dialign, overlap} (default=global) Select between one of the four different alignment modes regularly implemented in LingPy, see List2012a for details. distance : bool (default=False) If set to c{True} return the distance score following the formula by Downey2008. Otherwise, return the basic similarity score. Examples Align two words using the dialign algorithm:: >>> seqA = 'fat cat' >>> seqB = 'catfat' >>> pw_align(seqA, seqB, mode='dialign') (['f', 'a', 't', ' ', 'c', 'a', 't', '-', '-', '-'], ['-', '-', '-', '-', 'c', 'a', 't', 'f', 'a', 't'], 3.0) lingpy.align.pairwise.structalign(seqA, seqB) Experimental function for testing structural alignment algorithms. lingpy.align.pairwise.sw_align(seqA, seqB, scorer=False, gap=-1) Carry out the traditional Smith-Waterman algorithm. Parameters seqA, seqB : {str, list, tuple} The input strings. These should be iterables, so you can use tuples, lists, or strings. scorer : dict (default=False) If set to c{False} a scorer will automatically be calculated, otherwise, the scorer needs to be passed as a dictionary that covers all segment matches between the input strings. gap : int (default=-1) The gap penalty. Returns alm : tuple A tuple consisting of prefix, alignment, and suffix of the first and the second sequence, and the alignment score. Notes The Smith-Waterman algorithm (see Smith1981) returns a local alignment between two sequences. A local alignment is an alignment of those subsequences of the input sequences that yields the highest score. Examples Align two sequences:: 9.1. Reference 83 LingPy Documentation, Release 2.6 >>> seqA = 'fat cat' >>> seqB = 'catfat' >>> sw_align(seqA, seqB) (([], ['f', 'a', 't'], [' ', 'c', 'a', 't']), (['c', 'a', 't'], ['f', 'a', 't'], []), 3.0) lingpy.align.pairwise.turchin(seqA, seqB, model=’dolgo’, **keywords) Return cognate judgment based on the method by Turchin2010. Parameters seqA, seqB : {str, list, tuple} The input strings. These should be iterables, so you can use tuples, lists, or strings. model : {asjp, sca, dolgo} (default=dolgo) A sound-class model instance or a string that denotes one of the standard sound class models used in LingPy. Returns cognacy : {0, 1} The cognacy assertion which is either 0 (words are probably cognate) or 1 (words are not likely to be cognate). lingpy.align.pairwise.we_align(seqA, seqB, scorer=False, gap=-1) Carry out the traditional Waterman-Eggert algorithm. Parameters seqA, seqB : {str, list, tuple} The input strings. These should be iterables, so you can use tuples, lists, or strings. scorer : dict (default=False) If set to c{False} a scorer will automatically be calculated, otherwise, the scorer needs to be passed as a dictionary that covers all segment matches between the input strings. gap : int (default=-1) The gap penalty. Returns alms : list A list consisting of tuples. Each tuple gives the alignment of one of the subsequences of the input sequences. Each tuple contains the aligned part of the first, the aligned part of the second sequence, and the score of the alignment. Notes The Waterman-Eggert algorithm (see Waterman1987) returns all local matches between two sequences. Examples Align two sequences:: >>> seqA = 'fat cat' >>> seqB = 'catfat' >>> we_align(seqA, seqB) [(['f', 'a', 't'], ['f', 'a', 't'], 3.0), (['c', 'a', 't'], ['c', 'a', 't'], 3.0)] 84 Chapter 9. Reference LingPy Documentation, Release 2.6 lingpy.align.sca module Basic module for pairwise and multiple sequence comparison. The module consists of four classes which deal with pairwise and multiple sequence comparison from the sequence and the alignment perspective. The sequence perspective deals with unaligned sequences. The alignment perspective deals with aligned sequences. class lingpy.align.sca.Alignments(infile, row=’concept’, col=’doculect’, conf=”, modify_ref=False, _interactive=True, split_on_tones=True, ref=’cogid’, **keywords) Bases: lingpy.basic.wordlist.Wordlist Class handles Wordlists for the purpose of alignment analyses. Parameters infile : str The name of the input file that should conform to the basic format of the ~lingpy.basic.wordlist.Wordlist class and define a specific ID for cognate sets. row : str (default = concept) A string indicating the name of the row that shall be taken as the basis for the tabular representation of the word list. col : str (default = doculect) A string indicating the name of the column that shall be taken as the basis for the tabular representation of the word list. conf : string (default=) A string defining the path to the configuration file. ref : string (default=cogid) The name of the column that stores the cognate IDs. modify_ref : function (default=False) Use a function to modify the reference. If your cognate identifiers are numerical, for example, and negative values are assigned as loans, but you want to suppress this behaviour, just set this keyword to abs, and all cognate IDs will be converted to their absolute value. split_on_tones : bool (default=True) If set to True, this means that in the case of fuzzy alignment mode, the algorithm will attempt to split words into morphemes by tones if no explicit morpheme markers can be found. Notes This class inherits from Wordlist and additionally creates instances of the Multiple class for all cognate sets that are specified by the ref keyword. 9.1. Reference 85 LingPy Documentation, Release 2.6 Attributes msa dict A dictionary storing multiple alignments as dictionaries which can be directly opened and aligned with help of the ~lingpy.align.sca.SCA function. The alignment objects are referenced by a key which is identical with the reference (ref-keyword) of the alignment, that is the name of the column which contains the cognate identifiers. add_alignments(ref=False, modify_ref=False, fuzzy=False, split_on_tones=True) Function adds a new set of alignments to the data. Parameters ref: str (default=False) : Use this to set the name of the column which contains the cognate sets. fuzzy: bool (default=False) : If set to true, force the algorithm to treat the cognate sets as fuzzy cognate sets, i.e., as multiple cognate sets which are in order assigned to a word (proper partial cognates). align(**keywords) Carry out a multiple alignment analysis of the data. Parameters method : { progressive, library } (default=progressive) Select the method to use for the analysis. iteration : bool (default=False) Set to c{True} in order to use iterative refinement methods. swap_check : bool (default=False) Set to c{True} in order to carry out a swap-check. model : { dolgo, sca, asjp } A string indicating the name of the Model object that shall be used for the analysis. Currently, three models are supported: • dolgo – a sound-class model based on Dolgopolsky1986, • sca – an extension of the dolgo sound-class model based on List2012b, and • asjp – an independent sound-class model which is based on the sound-class model of Brown2008 and the empirical data of Brown2011 (see the description in List2012. mode : { global, dialign } A string indicating which kind of alignment analysis should be carried out during the progressive phase. Select between: • global – traditional global alignment analysis based on the Needleman-Wunsch algorithm Needleman1970, • dialign – global alignment analysis which seeks to maximize local similarities Morgenstern1996. modes : list (default=[(global,-2,0.5),(local,-1,0.5)]) Indicate the mode, the gap opening penalties (GOP), and the gap extension scale (GEP scale), of the pairwise alignment analyses which are used to create the library. gop : int (default=-5) 86 Chapter 9. Reference LingPy Documentation, Release 2.6 The gap opening penalty (GOP) used in the analysis. scale : float (default=0.6) The factor by which the penalty for the extension of gaps (gap extension penalty, GEP) shall be decreased. This approach is essentially inspired by the exension of the basic alignment algorithm for affine gap penalties Gotoh1982. factor : float (default=1) The factor by which the initial and the descending position shall be modified. tree_calc : { neighbor, upgma } (default=upgma) The cluster algorithm which shall be used for the calculation of the guide tree. Select between neighbor, the Neighbor-Joining algorithm (Saitou1987), and upgma, the UPGMA algorithm (Sokal1958). gap_weight : float (default=0) The factor by which gaps in aligned columns contribute to the calculation of the column score. When set to 0, gaps will be ignored in the calculation. When set to 0.5, gaps will count half as much as other characters. restricted_chars : string (default=T) Define which characters of the prosodic string of a sequence reflect its secondary structure (cf. List2012b) and should therefore be aligned specifically. This defaults to T, since this is the character that represents tones in the prosodic strings of sequences. get_confidence(scorer, ref=’lexstatid’, gap_weight=0.25) Function creates confidence scores for a given set of alignments. Parameters scorer : ScoreDict A ScoreDict object which gives similarity scores for all segments in the alignment. ref : str (default=lexstatid) The reference entry-type, referring to the cognate-set to be used for the analysis. gap_weight : {loat} (default=1.0) Determine the weight assigned to matches containing gaps. get_consensus(tree=False, gaps=False, classes=False, consensus=’consensus’, counterpart=’ipa’, weights=[], return_data=False, **keywords) Calculate a consensus string of all MSAs in the wordlist. Parameters msa : {c{list} ~lingpy.align.multiple.Multiple} Either an MSA object or an MSA matrix. tree : {c{str} ~lingpy.thirdparty.cogent.PhyloNode} A tree object or a Newick string along which the consensus shall be calculated. gaps : c{bool} (default=False) If set to c{True}, return the gap positions in the consensus. classes : c{bool} (default=False) Specify whether sound classes shall be used to calculate the consensus. model : ~lingpy.data.model.Model 9.1. Reference 87 LingPy Documentation, Release 2.6 A sound class model according to which the IPA strings shall be converted to soundclass strings. return_data : c{bool} (default=False) Return the data instead of adding it in a column to the wordlist object. get_msa(ref ) output(fileformat, **keywords) Write wordlist to file. Parameters fileformat : {tsv, msa, tre, nwk, dst, taxa, starling, paps.nex, paps.csv html} The format that is written to file. This corresponds to the file extension, thus tsv creates a file in tsv-format, dst creates a file in Phylip-distance format, etc. Specific output is created for the formats html and msa: • msa will create a folder containing all alignments of all cognate sets in msa-format • html will create html-output in which words are sorted according to meaning, cognate set, and all cognate words are aligned filename : str Specify the name of the output file (defaults to a filename that indicates the creation date). subset : bool (default=False) If set to c{True}, return only a subset of the data. Which subset is specified in the keywords cols and rows. cols : list If subset is set to c{True}, specify the columns that shall be written to the csv-file. rows : dict If subset is set to c{True}, use a dictionary consisting of keys that specify a column and values that give a Python-statement in raw text, such as, e.g., == hand. The content of the specified column will then be checked against statement passed in the dictionary, and if it is evaluated to c{True}, the respective row will be written to file. ref : str Name of the column that contains the cognate IDs if starling is chosen as an output format. missing : { str, int } (default=0) If paps.nex or paps.csv is chosen as fileformat, this character will be inserted as an indicator of missing data. tree_calc : {neighbor, upgma} If no tree has been calculated and tre or nwk is chosen as output format, the method that is used to calculate the tree. threshold : float (default=0.6) The threshold that is used to carry out a flat cluster analysis if groups or cluster is chosen as output format. style : str (default=id) 88 Chapter 9. Reference LingPy Documentation, Release 2.6 If msa is chosen as output format, this will write the alignments for each msa-file in a specific format in which the first column contains a direct reference to the word via its ID in the wordlist. ignore : { list, all } Modifies the output format in tsv output and allows to ignore certain blocks in extended tsv, like msa, taxa, json, etc., which should be passed as a list. If you choose all as a plain string and not a list, this will ignore all additional blocks and output only plain tsv. prettify : bool (default=True) Inserts comment characters between concepts in the tsv file output format, which makes it easier to see blocks of words denoting the same concept. Switching this off will output the file in plain tsv. reduce_alignments(alignment=False, ref=False) Function reduces alignments which contain columns that are marked to be ignored by the user. Notes This function changes the data only internally: All alignments are checked as to whether they contain data that should be ignored. If this is the case, the alignments are then reduced, and stored in a specific item of the alignment string. If the method doesnt find any instances for reduction, it still makes the copies of the alignments in order to guarantee that the alignments with with we want to work are at the same place in the dictionary. class lingpy.align.sca.MSA(infile, **keywords) Bases: lingpy.align.multiple.Multiple Basic class for carrying out multiple sequence alignment analyses. Parameters infile : file A file in msq-format or msa-format. merge_vowels : bool (default=True) Indicate, whether neighboring vowels should be merged into diphtongs, or whether they should be kept separated during the analysis. comment : char (default=#) The comment character which, inserted in the beginning of a line, prevents that line from being read. normalize : bool (default=True) Normalize the alignment, that is, add gap characters for all sequences which are shorter than the longest sequence, and delete all columns from the alignment in which only gaps occur. Notes There are two possible input formats for this class: the MSQ-format, and the MSA-format (see msa_formats for details). This class directly inherits all methods of the Multiple class. 9.1. Reference 89 LingPy Documentation, Release 2.6 Examples Get the path to a file from the testset. >>> from lingpy import * >>> path = rc("test_path")+'harry.msq' Load the file into the Multiple class. >>> mult = Multiple(path) Carry out a progressive alignment analysis of the sequences. >>> mult.prog_align() Print the result to the screen: >>> print(mult) w o l w a l v l a d d d e e i m m m o a i r r r t - ipa2cls(**keywords) Retrieve sound-class strings from aligned IPA sequences. Parameters model : str (default=sca) The sound-class model according to which the sequences shall be converted. Notes This function is only useful when an msa-file with already conducted alignment analyses was loaded. output(fileformat=’msa’, filename=None, sorted_seqs=False, unique_seqs=False, **keywords) Write data to file. Parameters fileformat : { psa, msa, msq } Indicate which data should be written to file. Select between: • psa – output of all pairwise alignments in psa-format, • msa – output of the multiple alignment in msa-format, or • msq – output of the multiple sequences in msq-format. • html – output of the multiple alignment in html-format. filename : str Select a specific name for the outfile, otherwise, the name of the infile will be taken by default. sorted_seqs : bool Indicate whether the sequences should be sorted or not (applys only to msa and msq output. unique_seqs : bool Indicate whether only unique sequences should be written to file or not. 90 Chapter 9. Reference LingPy Documentation, Release 2.6 class lingpy.align.sca.PSA(infile, **keywords) Bases: lingpy.align.pairwise.Pairwise Basic class for dealing with the pairwise alignment of sequences. Parameters infile : file A file in psq-format. merge_vowels : bool (default=True) Indicate, whether neighboring vowels should be merged into diphtongs, or whether they should be kept separated during the analysis. comment : char (default=#) The comment character which, inserted in the beginning of a line, prevents that line from being read. Notes In order to read in data from text files, two different file formats can be used along with this class: the PSQformat, and the PSA-format (see psa_formats for details). This class inherits the methods of the Pairwise class. Attributes taxa seqs tokens list list list A list of tuples containing the taxa of all sequence pairs. A list of tuples containing all sequence pairs. A list of tuples containing all sequence pairs in a tokenized form. output(fileformat=’psa’, filename=None, **keywords) Write the results of the analyses to a text file. Parameters fileformat : { psa, psq } Indicate which data should be written to file. Select between: • psa – output of all pairwise alignments in psa-format, • psq – output of the multiple sequences in psq-format. filename : str Select a specific name for the outfile, otherwise, the name of the infile will be taken by default. lingpy.align.sca.SCA(infile, **keywords) Method returns alignment objects depending on input file or input data. Notes This method checks for the type of an alignment object and returns an alignment object of the respective type. lingpy.align.sca.get_consensus(msa, gaps=False, taxa=False, classes=False, **keywords) Calculate a consensus string of a given MSA. Parameters msa : {c{list} ~lingpy.align.multiple.Multiple} 9.1. Reference 91 LingPy Documentation, Release 2.6 Either an MSA object or an MSA matrix. gaps : c{bool} (default=False) If set to c{True}, return the gap positions in the consensus. taxa : {c{list} bool} (default=False) If tree is chosen as a parameter, specify the taxa in order of the aligned strings. classes : c{bool} (default=False) Specify whether sound classes shall be used to calculate the consensus. model : ~lingpy.data.model.Model A sound class model according to which the IPA strings shall be converted to soundclass strings. local : { c{bool}, peaks, gaps }(default=False) Specify whether local pre-processing should be applied to the data. If set to c{peaks}, the average alignment score of each column is taken as reference to remove low-scoring columns from the alignment. If set to gaps, the columns with the highest proportion of gaps will be excluded. Returns cons : c{str} A consensus string of the given MSA. Module contents Package provides basic modules for alignment analyses. lingpy.basic package Submodules lingpy.basic.ops module Module provides basic operations on Wordlist-Objects. lingpy.basic.ops.calculate_data(wordlist, data, taxa=’taxa’, concepts=’concepts’, ref=’cogid’, **keywords) Manipulate a wordlist object by adding different kinds of data. Parameters data : str The type of data that shall be calculated. Currently supports • tree: calculate a reference tree based on shared cognates • dst: get distances between taxa based on shared cognates • cluster: cluster the taxa into groups using different methods lingpy.basic.ops.clean_taxnames(wordlist, column=’doculect’, f=<function <lambda>>) Function cleans taxon names for use in Newick files. lingpy.basic.ops.coverage(wordlist) Determine the average coverage of a wordlist. 92 Chapter 9. Reference LingPy Documentation, Release 2.6 lingpy.basic.ops.get_score(wl, ref, mode, taxA, nore_missing=False) taxB, concepts_attr=’concepts’, ig- lingpy.basic.ops.iter_rows(wordlist, *values) Function generates a list of the specified values in a wordlist. Parameters wordlist : ~lingpy.basic.wordlist.Wordlist A wordlist object or one of the daughter classes of wordlists. value : str A value as defined in the header of the wordlist. Returns list : list A generator object that generates list containing the key of each row in the wordlist and the corresponding cells, as specified in the headers. lingpy.basic.ops.renumber(wordlist, source, target=”, override=False) Create numerical identifiers from string identifiers. lingpy.basic.ops.triple2tsv(triples_or_fname, output=’table’) Function reads a triple file and converts it to a tabular data structure. lingpy.basic.ops.tsv2triple(wordlist, outfile=None) Function converts a wordlist to a triple data structure. Notes The basic values of which the triples consist are: • ID (the ID in the TSV file) • COLUMN (the column in the TSV file) • VALUE (the entry in the TSV file) lingpy.basic.ops.wl2dict(wordlist, sections, entries, exclude=None) Convert a wordlist to a complex dictionary with headings as keys. lingpy.basic.ops.wl2dst(wl, taxa=’taxa’, concepts=’concepts’, ref=’cogid’, mode=’swadesh’, ignore_missing=False, **keywords) Function converts wordlist to distance matrix. refB=”, lingpy.basic.ops.wl2multistate(wordlist, ref, missing) Function converts a wordlist to multistate format (compatible with PAUP). lingpy.basic.ops.wl2qlc(header, data, filename=”, formatter=’concept’, **keywords) Write the basic data of a wordlist to file. lingpy.basic.parser module Basic parser for text files in QLC format. class lingpy.basic.parser.QLCParser(filename, conf=”) Bases: object Basic class for the handling of text files in QLC format. add_entries(entry, source, function, override=False, **keywords) Add new entry-types to the word list by modifying given ones. 9.1. Reference 93 LingPy Documentation, Release 2.6 Parameters entry : string A string specifying the name of the new entry-type to be added to the word list. source : string A string specifying the basic entry-type that shall be modified. If multiple entry-types shall be used to create a new entry, they should be passed in a simple string separated by a comma. function : function A function which is used to convert the source into the target value. keywords : {dict} A dictionary of keywords that are passed as parameters to the function. Notes This method can be used to add new entry-types to the data by converting given ones. There are a lot of possibilities for adding new entries, but the most basic procedure is to use an existing entry-type and to modify it with help of a function. pickle(filename=None) Store the QLCParser instance in a pickle file. Notes The function stores a binary file called FILENAME.pkl with FILENAME corresponding to the name of the original file in the user cache dir for lingpy on your system. To restore the instance from the pickle call unpickle(). static unpickle(filename) class lingpy.basic.parser.QLCParserWithRowsAndCols(filename, row, col, conf ) Bases: lingpy.basic.parser.QLCParser get_entries(entry) Return all entries matching the given entry-type as a two-dimensional list. Parameters entry : string The entry-type of the data that shall be returned in tabular format. lingpy.basic.tree module Basic module for the handling of language trees. class lingpy.basic.tree.Tree(tree, **keywords) Bases: lingpy.thirdparty.cogent.tree.PhyloNode Basic class for the handling of phylogenetic trees. Parameters tree : {str file list} A string or a file containing trees in Newick format. As an alternative, you can also simply pass a list containing taxon names. In that case, a random tree will be created from the list of taxa. 94 Chapter 9. Reference LingPy Documentation, Release 2.6 branch_lengths : bool (default=False) When set to True, and a list of taxa is passed instead of a Newick string or a file containing a Newick string, a random tree with random branch lengths will be created with the branch lengths being in order of the maximum number of the total number of internal branches. getDistanceToRoot(node) Return the distance from the given node to the root. Parameters node : str The name of a given node in the tree. Returns distance : int The distance of the given node to the root of the tree. get_distance(other, distance=’grf’, debug=False) Function returns the Robinson-Foulds distance between the two trees. Parameters other : lingpy.basic.tree.Tree A tree object. It should have the same number of taxa as the intitial tree. distance : { grf, rf, branch, symmetric} (default=grf) The distance which shall be calculated. Select between: • grf: the generalized Robinson-Foulds Distance • rf: the Robinson-Foulds Distance • symmetric: the symmetric difference between all partitions of the trees lingpy.basic.tree.random_tree(taxa, branch_lengths=False) Create a random tree from a list of taxa. Parameters taxa : list The list containing the names of the taxa from which the tree will be created. branch_lengths : bool (default=False) When set to True, a random tree with random branch lengths will be created with the branch lengths being in order of the maximum number of the total number of internal branches. Returns tree_string : str A string representation of the random tree in Newick format. lingpy.basic.wordlist module This module provides a basic class for the handling of word lists. class lingpy.basic.wordlist.Wordlist(filename, row=’concept’, col=’doculect’, conf=None) Bases: lingpy.basic.parser.QLCParserWithRowsAndCols Basic class for the handling of multilingual word lists. Parameters filename : { string, dict } The input file that contains the data. Otherwise a dictionary with consecutive integers as keys and lists as values with the key 0 specifying the header. 9.1. Reference 95 LingPy Documentation, Release 2.6 row : str (default = concept) A string indicating the name of the row that shall be taken as the basis for the tabular representation of the word list. col : str (default = doculect) A string indicating the name of the column that shall be taken as the basis for the tabular representation of the word list. conf : string (default=) A string defining the path to the configuration file (more information in the notes). Notes A word list is created from a dictionary containing the data. The idea is a three-dimensional representation of (linguistic) data. The first dimension is called col (column, usually language), the second one is called row (row, usually concept), the third is called entry, and in contrast to the first two dimensions, which have to consist of unique items, it contains flexible values, such as ipa (phonetic sequence), cogid (identifier for cognate sets), tokens (tokenized representation of phonetic sequences). The LingPy website offers some tutorials for word lists which we recommend to read in case you are looking for more information. A couple of methods is provided along with the word list class in order to access the multi-dimensional input data. The main idea is to provide an easy way to access two-dimensional slices of the data by specifying which entry type should be returned. Thus, if a word list consists not only of simple orthographical entries but also of IPA encoded phonetic transcriptions, both the orthographical source and the IPA transcriptions can be easily accessed as two separate two-dimensional lists. add_entries(entry, source, function, override=False, **keywords) Add new entry-types to the word list by modifying given ones. Parameters entry : string A string specifying the name of the new entry-type to be added to the word list. source : string A string specifying the basic entry-type that shall be modified. If multiple entrytypes shall be used to create a new entry, they should be passed in a simple string separated by a comma. function : function A function which is used to convert the source into the target value. keywords : {dict} A dictionary of keywords that are passed as parameters to the function. Notes This method can be used to add new entry-types to the data by converting given ones. There are a lot of possibilities for adding new entries, but the most basic procedure is to use an existing entry-type and to modify it with help of a function. calculate(data, taxa=’taxa’, concepts=’concepts’, ref=’cogid’, **keywords) Function calculates specific data. Parameters data : str 96 Chapter 9. Reference LingPy Documentation, Release 2.6 The type of data that shall be calculated. Currently supports • tree: calculate a reference tree based on shared cognates • dst: get distances between taxa based on shared cognates • cluster: cluster the taxa into groups using different methods coverage(stats=’absolute’) Function determines the coverage of a wordlist. export(fileformat, sections=None, entries=None, entry_sep=”, item_sep=”, template=”, **keywords) Export the wordlist to specific fileformats. Notes The difference between export and output is that the latter mostly serves for internal purposes and formats, while the former serves for publication of data, using specific, nested statements to create, for example, HTML or LaTeX files from the wordlist data. get_dict(col=”, row=”, entry=”, **keywords) Function returns dictionaries of the cells matched by the indices. Parameters col : string (default=) The column index evaluated by the method. It should contain one of the values in the row of the Wordlist instance, usually a taxon (language) name. row : string (default=) The row index evaluated by the method. It should contain one of the values in the row of the Wordlist instance, usually a taxon (language) name. entry : string (default=) The index for the entry evaluated by the method. It can be used to specify the datatype of the rows or columns selected. As a default, the indices of the entries are returned. Returns entries : dict A dictionary of keys and values specifying the selected part of the data. Typically, this can be a dictionary of a given language with keys for the concept and values as specified in the entry keyword. See also: Wordlist.get_list, Wordlist.add_entries Notes The col and row keywords in the function are all aliased according to the description in the wordlist. rc file. Thus, instead of using these attributes, the aliases can also be taken. For selecting a language, one may type something like: >>> Wordlist.get_dict(language='LANGUAGE') and for the selection of a concept, one may type something like: 9.1. Reference 97 LingPy Documentation, Release 2.6 >>> Wordlist.get_dict(concept='CONCEPT') See the examples below for details. Examples Load the harry_potter.csv file: >>> wl = Wordlist('harry_potter.csv') Select all IPA-entries for the language German: >>> wl.get_dict(language='German',entry='ipa') {'Harry': ['haralt'], 'hand': ['hant'], 'leg': ['bain']} Select all words (orthographical representation) for the concept Harry: >>> wl.get_dict(concept="Harry",entry="words") {'English': ['hæri'], 'German': ['haralt'], 'Russian': ['gari'], ֒→'Ukrainian': ['gari']} Note that the values of the dictionary that is returned are always lists, since it is possible that the original file contains synonyms (multiple words corresponding to the same concept). get_distances(**kw) get_etymdict(ref=’cogid’, entry=”, modify_ref=False) Return an etymological dictionary representation of the word list. Parameters ref : string (default = cogid) The reference entry which is used to store the cognate ids. entry : string (default = ) The entry-type which shall be selected. modify_ref : function (default=False) Use a function to modify the reference. If your cognate identifiers are numerical, for example, and negative values are assigned as loans, but you want to suppress this behaviour, just set this keyword to abs, and all cognate IDs will be converted to their absolute value. Returns etym_dict : dict An etymological dictionary representation of the data. Notes In contrast to the word-list representation of the data, an etymological dictionary representation sorts the counterparts according to the cognate sets of which they are reflexes. If more than one cognate ID are assigned to an entry, for example in cases of fuzzy cognate IDs or partial cognate IDs, the etymological dictionary will return one cognate set for each of the IDs. get_list(row=”, col=”, entry=”, flat=False, **keywords) Function returns lists of rows and columns specified by their name. Parameters row: string (default = ) : 98 Chapter 9. Reference LingPy Documentation, Release 2.6 The row name whose entries are selected from the data. col : string (default = ) The column name whose entries are selected from the data. entry: string (default = ) : The entry-type which is selected from the data. flat : bool (default = False) Specify whether the returned list should be one- or two-dimensional, or whether it should contain gaps or not. Returns data : list A list representing the selected part of the data. See also: Wordlist.get_list, Wordlist.add_entries Notes The col and row keywords in the function are all aliased according to the description in the wordlist. rc file. Thus, instead of using these attributes, the aliases can also be taken. For selecting a language, one may type something like: >>> Wordlist.get_list(language='LANGUAGE') and for the selection of a concept, one may type something like: >>> Wordlist.get_list(concept='CONCEPT') See the examples below for details. Examples Load the harry_potter.csv file: >>> wl = Wordlist('harry_potter.csv') Select all IPA-entries for the language German: >>> wl.get_list(language='German',entry='ipa' ['bain', 'hant', 'haralt'] Note that this function returns 0 for missing values (concepts that dont have a word in the given language). If one wants to avoid this, the flat keyword should be set to True. Select all words (orthographical representation) for the concept Harry: >>> wl.get_list(concept="Harry",entry="words") [['Harry', 'Harald', '', 'i']] Note that the values of the list that is returned are always two-dimensional lists, since it is possible that the original file contains synonyms (multiple words corresponding to the same concept). If one wants to have a flat representation of the entries, the flat keyword should be set to True: 9.1. Reference 99 LingPy Documentation, Release 2.6 >>> wl.get_list(concept="Harry",entry="words",flat=True) ['hæri', 'haralt', 'gari', 'hari'] get_paps(ref=’cogid’, entry=’concept’, missing=0, modify_ref=False) Function returns a list of present-absent-patterns of a given word list. Parameters ref : string (default = cogid) The reference entry which is used to store the cognate ids. entry : string (default = concept) The field which is used to check for missing data. missing : string,int (default = 0) The marker for missing items. get_tree(**kw) iter_rows(*entries) Iterate over the columns in a wordlist. Parameters entries : list The name of the columns which shall be iterated. Returns iterator : iterator An iterator yielding lists in which the first entry is the ID of the wordlist row and the following entries are the content of the columns as specified. Examples Load a wordlist from LingPys test data: >>> from lingpy.tests.util import test_data >>> from lingpy import Wordlist >>> wl = Wordlist(test_data("KSL.qlc")) >>> list(wl.iter_rows('ipa'))[:10] [[1, 'iθ'], [2, 'l'], [3, 'tut'], [4, 'al'], [5, 'apa.u'], [6, 'ayo'], [7, 'bytyn'], [8, 'e'], [9, 'and'], [10, 'e']] So as you can see, the function returns the key of the wordlist as well as the specified entry. output(fileformat, **keywords) Write wordlist to file. Parameters fileformat : {tsv,tre,nwk,dst, taxa, starling, paps.nex, paps.csv} The format that is written to file. This corresponds to the file extension, thus tsv creates a file in extended tsv-format, dst creates a file in Phylip-distance format, etc. 100 Chapter 9. Reference LingPy Documentation, Release 2.6 filename : str Specify the name of the output file (defaults to a filename that indicates the creation date). subset : bool (default=False) If set to c{True}, return only a subset of the data. Which subset is specified in the keywords cols and rows. cols : list If subset is set to c{True}, specify the columns that shall be written to the csv-file. rows : dict If subset is set to c{True}, use a dictionary consisting of keys that specify a column and values that give a Python-statement in raw text, such as, e.g., == hand. The content of the specified column will then be checked against statement passed in the dictionary, and if it is evaluated to c{True}, the respective row will be written to file. ref : str Name of the column that contains the cognate IDs if starling is chosen as an output format. missing : { str, int } (default=0) If paps.nex or paps.csv is chosen as fileformat, this character will be inserted as an indicator of missing data. tree_calc : {neighbor, upgma} If no tree has been calculated and tre or nwk is chosen as output format, the method that is used to calculate the tree. threshold : float (default=0.6) The threshold that is used to carry out a flat cluster analysis if groups or cluster is chosen as output format. ignore : { list, all (default=all)} Modifies the output format in tsv output and allows to ignore certain blocks in extended tsv, like msa, taxa, json, etc., which should be passed as a list. If you choose all as a plain string and not a list, this will ignore all additional blocks and output only plain tsv. prettify : bool (default=False) Inserts comment characters between concepts in the tsv file output format, which makes it easier to see blocks of words denoting the same concept. Switching this off will output the file in plain tsv. renumber(source, target=”, override=False) Renumber a given set of string identifiers by replacing the ids by integers. Parameters source : str The source column to be manipulated. target : str (default=) The name of the target colummn. If no name is chosen, the target column will be manipulated by adding id to the name of the source column. 9.1. Reference 101 LingPy Documentation, Release 2.6 override : bool (default=False) Force to overwrite the data if the target column already exists. Notes In addition to a new column, an further entry is added to the _meta attribute of the wordlist by which newly coined ids can be retrieved from the former string attributes. This attribute is called source2target and can be accessed either via the _meta dictionary or directly as an attribute of the wordlist. lingpy.basic.wordlist.from_cldf(path, to=<class ’lingpy.basic.wordlist.Wordlist’>) Load data from CLDF into a LingPy Wordlist object or similar. Parameters path : str The path to the metadata-file of your CLDF dataset. to : ~lingpy.basic.wordlist.Wordlist A ~lingpy.basic.wordlist.Wordlist object or one of the descendants (LexStat, Alignmnent). lingpy.basic.wordlist.get_wordlist(path, delimiter=’, ’, quotechar=’"’, tion_form=’NFC’, **keywords) Load a wordlist from a normal CSV file. normaliza- Parameters path : str The path to your CSV file. delimiter : str The delimiter in the CSV file. quotechar : str The quote character in your data. row : str (default = concept) A string indicating the name of the row that shall be taken as the basis for the tabular representation of the word list. col : str (default = doculect) A string indicating the name of the column that shall be taken as the basis for the tabular representation of the word list. conf : string (default=) A string defining the path to the configuration file. Notes This function returns a Wordlist object. In contrast to the normal way to load a wordlist from a tab-separated file, however, this allows to directly load a wordlist from any normal csv-file, with your own specified delimiters and quote characters. If the first cell in the first row of your CSV file is not named ID, the integer identifiers, which are required by LingPy will be automatically created. 102 Chapter 9. Reference LingPy Documentation, Release 2.6 Module contents This module provides basic classes for the handling of linguistic data. The basic idea is to provide classes that allow the user to handle basic linguistic datatypes (spreadsheets, wordlists) in a consistent way. lingpy.compare package Submodules lingpy.compare.lexstat module class lingpy.compare.lexstat.LexStat(filename, **keywords) Bases: lingpy.basic.wordlist.Wordlist Basic class for automatic cognate detection. Parameters filename : str The name of the file that shall be loaded. model : Model The sound-class model that shall be used for the analysis. Defaults to the SCA sound-class model. merge_vowels : bool (default=True) Indicate whether consecutive vowels should be merged into single tokens or kept apart as separate tokens. transform : dict A dictionary that indicates how prosodic strings should be simplified (or generally transformed), using a simple key-value structure with the key referring to the original prosodic context and the value to the new value. Currently, prosodic strings (see prosodic_string()) offer 11 different prosodic contexts. Since not all these are helpful in preliminary analyses for cognate detection, it is useful to merge some of these contexts into one. The default settings distinguish only 5 instead of 11 available contexts, namely: • C for all consonants in prosodically ascending position, • c for all consonants in prosodically descending position, • V for all vowels, • T for all tones, and • _ for word-breaks. Make sure to check also the vowel keyword when initialising a LexStat object, since the symbols you use for vowels and tones should be identical with the ones you define in your transform dictionary. vowels : str (default=VT_) For scoring function creation using the get_scorer function, you have the possibility to use reduced scores for the matching of tones and vowels by modifying the vscale parameter, which is set to 0.5 as a default. In order to make sure that vowels 9.1. Reference 103 LingPy Documentation, Release 2.6 and tones are properly detected, make sure your prosodic string representation of vowels matches the one in this keyword. Thus, if you change the prosodic strings using the transform keyword, you also need to change the vowel string, to make sure that vscale works as wanted in the get_scorer function. check : bool (default=False) If set to True, the input file will first be checked for errors before the calculation is carried out. Errors will be written to the file errors, defaulting to errors.log. See also apply_checks apply_checks : bool (default=False) If set to True, any errors identified by check will be handled silently. no_bscorer: bool (default=False) : If set to True, this will suppress the creation of a language-specific scoring function (which may become quite large and is additional ballast if the method lexstat is not used after all. If you use the lexstat method, however, this needs to be set to False. errors : str The name of the error log. segments : str (default=tokens) The name of the column in your data which contains the segmented transcriptions, or in which the segmented transcriptions should be placed. transcription : str (default=ipa) The name of the column in your data which contains the unsegmented transcriptions. classes : str (default=classes) The name of the column in the data which contains the sound class representation of the transcriptions, or in which this information shall be placed after automatic conversion. numbers : str (default=numbers) The language-specific triples consisting of language id (numeric), sound class string (one character only), and prosodic string (one character only). Usually, numbers are automatically created from the columns classes, prostrings, and langid, but you can also provide them in your data. langid : str (default=langid) Name of the column that contains a numerical language identifier, needed to produce the language-specific character triples (numbers). Unless specific explicitly, this is automatically created. prostrings : str (default=prostrings) Name of the column containing prosodic strings (see List2014d for more details) of the segmented transcriptions, containing one character per prosodic string. Prostrings add a contextual component to phonetic sequences. They are automatically created, but can likewise be submitted from the initial data. weights : str (default=weights) The name of the column which stores the individual gap-weights for each sequence. Gap weights are positive floats for each segment in a string, which modify the gap opening penalty during alignment. tokenize : function (default=ipa2tokens) 104 Chapter 9. Reference LingPy Documentation, Release 2.6 The function which should be used to tokenize the entries in the column storing the transcriptions in case no segmentation is provided by the user. get_prostring : function (default=prosodic_string) The function which should be used to create prosodic strings from the segmented transcription data. If you want to completely ignore prosodic strings in LexStat calculations, you could just pass the following function: >>> lex = LexStat('inputfile.tsv', get_prostring=lambda x: ["x ֒→" for y in x]) Notes Instantiating this class does not require a lot of parameters. However, the user may modify its behaviour by providing additional attributes in the input file. 9.1. Reference 105 LingPy Documentation, Release 2.6 106 Chapter 9. Reference LingPy Documentation, Release 2.6 Attributes pairs dict model Model chars list rchars list scorer dict 9.1. Reference A dictionary with tuples of language names as key and indices as value, pointing to unique combinations of words with the same meaning in all language pairs. The sound class model instance which serves to convert the phonetic data into sound classes. A list of all unique languagespecific character types in the instantiated LexStat object. The characters in this list consist of • the language identifier (numeric, referenced as langid as a default, but customizable via the keyword langid) • the sound class symbol for the respective IPA transcription value • the prosodic class value All values are represented in the above order as one string, separated by a dot. Gaps are also included in this collection. They are traditionally represented as X for the sound class and - for the prosodic string. A list containing all unique character types across languages. In contrast to the chars-attribute, the rchars (raw chars) do not contain the languahttp://tsv.lingpy.org/triples/get_data.py?history=tru identifier, thus they only consist of two values, separated by a dot, namely, the sound class symbol, and the prosodic class value. A collection of ScoreDict objects, which are used to score the strings. LexStat distinguishes two different scoring functions: • rscorer: A raw scorer that is not language-specific and consists only of sound class values and prosodic string values. This scorer is traditionally used to carry out the first alignment in order to calculate the languagespecific scorer. It is directly accessible as an attribute of the LexStat class 107 (rscorer). The characters which constitute the values in this scorer are accessible via the rchars at- LingPy Documentation, Release 2.6 align_pairs(idxA, idxB, concept=None, **keywords) Align all or some words of a given pair of languages. Parameters idxA,idxB : {int, str} Use an integer to refer to the words by their unique internal ID, use language names to select all words for a given language. method : {lexstat,sca} Define the method to be used for the alignment of the words. mode : {global,local,overlap,dialign} (default=overlap) Select the mode for the alignment analysis. gop : int (default=-2) If sca is selected as a method, define the gap opening penalty. scale : float (default=0.5) Select the scale for the gap extension penalty. factor : float (default=0.3) Select the factor for extra scores for identical prosodic segments. restricted_chars : str (default=T_) Select the restricted chars (boundary markers) in the prosodic strings in order to enable secondary alignment. distance : bool (default=True) If set to c{True}, return the distance instead of the similarity score. pprint : bool (default=True) If set to c{True}, print the results to the terminal. return_distance : bool (default=False) If set to c{True}, return the distance score, otherwise, nothing will be returned. cluster(method=’sca’, cluster_method=’upgma’, threshold=0.3, scale=0.5, factor=0.3, restricted_chars=’_T’, mode=’overlap’, gop=-2, restriction=”, ref=”, external_function=None, **keywords) Function for flat clustering of words into cognate sets. Parameters method : {sca,lexstat,edit-dist,turchin} (default=sca) Select the method that shall be used for the calculation. cluster_method : {upgma,single,complete, mcl} (default=upgma) Select the cluster method. upgma (Sokal1958) refers to average linkage clustering, mcl refers to the Markov Clustering Algorithm (Dongen2000). threshold : float (default=0.3) Select the threshold for the cluster approach. If set to c{False}, an automatic threshold will be calculated by calculating the average distance of unrelated sequences (use with care). scale : float (default=0.5) Select the scale for the gap extension penalty. 108 Chapter 9. Reference LingPy Documentation, Release 2.6 factor : float (default=0.3) Select the factor for extra scores for identical prosodic segments. restricted_chars : str (default=T_) Select the restricted chars (boundary markers) in the prosodic strings in order to enable secondary alignment. mode : {global,local,overlap,dialign} (default=overlap) Select the mode for the alignment analysis. verbose : bool (default=False) Define whether verbose output should be used or not. gop : int (default=-2) If sca is selected as a method, define the gap opening penalty. restriction : {cv} (default=) Specify the restriction for calculations using the edit-distance. Currently, only cv is supported. If edit-dist is selected as method and restriction is set to cv, consonant-vowel matches will be prohibited in the calculations and the edit distance will be normalized by the length of the alignment rather than the length of the longest sequence, as described in Heeringa2006. inflation : {int, float} (default=2) Specify the inflation parameter for the use of the MCL algorithm. expansion : int (default=2) Specify the expansion parameter for the use of the MCL algorithm. get_distances(method=’sca’, mode=’overlap’, gop=-2, scale=0.5, stricted_chars=’T\\_’, aggregate=True) Method calculates different distance estimates for language pairs. factor=0.3, re- Parameters method : {sca,lexstat,edit-dist,turchin} (default=sca) Select the method that shall be used for the calculation. runs : int (default=100) Select the number of random alignments for each language pair. mode : {global,local,overlap,dialign} (default=overlap) Select the mode for the alignment analysis. gop : int (default=-2) If sca is selected as a method, define the gap opening penalty. scale : float (default=0.5) Select the scale for the gap extension penalty. factor : float (default=0.3) Select the factor for extra scores for identical prosodic segments. restricted_chars : str (default=T_) Select the restricted chars (boundary markers) in the prosodic strings in order to enable secondary alignment. 9.1. Reference 109 LingPy Documentation, Release 2.6 aggregate : bool (default=True) Return aggregated distances in form of a distance matrix for all taxa in the data. Returns D : c{numpy.array} An array with all distances calculated for each sequence pair. get_frequencies(ftype=’sounds’, ref=’tokens’, aggregated=False) Computes the frequencies of a given wordlist. Parameters ftype: str (default=sounds) : The type of frequency which shall be calculated. Select between sounds (typetoken frequencies of sounds), and wordlength (average word length per taxon or in aggregated form), or diversity for the diversity index (requires that you have carried out cognate judgments, and make sure to set the ref keyword to the column in which your cognates are). ref : str (default=tokens) The reference column, with the column for tokens as a default. Make sure to modify this keyword in case you want to check for the diversity. aggregated : bool (default=False) Determine whether frequencies should be calculated in an aggregated way, for all languages, or on a language-per-language basis. Returns freqs : {dict, float} Depending on the selection of the datatype you chose, this returns either a dictionary containing the frequencies or a float indicating the ratio. get_random_distances(method=’lexstat’, runs=100, mode=’overlap’, gop=-2, scale=0.5, factor=0.3, restricted_chars=’T\\_’) Method calculates randoms scores for unrelated words in a dataset. Parameters method : {sca,lexstat,edit-dist,turchin} (default=sca) Select the method that shall be used for the calculation. runs : int (default=100) Select the number of random alignments for each language pair. mode : {global,local,overlap,dialign} (default=overlap) Select the mode for the alignment analysis. gop : int (default=-2) If sca is selected as a method, define the gap opening penalty. scale : float (default=0.5) Select the scale for the gap extension penalty. factor : float (default=0.3) Select the factor for extra scores for identical prosodic segments. restricted_chars : str (default=T_) Select the restricted chars (boundary markers) in the prosodic strings in order to enable secondary alignment. Returns D : c{numpy.array} 110 Chapter 9. Reference LingPy Documentation, Release 2.6 An array with all distances calculated for each sequence pair. get_scorer(**keywords) Create a scoring function based on sound correspondences. Parameters method : str (default=shuffle) Select between markov, for automatically generated random strings, and shuffle, for random strings taken directly from the data. ratio : tuple (default=3,2) Define the ratio between derived and original score for sound-matches. vscale : float (default=0.5) Define a scaling factor for vowels, in order to decrease their score in the calculations. runs : int (default=1000) Choose the number of random runs that shall be made in order to derive the random distribution. threshold : float (default=0.7) The threshold which used to select those words that are compared in order to derive the attested distribution. modes : list (default = [(global,-2,0.5),(local,-1,0.5)]) The modes which are used in order to derive the distributions from pairwise alignments. factor : float (default=0.3) The scaling factor for sound segments with identical prosodic environment. force : bool (default=False) Force recalculation of existing distribution. preprocessing: bool (default=False) : Select whether SCA-analysis shall be used to derive a preliminary set of cognates from which the attested distribution shall be derived. rands : int (default=1000) If method is set to markov, this parameter defines the number of strings to produce for the calculation of the random distribution. limit : int (default=10000) If method is set to markov, this parameter defines the limit above which no more search for unique strings will be carried out. cluster_method : {upgma single complete} (default=upgma) Select the method to be used for the calculation of cognates in the preprocessing phase, if preprocessing is set to c{True}. gop : int (default=-2) If preprocessing is selected, define the gap opening penalty for the preprocessing calculation of cognates. unattested : {int, float} (default=-5) 9.1. Reference 111 LingPy Documentation, Release 2.6 If a pair of sounds is not attested in the data, but expected by the alignment algorithm that computes the expected distribution, the score would be -infinity. Yet in order to allow to smooth this behaviour and to reduce the strictness, we set a default negative value which does not necessarily need to be too high, since it may well be that we miss a potentially good pairing in the first runs of alignment analyses. Use this keyword to adjust this parameter. unexpected : {int, float} (default=0.000001) If a pair is encountered in a given alignment but not expected according to the randomized alignments, the score would be not calculable, since we had to divide by zero. For this reason, we set a very small constant, by which the score is divided in this case. Not that this constant is only relevant in those cases where the shuffling procedure was not carried out long enough. get_subset(sublist, ref=’concept’) Function creates a specific subset of all word pairs. Parameters sublist : list A list which contains those items which should be considered for the subset creation, for example, a list of concepts. ref : string (default=concept) The reference point to compare the given sublist. Notes This function can be used to consider only a smaller part of word pairs when creating a scorer. Normally, all words are compared, but defining a subset allows to compare only those belonging to a specific concept list (Swadesh list). output(fileformat, **keywords) Write data to file. Parameters fileformat : {tsv, tre,nwk,dst, taxa,starling, paps.nex, paps.csv} The format that is written to file. This corresponds to the file extension, thus tsv creates a file in tsv-format, dst creates a file in Phylip-distance format, etc. filename : str Specify the name of the output file (defaults to a filename that indicates the creation date). subset : bool (default=False) If set to c{True}, return only a subset of the data. Which subset is specified in the keywords cols and rows. cols : list If subset is set to c{True}, specify the columns that shall be written to the csv-file. rows : dict If subset is set to c{True}, use a dictionary consisting of keys that specify a column and values that give a Python-statement in raw text, such as, e.g., == hand. The content of the specified column will then be checked against statement passed in the dictionary, and if it is evaluated to c{True}, the respective row will be written to file. 112 Chapter 9. Reference LingPy Documentation, Release 2.6 ref : str Name of the column that contains the cognate IDs if starling is chosen as an output format. missing : { str, int } (default=0) If paps.nex or paps.csv is chosen as fileformat, this character will be inserted as an indicator of missing data. tree_calc : {neighbor, upgma} If no tree has been calculated and tre or nwk is chosen as output format, the method that is used to calculate the tree. threshold : float (default=0.6) The threshold that is used to carry out a flat cluster analysis if groups or cluster is chosen as output format. ignore : { list, all } Modifies the output format in tsv output and allows to ignore certain blocks in extended tsv, like msa, taxa, json, etc., which should be passed as a list. If you choose all as a plain string and not a list, this will ignore all additional blocks and output only plain tsv. prettify : bool (default=True) Inserts comment characters between concepts in the tsv file output format, which makes it easier to see blocks of words denoting the same concept. Switching this off will output the file in plain tsv. lingpy.compare.lexstat.char_from_charstring(cstring) lingpy.compare.lexstat.get_score_dict(chars, model) lingpy.compare.partial module Module provides a class for partial cognate detection, expanding the LexStat class. class lingpy.compare.partial.Partial(infile, **keywords) Bases: lingpy.compare.lexstat.LexStat Extended class for automatic detection of partial cognates. Parameters filename : str The name of the file that shall be loaded. model : Model The sound-class model that shall be used for the analysis. Defaults to the SCA sound-class model. merge_vowels : bool (default=True) Indicate whether consecutive vowels should be merged into single tokens or kept apart as separate tokens. transform : dict 9.1. Reference 113 LingPy Documentation, Release 2.6 A dictionary that indicates how prosodic strings should be simplified (or generally transformed), using a simple key-value structure with the key referring to the original prosodic context and the value to the new value. Currently, prosodic strings (see prosodic_string()) offer 11 different prosodic contexts. Since not all these are helpful in preliminary analyses for cognate detection, it is useful to merge some of these contexts into one. The default settings distinguish only 5 instead of 11 available contexts, namely: • C for all consonants in prosodically ascending position, • c for all consonants in prosodically descending position, • V for all vowels, • T for all tones, and • _ for word-breaks. Make sure to check also the vowel keyword when initialising a LexStat object, since the symbols you use for vowels and tones should be identical with the ones you define in your transform dictionary. vowels : str (default=VT_) For scoring function creation using the get_scorer function, you have the possibility to use reduced scores for the matching of tones and vowels by modifying the vscale parameter, which is set to 0.5 as a default. In order to make sure that vowels and tones are properly detected, make sure your prosodic string representation of vowels matches the one in this keyword. Thus, if you change the prosodic strings using the transform keyword, you also need to change the vowel string, to make sure that vscale works as wanted in the get_scorer function. check : bool (default=False) If set to True, the input file will first be checked for errors before the calculation is carried out. Errors will be written to the file errors, defaulting to errors.log. See also apply_checks apply_checks : bool (default=False) If set to True, any errors identified by check will be handled silently. no_bscorer: bool (default=False) : If set to True, this will suppress the creation of a language-specific scoring function (which may become quite large and is additional ballast if the method lexstat is not used after all. If you use the lexstat method, however, this needs to be set to False. errors : str The name of the error log. Notes This method automatically infers partial cognate sets from data which was previously morphologically segmented. 114 Chapter 9. Reference LingPy Documentation, Release 2.6 9.1. Reference 115 LingPy Documentation, Release 2.6 Attributes 116 pairs dict model Model chars list rchars list scorer dict A dictionary with tuples of language names as key and indices as value, pointing to unique combinations of words with the same meaning in all language pairs. The sound class model instance which serves to convert the phonetic data into sound classes. A list of all unique languagespecific character types in the instantiated LexStat object. The characters in this list consist of • the language identifier (numeric, referenced as langid as a default, but customizable via the keyword langid) • the sound class symbol for the respective IPA transcription value • the prosodic class value All values are represented in the above order as one string, separated by a dot. Gaps are also included in this collection. They are traditionally represented as X for the sound class and - for the prosodic string. A list containing all unique character types across languages. In contrast to the chars-attribute, the rchars (raw chars) do not contain the language identifier, thus they only consist of two values, separated by a dot, namely, the sound class symbol, and the prosodic class value. A collection of ScoreDict objects, which are used to score the strings. LexStat distinguishes two different scoring functions: • rscorer: A raw scorer that is not language-specific and consists only of sound class values and prosodic string values. This scorer is traditionally used to carry out the first alignment in order to calculate the languagespecific scorer. It is directly accessible as an attribute of the LexStat class (rscorer). The charac9. Reference tersChapter which constitute the values in this scorer are accessible via the rchars attribue of each lexstat class. LingPy Documentation, Release 2.6 add_cognate_ids(source, target, idtype=’strict’, override=False) Compute normal cognate identifiers from partial cognate sets. Parameters source: str : Name of the source column in your wordlist file. target : str Name of the target column in your wordlist file. idtype : str (default=strict) Select between strict and loose. override: bool (default=False) : Specify whether you want to override existing columns. Notes While the computation of strict cognate IDs from partial cognate IDs is straightforward and just judges those words as cognate which are identical in all their parts, the computation of loose cognate IDs constructs a network between all words, draws lines between all words that share a common morpheme, and judges all connected components in this network as cognate. partial_cluster(method=’sca’, threshold=0.45, scale=0.5, factor=0.3, restricted_chars=’_T’, mode=’overlap’, cluster_method=’infomap’, gop=-1, restriction=”, ref=”, external_function=None, split_on_tones=True, **keywords) Cluster the words into partial cognate sets. Function for flat clustering of words into cognate sets. Parameters method : {sca,lexstat,edit-dist,turchin} (default=sca) Select the method that shall be used for the calculation. cluster_method : {upgma,single,complete, mcl} (default=upgma) Select the cluster method. upgma (Sokal1958) refers to average linkage clustering, mcl refers to the Markov Clustering Algorithm (Dongen2000). threshold : float (default=0.3) Select the threshold for the cluster approach. If set to c{False}, an automatic threshold will be calculated by calculating the average distance of unrelated sequences (use with care). scale : float (default=0.5) Select the scale for the gap extension penalty. factor : float (default=0.3) Select the factor for extra scores for identical prosodic segments. restricted_chars : str (default=T_) Select the restricted chars (boundary markers) in the prosodic strings in order to enable secondary alignment. mode : {global,local,overlap,dialign} (default=overlap) Select the mode for the alignment analysis. verbose : bool (default=False) 9.1. Reference 117 LingPy Documentation, Release 2.6 Define whether verbose output should be used or not. gop : int (default=-2) If sca is selected as a method, define the gap opening penalty. restriction : {cv} (default=) Specify the restriction for calculations using the edit-distance. Currently, only cv is supported. If edit-dist is selected as method and restriction is set to cv, consonant-vowel matches will be prohibited in the calculations and the edit distance will be normalized by the length of the alignment rather than the length of the longest sequence, as described in Heeringa2006. inflation : {int, float} (default=2) Specify the inflation parameter for the use of the MCL algorithm. expansion : int (default=2) Specify the expansion parameter for the use of the MCL algorithm. lingpy.compare.phylogeny module Phylogeny-based detection of borrowings in lexicostatistical wordlists. class lingpy.compare.phylogeny.PhyBo(dataset, tree=None, paps=’pap’, ref=’cogid’, tree_calc=’neighbor’, output_dir=None, **keywords) Bases: lingpy.basic.wordlist.Wordlist Basic class for calculations using the TreBor method. Parameters dataset : string Name of the dataset that shall be analyzed. tree : {None, string} Name of the tree file. paps : string (default=pap) Name of the column that stores the specific cognate IDs consisting of an arbitrary integer key and a key for the concept. ref : string (default=cogid) Name of the column that stores the general cognate ids (the reference of the analysis). tree_calc : {neighbor,upgma} (default=neighbor) Select the algorithm to be used for the tree calculation if no tree is passed with the file. missing : int (default=-1) Specify how missing data should be handled. If set to -1, missing data can account for both presence or absence of a cognate set in the given language. If set to 0, missing data is treated as absence. degree : int (default=100) The degree which is chosen for the projection of the tree layout. 118 Chapter 9. Reference LingPy Documentation, Release 2.6 analyze(runs=’default’, mixed=False, output_gml=False, tar=False, full_analysis=True, plot_dists=False, output_plot=False, plot_mln=False, plot_msn=False, **keywords) Carry out a full analysis using various parameters. Parameters runs : {str list} (default=default) Define a couple of different models to be analyzed. Select between: • default: weighted analysis, using parsimony and weights for gains and losses • topdown: use the traditional approach by Nelson-Sathi2011 • restriction: use the restriction approach You can also define your own mix of models. usetex : bool (default=True) Specify whether you want to use LaTeX to render plots. mixed : bool (default=False) If set to c{True}, calculate a mixed model by selecting the best model for each item separately. output_gml : bool (default=False) Set to c{True} in order to output every gain-loss-scenario in GML-format. full_analysis : bool (default=True) Specifies whether a full analysis is carried out or not. plot_mln : bool (default=True) Select or unselect output plot for the MLN. plot_msn : bool (default=False) Select or unselect output plot for the MSN. get_ACS(glm, **keywords) Compute the ancestral character states (ACS) for all internal nodes. get_AVSD(glm, **keywords) Function retrieves all pap s for ancestor languages in a given tree. get_CVSD() Calculate the Contemporary Vocabulary Size Distribution (CVSD). get_GLS(mode=’weighted’, ratio=(1, 1), restriction=3, output_gml=False, output_plot=False, tar=False, **keywords) Create gain-loss-scenarios for all non-singleton paps in the data. Parameters mode : string (default=weighted) Select between weighted, restriction and topdown. The three modes refer to the following frameworks: • weighted refers to the weighted parsimony framework described in List2014b and List2014a. Weights are specified with help of a ratio for the scoring of gain and loss events. The ratio can be defined with help of the ratio keyword. • restrictino refers to a simple method in which only a specific amount of gain events is allowed. The maximally allowed number of gain events can be defined with help of the restriction keyword. 9.1. Reference 119 LingPy Documentation, Release 2.6 • topdown refers to the top-down method outlined in Dagan2007 and first applied to linguistic data in Nelson-Sathi2011. This method also defines a maximal number of gain events, but in contrast to the restriction approach, it starts from the top of the tree and stops if the maximal number of restrictions has been reached. The maximally allowed number of gain events can, again, be specified with help of the restriction keyword. ratio : tuple (default=(1,1)) If weighted mode is selected, define the ratio between the weights for gains and losses. restriction : int (default=3) If restriction is selected as mode, define the maximal number of gains. output_gml : bool (default=False) If set to c{True}, the decisions for each GLS are stored in a separate file in GMLformat. tar : bool (default=False) If set to c{True}, the GML-files will be added to a compressed tar-file. gpl : int (default=1) Specifies the maximal number of gains per lineage. This parameter specifies how cases should be handled in which a character is first gained, then lost, and then gained again. By setting this parameter to 1 (the default setting), such cases are prohibited, since only one gain per lineage is allowed. missing_data : int (default=0) Currently, we offer two ways to handle missing data. The first case just treats missing data in the same way in which the absence of a character is handled and can be evoked by setting this parameter to 0. The second case will treat missing data as either absent or present characters, based on how well each option coincides with the overall evolutionary scenario. This behaviour can be evoked by setting this parameter to -1. push_gains: bool (default=True) : In bottom-up calculations, there will often be multiple scenarios upon which only one is selected by the method. In order to define consistent criteria for scenario selection, we follow Mirkin2003 in allowing to force the algorithm to prefer those scenarios in which gains are pushed to the leaves. This behaviour is handle by this parameter. Setting it to True will force the algorithm to push gain events to the leaves of the tree. Setting it to False will force it to prefer those scenarios where the gains are closer to the root. get_IVSD(output_gml=False, output_plot=False, tar=True, mixed_threshold=0.0, evaluation=’mwu’, **keywords) Calculate VSD on the basis of each item. leading_model=False, get_MLN(glm, threshold=1, method=’mr’) Compute an Minimal Lateral Network for a given model. Parameters glm : str The dictionary key for the gain-loss-model. threshold : int (default=1) 120 Chapter 9. Reference LingPy Documentation, Release 2.6 The threshold used to exclude edges. method : str (default=mr) Select the method for MLN calculation. Choose between: * mr: majority-rule, multiple links are resolved by selecting those which occur most frequently • td: tree-distance, multiple links are resolved by selecting those which are closest on the tree • bc: betweenness-centrality, multiple links are resolved by selecting those which have the highest betweenness centrality get_MSN(glm=”, external_edges=False, deep_nodes=False, **keywords) Plot the Minimal Spatial Network. Parameters glm : str (default=) A string that encodes which model should be plotted. filename : str The name of the file to which the plot shall be written. fileformat : str The output format of the plot. threshold : int (default=1) The threshold for the minimal amount of shared links that shall be plotted. usetex : bool (default=True) Specify whether LaTeX shall be used for the plot. get_PDC(glm, **keywords) Calculate Patchily Distributed Cognates. get_edge(glm, nodeA, nodeB, entries=”, msn=False) Return the edge data for a given gain-loss model. get_stats(glm, subset=”, filename=”) Calculate basic statistics for a given gain-loss model. plot_ACS(glm, **keywords) Plot a tree in which the node size correlates with the size of the ancestral node. plot_GLS(glm, **keywords) Plot the inferred scenarios for a given model. plot_MLN(glm=”, fileformat=’pdf’, threshold=1, usetex=False, taxon_labels=’taxon_short_labels’, alphat=False, alpha=0.75, **keywords) Plot the MLN with help of Matplotlib. glm [str (default=)] Identifier for the gain-loss model that is plotted. Defaults to the model that had the best scores in terms of probability. filename [str (default=)] If no filename is selected, the filename is identical with the dataset. fileformat [{svg,png,jpg,pdf} (default=pdf)] Select the format of the output plot. threshold [int (default=1)] Select the threshold for drawing lateral edges. usetex [bool (default=True)] Specify whether you want to use LaTeX to render plots. 9.1. Reference 121 LingPy Documentation, Release 2.6 colormap [{None matplotlib.cm}] A matplotlib.colormap instance. If set to c{None}, this defaults to jet. taxon_labels [str (default=taxon.short_labels)] Specify the taxon labels that should be included in the plot. plot_MLN_3d(glm=”, filename=”, fileformat=’pdf’, threshold=1, usetex=True, colormap=None, taxon_labels=’taxon_short_labels’, alphat=False, alpha=0.75, **keywords) Plot the MLN with help of Matplotlib in 3d. glm [str (default=)] Identifier for the gain-loss model that is plotted. Defaults to the model that had the best scores in terms of probability. filename [str (default=)] If no filename is selected, the filename is identical with the dataset. fileformat [{svg,png,jpg,pdf} (default=pdf)] Select the format of the output plot. threshold [int (default=1)] Select the threshold for drawing lateral edges. usetex [bool (default=True)] Specify whether you want to use LaTeX to render plots. colormap [{None matplotlib.cm}] A matplotlib.colormap instance. If set to c{None}, this defaults to jet. taxon_labels [str (default=taxon.short_labels)] Specify the taxon labels that should be included in the plot. plot_MSN(glm=”, fileformat=’pdf’, threshold=1, usetex=False, alphat=False, alpha=0.75, only=[], **keywords) Plot a minimal spatial network. plot_concept_evolution(glm, concept=”, fileformat=’png’, **keywords) Plot the evolution of specific concepts along the reference tree. plot_two_concepts(concept, cogA, cogB, labels={1: ’1’, 2: ’2’, 3: ’3’, 4: ’4’}, tcolor={1: ’white’, 2: ’black’, 3: ’0.5’, 4: ’0.1’}, filename=’pdf’, fileformat=’pdf’, usetex=True) Plot the evolution of two concepts in space. Notes This function may be useful to contrast patterns of different words in geographic space. lingpy.compare.phylogeny.TreBor alias of PhyBo lingpy.compare.phylogeny.get_gls(paps, taxa, tree, gpl=1, weights=(1, 1), push_gains=True, missing_data=0) Calculate a gain-loss scenario. Parameters paps : list A list containing the presence-absence patterns for all leaves of the reference tree. Presence is indicated by 1, and absence by 0. Missing characters are indicated by -1. taxa : list The list of taxa (leaves of the tree). tree : str A tree in Newick-format. Taxon names should (of course) be identical with the names in the list of taxa. gpl : int 122 Chapter 9. Reference LingPy Documentation, Release 2.6 Gains per lineage. Specify the maximal amount of gains per lineage. One lineage is hereby defined as one path in the tree. If set to 0, only one gain per lineage is allowed, if set to 1, one additional gain is allowed, and so on. Use with care, since this will lead to larger computation costs (more possibilities have to be taken care of) and can also be quite unrealistic. weights : tuple (default=(1,1)) Specify the weights for gains and losses. Setting this parameter to (2,1) will penalize gain events with 2 and loss events with 1. push_gains : bool (default=True) Determine whether of a set of equally parsimonious patterns those should be retained that show gains closer to the leaves of the tree or not. missing_data : int (default=0) Determine how missing data should be represented. If set to 0 (default), missing data will be treated in the same way as absence character states. If you want missing data to be accounted for in the algorithm, set this parameter to -1. Notes This is an enhanced version of the older approach to parsimony-based gain-loss mapping. The algorithm is much faster than the previous one and also written much clearer as to the code. In most tests I ran so far, it also outperformed other approaches by finding more parsimonious solutions. lingpy.compare.sanity module Module provides basic checks for wordlists. lingpy.compare.sanity.mutual_coverage(wordlist, concepts=’concept’) Compute mutual coverage for all language pairs in your data. Parameters wordlist : ~lingpy.basic.wordlist.Wordlist Your Wordlist object (or a descendant class). concepts : str (default=concept) The column which stores your concepts. Returns coverage : dict A dictionary of dictionaries whose value is the number of items two languages share. See also: mutual_coverage_check, mutual_coverage_subset Examples Compute coverage for the KSL.qlc dataset: >>> >>> >>> >>> from from from wl = 9.1. Reference lingpy.compare.sanity import mutual_coverage lingpy import * lingpy.tests.util import test_data Wordlist(test_data('KSL.qlc')) 123 LingPy Documentation, Release 2.6 >>> cov = mutual_coverage(wl) >>> cov['English']['German'] 200 lingpy.compare.sanity.mutual_coverage_check(wordlist, threshold, concepts=’concept’) Check whether a given mutual coverage is fulfilled by the dataset. Parameters wordlist : ~lingpy.basic.wordlist.Wordlist Your Wordlist object (or a descendant class). concepts : str (default=concept) The column which stores your concepts. threshold : int The threshold which should be checked. Returns c: bool : True, if coverage is fulfilled for all language pairs, False if otherwise. See also: mutual_coverage, mutual_coverage_subset Examples Compute minimal mutual coverage for the KSL dataset: >>> >>> >>> >>> >>> from lingpy.compare.sanity import mutual_coverage from lingpy import * from lingpy.tests.util import test_data wl = Wordlist(test_data('KSL.qlc')) for i in range(wl.height, 1, -1): if mutual_coverage_check(wl, i): print('mutual coverage is {0}'.format(i)) break 200 lingpy.compare.sanity.mutual_coverage_subset(wordlist, threshold, concepts=’concept’) Compute maximal mutual coverage for all language in a wordlist. Parameters wordlist : ~lingpy.basic.wordlist.Wordlist Your Wordlist object (or a descendant class). concepts : str (default=concept) The column which stores your concepts. threshold : int The threshold which should be checked. Returns coverage : tuple A tuple consisting of the number of languages for which the coverage could be found as well as a list of all pairings in which this coverage is possible. The list itself contains the mutual coverage inside each pair and the list of languages. 124 Chapter 9. Reference LingPy Documentation, Release 2.6 See also: mutual_coverage, mutual_coverage_check Examples Compute all sets of languages with coverage at 200 for the KSL dataset: >>> >>> >>> >>> >>> >>> from lingpy.compare.sanity import mutual_coverage_subset from lingpy import * from lingpy.tests.util import test_data wl = Wordlist(test_data('KSL.qlc')) number_of_languages, pairs = mutual_coverage_subset(wl, 200) for number_of_items, languages in pairs: print(number_of_items, ','.join(languages)) 200 Albanian,English,French,German,Hawaiian,Navajo,Turkish lingpy.compare.sanity.synonymy(wordlist, concepts=’concept’, languages=’doculect’) Check the number of synonyms per language and concept. Parameters wordlist : ~lingpy.basic.wordlist.Wordlist Your Wordlist object (or a descendant class). concepts : str (default=concept) The column which stores your concepts. languages : str (default=doculect) The column which stores your language names. Returns synonyms : dict A dictionary with language and concept as key and the number of synonyms as value. Examples Calculate synonymy in KSL.qlc dataset: >>> >>> >>> >>> >>> >>> from lingpy.compare.sanity import synonymy from lingpy import * from lingpy.tests.util import test_data wl = Wordlist(test_data('KSL.qlc')) syns = synonymy(wl) for a, b in syns.items(): if b > 1: print(a[0], a[1], b) There is no case where synonymy exceeds 1 word per concept per language, since Kessler2001 was paying particular attention to avoid synonyms. lingpy.compare.strings module Module provides various string similarity metrics. lingpy.compare.strings.bidist1(a, b, normalized=True) Computes bigram-based distance. 9.1. Reference 125 LingPy Documentation, Release 2.6 Notes The binary version. Checks if two bigrams are equal or not. lingpy.compare.strings.bidist2(a, b, normalized=True) Computes bigram based distance. Notes The comprehensive version of the bigram distance. lingpy.compare.strings.bidist3(a, b, normalized=True) Computes bigram based distance. Notes Computes the positional version of the bigrams. Assigns a partial distance between two bigrams based on positional similarity of bigrams. lingpy.compare.strings.bisim1(a, b, normalized=True) computes the binary version of bigram similarity. lingpy.compare.strings.bisim2(a, b, normalized=True) Computes bigram similarity the comprehensive version. Notes Computes the number of common 1-grams between two n-grams. lingpy.compare.strings.bisim3(a, b, normalized=True) Computes bi-sim the positional version. Notes The partial similarity between two bigrams is defined as the number of matching 1-grams at each position. lingpy.compare.strings.dice(a, b, normalized=True) Computes the Dice measure that measures the number of common bigrams. lingpy.compare.strings.ident(a, b) Computes the identity between two strings. If yes, returns 1, else, returns 0. lingpy.compare.strings.jcd(a, b, normalized=True) Computes the bigram-based Jaccard Index. lingpy.compare.strings.jcdn(a, b, normalized=True) Computes the bigram and trigram-based Jaccard Index lingpy.compare.strings.lcs(a, b, normalized=True) Computes the longest common subsequence between two strings. lingpy.compare.strings.ldn(a, b, normalized=True) Basic Levenshtein distance without swap operation (all operations are equal costs). See also: 126 Chapter 9. Reference LingPy Documentation, Release 2.6 lingpy.align.pairwise.edit_dist, lingpy.compare.strings.ldn_swap lingpy.compare.strings.ldn_swap(a, b, normalized=True) Basic Levenshtein distance with swap operation included (identifies metathesis). lingpy.compare.strings.prefix(a, b, normalized=True) Computes the longest common prefix between two strings. lingpy.compare.strings.tridist1(a, b, normalized=True) Computes trigram-based distance. Notes The binary version. Checks if two trigrams are equal or not. lingpy.compare.strings.tridist2(a, b, normalized=True) Computes bigram based distance. Notes The comprehensive version of the bigram distance. lingpy.compare.strings.tridist3(a, b, normalized=True) Computes trigram based distance. Notes Computes the positional version of the trigrams. Assigns a partial distance between two trigrams based on positional similarity of trigrams. lingpy.compare.strings.trigram(a, b, normalized=True) Computes the number of common trigrams between two strings. lingpy.compare.strings.trisim1(a, b, normalized=True) Computes the binary version of trigram similarity. lingpy.compare.strings.trisim2(a, b, normalized=True) Computes tri-sim the comprehensive version. Notes Simply computes the number of common 1-grams between two n-grams instead of calling LCS as should be done in Kondrak2005 paper. Note that the LCS for a trigram can be computed in O(n) time if we asssume that list lookup is in constant time. lingpy.compare.strings.trisim3(a, b, normalized=True) Computes tri-sim the positional version. Notes Simply computes the number of matching 1-grams in each position. lingpy.compare.strings.xdice(a, b, normalized=True) Computes the skip 1 character version of Dice. 9.1. Reference 127 LingPy Documentation, Release 2.6 lingpy.compare.strings.xxdice(a, b, normalized=True) Returns the XXDice between two strings. Notes Taken from Brew1996. lingpy.compare.util module Module contents Basic module for language comparison. lingpy.convert package Submodules lingpy.convert.cldf module Basic functions for the conversion from LingPy to CLDF and vice versa. lingpy.convert.cldf.from_cldf(path, to=<class ’lingpy.basic.wordlist.Wordlist’>) Load data from CLDF into a LingPy Wordlist object or similar. Parameters path : str The path to the metadata-file of your CLDF dataset. to : ~lingpy.basic.wordlist.Wordlist A ~lingpy.basic.wordlist.Wordlist object or one of the descendants (LexStat, Alignmnent). lingpy.convert.cldf.to_cldf(wordlist, path=’cldf’, source_path=None, ref=’cogid’, segments=’tokens’, form=’ipa’, note=’note’, form_in_source=’value’, source=None, alignment=None) Convert a wordlist in LingPy to CLDF. Parameters wordlist : ~lingpy.basic.wordlist.Wordlist A regular Wordlist object (or similar). path : str (default=cldf) The name of the directory to which the files will be written. source_path : str (default=None) If available, specify the path of your BibTex file with the sources. ref : str (default=cogid) The column in which the cognate sets are stored. segments : str (default=tokens) The column in which the segmented phonetic strings are stored. form : str (default=ipa) 128 Chapter 9. Reference LingPy Documentation, Release 2.6 The column in which the unsegmented phonetic strings are stored. note : str (default=None) The column in which you store your comments. form_in_source : str (default=None) The column in which you store the original form in the source. source : str (default=None) The column in which you store your source information. alignment : str (default=alignment) The column in which you store the alignments. lingpy.convert.graph module Conversion routines for the GML format. lingpy.convert.graph.gls2gml(gls, graph, tree, filename=”) Create GML-representation of a given gain-loss-scenario (GLS). Parameters gls : list A list of tuples, indicating the origins of characters along a tree. graph : networkx.graph A graph that serves as a template for the plotting of the GLS. tree : cogent.tree.PhyloNode A tree object. lingpy.convert.graph.igraph2networkx(graph) lingpy.convert.graph.networkx2igraph(graph) Helper function converts networkx graph to igraph graph object. lingpy.convert.graph.nwk2gml(treefile, filename=”) Function converts a tree in newick format to a network in gml-format. treefile [str] Either a str defining the path to a file containing the tree in Newick-format, or the tree-string itself. filename [str (default=lingpy)] The name of the output GML-file. If filename is set to c{None}, the function returns a Graph. Returns graph : networkx.Graph lingpy.convert.graph.radial_layout(treestring, change=<function <lambda>>, degree=100, filename=”, start=0, root=’root’) Function calculates a simple radial tree layout. Parameters treefile : str Either a str defining the path to a file containing the tree in Newick-format, or the tree-string itself. filename : str (default=None) The name of the output file (GML-format). If set to c{None}, no output will be written to file. 9.1. Reference 129 LingPy Documentation, Release 2.6 change : function (default = lambda x:2 * x**2) The function used to modify the radius in the polar projection of the tree. Returns graph : networkx.Graph A graph representation of the tree with coordinates specified in the graphics-attribute of the nodes. Notes This function creates a radial tree-layout from a given tree specified in Newick format. lingpy.convert.html module Basic functions for HTML-plots. lingpy.convert.html.alm2html(infile, title=”, shorttitle=”, filename=”, colored=False, main_template=”, table_template=”, dataset=”, confidence=False, **keywords) Convert files in alm-format into colored html-format. Parameters title : str Define the title of the output file. If no title is provided, the default title LexStat - Automatic Cognate Judgments will be used. shorttitle : str Define the shorttitle of the html-page. If no title is provided, the default title LexStat will be used. See also: lingpy.convert.html.msa2html, lingpy.convert.html.msa2tex Notes The coloring of sound segments with respect to the sound class they belong to is based on the definitions given in the color Model. It can easily be changed and adapted. lingpy.convert.html.colorRange(number, brightness=300) Function returns different colors for the given range. Notes Idea taken from http://stackoverflow.com/questions/876853/generating-color-ranges-in-python . lingpy.convert.html.msa2html(msa, shorttitle=”, filename=”, template=”, **keywords) Convert files in msa-format into colored html-format. Parameters msa : dict A dictionary object that contains all the information of an MSA object. shorttitle : str 130 Chapter 9. Reference LingPy Documentation, Release 2.6 Define the shorttitle of the html-page. If no title is provided, the default title SCA will be used. filename : str (default=) Define the name of the output file. If no name is defined, the name of the input file will be taken as a default. template : str (default=) The path to the template file. If no name is defined, the basic template will be used. The basic template currently used can be found under lingpy/data/ templates/msa2html.html. See also: lingpy.convert.html.alm2html Notes The coloring of sound segments with respect to the sound class they belong to is based on the definitions given in the color Model. It can easily be changed and adapted. Examples Load the libary. >>> from lingpy import * Load an msq-file from the test-sets. >>> msa = MSA('harry.msq') Align the data progressively and carry out a check for swapped sites. >>> msa.prog_align() >>> msa.swap_check() >>> print(msa) w o l d w a l d v l a d e e i m m m o a i r r r t - Save the data to the file harry.msa. >>> msa.output('msa',filename='harry') Save the msa-object as html. >>> msa.output('html',filename='harry') lingpy.convert.html.msa2tex(infile, template=”, filename=”, **keywords) Convert an MSA to a tabular representation which can easily be used in LaTeX documents. lingpy.convert.html.psa2html(infile, **kw) Function converts a PSA-file into colored html-format. 9.1. Reference 131 LingPy Documentation, Release 2.6 lingpy.convert.html.string2html(taxon, string, swaps=[], tax_len=None) Function converts an (aligned) string into colored html-format. @deprecated lingpy.convert.html.tokens2html(string, swaps=[], tax_len=None) Function converts an (aligned) string into colored html-format. Notes This function is currently not used by any other program. So it might be useful to just deprecate it. @deprecated lingpy.convert.plot module Module provides functions for the transformation of text data into visually appealing format. lingpy.convert.plot.plot_concept_evolution(scenarios, tree, fileformat=’pdf’, degree=90, **keywords) Plot the evolution according to the MLN method of all words for a given concept. Parameters tree : str A tree representation in Newick format. fileformat : str (default=pdf) A valid fileformat according to Matplotlib. degree : int (default=90) The degree by which the tree is drawn. 360 yields a circular tree, 180 yields a tree filling half of the space of a circle. lingpy.convert.plot.plot_gls(gls, treestring, degree=90, fileformat=’pdf’, **keywords) Plot a gain-loss scenario for a given reference tree. lingpy.convert.plot.plot_heatmap(wordlist, filename=’heatmap’, fileformat=’pdf’, ref=’cogid’, normalized=False, refB=”, **keywords) Create a heatmap-representation of shared cognates for a given wordlist. Parameters wordlist : lingpy.basic.wordlist.Wordlist A Wordlist object containing cognate IDs. filename : str (default=heatmap) Name of the file to which the heatmap will be written. fileformat : str (default=pdf) A regular matplotlib-fileformat (pdf, png, pgf, svg). ref : str (default=cogid) The name of the column that contains the cognate identifiers. normalized : {bool str} (default=True) If set to c{False}, dont normalize the data. Otherwise, select the normalization method, choose between: • jaccard for the Jaccard-distance (see Bategelj1995 for details), and 132 Chapter 9. Reference LingPy Documentation, Release 2.6 • swadesh for traditional lexicostatistical calculation of shared cognate percentages. cmap : matplotlib.cm (default=matplotlib.cm.jet) The color scheme to be used for the heatmap. steps : int (default=5) The number of steps in which names of taxa will be written to the axes. xrotation : int (default=45) The rotation of the taxon-names on the x-axis. colorbar : bool (default=True) Specify, whether a colorbar should be added to the plot. figsize : tuple (default=(10,10)) Specify the size of the figure. tree : str (default=) A tree passed for the taxa in Newick-format. If no tree is specified, the method looks for a tree object in the Wordlist. Notes This function plots shared cognate percentages. lingpy.convert.plot.plot_tree(treestring, degree=90, fileformat=’pdf’, root=’root’, **keywords) Plot a Newick tree to PDF or other graphical formats. Parameters treestring : str A string in Newick format. degree : int Determine the degree of the tree (this determines how circular the tree will be). fileformat : str (default=pdf) Select the fileformat to which the tree shall be written. filename : str Determine the name of the file to which the data shall be written. Defaults to a timestamp. figsize : tuple (default=(10,10)) Determine the size of the figure. lingpy.convert.strings module Basic functions for the conversion of Python-internal data into strings. lingpy.convert.strings.matrix2dst(matrix, taxa=None, stamp=”, filename=”, taxlen=10, comment=’#’) Convert matrix to dst-format. Parameters taxa : {None, list} 9.1. Reference 133 LingPy Documentation, Release 2.6 List of taxon names corresponding to the distances. Make sure that you only use alphanumeric characters and the understroke for assigning the taxon names. Especially avoid the usage of brackets, since this will confuse many phylogenetic programs. stamp : str (default=) Convenience stamp passed as a comment that can be used to indicate how the matrix was created. filename : str If you specify a filename, the data will be written to file. taxlen : int (default=10) Indicate how long the taxon names are allowed to be. The Phylip package only allows taxon names consisting of maximally 10 characters. Other packages, however, allow more. If Phylip compatibility is not important for you and you just want to allow for as long taxon names as possible, set this value to 0. comment : str (default = #) The comment character to be used when adding additional information in the stamp. Returns output : {str or file} Depending on your settings, this function returns a string in DST (=Phylip) format, or a file containing the string. lingpy.convert.strings.msa2str(msa, wordlist=False, comment=’#’, _arange=’{stamp}{comment}\n{meta}{comment}\n{body}’, merge=False) Function converts an MSA object into a string. lingpy.convert.strings.multistate2nex(taxa, matrix, filename=”, missing=’?’) Convert the data in a given wordlist to NEXUS-format for multistate analyses in PAUP. Parameters taxa : list The list of taxa that shall be written to file. matrix : list The multi-state matrix with the first dimension indicating the taxa, and the second their states. filename : str (default=) If not specified, the filename of the Wordlist will be taken, otherwise, it specifies the name of the file to which the data will be written. lingpy.convert.strings.pap2csv(taxa, paps, filename=”) Write paps created by the Wordlist class to a csv-file. lingpy.convert.strings.pap2nex(taxa, paps, missing=0, filename=”, datatype=’STANDARD’) Function converts a list of paps into nexus file format. Parameters taxa : list List of taxa. paps : {list, dict} A two-dimensional list with the first dimension being identical to the number of taxa and the second dimension being identical to the number of paps. If a dictionary is 134 Chapter 9. Reference LingPy Documentation, Release 2.6 passed, each key represents a given pap. The following two structures will thus be treated identically: >>> paps = [[1,0],[1,0],[1,0]] # two languages, three paps >>> paps = {1:[1,0], 2:[1,0], 3:[1,0]} # two languages, three ֒→paps missing : {str, int} (default=0) Indicate how missing characters are represented in the original data. lingpy.convert.strings.scorer2str(scorer) Convert a scoring function to a string. lingpy.convert.strings.write_nexus(wordlist, mode=’mrbayes’, filename=’mrbayes.nex’, ref=’cogid’, missing=’?’, gap=’-’, custom=None, custom_name=’lingpy’, commands=None, commands_name=’mrbayes’) Write a nexus file for phylogenetic analyses. Parameters wordlist : lingpy.basic.wordlist.Wordlist A Wordlist object containing cognate IDs. mode : str (default=mrbayes) The name of the output nexus style. Valid values are: • MRBAYES: a MrBayes formatted nexus file • BEAST: a BEAST formatted nexus file • BEASTWORDS: a BEAST-formatted nexus for word-partitioned analyses. filename : str (default=None) Name of the file to which the nexus file will be written. If set to c{None}, then this function will not write the nexus ontent to a file, but simply return the content as a string. ref: str (default=cogid) : Column in which you store the cognate sets in your data. gap : str (default=-) The symbol for gaps (not relevant for linguistic analyses). missing : str (default=?) The symbol for missing characters. custom : list {default=None) This information allows to add custom information to the nexus file, like, for example, the structure of the characters, their original concept, or their type, and it will be written into a custom block in the nexus file. The name of the custom block can be specified with help of the custom_name keyword. The content is a list of strings which will be written line by line into the custom block. custom_name : str (default=lingpy) The name of the custom block which will be written to the file. commands : list (default=None) 9.1. Reference 135 LingPy Documentation, Release 2.6 If specified, will write an additional block containing commands for phylogenetic software. The commands are passed as a list, containing strings. The name of the block is given by the keywords commands_name. commands_name : str (default=mrbayes) Determines how the block will be called to which the commands will be written. Returns nexus : str A string containing nexus file output lingpy.convert.tree module Functions for tree calculations and working with trees. lingpy.convert.tree.nwk2tree_matrix(newick) Convert a newick file to a tree matrix. Notes This is an additional function that can be used for plots with help of matplotlibs functions. The tree_matrix is compatible with those matrices that scipys linkage functions create. Module contents Package provides different methods for file conversion. lingpy.data package Subpackages lingpy.data.ipa package Submodules lingpy.data.ipa.sampa module The regular expression used in the sampa2unicode-converter is taken from an algorithm for the conversion of XSAMPA to IPA (Unicode) by Peter Kleiweg <http://www.let.rug.nl/~kleiweg/L04/devel/python/xsampa.html>. @author: Peter Kleiweg @date: 2007/07/19 Module contents Submodules lingpy.data.derive module Module for the derivation of sound class models. 136 Chapter 9. Reference LingPy Documentation, Release 2.6 The module provides functions for the customized compilation of sound-class models. All models are defined in simple text files. In order to guarantee their quick access when loading the library, the models are compiled and stored in binary files. lingpy.data.derive.compile_dvt(path=”) Function compiles diacritics, vowels, and tones. See also: lingpy.data.model.Model, lingpy.data.derive.compile_model Notes Diacritics, vowels, and tones are defined in the data/models/dv/ directory of the LingPy package and automatically loaded when loading the LingPy library. The values are defined as the constants rcParams['vowels'], rcParams['diacritics'], and rcParams['tones']. Their core purpose is to guide the tokenization of IPA strings (cf. ipa2tokens()). In order to change the variables, one simply has to change the text files diacritics, tones, and vowels in the data/models/dv directory. The structure of these files is fairly simple: Each line contains a vowel or a diacritic character, whereas diacritics are preceded by a dash. lingpy.data.derive.compile_model(model, path=None) Function compiles customized sound-class models. Parameters model : str A string indicating the name of the model which shall be created. path : str A string indication the path where the model-folder is stored. See also: lingpy.data.model.Model, compile_dvt Notes A model is defined by a folder placed in data/models directory of the LingPy package. The name of the folder reflects the name of the model. It contains three files: the file converter, the file INFO, and the optional file scorer. The format requirements for these files are as follows: INFO The INFO-file serves as a reference for a given sound-class model. It can contain arbitrary information (and also be empty). If one wants to define specific characteristics, like the source, the compiler, the date, or a description of a given model, this can be done by employing a key-value structure in which the key is preceded by an @ and followed by a colon and the value is written right next to the key in the same line, e.g.: @source: Dolgopolsky (1986) This information will then be read from the INFO file and rendered when printing the model to screen with help of the print() function. converter The converter file contains all sound classes which are matched with their respective sound values. Each line is reserved for one class, precede by the key (preferably an ASCII-letter) representing the class: 9.1. Reference 137 LingPy Documentation, Release 2.6 B : E : D : G : ... , β, f, pf, pf, , æ, , , , e, , , , , è, é, , , ê, θ, ð, , þ, x, , χ matrix A scoring matrix indicating the alignment scores of all sound-class characters defined by the model. The scoring is structured as a simple tab-delimited text file. The first cell contains the character names, the following cells contain the scores in redundant form (with both triangles being filled): B 10.0 -10.0 5.0 ... E -10.0 5.0 -10.0 ... F 5.0 -10.0 10.0 ... ... scorer The scorer file (which is optional) contains the graph of class-transitions which is used for the calculation of the scoring dictionary. Each class is listed in a separate line, followed by the symbols v,‘‘c‘‘, or t (indicating whether the class represents vowels, consonants, or tones), and by the classes it is directly connected to. The strength of this connection is indicated by digits (the smaller the value, the shorter the path between the classes): A : C : B : E : D : ... v, c, c, v, c, E:1, O:1 S:2 W:2 A:1, I:1 S:2 The information in such a file is automatically converted into a scoring dictionary (see List2012b for details). Based on the information provided by the files, a dictionary for the conversion of IPA-characters to sound classes and a scoring dictionary are created and stored as a binary. The model can be loaded with help of the Model class and used in the various classes and functions provided by the library. lingpy.data.model module Module for handling sequence models. class lingpy.data.model.Model(model, path=None) Bases: object Class for the handling of sound-class models. Parameters model : { sca, dolgo, asjp, art, _color } A string indicating the name of the model which shall be loaded. Select between: • sca - the SCA sound-class model (see List2012a), • dolgo - the DOLGO sound-class model (see: :evobib:‘Dolgopolsky1986), • asjp - the ASJP sound-class model (see Brown2008 and Brown2011), • art - the sound-class model which is used for the calculation of sonority profiles and prosodic strings (see List2012), and • _color - the sound-class model which is used for the coloring of sound-tokens when creating html-output. 138 Chapter 9. Reference LingPy Documentation, Release 2.6 See also: lingpy.data.derive.compile_model, lingpy.data.derive.compile_dvt Notes Models are loaded from binary files which can be found in the data/models/ folder of the LingPy package. A model has two essential attributes: • converter – a dictionary with IPA-tokens as keys and sound-class characters as values, and • scorer – a scoring dictionary with tuples of sound-class characters as keys and scores (integers or floats) as values. Examples When loading LingPy, the models sca, asjp, dolgo, and art are automatically loaded, and they are accessible via the rc() function for global settings: >>> from lingpy import * >>> rc('asjp') <sca-model "asjp"> Define variables for the standard models for convenience: >>> >>> >>> >>> asjp = rc('asjp') sca = rc('sca') dolgo = rc('dolgo') art = rc('art') Check how the letter a is converted in the various models: >>> ... ... a > a > a > a > for m in [asjp,sca,dolgo,art]: print('{0} > {1} ({2})'.format('a',m.converter['a'],m.name)) a A V 7 (asjp) (sca) (dolgo) (art) Retrieve basic information of a given model: >>> print(sca) Model: sca Info: Extended sound class model based on Dolgopolsky (1986) Source: List (2012) Compiler: Johann-Mattis List Date: 2012-03 9.1. Reference 139 LingPy Documentation, Release 2.6 Attributes converter scorer dict A dictionary with IPA tokens as keys and sound-class characters as values. dict info name dict str A scoring dictionary with tuples of sound-class characters as keys and similarity scores as values. A dictionary storing the key-value pairs defined in the INFO. The name of the model which is identical with the name of the folder from wich the model is loaded. lingpy.data.model.load_dvt(path=”) Function loads the default characters for IPA diacritics and IPA vowels of LingPy. Module contents LingPy comes along with many different kinds of predefined data. When loading the library, the following dictionary is automatically loaded and employed by all LingPy modules: rcParams : dict As an alternative to all global variables, this dictionary contains all these variables, and additional ones. This dictionary is used for internal coding purposes and stores parameters that are globally set (if not defined otherwise by the user), such as • specific debugging messages (warnings, messages, errors) • default values, such as gop (gap opening penalty), scale (scaling factor • by which extended gaps are penalized), or figsize (the default size of • figures if data is plotted using matplotlib). These default values can be changed with help of the rc function that takes any keyword and any variable as input and adds or modifies the specific key of the rcParams dictionary, but also provides more complex functions that change whole sets of variables, such as the following statement: >>> rc(schema="asjp") which switches the variables asjp, dolgo, etc. to the ASCII-based transcription system of the ASJP project. If you want to change the content of c{rcParams} directly, you need to import the dictionary explicitly: >>> from lingpy.settings import rcParams However, changing the values in the dictionary randomly can produce unexpected behavior and we recommend to use the regular rc function for this purpose. lingpy.settings.rc(rval=None, **keywords) Function changes parameters globally set for LingPy sessions. Parameters rval : string (default=None) Use this keyword to specify a return-value for the rc-function. schema : {ipa, asjp} 140 Chapter 9. Reference LingPy Documentation, Release 2.6 Change the basic schema for sequence comparison. When switching to asjp, this means that sequences will be treated as sequences in ASJP code, otherwise, they will be treated as sequences written in basic IPA. Notes This function is the standard way to communicate with the rcParams dictionary which is not imported as a default. If you want to see which parameters there are, you can load the rcParams dictonary directly: >>> from lingpy.settings import rcParams However, be careful when changing the values. They might produce some unexpected behavior. Examples Import LingPy: >>> from lingpy import * Switch from IPA transcriptions to ASJP transcriptions: >>> rc(schema="asjp") You can check which basic orthography is currently loaded: >>> rc(basic_orthography) 'asjp' >>> rc(schema='ipa') >>> rc(basic_orthography) 'fuzzy' lingpy.evaluate package Submodules lingpy.evaluate.acd module Evaluation methods for automatic cognate detection. lingpy.evaluate.acd.bcubes(wordlist, gold=’cogid’, test=’lexstatid’, pprint=True, per_concept=False) Compute B-Cubed scores for test and reference datasets. modify_ref=False, Parameters lex : lingpy.basic.wordlist.Wordlist A lingpy.basic.wordlist.Wordlist class or a daughter class, (like the LexStat class used for the computation). It should have two columns indicating cognate IDs. gold : str (default=cogid) The name of the column containing the gold standard cognate assignments. test : str (default=lexstatid) 9.1. Reference 141 LingPy Documentation, Release 2.6 The name of the column containing the automatically implemented cognate assignments. modify_ref : function (default=False) Use a function to modify the reference. If your cognate identifiers are numerical, for example, and negative values are assigned as loans, but you want to suppress this behaviour, just set this keyword to abs, and all cognate IDs will be converted to their absolute value. pprint : bool (default=True) Print out the results per_concept : bool (default=False) Compute b-cubed scores per concep and not for the whole data in one piece. Returns t : tuple A tuple consisting of the precision, the recall, and the harmonic mean (F-scores). See also: diff, pairs Notes B-Cubed scores were first described by Bagga1998 as part of an algorithm. Later on, Amigo2009 showed that they can also used as to compare cluster decisions. Hauer2011 applied the B-Cubed scores first to the task of automatic cognate detection. lingpy.evaluate.acd.diff(wordlist, gold=’cogid’, test=’lexstatid’, modify_ref=False, pprint=True, filename=”, tofile=True, transcription=’ipa’, concepts=False) Write differences in classifications on an item-basis to file. lex [lingpy.compare.lexstat.LexStat] The LexStat class used for the computation. It should have two columns indicating cognate IDs. gold [str (default=cogid)] The name of the column containing the gold standard cognate assignments. test [str (default=lexstatid)] The name of the column containing the automatically implemented cognate assignments. modify_ref [function (default=False)] Use a function to modify the reference. If your cognate identifiers are numerical, for example, and negative values are assigned as loans, but you want to suppress this behaviour, just set this keyword to abs, and all cognate IDs will be converted to their absolute value. pprint [bool (default=True)] Print out the results filename [str (default=)] Name of the output file. If not specified, it is identical with the name of the LexStat, but with the extension diff. tofile [bool (default=True)] If set to c{False}, no data will be written to file, but instead, the data will be returned. transcription [str (default=ipa)] The file in which the transcriptions are located (should be a string, no segmentized version, for convenience of writing to file). Returns t : tuple 142 Chapter 9. Reference LingPy Documentation, Release 2.6 A nested tuple consisting of two further tuples. The first containing precision, recall, and harmonic mean (F-scores), the second containing the same values for the pairscores. See also: bcubes, pairs Notes If the tofile option is chosen, the results are written to a specific file with the extension diff. This file contains all cognate sets in which there are differences between gold standard and test sets. It also gives detailed information regarding false positives, false negatives, and the words involved in these wrong decisions. lingpy.evaluate.acd.extreme_cognates(wordlist, ref=’extremeid’, bias=’lumper’) Return extreme cognates, either lump all words together or split them. Parameters wordlist : ~lingpy.basic.wordlist.Wordlist A ~lingpy.basic.wordlist.Wordlist object. ref : str (default=extremeid) The name of the table in your wordlist to which the new IDs should be written. bias : str (default=lumper) If set to lumper, all words with a certain meaning will be given the same cognate set ID, if set to splitter, all will be given a separate ID. lingpy.evaluate.acd.npoint_ap(scores, cognates, reverse=False) Calculate the n-point average precision. Parameters scores : list The scores of your algorithm for pairwise string comparison. cognates : list The cognate codings of the word pairs you compared. 1 indicates that the pair is cognate, 0 indicates that it is not cognate. reverse : bool (default=False) The order of your ranking mechanism. If your algorithm yields high scores for words which are probably cognate, and low scores for non-cognate words, you should set this keyword to True. Notes This follows the description in Kondrak2002. The n-point average precision is useful to compare the discriminative force of different algorithms for string similarity, or to train the parameters of a given algorithm. Examples 9.1. Reference 143 LingPy Documentation, Release 2.6 >>> >>> >>> >>> 1.0 scores = [1, 2, 3, 4, 5] cognates = [1, 1, 1, 0, 0] from lingpy.evaluate.acd import npoint_ap npoint_ap(scores, cognates) lingpy.evaluate.acd.pairs(lex, gold=’cogid’, test=’lexstatid’, modify_ref=False, pprint=True, _return_string=False) Compute pair scores for the evaluation of cognate detection algorithms. Parameters lex : lingpy.compare.lexstat.LexStat The LexStat class used for the computation. It should have two columns indicating cognate IDs. gold : str (default=cogid) The name of the column containing the gold standard cognate assignments. test : str (default=lexstatid) The name of the column containing the automatically implemented cognate assignments. modify_ref : function (default=False) Use a function to modify the reference. If your cognate identifiers are numerical, for example, and negative values are assigned as loans, but you want to suppress this behaviour, just set this keyword to abs, and all cognate IDs will be converted to their absolute value. pprint : bool (default=True) Print out the results Returns t : tuple A tuple consisting of the precision, the recall, and the harmonic mean (F-scores). See also: diff, bcubes Notes Pair-scores can be computed in different ways, with often different results. This variant follows the description by Bouchard-Cote2013. lingpy.evaluate.acd.partial_bcubes(wordlist, gold, test, pprint=True) Compute B-Cubed scores for test and reference datasets for partial cognate detection. Parameters wordlist : Wordlist A Wordlist, or one of its daughter classes (like, e.g., the Partial class used for computation of partial cognates. It should have two columns indicating cognate IDs. gold : str (default=cogid) The name of the column containing the gold standard cognate assignments. test : str (default=lexstatid) 144 Chapter 9. Reference LingPy Documentation, Release 2.6 The name of the column containing the automatically implemented cognate assignments. pprint : bool (default=True) Print out the results Returns t : tuple A tuple consisting of the precision, the recall, and the harmonic mean (F-scores). See also: bcubes, diff, pairs Notes B-Cubed scores were first described by Bagga1998 as part of an algorithm. Later on, Amigo2009 showed that they can also used as to compare cluster decisions. Hauer2011 applied the B-Cubed scores first to the task of automatic cognate detection. lingpy.evaluate.acd.random_cognates(wordlist, ref=’randomid’, bias=False) Populate a wordlist with random cognates for each entry. Parameters ref : str (default=randomid) Cognate set identifier for the newly created random cognate sets. bias : str (default=False) When set to lumper this will tend to create less cognate sets and larger clusters, when set to splitter it will tend to create smaller clusters. lingpy.evaluate.alr module Module provides methods for the evaluation of automatic linguistic reconstruction analyses. lingpy.evaluate.alr.mean_edit_distance(wordlist, gold=’proto’, test=’consensus’, ref=’cogid’, tokens=True, classes=False, **keywords) Function computes the edit distance between gold standard and test set. Parameters wordlist : ~lingpy.basic.wordlist.Wordlist The wordlist object containing the data for a given analysis. gold : str (default=proto) The name of the column containing the gold-standard solutions. test = consensus : The name of the column containing the test solutions. stress : str (default=rcParams[stress]) A string containing the stress symbols used in the sound-class conversion. Defaults to the stress as defined in ~lingpy.settings.rcParams. diacritics : str (default=rcParams[diacritics]) A string containing diacritic symbols used in the sound-class conversion. Defaults to the diacritic symbolds defined in ~lingpy.settings.rcParams. 9.1. Reference 145 LingPy Documentation, Release 2.6 cldf : bool (default=False) If set to True, this will allow for a specific treatment of phonetic symbols which cannot be completely resolved (e.g., laryngeal h2 in Indo-European). Following the CLDF specifications (in particular the specifications for writing transcriptions in segmented strings, as employed by the CLTS initiative), in cases of insecurity of pronunciation, users can adopt a `source/target` style, where the source is the symbol used, e.g., in a reconstruction system, and the target is a proposed phonetic interpretation. This practice is also accepted by the EDICTOR tool. Returns dist : float The mean edit distance between gold and test reconstructions. Notes This function has an alias (med). Calling it will produce the same results. lingpy.evaluate.alr.med(wordlist, **keywords) lingpy.evaluate.apa module Basic module for the comparison of automatic phonetic alignments. class lingpy.evaluate.apa.Eval(gold, test) Bases: object Base class for evaluation objects. class lingpy.evaluate.apa.EvalMSA(gold, test) Bases: lingpy.evaluate.apa.Eval Base class for the evaluation of automatic multiple sequence analyses. Parameters gold, test : MSA The Multiple objects which shall be compared. The first object should be the gold standard and the second object should be the test set. Notes Most of the scores which can be calculated with help of this class are standard evaluation scores in evolutionary biology. For a close description on how these scores are calculated, see, for example, Thompson1999, List2012, and Rosenberg2009b. c_score(mode=1) Calculate the column (C) score. Parameters mode : { 1, 2, 3, 4 } Indicate, which mode to compute. Select between: 1. divide the number of common columns in reference and test alignment by the total number of columns in the test alignment (the traditional C score described in Thompson1999, also known as precision score in applications of information retrieval), 146 Chapter 9. Reference LingPy Documentation, Release 2.6 2. divide the number of common columns in reference and test alignment by the total number of columns in the reference alignment (also known as recall score in applications of information retrieval), 3. divide the number of common columns in reference and test alignment by the average number of columns in reference and test alignment, or 4. combine the scores of mode 1 and mode 2 by computing their F-score, using pr the formula 2 ∗ p+r , where p is the precision (mode 1) and r is the recall (mode 2). Returns score : float The C score for reference and test alignments. Notes The different cc_scores Calculate the c-scores. check_swaps() Check for possibly identical swapped sites. Returns swap : { -2, -1, 0, 1, 2 } Information regarding the identity of swap decisions is coded by integers, whereas 1 – indicates that swaps are detected in both gold standard and testset, whereas a negative value indicates that the positions are not identical, 2 – indicates that swap decisions are not identical in gold standard and testset, whereas a negative value indicates that there is a false positive in the testset, and 0 – indicates that there are no swaps in the gold standard and the testset. jc_score() Calculate the Jaccard (JC) score. Returns score : float The JC score. See also: lingpy.test.evaluate.EvalPSA.jc_score Notes The Jaccard score (see List2012) is calculated by dividing the size of the intersection of residue pairs in reference and test alignment by the size of the union of residue pairs in reference and test alignment. r_score() Compute the rows (R) score. Returns score : float The PIR score. 9.1. Reference 147 LingPy Documentation, Release 2.6 Notes The R score is the number of identical rows (sequences) in reference and test alignment divided by the total number of rows. sp_score(mode=1) Calculate the sum-of-pairs (SP) score. Parameters mode : { 1, 2, 3 } Indicate, which mode to compute. Select between: 1. divide the number of common residue pairs in reference and test alignment by the total number of residue pairs in the test alignment (the traditional SP score described in Thompson1999, also known as precision score in applications of information retrieval), 2. divide the number of common residue pairs in reference and test alignment by the total number of residue pairs in the reference alignment (also known as recall score in applications of information retrieval), 3. divide the number of common residue pairs in reference and test alignment by the average number of residue pairs in reference and test alignment. Returns score : float The SP score for gold standard and test alignments. Notes The SP score (see Thompson1999) is calculated by dividing the number of identical residue pairs in reference and test alignment by the total number of residue pairs in the reference alignment. class lingpy.evaluate.apa.EvalPSA(gold, test) Bases: lingpy.evaluate.apa.Eval Base class for the evaluation of automatic pairwise sequence analyses. Parameters gold, test : lingpy.align.sca.PSA The Pairwise objects which shall be compared. The first object should be the gold standard and the second object should be the test set. Notes Moste of the scores which can be calculated with help of this class are standard evaluation scores in evolutionary biology. For a close description on how these scores are calculated, see, for example, Thompson1999, List2012, and Rosenberg2009b. c_score() Calculate column (C) score. Returns score : float The C score for reference and test alignments. 148 Chapter 9. Reference LingPy Documentation, Release 2.6 Notes The C score, as it is described in Thompson1999, is calculated by dividing the number of columns which are identical in the gold standarad and the test alignment by the total number of columns in the test alignment. diff(**keywords) Write all differences between two sets to a file. Parameters filename : str (default=eval_psa_diff) Default jc_score() Calculate the Jaccard (JC) score. Returns score : float The JC score. Notes The Jaccard score (see List2012) is calculated by dividing the size of the intersection of residue pairs in reference and test alignment by the size of the union of residue pairs in reference and test alignment. pairwise_column_scores Compute the different column scores for pairwise alignments. The method returns the precision, the recall score, and the f-score, following the proposal of Bergsma and Kondrak (2007), and the column score proposed by Thompson et al. (1999). r_score(mode=1) Compute the percentage of identical rows (PIR) score. Parameters mode : { 1, 2 } Select between mode 1, where all sequences are compared with each other, and mode 2, where only whole alignments are compared. Returns score : float The PIR score. Notes The PIR score is the number of identical rows (sequences) in reference and test alignment divided by the total number of rows. sp_score() Calculate the sum-of-pairs (SP) score. Returns score : float The SP score for reference and test alignments. Notes The SP score (see Thompson1999) is calculated by dividing the number of identical residue pairs in reference and test alignment by the total number of residue pairs in the reference alignment. 9.1. Reference 149 LingPy Documentation, Release 2.6 Module contents Basic module for the evaluation of algorithms. lingpy.meaning package Submodules lingpy.meaning.colexification module Module offers methods to handle colexification patterns in wordlist objects. lingpy.meaning.colexification.colexification_network(wordlist, entry=’ipa’, concept=’concept’, output=”, filename=’network’, bipartite=False, **keywords) Calculate a colexification network from a given wordlist object. Parameters wordlist : ~lingpy.basic.wordlist.Wordlist The wordlist object containing the data. entry : str (default=ipa) The reference point for the language entry. We use ipa as a default. concept : str (default=concept) The reference point for the name of the row containing the concepts. We use concept as a default. output: str (default=) : If output is set to gml, the resulting network will be written to a textfile in GML format. Returns G : networkx.Graph A networkx.Graph object. lingpy.meaning.colexification.compare_colexifications(wordlist, entry=’ipa’, concept=’concept’) Compare colexification patterns for a given wordlist. lingpy.meaning.colexification.evaluate_colexifications(G, weight=’wordWeight’, outfile=None) Function calculates most frequent colexifications in a wordlist. Module contents lingpy.read package Submodules lingpy.read.csv module Module provides functions for reading csv-files. 150 Chapter 9. Reference LingPy Documentation, Release 2.6 lingpy.read.csv.csv2dict(filename, fileformat=None, dtype=None, strip_lines=True, header=False) Very simple function to get quick access to CSV-files. comment=’#’, sep=’\t’, Parameters filename : str Name of the input file. fileformat : {None str} If not specified the file <filename> will be loaded. Otherwise, the fileformat is interpreted as the specific extension of the input file. dtype : {None list} If not specified, all data will be loaded as strings. Otherwise, a list specifying the data for each line should be provided. comment : string (default=#) Comment character in the begin of a line forces this line to be ignored. sep : string (default = ) Specify the separator for the CSV-file. strip_lines : bool (default=True) Specify whether empty cells in the input file should be preserved. If set to c{False}, each line will be stripped first, and all whitespace will be cleaned. Otherwise, each line will be separated using the specified separator, and no stripping of whitespace will be carried out. header : bool (default=False) Indicate, whether the data comes along with a header. Returns d : dict A dictionary-representation of the CSV file, with the first row being used as key and the rest of the rows as values. lingpy.read.csv.csv2list(filename, fileformat=”, dtype=None, comment=’#’, strip_lines=True, header=False) Very simple function to get quick (and somewhat naive) access to CSV-files. sep=’\t’, Parameters filename : str Name of the input file. fileformat : {None str} If not specified the file <filename> will be loaded. Otherwise, the fileformat is interpreted as the specific extension of the input file. dtype : {list} If not specified, all data will be loaded as strings. Otherwise, a list specifying the data for each line should be provided. comment : string (default=#) Comment character in the begin of a line forces this line to be ignored. sep : string (default = ) Specify the separator for the CSV-file. 9.1. Reference 151 LingPy Documentation, Release 2.6 strip_lines : bool (default=True) Specify whether empty cells in the input file should be preserved. If set to c{False}, each line will be stripped first, and all whitespace will be cleaned. Otherwise, each line will be separated using the specified separator, and no stripping of whitespace will be carried out. header : bool (default=False) Indicate, whether the data comes along with a header. Returns l : list A list-representation of the CSV file. lingpy.read.csv.csv2multidict(filename, comment=’#’, sep=’\t’) Function reads a csv-file into a multi-dimensional dictionary structure. lingpy.read.csv.read_asjp(infile, family=’Indo-European’, classification=’hh’, max_synonyms=2, min_population=<function <lambda>>, merge_vowels=True, evaluate=False) lingpy.read.phylip module Module provides functions to read in various formats from the Phylip package. lingpy.read.phylip.read_dst(filename, taxlen=10, comment=’#’) Function reads files in Phylip dst-format. Parameters filename : string Name of the file which should have the extension dst. taxlen : int (default=10) Indicate how long the taxon names are allowed to be in the file from which you want to read. The Phylip package only allows taxon names consisting of maximally 10 characters (this is the default). Other packages, however, allow more. If Phylip compatibility is not important for you and you just want to allow for as long taxon names as possible, set this value to 0 and make sure to use tabstops as separators between values in your matrix file. comment : str (default = #) The comment character to be used if your file contains additional information which should be ignored. Returns data : tuple A tuple consisting of a list of taxa and a matrix. lingpy.read.phylip.read_scorer(infile) Read a scoring function in a file into a ScoreDict object. Parameters infile : str The path to the input file that shall be read as a scoring dictionary. The matrix format is a simple csv-file in which the scoring matrix is displayed, with negative values indicating high differences between sound segments (or sound classes) and positive values indicating high similarity. The matrix should be symmetric, columns should be separated by tabstops, and the first column should provide the alphabet for which the scoring function is defined. 152 Chapter 9. Reference LingPy Documentation, Release 2.6 Returns scoredict : ~lingpy.algorithm.misc.ScoreDict A ScoreDict instance which can be directly passed to LingPys alignment functions. lingpy.read.qlc module lingpy.read.qlc.normalize_alignment(alignment) Function normalizes an alignment. Normalization here means that columns consisting only of gaps will be deleted, and all sequences will be stretched to equal length by adding additional gap characters in the end of smaller sequences. lingpy.read.qlc.read_msa(infile, comment=’#’, ids=False, header=True, normalize=True, **keywords) Simple function to load an MSA object. Parameters infile : str The name of the input file. comment : str (default=#) The comment character. If a line starts with this character, it will be ignored. ids : bool (default=False) Indicate whether the MSA file contains unique IDs for all sequences or not. Returns d : dict A dictionary in which keys correspond to specific parts of a multiple alignment. This dictionary can be directly passed to alignment functions, such as lingpy. sca.MSA. lingpy.read.qlc.read_qlc(infile, comment=’#’) Simple function that loads qlc-format into a dictionary. Parameters infile : str The name of the input file. comment : str (default=#) The comment character. If a line starts with this character, it will be ignored. Returns d : dict A dictionary with integer keys corresponding to the order of the lines of the input file. The header is given 0 as a specific key. lingpy.read.qlc.reduce_alignment(alignment) Function reduces a given alignment. Notes Reduction here means that the output alignment consists only of those parts which have not been marked to be ignored by the user (parts in brackets). It requires that all data is properly coded. If reduction fails, this will throw a warning, and all brackets are simply removed in the output alignment. 9.1. Reference 153 LingPy Documentation, Release 2.6 lingpy.read.starling module Basic parser for Starling data. lingpy.read.starling.star2qlc(filename, clean_taxnames=False, debug=False) Converts a file directly output from starling to LingPy-QLC format. Module contents lingpy.sequence package Submodules lingpy.sequence.generate module Module provides simple basic classes for sequence generation using Markov models. class lingpy.sequence.generate.MCBasic(seqs) Bases: object Basic class for creating Markov chains from sequence training data. Parameters seq : list A list of sequences. Sequences are assumed to be tokenized, i.e. they should be either passed as lists or as tuples. walk() Create random sequence from the distribution. class lingpy.sequence.generate.MCPhon(words, tokens=False, prostrings=[], classes=False, class_model=<sca-model "sca">, **keywords) Bases: lingpy.sequence.generate.MCBasic Class for the creation of phonetic sequences (pseudo words). Parameters words : list List of phonetic sequences. This list can contain tokenized sequences (lists or tuples), or simple untokenized IPA strings. tokens : bool (default=False) If set to True, no tokenization of input sequences is carried out. prostring : list (default=[]) List containing the prosodic profiles of the input sequences. If the list is empty, the profiles are generated automatically. evaluate_string(string, tokens=False, **keywords) get_string(new=True, tokens=False) Generate a string from the Markov chain created from the training data. Parameters new : bool (default=True) Determine whether the string created should be different from the training data or not. tokens : bool (default=False) 154 Chapter 9. Reference LingPy Documentation, Release 2.6 If set to True he full list of tokens that was internally used to represent the sequences as a Markov chain is returned. lingpy.sequence.profile module Module provides methods for the handling of orthography profiles. lingpy.sequence.profile.context_profile(wordlist, ref=’ipa’, col=’doculect’, semi_diacritics=’hsw’, merge_vowels=False, brackets=None, splitters=’/, ;~’, merge_geminates=True, clts=False, bad_word=’<???>’, bad_sound=’<?>’, unknown_sound=’!{0}’, examples=2) Create an advanced Orthography Profile with context and doculect information. Parameters wordlist : ~lingpy.basic.wordlist.Wordlist A wordlist from which you want to derive an initial orthography profile. ref : str (default=ipa) The name of the reference column in which the words are stored. col : str (default=doculect) Indicate in which column the information on the language variety is stored. semi_diacritics : str Indicate characters which can occur both as diacritics (second part in a sound) or alone. merge_vowels : bool (default=True) Indicate whether consecutive vowels should be merged. brackets : dict A dictionary with opening brackets as key and closing brackets as values. Defaults to a pre-defined set of frequently occurring brackets. splitters : str The characters which force the automatic splitting of an entry. clts : dict (default=None) A dictionary(like) object that converts a given source sound into a potential target sound, using the get()-method of the dictionary. Normally, we think of a CLTS instance here (that is: a cross-linguistic transcription system as defined in the pyclts package). bad_word : str (default=ń???ż) Indicate how words that could not be parsed should be handled. Note that both bad_word and bad_sound are format-strings, so you can add formatting information here. bad_sound : str (default=ń?ż) Indicate how sounds that could not be converted to a sound class be handled. Note that both bad_word and bad_sound are format-strings, so you can add formatting information here. 9.1. Reference 155 LingPy Documentation, Release 2.6 unknown_sound : str (default=!{0}) If with_clts is set to True, use this string to indicate that sounds are classified as unknown sound in the CLTS framework. examples : int(default=2) Indicate the number of examples that should be printed out. Returns profile : generator A generator of tuples (three items), indicating the segment, its frequency, the conversion to sound classes in the Dolgopolsky sound-class model, and the unicodecodepoints. lingpy.sequence.profile.simple_profile(wordlist, ref=’ipa’, semi_diacritics=’hsw’, merge_vowels=False, brackets=None, splitters=’/, ;~’, merge_geminates=True, bad_word=’<???>’, bad_sound=’<?>’, clts=None, unknown_sound=’!{0}’) Create an initial Orthography Profile using Lingpys clean_string procedure. Parameters wordlist : ~lingpy.basic.wordlist.Wordlist A wordlist from which you want to derive an initial orthography profile. ref : str (default=ipa) The name of the reference column in which the words are stored. semi_diacritics : str Indicate characters which can occur both as diacritics (second part in a sound) or alone. merge_vowels : bool (default=True) Indicate whether consecutive vowels should be merged. brackets : dict A dictionary with opening brackets as key and closing brackets as values. Defaults to a pre-defined set of frequently occurring brackets. splitters : str The characters which force the automatic splitting of an entry. clts : dict (default=None) A dictionary(like) object that converts a given source sound into a potential target sound, using the get()-method of the dictionary. Normally, we think of a CLTS instance here (that is: a cross-linguistic transcription system as defined in the pyclts package). bad_word : str (default=ń???ż) Indicate how words that could not be parsed should be handled. Note that both bad_word and bad_sound are format-strings, so you can add formatting information here. bad_sound : str (default=ń?ż) Indicate how sounds that could not be converted to a sound class be handled. Note that both bad_word and bad_sound are format-strings, so you can add formatting information here. 156 Chapter 9. Reference LingPy Documentation, Release 2.6 unknown_sound : str (default=!{0}) If with_clts is set to True, use this string to indicate that sounds are classified as unknown sound in the CLTS framework. Returns profile : generator A generator of tuples (three items), indicating the segment, its frequency, the conversion to sound classes in the Dolgopolsky sound-class model, and the unicodecodepoints. lingpy.sequence.sound_classes module Module provides various methods for the handling of sound classes. lingpy.sequence.sound_classes.asjp2tokens(seq, merge_vowels=True) lingpy.sequence.sound_classes.bigrams(sequence) Convert a given sequence into a sequence of bigrams. lingpy.sequence.sound_classes.check_tokens(tokens, **keywords) Function checks whether tokens are given in a consistent input format. lingpy.sequence.sound_classes.class2tokens(tokens, classes, gap_char=’-’, local=False) Turn aligned sound-class sequences into an aligned sequences of IPA tokens. Parameters tokens : list The list of tokens corresponding to the unaligned IPA string. classes : string or list The aligned class string. gap_char : string (default=-) The character which indicates gaps in the output string. local : bool (default=False) If set to True a local alignment with prefix and suffix can be converted. Returns alignment : list A list of tokens with gaps at the positions where they occured in the alignment of the class string. See also: ipa2tokens, tokens2class Examples >>> >>> >>> >>> ts, from lingpy import * tokens = ipa2tokens('tsy') aligned_sequence = 'CU-KE' print ', '.join(class2tokens(tokens,aligned_sequence)) y, -, , 9.1. Reference 157 LingPy Documentation, Release 2.6 lingpy.sequence.sound_classes.clean_string(sequence, semi_diacritics=’hsw’, merge_vowels=False, segmentized=False, rules=None, ignore_brackets=True, brackets=None, split_entries=True, splitters=’/, ;~’, preparse=None, merge_geminates=True, normalization_form=’NFC’) Function exhaustively checks how well a sequence is understood by LingPy. Parameters semi_diacritics : str Indicate characters which can occur both as diacritics (second part in a sound) or alone. merge_vowels : bool (default=True) Indicate whether consecutive vowels should be merged. segmentized : False Indicate whether the input string is already segmentized or not. If set to True, items in brackets can no longer be ignored. rules : dict Replacement rules to be applied to a segmentized string. ignore_brackets : bool If set to True, ignore all content within a given bracket. brackets : dict A dictionary with opening brackets as key and closing brackets as values. Defaults to a pre-defined set of frequently occurring brackets. split_entries : bool (default=True) Indicate whether multiple entries (with a comma etc.) should be split into separate entries. splitters : str The characters which force the automatic splitting of an entry. prepares : list List of tuples, giving simple replacement patterns (source and target), which are applied before every processing starts. Returns cleaned_strings : list A list of cleaned strings which are segmented by space characters. If splitters are encountered, indicating that the entry contains two variants, the list will contain one for each element in a separate entry. If there are no splitters, the list has only size one. lingpy.sequence.sound_classes.codepoint(s) Return unicode codepoint(s) for a character set. lingpy.sequence.sound_classes.fourgrams(sequence) Convert a given sequence into a sequence of trigrams. lingpy.sequence.sound_classes.get_all_ngrams(sequence, sort=False) Function returns all possible n-grams of a given sequence. Parameters sequence : list or str 158 Chapter 9. Reference LingPy Documentation, Release 2.6 The sequence that shall be converted into its ngram-representation. Returns out : list A list of all ngrams of the input word, sorted in decreasing order of length. Examples >>> get_all_ngrams('abcde') ['abcde', 'bcde', 'abcd', 'cde', 'abc', 'bcd', 'ab', 'de', 'cd', 'bc', 'a', 'e', ֒→'b', 'd', 'c'] lingpy.sequence.sound_classes.get_n_ngrams(sequence, ngram=4) convert a given sequence into a sequence of ngrams. lingpy.sequence.sound_classes.ipa2tokens(istring, **keywords) Tokenize IPA-encoded strings. Parameters seq : str The input sequence that shall be tokenized. diacritics : {str, None} (default=None) A string containing all diacritics which shall be considered in the respective analysis. When set to None, the default diacritic string will be used. vowels : {str, None} (default=None) A string containing all vowel symbols which shall be considered in the respective analysis. When set to None, the default vowel string will be used. tones : {str, None} (default=None) A string indicating all tone letter symbals which shall be considered in the respective analysis. When set to None, the default tone string will be used. combiners : str (default=) A string with characters that are used to combine two separate characters (compare affricates such as ts). breaks : str (default=-.) A string containing the characters that indicate that a new token starts right after them. These can be used to indicate that two consecutive vowels should not be treated as diphtongs or for diacritics that are put before the following letter. merge_vowels : bool (default=False) Indicate, whether vowels should be merged into diphtongs (default=True), or whether each vowel symbol should be considered separately. merge_geminates : bool (default=False) Indicate, whether identical symbols should be merged into one token, or rather be kept separate. expand_nasals : bool (default=False) semi_diacritics: str (default=) : 9.1. Reference 159 LingPy Documentation, Release 2.6 Indicate which symbols shall be treated as semi-diacritics, that is, as symbols which can occur on their own, but which eventually, when preceded by a consonant, will form clusters with it. If you want to disable this features, just set the keyword to an empty string. clean_string : bool (default=False) Conduct a rough string-cleaning strategy by which all items between brackets are removed along with the brackets, and Returns tokens : list A list of IPA tokens. See also: tokens2class, class2tokens Examples >>> from lingpy import * >>> myseq = 'tsy' >>> ipa2tokens(myseq) ['ts', 'y', '', ''] lingpy.sequence.sound_classes.ono_parse(word, output=”, **keywords) Carry out a rough onset-nucleus-offset parse of a word in IPA. Notes Method is an approximation and not supposed to do without flaws. It is, however, rather helpful in most instances. It defines a so far simple model in which 7 different contexts for each word are distinguished: • #: onset cluster in a words initial • C: onset cluster in a words non-initial • V: nucleus vowel in a words initial syllable • v: nucleus vowel in a words non-initial and non-final syllable • >: nucleus vowel in a words final syllable • c: offset cluster in a words non-final syllable • $: offset cluster in a words final syllable lingpy.sequence.sound_classes.pgrams(sequence, **keywords) Convert a given sequence into bigrams consisting of prosodic string symbols and the tokens of the original sequence. lingpy.sequence.sound_classes.pid(almA, almB, mode=2) Calculate the Percentage Identity (PID) score for aligned sequence pairs. Parameters almA, almB : string or list The aligned sequences which can be either a string or a list. mode : { 1, 2, 3, 4, 5 } Indicate which of the four possible PID scores described in Raghava2006 should be calculated, the fifth possibility is added for linguistic purposes: 160 Chapter 9. Reference LingPy Documentation, Release 2.6 1. identical positions / (aligned positions + internal gap positions), 2. identical positions / aligned positions, 3. identical positions / shortest sequence, or 4. identical positions / shortest sequence (including internal gap pos.) 5. identical positions / (aligned positions + 2 * number of gaps) Returns score : float The PID score of the given alignment as a floating point number between 0 and 1. See also: lingpy.compare.Multiple.get_pid, Notes The PID score is a common measure for the diversity of a given alignment. The implementation employed by LingPy follows the description of Raghava2006 where four different variants of PID scores are distinguished. Essentially, the PID score is based on the comparison of identical residue pairs with the total number of residue pairs in a given alignment. Examples Load an alignment from the test suite. >>> from lingpy import * >>> pairs = PSA(get_file('test.psa')) Extract the alignments of the first aligned sequence pair. >>> almA,almB,score = pairs.alignments[0] Calculate the PID score of the alignment. >>> pid(almA,almB) 0.44444444444444442 lingpy.sequence.sound_classes.prosodic_string(string, _output=True, **keywords) Create a prosodic string of the sonority profile of a sequence. Parameters seq : list A list of integers indicating the sonority of the tokens of the underlying sequence. stress : str (default=rcParams[stress]) A string containing the stress symbols used in the analysis. Defaults to the stress as defined in ~lingpy.settings.rcParams. diacritics : str (default=rcParams[diacritics]) A string containing diacritic symbols used in the analysis. Defaults to the diacritic symbolds defined in ~lingpy.settings.rcParams. cldf : bool (default=False) 9.1. Reference 161 LingPy Documentation, Release 2.6 If set to True, this will allow for a specific treatment of phonetic symbols which cannot be completely resolved (e.g., laryngeal h2 in Indo-European). Following the CLDF specifications (in particular the specifications for writing transcriptions in segmented strings, as employed by the CLTS initiative), in cases of insecurity of pronunciation, users can adopt a `source/target` style, where the source is the symbol used, e.g., in a reconstruction system, and the target is a proposed phonetic interpretation. This practice is also accepted by the EDICTOR tool. Returns prostring : string A prosodic string corresponding to the sonority profile of the underlying sequence. Notes A prosodic string is a sequence of specific characters which indicating their resprective prosodic context (see List2012 or List2012a for a detailed description). In contrast to the previous model, the current implementation allows for a more fine-graded distinction between different prosodic segments. The current scheme distinguishes 9 prosodic positions: • A: sequence-initial consonant • B: syllable-initial, non-sequence initial consonant in a context of ascending sonority • C: non-syllable, non-initial consonant in ascending sonority context • L: non-syllable-final consonant in descending environment • M: syllable-final consonant in descending environment • N: word-final consonant • X: first vowel in a word • Y: non-final vowel in a word • Z: vowel occuring in the last position of a word • T: tone • _: word break Examples >>> prosodic_string(ipa2tokens('tsy') 'AXBZ' lingpy.sequence.sound_classes.prosodic_weights(prostring, _transform={}) Calculate prosodic weights for each position of a sequence. Parameters prostring : string A prosodic string as it is returned by prosodic_string(). _transform : dict A dictionary that determines how prosodic strings should be transformed into prosodic weights. Use this dictionary to adjust the prosodic strings to your own user-defined prosodic weight schema. Returns weights : list 162 Chapter 9. Reference LingPy Documentation, Release 2.6 A list of floats reflecting the modification of the weight for each position. See also: prosodic_string Notes Prosodic weights are specific scaling factors which decrease or increase the gap score of a given segment in alignment analyses (see List2012 or List2012a for a detailed description). Examples >>> from lingpy import * >>> prostring = '#vC>' >>> prosodic_weights(prostring) [2.0, 1.3, 1.5, 0.7] lingpy.sequence.sound_classes.sampa2uni(seq) Convert sequence in IPA-sampa-format to IPA-unicode. Notes This function is based on code taken from Peter Kleiweg (http://www.let.rug.nl/~kleiweg/L04/devel/python/ xsampa.html). lingpy.sequence.sound_classes.syllabify(seq, output=’flat’, **keywords) Carry out a simple syllabification of a sequence, using sonority as a proxy. Parameters output: {flat, breakpoints, nested} (default=flat) : Define how to output the syllabification. Select between: * flat: A syllable separator is introduced to mark the syllable boundaries * breakpoins: A tuple consisting of indices that slice the original sequence into syllables is returned. * nested: A nested list reflecting the syllable structure is returned. sep : str (default=) Select your preferred syllable separator. Returns syllable : list Either a flat list containing a morpheme separator, or a nested list, reflecting the syllable structure, or a list of tuples containing the indices indicating where the input sequence should be sliced in order to split it into syllables. Notes When analyzing the sequence, we start a new syllable in all cases where we reach a deepest point in the sonority hierarchy of the sonority profile of the sequence. When passing an aligned string to this function, the gaps will be ignored when computing boundaries, but later on re-introduced, if the alignment is passed in segmented form. lingpy.sequence.sound_classes.token2class(token, model, stress=None, diacritics=None, cldf=None) Convert a single token into a sound-class. 9.1. Reference 163 LingPy Documentation, Release 2.6 tokens [str] A token (phonetic segment). model [Model] A Model object. stress [str (default=rcParams[stress])] A string containing the stress symbols used in the analysis. Defaults to the stress as defined in ~lingpy.settings.rcParams. diacritics [str (default=rcParams[diacritics])] A string containing diacritic symbols used in the analysis. Defaults to the diacritic symbolds defined in ~lingpy.settings.rcParams. cldf [bool (default=False)] If set to True, this will allow for a specific treatment of phonetic symbols which cannot be completely resolved (e.g., laryngeal h2 in Indo-European). Following the CLDF specifications (in particular the specifications for writing transcriptions in segmented strings, as employed by the CLTS initiative), in cases of insecurity of pronunciation, users can adopt a `source/target` style, where the source is the symbol used, e.g., in a reconstruction system, and the target is a proposed phonetic interpretation. This practice is also accepted by the EDICTOR tool. Returns sound_class : str A sound-class representation of the phonetic segment. If the segment cannot be resolved, the respective string will be rendered as 0 (zero). See also: ipa2tokens, class2tokens, token2class lingpy.sequence.sound_classes.tokens2class(tokens, model, stress=None, diacritics=None, cldf=False) Convert tokenized IPA strings into their respective class strings. Parameters tokens : list A list of tokens as they are returned from ipa2tokens(). model : Model A Model object. stress : str (default=rcParams[stress]) A string containing the stress symbols used in the analysis. Defaults to the stress as defined in ~lingpy.settings.rcParams. diacritics : str (default=rcParams[diacritics]) A string containing diacritic symbols used in the analysis. Defaults to the diacritic symbolds defined in ~lingpy.settings.rcParams. cldf : bool (default=False) If set to True, this will allow for a specific treatment of phonetic symbols which cannot be completely resolved (e.g., laryngeal h2 in Indo-European). Following the CLDF specifications (in particular the specifications for writing transcriptions in segmented strings, as employed by the CLTS initiative), in cases of insecurity of pronunciation, users can adopt a `source/target` style, where the source is the symbol used, e.g., in a reconstruction system, and the target is a proposed phonetic interpretation. This practice is also accepted by the EDICTOR tool. Returns classes : list A sound-class representation of the tokenized IPA string in form of a list. If sound classes cannot be resolved, the respective string will be rendered as 0 (zero). 164 Chapter 9. Reference LingPy Documentation, Release 2.6 See also: ipa2tokens, class2tokens, token2class Notes The function ~lingpy.sequence.sound_classes.token2class returns a 0 (zero) if the sound is not recognized by LingPys sound class models. While an unknown sound in a longer sequence is no problem for alignment algorithms, we have some unwanted and often even unforeseeable behavior, if the sequence is completely unknown. For this reason, this function raises a ValueError, if a resulting sequence only contains unknown sounds. Examples >>> from lingpy import * >>> tokens = ipa2tokens('tsy') >>> classes = tokens2class(tokens,'sca') >>> print(classes) CUKE lingpy.sequence.sound_classes.tokens2morphemes(tokens, **keywords) Split a string into morphemes if it contains separators. Parameters sep : str (default=) Select your morpheme separator. word_sep: str (default=_) : Select your word separator. Returns morphemes : list A nested list of the original segments split into morphemes. Notes Function splits a list of tokens into subsequent lists of morphemes if the list contains morpheme separators. If no separators are found, but tonemarkers, it will still split the string according to the tones. If you want to avoid this behavior, set the keyword split_on_tones to False. lingpy.sequence.sound_classes.trigrams(sequence) Convert a given sequence into a sequence of trigrams. lingpy.sequence.tiers module Module provides tools to handle transcriptions as multi-tiered sequences. lingpy.sequence.tiers.cvcv(sequence, **keywords) Create a CV-template representation out of a sound sequence. lingpy.sequence.tiers.get_stress(sound) lingpy.sequence.tiers.is_consonant(sound) lingpy.sequence.tiers.is_sound(sound, what) Check whether a sound is a vowel or not. 9.1. Reference 165 LingPy Documentation, Release 2.6 lingpy.sequence.tiers.is_stressed(sound) Quick check for stress. lingpy.sequence.tiers.is_tone(sound) lingpy.sequence.tiers.is_vowel(sound) lingpy.sequence.tiers.remove_stress(sound) lingpy.sequence.tiers.sound_type(sound) Shortcut to determine basic sound type (C, V, or T). Module contents Module provides methods and functions for dealing with linguistic sequences. lingpy.tests package Subpackages lingpy.tests.algorithm package Submodules lingpy.tests.algorithm.test_cluster_util module class lingpy.tests.algorithm.test_cluster_util.Tests(methodName=’runTest’) Bases: unittest.case.TestCase test_generate_all_clusters() test_generate_random_clusters() test_mutate_cluster() test_order_cluster() test_valid_cluster() lingpy.tests.algorithm.test_clustering module class lingpy.tests.algorithm.test_clustering.Tests(methodName=’runTest’) Bases: clldutils.testing.WithTempDir setUp() test_best_threshold() test_check_taxa() test_check_taxon_names() test_find_threshold() test_flat_cluster() test_fuzzy() 166 Chapter 9. Reference LingPy Documentation, Release 2.6 test_link_clustering() test_matrix2groups() test_matrix2tree() test_neighbor() test_partition_density() test_upgma() lingpy.tests.algorithm.test_cython module class lingpy.tests.algorithm.test_cython.Tests Bases: object setUp() test__calign() test__malign() test__talign() test_corrdist() lingpy.tests.algorithm.test_extra module class lingpy.tests.algorithm.test_extra.Cluster(*args, **kw) Bases: mock.mock.MagicMock class AffinityPropagation(*args, **kw) Bases: object fit_predict(arg) dbscan(*args, **kw) class lingpy.tests.algorithm.test_extra.Igraph(*args, **kw) Bases: mock.mock.MagicMock class Graph(vs=[]) Bases: object add_edge(a, b) add_vertex(vertex) community_infomap(*args, **kw) class lingpy.tests.algorithm.test_extra.Tests Bases: object setUp() test_affinity_propagation() test_clustering() test_dbscan() test_infomap_clustering() 9.1. Reference 167 LingPy Documentation, Release 2.6 class lingpy.tests.algorithm.test_extra.components(nodes) Bases: object subgraphs() Module contents lingpy.tests.align package Submodules lingpy.tests.align.test_multiple module Testing multiple module. class lingpy.tests.align.test_multiple.Tests(methodName=’runTest’) Bases: unittest.case.TestCase setUp() test___get__() test_get_local_peaks() test_get_pairwise_alignments() test_get_peaks() test_get_pid() test_iterate_all_sequences() test_iterate_clusters() test_iterate_orphans() test_iterate_similar_gap_sites() test_lib_align() test_mult_align() test_prog_align() test_sum_of_pairs() test_swap_check() lingpy.tests.align.test_pairwise module class lingpy.tests.align.test_pairwise.TestPairwise(methodName=’runTest’) Bases: unittest.case.TestCase setUp() test_align() test_basics() lingpy.tests.align.test_pairwise.test_editdist() lingpy.tests.align.test_pairwise.test_nw_align() 168 Chapter 9. Reference LingPy Documentation, Release 2.6 lingpy.tests.align.test_pairwise.test_pw_align() lingpy.tests.align.test_pairwise.test_structalign() lingpy.tests.align.test_pairwise.test_sw_align() lingpy.tests.align.test_pairwise.test_turchin() lingpy.tests.align.test_pairwise.test_we_align() lingpy.tests.align.test_sca module Test the SCA module. class lingpy.tests.align.test_sca.TestAlignments(methodName=’runTest’) Bases: clldutils.testing.WithTempDir setUp() test_align() test_get_confidence() test_get_consensus() test_ipa2tokens() test_output() class lingpy.tests.align.test_sca.TestMSA(methodName=’runTest’) Bases: clldutils.testing.WithTempDir test_output() class lingpy.tests.align.test_sca.TestPSA(methodName=’runTest’) Bases: clldutils.testing.WithTempDir test_output() lingpy.tests.align.test_sca.test_get_consensus() Module contents lingpy.tests.basic package Submodules lingpy.tests.basic.test_ops module Test wordlist module. class lingpy.tests.basic.test_ops.TestOps(methodName=’runTest’) Bases: clldutils.testing.WithTempDir setUp() test_calculate_data() test_clean_taxnames() test_coverage() 9.1. Reference 169 LingPy Documentation, Release 2.6 test_iter_rows() test_renumber() test_tsv2triple() test_wl2dict() test_wl2dst() test_wl2multistate() test_wl2qlc() lingpy.tests.basic.test_parser module class lingpy.tests.basic.test_parser.TestParser(methodName=’runTest’) Bases: unittest.case.TestCase setUp() test_add_entries() test_cache() test_get_entries() test_getattr() test_getitem() test_init() test_len() lingpy.tests.basic.test_tree module class lingpy.tests.basic.test_tree.TestTree(methodName=’runTest’) Bases: unittest.case.TestCase setUp() test_getDistanceToRoot() test_get_LCA() test_get_distance() test_get_distance_unknown() test failure with unknown distance test_init_from_file() test_init_from_list() lingpy.tests.basic.test_tree.test_random_tree() lingpy.tests.basic.test_tree.test_star_tree() 170 Chapter 9. Reference LingPy Documentation, Release 2.6 lingpy.tests.basic.test_wordlist module Test wordlist module. class lingpy.tests.basic.test_wordlist.TestWordlist(methodName=’runTest’) Bases: clldutils.testing.WithTempDir setUp() test___len__() test_calculate() test_coverage() test_export() test_get_dict() test_get_entries() test_get_etymdict() test_get_list() test_get_paps() test_get_wordlist() test_output() test_renumber() Module contents lingpy.tests.compare package Submodules lingpy.tests.compare.test__phylogeny module class lingpy.tests.compare.test__phylogeny.Graph(*args, **kw) Bases: mock.mock.MagicMock nodes(**kw) class lingpy.tests.compare.test__phylogeny.Nx(*args, **kw) Bases: mock.mock.MagicMock Graph(*args, **kw) generate_gml(*args) class lingpy.tests.compare.test__phylogeny.Plt(*args, **kw) Bases: mock.mock.MagicMock Polygon(*args, **kw) fill(*args, **kw) gca(*args, **kw) plot(*args, **kw) 9.1. Reference 171 LingPy Documentation, Release 2.6 text(*args, **kw) class lingpy.tests.compare.test__phylogeny.SPS(*args, **kw) Bases: mock.mock.MagicMock mstats = <MagicMock id='139967310284392'> class lingpy.tests.compare.test__phylogeny.TestUtils(methodName=’runTest’) Bases: clldutils.testing.WithTempDir setUp() test_utils() lingpy.tests.compare.test__phylogeny.test_convex_hull() lingpy.tests.compare.test__phylogeny.test_get_convex_hull() lingpy.tests.compare.test__phylogeny.test_get_polygon_from_nodes() lingpy.tests.compare.test__phylogeny.test_seg_intersect() lingpy.tests.compare.test__phylogeny.test_settings() lingpy.tests.compare.test_lexstat module class lingpy.tests.compare.test_lexstat.TestLexStat(methodName=’runTest’) Bases: clldutils.testing.WithTempDir setUp() test__get_matrices() test_align_pairs() test_cluster() test_correctness() test_get_distances() test_get_frequencies() test_get_scorer() test_get_subset() test_getitem() test_init() test_init2() test_init3() test_output() lingpy.tests.compare.test_lexstat.test_char_from_charstring() lingpy.tests.compare.test_lexstat.test_get_score_dict() 172 Chapter 9. Reference LingPy Documentation, Release 2.6 lingpy.tests.compare.test_partial module class lingpy.tests.compare.test_partial.Tests(methodName=’runTest’) Bases: clldutils.testing.WithTempDir setUp() test__get_slices() test_add_cognate_ids() test_get_partial_matrices() test_partial_cluster() lingpy.tests.compare.test_phylogeny module Test the TreBor borrowing detection algorithm. class lingpy.tests.compare.test_phylogeny.Bmp(*args, **kw) Bases: mock.mock.MagicMock Basemap(*args, **kw) class lingpy.tests.compare.test_phylogeny.Graph(*args, **kw) Bases: mock.mock.MagicMock nodes(**kw) class lingpy.tests.compare.test_phylogeny.Nx(*args, **kw) Bases: mock.mock.MagicMock Graph(*args, **kw) generate_gml(*args) class lingpy.tests.compare.test_phylogeny.Plt(*args, **kw) Bases: mock.mock.MagicMock plot(*args, **kw) class lingpy.tests.compare.test_phylogeny.Sp(*args, **kw) Bases: mock.mock.MagicMock stats = <MagicMock id='139967310000256'> class lingpy.tests.compare.test_phylogeny.TestPhyBo(methodName=’runTest’) Bases: clldutils.testing.WithTempDir setUp() test_get_GLS() test_plot() lingpy.tests.compare.test_sanity module class lingpy.tests.compare.test_sanity.Tests Bases: object setUp() test__get_concepts() 9.1. Reference 173 LingPy Documentation, Release 2.6 test__mutual_coverage() test_mutual_coverage() test_mutual_coverage_check() test_mutual_coverage_subset() test_synonymy() lingpy.tests.compare.test_strings module class lingpy.tests.compare.test_strings.TestStrings(methodName=’runTest’) Bases: unittest.case.TestCase setUp() test_bidist1() test_bidist2() test_bidist3() test_bisim1() test_bisim2() test_bisim3() test_dice() test_ident() test_jcd() test_jcdn() test_lcs() test_ldn() test_ldn_swap() test_prefix() test_tridist1() test_tridist2() test_tridist3() test_trigram() test_trisim1() test_trisim2() test_trisim3() test_xdice() test_xxdice() 174 Chapter 9. Reference LingPy Documentation, Release 2.6 Module contents lingpy.tests.convert package Submodules lingpy.tests.convert.test_cldf module lingpy.tests.convert.test_cldf.test_from_cldf() lingpy.tests.convert.test_graph module lingpy.tests.convert.test_graph.test_igraph2networkx() lingpy.tests.convert.test_graph.test_networkx2igraph() lingpy.tests.convert.test_html module class lingpy.tests.convert.test_html.Tests(methodName=’runTest’) Bases: clldutils.testing.WithTempDir test_alm2html() test_color_range() test_msa2html() test_psa2html() test_strings_and_tokens2html() lingpy.tests.convert.test_plot module class lingpy.tests.convert.test_plot.Plt(*args, **kw) Bases: mock.mock.MagicMock plot(*args, **kw) class lingpy.tests.convert.test_plot.Sch(*args, **kw) Bases: mock.mock.MagicMock dendrogram(*args, **kw) class lingpy.tests.convert.test_plot.TestPlot(methodName=’runTest’) Bases: clldutils.testing.WithTempDir setUp() test_plots() lingpy.tests.convert.test_strings module Test conversions involving strings. 9.1. Reference 175 LingPy Documentation, Release 2.6 class lingpy.tests.convert.test_strings.TestWriteNexus(methodName=’runTest’) Bases: clldutils.testing.WithTempDir Tests for write_nexus assertRegexWorkaround(a, b) setUp() test_beast() test_beastwords() test_error_on_unknown_mode() test_error_on_unknown_ref() test_mrbayes() class lingpy.tests.convert.test_strings.Tests(methodName=’runTest’) Bases: unittest.case.TestCase test_matrix2dst() test_msa2str() test_pap2csv() test_pap2nex() test_scorer2str() Test conversion of scorers to strings. lingpy.tests.convert.test_tree module class lingpy.tests.convert.test_tree.TestTree(methodName=’runTest’) Bases: clldutils.testing.WithTempDir setUp() test__nwk_format() test_nwk2tree_matrix() Module contents lingpy.tests.data package Submodules lingpy.tests.data.test_derive module class lingpy.tests.data.test_derive.TestDerive(methodName=’runTest’) Bases: clldutils.testing.WithTempDir setUp() test_compile_dvt() test_compile_model() 176 Chapter 9. Reference LingPy Documentation, Release 2.6 lingpy.tests.data.test_sound_class_models module class lingpy.tests.data.test_sound_class_models.Tests Bases: object failures = defaultdict(<class 'list'>, {}) model = 'asjp' models = ['sca', 'dolgo', 'art', 'color', 'asjp'] segment = 'c' segments = {'', 'j', 'í', '45 ', '0 ', '', '33 ', '', '12 ', "'", 'p', 't', '', '', '14 ', 'p' values = ['', '', '', 'ð', 'ts', '45 ', '0 ', '14 ', '', '', 'd', '33 ', '', '', '12 ', '31 ', Module contents lingpy.tests.evaluate package Submodules lingpy.tests.evaluate.test_acd module class lingpy.tests.evaluate.test_acd.Tests(methodName=’runTest’) Bases: clldutils.testing.WithTempDir setUp() test_bcubes() test_diff() test_extreme_cognates() test_pairs() test_partial_bcubes() test_random_cognates() lingpy.tests.evaluate.test_acd.test_npoint_ap() lingpy.tests.evaluate.test_alr module class lingpy.tests.evaluate.test_alr.Tests(methodName=’runTest’) Bases: clldutils.testing.WithTempDir setUp() test_med() lingpy.tests.evaluate.test_apa module class lingpy.tests.evaluate.test_apa.Tests(methodName=’runTest’) Bases: clldutils.testing.WithTempDir 9.1. Reference 177 LingPy Documentation, Release 2.6 test_EvalMSA() test_EvalPSA() Module contents lingpy.tests.meaning package Submodules lingpy.tests.meaning.test_colexification module Tests for colexification module. class lingpy.tests.meaning.test_colexification.TestColexifications(methodName=’runTest’) Bases: clldutils.testing.WithTempDir setUp() test__get_colexifications() test__get_colexifications_by_taxa() test__get_statistics() test__make_graph() test__make_matrix() test_colexification_network() test_compare_colexifications() test_evaluate_colexifications() Module contents lingpy.tests.read package Submodules lingpy.tests.read.test_csv module Tests for the read.csv module. lingpy.tests.read.test_csv.test_csv2dict() lingpy.tests.read.test_csv.test_csv2list() lingpy.tests.read.test_csv.test_csv2multidict() lingpy.tests.read.test_csv.test_read_asjp() 178 Chapter 9. Reference LingPy Documentation, Release 2.6 lingpy.tests.read.test_phylip module Basic tests for the Phylip module. lingpy.tests.read.test_phylip.test_read_dst() lingpy.tests.read.test_phylip.test_read_scorer() lingpy.tests.read.test_qlc module lingpy.tests.read.test_qlc.test_normalize_alignment() lingpy.tests.read.test_qlc.test_read_msa() lingpy.tests.read.test_qlc.test_read_qlc() lingpy.tests.read.test_qlc.test_reduce_msa() lingpy.tests.read.test_starling module class lingpy.tests.read.test_starling.Tests(methodName=’runTest’) Bases: clldutils.testing.WithTempDir test_star2qlc() Module contents lingpy.tests.sequence package Submodules lingpy.tests.sequence.test_generate module class lingpy.tests.sequence.test_generate.Tests(methodName=’runTest’) Bases: unittest.case.TestCase setUp() test_evaluate_string() test_get_string() lingpy.tests.sequence.test_profile module lingpy.tests.sequence.test_profile.test_context_profile() lingpy.tests.sequence.test_profile.test_simple_profile() 9.1. Reference 179 LingPy Documentation, Release 2.6 lingpy.tests.sequence.test_sound_classes module lingpy.tests.sequence.test_sound_classes.test_bigrams() lingpy.tests.sequence.test_sound_classes.test_check_tokens() lingpy.tests.sequence.test_sound_classes.test_class2tokens() lingpy.tests.sequence.test_sound_classes.test_clean_string() lingpy.tests.sequence.test_sound_classes.test_codepoint() lingpy.tests.sequence.test_sound_classes.test_fourgrams() lingpy.tests.sequence.test_sound_classes.test_get_all_ngrams() lingpy.tests.sequence.test_sound_classes.test_get_n_ngrams() lingpy.tests.sequence.test_sound_classes.test_ipa2tokens() lingpy.tests.sequence.test_sound_classes.test_onoparse() lingpy.tests.sequence.test_sound_classes.test_pgrams() lingpy.tests.sequence.test_sound_classes.test_pid() lingpy.tests.sequence.test_sound_classes.test_prosodic_string() lingpy.tests.sequence.test_sound_classes.test_prosodic_weights() lingpy.tests.sequence.test_sound_classes.test_sampa2uni() lingpy.tests.sequence.test_sound_classes.test_syllabify() lingpy.tests.sequence.test_sound_classes.test_token2class() lingpy.tests.sequence.test_sound_classes.test_tokens2class() lingpy.tests.sequence.test_sound_classes.test_tokens2morphemes() lingpy.tests.sequence.test_sound_classes.test_trigrams() Module contents lingpy.tests.thirdparty package Submodules lingpy.tests.thirdparty.test_cogent module Test thirdparty modules. class lingpy.tests.thirdparty.test_cogent.PhyloNodeTests(methodName=’runTest’) Bases: unittest.case.TestCase test_PhyloNode() class lingpy.tests.thirdparty.test_cogent.TreeTests(methodName=’runTest’) Bases: unittest.case.TestCase test_Tree() test_more_trees() 180 Chapter 9. Reference LingPy Documentation, Release 2.6 lingpy.tests.thirdparty.test_cogent.test_LoadTree() lingpy.tests.thirdparty.test_linkcomm module class lingpy.tests.thirdparty.test_linkcomm.TestHLC(methodName=’runTest’) Bases: clldutils.testing.WithTempDir setUp() test_hlc() lingpy.tests.thirdparty.test_linkcomm.test_dc() lingpy.tests.thirdparty.test_linkcomm.test_swap() Module contents Submodules lingpy.tests.test_cache module class lingpy.tests.test_cache.TestCache(methodName=’runTest’) Bases: clldutils.testing.WithTempDir test_cache() lingpy.tests.test_cli module class lingpy.tests.test_cli.Tests(methodName=’runTest’) Bases: clldutils.testing.WithTempDir run_cli(*args) test_alignments() test_lexstat() test_multiple() test_ortho_profile() test_pairwise() test_profile() test_settings() test_wordlist() lingpy.tests.test_cli.capture(*args) lingpy.tests.test_config module class lingpy.tests.test_config.ConfigTest(methodName=’runTest’) Bases: clldutils.testing.WithTempDir setUp() 9.1. Reference 181 LingPy Documentation, Release 2.6 test_default() test_existing_config() test_new_config() lingpy.tests.test_log module class lingpy.tests.test_log.LogTest(methodName=’runTest’) Bases: clldutils.testing.WithTempDir tearDown() test_Logging_context_manager() test_convenience() test_default_config() test_new_config() lingpy.tests.test_util module class lingpy.tests.test_util.Test(methodName=’runTest’) Bases: clldutils.testing.WithTempDir test_TextFile() test_write_text_file() class lingpy.tests.test_util.TestCombinations(methodName=’runTest’) Bases: unittest.case.TestCase test_combinations2() class lingpy.tests.test_util.TestJoin(methodName=’runTest’) Bases: unittest.case.TestCase test_dotjoin() test_join() lingpy.tests.test_util.test_as_string() lingpy.tests.util module Utilities used in lingpy tests lingpy.tests.util.get_log() A mock object for lingpy.log to test whether log messages have been emitted. Returns Mock instance. lingpy.tests.util.test_data(*comps) Access test data files. Parameters comps – Path components of the data file path relative to the test_data dir. Returns Absolute path to the specified test data file. 182 Chapter 9. Reference LingPy Documentation, Release 2.6 Module contents lingpy.thirdparty package Subpackages lingpy.thirdparty.cogent package Submodules lingpy.thirdparty.cogent.newick module Newick format with all features as per the specs at: http://evolution.genetics.washington.edu/phylip/newick_doc. html http://evolution.genetics.washington.edu/phylip/newicktree.html ie: Unquoted label underscore munging Quoted labels Inner node labels Lengths [ ] Comments (discarded) Unlabeled tips also: Double quotes can be used. Spaces and quote marks are OK inside unquoted labels. exception lingpy.thirdparty.cogent.newick.TreeParseError Bases: ValueError lingpy.thirdparty.cogent.newick.parse_string(text, constructor, **kw) Parses a Newick-format string, using specified constructor for tree. Calls constructor(children, name, attributes) Note: underscore_unmunge, if True, replaces underscores with spaces in the data thats read in. This is part of the Newick format, but it is often useful to suppress this behavior. lingpy.thirdparty.cogent.tree module lingpy.thirdparty.cogent.tree.LoadTree(filename=None, treestring=None, tip_names=None, underscore_unmunge=False) Constructor for tree. Arguments, use only one of: • filename: a file containing a newick or xml formatted tree. • treestring: a newick or xml formatted tree string. • tip_names: a list of tip names. Notes Underscore_unmunging is turned off by default, although it is part of the Newick format. score_unmunge to True to replace underscores with spaces in all names read. Set under- class lingpy.thirdparty.cogent.tree.PhyloNode(*args, **kwargs) Bases: lingpy.thirdparty.cogent.tree.TreeNode Length balanced() Tree rooted here with no neighbour having > 50% of the edges. 9.1. Reference 183 LingPy Documentation, Release 2.6 Notes Using a balanced tree can substantially improve performance of the likelihood calculations. Note that the resulting tree has a different orientation with the effect that specifying clades or stems for model parameterisation should be done using the outgroup_name argument. bifurcating(constructor=None) compareByPartitions(other, debug=False) distance(other) Returns branch length between self and other. getDistances(endpoints=None) The distance matrix as a dictionary. Usage: Grabs the branch lengths (evolutionary distances) as a complete matrix (i.e. a,b and b,a). getNewick(with_distances=False, semicolon=True, escape_name=True) prune() Reconstructs correct tree after nodes have been removed. Internal nodes with only one child will be removed and new connections and Branch lengths will be made to reflect change. rootAtMidpoint() return a new tree rooted at midpoint of the two tips farthest apart this fn doesnt preserve the internal node naming or structure, but does keep tip to tip distances correct. uses unrootedDeepcopy() rootedAt(edge_name) Return a new tree rooted at the provided node. Usage: This can be useful for drawing unrooted trees with an orientation that reflects knowledge of the true root location. rootedWithTip(outgroup_name) A new tree with the named tip as one of the roots children sameTopology(other) Tests whether two trees have the same topology. scaleBranchLengths(max_length=100, ultrametric=False) Scales BranchLengths in place to integers for ascii output. Warning: tree might not be exactly the length you specify. Set ultrametric=True if you want all the root-tip distances to end up precisely the same. setTipDistances() Sets distance from each node to the most distant tip. tipToTipDistances(endpoints=None, default_length=1) Returns distance matrix between all pairs of tips, and a tip order. Warning: .__start and .__stop added to self and its descendants. tip_order contains the actual node objects, not their names (may be confusing in some cases). totalDescendingBranchLength() Returns total descending branch length from self 184 Chapter 9. Reference LingPy Documentation, Release 2.6 unrooted() A tree with at least 3 children at the root. unrootedDeepcopy(constructor=None, parent=None) class lingpy.thirdparty.cogent.tree.TreeBuilder(mutable=False, constructor=<class ’lingpy.thirdparty.cogent.tree.PhyloNode’>) Bases: object createEdge(children, name, params, nameLoaded=True) Callback for newick parser edgeFromEdge(edge, children, params=None) Callback for tree-to-tree transforms like getSubTree exception lingpy.thirdparty.cogent.tree.TreeError Bases: Exception class lingpy.thirdparty.cogent.tree.TreeNode(Name=None, Children=None, Parent=None, Params=None, NameLoaded=True, **kwargs) Bases: object Store information about a tree node. Mutable. Parameters: Name: label for the node, assumed to be unique. Children: list of the nodes children. Params: dict containing arbitrary parameters for the node. NameLoaded: ? Parent Accessor for parent. If using an algorithm that accesses Parent a lot, it will be much faster to access self._parent directly, but dont do it if mutating self._parent! (or, if you must, remember to clean up the refs). ancestors() Returns all ancestors back to the root. Dynamically calculated. append(i) Appends i to self.Children, in-place, cleaning up refs. asciiArt(show_internal=True, compact=False, labels=False) Returns a string containing an ascii drawing of the tree. Parameters show_internal: bool : includes internal edge names. compact: bool : use exactly one line per tip. labels: {bool, list} : specify specific labels for all nodes in the tree. Notes The labels-keyword was added to the function by JML. childGroups() Returns list containing lists of children sharing a state. In other words, returns runs of tip and nontip children. 9.1. Reference 185 LingPy Documentation, Release 2.6 compareByNames(other) Equality test for trees by name compareBySubsets(other, exclude_absent_taxa=False) Returns fraction of overlapping subsets where self and other differ. Other is expected to be a tree object compatible with PhyloNode. Notes Names present in only one of the two trees will count as mismatches: if you dont want this behavior, strip out the non-matching tips first. compareName(other) Compares TreeNode by name copy(memo=None, _nil=[], constructor=’ignored’) Returns a copy of self using an iterative approach copyRecursive(memo=None, _nil=[], constructor=’ignored’) Returns copy of selfs structure, including shallow copy of attrs. constructor is ignored; required to support old tree unit tests. copyTopology(constructor=None) Copies only the topology and labels of a tree, not any extra data. Useful when you want another copy of the tree with the same structure and labels, but want to e.g. assign different branch lengths and environments. Does not use deepcopy from the copy module, so _much_ faster than the copy() method. deepcopy(memo=None, _nil=[], constructor=’ignored’) Returns a copy of self using an iterative approach descendantArray(tip_list=None) Returns numpy array with nodes in rows and descendants in columns. A value of 1 indicates that the decendant is a descendant of that node/ A value of 0 indicates that it is not Also returns a list of nodes in the same order as they are listed in the array. tip_list is a list of the names of the tips that will be considered, in the order they will appear as columns in the final array. Internal nodes will appear as rows in preorder traversal order. extend(items) Extends self.Children by items, in-place, cleaning up refs. getConnectingEdges(name1, name2) returns a list of edges connecting two nodes includes self and other in the list getConnectingNode(name1, name2) Finds the last common ancestor of the two named edges. getDistances(endpoints=None) The distance matrix as a dictionary. Usage: Grabs the branch lengths (evolutionary distances) as a complete matrix (i.e. a,b and b,a). 186 Chapter 9. Reference LingPy Documentation, Release 2.6 getEdgeNames(tip1name, tip2name, getclade, getstem, outgroup_name=None) Return the list of stem and/or sub tree (clade) edge name(s). This is done by finding the common intersection, and then getting the list of names. If the clade traverses the root, then use the outgroup_name argument to ensure valid specification. Arguments: • tip1/2name: edge 1/2 names • getstem: whether the name of the clade stem edge is returned. • getclade: whether the names of the edges within the clade are returned • outgroup_name: if provided the calculation is done on a version of the tree re-rooted relative to the provided tip. Usage: The returned list can be used to specify subtrees for special parameterisation. For instance, say you want to allow the primates to have a different value of a particular parameter. In this case, provide the results of this method to the parameter controller method setParamRule() along with the parameter name etc.. getEdgeVector() Collect the list of edges in postfix order getMaxTipTipDistance() Returns the max tip tip distance between any pair of tips Returns (dist, tip_names, internal_node) getNewick(with_distances=False, semicolon=True, escape_name=True) Return the newick string for this tree. Arguments: • with_distances: whether branch lengths are included. • semicolon: end tree string with a semicolon • escape_name: if any of these characters [](),:;_ exist in a nodes name, wrap the name in single quotes NOTE: This method returns the Newick representation of this node and its descendents. This method is a modification of an implementation by Zongzhi Liu getNewickRecursive(with_distances=False, semicolon=True, escape_name=True) Return the newick string for this edge. Arguments: • with_distances: whether branch lengths are included. • semicolon: end tree string with a semicolon • escape_name: if any of these characters [](),:;_ exist in a nodes name, wrap the name in single quotes getNodeMatchingName(name) getNodeNames(includeself=True, tipsonly=False) Return a list of edges from this edge - may or may not include self. This node (or first connection) will be the first, and then they will be listed in the natural traverse order. getNodesDict() Returns a dict keyed by node name, value is node Will raise TreeError if non-unique names are encountered 9.1. Reference 187 LingPy Documentation, Release 2.6 getParamValue(param, edge) returns the parameter value for named edge getSubTree(name_list, ignore_missing=False, keep_root=False) A new instance of a sub tree that contains all the otus that are listed in name_list. ignore_missing: if False, getSubTree will raise a ValueError if name_list contains names that arent nodes in the tree keep_root: if False, the root of the subtree will be the last common ancestor of all nodes kept in the subtree. Root to tip distance is then (possibly) different from the original tree If True, the root to tip distance remains constant, but root may only have one child node. getTipNames(includeself=False) return the list of the names of all tips contained by this edge get_LCA(*nodes) Find lowest common ancestor of a given number of nodes. Notes This function is supposed to yield the same output as lowestCommonAncestor does. It was added in order to overcome certain problems in the original function, resulting from attributes added to a PhyloNodeobject that make the use at time unsecure. Furthermore, it works with an arbitrary list of nodes (including tips and internal nodes). indexInParent() Returns index of self in parent. insert(index, i) Inserts an item at specified position in self.Children. isRoot() Returns True if the current is a root, i.e. has no parent. isTip() Returns True if the current node is a tip, i.e. has no children. isroot() Returns True if root of a tree, i.e. no parent. istip() Returns True if is tip, i.e. no children. iterNontips(include_self=False) Iterates over nontips descended from self, [] if none. include_self, if True (default is False), will return the current node as part of the list of nontips if it is a nontip. iterTips(include_self=False) Iterates over tips descended from self, [] if self is a tip. lastCommonAncestor(other) Finds last common ancestor of self and other, or None. Always tests by identity. lca(other) Finds last common ancestor of self and other, or None. Always tests by identity. 188 Chapter 9. Reference LingPy Documentation, Release 2.6 levelorder(include_self=True) Performs levelorder iteration over tree lowestCommonAncestor(tipnames) Lowest common ancestor for a list of tipnames This should be around O(H sqrt(n)), where H is height and n is the number of tips passed in. makeTreeArray(dec_list=None) Makes an array with nodes in rows and descendants in columns. A value of 1 indicates that the decendant is a descendant of that node/ A value of 0 indicates that it is not also returns a list of nodes in the same order as they are listed in the array maxTipTipDistance() returns the max distance between any pair of tips Also returns the tip names that it is between as a tuple nameUnnamedNodes() sets the Data property of unnamed nodes to an arbitrary value Internal nodes are often unnamed and so this function assigns a value for referencing. nonTipChildren() Returns direct children in self that have descendants. nontips(include_self=False) Returns nontips descended from self. pop(index=-1) Returns and deletes child of self at index (default: -1) postorder(include_self=True) Performs postorder iteration over tree. This is somewhat inelegant compared to saving the node and its index on the stack, but is 30% faster in the average case and 3x faster in the worst case (for a comb tree). Zongzhi Lius slower but more compact version is: def postorder_zongzhi(self): stack = [[self, 0]] while stack: curr, child_idx = stack[-1] if child_idx < len(curr.Children): stack[-1][1] += 1 stack.append([curr.Children[child_idx], 0]) else: yield stack.pop()[0] pre_and_postorder(include_self=True) Performs iteration over tree, visiting node before and after. preorder(include_self=True) Performs preorder iteration over tree. prune() Reconstructs correct topology after nodes have been removed. Internal nodes with only one child will be removed and new connections will be made to reflect change. reassignNames(mapping, nodes=None) Reassigns node names based on a mapping dict mapping : dict, old_name -> new_name nodes : specific nodes for renaming (such as just tips, etc) 9.1. Reference 189 LingPy Documentation, Release 2.6 remove(target) Removes node by name instead of identity. Returns True if node was present, False otherwise. removeNode(target) Removes node by identity instead of value. Returns True if node was present, False otherwise. root() Returns root of the tree self is in. Dynamically calculated. sameShape(other) Ignores lengths and order, so trees should be sorted first separation(other) Returns number of edges separating self and other. setMaxTipTipDistance() Propagate tip distance information up the tree This method was originally implemented by Julia Goodrich with the intent of being able to determine max tip to tip distances between nodes on large trees efficiently. The code has been modified to track the specific tips the distance is between setParamValue(param, edge, value) sets the value for param at named edge siblings() Returns all nodes that are children of the same parent as self. Notes Excludes self from the list. Dynamically calculated. sorted(sort_order=[]) An equivalent tree sorted into a standard order. If this is not specified then alphabetical order is used. At each node starting from root, the algorithm will try to put the descendant which contains the lowest scoring tip on the left. subset() Returns set of names that descend from specified node subsets() Returns all sets of names that come from specified node and its kids tipChildren() Returns direct children of self that are tips. tips(include_self=False) Returns tips descended from self, [] if self is a tip. traverse(self_before=True, self_after=False, include_self=True) Returns iterator over descendants. Iterative: safe for large trees. Notes self_before includes each node before its descendants if True. self_after includes each node after its descendants if True. include_self includes the initial node if True. 190 Chapter 9. Reference LingPy Documentation, Release 2.6 self_before and self_after are independent. If neither is True, only terminal nodes will be returned. Note that if self is terminal, it will only be included once even if self_before and self_after are both True. This is a depth-first traversal. Since the trees are not binary, preorder and postorder traversals are possible, but inorder traversals would depend on the data in the tree and are not handled here. traverse_recursive(self_before=True, self_after=False, include_self=True) Returns iterator over descendants. IMPORTANT: read notes below. Notes traverse_recursive is slower than traverse, and can lead to stack errors. However, you _must_ use traverse_recursive if you plan to modify the tree topology as you walk over it (e.g. in post-order), because the iterative methods use their own stack that is not updated if you alter the tree. self_before includes each node before its descendants if True. self_after includes each node after its descendants if True. include_self includes the initial node if True. self_before and self_after are independent. If neither is True, only terminal nodes will be returned. Note that if self is terminal, it will only be included once even if self_before and self_after are both True. This is a depth-first traversal. Since the trees are not binary, preorder and postorder traversals are possible, but inorder traversals would depend on the data in the tree and are not handled here. writeToFile(filename, with_distances=True, format=None) Save the tree to filename Arguments: • filename: self-evident • with_distances: whether branch lengths are included in string. • format: default is newick, xml is alternate. Argument overrides the filename suffix. All attributes are saved in the xml format. lingpy.thirdparty.cogent.tree.cmp(a, b) lingpy.thirdparty.cogent.tree.comb(items, n=None) Yields each successive combination of n items. items: a slicable sequence. n: number of items in each combination This version from Raymond Hettinger, 2006/03/23 Module contents Simple py3-port of PyCogents (http://pycogent.sourceforge.net) Tree classes. lingpy.thirdparty.linkcomm package Submodules lingpy.thirdparty.linkcomm.link_clustering module changes 2010-08-27: 9.1. Reference 191 LingPy Documentation, Release 2.6 • all three output files now contain the same community id numbers • comm2nodes and comm2edges both present the cid as the first entry of each line. Previously only comm2nodes did this. * implemented weighted version, added -w switch * expanded help string to explain input and outputs lingpy.thirdparty.linkcomm.link_clustering.Dc(m, n) partition density class lingpy.thirdparty.linkcomm.link_clustering.HLC(adj, edges) Bases: object initialize_edges() merge_comms(edge1, edge2) single_linkage(threshold=None, w=None) lingpy.thirdparty.linkcomm.link_clustering.similarities_unweighted(adj) Get all the edge similarities. Input dict maps nodes to sets of neighbors. Output is a list of decorated edge-pairs, (1-sim,eij,eik), ordered by similarity. lingpy.thirdparty.linkcomm.link_clustering.similarities_weighted(adj, ij2wij) Same as similarities_unweighted but using tanimoto coefficient. ‘adj is a dict mapping nodes to sets of neighbors, ij2wij is a dict mapping an edge (ni,nj) tuple to the weight wij of that edge. lingpy.thirdparty.linkcomm.link_clustering.swap(a, b) Module contents Module provides a simple py3 port for link community analyses, following the algorithm by James Bagrow, Yong-Yeol Ahn. Module contents Submodules lingpy.cache module Implements the lingpy cache. Some operations in lingpy may be time consuming, so we provide a mechanism to cache the results of these operations. lingpy.cache.dump(data, filename) lingpy.cache.load(filename) lingpy.cache.path(filename) lingpy.cli module class lingpy.cli.Command Bases: object Base class for subcommands of the lingpy command line interface. help = None 192 Chapter 9. Reference LingPy Documentation, Release 2.6 output(args, content) classmethod subparser(parser) Hook to define subcommand arguments. class lingpy.cli.CommandMeta(name, bases, dct) Bases: type A metaclass which keeps track of subclasses, if they have all-lowercase names. lingpy.cli.add_align_method_option(p) lingpy.cli.add_cognate_identifier_option(p, default) lingpy.cli.add_format_option(p, default, choices) lingpy.cli.add_method_option(p, default, choices, spec=”) lingpy.cli.add_mode_option(p, choices) lingpy.cli.add_option(parser, name_, default_, help_, short_opt=None, **kw) lingpy.cli.add_shared_args(p) lingpy.cli.add_strings_option(p, n) lingpy.cli.add_tree_calc_option(p) class lingpy.cli.alignments Bases: lingpy.cli.Command Carry out alignment analysis of a wordlist file with readily detected cognates. classmethod subparser(p) lingpy.cli.get_parser() class lingpy.cli.help Bases: lingpy.cli.Command Show help for commands. classmethod subparser(parser) class lingpy.cli.lexstat Bases: lingpy.cli.Command classmethod subparser(p) lingpy.cli.main(*args) LingPy command line interface. class lingpy.cli.multiple Bases: lingpy.cli.Command Multiple alignment console interface for LingPy. classmethod subparser(p) class lingpy.cli.pairwise Bases: lingpy.cli.Command Run pairwise analyses from command line in LingPy 9.1. Reference 193 LingPy Documentation, Release 2.6 Notes Currently, the following options are supported: • run normal analyses without sound class strings • run sound-class based analyses Furthermore, input output is handled as follows: • define user input using psa-formats in lingpy • define user output (stdout, file) classmethod subparser(p) class lingpy.cli.profile Bases: lingpy.cli.Command classmethod subparser(p) class lingpy.cli.settings Bases: lingpy.cli.Command classmethod subparser(p) class lingpy.cli.wordlist Bases: lingpy.cli.Command Load a wordlist and carry out simple checks. classmethod subparser(p) lingpy.compat module Functionality to provide compatibility across supported python versions lingpy.config module Configuration management for lingpy. Various aspects of lingpy can be configured and customized by the user. This is done with configuration files in the users config dir. See also: https://pypi.python.org/pypi/appdirs/ class lingpy.config.Config(name, default=None, **kw) Bases: configparser.RawConfigParser lingpy.log module Logging utilities class lingpy.log.CustomFilter(name=”) Bases: logging.Filter filter(record) 194 Chapter 9. Reference LingPy Documentation, Release 2.6 class lingpy.log.Logging(level=10, logger=None) Bases: object A context manager to execute a block of code at a specific logging level. lingpy.log.debug(msg, **kw) lingpy.log.deprecated(old, new) lingpy.log.error(msg, **kw) lingpy.log.file_written(fname, logger=None) lingpy.log.get_level() lingpy.log.get_logger(config_dir=None, force_default_config=False, test=False) Get a logger configured according to the lingpy log config file. Note: If no logging configuration file exists, it will be created. Parameters • config_dir – Directory in which to look for/create the log config file. • force_default_config – Configure the logger using the default config. • test – Force reconfiguration of the logger. Returns A logger. lingpy.log.info(msg, **kw) lingpy.log.missing_module(name, logger=None) lingpy.log.warn(msg) lingpy.settings module Module handels all global parameters used in a LingPy session. lingpy.settings.rc(rval=None, **keywords) Function changes parameters globally set for LingPy sessions. Parameters rval : string (default=None) Use this keyword to specify a return-value for the rc-function. schema : {ipa, asjp} Change the basic schema for sequence comparison. When switching to asjp, this means that sequences will be treated as sequences in ASJP code, otherwise, they will be treated as sequences written in basic IPA. Notes This function is the standard way to communicate with the rcParams dictionary which is not imported as a default. If you want to see which parameters there are, you can load the rcParams dictonary directly: >>> from lingpy.settings import rcParams However, be careful when changing the values. They might produce some unexpected behavior. 9.1. Reference 195 LingPy Documentation, Release 2.6 Examples Import LingPy: >>> from lingpy import * Switch from IPA transcriptions to ASJP transcriptions: >>> rc(schema="asjp") You can check which basic orthography is currently loaded: >>> rc(basic_orthography) 'asjp' >>> rc(schema='ipa') >>> rc(basic_orthography) 'fuzzy' lingpy.util module class lingpy.util.TemporaryPath(suffix=”) Bases: object class lingpy.util.TextFile(path, log=True) Bases: object lingpy.util.as_string(obj, pprint=False) lingpy.util.charstring(id_, char=’X’, cls=’-’) lingpy.util.combinations2(iterable) Convenience shortcut lingpy.util.identity(x) lingpy.util.join(sep, *args, **kw) Convenience shortcut. Strings to be joined do not have to be passed as list or tuple. Notes An implicit conversion of objects to strings is performed as well. lingpy.util.lines_to_text(lines) lingpy.util.lingpy_path(*comps) lingpy.util.multicombinations2(iterable) Convenience shortcut, for the name, see the Wikipedia article on Combination. https://en.wikipedia.org/wiki/Combination#Number_of_combinations_with_repetition lingpy.util.nexus_slug(s) Converts a string to a nexus safe representation (i.e. removes many unicode characters and removes some punctuation characters). Parameters s : str A string to convert to a nexus safe format. Returns s : str 196 Chapter 9. Reference LingPy Documentation, Release 2.6 A string containing a nexus safe label. lingpy.util.read_config_file(path, **kw) Read lines of a file ignoring commented lines and empty lines. lingpy.util.read_text_file(path, normalize=None, lines=False) Read a text file encoded in utf-8. Parameters path : { Path, str } File-system path of the file. normalize : { None, NFC, NFC } If not None a valid unicode normalization mode must be passed. lines : bool (default=False) Flag signalling whether to return a list of lines (without the line-separation character). Returns file_content : { list, str } File content as unicode object or list of lines as unicode objects. Notes The whole file is read into memory. lingpy.util.setdefaults(d, **kw) Shortcut for a common idiom, setting multiple default values at once. Parameters d : dict Dictionary to be updated. kw : dict Dictionary with default values. lingpy.util.write_text_file(path, content, normalize=None, log=True) Write a text file encoded in utf-8. Parameters path : str File-system path of the file. content : str The text content to be written. normalize : { None, NFC, NFD } (default=False) If not None a valid unicode normalization mode must be passed. log : bool (default=True) Indicate whether you want to log the result of the file writing process. Module contents LingPy package for quantitative tasks in historical linguistics. Documentation is available in the docstrings. Online documentation is available at http://lingpy.org 9.1. Reference 197 LingPy Documentation, Release 2.6 Subpackages algorithm Basic Algorithms for Sequence Comparison align Specific Algorithms Alignment Analyses basic Basic Classes for Language Comparison compare Basic Modules for Language Comparison convert Functions for Format Conversion data Data Handling evaluate Basic Classes and Functions for Algorithm Evaluation read Basic Functions for Data Input sequence Basic Functions for Sequence Modeling thirdparty Temporary Forks of Third-Party-Modules 198 Chapter 9. Reference CHAPTER TEN DOWNLOAD 10.1 Download 10.1.1 Current Version The current stable release of LingPy is version 2.6. This (and older versions) can be downloaded from: • PyPi: https://pypi.python.org/pypi/lingpy We are regularly developing LingPy. You can always download the most recent version at our GIT repository: • GIT-Repository: https://github.com/lingpy/lingpy 10.1.2 Older Versions Older versions of Lingpy that work only with Python 2, including source code and documentation, are still available for download, but they will no longer be modified. • LingPy-1.0 (Python2): http://pypi.python.org/pypi/lingpy/1.0 • Documentation (PDF): http://lingpy.org/download/lingpy_doc.pdf • Documentation (HTML): http://lingpy.org/download/lingpy-1.0-doc.zip 199 LingPy Documentation, Release 2.6 200 Chapter 10. Download PYTHON MODULE INDEX a lingpy.convert.tree, 136 lingpy.algorithm, 70 lingpy.algorithm.cluster_util, 59 lingpy.algorithm.clustering, 60 lingpy.algorithm.cython, 59 lingpy.algorithm.cython.calign, 35 lingpy.algorithm.cython.cluster, 47 lingpy.algorithm.cython.compilePYX, 50 lingpy.algorithm.cython.malign, 50 lingpy.algorithm.cython.misc, 53 lingpy.algorithm.cython.talign, 54 lingpy.algorithm.extra, 69 lingpy.align, 92 lingpy.align.multiple, 71 lingpy.align.pairwise, 80 lingpy.align.sca, 85 d b lingpy.basic, 103 lingpy.basic.ops, 92 lingpy.basic.parser, 93 lingpy.basic.tree, 94 lingpy.basic.wordlist, 95 c lingpy.cache, 192 lingpy.cli, 192 lingpy.compare, 128 lingpy.compare.lexstat, 103 lingpy.compare.partial, 113 lingpy.compare.phylogeny, 118 lingpy.compare.sanity, 123 lingpy.compare.strings, 125 lingpy.compare.util, 128 lingpy.compat, 194 lingpy.config, 194 lingpy.convert, 136 lingpy.convert.cldf, 128 lingpy.convert.graph, 129 lingpy.convert.html, 130 lingpy.convert.plot, 132 lingpy.convert.strings, 133 lingpy.data, 140 lingpy.data.derive, 136 lingpy.data.ipa, 136 lingpy.data.ipa.sampa, 136 lingpy.data.model, 138 e lingpy.evaluate, 150 lingpy.evaluate.acd, 141 lingpy.evaluate.alr, 145 lingpy.evaluate.apa, 146 l lingpy, 197 lingpy.log, 194 m lingpy.meaning, 150 lingpy.meaning.colexification, 150 r lingpy.read, 154 lingpy.read.csv, 150 lingpy.read.phylip, 152 lingpy.read.qlc, 153 lingpy.read.starling, 154 s lingpy.sequence, 166 lingpy.sequence.generate, 154 lingpy.sequence.profile, 155 lingpy.sequence.sound_classes, 157 lingpy.sequence.tiers, 165 lingpy.settings, 195 t lingpy.tests, 183 lingpy.tests.algorithm, 168 201 LingPy Documentation, Release 2.6 lingpy.tests.algorithm.test_cluster_util,lingpy.tests.test_log, 182 166 lingpy.tests.test_util, 182 lingpy.tests.algorithm.test_clustering, lingpy.tests.thirdparty, 181 166 lingpy.tests.thirdparty.test_cogent, 180 lingpy.tests.algorithm.test_cython, 167 lingpy.tests.thirdparty.test_linkcomm, lingpy.tests.algorithm.test_extra, 167 181 lingpy.tests.align, 169 lingpy.tests.util, 182 lingpy.tests.align.test_multiple, 168 lingpy.thirdparty, 192 lingpy.tests.align.test_pairwise, 168 lingpy.thirdparty.cogent, 191 lingpy.tests.align.test_sca, 169 lingpy.thirdparty.cogent.newick, 183 lingpy.tests.basic, 171 lingpy.thirdparty.cogent.tree, 183 lingpy.tests.basic.test_ops, 169 lingpy.thirdparty.linkcomm, 192 lingpy.tests.basic.test_parser, 170 lingpy.thirdparty.linkcomm.link_clustering, lingpy.tests.basic.test_tree, 170 191 lingpy.tests.basic.test_wordlist, 171 u lingpy.tests.compare, 175 lingpy.tests.compare.test__phylogeny, lingpy.util, 196 171 lingpy.tests.compare.test_lexstat, 172 lingpy.tests.compare.test_partial, 173 lingpy.tests.compare.test_phylogeny, 173 lingpy.tests.compare.test_sanity, 173 lingpy.tests.compare.test_strings, 174 lingpy.tests.convert, 176 lingpy.tests.convert.test_cldf, 175 lingpy.tests.convert.test_graph, 175 lingpy.tests.convert.test_html, 175 lingpy.tests.convert.test_plot, 175 lingpy.tests.convert.test_strings, 175 lingpy.tests.convert.test_tree, 176 lingpy.tests.data, 177 lingpy.tests.data.test_derive, 176 lingpy.tests.data.test_sound_class_models, 177 lingpy.tests.evaluate, 178 lingpy.tests.evaluate.test_acd, 177 lingpy.tests.evaluate.test_alr, 177 lingpy.tests.evaluate.test_apa, 177 lingpy.tests.meaning, 178 lingpy.tests.meaning.test_colexification, 178 lingpy.tests.read, 179 lingpy.tests.read.test_csv, 178 lingpy.tests.read.test_phylip, 179 lingpy.tests.read.test_qlc, 179 lingpy.tests.read.test_starling, 179 lingpy.tests.sequence, 180 lingpy.tests.sequence.test_generate, 179 lingpy.tests.sequence.test_profile, 179 lingpy.tests.sequence.test_sound_classes, 180 lingpy.tests.test_cache, 181 lingpy.tests.test_cli, 181 lingpy.tests.test_config, 181 202 Python Module Index