Based on least square considerations, Brito and Moreira Freitas [Brito, M., Moreira Freitas, A.C.... more Based on least square considerations, Brito and Moreira Freitas [Brito, M., Moreira Freitas, A.C., 2003. Limiting behaviour of a geometric-type estimator for tail indices. Insurance: Math. Econ. 33, 211-226] proposed a geometric-type estimator for estimating an exponential tail coefficient. We consider here the tail bootstrap method introduced by Bacro and Brito [Bacro, J.N., Brito, M., 1998. A tail bootstrap procedure for estimating the tail Pareto index. J. Stat. Plan. Infer. 71, 245-260] and show that this procedure works for this estimator. Moreover, we extend the application given in Brito and Moreira Freitas [Brito, M., Moreira Freitas, A.C., 2003. Limiting behaviour of a geometric-type estimator for tail indices. Insurance: Math. Econ. 33, 211-226], by showing that the results obtained may be applied to the related problem of estimating the adjustment coefficient in the Sparre Andersen model, under the standard conditions.
In this paper we propose a new data structure for the efficient extraction of structured motifs f... more In this paper we propose a new data structure for the efficient extraction of structured motifs from DNA sequences. A structured motif is defined as a collection of highly conserved motifs with pre-specified sizes and spacings between them. The new data structure, called box-link, stores the information on how to jump over the spacings which separate each motif in a structured motif. A factor tree, a variation of a suffix tree, endowed with box-links provide the means for the efficient extraction of structured motifs.
In this work we propose a parallel algorithm for the efficient extraction of binding-site consens... more In this work we propose a parallel algorithm for the efficient extraction of binding-site consensus from genomic sequences. This algorithm, based on an existing approach, extracts structured motifs, that consist of an ordered collection of p ≥ 1 boxes with sizes and spacings between them specified by given parameters. The contents of the boxes, which represent the extracted motifs, are unknown at the start of the process and are found by the algorithm using a suffix tree as the fundamental data structure. By partitioning the structured motif searching space we divide the most demanding part of the algorithm by a number of processors that can be loosely coupled. In this way we obtain, under conditions that are easily met, a speedup that is linear on the number of available processing units. This speedup is verified by both theoretical and experimental analysis, also presented in this paper.
A total of 553 Y-chromosomes were analyzed from mainland Portugal and the North Atlantic Archipel... more A total of 553 Y-chromosomes were analyzed from mainland Portugal and the North Atlantic Archipelagos of Açores and Madeira, in order to characterize the genetic composition of their male gene pool. A large majority (78-83% of each population) of the male lineages could be classified as belonging to three basic Y chromosomal haplogroups, R1b, J, and E3b. While R1b, accounting for more than half of the lineages in any of the Portuguese subpopulations, is a characteristic marker of many different West European populations, haplogroups J and E3b consist of lineages that are typical of the circum-Mediterranean region or even East Africa. The highly diverse haplogroup E3b in Portuguese likely combines sub-clades of distinct origins. The present composition of the Y chromosomes in Portugal in this haplogroup likely reflects a pre-Arab component shared with North African populations or testifies, at least in part, to the influence of Sephardic Jews. In contrast to the marginally low sub-Saharan African Y chromosome component in Portuguese, such lineages have been detected at a moderately high frequency in our previous survey of mtDNA from the same samples, indicating the presence of sex-related gene flow, most likely mediated by the Atlantic slave trade.
The Y-chromosome haplogroup composition of the population of the Cabo Verde Archipelago was profi... more The Y-chromosome haplogroup composition of the population of the Cabo Verde Archipelago was profiled by using 32 single-nucleotide polymorphism markers and compared with potential source populations from Iberia, west Africa, and the Middle East. According to the traditional view, the major proportion of the founding population of Cabo Verde was of west African ancestry with the addition of a minor fraction of male colonizers from Europe. Unexpectedly, more than half of the paternal lineages (53.5%) of Cabo Verdeans clustered in haplogroups I, J, K, and R1, which are characteristic of populations of Europe and the Middle East, while being absent in the probable west African source population of Guiné-Bissau. Moreover, a high frequency of J* lineages in Cabo Verdeans relates them more closely to populations of the Middle East and probably provides the first genetic evidence of the legacy of the Jews. In addition, the considerable proportion (20.5%) of E3b(xM81) lineages indicates a possible gene flow from the Middle East or northeast Africa, which, at least partly, could be ascribed to the Sephardic Jews. In contrast to the predominance of west African mitochondrial DNA haplotypes in their maternal gene pool, the major west African Y-chromosome lineage E3a was observed only at a frequency of 15.9%. Overall, these results indicate that gene flow from multiple sources and various sex-specific patterns have been important in the formation of the genomic diversity in the Cabo Verde islands.
Exact power estimation, at logic level, is only possible if all the input correlations are taken ... more Exact power estimation, at logic level, is only possible if all the input correlations are taken into account. Recently, a probabilistic approach that uses a simple but powerful formalism for power estimation taking into account all the input correlations has been proposed. With this probabilistic approach it is possible to compute exactly the power dissipation of combinational modules using input statistics that would require extremely large traces if simulation based methods were to be used. However, the applicability of the method is limited to very small circuits. This paper describes a circuit partitioning technique that speeds up that method. By using partitioning techniques we avoid the computation of global BDD representations for node functions, thereby extending considerably the range of applicability of the algorithm. Moreover, the partitioning maintains the full set of correlations and, therefore, does not induce any loss of accuracy.
Bioinformatics/computer Applications in The Biosciences, 2006
Motivation: The ability to identify complex motifs, i.e. non-contiguous nucleotide sequences, is ... more Motivation: The ability to identify complex motifs, i.e. non-contiguous nucleotide sequences, is a key feature of modern motif finders. Addressing this problem is extremely important, not only because these motifs can accurately model biological phenomena but because its extraction is highly dependent upon the appropriate selection of numerous search parameters. Currently available combinatorial algorithms have proved to be highly efficient in exhaustively enumerating motifs (including complex motifs), which fulfill certain extraction criteria. However, one major problem with these methods is the large number of parameters that need to be specified. Results: We propose a new algorithm, MUSA (Motif finding using an UnSupervised Approach), that can be used either to autonomously find over-represented complex motifs or to estimate search parameters for modern motif finders. This method relies on a biclustering algorithm that operates on a matrix of co-occurrences of small motifs. The performance of this method is independent of the composite structure of the motifs being sought, making few assumptions about their characteristics. The MUSA algorithm was applied to two datasets involving the bacterium Pseudomonas putida KT2440. The first one was composed of 70 s 54 -dependent promoter sequences and the second dataset included 54 promoter sequences of up-regulated genes in response to phenol, as suggested by quantitative proteomics. The results obtained indicate that this approach is very effective at identifying complex motifs of biological significance. Availability: The MUSA algorithm is available upon request from the authors, and will be made available via a Web based interface.
The Yeast search for transcriptional regulators and consensus tracking (YEASTRACT) information sy... more The Yeast search for transcriptional regulators and consensus tracking (YEASTRACT) information system (www.yeastract.com) was developed to support the analysis of transcription regulatory associations in Saccharomyces cerevisiae. Last updated in September 2007, this database contains over 30 990 regulatory associations between Transcription Factors (TFs) and target genes and includes 284 specific DNA binding sites for 108 characterized TFs. Computational tools are also provided to facilitate the exploitation of the gathered data when solving a number of biological questions, in particular the ones that involve the analysis of global gene expression results. In this new release, YEASTRACT includes DISCOVERER, a set of computational tools that can be used to identify complex motifs over-represented in the promoter regions of co-regulated genes. The motifs identified are then clustered in families, represented by a position weight matrix and are automatically compared with the known transcription factor binding sites described in YEASTRACT. Additionally, in this new release, it is possible to generate graphic depictions of transcriptional regulatory networks for documented or potential regulatory associations between TFs and target genes. The visual display of these networks of interactions is instrumental in functional studies. Tutorials are available on the system to exemplify the use of all the available tools.
We propose a model and an algorithm to perform exact power estimation taking into account all tem... more We propose a model and an algorithm to perform exact power estimation taking into account all temporal and spatial correlations of the input signals. The proposed methodology is able to accurately model temporal and spatial correlations at the logic level, with the input signal correlations being specified at the word level using a simple but effective formulation.
Motivation: Models of the dynamics of cellular interaction networks have become increasingly larg... more Motivation: Models of the dynamics of cellular interaction networks have become increasingly larger in recent years. Formal verification based on model checking provides a powerful technology to keep up with this increase in scale and complexity. The application of modelchecking approaches is hampered, however, by the difficulty for nonexpert users to formulate appropriate questions in temporal logic. Results: In order to deal with this problem, we propose the use of patterns, that is, high-level query templates that capture recurring biological questions and can be automatically translated into temporal logic. The applicability of the developed set of patterns has been investigated by the analysis of an extended model of the network of global regulators controlling the carbon starvation response in Escherichia coli. Availability: GNA and the model of the carbon starvation response network are available at http://www-helix.inrialpes.fr/gna Contact: [email protected]
Although many algorithms for power estimation have been proposed to date, no comprehensive result... more Although many algorithms for power estimation have been proposed to date, no comprehensive results have been presented on the actual complexity of power estimation problems.
Background Motif finding algorithms have developed in their ability to use computationally effici... more Background Motif finding algorithms have developed in their ability to use computationally efficient methods to detect patterns in biological sequences. However the posterior classification of the output still suffers from some limitations, which makes it difficult to assess the biological significance of the motifs found. Previous work has highlighted the existence of positional bias of motifs in the DNA sequences, which might indicate not only that the pattern is important, but also provide hints of the positions where these patterns occur preferentially. Results We propose to integrate position uniformity tests and over-representation tests to improve the accuracy of the classification of motifs. Using artificial data, we have compared three different statistical tests (Chi-Square, Kolmogorov-Smirnov and a Chi-Square bootstrap) to assess whether a given motif occurs uniformly in the promoter region of a gene. Using the test that performed better in this dataset, we proceeded to study the positional distribution of several well known cis-regulatory elements, in the promoter sequences of different organisms (S. cerevisiae, H. sapiens, D. melanogaster, E. coli and several Dicotyledons plants). The results show that position conservation is relevant for the transcriptional machinery. Conclusion We conclude that many biologically relevant motifs appear heterogeneously distributed in the promoter region of genes, and therefore, that non-uniformity is a good indicator of biological relevance and can be used to complement over-representation tests commonly used. In this article we present the results obtained for the S. cerevisiae data sets.
In this work we describe an approach that implicitly formulates and solves the Chapman-Kolmogorov... more In this work we describe an approach that implicitly formulates and solves the Chapman-Kolmogorov equations that describe the state probabilities associated with the stationary behavior of sequential circuits. Unlike previous approaches that assumed uncorrelated input signals, we model the more general case where the sequential circuit is driven by a sequence of inputs described by a discrete time Markov chain. This Markov chain is described implicitly using a formalism that allows for a compact description of chains with an exponentially high number of states. Using this approach, we present an application in power estimation of sequential circuits that takes into account all the temporal and spatial correlations between the primary inputs and the internal signals. We present results showing that, in some cases, it is possible to solve exactly the Chapman-Kolmogorov equations for systems with more than ¢ ¡ ¤ £ equations.
We present the YEAst Search for Transcriptional Regulators And Consensus Tracking (YEASTRACT; www... more We present the YEAst Search for Transcriptional Regulators And Consensus Tracking (YEASTRACT; www.yeastract.com) database, a tool for the analysis of transcription regulatory associations in Saccharomyces cerevisiae. This database is a repos-
In this paper we propose a new data structure for the efficient extraction of structured motifs f... more In this paper we propose a new data structure for the efficient extraction of structured motifs from DNA sequences. A structured motif is defined as a collection of highly conserved motifs with pre-specified sizes and spacings between them. The new data structure, called box-link, stores the information on how to jump over the spacings which separate each motif in a structured motif. A factor tree, a variation of a suffix tree, endowed with box-links provide the means for the efficient extraction of structured motifs.
In this paper we propose a new algorithm for identifying cis-regulatory modules in genomic sequen... more In this paper we propose a new algorithm for identifying cis-regulatory modules in genomic sequences. In particular, the algorithm extracts structured motifs, defined as a collection of highly conserved regions with pre-specified sizes and spacings between them. This type of motifs is extremely relevant in the research of gene regulatory mechanisms since it can effectively represent promoter models. The proposed algorithm uses a new data structure, called box-link, to store the information about conserved regions that occur in a well-ordered and regularly spaced manner in the dataset sequences. The complexity analysis shows a time and space gain over previous algorithms that is exponential on the spacings between binding sites. Experimental results show that the algorithm is much faster than existing ones, sometimes by more than two orders of magnitude. The application of the method to biological datasets shows its ability to extract relevant consensi.
A total of 553 Y-chromosomes were analyzed from mainland Portugal and the North Atlantic Archipel... more A total of 553 Y-chromosomes were analyzed from mainland Portugal and the North Atlantic Archipelagos of Açores and Madeira, in order to characterize the genetic composition of their male gene pool. A large majority (78-83% of each population) of the male lineages could be classified as belonging to three basic Y chromosomal haplogroups, R1b, J, and E3b. While R1b, accounting for more than half of the lineages in any of the Portuguese subpopulations, is a characteristic marker of many different West European populations, haplogroups J and E3b consist of lineages that are typical of the circum-Mediterranean region or even East Africa. The highly diverse haplogroup E3b in Portuguese likely combines sub-clades of distinct origins. The present composition of the Y chromosomes in Portugal in this haplogroup likely reflects a pre-Arab component shared with North African populations or testifies, at least in part, to the influence of Sephardic Jews. In contrast to the marginally low sub-Saharan African Y chromosome component in Portuguese, such lineages have been detected at a moderately high frequency in our previous survey of mtDNA from the same samples, indicating the presence of sex-related gene flow, most likely mediated by the Atlantic slave trade.
Based on least square considerations, Brito and Moreira Freitas [Brito, M., Moreira Freitas, A.C.... more Based on least square considerations, Brito and Moreira Freitas [Brito, M., Moreira Freitas, A.C., 2003. Limiting behaviour of a geometric-type estimator for tail indices. Insurance: Math. Econ. 33, 211-226] proposed a geometric-type estimator for estimating an exponential tail coefficient. We consider here the tail bootstrap method introduced by Bacro and Brito [Bacro, J.N., Brito, M., 1998. A tail bootstrap procedure for estimating the tail Pareto index. J. Stat. Plan. Infer. 71, 245-260] and show that this procedure works for this estimator. Moreover, we extend the application given in Brito and Moreira Freitas [Brito, M., Moreira Freitas, A.C., 2003. Limiting behaviour of a geometric-type estimator for tail indices. Insurance: Math. Econ. 33, 211-226], by showing that the results obtained may be applied to the related problem of estimating the adjustment coefficient in the Sparre Andersen model, under the standard conditions.
In this paper we propose a new data structure for the efficient extraction of structured motifs f... more In this paper we propose a new data structure for the efficient extraction of structured motifs from DNA sequences. A structured motif is defined as a collection of highly conserved motifs with pre-specified sizes and spacings between them. The new data structure, called box-link, stores the information on how to jump over the spacings which separate each motif in a structured motif. A factor tree, a variation of a suffix tree, endowed with box-links provide the means for the efficient extraction of structured motifs.
In this work we propose a parallel algorithm for the efficient extraction of binding-site consens... more In this work we propose a parallel algorithm for the efficient extraction of binding-site consensus from genomic sequences. This algorithm, based on an existing approach, extracts structured motifs, that consist of an ordered collection of p ≥ 1 boxes with sizes and spacings between them specified by given parameters. The contents of the boxes, which represent the extracted motifs, are unknown at the start of the process and are found by the algorithm using a suffix tree as the fundamental data structure. By partitioning the structured motif searching space we divide the most demanding part of the algorithm by a number of processors that can be loosely coupled. In this way we obtain, under conditions that are easily met, a speedup that is linear on the number of available processing units. This speedup is verified by both theoretical and experimental analysis, also presented in this paper.
A total of 553 Y-chromosomes were analyzed from mainland Portugal and the North Atlantic Archipel... more A total of 553 Y-chromosomes were analyzed from mainland Portugal and the North Atlantic Archipelagos of Açores and Madeira, in order to characterize the genetic composition of their male gene pool. A large majority (78-83% of each population) of the male lineages could be classified as belonging to three basic Y chromosomal haplogroups, R1b, J, and E3b. While R1b, accounting for more than half of the lineages in any of the Portuguese subpopulations, is a characteristic marker of many different West European populations, haplogroups J and E3b consist of lineages that are typical of the circum-Mediterranean region or even East Africa. The highly diverse haplogroup E3b in Portuguese likely combines sub-clades of distinct origins. The present composition of the Y chromosomes in Portugal in this haplogroup likely reflects a pre-Arab component shared with North African populations or testifies, at least in part, to the influence of Sephardic Jews. In contrast to the marginally low sub-Saharan African Y chromosome component in Portuguese, such lineages have been detected at a moderately high frequency in our previous survey of mtDNA from the same samples, indicating the presence of sex-related gene flow, most likely mediated by the Atlantic slave trade.
The Y-chromosome haplogroup composition of the population of the Cabo Verde Archipelago was profi... more The Y-chromosome haplogroup composition of the population of the Cabo Verde Archipelago was profiled by using 32 single-nucleotide polymorphism markers and compared with potential source populations from Iberia, west Africa, and the Middle East. According to the traditional view, the major proportion of the founding population of Cabo Verde was of west African ancestry with the addition of a minor fraction of male colonizers from Europe. Unexpectedly, more than half of the paternal lineages (53.5%) of Cabo Verdeans clustered in haplogroups I, J, K, and R1, which are characteristic of populations of Europe and the Middle East, while being absent in the probable west African source population of Guiné-Bissau. Moreover, a high frequency of J* lineages in Cabo Verdeans relates them more closely to populations of the Middle East and probably provides the first genetic evidence of the legacy of the Jews. In addition, the considerable proportion (20.5%) of E3b(xM81) lineages indicates a possible gene flow from the Middle East or northeast Africa, which, at least partly, could be ascribed to the Sephardic Jews. In contrast to the predominance of west African mitochondrial DNA haplotypes in their maternal gene pool, the major west African Y-chromosome lineage E3a was observed only at a frequency of 15.9%. Overall, these results indicate that gene flow from multiple sources and various sex-specific patterns have been important in the formation of the genomic diversity in the Cabo Verde islands.
Exact power estimation, at logic level, is only possible if all the input correlations are taken ... more Exact power estimation, at logic level, is only possible if all the input correlations are taken into account. Recently, a probabilistic approach that uses a simple but powerful formalism for power estimation taking into account all the input correlations has been proposed. With this probabilistic approach it is possible to compute exactly the power dissipation of combinational modules using input statistics that would require extremely large traces if simulation based methods were to be used. However, the applicability of the method is limited to very small circuits. This paper describes a circuit partitioning technique that speeds up that method. By using partitioning techniques we avoid the computation of global BDD representations for node functions, thereby extending considerably the range of applicability of the algorithm. Moreover, the partitioning maintains the full set of correlations and, therefore, does not induce any loss of accuracy.
Bioinformatics/computer Applications in The Biosciences, 2006
Motivation: The ability to identify complex motifs, i.e. non-contiguous nucleotide sequences, is ... more Motivation: The ability to identify complex motifs, i.e. non-contiguous nucleotide sequences, is a key feature of modern motif finders. Addressing this problem is extremely important, not only because these motifs can accurately model biological phenomena but because its extraction is highly dependent upon the appropriate selection of numerous search parameters. Currently available combinatorial algorithms have proved to be highly efficient in exhaustively enumerating motifs (including complex motifs), which fulfill certain extraction criteria. However, one major problem with these methods is the large number of parameters that need to be specified. Results: We propose a new algorithm, MUSA (Motif finding using an UnSupervised Approach), that can be used either to autonomously find over-represented complex motifs or to estimate search parameters for modern motif finders. This method relies on a biclustering algorithm that operates on a matrix of co-occurrences of small motifs. The performance of this method is independent of the composite structure of the motifs being sought, making few assumptions about their characteristics. The MUSA algorithm was applied to two datasets involving the bacterium Pseudomonas putida KT2440. The first one was composed of 70 s 54 -dependent promoter sequences and the second dataset included 54 promoter sequences of up-regulated genes in response to phenol, as suggested by quantitative proteomics. The results obtained indicate that this approach is very effective at identifying complex motifs of biological significance. Availability: The MUSA algorithm is available upon request from the authors, and will be made available via a Web based interface.
The Yeast search for transcriptional regulators and consensus tracking (YEASTRACT) information sy... more The Yeast search for transcriptional regulators and consensus tracking (YEASTRACT) information system (www.yeastract.com) was developed to support the analysis of transcription regulatory associations in Saccharomyces cerevisiae. Last updated in September 2007, this database contains over 30 990 regulatory associations between Transcription Factors (TFs) and target genes and includes 284 specific DNA binding sites for 108 characterized TFs. Computational tools are also provided to facilitate the exploitation of the gathered data when solving a number of biological questions, in particular the ones that involve the analysis of global gene expression results. In this new release, YEASTRACT includes DISCOVERER, a set of computational tools that can be used to identify complex motifs over-represented in the promoter regions of co-regulated genes. The motifs identified are then clustered in families, represented by a position weight matrix and are automatically compared with the known transcription factor binding sites described in YEASTRACT. Additionally, in this new release, it is possible to generate graphic depictions of transcriptional regulatory networks for documented or potential regulatory associations between TFs and target genes. The visual display of these networks of interactions is instrumental in functional studies. Tutorials are available on the system to exemplify the use of all the available tools.
We propose a model and an algorithm to perform exact power estimation taking into account all tem... more We propose a model and an algorithm to perform exact power estimation taking into account all temporal and spatial correlations of the input signals. The proposed methodology is able to accurately model temporal and spatial correlations at the logic level, with the input signal correlations being specified at the word level using a simple but effective formulation.
Motivation: Models of the dynamics of cellular interaction networks have become increasingly larg... more Motivation: Models of the dynamics of cellular interaction networks have become increasingly larger in recent years. Formal verification based on model checking provides a powerful technology to keep up with this increase in scale and complexity. The application of modelchecking approaches is hampered, however, by the difficulty for nonexpert users to formulate appropriate questions in temporal logic. Results: In order to deal with this problem, we propose the use of patterns, that is, high-level query templates that capture recurring biological questions and can be automatically translated into temporal logic. The applicability of the developed set of patterns has been investigated by the analysis of an extended model of the network of global regulators controlling the carbon starvation response in Escherichia coli. Availability: GNA and the model of the carbon starvation response network are available at http://www-helix.inrialpes.fr/gna Contact: [email protected]
Although many algorithms for power estimation have been proposed to date, no comprehensive result... more Although many algorithms for power estimation have been proposed to date, no comprehensive results have been presented on the actual complexity of power estimation problems.
Background Motif finding algorithms have developed in their ability to use computationally effici... more Background Motif finding algorithms have developed in their ability to use computationally efficient methods to detect patterns in biological sequences. However the posterior classification of the output still suffers from some limitations, which makes it difficult to assess the biological significance of the motifs found. Previous work has highlighted the existence of positional bias of motifs in the DNA sequences, which might indicate not only that the pattern is important, but also provide hints of the positions where these patterns occur preferentially. Results We propose to integrate position uniformity tests and over-representation tests to improve the accuracy of the classification of motifs. Using artificial data, we have compared three different statistical tests (Chi-Square, Kolmogorov-Smirnov and a Chi-Square bootstrap) to assess whether a given motif occurs uniformly in the promoter region of a gene. Using the test that performed better in this dataset, we proceeded to study the positional distribution of several well known cis-regulatory elements, in the promoter sequences of different organisms (S. cerevisiae, H. sapiens, D. melanogaster, E. coli and several Dicotyledons plants). The results show that position conservation is relevant for the transcriptional machinery. Conclusion We conclude that many biologically relevant motifs appear heterogeneously distributed in the promoter region of genes, and therefore, that non-uniformity is a good indicator of biological relevance and can be used to complement over-representation tests commonly used. In this article we present the results obtained for the S. cerevisiae data sets.
In this work we describe an approach that implicitly formulates and solves the Chapman-Kolmogorov... more In this work we describe an approach that implicitly formulates and solves the Chapman-Kolmogorov equations that describe the state probabilities associated with the stationary behavior of sequential circuits. Unlike previous approaches that assumed uncorrelated input signals, we model the more general case where the sequential circuit is driven by a sequence of inputs described by a discrete time Markov chain. This Markov chain is described implicitly using a formalism that allows for a compact description of chains with an exponentially high number of states. Using this approach, we present an application in power estimation of sequential circuits that takes into account all the temporal and spatial correlations between the primary inputs and the internal signals. We present results showing that, in some cases, it is possible to solve exactly the Chapman-Kolmogorov equations for systems with more than ¢ ¡ ¤ £ equations.
We present the YEAst Search for Transcriptional Regulators And Consensus Tracking (YEASTRACT; www... more We present the YEAst Search for Transcriptional Regulators And Consensus Tracking (YEASTRACT; www.yeastract.com) database, a tool for the analysis of transcription regulatory associations in Saccharomyces cerevisiae. This database is a repos-
In this paper we propose a new data structure for the efficient extraction of structured motifs f... more In this paper we propose a new data structure for the efficient extraction of structured motifs from DNA sequences. A structured motif is defined as a collection of highly conserved motifs with pre-specified sizes and spacings between them. The new data structure, called box-link, stores the information on how to jump over the spacings which separate each motif in a structured motif. A factor tree, a variation of a suffix tree, endowed with box-links provide the means for the efficient extraction of structured motifs.
In this paper we propose a new algorithm for identifying cis-regulatory modules in genomic sequen... more In this paper we propose a new algorithm for identifying cis-regulatory modules in genomic sequences. In particular, the algorithm extracts structured motifs, defined as a collection of highly conserved regions with pre-specified sizes and spacings between them. This type of motifs is extremely relevant in the research of gene regulatory mechanisms since it can effectively represent promoter models. The proposed algorithm uses a new data structure, called box-link, to store the information about conserved regions that occur in a well-ordered and regularly spaced manner in the dataset sequences. The complexity analysis shows a time and space gain over previous algorithms that is exponential on the spacings between binding sites. Experimental results show that the algorithm is much faster than existing ones, sometimes by more than two orders of magnitude. The application of the method to biological datasets shows its ability to extract relevant consensi.
A total of 553 Y-chromosomes were analyzed from mainland Portugal and the North Atlantic Archipel... more A total of 553 Y-chromosomes were analyzed from mainland Portugal and the North Atlantic Archipelagos of Açores and Madeira, in order to characterize the genetic composition of their male gene pool. A large majority (78-83% of each population) of the male lineages could be classified as belonging to three basic Y chromosomal haplogroups, R1b, J, and E3b. While R1b, accounting for more than half of the lineages in any of the Portuguese subpopulations, is a characteristic marker of many different West European populations, haplogroups J and E3b consist of lineages that are typical of the circum-Mediterranean region or even East Africa. The highly diverse haplogroup E3b in Portuguese likely combines sub-clades of distinct origins. The present composition of the Y chromosomes in Portugal in this haplogroup likely reflects a pre-Arab component shared with North African populations or testifies, at least in part, to the influence of Sephardic Jews. In contrast to the marginally low sub-Saharan African Y chromosome component in Portuguese, such lineages have been detected at a moderately high frequency in our previous survey of mtDNA from the same samples, indicating the presence of sex-related gene flow, most likely mediated by the Atlantic slave trade.
Uploads
Papers by Ana Freitas