Vctkec Dwarahat

VCTKEC DWARAHAT
DNA SEQUENCE COMPRESSION

USING COMPLIMENTRY PALLINDROME FOR BETTER COMPRESSION
RATIO TO REVEAL THE TRUE CHARACTERSTICS OF THE DNA
Ganesh Bathyal(070109)
UNDER THE SUPERVISION OF:
Himanshu Joshi(070115)
Mr. RK Bharti
Rachit Phartiyal(070136)
(Asst. Proffessor,CSE Deptt.)
INTRODUCTION
DNA
Deoxyribonucleic acid (DNA) is a nucleic acid that contains the genetic instructions used in the
development and functioning of all known living organisms with the exception of some viruses. The main
role of DNA molecules is the long-term storage of information. DNA is often compared to a set
of blueprints, like a recipe or a code, since it contains the instructions needed to construct other
components of cells, such as proteins and RNA molecules. The DNA segments that carry this genetic
information are called genes, but other DNA sequences have structural purposes, or are involved in
regulating the use of this genetic information.Chemically, DNA consists of two long polymers of simple
units called nucleotides, with backbones made of sugars and phosphate groups joined byester bonds.
These two strands run in opposite directions to each other and are therefore anti-parallel. Attached to
each sugar is one of four types of molecules called bases. It is the sequence of these four bases along the
backbone that encodes information. This information is read using thegenetic code, which specifies the
sequence of the amino acids within proteins. The code is read by copying stretches of DNA into the related
nucleic acid RNA, in a process called transcription.
Within cells, DNA is organized into long structures called chromosomes. These chromosomes are
duplicated before cells divide, in a process calledDNA replication.
DNA COMPRESSION NEED

With more and more complete genomes of prokaryotes and eukaryotes becoming available and the
completion of human genome project in the horizon, fundamental questions regarding the
characteristics of these sequences arise. In this paper, we study one such basic question: the
compressibility of DNA sequences.
Life represents order. It is not chaotic or random. Thus, we expect the DNA sequences that encode
Life to be nonrandom. In other words, they should be very compressible. There are also strong
biological evidences that support this claim: it is well-known that DNA sequences, especially in higher
eukaryotes, contain many (approximate) tandem repeats; it is also well-known that many essential
genes (like rRNAs) have many copies; it is believed that there are only about a thousand
basic protein folding patterns; it also has been conjectured that genes duplicate themselves sometimes
for evolutionary or simply for “selfish” purposes. All these give more concrete support that the DNA
sequences should be reasonably compressible. However, such regularities are often blurred by random
mutation, translocation, cross-over, and reversal events, as well as sequencing errors.
It is well recognized that the compression of DNA sequences is a very di fficult task.
The DNA sequences only consist of 4 nucleotide bases {a, c, g, t}, 2 bits are enough to store each base.
However, if one applies standard compression software such as the Unix “compress” and “compact” or
the MS-DOS archive programs “pkzip” and “arj”, they all expand the file with more than 2 bits per
base, as shown in Table 1, although all these compression software are universal compression
algorithms. These software are designed for text compression, while the regularities in DNA sequences
are much subtler.
It is our purpose to study such subtleties in DNA sequences. One may treat compressibility study as
the ultimate generalization of the simpler (and fruitful) biological studies such as G-C contents of
various species. We are acquiring more and more complete genomes from the 5 megabase long E. coli
to the 97 megabase long C. elegans. The 3 billion bases of H. sapiens will also be available soon. More
sophisticated studies on these sequences will give us deeper understanding about the nature of these
sequences. Different regions on a genome, different genes, different species may have different
compression ratios. Such difference may imply, for example, different mutation rates in different genes.
EXISTING COMPRESSION Need
for
ALGORITHMS
Compression arises because approximately 44,575,745,176 bases in 40,604,319 algorithms which
compress significantly better than the trivial way of using 2 bit for each character of a four letter
alphabet.For a four-letter alphabet in DNA(A,C,G,T), ,an average description length of two bits per base
is the largest length needed for faithful representation. Other algorithms specifically designed for DNA
sequences did not manage to achieve average description compression ratio below 1.6. Increasing
genome sequence data of organisms lead DNA database size two or three times bigger annually.Thus
it becomes very hard to download and maintain data in a local system.
In modern molecular biology, the genome is the entirety of an organism's hereditary
information. It is encoded either in DNA or, for many types of virus, in RNA.The genome includes both
the genes and the non- coding sequences of the DNA.( Ridley, M. (2006) Genome). Algorithms for
Compressing DNA sequences, such as GenCompress ,Biocompress and Cfact are available to compress
DNA sequences. Their compression rate is about 1.74 bits per base i.e.,78% in compression rate.
Hence we present a new compression algorithm named ”GenBit Compress Tool” whose compression
rate is below 1.2 bits per byte(for Best case) , 1.727 bits/bytes(for Average Case), 2.238
bits/bytes(Worst case ) even for larger genome(nearly 2,00,000 characters). Sequence records in the
GenBank database Efficient compression may also reveal some biological functions ,aid in phylogenic
tree reconstruction. Each day several thousand nucleotides are sequenced in the labs. For subsequent
analysis these have to be stored. Until now there are no compressing
Most of the compression methods used today including DNA compression fall into two categories.
First is statistical method, which compresses data by replacing a more popular symbol to a shorter
code. Second is dictionary-based scheme, which compresses data by replacing long sequences by
short pointe information to the same sequences in a dictionary.
In statistical methods, arithmetic coding and CTW are known to compress the DNA data well [4] and
Huffman coding is known to compress not very efficiently. The Burrows-Wheeler transform (BWT, also
called block-sorting compression),( Michael Burrows and David Wheeler.) is an algorithm used in data
compression techniques such as bzip2. Both the LZ77 and LZ78 algorithms work on this principle. In GS
Compress,LZ77 scheme with reverse complement is introduced as a dictionary-based scheme. E. Rivals
et al.[5] another compression algorithm Cfact, which searches the longest exact matching repeat using
sux tree data structure in an entire sequence. Sadeh has proposed lossy data compression schemes
based on approximate string matching and proved some asymptotic properties with respect to
stationary sources. In spite of the good compression ratio, arithmetic coding and CTW have
disadvantages such as low decompression speed.
OUR TARGET:
We focus our work on the compression of the dna sequences exploiting exact matching and
complimentary palindromes.
Current researches going on in:-

Vctkec Dwarahat

Uploaded by

Copyright:

Available Formats

Vctkec Dwarahat

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Vctkec Dwarahat

Uploaded by

Copyright:

Available Formats

VCTKEC DWARAHAT

DNA SEQUENCE COMPRESSION

DNA COMPRESSION NEED

Current researches going on in:-

You might also like