Lossless Data Compression Techniques and Their Performance

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

International Conference on Computing, Communication and Automation (ICCCA2017)

Lossless Data Compression Techniques and Their


Performance
Komal Sharma Kunal Gupta
Department of Computer Science Department of Computer Science
Amity University Amity University
Noida, Uttar Pradesh Noida, Uttar Pradesh
[email protected] [email protected]

Abstract— This paper provides the study of various lossless In this paper we have discussed and analysed the three
data compression techniques and compares their performance most common lossless data compression techniques like
and efficiency using time and space complexity. Data Huffman coding, LZW and Shannon-Fano coding. As a test
compression is itself a huge field in a computer science and it is
bed we have taken several file formats like plain text(txt),
used in reducing the size of data by removing the redundant
characters or encoding the bits in data. We have used only document(doc), image(bmp,png) and audio(wav).
lossless data compression techniques for compressing various Based on the efficiency factors like:
media format files like text, doc, bmp, png and wav. Huffman • Compression Ratio: Compression ratio is a process of
coding which generates variable length codes for each symbol. calculating how much or to what extent a file is
LZW is a dictionary based technique and Shannon-Fano coding compressed.
is used to generate binary code based on the frequency of each
symbol. So we calculated the mean compression ratio,
compression factor and compression time for various media files
and result analysis of above data compression techniques will
help to select the most appropriate method of compressing of any
given file format in real world.
• Compression Factor: Compression factor is inverse
Keywords— Lossless data compression; Huffman Algorithm; of compression ratio.
Lempel-Ziv Coding; Shannon-fano coding

I. INTRODUCTION
Data compression is the art of producing information in a
reduced form. It encodes original files and information to
compressed ones using bit-codes saving time and space in the
matter. Generally data compression techniques finds and • Compression time: Compression time is a time which
eliminates the redundant data using different methodologies. a compressor takes while compressing a file.
There are broadly two data compression techniques:
• Lossless Compression: As the name suggests, • Saving Percentage: Saving percentage calculates how
lossless data compression with shrinking the much a file shrinks in size as a percentage.
original data without any permanent loss of
information. It generally used to transmit digital
data.ZIP and GIF are the image formats which uses
this kind of technique
• Lossy Compression: this data compression however
permanently loses some information in the We have examined and analyzed the above techniques
compression process. This technique is favored for its easy and selective use in implementing their algorithm
when prediction or generation of some information in C programming language
of data is quite easy and flexible. Mp3 and JPG are
the audio and image format which uses this kind of II. DATA COMPRESSION ALGORITHMS
technique.
Nowadays data transmission over the network becomes
very time consuming because of the large size of data being
transmitted, so to solve the above problem we have many

ISBN: 978-1-5090-6471-7/17/$31.00 ©2017 IEEE 256


International Conference on Computing, Communication and Automation (ICCCA2017)

lossless data compression algorithms but we have taken only codes as in encoded output. It removes redundancy in output
three algorithms among the others, the algorithms which we file by skipping repetitive dictionary references if found in
are going to study are Huffman algorithm, LZW and Shannon- reading input stream. It is Generally favored to compress
Fano algorithm. These algorithms are used to compressed the image formats like GIF and BMPs.
large size data so that that data can efficiently transmitted LZW algorithm procedure is as follows:
over the network at a given time. Below discuss the
algorithms. • Parse the input file using a file reader stream and
obtain the distinct ASCII characters in it.
A. Huffman Algorithm
• Arrange the ASCII characters in ascending order of
Huffman coding is a lossless data compression technique their respective ASCII values and assign reference to
based on ASCII characters developed by David Huffman in the character list starting from 0.
1951. It is an entropy encoding algorithm means it converts
the fixed length codes of characters to variable length codes to • Initialize an empty string S and an empty character
characters is done by non repeating prefix codes such that no (To read from file character by character ).
two characters are assigned same codes.
• Implement a loop:
Huffman algorithm procedure is as follows: while(not End of File (EOF))
{
• Parsing Input File: Input file is to parsed via File ch=read character from file;
Reader to obtain the distinct ASCII characters if (Dictionary contains S+ch);
present in a file with their frequency. {
S=S+ch;
• Arranging the Character List: The character List is }
obtained from Parsed Input to be arranged in else
descending order according to their frequency. {
Assign reference code of S to output Stream;
• Constructing the Huffman Tree: Create a Add (S+ch) to dictionary with continued
corresponding isolated tree node for each distinct reference code;
character their data as frequency. Take two data S=ch;
nodes and merge them to form one common node X }
with frequency being addition of the individual }
nodes. Replace the two data nodes in the list with encode S to output Stream.
common node X and readjust the list using insertion
sort. Perform the above selection and merging • Flush the output stream to an output file.
process repeatedly until there is only one node left
. C. Shannon-Fano Algorithm
• Assigning Codes: Starting from the Single parent
node formed in previous step traverse the tree Shannon-Fano is a Lossless Data Compression Algorithm
recursively maintaining an empty character array for named after Claude Shannon and Robert Fano. It is also an
storing and assigning bit codes. For traversing right, entropy encoding algorithm assigning variable length code to
concatenate 1 to array and traversing left concatenate characters from input stream and writing to a compressed
0 to array, if leaf node encountered then assign array output stream. Unlike Huffman it Follows a top down tree
value to the node. construction and code assigning approach with a symbol by
symbol encoding.
• Generating Compressed File: Create a new file and Shannon-Fano algorithm procedure is as follows:
code the corresponding variable length codes for the
character format as in the original file.
• Parsing Input File: Input file is to parsed via File
Reader to obtain the distinct ASCII characters
B. Lempel-Ziv Algorithm (c1,c2..ci) present in a file with their frequency.

A Universal Lossless Data Compression Algorithm • Arranging the Character List: The character List is
created by Abraham Lempel, Jacob Ziv and Terry Welch in obtained from Parsed Input to be arranged in
1984. It is a dictionary based dynamic compression algorithm descending order according to their frequency
which reads the character substrings and assigns reference (f1,f2...fi) and calculate the total Frequency (F).

257
International Conference on Computing, Communication and Automation (ICCCA2017)

Algorithm. The above method is fast , low cost and


• Selecting Partition: Initially, Consider a Parent node compatible with most hardware implementations as the input
with all distinct characters to partition. file is already parsed in simpletons.
From left keep on adding fi's symbol by symbol until In [2] a new prediction algorithm has proposed which
sum reaches just above F/2 and divide the parent predicts whether a file would or would not compress with the
node between left and right child nodes. LZW Technique. It also predicts the compression Ratio which
Assign 0 to left nodal path and 1 to right nodal path. helps storage systems to decide whether the file should be
compressed or not ( helping in auto-accommodation of file).
• Recursively perform the above step (in which total This method reduces the compression time by 17.79 %.
frequency will be that of their respective head nodes) For energy saving in wireless sensor networks a new
until characters are isolated and occupy leaf node. lossless compression algorithm has proposed (S-LEC) in [5]
When leaf node encountered or no further partition and compares it with the existing algorithm which are LEC
possible assign the path code to the character. and S-LZW. This method has been applied on wireless data
sets like Volcano data and Humidity data. The S-LEC is
• Generating Compressed File: Create a new file and robust and more efficient than both LEC and S-LZW.
code the corresponding variable length codes for the
character format as in the original file.
IV. EXPERIMENT AND RESULTS
The above mentioned Lossless Data Compression
III. RELATED WORK Algorithms (Huffman, LZW and Shannon-Fano) have been
For the last few years many research work has been coded and implemented in C programming language whose
carried out by using lossless data compression techniques in source file is compatible to run on both GCC and Visual C
every field of Computer Science and Electronics Compiler and can be viewed using notepad or related text
Communications. In 1949 and 1951 the Shannon-Fano and reading softwares. C Language has been chosen due to its
Huffman had Developed Algorithms for the data compression fastest compilation time than other programming languages
and they are still being used in many research areas for and so it can compress large data file with quickest precision.
compressing various media and document files.

A Research in [1] showed Comparison of Huffman and


LZW Algorithms for compressing audio, image and text files.
Mean compression Ratio and Compression Time have been
used for comparing above algorithms. We observed Huffman
Algorithm is ideal to compress plain text files as it gives lower
mean compression ratio and optimal compression time, but for
image file LZW algorithm is more preferable as compared to
Huffman as it gives takes quite low amount of compression
time. For compressing audio files Huffman is better.
A Research in [4] showed the use of Huffman coding,
minimum-Variance Huffman coding and LZW with arithmetic
coding Techniques in wireless sensor networks for saving
energy and time while transmission and processing of data. As
minimum-Variance Huffman coding technique compressed
more as compared to Traditional Huffman Coding and LZW
with integrated Arithmetic coding gives better results in
compressing a file as compared to traditional LZW coding in
wireless networks.
In [3] processing of data is done before compressing it
using deflate algorithm. The work has been done on proposing
a preprocessing algorithm in [3] to increase average
compression ratio of deflate algorithm by 3.39 %. The BPE
(Byte Pair encoding) is the proposed preprocessing algorithm
used for processing files of Canterbury, Artificial,
Fig 1. Compression Source Directory
Miscellaneous Calgary and large Text Corpuses before
applying compression using deflate algorithm. The above
research paper has concluded in comparing the Deflate
compression Algorithm with the Deflate-BPE integrated

258
International Conference on Computing, Communication and Automation (ICCCA2017)

Fig 4. Compression Coding Source File (LZW)

For each Algorithm C Source file an executable


Fig 2. Test Files Directory application (namely huffenc.exe, lzw.exe and shan-f.exe) has
been constructed which can easily run in a command prompt
The above source files along with their respectable
window (as shown in Fig 5) saving the repetitive compilation
executables have been saved in a single directory. Test files of
and running hassles. Running any executable will prompt for a
various formats are also to be kept in the above directory. In
user input file where we must provide a sample test file and an
CMD we must configure the Current directory path using the
output (encoded) filename. After which it will display the
CD command to correspond to the above source directory.
input file size, encoded file size along with the compression
file size.

Fig 3. Compression Coding Source File (Huffman)


Fig 5. Command Prompt Window

259
International Conference on Computing, Communication and Automation (ICCCA2017)

V. COMPARISON AND ANALYSIS


Table 1. Test Bed

In above compression time analysis LZW algorithm


compresses the fastest in all test samples, Shannon-Fano
encodes with small compression time difference w.r.t LZW in
txt and audio file formats whereas, Huffman takes the longest
to compress above tested files.

Fig 6. Compression Ratio

In above compression ratio analysis, both huffman and


shannon-fano shows equivalent statistics for text (0.605 &
0.619) and WAV files (0.876 & 0.881).Whereas for images
(BMP, PNG & GIF) LZW compresses most efficiently Fig 8. Compression Factor
(0.878).
In above compression factor analysis, both huffman and
shannon-fano shows equivalent statistics for text (1.654 &
0.617) and WAV files (1.153 & 1.145).Whereas for images
(BMP, PNG & GIF) LZW Encodes most efficiently (1.235).

VI. CONCLUSION

From the results and analysis of the above test bed (Table
1)we have concluded that for compressing plain text files
(TXT), Huffman compression algorithm is suited the best as it
shows the least compression ratio and takes optimal
compression time. LZW compression algorithm works
suitably for image file formats (BMP, PNG, GIF) applying the
Fig 7. Compression Time best compression ratio and taking least compression time.

260
International Conference on Computing, Communication and Automation (ICCCA2017)

Shannon-fano compression algorithm works in optimal


fashion for hybrid text file (DOC,PDF,DOCX etc) and audio
file (WAV etc) formats as it applies favorable compression
ratio without costing much compression time.

VII. REFERENCES

[1] Rhen Anjerome Bedruz and Ana Riza F.Quiros,


"Comparisoin of Huffman and Lempel-Ziv Algorithm for
Audio, Image and Text Compression", IEEE - Philippine
Section 9-12 December 2015.

[2]. Alireza Yazdanpanah and Mahmoud Reza Hashemi, "A


New Compression Ratio Prediction Algorithm For Hardware
Implementation of LZW Data Compression ", IEEE Computer
Society 2010.

[3]. Alireza Yazdanpanah and Mahmoud Reza Hashemi, "A


Simple Lossless Preprocessing Algorithm for Hardware
implementation of DEFLATE Data ", IEEE.

[4]. S. Renugadevi and P.S Nithya Darsini, "Huffman and


Lempel-Ziv Based Data Compression Algorithm for Wireless
Sensor Networks", IEEE (PRIME) February 21-22 2013.

[5] Yao Liang and Yimei Li, "An Efficient and Robust Data
Compression Algorithm in Wireless Sensor Networks", IEEE
Communication Letter, Vol 18, No. 3 March 2014.

[6]. K.A Ramya and M.Pushpa, "Comparative Study on


Different Lossless Data Compression Methods", IJSEAS-
Vol- 2, Issue -1 January 2016

[7] Kashfia Sailunaz, Mohammed Rokibal Alam Kotwal and


Mohammed Nurul Huda, "Data Compression Considering
Text Files", Internation Journal of Computer Applications,
Vol-90, No-11 March 2014.

[8] P. Yellamma and Dr. Narasimham Challa, "Performance


Analysis of Different Data Compression Techniques on Text
Files", IJERT, Vol-1, Issue-8 October 2012.

[9] Anmol Jyot Mann, "Analysis and Comparison of


Algorithms for Lossless Data Compression", IJICT, Volume
3(2013),pp.139-146.

[10] Shruti Porwal, Yashi Chaudhary, Jitendra Joshi, Manish


Jain, "Data Compression Methodologies for Lossless Data and
Comparison between Algorithms", IJESIT Volume 2, Issue 2,
March 2013.

[11] Arup Kumar Bhattacharjee, Tanumon Bej, Saheb


Agarwal, "Comparison Study of Lossless Data Compression
Algorithms for Text Data", IOSR-JCE Volume 11, Issue
6(May-June 2013), pp 15-19.

261

You might also like