Heterogeneous Log File Analyzer System Using Hadoop Mapreduce Framework
Heterogeneous Log File Analyzer System Using Hadoop Mapreduce Framework
Heterogeneous Log File Analyzer System Using Hadoop Mapreduce Framework
Web Site: www.ijaiem.org Email: [email protected], [email protected] Volume 2, Issue 12, December 2013 ISSN 2319 - 4847
Abstract
There are various applications which have a huge database. All databases maintain log files that keep records of database changes. This can include tracking various user events. Apache Hadoop can be used for log processing at scale. Log files have become a standard part of large applications and are essential in operating systems, computer networks and distributed systems. Log files are often the only way to identify and locate an error in software, because log file analysis is not affected by any timebased issues known as probe effect. This is opposite to analysis of a running program, when the analytical process can interfere with time-critical or resource critical conditions within the analyzed program. Log files are often very large and can have complex structure. Although the process of generating log files is quite simple and straightforward, log file analysis could be a tremendous task that requires enormous computational resources, long time and sophisticated procedures. This often leads to a common situation, when log files are continuously generated and occupy valuable space on storage devices, but nobody uses them and utilizes enclosed information. The overall goal of this project is to design a generic log analyzer using hadoop map-reduce framework. This generic log analyzer can analyze different kinds of log files such as- Email logs, Web logs, Firewall logs Server logs, Call data logs.
Keywords: Hadoop, Map-reduce framework, Log files, log analyzer, Heterogeneous database
1. INTRODUCTION
Current software applications often produce (or can be configured to produce) some auxiliary text files known as log files. Such files are used during various stages of software development, mainly for debugging and profiling purposes. Use of log files helps testing by making debugging easier. It allows you to follow the logic of the program, at high level, without having to run it in debug mode. Nowadays, log files are commonly used at customers installations for the purpose of permanent software monitoring and/or fine-tuning. Log files became a standard part of large application and are essential in operating systems, computer networks and distributed systems. Log files are often the only way how to identify and locate an error in software, because log file analysis is not affected by any time-based issues known as probe effect. This is an opposite to an analysis of a running program, when the analytical process can interfere with time-critical or resourcecritical conditions within the analysed program. Log files are often very large and can have complex structure. Although the process of generating log files is quite simple and straightforward, log file analysis could be a tremendous task that requires enormous computational resources, long time and sophisticated procedures. This often leads to a common situation, when log files are continuously generated and occupy valuable space on storage devices, but nobody uses them and utilizes enclosed information. The overall goal of this project is to design a generic log analyser using hadoop mapreduce framework. This generic log analyser can analyse different kinds of log files such as- Email logs, Web logs, Firewall logs Server logs, Call data logs. There are various applications (known as log file analysers for log files to produce easily human readable summary reports. Such tools are undoubtedly useful, but their usage is limited only to log files of certain structure. Although such products visualization tools) that can digest a log file of specific vendor or structure and have configuration options, they can answer only built-in questions and create built-in reports to design an open, very flexible modular tool, that would be capable to analyse almost. There is also a belief that it is useful to research in the field of log files analysis and any log file and answers any questions, including very complex ones. Such analyser should be programmable, extendable, efficient (because of the volume of log files) and easy to use for end users. It should not be limited to analyse just log files of specific structure or type and also the type of question should not be restricted.
2. LITERATURE SURVEY
In the past decades there was surprisingly low attention paid to problem of getting useful information from log files. It seems there are two main streams of research. The first one concentrates on validating program runs by checking conformity of log files to a state machine. Records in a log file are interpreted as transitions of given state machine. If some illegal transitions occur, then there is certainly a problem, either in the software under test or in the state machine specification or in the testing software itself. The second branch of research is represented by articles that just de-scribe various ways of production statistical output. The following item summarizes current possible usage of log files:
Page 448
Figure 1 Virtual Database System-General structure 4) Wrapper: A wrapper provides the connectivity to the physical data source. This also provides way to natively issue commands and gather results. A wrapper can be a RDBMS data source, Web Service, text file, connection to main frame etc. 5) Metadata: Metadata is data that describes a specific item of content and where it is located. Metadata is capturing important information about the enterprise environment, data, and business logic to accelerate development, drive integration procedures, and improve integration efficiency. Metadata captures all technical, operational, and business metadata in real time in a single open repository. This repository ensures that metadata is always up to date, accurate, complete, and available. Fig. 1.2 shows the extraction of data from heterogeneous data sources using VDB. The most common interface to VBD is that of a relational database management system, effected through methods such as the Open Database Connectivity (ODBC) method, the Structured Query Language and the relational database model. However, the engine can be implemented with an eXtensible Markup Language (XML) interface. VDB can be accessed through JDBC-SQL, SOAP (Web Services), SOAP-SQL, or XQuery. This project uses the XML for maintaining the Metadata. XML metadata containing all process, map and schema designs integrated with a single, powerful integration engine allows tremendous flexibility and scalability.
Page 449
Figure 2 : HDFS Architecture Map-Reduce Hadoop Map-Reduce is a software framework for easily writing applications which process vast amounts of data in parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner. A Map-Reduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks.
Page 450
Figure 3 : Map Reduce Data Flow Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. Typically the compute nodes and the storage nodes are the same, that is, the Map-Reduce framework and the Distributed File System are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster. The Map-Reduce framework consists of a single master Job Tracker and one slave Task Tracker per cluster-node. The master is responsible for scheduling the jobs component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by the master. Minimally, applications specify the input/output locations and supply map and reduce functions via implementations of appropriate interfaces and/or abstract-classes. These, and other job parameters, comprise the job configuration. The Hadoop job client then submits the job (jar/executable etc.) and configuration to the Job Tracker which then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job client.
5. ADVANTAGES
The main concept of data integration is to combine data from different resources and provide users with a unified view of these data. These works mainly focus on isomorphic data resources stored in the form of key-value. So this system is used MapReduce technique for data integration of heterogeneous data resources such as database or file systems.
6. CONCLUSION
Many systems based on these models have been developed to deal with various real problems of data integration, such as collecting information from different web sites, integrating spatially-related environmental data, and aggregating information from multiple enterprise systems. But all that system is work on only single type of log files.so with the help of map reduce framework This System will be able to analyze many types of log files. Due to use of Hadoop framework, Efficiency of log analysis has improved If any new standard format log file is created then it will be easy to extend our project to analyze that log file. Our project can also be implemented on windows so that novice users find it easy to use.
References
[1] [2] [3] [4] [5] [6] [7] J.H. Andrews,Theory and Practice of Log File Analysis Technical Report,Pennsylvania Western Ontario. Jan Waldamn,Log File Analysis Technical Report. Hadoop- A definitive guide. http://hadoop.apache.org http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntulinux-single node-cluster/. http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntulinux-multi-node-cluster/. http:Taylor, Laura Read Your Firewall Logs, 10 July URL:http://techupdate.zdnet.com/techupdate/stories/main/0,14179,2782699,00.html (29 Feb. 2002) [8] Bernard J. Jansen,The Methodology of Search Log AnalysisPennsylvania State University,USA.
2001.
Page 451