2017 IEEE International Congress on Big Data (BigData Congress), 2017
The ubiquity of Big Data has greatly influenced the direction and the development of storage tech... more The ubiquity of Big Data has greatly influenced the direction and the development of storage technologies. To meet the needs of storing and analyzing Big Data, researchers and administrators have turned to parallel and distributed storage and compute architectures in both industry and science. While the problems of securely and consitently storing and accessing data in large parallel and distributed file systems have been addressed in both the research and production systems, the indexing and search through large unstructured data and metadata has largely been overlooked. According to the International Data Corporation, more than 90% of data found in the digital universe is unstructured, emphasizing the importance of developing efficient solutions for querying distributed data. This paper proposes a novel indexing solution, called FusionDex, that significantly improves the performance of querying across distributed file systems. FusionDex leverages state-of-the-art, open-source indexing modules as its building blocks to deliver an integrated system for enabling efficient user-specified queries over distributed and unstructured data. FusionDex has been evaluated on a cluster of 64 nodes, and results show that it outperforms existing tools (in some cases by orders of magnitude), such as Hadoop Grep and Cloudera Search.
In today’s world, the scientific community is moving towards distributed systems which plays an i... more In today’s world, the scientific community is moving towards distributed systems which plays an important role on achieving good performance and scalability. Task scheduling and execution over large scale, distributed systems plays an important role on achieving good performance and high system utilization[15]. Most of todays state-of-the-art job execution systems are centralized architectures, which have inherent limitations, such as scalability issues at extreme scales and single point of failures. On the other hand distributed job management systems are complex, and employ non-trivial load balancing algorithms to maintain good utilization. Thus we propose a distributed task execution framework which will provide the Load Balancing inherently using a distributed message passing Interface which is essentially a Distributed queue. CloudKon+ is a distributed task execution framework that can support distributed HPC[12] and MTC[15] scheduling, running millions of tasks on multiple nod...
iii ACKNOWLEDGEMENT I want to thank God for making this possible, for seeing me through to this s... more iii ACKNOWLEDGEMENT I want to thank God for making this possible, for seeing me through to this stage of my life and the opportunity to successfully complete my thesis in graduate school. I want to thank my advisor, Dr. Ioan Raicu for the opportunity to be a member of his lab and work under him as my thesis advisor. I want to thank my committee members, Dr. Boris Glavic and Dr. Kevin Jin for taking their time to serve on my committee. I also want to thank my lab member, Shiva Kumar, who also contributed to the success of my thesis, as well as other students who contributed on various parts of early iterations of this work, namely
2017 IEEE International Congress on Big Data (BigData Congress), 2017
The ubiquity of Big Data has greatly influenced the direction and the development of storage tech... more The ubiquity of Big Data has greatly influenced the direction and the development of storage technologies. To meet the needs of storing and analyzing Big Data, researchers and administrators have turned to parallel and distributed storage and compute architectures in both industry and science. While the problems of securely and consitently storing and accessing data in large parallel and distributed file systems have been addressed in both the research and production systems, the indexing and search through large unstructured data and metadata has largely been overlooked. According to the International Data Corporation, more than 90% of data found in the digital universe is unstructured, emphasizing the importance of developing efficient solutions for querying distributed data. This paper proposes a novel indexing solution, called FusionDex, that significantly improves the performance of querying across distributed file systems. FusionDex leverages state-of-the-art, open-source indexing modules as its building blocks to deliver an integrated system for enabling efficient user-specified queries over distributed and unstructured data. FusionDex has been evaluated on a cluster of 64 nodes, and results show that it outperforms existing tools (in some cases by orders of magnitude), such as Hadoop Grep and Cloudera Search.
In today’s world, the scientific community is moving towards distributed systems which plays an i... more In today’s world, the scientific community is moving towards distributed systems which plays an important role on achieving good performance and scalability. Task scheduling and execution over large scale, distributed systems plays an important role on achieving good performance and high system utilization[15]. Most of todays state-of-the-art job execution systems are centralized architectures, which have inherent limitations, such as scalability issues at extreme scales and single point of failures. On the other hand distributed job management systems are complex, and employ non-trivial load balancing algorithms to maintain good utilization. Thus we propose a distributed task execution framework which will provide the Load Balancing inherently using a distributed message passing Interface which is essentially a Distributed queue. CloudKon+ is a distributed task execution framework that can support distributed HPC[12] and MTC[15] scheduling, running millions of tasks on multiple nod...
iii ACKNOWLEDGEMENT I want to thank God for making this possible, for seeing me through to this s... more iii ACKNOWLEDGEMENT I want to thank God for making this possible, for seeing me through to this stage of my life and the opportunity to successfully complete my thesis in graduate school. I want to thank my advisor, Dr. Ioan Raicu for the opportunity to be a member of his lab and work under him as my thesis advisor. I want to thank my committee members, Dr. Boris Glavic and Dr. Kevin Jin for taking their time to serve on my committee. I also want to thank my lab member, Shiva Kumar, who also contributed to the success of my thesis, as well as other students who contributed on various parts of early iterations of this work, namely
Uploads
Papers by Itua Ijagbone