7

Has anyone tried to use GlusterFS or Ceph as the backend for Hadoop? I am not talking about just use plugin to sew things up. Is the performance better than HDFS itself? whether it's ok for production usage.

Also, Is it a really good idea to merge object storage, hadoop hdfs storage all together as a single storage? or it's better keep them separated.

2 Answers 2

8

I have tried Ceph as "drop-in" HDFS replacement in Hadoop 2.7 and after solving many integration issues have found it two/three times slower than HDFS with default replication factor in terasort benchmark. I don't know the reason for this. Other folks tried different approach with similar result:

http://www.snia.org/sites/default/files/SDC15_presentations/cloud_files/YuanZhou_big_data_analytics_on_object_store_r3.pdf

Is it good idea to combine object and hdfs storage? I think the question is not correct. Both HDFS (via Ozone and FUSE) and Ceph provide ability to use them as object storage and regular POSIX filesystems, with Ceph having an edge offering block storage as well, while HDFS this is currently discussed: https://issues.apache.org/jira/browse/HDFS-11118 If it is a question of "can I expose my storage as POSIX FS, Object, Block store at the same time?" Then the answer would be if your design satisfy your requirements for scalability and high availability, it could be a great idea actually.

8

I have used GlusterFS before, it has some nice features but finally I choose to use HDFS for distributed file system in Hadoop.

The nice thing about GlusterFS is that it doesn't require master-client nodes. Every node in cluster are equally, so there is no single point failure in GlusterFS. And one more thing I find interesting thing in GlusterFS is that it has glusterfs-client module, http://www.jamescoyle.net/how-to/439-mount-a-glusterfs-volume, when you want to store a file to glusterfs, you don't need to interface with GlusterFS apis, you just need to copy the file to mounted volume in glusterfs-client and get the job done so simple.

But I find that GlusterFS is hard to integrate to Hadoop ecosystem such as Spark, Mapreduce, ect.. where HDFS is supported by all most any components in Hadoop ecosystem. I think GlusterFS is good to build a cluster system like files storage independent from Hadoop.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.