Academia.eduAcademia.edu

On the duality of data-intensive file system design

2011, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11

On the Duality of Data-intensive File System Design: Reconciling HDFS and PVFS Wittawat Tantisiriroj, Swapnil Patil, Garth Gibson (CMU) - Seung Woo Son, Samuel J. Lang, Robert B. Ross (ANL) Overview Experiment Setup Internet Services • Distributed file system • Purpose-built for anticipated workloads • Hadoop & Hadoop distributed file system (HDFS) • Use triplication for reliability • Use file layout to collocate computation and data • OpenCloud cluster - 51 nodes (8-core 2.8 GHz, 16GB DRAM, 4 SATA disks, 1 used in experiments, 10 GE) • Benchmarks • Data-set: 50 million 100-byte records (50GB) • Workload: write, read, grep (for a rare pattern), sort • Applications • Sampling (B. FU): Read 71GB astronomy data-set • FoF (B. FU): Cluster & join astronomical objects • Twitter (B. Meeder): Reformat 24GB to be 56GB High performance computing (HPC) [e.g. PVFS] • Equally large scale applications • Parallel file system • Concurrent reads and writes • Typically support POSIX and VFS interface Experiment Results N Clients write N 1GB les (TmpFS) Java application Apps Apps Aggregate write throughput (MB/s) PVFS plug-in under Hadoop stack Apps Hadoop/MapReduce framework File system extensions API (org.apache.hadoop.fs.FileSystem) 2,500 2,000 1,500 1,000 500 0 0 5 10 15 20 25 30 35 40 45 50 Number of clients HDFS Random PVFS Round-Robin Aggregate write throughput (MB/s) N clients write N 1GB les (Disk) 5,000 4,000 3,000 2,000 1,000 PVFS Hybrid 0 0 5 10 15 20 25 30 35 40 45 50 Number of clients HDFS Random PVFS Round-Robin PVFS Hybrid PVFS Hybrid (4 streams) PVFS Shim layer buf • HDFS is surprisingly tied to disk performance even without any explicit sync rep libpvfs to HDFS/PVFS servers HDFS/PVFS 600 HDFS/PVFS server Completion Time (seconds) libhdfs map server Local FS net to server from client Local FS net 500 HDFS 400 Vanilla PVFS 300 200 PVFS w/ readahead bu er 100 PVFS w/ readahead bu er and layout 0 Grep Benchmark Data servers PVFS Shim responsibilities: • Readahead buffer: reads from PVFS in 4MB requests • File layout: file layout exposed as extended attributes • Replication: triplicates data in one PVFS file Offset Server Server 0 1 (Writer) Server 2 Server 3 Offset Server Server 0 1 (Writer) Server 2 Server 3 Offset Server Server 0 1 (Writer) Server 2 Server 3 0M Blk 0 Blk 0 Blk 1 Blk 0 0M Blk 0 Blk 0 Blk 0 Blk 1 0M Blk 0 Blk 0 Blk 0 Blk 1 64M Blk 1 Blk 2 Blk 2 Blk 1 64M Blk 1 Blk 1 Blk 2 Blk 2 64M Blk 1 Blk 1 Blk 2 Blk 2 128M Blk 2 Blk 3 Blk 4 Blk 3 128M Blk 2 Blk 3 Blk 3 Blk 3 128M Blk 2 Blk 3 Blk 3 Blk 4 192M Blk 3 Blk 4 Blk 4 Blk 4 Blk 4 192M Blk 3 Blk 4 256M Blk 4 256M Blk 4 HDFS Random Layout 192M 256M PVFS Round-robin Layout 150 100 50 0 write HDFS Random read grep Benchmark PVFS Hybrid sort PVFS Round-Robin Completion Time (seconds) • By using both readahead buffer and file layout information, PVFS performance is comparable to HDFS Completion Time (seconds) MDS 800 600 400 200 0 sampling HDFS Random fof twitter Application PVFS Hybrid PVFS Round-Robin • PVFS performance is comparable to HDFS for both Hadoop benchmarks and scientific applications PVFS Hybrid Layout HDFS/PVFS data layout schemes: • HDFS Random: 1 copy on writer’s disks, 2 copies random • PVFS Round-robin: 3 copies striped in file • PVFS Hybrid: 1 copy on writer’s disks, 2 striped Acknowledgements: Robert Chansler, Tsz Wo Sze, Nathan Roberts, Bin Fu, and Brendan Meeder Conclusion • With a few modifications in a non-intrusive shim layer, PVFS matchs performance for Hadoop applications • File layout information is essential for Hadoop to collocate computation and data