Apache Ozone: A High Performance Object Store for analytics workloads

Rakesh Radhakrishnan, Mukul Kumar Singh

English Session 2021-08-07 14:50 GMT+8 (ROOM : A) #bigdata

Apache Ozone is a robust, distributed key-value object store for Hadoop with layered architecture and strong consistency. It provides Object Store semantics (like Amazon S3) and can handle billions of objects. Apache Ozone object store recently implemented a fast atomic rename and delete operation with O(1) complexity. This dramatic optimization lower the job latency equals lower total cost of ownership (TCO) for analytics workloads. As we know, most big data analytics tools like Apache Hive, Apache Spark, etc. often write output to temporary locations and then rename the location at the end of the job to become publicly visible. In the analysis of Object Store(like Amazon S3), rename is not a native Object Store operation, it is implemented using a costly copy and a delete operation. The rename operations can often take longer than the analytics process itself. These job committers demand atomic rename for improved performance as well as consistent listing operations. This talk will be a deep dive into the Apache Ozone architecture that describes the atomic rename and delete implementation, which greatly boost the analytics job performance. We will walk through performance benchmark results that show a consistent performance gain in various analytics workloads. Finally, we will also talk about a future roadmap to leverage this new design to achieve efficient lock management for namespace operations by avoiding global locks.

Speakers:

Rakesh Radhakrishnan: Rakesh Radhakrishnan is a committer and a PMC in Apache Hadoop, Apache ZooKeeper, Apache BookKeeper projects and primarily focusing on open source big data technologies. Rakesh is currently working at Cloudera and actively contributing on the Apache Ozone project. He has more than 14 years of experience in large scale Distributed Software Platforms design and development. Prior to joining Cloudera, he worked as a Big Data Software Engineer in Intel Corporation.

Mukul Kumar Singh: Mukul is currently working with Cloudera where he is leading the Storage team working on both Apache Ozone and Apache HDFS. He has also been working on Storage Systems and File Systems for 12 years and has played various roles as open source contributor, Apache PMC member, researcher and a software developer.He also has worked with Nimble Storage and NetApp and worked on WAFL and CASL file systems respectively. He graduated from Carnegie Mellon University, where his thesis was on a file system for Shingled Magnetic recording disks.