Introduction to Hadoop

Hadoop is an Apache project that provides a framework for running applications that processes large amounts of data (hundreds of terabytes) on large clusters (thousands of computers) of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a distributed file system, similar to GFS, and MapReduce. This presentation presents the motivation and approach for Hadoop, an overview of the components and architecture, and an overview of the tools built on top of Hadoop, such as Pig and Hive.