Thursday, 11 April 2013

SUBPROJECTS | Hadoop pdf free download

Although Hadoop is best known for MapReduce and its distributed filesystem(HDFS, renamed from NDFS), the other subprojects provide complementary services, or build on the core to add higher-level abstractions The various subprojects of hadoop includes:-

Core
A set of components and interfaces for distributed filesystems and general I/O(serialization, Java RPC, persistent data structures).

Avro
A data serialization system for efficient, cross-language RPC, and persistent datastorage. (At the time of this writing, Avro had been created only as a new subproject, and no other Hadoop subprojects were using it yet.)

Mapreduce
A distributed data processing model and execution environment that runs on large clusters of commodity machines.

HDFS
A distributed filesystem that runs on large clusters of commodity machines.

Pig
A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters.

HBASE
A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads).

Zookeeper
A distributed, highly available coordination service. ZooKeeper provides primitives such as distributed locks that can be used for building distributed applications.

Hive
A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine to MapReduce jobs) for querying the data.

Chukwa
A distributed data collection and analysis system. Chukwa runs collectors that store data in HDFS, and it uses MapReduce to produce reports. (At the time of this writing, Chukwa had only recently graduated from a “contrib” module in Core to its own subproject.)

No comments: