hadoop Tutorial => Hadoop overview and HDFS

Example

Hadoop is an open-source software framework for storage and large-scale processing of data-sets in a distributed computing environment. It is sponsored by Apache Software Foundation. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. History

Hadoop was created by Doug Cutting and Mike Cafarella in 2005.
Cutting, who was working at Yahoo! at the time, named it after his son's toy elephant.
It was originally developed to support distribution for the search engine project.

Major modules of hadoop

Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data. Hadoop MapReduce: A software framework for distributed processing of large data sets on compute clusters. Hadoop File System Basic Features

Highly fault-tolerant. High throughput. Suitable for applications with large data sets. Can be built out of commodity hardware. Namenode and Datanodes

Master/slave architecture. HDFS cluster consists of a single Namenode, a master server that manages the file system namespace and regulates access to files by clients. The DataNodes manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. A file is split into one or more blocks and set of blocks are stored in DataNodes. DataNodes: serves read, write requests, performs block creation, deletion, and replication upon instruction from Namenode.

HDFS is designed to store very large files across machines in a large cluster. Each file is a sequence of blocks. All blocks in the file except the last are of the same size. Blocks are replicated for fault tolerance. The Namenode receives a Heartbeat and a BlockReport from each DataNode in the cluster. BlockReport contains all the blocks on a Datanode. Hadoop Shell Commands

hadoop fs –ls Path

Cat

hadoop fs -cat PathOfFileToView