The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
Version | Release Notes | Release Date |
---|---|---|
3.0.0-alpha1 | 2016-08-30 | |
2.7.3 | Click here - 2.7.3 | 2016-01-25 |
2.6.4 | Click here - 2.6.4 | 2016-02-11 |
2.7.2 | Click here - 2.7.2 | 2016-01-25 |
2.6.3 | Click here - 2.6.3 | 2015-12-17 |
2.6.2 | Click here - 2.6.2 | 2015-10-28 |
2.7.1 | Click here - 2.7.1 | 2015-07-06 |
Link for hadoop shell commands:- https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/FileSystemShell.html
sudo addgroup hadoop
sudo adduser --ingroup hadoop hduser001
su -hduser001
ssh-keygen -t rsa -P ""
cat .ssh/id rsa.pub >> .ssh/authorized_keys
Note: If you get errors [bash: .ssh/authorized_keys: No such file or directory] whilst writing the authorized key. Check here.
sudo adduser hduser001 sudo
sudo add-apt-repository ppa:hadoop-ubuntu/stable
sudo apt-get install hadoop
A Pseudo Distributed Cluster Setup Procedure
Prerequisites
Install JDK1.7 and set JAVA_HOME environment variable.
Create a new user as "hadoop".
useradd hadoop
Setup password-less SSH login to its own account
su - hadoop
ssh-keygen
<< Press ENTER for all prompts >>
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
Verify by performing ssh localhost
Disable IPV6 by editing /etc/sysctl.conf
with the followings:
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
Check that using cat /proc/sys/net/ipv6/conf/all/disable_ipv6
(should return 1)
Installation and Configuration:
Download required version of Hadoop from Apache archives using wget
command.
cd /opt/hadoop/
wget http:/addresstoarchive/hadoop-2.x.x/xxxxx.gz
tar -xvf hadoop-2.x.x.gz
mv hadoop-2.x.x.gz hadoop
(or)
ln -s hadoop-2.x.x.gz hadoop
chown -R hadoop:hadoop hadoop
Update .bashrc
/.kshrc
based on your shell with below environment variables
export HADOOP_PREFIX=/opt/hadoop/hadoop
export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
export JAVA_HOME=/java/home/path
export PATH=$PATH:$HADOOP_PREFIX/bin:$HADOOP_PREFIX/sbin:$JAVA_HOME/bin
In $HADOOP_HOME/etc/hadoop
directory edit below files
core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:8020</value>
</property>
</configuration>
mapred-site.xml
Create mapred-site.xml
from its template
cp mapred-site.xml.template mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
yarn-site.xml
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>localhost</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///home/hadoop/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///home/hadoop/hdfs/datanode</value>
</property>
</configuration>
Create the parent folder to store the hadoop data
mkdir -p /home/hadoop/hdfs
Format NameNode (cleans up the directory and creates necessary meta files)
hdfs namenode -format
Start all services:
start-dfs.sh && start-yarn.sh
mr-jobhistory-server.sh start historyserver
Instead use start-all.sh (deprecated).
Check all running java processes
jps
Namenode Web Interface: http://localhost:50070/
Resource manager Web Interface: http://localhost:8088/
To stop daemons(services):
stop-dfs.sh && stop-yarn.sh
mr-jobhistory-daemon.sh stop historyserver
Instead use stop-all.sh (deprecated).