Lecture notes: first steps in Hadoop

In Lecture 6 of our Big Data in 30 hours class, we talk about Hadoop. The purpose of this memo is to summarize the terms and ideas presented.

About Hadoop

    1. Hadoop by Apache Software Foundation is a software used to run other software in parallel. It is a distributed batch processing system that comes together with a distributed filesystem. It scales well over commodity hardware and is fault-tolerant. Hence Hadoop’s popularity: it can manage a cluster of cheap computers and harness their CPU cycles so they work together like a supercomputer.
    2. It is easy to get confused among numerous brands in the Hadoop ecosystem. But if you just focus on the basics, it suddenly becomes quite easy. Most importantly, Hadoop’s two core packages are: MapReduce and HDFS.
    3. HDFS is the distributed file system and it is widely used. The architecture follows a simple concept: every computer on your cluster runs a DataNode, with one central NameNode helping to orchestrate the NameNodes.
    4. MapReduce, later augmented with YARN, is a batch scheduling system to run parallel tasks processing data from HDFS. The architecture, again, is real simple: there is one central ResourceManager that manages NodeManagers (one per worker node). Note: these days, besides MapReduce, there are other ways to run data in Hadoop HDFS cluster, such as Hive, or Apache Spark.
    5. The basic scenario? A client uploads data files to HDFS, and sends a job request to JobTracker. The JobTracker splits the job into tasks and schedules each to one of the TaskTrackers. TaskTrackers perform their part of the job and store the result back in HDFS. When the job completes, the client is notified that the result can be downloaded.
    6. Other important tools in the ecosystem which you may look at later: Yarn is a scheduler on top of MapReduce. Hive allows to query Hadoop in SQL-like fashion. Sqoop is like an ETL from RDBMS to HDFS. Pig is data workflow language. HBase is a NoSQL data store similar to BigTable. More tools here or here

Setting up Hadoop

Hadoop can be set in one of the three modes: Local mode (all runs in one JVM), Pseudo-distributed mode (still running on one machine, but with all bells and whistles normally found in the installation) and Fully Distributed Mode (on a cluster).  Use Fully Distributed if you have access to a compute cluster. Use Pseudo-distributed for learning in the absence of such a cluster.

To set up Hadoop in Pseudo-distributed mode on your laptop, use Docker. Then just pull a Hadoop image from Dockerhub. I tested this image with Hadoop 2.7.0 (credits to sequenceiq) it works well. Here is all you need to do:

 

Otherwise, to install Hadoop 3 on one node manually, you may follow this instruction by Mark Litwintschik.

In our lab we have set up Fully Distributed Hadoop 3.1.1 install on 8 nodes.

Manage the active services

Or simply:

if services are missing, (re)start them. Some commands are:

Run a sample MapReduce job

Access online management tools

First, run your standalone install with following ports published:

docker run -it –publish 50070:50070 –publish 8088:8088 sequenceiq/hadoop-docker /etc/bootstrap.sh -bash

Access HDFS management console at localhost:50070

Access MapReduce management console at localhost:80088

Understand the cluster config

Here is defined where are worker nodes and who is the master node.

Leave a Reply

Your email address will not be published. Required fields are marked *