Lecture Notes: Hadoop HDFS orientation

In Lecture 6 of the Big Data in 30 hours class we cover HDFS. The purpose of this memo is to provide participants a quick reference to the material covered.

HDFS user interface

HDFS is distributed file system. The interface to HDFS provides a filesystem abstraction similar to Linux. It has commands like ls, mkidr etc. Example:

To get the file out of HDFS:

HDFS under the hood

However, under the hoodHDFS performs distributed file management and file replication amongst nodes.  Use dfsadmin tool for an overview of the replicated data.

The replication factor, default 3

Replication factor tells you how many times a file is replicated. To set replication factor for one particular file, use dfs -setrep. In the example below, file 2 was replicated twice, and we change it so it will be replicated 3x. Note: this can be seen in the column 2 of the dfs -ls listing.

Similarly, we can change the replication factor for all files currently present in the system :

To set the default replication factor for the future files, change the config file:

The block size, default 64 MB. Similarly, to change this – add to the config:

Check system health

The hdfs fsck command is helpful in determining the system health status.

HDFS becomes very useful as the default storage platform for distributed processing frameworks such as MapReduce or Apache Spark. To read more:

HDFS architecture guide

HDFS commands reference

Leave a Reply

Your email address will not be published. Required fields are marked *