In Lecture 6 of the Big Data in 30 hours class we cover HDFS. The purpose of this memo is to provide participants a quick reference to the material covered.

HDFS user interface

HDFS is distributed file system. The interface to HDFS provides a filesystem abstraction similar to Linux. It has commands like ls, mkidr etc. Example:

hadoop@ubuntu:~$ hadoop fs -ls
Found 1 items
drwxr-xr-x   - hadoop supergroup          0 2018-11-17 21:42 wordcount
hadoop@ubuntu:~$ hadoop fs -ls wordcount
Found 3 items
-rw-r--r--   2 hadoop supergroup         23 2018-11-17 21:41 wordcount/file1
-rw-r--r--   2 hadoop supergroup         25 2018-11-17 21:42 wordcount/file2
drwxr-xr-x   - hadoop supergroup          0 2018-11-17 21:37 wordcount/input
hadoop@ubuntu:~$ hadoop fs -mkdir -p /user/hadoop/pp/input

To get the file out of HDFS:

hadoop@ubuntu:~$ hadoop fs -cat wordcount/file2Hello 
Hadoop Bye Hadoop
hadoop@ubuntu:~$ hadoop fs -get wordcount/file2 localfile2
hadoop@ubuntu:~$ cat localfile2Hello Hadoop Bye Hadoop

HDFS under the hood

However, under the hoodHDFS performs distributed file management and file replication amongst nodes.  Use dfsadmin tool for an overview of the replicated data.

hadoop@ubuntu:~$ hdfs dfsadmin -report
Configured Capacity: 359335223296 (334.66 GB)
Present Capacity: 274261630976 (255.43 GB)
DFS Remaining: 274261311488 (255.43 GB)
DFS Used: 319488 (312 KB)DFS Used%: 0.00
%Replicated Blocks:        
Under replicated blocks: 0        
Blocks with corrupt replicas: 0        
Missing blocks: 0        
Missing blocks (with replication factor 1): 0        
Pending deletion blocks: 0
[..]

The replication factor, default 3

Replication factor tells you how many times a file is replicated. To set replication factor for one particular file, use dfs -setrep. In the example below, file 2 was replicated twice, and we change it so it will be replicated 3x. Note: this can be seen in the column 2 of the dfs -ls listing.

hadoop@ubuntu:~$ hdfs dfs -ls wordcount
Found 3 items
-rw-r--r--   2 hadoop supergroup         23 2018-11-17 21:41 wordcount/file1
-rw-r--r--   2 hadoop supergroup         25 2018-11-17 21:42 wordcount/file2
drwxr-xr-x   - hadoop supergroup          0 2018-11-17 21:37 wordcount/input
hadoop@ubuntu:~$ hdfs dfs -setrep 3 wordcount/file2
Replication 3 set: wordcount/file2
hadoop@ubuntu:~$ hdfs dfs -ls wordcount
Found 3 items
-rw-r--r--   2 hadoop supergroup         23 2018-11-17 21:41 wordcount/file1
-rw-r--r--   3 hadoop supergroup         25 2018-11-17 21:42 wordcount/file2
drwxr-xr-x   - hadoop supergroup          0 2018-11-17 21:37 wordcount/input
Similarly, we can change the replication factor for all files currently present in the system :
hadoop@ubuntu:~$ hdfs dfs -setrep -w 3 /setrep: 

Replication 3 set: /user/hadoop/output/_SUCCESS
Replication 3 set: /user/hadoop/output/part-r-00000
Replication 3 set: /user/hadoop/wordcount/file1
Replication 3 set: /user/hadoop/wordcount/file2
Waiting for /user/hadoop/output/_SUCCESS ... done
Waiting for /user/hadoop/output/part-r-00000 ... done
Waiting for /user/hadoop/wordcount/file1 .... done
Waiting for /user/hadoop/wordcount/file2 ... done

To set the default replication factor for the future files, change the config file:

<configuration>  
<property>    
<name>dfs.namenode.name.dir</name>    
<value>/usr/local/hadoop/data/nameNode</value>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>/usr/local/hadoop/data/dataNode</value>
  </property>
  <property>
    <name>dfs.replication</name>
    <value>2</value>
  </property>
</configuration>

The block size, default 64 MB. Similarly, to change this – add to the config:

<property>
<name>dfs.block.size</name>
<value>134217728</value>
<description>Block size</description>
</property>

Check system health

The hdfs fsck command is helpful in determining the system health status.

hadoop@ubuntu:~$ hdfs fsck /
Connecting to namenode via XXXFSCK started by hadoop (auth:SIMPLE) 
from XXXX for path / at Tue Nov 20 15:24:21 UTC 2018
Status: HEALTHY
Number of data-nodes:  7
Number of racks:               1
Total dirs:                    15
Total symlinks:                0
Replicated Blocks:Total size:    48 B
Total files:   4
Total blocks (validated):      2 (avg. block size 24 B)
Minimally replicated blocks:   2 (100.0 %)
Over-replicated blocks:        0 (0.0 %)
Under-replicated blocks:       0 (0.0 %)
Mis-replicated blocks:         0 (0.0 %)
Default replication factor:    2
Average block replication:     3.0
Missing blocks:                0
Corrupt blocks:                0
Missing replicas:              0 (0.0 %)
Erasure Coded Block Groups:Total size:    0 B
Total files:   0Total block groups (validated):        0
Minimally erasure-coded block groups:  0O
ver-erasure-coded block groups:       0
Under-erasure-coded block groups:      0
Unsatisfactory placement block groups: 0
Average block group size:      0.0
Missing block groups:          0
Corrupt block groups:          0
Missing internal blocks:       0
FSCK ended at Tue Nov 20 15:24:21 UTC 2018 in 3 milliseconds

HDFS becomes very useful as the default storage platform for distributed processing frameworks such as MapReduce or Apache Spark. To read more:

HDFS architecture guide

HDFS commands reference

Lecture Notes: Hadoop HDFS orientation

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.