In Lecture 6 of the Big Data in 30 hours class we cover HDFS. The purpose of this memo is to provide participants a quick reference to the material covered.
HDFS user interface
HDFS is distributed file system. The interface to HDFS provides a filesystem abstraction similar to Linux. It has commands like ls, mkidr etc. Example:
hadoop@ubuntu:~$ hadoop fs -ls Found 1 items drwxr-xr-x - hadoop supergroup 0 2018-11-17 21:42 wordcount hadoop@ubuntu:~$ hadoop fs -ls wordcount Found 3 items -rw-r--r-- 2 hadoop supergroup 23 2018-11-17 21:41 wordcount/file1 -rw-r--r-- 2 hadoop supergroup 25 2018-11-17 21:42 wordcount/file2 drwxr-xr-x - hadoop supergroup 0 2018-11-17 21:37 wordcount/input hadoop@ubuntu:~$ hadoop fs -mkdir -p /user/hadoop/pp/input
To get the file out of HDFS:
hadoop@ubuntu:~$ hadoop fs -cat wordcount/file2Hello Hadoop Bye Hadoop hadoop@ubuntu:~$ hadoop fs -get wordcount/file2 localfile2 hadoop@ubuntu:~$ cat localfile2Hello Hadoop Bye Hadoop
HDFS under the hood
However, under the hoodHDFS performs distributed file management and file replication amongst nodes. Use dfsadmin tool for an overview of the replicated data.
hadoop@ubuntu:~$ hdfs dfsadmin -report Configured Capacity: 359335223296 (334.66 GB) Present Capacity: 274261630976 (255.43 GB) DFS Remaining: 274261311488 (255.43 GB) DFS Used: 319488 (312 KB)DFS Used%: 0.00 %Replicated Blocks: Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0 Missing blocks (with replication factor 1): 0 Pending deletion blocks: 0 [..]
The replication factor, default 3
Replication factor tells you how many times a file is replicated. To set replication factor for one particular file, use dfs -setrep. In the example below, file 2 was replicated twice, and we change it so it will be replicated 3x. Note: this can be seen in the column 2 of the dfs -ls listing.
hadoop@ubuntu:~$ hdfs dfs -ls wordcount Found 3 items -rw-r--r-- 2 hadoop supergroup 23 2018-11-17 21:41 wordcount/file1 -rw-r--r-- 2 hadoop supergroup 25 2018-11-17 21:42 wordcount/file2 drwxr-xr-x - hadoop supergroup 0 2018-11-17 21:37 wordcount/input hadoop@ubuntu:~$ hdfs dfs -setrep 3 wordcount/file2 Replication 3 set: wordcount/file2 hadoop@ubuntu:~$ hdfs dfs -ls wordcount Found 3 items -rw-r--r-- 2 hadoop supergroup 23 2018-11-17 21:41 wordcount/file1 -rw-r--r-- 3 hadoop supergroup 25 2018-11-17 21:42 wordcount/file2 drwxr-xr-x - hadoop supergroup 0 2018-11-17 21:37 wordcount/input
hadoop@ubuntu:~$ hdfs dfs -setrep -w 3 /setrep: Replication 3 set: /user/hadoop/output/_SUCCESS Replication 3 set: /user/hadoop/output/part-r-00000 Replication 3 set: /user/hadoop/wordcount/file1 Replication 3 set: /user/hadoop/wordcount/file2 Waiting for /user/hadoop/output/_SUCCESS ... done Waiting for /user/hadoop/output/part-r-00000 ... done Waiting for /user/hadoop/wordcount/file1 .... done Waiting for /user/hadoop/wordcount/file2 ... done
To set the default replication factor for the future files, change the config file:
<configuration> <property> <name>dfs.namenode.name.dir</name> <value>/usr/local/hadoop/data/nameNode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/usr/local/hadoop/data/dataNode</value> </property> <property> <name>dfs.replication</name> <value>2</value> </property> </configuration>
The block size, default 64 MB. Similarly, to change this – add to the config:
<property> <name>dfs.block.size</name> <value>134217728</value> <description>Block size</description> </property>
Check system health
The hdfs fsck command is helpful in determining the system health status.
hadoop@ubuntu:~$ hdfs fsck / Connecting to namenode via XXXFSCK started by hadoop (auth:SIMPLE) from XXXX for path / at Tue Nov 20 15:24:21 UTC 2018 Status: HEALTHY Number of data-nodes: 7 Number of racks: 1 Total dirs: 15 Total symlinks: 0 Replicated Blocks:Total size: 48 B Total files: 4 Total blocks (validated): 2 (avg. block size 24 B) Minimally replicated blocks: 2 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 2 Average block replication: 3.0 Missing blocks: 0 Corrupt blocks: 0 Missing replicas: 0 (0.0 %) Erasure Coded Block Groups:Total size: 0 B Total files: 0Total block groups (validated): 0 Minimally erasure-coded block groups: 0O ver-erasure-coded block groups: 0 Under-erasure-coded block groups: 0 Unsatisfactory placement block groups: 0 Average block group size: 0.0 Missing block groups: 0 Corrupt block groups: 0 Missing internal blocks: 0 FSCK ended at Tue Nov 20 15:24:21 UTC 2018 in 3 milliseconds
HDFS becomes very useful as the default storage platform for distributed processing frameworks such as MapReduce or Apache Spark. To read more: