Hadoop File System (HDFS)

HDFS is a Hadoop Distributed File System. The HDFS modules are split between partitions. Currently, there is an HDFS cluster within the Compute partition and the GPU partition.

Note: Permission to use HDFS is not granted automatically. You will need to request access before you can utilize HDFS.


HDFS Module

In this example, we will do simple file operations using hadoop fs.

To list the contents of your directory within HDFS, first load the required HDFS module.

Command:

module load hdfs/hdfs-rc

Output:

## Assuming all went well, you will have no output.
## You can test to see if the module is loaded via
## `module list`

Note: currently, the module hdfs/hdfs-rc auto loads in java/openjdk/java-1.7.0-openjdk. If you require a different version of java, just module unload java, then module load the version of java you want.

Now, you can list the contents of HDFS.

Command:

hadoop fs -ls /

Output:

[user@lewis4-r710-login-node223 ~]$ hadoop fs -ls /
Found 4 items
drwxr-xr-x   - hdfs-resource hdfs-resource          0 2017-07-14 13:22 /group
drwxr-xr-x   - hdfs-resource hdfs-resource          0 2017-06-16 11:31 /shared
drwxr-xr-x   - hdfs-resource hdfs-resource          0 2018-02-13 14:45 /testing
drwx-wx-wx   - user        hdfs-resource          0 2017-06-29 11:23 /tmp
[user@lewis4-r710-login-node223 ~]$

To write a file into HDFS, use hadoop fs -put.

$USER in this example is your username on the cluster and $SOME_LOCAL_FILE is the full path to a file on the local file system to be placed into HDFS. For the output and command examples, they will be replaced with user and awesomefile.json respectively.

Command:

hadoop fs -put /home/user/awesomefile.json /group/rc/user/

Output:

## None, assuming all went well.

We can now list the file we just placed in HDFS using hadoop fs -ls /group/rc/user

Command:

hadoop fs -ls /group/rc/user

Output:

[user@lewis4-r710-login-node223 ~]$ hadoop fs -ls /group/rc/user
Found 3 items
-rw-r--r--   3 user user         25 2018-02-13 15:28 /group/rc/user/awesomefile.json

To delete a file in HDFS, use hadoop fs -rm

Command:

hadoop fs -rm /group/rc/user/awesomefile.json

Output:

[user@lewis4-r710-login-node223 ~]$ hadoop fs -rm /group/rc/user/awesomefile.json
18/02/13 15:36:11 INFO fs.TrashPolicyDefault: Namenode trash configuration: ...
Deleted /group/rc/user/awesomefile.json

MRI-HDFS Modules

You can view information about the HDFS modules via module help:

Example for Compute Partition:

[example@c12-rc4-head ~]$ module help mri/mri-hdfs

----------- Module Specific Help for 'mri/mri-hdfs' ---------------

The mri-hdfs module loads the required modules and sets the needed
environmental variables to access HDFS on the Compute Partition
Use this module within the Compute Partition only.

#------------------------------------------------------------------
# HDFS INFO
#------------------------------------------------------------------
       Location : hdfs://r630-node66:9090/
      WebUI URL : http://r630-node66:50070/
#------------------------------------------------------------------

Example for GPU Partition:

[example@c12-rc4-head ~]$ module help mri/mri-hdfs-gpu

----------- Module Specific Help for 'mri/mri-hdfs-gpu' -----------

The mri-hdfs-gpu module loads the required modules and sets the needed
environmental variables to access HDFS on the Compute Partition
Use this module within the Compute Partition only.

#------------------------------------------------------------------
# HDFS INFO
#------------------------------------------------------------------
       Location : hdfs://r730-node74:9090/
      WebUI URL : http://r730-node74:50070/
#------------------------------------------------------------------

Example Usage

All examples use the default amount of resources and assume a clean environment without modules loaded. To clear any modules do module purge.

In the following example, we are using srun to submit an interactive job that simply lists the contents of a directory within the HDFS cluster:

[example@c12-rc4-head ~]$ module load mri/mri-hdfs
[example@c12-rc4-head ~]$ srun -N 1 -p Compute hadoop fs -ls /
Found 3 items
drwxrwxr-x   - sspark idas           0 2016-02-26 10:12 /idas
drwxr-xr-x   - example users          0 2016-02-21 14:10 /example
drwx-wx-wx   - example users          0 2016-02-21 14:10 /tmp

In this example, we are placing a file from our home directory into a folder on the HDFS cluster:

[example@c12-rc4-head ~]$ module load mri/mri-hdfs
[example@c12-rc4-head ~]$ srun -N 1 -p Compute hadoop fs -put littlelog.csv /example/
[example@c12-rc4-head ~]$ srun -N 1 -p Compute hadoop fs -ls /example
Found 1 items
-rw-r--r--   3 example users       1399 2016-02-26 16:54 /example/littlelog.csv

The final srun example, we will access the HDFS cluster and the file placed in the last example using a spark-shell (output trimmed to save space):

[example@c12-rc4-head ~]$ module load spark/spark-1.6.0-bin-hadoop2.6
[example@c12-rc4-head ~]$ srun -N 1 -p Compute --pty spark-shell

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.0
      /_/


scala> val file = sc.textFile("hdfs://r630-node66:9090/example/littlelog.csv")
file: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at <console>:27

scala> file.toArray.foreach(println)
warning: there were 1 deprecation warning(s); re-run with -deprecation for details
20120315 01:17:06,99.122.210.248,[http://www.acme.com/SH55126545/VD55170364,{7AAB8415-E803-3C5D-7100-E362D7F67CA7},homestead,fl,usa](http://www.acme.com/SH55126545/VD55170364,{7AAB8415-E803-3C5D-7100-E362D7F67CA7},homestead,fl,usa)

20120315 01:34:46,69.76.12.213,[http://www.acme.com/SH55126545/VD55177927,{8D0E437E-9249-4DDA-BC4F-C1E5409E3A3B},coeur d alene,id,usa](http://www.acme.com/SH55126545/VD55177927,{8D0E437E-9249-4DDA-BC4F-C1E5409E3A3B},coeur d alene,id,usa)

20120315 17:23:53,67.240.15.94,[http://www.acme.com/SH55126545/VD55166807,{E3FEBA62-CABA-11D4-820E-00A0C9E58E2D},queensbury,ny,usa](http://www.acme.com/SH55126545/VD55166807,{E3FEBA62-CABA-11D4-820E-00A0C9E58E2D},queensbury,ny,usa)

20120315 17:05:00,67.240.15.94,[http://www.acme.com/SH55126545/VD55149415,{E3FEBA62-CABA-11D4-820E-00A0C9E58E2D},queensbury,ny,usa](http://www.acme.com/SH55126545/VD55149415,{E3FEBA62-CABA-11D4-820E-00A0C9E58E2D},queensbury,ny,usa)

20120315 01:27:53,98.234.107.75,[http://www.acme.com/SH55126545/VD55179433,{49E0D2EE-1D57-48C5-A27D-7660C78CB55C},sunnyvale,ca,usa](http://www.acme.com/SH55126545/VD55179433,{49E0D2EE-1D57-48C5-A27D-7660C78CB55C},sunnyvale,ca,usa)

20120315 02:09:38,75.85.165.38,[http://www.acme.com/SH55126545/VD55179433,{F6F8B460-4204-4C26-A32C-B93826EDCB99},san diego,ca,usa](http://www.acme.com/SH55126545/VD55179433,{F6F8B460-4204-4C26-A32C-B93826EDCB99},san diego,ca,usa)

scala>exit
[example@c12-rc4-head ~]$