CUDA

CUDA® is a parallel computing platform and programming model invented by NVIDIA. It enables dramatic increases in computing performance by harnessing the power of the graphics processing unit (GPU). With millions of CUDA-enabled GPUs sold to date, software developers, scientists and researchers are using GPU-accelerated computing for broad-ranging applications.

Software URL: https://developer.nvidia.com/cuda-zone
Documentation: http://docs.nvidia.com/cuda/index.html

Usage

There are various partitions that have GPUs on Lewis. We can see their status by filtering the sinfo command like so:

Command:

sinfo|grep -i gpu

Output:

[user@lewis4-r710-login-node223 ~]$ sinfo | grep -i gpu
r730-gpu3        up 2-00:00:00      10   idle lewis4-r730-gpu3-node[426,428-435,476]
gpu3             up 2-00:00:00      10   idle lewis4-r730-gpu3-node[426,428-435,476]
Gpu              up 2-00:00:00      10   idle lewis4-r730-gpu3-node[426,428-435,476]

To get more information on what GPUs are available, you can use sinfo as follows:

Command:

sinfo -p Gpu -o %n,%G

Output:

[user@lewis4-r710-login-node223 ~]$ sinfo -p Gpu -o %n,%G
HOSTNAMES,GRES
lewis4-r730-gpu3-node426,gpu:Tesla K20Xm:1
lewis4-r730-gpu3-node428,gpu:Tesla K20Xm:1
lewis4-r730-gpu3-node429,gpu:Tesla K40m:1
lewis4-r730-gpu3-node430,gpu:Tesla K40m:1
lewis4-r730-gpu3-node431,gpu:Tesla K20Xm:1
lewis4-r730-gpu3-node432,gpu:Tesla K20Xm:1
lewis4-r730-gpu3-node433,gpu:Tesla K20Xm:1
lewis4-r730-gpu3-node434,gpu:Tesla K20Xm:1
lewis4-r730-gpu3-node435,gpu:Tesla K20Xm:1
lewis4-r730-gpu3-node476,gpu:Tesla K20Xm:1

To get started with CUDA we will first need to request one of the GPU nodes from the cluster. We will use the Gpu partition in this example.

Command:

srun -p Gpu -N1 -n20 -t 0-02:00 --mem=100G --gres gpu:1 --pty /bin/bash

Output:

[user@lewis4-r710-login-node223 ~]$ srun -p Gpu -N1 -n20 -t 0-02:00 --mem=100G --gres gpu:1 --pty /bin/bash
[user@lewis4-r730-gpu3-node428 ~]$

Notice how the prompt changed? We are now working on a node with a GPU. Let's find out more about our GPU with the nvidia-smi command:

nvidia-smi

Output:

[user@lewis4-r730-gpu3-node427 training]$ nvidia-smi
Fri Feb 10 16:03:18 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48                 Driver Version: 367.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K20Xm         Off  | 0000:03:00.0     Off |                    0 |
| N/A   25C    P0    62W / 235W |      0MiB /  5699MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Now we will load the CUDA module and run a simple test program.

Command:

module load cuda/cuda-7.5

(No output unless there is an error)

We can double check that the module loaded properly with these commands:

module list
which nvcc

Output:

[user@lewis4-r730-gpu3-node427 training]$ module list
Currently Loaded Modulefiles:
  1) cuda/cuda-7.5
[user@lewis4-r730-gpu3-node427 training]$ which nvcc
/usr/local/cuda-7.5/bin/nvcc

With our module loaded we can now try building and running a simple example. Using your favorite text editor create a file and paste in the example code below. Save it with the name: hello_cuda.cu Now we are ready to compile the code with nvcc:

nvcc hello_cuda.cu -o hello_cuda

(No output unless there is an error)

We can check that the compiled binary is there with ls:

[user@lewis4-r730-gpu3-node427 cuda]$ nvcc hello_cuda.cu -o hello_cuda
[user@lewis4-r730-gpu3-node427 cuda]$ ls
hello_cuda  hello_cuda.cu

Now it is time to execute our code:

./hello_cuda

Output:

[user@lewis4-r730-gpu3-node427 cuda]$ ./hello_cuda
Hello World!

NOTE

If your output is Hello Hello that indicates a GPU was not used.

Success! Now the last step is leave our srun session and use sbatch to launch our example. Type exit into your prompt and notice that we are taken back to the login node:

exit

Output:

[user@lewis4-r730-gpu3-node427 cuda]$ exit
exit
[user@lewis4-r710-login-node223 training]$

Create another file called cuda_sbatch.sh and paste in the code below. Now we can execute our CUDA example using sbatch:

sbatch cuda_sbatch.sh

Output:

[user@lewis4-r710-login-node223 cuda]$ sbatch cuda_sbatch.sh
Submitted batch job 530035

To see the output we look for the result file that matches our job id (in this example it is 530035) and use the cat command:

[user@lewis4-r710-login-node223 cuda]$ ls
cuda_sbatch.sh  hello_cuda  hello_cuda.cu  results_cuda-530035.out
[user@lewis4-r710-login-node223 cuda]$ cat results_cuda-530035.out
### Starting at: Fri Feb 10 16:45:05 CST 2017 ###
Currently Loaded Modulefiles:
  1) cuda/cuda-7.5
### Starting at: Fri Feb 10 16:45:05 CST 2017
First core reporting from node:
lewis4-r730-gpu3-node427
Currently working in directory:
/home/user/training/cuda
Files in this folder:
total 854
-rw-rw-r--. 1 user group   1342 Feb 10 16:44 cuda_sbatch.sh
-rwxrwxr-x. 1 user group 530368 Feb 10 16:30 hello_cuda
-rw-rw-r--. 1 user group    954 Feb 10 11:30 hello_cuda.cu
-rw-rw-r--. 1 user group    285 Feb 10 16:45 results_cuda-530035.out
Hello World!
### Ending at: Fri Feb 10 16:45:07 CST 2017 ###
[user@lewis4-r710-login-node223 cuda]$

Notice that we get the same result, but now we don't have to be logged in directly to a GPU node.

Contents of Example Files

hello_cuda.cu:

// This is the REAL "hello world" for CUDA!
// It takes the string "Hello ", prints it, then passes it to CUDA with an array
// of offsets. Then the offsets are added in parallel to produce the string "World!"
// By Ingemar Ragnemalm 2010

#include <stdio.h>

const int N = 7;
const int blocksize = 7;

__global__
void hello(char *a, int *b)
{
 a[threadIdx.x] += b[threadIdx.x];
}

int main()
{
 char a[N] = "Hello ";
 int b[N] = {15, 10, 6, 0, -11, 1, 0};

 char *ad;
 int *bd;
 const int csize = N*sizeof(char);
 const int isize = N*sizeof(int);

 printf("%s", a);

 cudaMalloc( (void**)&ad, csize );
 cudaMalloc( (void**)&bd, isize );
 cudaMemcpy( ad, a, csize, cudaMemcpyHostToDevice );
 cudaMemcpy( bd, b, isize, cudaMemcpyHostToDevice );

 dim3 dimBlock( blocksize, 1 );
 dim3 dimGrid( 1, 1 );
 hello<<<dimGrid, dimBlock>>>(ad, bd);
 cudaMemcpy( a, ad, csize, cudaMemcpyDeviceToHost );
 cudaFree( ad );

 printf("%s\n", a);
 return EXIT_SUCCESS;
}

cuda_sbatch.sh:

#!/bin/bash
#-------------------------------------------------------------------------------
#  SBATCH CONFIG
#-------------------------------------------------------------------------------
## resources
#SBATCH -p gpu3  # partition (which set of nodes to run on)
#SBATCH -N1  # nodes
#SBATCH -n20  # tasks (cores)
#SBATCH --mem=100G  # total RAM
#SBATCH -t 0-01:00  # time (days-hours:minutes)
#SBATCH --qos=normal  # qos level
#SBATCH --exclusive  # reserve entire node
#
## labels and outputs
#SBATCH -J hello_cuda  # job name - shows up in sacct and squeue
#SBATCH -o results_cuda-%j.out  # filename for the output from this job (%j = job#)
#SBATCH -A general-gpu  # investor account
#
## notifications
#SBATCH --mail-user=username@missouri.edu  # email address for notifications
#SBATCH --mail-type=END,FAIL  # which type of notifications to send
#
#-------------------------------------------------------------------------------

echo "### Starting at: $(date) ###"

# load modules then display what we have
module load cuda/cuda-7.5
module list

# Serial operations - only runs on the first core
echo "### Starting at: $(date)"
echo "First core reporting from node:"
hostname

echo "Currently working in directory:"
pwd

echo "Files in this folder:"
ls -l

# Execute the hello_cuda script:
./hello_cuda

echo "### Ending at: $(date) ###"