CUDA® is a parallel computing platform and programming model invented by NVIDIA. It enables dramatic increases in computing performance by harnessing the power of the graphics processing unit (GPU). With millions of CUDA-enabled GPUs sold to date, software developers, scientists and researchers are using GPU-accelerated computing for broad-ranging applications.


There are various partitions that have GPUs on Lewis. We can see their status by filtering the sinfo command like so:


sinfo|grep -i gpu


[user@lewis4-r710-login-node223 ~]$ sinfo | grep -i gpu
r730-gpu3        up 2-00:00:00      10   idle lewis4-r730-gpu3-node[426,428-435,476]
gpu3             up 2-00:00:00      10   idle lewis4-r730-gpu3-node[426,428-435,476]
Gpu              up 2-00:00:00      10   idle lewis4-r730-gpu3-node[426,428-435,476]

To get more information on what GPUs are available, you can use sinfo as follows:


sinfo -p Gpu -o %n,%G


[user@lewis4-r710-login-node223 ~]$ sinfo -p Gpu -o %n,%G
lewis4-r730-gpu3-node426,gpu:Tesla K20Xm:1
lewis4-r730-gpu3-node428,gpu:Tesla K20Xm:1
lewis4-r730-gpu3-node429,gpu:Tesla K40m:1
lewis4-r730-gpu3-node430,gpu:Tesla K40m:1
lewis4-r730-gpu3-node431,gpu:Tesla K20Xm:1
lewis4-r730-gpu3-node432,gpu:Tesla K20Xm:1
lewis4-r730-gpu3-node433,gpu:Tesla K20Xm:1
lewis4-r730-gpu3-node434,gpu:Tesla K20Xm:1
lewis4-r730-gpu3-node435,gpu:Tesla K20Xm:1
lewis4-r730-gpu3-node476,gpu:Tesla K20Xm:1

To get started with CUDA we will first need to request one of the GPU nodes from the cluster. We will use the Gpu partition in this example.


srun -p Gpu -N1 -n20 -t 0-02:00 --mem=100G --gres gpu:1 --pty /bin/bash


[user@lewis4-r710-login-node223 ~]$ srun -p Gpu -N1 -n20 -t 0-02:00 --mem=100G --gres gpu:1 --pty /bin/bash
[user@lewis4-r730-gpu3-node428 ~]$

Notice how the prompt changed? We are now working on a node with a GPU. Let's find out more about our GPU with the nvidia-smi command:



[user@lewis4-r730-gpu3-node427 training]$ nvidia-smi
Fri Feb 10 16:03:18 2017
| NVIDIA-SMI 367.48                 Driver Version: 367.48                    |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla K20Xm         Off  | 0000:03:00.0     Off |                    0 |
| N/A   25C    P0    62W / 235W |      0MiB /  5699MiB |     99%      Default |

| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|  No running processes found                                                 |

Now we will load the CUDA module and run a simple test program.


module load cuda/cuda-7.5

(No output unless there is an error)

We can double check that the module loaded properly with these commands:

module list
which nvcc


[user@lewis4-r730-gpu3-node427 training]$ module list
Currently Loaded Modulefiles:
  1) cuda/cuda-7.5
[user@lewis4-r730-gpu3-node427 training]$ which nvcc

With our module loaded we can now try building and running a simple example. Using your favorite text editor create a file and paste in the example code below. Save it with the name: Now we are ready to compile the code with nvcc:

nvcc -o hello_cuda

(No output unless there is an error)

We can check that the compiled binary is there with ls:

[user@lewis4-r730-gpu3-node427 cuda]$ nvcc -o hello_cuda
[user@lewis4-r730-gpu3-node427 cuda]$ ls

Now it is time to execute our code:



[user@lewis4-r730-gpu3-node427 cuda]$ ./hello_cuda
Hello World!


If your output is Hello Hello that indicates a GPU was not used.

Success! Now the last step is leave our srun session and use sbatch to launch our example. Type exit into your prompt and notice that we are taken back to the login node:



[user@lewis4-r730-gpu3-node427 cuda]$ exit
[user@lewis4-r710-login-node223 training]$

Create another file called and paste in the code below. Now we can execute our CUDA example using sbatch:



[user@lewis4-r710-login-node223 cuda]$ sbatch
Submitted batch job 530035

To see the output we look for the result file that matches our job id (in this example it is 530035) and use the cat command:

[user@lewis4-r710-login-node223 cuda]$ ls  hello_cuda  results_cuda-530035.out
[user@lewis4-r710-login-node223 cuda]$ cat results_cuda-530035.out
### Starting at: Fri Feb 10 16:45:05 CST 2017 ###
Currently Loaded Modulefiles:
  1) cuda/cuda-7.5
### Starting at: Fri Feb 10 16:45:05 CST 2017
First core reporting from node:
Currently working in directory:
Files in this folder:
total 854
-rw-rw-r--. 1 user group   1342 Feb 10 16:44
-rwxrwxr-x. 1 user group 530368 Feb 10 16:30 hello_cuda
-rw-rw-r--. 1 user group    954 Feb 10 11:30
-rw-rw-r--. 1 user group    285 Feb 10 16:45 results_cuda-530035.out
Hello World!
### Ending at: Fri Feb 10 16:45:07 CST 2017 ###
[user@lewis4-r710-login-node223 cuda]$

Notice that we get the same result, but now we don't have to be logged in directly to a GPU node.

Contents of Example Files

// This is the REAL "hello world" for CUDA!
// It takes the string "Hello ", prints it, then passes it to CUDA with an array
// of offsets. Then the offsets are added in parallel to produce the string "World!"
// By Ingemar Ragnemalm 2010

#include <stdio.h>

const int N = 7;
const int blocksize = 7;

void hello(char *a, int *b)
 a[threadIdx.x] += b[threadIdx.x];

int main()
 char a[N] = "Hello ";
 int b[N] = {15, 10, 6, 0, -11, 1, 0};

 char *ad;
 int *bd;
 const int csize = N*sizeof(char);
 const int isize = N*sizeof(int);

 printf("%s", a);

 cudaMalloc( (void**)&ad, csize );
 cudaMalloc( (void**)&bd, isize );
 cudaMemcpy( ad, a, csize, cudaMemcpyHostToDevice );
 cudaMemcpy( bd, b, isize, cudaMemcpyHostToDevice );

 dim3 dimBlock( blocksize, 1 );
 dim3 dimGrid( 1, 1 );
 hello<<<dimGrid, dimBlock>>>(ad, bd);
 cudaMemcpy( a, ad, csize, cudaMemcpyDeviceToHost );
 cudaFree( ad );

 printf("%s\n", a);

## resources
#SBATCH -p gpu3  # partition (which set of nodes to run on)
#SBATCH -N1  # nodes
#SBATCH -n20  # tasks (cores)
#SBATCH --mem=100G  # total RAM
#SBATCH -t 0-01:00  # time (days-hours:minutes)
#SBATCH --qos=normal  # qos level
#SBATCH --exclusive  # reserve entire node
## labels and outputs
#SBATCH -J hello_cuda  # job name - shows up in sacct and squeue
#SBATCH -o results_cuda-%j.out  # filename for the output from this job (%j = job#)
#SBATCH -A general-gpu  # investor account
## notifications
#SBATCH  # email address for notifications
#SBATCH --mail-type=END,FAIL  # which type of notifications to send

echo "### Starting at: $(date) ###"

# load modules then display what we have
module load cuda/cuda-7.5
module list

# Serial operations - only runs on the first core
echo "### Starting at: $(date)"
echo "First core reporting from node:"

echo "Currently working in directory:"

echo "Files in this folder:"
ls -l

# Execute the hello_cuda script:

echo "### Ending at: $(date) ###"