R

R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R. R is a FREE command language program that is used widely for statistical field. Now, it is the favorite language for statisticians to use for researching. R includes over 6,000 packages available for download, in addition to the basic software (linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, and so on. Also, R always provides the new methods and packages firstly before other statistical software like SAS or SPSS. The most important thing is that R is an open source language for writing your own software and create your new packages.

Software URL: https://www.r-project.org/
Documentation: https://cran.r-project.org/manuals.html

Usage

In this example we will be using the following script named cars_summary.R:

cars_summary.R:

## Cars example dataset from the R Documentation
#  Modified:
#    19 October 2018
#  Author:
#    Jacob Gotberg

## load the package called 'datasets' which contains sample data
library("datasets")

## examine the 'cars' object
# cars is a data frame with 50 rows
# column 1 is speed in mph
# column 2 is stopping distance in ft
cars

# print out a summary of the data
summary(cars)

## make a plot of this data and save it as a jpeg
# start the jpeg and give it a filename
jpeg('stop_dist_by_speed.jpg')

# build the plot and provide x/y labels
plot(cars, xlab = "Speed (mph)", ylab = "Stopping distance (ft)", las = 1)

# add a LOWESS smoother to the data and plot it
lines(lowess(cars$speed, cars$dist, f = 2/3, iter = 3), col = "red")

# add a title
title(main = "Cars dataset: Stopping Distance vs Speed")

# save the jpeg
dev.off()

## make a second jpeg with summary data
# start the jpeg and give it a filename
jpeg('summary.jpg')

# build the summary
summary(fm1 <- lm(log(dist) ~ log(speed), data = cars))

# customize the plot settings
opar <- par(mfrow = c(2, 2), oma = c(0, 0, 1.1, 0),
            mar = c(4.1, 4.1, 2.1, 1.1))

# build the plt based on custom settings
plot(fm1)

# restore plot settings to defaults
par(opar)

# save the jpeg
dev.off()

After you create the script (or upload it) we need to create a SBATCH script. this file has three distinct parts:

SLURM Configuration (always at the top of the script)
Module commands to load R software
The command to run the Rscript

For this example the SBATCH script is called R_sbatch.sh:

#!/bin/bash
#--------------------------------------------------------------------------------
#  SBATCH CONFIG
#--------------------------------------------------------------------------------
#SBATCH --job-name=cars_summary        # name for the job
#SBATCH --cpus-per-task=1              # number of cores
#SBATCH --mem=4G                       # total memory
#SBATCH --time 0-04:00                 # time limit in the form days-hours:minutes
#SBATCH --mail-user=username@mu.edu    # email address for notifications
#SBATCH --mail-type=FAIL,END           # email types
#SBATCH --partition General            # max of 1 node and 4 hours; use `Lewis` for larger jobs
#--------------------------------------------------------------------------------

echo "### Starting at: $(date) ###"

## Module Commands
# 'use module avail r/' to find the latest version
module load r
module list

## Run the R script
SCRIPT='cars_summary.R'
Rscript ${SCRIPT}

echo "### Ending at: $(date) ###"

The section at the top labeled 'SBATCH CONFIG' tells SLURM what resources you need for the job. The rest of the script is a standard bash script that sets up your modules and runs the script. Once you have this example working, you will update the line that says SCRIPT='cars_summary.R' with the name of your own R script.

To start the job we will use the sbatch command with the name of our SBATCH script like so:

sbatch R_sbatch.sh

Output:

[user@lewis4-r710-login-node223 R]$ sbatch R_sbatch.sh
Submitted batch job 7759657

The SLURM system will assign a worker node and complete your job as soon as resources are available. The output will be automatically saved to a file called slurm-7759657.out. Notice that the job was given a unique id after we submitted the job and that the output file has that same number included (in this case 7759657). We can check on the progress of our job with the sacct command. See the Slurm pages for more info. Once the job is completed you can view the output with the less command:

ls
less slurm-5686984.out

Output:

[user@lewis4-r710-login-node223 R]$ ls
cars_summary.R  R_sbatch.sh  slurm-7759657.out  stop_dist_by_speed.jpg  summary.jpg
[user@lewis4-r710-login-node223 R]$ less slurm-7759657.out
### Starting at: Fri Oct 19 14:29:07 CDT 2018 ###
Autoloading glib/glib-2.56.0-python-2.7.14-tk
...
... lots of output here ...
...
     speed           dist
 Min.   : 4.0   Min.   :  2.00
 1st Qu.:12.0   1st Qu.: 26.00
 Median :15.0   Median : 36.00
 Mean   :15.4   Mean   : 42.98
 3rd Qu.:19.0   3rd Qu.: 56.00
 Max.   :25.0   Max.   :120.00
null device
          1

Call:
lm(formula = log(dist) ~ log(speed), data = cars)

Residuals:
     Min       1Q   Median       3Q      Max
-1.00215 -0.24578 -0.02898  0.20717  0.88289

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  -0.7297     0.3758  -1.941   0.0581 .
log(speed)    1.6024     0.1395  11.484 2.26e-15 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4053 on 48 degrees of freedom
Multiple R-squared:  0.7331,    Adjusted R-squared:  0.7276
F-statistic: 131.9 on 1 and 48 DF,  p-value: 2.259e-15

null device
          1
### Ending at: Fri Oct 19 14:29:08 CDT 2018 ###

Scaling up from one core

R is a serial code unless you implicitly use a parallel package. Assigning more than one core is a waste of resources unless you are using parallel code. Review:

and/or attend a training session to learn more.

Interactive Usage

Once you have your R code ready you shouldn't need to interactively run R except to debug or develop features. Using the cluster interactively has serious downsides and should not be used for everyday production research. If you do need to use R interactively RCSS recommends using RStudio or the method below:

srun -p Interactive --qos interactive --mem 4G --pty /bin/bash
module load r
R

Once you run those commands you will be presented with the R prompt.

Install packages in R

To install R packages, you can use the following methods:

Install R packages from Anaconda (recommended)

You can create a virtual environment with Anaconda including latest version of R and any packages that you want. Please review Anaconda to learn how to set up and create a virtual environment. After setting up Recommended Configuration, use the following to create an environment including latest version of R:

srun -p Interactive --qos interactive --mem 16G --pty bash
module load miniconda3
conda create -n r-env -c conda-forge r-base

## <it will install pkgs and create the env>

## activate the env:
source activate r-env

Type R in the terminal to open R that you just installed inside the r-env. Note that when you login to Lewis again, you need to request resources, load miniconda3 and activate the r-env to use your R environment.

To install R pcackages, search the package's name in https://anaconda.org/ to find how to add them to the activated environment by conda install command. For instance:

source activate r-env
conda install -c r r-lme4

It will activate r-env and install "lme4" in your virtual environment. Now, you can type R to open R and import the library:

R version 4.0.2 (2020-06-22) -- "Taking Off Again"
Copyright (C) 2020 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

> library(lme4)

Install R packages from CRAN

You can use R to install packages. Note that in this method you just be able to use R that already installed in to the cluster. It might raise some package dependency conflicts if try to install multiple packages.

To install packages from CRAN, first need to request resources and load and open R:

srun -p Interactive --qos interactive --mem 4G --pty /bin/bash
module load r
R

Then use the following R command to install packages:

install.packages("package-name", repos = "http://cran.us.r-project.org")

Press "y" to install packages into in your home directory (~/R/x86_64-pc-linux-gnu-library/XX).

RStudio

RStudio is an integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management.

Prerequisites: in order to display the RStudio GUI you will need to have a terminal with X11 forwarding enabled. Review Graphical User Interfaces (GUIs) for X11 forwarding.

Using RStudio requires you to request an interactive session using SLURM and then load the RStudio module. The following commands will get you started with RStudio on Lewis for 2 hours:

srun -p Interactive --qos interactive --mem 4G --pty /bin/bash
module load rstudio
rstudio

R

See Also