Introduction to R language

Design of Experiments and Statistical Analysis of Experimental Data, 2024

Paolo Bosetti

University of Trento, Department of Industrial Engineering

Introduction to R language

Statistical analysis requires the use of specific software

Today the two most used software/languages in this field are Python and R, followed by Matlab

We will use R because it is specific for statistics, graphics-oriented and open source

RStudio environment

  • Installation: first R, then RStudio
  • RStudio works on folders or (better) projects (.Rproj)
  • A project also contains settings that are specific and common to files in the folder
  • An RStudio session can operate on a single project
  • Multiple sessions can be opened at the same time
  • RStudio is a very powerful and complex environment, also suitable for compiling technical reports, articles, books and presentations (like this one)

The R language

  • R is a high-level, declarative, interpreted language with C-like syntax
  • R is both a language and an interpreter
  • R is a dynamically typed language
  • R is used in both script mode and interactive mode
  • R began as an open source GNU version of S, a proprietary language for statistical analysis
  • RStudio is a proprietary (but free) IDE for R

Assignments

Every language uses variables to store values and objects through an assignment operation:

a <- 1
# but also
b = 2
# however arrow notation is preferred,
# because it also works like this:
3 -> c
# to display the value of a variable:
c
# in one fell swoop, assignment and display:
(d <- "string")

Executing a command directly provides a result:

12*12
[1] 144

Types, or native classes

  • R has 6+1 native types or classes
    • character: "a", "string", 'my text'
    • numeric: 1, 3.1415
    • integer: 1L
    • logical: TRUE, FALSE (or T and F)
    • complex: 1+4i
    • function: a function
    • (raw: bit sequence)
  • Each instance is intrinsically a vector
  • A scalar is simply a vector of length 1

Special values

  • The following special values are defined:
    • NA: missing value
    • NULL: nothing
    • Inf: Infinite
    • NaN: Not a Number (example 0/0)

Coercion

  • When mixing different types, e.g. into a vector, R transforms them into a common type:
c(1L, 7, "2")
[1] "1" "7" "2"
c(T, 0)
[1] 1 0
as.numeric(c("a", "1"))
[1] NA  1
as.character(c(1, 1.7))
[1] "1"   "1.7"

Vectors

# They are constructed with the c() operator/function:
v1 <- c(10, 2, 7.5, 3)
# or with a sequence:
v2 <- 1:10
# also with specified pitch:
v3 <- seq(1, 10, 0.5)
# Functions are called with parentheses,
# separating arguments with ,

Note the output for v3 in this case:

 [1]  1.0  1.5  2.0  2.5  3.0  3.5  4.0  4.5  5.0  5.5  6.0
[12]  6.5  7.0  7.5  8.0  8.5  9.0  9.5 10.0

The first element of the first row is the [1] element of the vector, while the first element of the second row is the [12] element. In all, the vector v3 has 19 elements

Vectors

Variables are natively vectors. Scalars are just vectors of dimension 1:

a <- 10
length(a)
[1] 1
length(v3)
[1] 19

Functions therefore always act on vectors:

a * 2
[1] 20
v3 + 2
 [1]  3.0  3.5  4.0  4.5  5.0  5.5  6.0  6.5  7.0  7.5  8.0
[12]  8.5  9.0  9.5 10.0 10.5 11.0 11.5 12.0

Introspection

  • Useful functions for inspecting objects:
    • mode(): storage mode
    • class(): class (high level, same as mode() for basic types)
    • typeof(): type (low level)
    • length(): vector length
    • attributes(): metadata

Matrices

  • They are constructed with the matrix() function
(m1 <- matrix(1:10, 2, 5))
  • the array() function constructs n-dimensional arrays
  • A matrix is a vector with dim attribute:
attr(m1, "dim")
v <- 1:4
attr(v, "dim") <- c(2,2) # is equivalent to dim(m) <- c(2,2)
v

Factors

  • An additional (non-base) but very common class is factor
  • Represents categorical variables (ordered or unordered)
(vf <- factor(LETTERS[1:5], levels=LETTERS[c(2, 1, 3, 5, 4)], ordered=T))
[1] A B C D E
Levels: B < A < C < E < D
class(vf)
[1] "ordered" "factor" 
typeof(vf)
[1] "integer"
vf[1] < vf[3]
[1] TRUE

Strings

A string can be thought of as an array of characters with a length greater than 1.

The most common string manipulation functions are cat(), paste(), and paste0(). The first is used to print the string as it is:

cat("Hello!")
Hello!

The two functions paste() and paste0() are used to join two or more strings, the first inserting a space in between, the second without space:

paste("Hello,", "World!")
[1] "Hello, World!"
paste0("Hello,", "World!")
[1] "Hello,World!"

Indexing

  • R’s indexing syntax is very flexible and powerful
  • as for vectors, square brackets [r,c] are used, the base is 1
  • if an index is missing, it means “all rows|columns”
v3[3]
[1] 2
m1[1,1]
[1] 1
m1[2,]
[1]  2  4  6  8 10
m1[,]
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    3    5    7    9
[2,]    2    4    6    8   10

Indexing

  • An index can also be a vector of positions or a vector of Boolean values
v1[c(2,4,1)] # extracts only elements 2, 4, and 1
[1]  2  3 10
v2[v2 %% 2 == 0] # extract elements divisible by 2
[1]  2  4  6  8 10

The second case works thanks to the modulus operator:

v2 %% 2 == 0 # modulus operator (remainder)
 [1] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
[10]  TRUE

Functions

  • Functions are first class objects, that is, they are valid types
  • can be assigned to variables and passed to other functions
my_fun <- function(x) x^2
my_fun(1:5)
[1]  1  4  9 16 25
your_fun <- my_fun
your_fun(6)
[1] 36
my_apply <- function(x, f) f(x)
my_apply(10, my_fun)
[1] 100
  • If the definition requires multiple lines, a block is used between {}
  • Each function always returns the last evaluated expression
  • Or explicitly via return()

Arrow functions (replacement functions)

  • We’ve seen things like dim(v) <- c(2,3): how do you declare them?
`pwr<-` <- function(obj, value) obj ** value
a <- 2
pwr(a) <- 10
a
[1] 1024
  • The last argument must be called value and represents the right side of the assignment!

Flow control

R supports typical flow control instructions

  • for conditional statements:
    • if(cond) expr
    • if(cond) true.expr else false.expr
    • iffelse(cond, true.expr, false.expr)
  • and for cycles:
    • for(var in seq) expr
    • while(cond) expr
    • repeat expr
    • break
    • next

Function arguments

  • Topics can be indicated by position or name
  • Named topics can appear in any order
  • Arguments may have a default, in which case they are optional
f <- function(x, y, n=10, test=F) {
   ifelse(test, 0, x^y + n)
}
f(2, 10)
[1] 1034
f(test=F, y=10, x=2)
[1] 1034
f(test=T)
[1] 0

Difference between <- and =

  • The = operator as an assignment is valid only at the top-level
  • The <- operator is valid everywhere, even as a function argument:
system.time(m <- mean(1:1E6))
   user  system elapsed 
  0.007   0.000   0.007 
m
[1] 500000.5

Dataframes

  • In R, dataframes are used rather than matrices
  • These are tables organized by columns, internally homogeneous but potentially of different types
df <- data.frame(A=1:10, B=letters[1:10])
head(df)
  A B
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 6 f

Dataframes

  • A dataframe can be indexed as an array (two indices)
  • Or by selecting a column with the notation $
df[2,2]
[1] "b"
df$B[2]
[1] "b"

Also in assignment:

df$C <- LETTERS[1:10]
head(df, 3)
  A B C
1 1 a A
2 2 b B
3 3 c C

Lists

A list is a sequence of key-value pairs, that is, a sequence of values identified by a name, or key.

Unlike vectors (which are always homogeneous) they can contain heterogeneous values.

(l <- list(A="one", B="two", C=1:4))
$A
[1] "one"

$B
[1] "two"

$C
[1] 1 2 3 4

A list can be indexed in three ways:

  • with the $ operator: extracts a single element by name
  • with the [] operator: extract elements by position and obtain a list
  • with the [[]] operator: extracts a single element per position

Commonly used algorithms

  • Sorting: sort, rev, order
  • Sampling: sample, expand.grid
  • Aggregation: by, aggregate
  • Contingency tables: table

Sorting vectors

To sort a vector you use the sort function:

v <- runif(5, 1, 10)
sort(v)
[1] 3.389578 4.349115 6.155680 9.070275 9.173870
rev(sort(v))
[1] 9.173870 9.070275 6.155680 4.349115 3.389578
sort(v, decreasing = T)
[1] 9.173870 9.070275 6.155680 4.349115 3.389578

Sorting dataframes

To reorder a data frame, the ordered indices are extracted:

df <- data.frame(A=1:5, B=runif(5))
df[order(df$B),]
  A         B
1 1 0.2016819
5 5 0.6291140
4 4 0.6607978
2 2 0.8983897
3 3 0.9446753

The order function returns the indices of a vector ordered according to the values:

order(df$B)
[1] 1 5 4 2 3

where the first is the index of the smallest value of df$B and the last is the index of the largest

Sampling

Sampling a set of data (a vector) means extracting a subset (called sample) of values randomly. This is done with the sample function:

sample(1:10) # without reinsertion
 [1]  2  3  1  5  7 10  6  4  9  8
sample(1:10, replace = T) # with reinsertion
 [1]  9  5  5  9  9  5  5  2 10  9

The sample size can be equal (case above) or smaller than the initial set:

sample(1:10, size = 5)
[1] 1 4 3 6 2
sample(10) # random integer generation without repetition
 [1] 10  6  7  4  8  9  2  1  3  5

Grids

A grid is a matrix that contains all (ordered) combinations between \(n\) vectors of possibly different sizes. In R it is represented as a data frame and constructed with the expand.grid function:

(df <- expand.grid(A=1:2, B=c("-", "+"), D=c("a", "b", "c")))
   A B D
1  1 - a
2  2 - a
3  1 + a
4  2 + a
5  1 - b
6  2 - b
7  1 + b
8  2 + b
9  1 - c
10 2 - c
11 1 + c
12 2 + c

Aggregation

Aggregation means grouping rows having common elements in a data frame and applying a given function to each group. It is useful for example for calculating sub-totals.

In R it can be done using the by function or the aggregate function (changes the output type):

by(df$A, INDICES = df$B, FUN=sum)
df$B: -
[1] 9
--------------------------------------------- 
df$B: +
[1] 9
aggregate(A~B, data = df, FUN = sum)
  B A
1 - 9
2 + 9

Contingency tables

A contingency table counts the occurrences between a pair of columns in a data frame:

head(airquality, n = 3)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
with(airquality, table(OzHi = Ozone > 80, Month,
                        useNA = "ifany"))
       Month
OzHi     5  6  7  8  9
  FALSE 25  9 20 19 27
  TRUE   1  0  6  7  2
  <NA>   5 21  5  5  1

NOTE: with() is to save you from typing airquality$Ozone and airquality$Month

Contingency tables

  • Also useful is tapply(), which operates on a table similarly to aggregate functions:
round(with(airquality,
            tapply(Ozone, Month, mean, na.rm=T)), 1)
   5    6    7    8    9 
23.6 29.4 59.1 60.0 31.4 

With aggregate() you would do:

aggregate(Ozone~Month, data=airquality, FUN=mean, ra.rm=T)
  Month    Ozone
1     5 23.61538
2     6 29.44444
3     7 59.11538
4     8 59.96154
5     9 31.44828

Input/output to file

Since statistics generally deals with large quantities of data, it is essential to be able to import and export data in generic formats.

Data is generally presented in tabular form (by rows and columns)

The simplest and most common formats are:

  • Flat File: an ASCII text file containing row and column values; columns can be separated
    • fixed length
    • using separator characters
  • CSV (Comma-Separated Values): a special version of FF where column fields are separated by commas

Input from file

A flat file with fields separated by spaces can look like this:

# Data collected on 8/10/2023
x y z
1.2 3.7 2.7
2.1 2.5 3.9
3.8 2.2 6.8

Such a file can be imported as a data frame like this:

df <- read.table("data_file.txt", header=T, sep=" ", comment.char="#")

The read.table() function has numerous options that allow you to handle all possible cases in which files contain fields separated by specific characters (spaces or other)

Input from file

A flat file with fixed width fields can look like this:

# Data collected on 8/10/2023
x y z
1.2 3.7 2.7
2.1 2.5 3.9
3.8 2.2 6.8

Such a file can be imported as a data frame like this:

df <- read.fwf("data_file.txt", widths=5, header=T, skip=1)

The skip=1 parameter requires skipping the first line (comment)

Input from CSV file

CSV files are special FFs where the field separator is the comma. In these cases we use the read.csv() function which works similarly to read.table() but does not require specifying the separator.

A CSV looks like this:

# Data collected on 8/10/2023
x,y,z
1.2,3.7,2.7
2.1,2.5,3.9
3.8,2.2,6.8

Software that uses Latin languages (Italian, Spanish, Portuguese and French) adopts the comma as decimal separator. Consequently, when these software (e.g. MS Excel) generate CSVs they use the semicolon as a field separator.

In this case from R it is necessary to use the read.csv2() function, which takes the comma as decimal separator and the semicolon as field separator.

Output to file

The opposite of importing a file to a data frame is exporting a data frame to a file.

This operation is performed with the opposite functions to the previous ones:

  • write.table()
  • write.fwf()
  • write.csv() and write.csv2()

All these functions have two mandatory arguments: the data frame to save and the destination file:

write.csv(df, "data.csv")

Other optional arguments are used to customize the result.

Tidyverse

Along with RStudio, a new wave of R libraries has emerged that radically changes the approach. They go by the collective name of tidyverse

  • ggplot2: plots
  • purrr: functional programming
  • dplyr: data manipulation
  • stringr: string manipulation

Tidyverse

Along with RStudio, a new wave of R libraries has emerged that radically changes the approach. They go by the collective name of tidyverse

  • tibble: Improved data frames
  • readr: import data
  • tidyr: data preparation
  • lubridate: date manipulation

Tidyverse

The tidyverse approach has some common characteristics:

  • data in tidy format (one observation per row, one variable, or observing, per column)
  • composition of graph functions with + (ggplot(...) + geom_line()), each function is a layer
  • prefix notation with %>% (a %>% str() instead of str(a))

It is useful to consult the cheat sheets: https://posit.co/resources/cheatsheets/

Infix notation

# I create the histogram of a sample of 10 elements from 100 random numbers
# infix:
hist(sample(rnorm(100), 10))

Hardly readable, the first step of the algorithm is the internal one

Infix notation, sequenced

# I create the histogram of a sample of 10 elements from 100 random numbers
# infix:
hist(sample(rnorm(100), 10))

# sequenced infix:
s <- rnorm(100)
c <- sample(s, 10)
hist(c)

More readable, the algorithm is more obvious, but requires the creation of intermediate variables

Prefix notation

# I create the histogram of a sample of 10 elements from 100 random numbers
# infix:
hist(sample(rnorm(100), 10))

# sequenced infix:
s <- rnorm(100)
c <- sample(s, 10)
hist(c)

# prefix with pipe:
rnorm(100) %>% sample(10) %>% hist()

Much more readable, the sequential algorithm is obvious, no intermediate variables are needed

Prefix notation

# I create the histogram of a sample of 10 elements from 100 random numbers
# infix:
hist(sample(rnorm(100), 10))

# sequenced infix:
s <- rnorm(100)
c <- sample(s, 10)
hist(c)

# prefix with pipe:
rnorm(100) %>% sample(10) %>% hist()

# even on multiple lines:
rnorm(100) %>%
1   sample(10) %>%
2   hist
1
the lines following the first must be indented
2
only when using pipe, if there are no arguments the parentheses are optional