[1] 144
Design of Experiments and Statistical Analysis of Experimental Data, 2024
University of Trento, Department of Industrial Engineering
Statistical analysis requires the use of specific software
Today the two most used software/languages in this field are Python and R, followed by Matlab
We will use R because it is specific for statistics, graphics-oriented and open source
.Rproj
)Every language uses variables to store values and objects through an assignment operation:
a <- 1
# but also
b = 2
# however arrow notation is preferred,
# because it also works like this:
3 -> c
# to display the value of a variable:
c
# in one fell swoop, assignment and display:
(d <- "string")
Executing a command directly provides a result:
"a"
, "string"
, 'my text'
1
, 3.1415
1L
TRUE
, FALSE
(or T
and F
)1+4i
NA
: missing valueNULL
: nothingInf
: InfiniteNaN
: Not a Number (example 0/0
)# They are constructed with the c() operator/function:
v1 <- c(10, 2, 7.5, 3)
# or with a sequence:
v2 <- 1:10
# also with specified pitch:
v3 <- seq(1, 10, 0.5)
# Functions are called with parentheses,
# separating arguments with ,
Note the output for v3
in this case:
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0
[12] 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0
The first element of the first row is the [1]
element of the vector, while the first element of the second row is the [12]
element. In all, the vector v3
has 19 elements
Variables are natively vectors. Scalars are just vectors of dimension 1:
Functions therefore always act on vectors:
mode()
: storage modeclass()
: class (high level, same as mode()
for basic types)typeof()
: type (low level)length()
: vector lengthattributes()
: metadatamatrix()
functionarray()
function constructs n-dimensional arraysdim
attribute:factor
A string can be thought of as an array of characters with a length greater than 1.
The most common string manipulation functions are cat()
, paste()
, and paste0()
. The first is used to print the string as it is:
The two functions paste()
and paste0()
are used to join two or more strings, the first inserting a space in between, the second without space:
[r,c]
are used, the base is 1[1] 2 3 10
[1] 2 4 6 8 10
The second case works thanks to the modulus operator:
[1] 1 4 9 16 25
[1] 36
[1] 100
{}
return()
dim(v) <- c(2,3)
: how do you declare them?value
and represents the right side of the assignment!R supports typical flow control instructions
if(cond) expr
if(cond) true.expr else false.expr
iffelse(cond, true.expr, false.expr)
for(var in seq) expr
while(cond) expr
repeat expr
break
next
<-
and =
=
operator as an assignment is valid only at the top-level<-
operator is valid everywhere, even as a function argument:dataframes
are used rather than matrices$
Also in assignment:
A list is a sequence of key-value pairs, that is, a sequence of values identified by a name, or key.
Unlike vectors (which are always homogeneous) they can contain heterogeneous values.
A list can be indexed in three ways:
$
operator: extracts a single element by name[]
operator: extract elements by position and obtain a list[[]]
operator: extracts a single element per positionsort
, rev
, order
sample
, expand.grid
by
, aggregate
table
To sort a vector you use the sort
function:
To reorder a data frame, the ordered indices are extracted:
A B
1 1 0.2016819
5 5 0.6291140
4 4 0.6607978
2 2 0.8983897
3 3 0.9446753
The order
function returns the indices of a vector ordered according to the values:
where the first is the index of the smallest value of df$B
and the last is the index of the largest
Sampling a set of data (a vector) means extracting a subset (called sample) of values randomly. This is done with the sample
function:
[1] 2 3 1 5 7 10 6 4 9 8
[1] 9 5 5 9 9 5 5 2 10 9
The sample size can be equal (case above) or smaller than the initial set:
A grid is a matrix that contains all (ordered) combinations between \(n\) vectors of possibly different sizes. In R it is represented as a data frame and constructed with the expand.grid
function:
Aggregation means grouping rows having common elements in a data frame and applying a given function to each group. It is useful for example for calculating sub-totals.
In R it can be done using the by
function or the aggregate
function (changes the output type):
A contingency table counts the occurrences between a pair of columns in a data frame:
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
Month
OzHi 5 6 7 8 9
FALSE 25 9 20 19 27
TRUE 1 0 6 7 2
<NA> 5 21 5 5 1
NOTE: with()
is to save you from typing airquality$Ozone
and airquality$Month
tapply()
, which operates on a table similarly to aggregate functions:With aggregate()
you would do:
Since statistics generally deals with large quantities of data, it is essential to be able to import and export data in generic formats.
Data is generally presented in tabular form (by rows and columns)
The simplest and most common formats are:
A flat file with fields separated by spaces can look like this:
# Data collected on 8/10/2023
x y z
1.2 3.7 2.7
2.1 2.5 3.9
3.8 2.2 6.8
Such a file can be imported as a data frame like this:
The read.table()
function has numerous options that allow you to handle all possible cases in which files contain fields separated by specific characters (spaces or other)
A flat file with fixed width fields can look like this:
# Data collected on 8/10/2023
x y z
1.2 3.7 2.7
2.1 2.5 3.9
3.8 2.2 6.8
Such a file can be imported as a data frame like this:
The skip=1
parameter requires skipping the first line (comment)
CSV files are special FFs where the field separator is the comma. In these cases we use the read.csv()
function which works similarly to read.table()
but does not require specifying the separator.
A CSV looks like this:
# Data collected on 8/10/2023
x,y,z
1.2,3.7,2.7
2.1,2.5,3.9
3.8,2.2,6.8
Software that uses Latin languages (Italian, Spanish, Portuguese and French) adopts the comma as decimal separator. Consequently, when these software (e.g. MS Excel) generate CSVs they use the semicolon as a field separator.
In this case from R it is necessary to use the read.csv2()
function, which takes the comma as decimal separator and the semicolon as field separator.
The opposite of importing a file to a data frame is exporting a data frame to a file.
This operation is performed with the opposite functions to the previous ones:
write.table()
write.fwf()
write.csv()
and write.csv2()
All these functions have two mandatory arguments: the data frame to save and the destination file:
Other optional arguments are used to customize the result.
Along with RStudio, a new wave of R libraries has emerged that radically changes the approach. They go by the collective name of tidyverse
ggplot2
: plotspurrr
: functional programmingdplyr
: data manipulationstringr
: string manipulationAlong with RStudio, a new wave of R libraries has emerged that radically changes the approach. They go by the collective name of tidyverse
tibble
: Improved data framesreadr
: import datatidyr
: data preparationlubridate
: date manipulationThe tidyverse
approach has some common characteristics:
+
(ggplot(...) + geom_line()
), each function is a layer%>%
(a %>% str()
instead of str(a)
)It is useful to consult the cheat sheets: https://posit.co/resources/cheatsheets/
# I create the histogram of a sample of 10 elements from 100 random numbers
# infix:
hist(sample(rnorm(100), 10))
Hardly readable, the first step of the algorithm is the internal one
# I create the histogram of a sample of 10 elements from 100 random numbers
# infix:
hist(sample(rnorm(100), 10))
# sequenced infix:
s <- rnorm(100)
c <- sample(s, 10)
hist(c)
More readable, the algorithm is more obvious, but requires the creation of intermediate variables
# I create the histogram of a sample of 10 elements from 100 random numbers
# infix:
hist(sample(rnorm(100), 10))
# sequenced infix:
s <- rnorm(100)
c <- sample(s, 10)
hist(c)
# prefix with pipe:
rnorm(100) %>% sample(10) %>% hist()
Much more readable, the sequential algorithm is obvious, no intermediate variables are needed
# I create the histogram of a sample of 10 elements from 100 random numbers
# infix:
hist(sample(rnorm(100), 10))
# sequenced infix:
s <- rnorm(100)
c <- sample(s, 10)
hist(c)
# prefix with pipe:
rnorm(100) %>% sample(10) %>% hist()
# even on multiple lines:
rnorm(100) %>%
1 sample(10) %>%
2 hist
paolo.bosetti@unitn.it — https://paolobosetti.quarto.pub/DESANED