Design of Experiments and Statistical Analysis of Experimental Data, 2024
Dpt. of Industrial Engineering, University of Trento
We load some data and make a quick plot
The regression of a linear model is performed with the lm()
function. It takes two arguments:
The formula is expressed in the formula notation, which is a map from an analytical regression model, as \(y_i = a + bx_i + cx_i^2 + \varepsilon_i\) to a formula object as y~x + I(x^2)
To build a formula from a model you typically:
I()
y~x1 + x2
, which corresponds to \(y_i = a + bx_{1,i} + cx_{2,1} + \varepsilon_i\)y~x1 + x2 + x1:x2
, which corresponds to \(y_i = a + bx_{1,i} + cx_{2,1} + dx_{1,i}x_{2,i} +\varepsilon_i\)y~x1 + x2 + x1:x2
can be abbreviated as y~x1*x2
So let’s build a linear model of degree 1 and plot the data with the regression line:
It is important to look at the residuals. They show a rather strong pattern, meaning that the linear relationship is underfitting the data
Thus we need to increase the degree of the fitting polynomial. But how much so?
The degree of the fitting polynomial is a hyper-parameter
Regression parameters are the coefficients of the polynomial, to be calculated typically by minimizing the root mean square of the residuals
The degree of the polynomial is a parameter that defines the number of regression parameters, and that is why it is named a hyper-parameter
Identifying the best hyper-parameter(s) is the aim of validation and cross-validation strategies
In our case we want to compare polynomial fits up to degree 12
We use modelr::fit_with()
to automate the building of many models together: The function needs two arguments:
lm()
formulas()
, which in turn takes a list of formulas to be usedLet’s see how it works. We first build a list of arguments:
Now args
is a list of formulas like ~y
, ~x
, ~poly(x, 2)
, ~poly(x, 3)
, etc.
Now the formulas()
function wants n
parameters, each being a formula, while we have a list of formulas. We can solve this problem by using do.call()
function, which calls a given function passing each element of a list as a separate argument:
Now fits
is a list or regression results, with polynomials from degree 1 to 12
Quality of a regression can be verified with different metrics:
Typically, the root means square of error (RMSE) and the mean absolute error (MAE) are the most commonly used metrics
Let’s see how the RMSE and the \(R^2\) metrics change when the polynomial degree increases. To do that we build a table with three columns:
We extract these data from the list of linear models above created, fits
For each fitted linear model (an entry in fits
), the \(R^2\) and RMSE can be extracted with the functions rsquare()
and rmse()
, respectively
We use map_dbl()
to map these functions over the list of polynomial degrees. The resulting table is then used to make a plot:
tibble(
degree=c(1,deg), # deg starts from 2!
rsquare=fits %>% map_dbl(~rsquare(., data)),
rmse=fits %>% map_dbl(~rmse(., data))
) %>%
ggplot(aes(x=degree)) +
geom_line(aes(y=rmse), color="blue") +
geom_line(aes(y=rsquare*4), color="red") +
scale_y_continuous(sec.axis = sec_axis(
\(x) scales::rescale(x, from=c(0,4), to=c(0,1)),
breaks=seq(0, 1, 0.1),
name="R squared"))
The \(R^2\) increases pretty quickly and saturates after degree 3. The RMSE decreases sharply and monothonically. It’s hard to figure out the point where overfitting starts
To avoid overfitting we want to stop increasing the degree when the model starts loosing generality.
We monitor generality by training on a subset of data, and estimating the quality indicator on a different subset, not used for training. This is called cross-validation
To solve the problem we use K-fold cross validation. It is a regression strategy where we split the dataset into \(k\) subsets, or folds, with roughly the same amount of observations. Then:
In R, we use the caret
library to simplify this process. The caret::train()
function performs the folding for a given model: it takes as arguments the model formula, the regression function (in our case lm()
), the dataset, and a list of parameters that can be created with the supporting trainControl()
function.
The trainControl()
function is used to define the details on the cross validation strategy to use. In our case we use the repeated K-fold cross validation, named "repeatedcv"
, which repeates a K-fold a given number of times.
In fact, the folds are defined by randomly sampling the initial dataset, so that the resulting RMSE (or any other metric) is also a random variable. Repeating the K-fold 100 times makes the whole process more robust:
ctrl <- trainControl(method = "repeatedcv", number=5, repeats=100)
model <- train(y~poly(x,8), data=data, method="lm", trControl=ctrl)
model
Linear Regression
25 samples
1 predictor
No pre-processing
Resampling: Cross-Validated (5 fold, repeated 100 times)
Summary of sample sizes: 20, 20, 21, 20, 19, 20, ...
Resampling results:
RMSE Rsquared MAE
10.3213 0.9241026 6.25129
Tuning parameter 'intercept' was held constant at a
value of TRUE
The model
object contains a field named model$results
that is a table with all the available performance metrics:
intercept | RMSE | Rsquared | MAE | RMSESD | RsquaredSD | MAESD |
---|---|---|---|---|---|---|
TRUE | 10.3213 | 0.9241026 | 6.25129 | 23.37541 | 0.1183862 | 12.14192 |
Now we want to repeat the K-fold validation over the list of formulas corresponding to the set of polynomials with degrees from 1 to 12. We use again the map()
function:
Note the unnest()
function at the end: the model field $results
is actualy a table, so without that function in fit_quality
we would get a column results
that contains a list of tables. The unnest()
function flattens this list of tables in place.
Now we can finally make a plot of the metrics as a function of the polynomial degree:
From the previous plot we observe that the minima of any metric happens at degree 3:
So we can finally accept the model \(y_i=a + bx_i + cx_i^2 + dx_i^3 + \varepsilon_i\) (a degree 3 polynomial in \(x_i\)):
paolo.bosetti@unitn.it — https://paolobosetti.quarto.pub/DESANED