Design of Experiments and Statistical Analysis of Experimental Data, 2024
Università di Trento, Dipartimento di Ingegneria Industriale
\(\renewcommand{\hat}[1]{\widehat{#1}}\) \(\renewcommand{\tilde}[1]{\widetilde{#1}}\) \(\renewcommand{\theta}{\vartheta}\)
Each model of a physical system or process depends on parameters. These parameters must be calculated adapting the model to experimental observations (measurements)
The operation of fitting a model is carried out using regression
A model of a physical system can be expressed as: \[ y=f(x_1, x_2, \dots, x_n, c_1, c_2, \dots, c_m) \] where \(x_i\) are the physical (random) variables, called regressors or predictors, while the \(c_i\) are the parameters (constant) of the model, and \(y\) is the response or dependent variable
If \(n=1\) there is a single predictor and a single dependent variable we talk about simple regression
For example, \(y=a+bx+cx^2\) is a simple linear model with parameters \(a,~b,~c\)
regressing the model means making measurements of \(y\) for different values of \(x\) and determining the values of the parameters \(a,~b,~c\) that minimize the distance between the model and experimental observations
We will consider three types of regression:
Valid for every model that is linear in the parameters (while predictors can appear with a degree other than 1)
The statistical model of a process with \(n\) parameters to be regressed is defined as follows: \[ y_i=f(x_{1i}, x_{2i}, \dots, y_{ni}) = \hat{y_i} + \varepsilon_{ij},~~~i=1,2,\dots,N \]
where \(i\) is the observation index (\(N\geq n+1\) in total), \(\hat{y_i}\) is the regressed value, or prediction, at the observation \(i\), and \(\varepsilon_{ij}\) are the residuals
If the \(f(\cdot)\) is an analytical function with one predictor and linear in the coefficients, then we can express it as \(\mathbf{A}\mathbf{k}=\mathbf{y}\), where
\[ \mathbf A = \begin{bmatrix} x_1^{n-1} & x_1^{n-2} & \dots & x_1 & 1 \\ x_2^{n-1} & x_2^{n-2} & \dots & x_2 & 1 \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ x_N^{n-1} & x_N^{n-2} & \dots & x_N & 1 \\ \end{bmatrix} \]
The linear equation can be solved with the pseudo-inverse method: \[ \begin{align} \mathbf A^T \mathbf A\cdot \mathbf{k} &= \mathbf A^T \cdot \mathbf{y} \\ (\mathbf A^T \mathbf A)^{-1} \mathbf A^T \mathbf A\cdot \mathbf{k} &= (\mathbf A^T \mathbf A)^{-1} \mathbf A ^T \cdot \mathbf{y} \\ \mathbf{k} &= (\mathbf A^T \mathbf A)^{-1} \mathbf A^T \cdot \mathbf{y} \end{align} \] This relationship makes clear what is meant by linear regression: it has nothing to do with the degree of the regressed function (which is not required to be linear in the predictors!), but only with the equation linear in the parameters that represents the model
NOTE:
Let the model to be regressed be of the type \(y_i=(ax_i + b) + \varepsilon_i\); then it can be represented as:
\[ \begin{bmatrix} x_1 & 1 \\ x_2 & 1 \\ \vdots & \vdots \\ x_N & 1 \\ \end{bmatrix} \cdot \begin{bmatrix} a \\ b \end{bmatrix} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{bmatrix} \]
The figure shows the observations as points with coordinates \((x_i, y_i)\), the regressed model as a red line (linear model in the coefficients \(a,~b\) and first degree of the predictor \(x\)) and the residuals \(\varepsilon_i\) as blue segments representing the difference between the \(y_i\) and the corresponding regressed value \(\hat{y_i}\)
If the model is not linear in its parameters, it is still possible to regress it with the least squares method
That is, we look for the set of parameter values that minimizes the distance between the model and the observations. This minimization can be achieved by defining an index of merit which represents the distance between the model and observations depending on the parameters: \[
\Phi(c_1, c_2,\dots,c_m)=\sum_{i=1}^N \left(y_i - f(x_{1i},x_{2i},\dots,x_{ni}, c_1, c_2 ,\dots,c_m) \right)^2
\]
If the \(f(\cdot)\) is analytic and differentiable, then we can minimize \(\Phi(\cdot)\) by differentiation, i.e. by solving the system of \(m\) equations \[ \frac{\partial\Phi}{\partial c_i}(c_1,c_2,\dots,c_m) = 0 \]
If \(f(\cdot)\) is not differentiable, the minimum of \(\Phi(\cdot)\) can still be calculated numerically (e.g. Newton-Raphson method)
To a given set of observations it is possible to fit an infinite number of models
It is possible to define some parameters of merit and some verification methods that allow you to evaluate the quality of a regression and, therefore, identify the model that best fits the observations
The coefficient of merit most used to evaluate a regression is the coefficient of determination \(R^2\)
It is defined as \(R^2 = 1 - \frac{SS_\mathrm{res}}{SS_\mathrm{tot}}\), where \(SS_\mathrm{res} = \sum \varepsilon_i^2\) and \(SS_ \mathrm{tot} = \sum(y_i - \bar y)^2\)
If the regressed values correspond to the observed values \(y_i=\hat{y_i}\), then the residuals are all zero and \(R^2 = 1\)
The quality of the regression decreases as \(R^2\) decreases
There is underfitting when the model has a lower degree than the apparent behavior of the observations
It can be highlighted, in addition to a low \(R^2\), by studying the distribution of the residuals: if there is under-fitting the residuals can be non-normal and, above all, show trends, or patterns
A pattern is a regular trend of the residuals as a function of the regressors
From the number of maximums and minimums present in the possible pattern it is possible to estimate how many degrees are missing
If the degree of the model is excessive, the model tends to chase individual points
The value of \(R^2\) increases, reaching 1 when the degree equals the number of observations minus 1
However, the model loses generality and is no longer able to correctly predict new values acquired at a later time (the red crosses in the figure)
Overfitting has particularly dramatic effects in case of extrapolation, i.e. when evaluating the model outside the range into which it has been regressed
It is a band symmetric with respect to the regression within which the observations (present and future) have an assigned probability of falling
In general, for a sufficiently large number of observations (\(>50\)) the 95% prediction band contains 95% of the observations
It is a band symmetric with respect to the regression within which the expected value of the model has an assigned probability of falling
It is always narrower than the prediction band
It is the multi-dimensional equivalent of the confidence interval for a T-test: as this is the interval, within which the value corresponding to the null hypothesis has an assigned probability of falling, here we can assume that the “true” model falls within the confidence band with a certain probability
It is obtained by calculating the confidence intervals on the regression parameters, then calculating—for each value of the predictor—the maximum and minimum value of the regression corresponding to the extreme values of the parameters in their confidence intervals
Classic linear regression assumes the hypothesis of normality of residuals
When this hypothesis is not true, but the model is still linear in the parameters, generalized linear regression can be used
In the case of linear regression: \[ \begin{align} y_i &= f(\mathbf{x}_i, \mathbf{k}) + \varepsilon_i = \eta_i + \varepsilon_i \\ \varepsilon_i &\sim \mathcal{N}(0, \sigma^2) \end{align} \] In the case of generalized linear regression:
\[ \begin{align} y_i &= \eta_i + \varepsilon_i \\ \varepsilon_i &\sim D(p_1,p_2,\dots,p_k) \end{align} \] where \(D\) is a generic distribution with \(k\) parameters belonging to the family of exponential distributions (normal, binomial, gamma, inverse normal, Poisson, quasinormal, quasibinomial and quasipoissonian)
The problem can be solved by introducing a link function that rescales the residuals by projecting them onto a normal distribution
The link function \(g(\cdot)\) is such that:
\[ \begin{align} y_i &= \eta_i + g(\varepsilon_i) \\ \varepsilon_i &\sim D(p_1,p_2,\dots,p_k);~g(\varepsilon_i)\sim \mathcal{N}(0, \sigma^2) \end{align} \] The link functions for the most common distributions are:
Distribution | Link function |
---|---|
Normal | \(g(x)=x\) |
Binomial | \(g(x)=\mathrm{logit}(x)\) |
Poisson | \(g(x)=\log(x)\) |
Range | \(g(x)=1/x\) |
In particular, it holds: \(\mathrm{logit}(x)=\frac{1}{1+e^{-p(x-x_0)}}\)
The typical case of logistic regression is the binomial event classifier
we consider a process that, depending on one or more predictors, can provide a result that can only be valid for one of two alternatives (success/failure, broken/intact, true/false, 1/0). We want to identify the threshold of predictors that switches the outcome
A linear regression is not suitable for the situation: it is clear that the residuals are not normal and that the slope of the regression depends a lot on how many points are collected in the “safe” zones
The regressed logistic function provides the parameter \(x_0\) which identifies the value that separates an equal amount of false positives and false negatives
Furthermore, it is possible to identify the appropriate threshold to obtain a predetermined probability of false positives (or false negatives)
This is the simplest type of machine learning: a binomial classifier
The concept of confidence band is essential in the graphical presentation of data from multiple series
Suppose we have a process whose value depends on a variable \(x\)
Suppose that a process parameter \(S\) can affect the output value. For example:
Suppose we repeat 8 times a measurement of the output value for the various combinations of \(x\) and \(S\), obtaining the results in the figure: which differences are significant?
Without a reference model that expresses \(v=f(x, S)\) it makes no sense to perform a regression
However, I can report, for each treatment
Areas where the bands overlap are statistically indistinguishable
Only if I have a model \(v=f(x, S)\) I can perform a regression
Also in this case, the regression must be accompanied with confidence bands
Again, areas where the bands overlap are statistically indistinguishable
paolo.bosetti@unitn.it — https://paolobosetti.quarto.pub/DESANED