Regression

Design of Experiments and Statistical Analysis of Experimental Data, 2024

Paolo Bosetti

Università di Trento, Dipartimento di Ingegneria Industriale

Regression

\(\renewcommand{\hat}[1]{\widehat{#1}}\) \(\renewcommand{\tilde}[1]{\widetilde{#1}}\) \(\renewcommand{\theta}{\vartheta}\)

Each model of a physical system or process depends on parameters. These parameters must be calculated adapting the model to experimental observations (measurements)

The operation of fitting a model is carried out using regression

Basics

  • A model of a physical system can be expressed as: \[ y=f(x_1, x_2, \dots, x_n, c_1, c_2, \dots, c_m) \] where \(x_i\) are the physical (random) variables, called regressors or predictors, while the \(c_i\) are the parameters (constant) of the model, and \(y\) is the response or dependent variable

  • If \(n=1\) there is a single predictor and a single dependent variable we talk about simple regression

  • For example, \(y=a+bx+cx^2\) is a simple linear model with parameters \(a,~b,~c\)

  • regressing the model means making measurements of \(y\) for different values of \(x\) and determining the values of the parameters \(a,~b,~c\) that minimize the distance between the model and experimental observations

Types of regression

We will consider three types of regression:

  • Linear regression: the model is a linear combination of parameters
  • Generalized linear regression
  • Least squares regression: parameters are combined in a non-linear way

Linear Regression

Valid for every model that is linear in the parameters (while predictors can appear with a degree other than 1)

Basics

The statistical model of a process with \(n\) parameters to be regressed is defined as follows: \[ y_i=f(x_{1i}, x_{2i}, \dots, y_{ni}) = \hat{y_i} + \varepsilon_{ij},~~~i=1,2,\dots,N \]

where \(i\) is the observation index (\(N\geq n+1\) in total), \(\hat{y_i}\) is the regressed value, or prediction, at the observation \(i\), and \(\varepsilon_{ij}\) are the residuals

  • The regressed value corresponds to the deterministic component
  • The residuals are the random component
  • The hypothesis of normality of the residuals is assumed, i.e. that \(\varepsilon_{ij} \sim\mathcal{N}(0, \sigma^2)\)

If the \(f(\cdot)\) is an analytical function with one predictor and linear in the coefficients, then we can express it as \(\mathbf{A}\mathbf{k}=\mathbf{y}\), where

  • \(\mathbf{y}\) is the vector of \(y_i,~i=1\dots N\)
  • \(\mathbf{k}\) is the vector of parameters \(c_j,~j=1\dots n+1\)
  • \(\mathbf{A}\) a matrix \(N\times (n+1)\) composed as follows:

\[ \mathbf A = \begin{bmatrix} x_1^{n-1} & x_1^{n-2} & \dots & x_1 & 1 \\ x_2^{n-1} & x_2^{n-2} & \dots & x_2 & 1 \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ x_N^{n-1} & x_N^{n-2} & \dots & x_N & 1 \\ \end{bmatrix} \]

Basics

The linear equation can be solved with the pseudo-inverse method: \[ \begin{align} \mathbf A^T \mathbf A\cdot \mathbf{k} &= \mathbf A^T \cdot \mathbf{y} \\ (\mathbf A^T \mathbf A)^{-1} \mathbf A^T \mathbf A\cdot \mathbf{k} &= (\mathbf A^T \mathbf A)^{-1} \mathbf A ^T \cdot \mathbf{y} \\ \mathbf{k} &= (\mathbf A^T \mathbf A)^{-1} \mathbf A^T \cdot \mathbf{y} \end{align} \] This relationship makes clear what is meant by linear regression: it has nothing to do with the degree of the regressed function (which is not required to be linear in the predictors!), but only with the equation linear in the parameters that represents the model

NOTE:

  • from the previous matrix equation it is clear that the regression can be performed if and only if \(N\geq n+1\), i.e. if the number of observations is at least equal to the number of parameters
  • even in the case of a \(f(x_1,x_2,\dots,x_n)\), if it is linear in the coefficients it is always possible to express it as \(\mathbf{A}\mathbf{k}=\mathbf{y}\) and therefore solve it with the pseudo-inverse method

Example

Let the model to be regressed be of the type \(y_i=(ax_i + b) + \varepsilon_i\); then it can be represented as:

\[ \begin{bmatrix} x_1 & 1 \\ x_2 & 1 \\ \vdots & \vdots \\ x_N & 1 \\ \end{bmatrix} \cdot \begin{bmatrix} a \\ b \end{bmatrix} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{bmatrix} \]

The figure shows the observations as points with coordinates \((x_i, y_i)\), the regressed model as a red line (linear model in the coefficients \(a,~b\) and first degree of the predictor \(x\)) and the residuals \(\varepsilon_i\) as blue segments representing the difference between the \(y_i\) and the corresponding regressed value \(\hat{y_i}\)

Least Squares Regression

If the model is not linear in its parameters, it is still possible to regress it with the least squares method

Least Squares Regression

That is, we look for the set of parameter values that minimizes the distance between the model and the observations. This minimization can be achieved by defining an index of merit which represents the distance between the model and observations depending on the parameters: \[ \Phi(c_1, c_2,\dots,c_m)=\sum_{i=1}^N \left(y_i - f(x_{1i},x_{2i},\dots,x_{ni}, c_1, c_2 ,\dots,c_m) \right)^2 \]

  • If the \(f(\cdot)\) is analytic and differentiable, then we can minimize \(\Phi(\cdot)\) by differentiation, i.e. by solving the system of \(m\) equations \[ \frac{\partial\Phi}{\partial c_i}(c_1,c_2,\dots,c_m) = 0 \]

  • If \(f(\cdot)\) is not differentiable, the minimum of \(\Phi(\cdot)\) can still be calculated numerically (e.g. Newton-Raphson method)

Regression quality

To a given set of observations it is possible to fit an infinite number of models

It is possible to define some parameters of merit and some verification methods that allow you to evaluate the quality of a regression and, therefore, identify the model that best fits the observations

Coefficient of Determination

  • The coefficient of merit most used to evaluate a regression is the coefficient of determination \(R^2\)

  • It is defined as \(R^2 = 1 - \frac{SS_\mathrm{res}}{SS_\mathrm{tot}}\), where \(SS_\mathrm{res} = \sum \varepsilon_i^2\) and \(SS_ \mathrm{tot} = \sum(y_i - \bar y)^2\)

  • If the regressed values correspond to the observed values \(y_i=\hat{y_i}\), then the residuals are all zero and \(R^2 = 1\)

  • The quality of the regression decreases as \(R^2\) decreases

Underfitting

There is underfitting when the model has a lower degree than the apparent behavior of the observations

It can be highlighted, in addition to a low \(R^2\), by studying the distribution of the residuals: if there is under-fitting the residuals can be non-normal and, above all, show trends, or patterns

A pattern is a regular trend of the residuals as a function of the regressors

From the number of maximums and minimums present in the possible pattern it is possible to estimate how many degrees are missing

Overfitting

If the degree of the model is excessive, the model tends to chase individual points

The value of \(R^2\) increases, reaching 1 when the degree equals the number of observations minus 1

However, the model loses generality and is no longer able to correctly predict new values acquired at a later time (the red crosses in the figure)

Overfitting has particularly dramatic effects in case of extrapolation, i.e. when evaluating the model outside the range into which it has been regressed

Prediction Bands

It is a band symmetric with respect to the regression within which the observations (present and future) have an assigned probability of falling

In general, for a sufficiently large number of observations (\(>50\)) the 95% prediction band contains 95% of the observations

Confidence Bands

It is a band symmetric with respect to the regression within which the expected value of the model has an assigned probability of falling

It is always narrower than the prediction band

It is the multi-dimensional equivalent of the confidence interval for a T-test: as this is the interval, within which the value corresponding to the null hypothesis has an assigned probability of falling, here we can assume that the “true” model falls within the confidence band with a certain probability

It is obtained by calculating the confidence intervals on the regression parameters, then calculating—for each value of the predictor—the maximum and minimum value of the regression corresponding to the extreme values of the parameters in their confidence intervals

Generalized Linear Regression

Classic linear regression assumes the hypothesis of normality of residuals

When this hypothesis is not true, but the model is still linear in the parameters, generalized linear regression can be used

Basics

In the case of linear regression: \[ \begin{align} y_i &= f(\mathbf{x}_i, \mathbf{k}) + \varepsilon_i = \eta_i + \varepsilon_i \\ \varepsilon_i &\sim \mathcal{N}(0, \sigma^2) \end{align} \] In the case of generalized linear regression:

\[ \begin{align} y_i &= \eta_i + \varepsilon_i \\ \varepsilon_i &\sim D(p_1,p_2,\dots,p_k) \end{align} \] where \(D\) is a generic distribution with \(k\) parameters belonging to the family of exponential distributions (normal, binomial, gamma, inverse normal, Poisson, quasinormal, quasibinomial and quasipoissonian)

The problem can be solved by introducing a link function that rescales the residuals by projecting them onto a normal distribution

Basics

The link function \(g(\cdot)\) is such that:

\[ \begin{align} y_i &= \eta_i + g(\varepsilon_i) \\ \varepsilon_i &\sim D(p_1,p_2,\dots,p_k);~g(\varepsilon_i)\sim \mathcal{N}(0, \sigma^2) \end{align} \] The link functions for the most common distributions are:

Distribution Link function
Normal \(g(x)=x\)
Binomial \(g(x)=\mathrm{logit}(x)\)
Poisson \(g(x)=\log(x)\)
Range \(g(x)=1/x\)

In particular, it holds: \(\mathrm{logit}(x)=\frac{1}{1+e^{-p(x-x_0)}}\)

Logistic Regression

The typical case of logistic regression is the binomial event classifier

we consider a process that, depending on one or more predictors, can provide a result that can only be valid for one of two alternatives (success/failure, broken/intact, true/false, 1/0). We want to identify the threshold of predictors that switches the outcome

A linear regression is not suitable for the situation: it is clear that the residuals are not normal and that the slope of the regression depends a lot on how many points are collected in the “safe” zones

Logistic Regression

The regressed logistic function provides the parameter \(x_0\) which identifies the value that separates an equal amount of false positives and false negatives

Furthermore, it is possible to identify the appropriate threshold to obtain a predetermined probability of false positives (or false negatives)

This is the simplest type of machine learning: a binomial classifier

Presentation of data

The concept of confidence band is essential in the graphical presentation of data from multiple series

Graphic comparison of multiple series

Suppose we have a process whose value depends on a variable \(x\)

Suppose that a process parameter \(S\) can affect the output value. For example:

  • the value is the hardness of a metal, \(x\) is the temperature, the parameter \(S\) is the quantity of an alloying element
  • the value is the productivity of a plant, \(x\) is a quantitative process parameter, the parameter \(S\) is the type of machine used

Suppose we repeat 8 times a measurement of the output value for the various combinations of \(x\) and \(S\), obtaining the results in the figure: which differences are significant?

Without a reference model

Without a reference model that expresses \(v=f(x, S)\) it makes no sense to perform a regression

However, I can report, for each treatment

  • the average value, joining the series with a broken line for the sole purpose of visually grouping the data
  • the limits of the confidence interval for each series and for each treatment
  • or, join the limits with a band representing confidence about the mean

Areas where the bands overlap are statistically indistinguishable

With a model

Only if I have a model \(v=f(x, S)\) I can perform a regression

Also in this case, the regression must be accompanied with confidence bands

Again, areas where the bands overlap are statistically indistinguishable