Design of Experiments and Statistical Analysis of Experimental Data, 2024
University of Trento, Department of Industrial Engineering
Descriptive Statistics is used to describe the behavior of random variables
A stochastic variable is a variable that takes on random values at each observation, i.e. such that it is not possible to predict the exact value of the next observation, not even knowing the previous observations
measurement is the process that leads to the objective evaluation of the measurand. The result of a measurement is called measurement
Stochastic variables are of particular interest for engineering and industry in general, given that each measurement produces, as a result, a value that has a random content and can therefore be represented as a stochastic variable
In turn, the random contribution to a measurement is called uncertainty
Given that every production activity is inextricably linked to measurements, it is therefore clear how fundamental it is to treat random contributions to measurements in a coherent and robust manner
In other words, the stochastic component of a measurement is affected by a scale effect: the smaller the ratio between the average value measured and the variability typical of the instrument (i.e. its precision), the less appreciable the random effect will be
It is therefore essential to precisely and effectively define intuitive concepts such as variability and average value that we have expressed above
In statistics, a population is a set of values, objects or events of interest for some analysis or experiment.
To study the height of residents in the city of Trento, the population of interest is the entire population of Trento.
To study the mechanical behavior of the aluminum alloy produced by a certain plant, the population of interest can be the entire quantity of alloy produced by a certain batch of raw material
But to also study the effects of variability in raw materials (between one batch and another), environmental conditions, etc., it would be more appropriate to define a larger set as a population
Therefore, the definition of the population of interest is often arbitrary
So:
the definition of the population depends on the objective of the analysis
the size of a population is generally very large and potentially not limited
consequently it is often impractical to consider the entire population
we then work on subsets randomly extracted from the population, called samples
being randomly drawn, the samples approximate the population, the more numerous they are, the better
By observing a population we can identify a central value and a variability
The central value is called expected value and the variability is called variance
Expected value:
for discrete r.v.: \(\mu = E(x) := \sum_{i = 1}^N x_i p(x_i)\)
for continuous r.v.: \(\mu = E(x) := \int_{-\infty}^{+\infty} x f(x)~\mathrm{d}x\)
Variance:
where \(E()\) and \(V()\) are the expected value and variance operators, respectively; \(x_i\) (and \(x\)) is the generic observation of the r.v., and \(p(x_i)\) and \(f(x)\) are the probability and probability density of finding a given value
The properties of a population are indicated with Greek letters: \(\mu\) and \(\sigma^2\)
NOTE: This results in \(\sigma^2 = E\left[(x-\mu)^2\right]\)
Probability (or frequency): for a discrete r.v., corresponds to the ratio between the number of observations of a given value and the total number of observations
Probability density: for a continuous r.v., the probability of exactly finding a given value is zero, therefore we are referring to a probability of finding a value within a given interval. The probability density is the derivative of this value
Also, note that probability and frequency must add to 1: respectively: \[ \begin{array}{l} \sum_i p(x_i) = 1 \\ \int_{-\infty}^\infty f(x)~\mathrm{d}x = 1 \end{array} \]
This is because obviously the probability of finding any value must be certain, i.e. 1
The expected value and variance operators have the following properties:
\[ \begin{array}{l} E(c)&=&c\\ E(x)&=&\mu\\ E(cx)&=&cE(x)=c\mu\\ V(c)&=&0\\ V(x)&=&\sigma^2\\ V(cx)&=&c^2V(x)=c^2\sigma^2\\ E(x+y)&=&E(x)+E(y)=\mu_x+\mu_y\\ \mathrm{Cov}(x,y)&=&E[(x-\mu_x)(y-\mu_y)] \label{eq:cov}\\ V(x+y)&=&V(x)+V(y)+2\textrm{ Cov}(x,y)\\ V(x-y)&=&V(x)+V(y)-2\textrm{ Cov}(x,y) \end{array} \]
where \(c\) indicates a constant
The covariance operator is an index of how interdependent two stochastic variables are
More useful than covariance (which is not limited) is correlation which has the advantage of being within the range \([-1,1]\):
\[ \mathrm{Corr}(x, y) = \frac{E[(x-\mu_x)(y-\mu_y)]}{\sigma_x\sigma_y} = \frac{\mathrm{Cov}(x,y) }{\sigma_x\sigma_y} \]
Covariance and correlation are also referred to as \(\sigma_{xy}^2\) and \(\rho_{xy}\), respectively.
A population may be too large to be analyzed directly
Then we analyze the subsets obtained by sampling, i.e. random extraction
Randomly drawing a large enough sample does not alter the distribution properties of the population
From a population of \(N\) elements it is possible to extract a number of different samples of size \(n\) described by the binomial coefficient: \[\binom{N}{n}=\frac{N! }{(N-n)!n!},~N>n\]
For each property of the population it is possible to define an estimator, or statistic, built on the sample
Sample mean and variance are defined
\[ \begin{eqnarray} \bar x &=& \frac{1}{n}\sum_{i=1}^n x_i\\ S^2 &=& \frac{\sum_{i=1}^n (x_i - \bar x)^2}{n-1} \end{eqnarray} \]
A particular value taken by an estimator is called estimate
Instead of the variance \(S^2\) the standard deviation \(S\) is often used (same units of measurement)
For each property of the population it is possible to define an estimator, or statistic, built on the sample
Sample mean and variance are defined
\[ \begin{eqnarray} \bar x &=& \frac{1}{n}\sum_{i=1}^n x_i\\ S^2 &=& \frac{\sum_{i=1}^n (x_i - \bar x)^2}{n-1} \end{eqnarray} \]
A particular value taken by an estimator is called estimate
Instead of the variance \(S^2\) the standard deviation \(S\) is often used (same units of measurement)
Since each sample is drawn randomly, each estimate is a random variable
The larger the sample, the closer the estimate is to the corresponding property: convergence in distribution
Expected value of the average:
\[ \begin{eqnarray} \mathrm E(\bar x) &=& \mathrm E(\frac{x_1+x_2+\dots+x_n}{n}) = \frac{1}{n}\left[\mathrm E (x_1+x_2+\dots+x_n) \right]\\ &=& \frac{1}{n}\left[\mathrm E(x_1)+\mathrm E(x_2)+\dots+\mathrm E(x_n) \right] = \frac{1}{n} n\mathrm E(x) \\ \mathrm E(\bar x)&=& \mu \end{eqnarray} \]
Variance of the mean:
\[ \begin{eqnarray} \mathrm V(\bar x) &=& \mathrm V(\frac{x_1+x_2+\dots+x_n}{n}) = \frac{1}{n^2}\left[\mathrm V(x_1+ x_2+\dots+x_n) \right]\\ &=& \frac{1}{n^2}\left[\mathrm V(x_1)+\mathrm V(x_2)+\dots+\mathrm V(x_n) \right] = \frac{n\mathrm V( x)}{n^2} = \frac{\mathrm V(x)}{n} \\ \mathrm V(\bar x) &=& \frac{\sigma^2}{n} \end{eqnarray} \]
The degrees of freedom of a statistic (GdF or DoF) are the number of independent elements that appear in its definition. From the definition of the variance it follows that
\[ \sigma^2=E\left(\frac{\sum(x_i - \bar x)^2}{n-1}\right)=E\left(\frac{SS}{\nu}\right) \]
That is, the variance is the expected value of the Sum of Squares divided by its number of degrees of freedom \(\nu\), i.e. of independent elements.
That the latter are \(n-1\) is demonstrated by the following relation: \[ \sum_{i=1}^n(x_i-\bar x) = \sum_{i=1}^n(x_i)-n\bar x=:0 \] therefore not all the \(n\) elements in the definition of \(SS\) can be independent, given that the value of one of them is predictable from the remaining \(n-1\) thanks to the definition of \(\bar x\)
Expected value and variance are not the only two properties of a population.
Two populations can have the same expected value and variance parameters but have different shapes. The shape of a population is called distribution
Discrete distributions
Continuous distributions
A Bernoulli process is a series of \(n\) events with outcomes \(z_1, z_2, \dots, z_n\) such that:
The binomial distribution describes the probability of getting \(x\in[0, n]\) consecutive successes
It is the probability distribution of obtaining a success in a Bernoulli process after \(x \in \mathbb{N}^+\) failures
We say that \(x\sim\mathrm{Norm}(\mu,\sigma^2)\) or \(x\sim\mathcal{N}(\mu,\sigma^2)\) when the PDF is: \[ f(x) =\frac{1}{\sigma \sqrt{2 \pi}}e^{-\frac{1}{2}\left[\frac{x-\mu}{\sigma}\right ]^2} \]
Moments: coincide with the two parameters \(\mu\) and \(\sigma^2\)
If \(x\sim\mathcal{N}(\mu, \sigma^2)\) then \[ \frac{x-\mu}{\sigma}\sim\mathcal{N}(0,1) \] and the distribution \(\mathcal{N}(0,1)\) is called standard normal.
If \(x_1, x_2, \dots,x_n\) are \(n\) independent and identically distributed (IID) random variables with \(E(x_i)=\mu\) and \(V(x_i)=\sigma^2~ ~\forall i=1,2,\dots,n\) (both finite), and \(y=x_1+x_2+\dots+x_n\), then: \[ z_n=\frac{y-n\mu}{\sqrt{n\sigma^2}} \] approximates a distribution \(\mathcal{N}(0,1)\), in the sense that if \(F_n(z)\) is the distribution function of \(z_n\) and \(\Phi(z)\) is the distribution function of \(\mathcal{N}(0,1)\), then: \[ \lim_{n\rightarrow+\infty}\frac{F_n(z)}{\Phi(z)}=1 \]
Considering the quadratic sum of a sample of \(k\) elements \(y_i\), each coming from a distribution \(\mathcal{N}(\mu, \sigma^2)\), it turns out that: \[ \frac{(y_i-\bar y)}{\sigma}\sim \mathcal{N}(0,1)~~\forall i=1,2,\dots,k \] and therefore: \[ \frac{\mathit{SS}}{\sigma^2}=\frac{\sum_{i=1}^k(y_i-\bar y)^2}{\sigma^2} \sim \mathcal{X }^2_{k-1} \]
Let there be two r.v. \[ z\sim\mathcal{N}(0,1),~x\sim\mathcal{X}^2_k \] then their combination \[ x=\frac{z}{\sqrt{x/k}} \] is distributed as a Student’s T
It is said that \(x\sim\mathrm{T}_k\) or \(x\sim\mathcal{T}_k\) when the PDF is: \[ f(x)=\frac{\Gamma\left((k-1)/2\right)}{\sqrt{k\pi}\Gamma(k/2)}\frac{1}{((x^ 2/k)+1)^{(k+1)/2}} \]
Moments: \[ \mu = 0,~~~ \sigma^2=k/(k-2) \]
Student’s T is a special case of \(\mathcal{N}(0,1)\): \[ \lim_{k\rightarrow+\infty}t_k=\mathcal{N}(0,1) \] The convergence is very rapid: already for \(k>30\) the difference between the two distribution functions becomes negligible
Let there be two r.v. \[ x_u\sim\mathcal{X}^2_u,~~~x_v\sim\mathcal{X}^2_v \] and \(x\) is defined as: \[x=\frac{x_u/u}{x_v/v}\] then \(x\) is a r.v. distributed as an F of Snedecor
It is said that \(x\sim\mathrm{F}_{u,v}\) or \(x\sim\mathcal{F}_{u,v}\) when the PDF is: \[ f(x)=\frac{\Gamma\left(\frac{u+v}{2}\right)\left(\frac{u}{v}\right)^{u/2}x^{( u/2)-1}}{\Gamma\left( \frac{u}{2} \right)\Gamma\left( \frac{v}{2} \right) \left(\frac{u}{ v}x+1\right)^{(u+v)/2}} \]
Moments: \[ \mu = \frac{v}{v-2},~~~\sigma^2=\frac{2v^2(u+v-2)}{u(v-2)^2(v-4) } \]
A distribution is described by three interrelated functions:
Studies the operations of inference, i.e. obtaining information on the population starting from a sample
A hypothesis test can have two types of errors:
The purpose of inferential statistics is to associate a probability with these errors
Null hypothesis | true | false |
---|---|---|
accepted | OK | Type II error |
rejected | Type I error | OK |
The probability of a Type I Error is \(\alpha\), the probability of a Type II Error is \(\beta\).
The value \(1-\beta\) is the power \(P\) of the test
Obviously the answer is probabilistic: I can only associate an error probability \(\alpha\) to the hypothesis test \[ \begin{eqnarray} H_0:~&\mu_1 = \mu_2 \\ H_1:~&\mu_1 \neq \mu_2 \\ \end{eqnarray} \]
For the two samples \(y_{1,1}\) and \(y_{2,i}\) it is possible to define the variable: \[ t_0 = \frac{\bar{y_2} - \bar{y_1}}{S_p\sqrt{\frac{1}{n_1} + \frac{1}{n_s} }} \] where \(S_p^2\) is called pooled variance and is: \[ S_p^2 = \frac{(n_1-1)S_1^2 + (n_2-1)S_2^2}{n_1+n_2-2} \] The \(t_0\) is the ratio between a r.v. normal to the numerator and a r.v. \(\chi^2\) in the denominator. Consequently it is itself a r.v. and is defined as a Student’s T.
The number of degrees of freedom is the same as \(\chi^2\) and is \(n_1+n_2-2\)
\(t_0\) is called test statistic and it holds that \(t_0\sim t_{n_1+n_2-2}\)
It is clear that cases like C1 vs. C3 in the previous graph will have a higher value of \(|t_0|\) than cases like C1 vs. C2 (relationship between distance and variance)
But given a pair of samples from the same population the probability of high values of \(|t_0|\) is very low
Since we know the distribution of \(t_0\) we can therefore calculate the probability of finding a given value assuming that \(H_1\) is valid
In other words, the probability of rejecting \(H_0\) when it is true is equal to the probability of finding a value equal to or greater than \(t_0\) in the Student distribution
In reality, the sign of \(t_0\) is arbitrary, so we need to check the probability of finding a value outside the interval \([-|t_0|, |t_0|]\)
Ultimately, the probability of a type I error in the Student test is: \[ p\mathrm{-value} = 2\mathrm{CDF}^+_t(|t_0|,n_1+n_2-2) \] where \(\mathrm{CDF}^{+}_{t}\) is the cumulative distribution, upper tail, of a Student’s T with \(n_1+n_2-2\) DoF, and where the factor 2 takes into account the last point on the previous page
The probability of error of any statistical test is called p-value
Often in statistical tests \(H_0\) is accepted or rejected on the basis of a Type I error probability threshold, denoted by \(\alpha\)
The threshold value is calculated using the quantile function and \(H_0\) is rejected when: \[ |t_0| \geq t_{0,\mathrm{max}} = t_{\alpha/2, n_1+n_2-2} \] where \(t_{\alpha/2, n_1+n_2-2}\) is precisely the quantile function, upper tail, of the Student’s T evaluated for the probability \(\alpha/2\) and for \(n_1+n_2-2\) DoF.
In the absence of calculators, the Student Test was performed by predetermining \(\alpha\) and deciding whether to reject \(H_0\) based on the quantile table:
dof | 0.1 | 0.05 | 0.025 | 0.01 | 0.005 | 0.0025 | 0.001 |
---|---|---|---|---|---|---|---|
1 | 3.078 | 6.314 | 12.706 | 31.821 | 63.657 | 127.321 | 318.309 |
2 | 1.886 | 2.920 | 4.303 | 6.965 | 9.925 | 14.089 | 22.327 |
3 | 1.638 | 2.353 | 3.182 | 4.541 | 5.841 | 7.453 | 10.215 |
4 | 1.533 | 2.132 | 2.776 | 3.747 | 4.604 | 5.598 | 7.173 |
5 | 1.476 | 2.015 | 2.571 | 3.365 | 4.032 | 4.773 | 5.893 |
6 | 1.440 | 1.943 | 2.447 | 3.143 | 3.707 | 4.317 | 5.208 |
7 | 1.415 | 1.895 | 2.365 | 2.998 | 3.499 | 4.029 | 4.785 |
8 | 1.397 | 1.860 | 2.306 | 2.896 | 3.355 | 3.833 | 4.501 |
9 | 1.383 | 1.833 | 2.262 | 2.821 | 3.250 | 3.690 | 4.297 |
10 | 1.372 | 1.812 | 2.228 | 2.764 | 3.169 | 3.581 | 4.144 |
For example, for a sample with 8 DoF and \(\alpha=10%\) results in a critical value of \(t_0=1.86\): any value of \(t_0\) calculated higher than this value ( in module) involves the rejection of \(H_0\) with an error probability of less than 10%
The Student test seen above is valid for the most generic case. There are test variations for the following conditions, which can combine to result in 4 different tests:
Furthermore, in the case of two-sample tests it is possible to assume that the samples are homoscedastic or not
Note: The one-sided test, at the same value of \(t_0\), has a lower rejection threshold of \(H_0\), and is therefore more powerful
The two-sample test seen above assumes that the two samples are homoscedastic. If they are not, you need to revise the definition of \(t_0\) as follows (Welch’s test): \[ t_0 = \frac{\bar{y_1} - \bar{y_2}}{\sqrt{S_1^2/n_1 + S_2^2/n_2}};~~~\nu=\frac{(S_1^2/n_1 + S_2^2/n_2)^2}{\frac{(S_1^2/n_1)^2}{n_1-1}+\frac{(S_2^2/n_2)^2}{n_2-1}} \] The homoscedasticity hypothesis must be preliminarily verified with a variance test: \[ \begin{eqnarray} H_0 :~& \sigma^2_1 = \sigma^2_2 \\ H_1 :~& \sigma^2_1 \neq \sigma^2_2 \end{eqnarray},~~~F_0=\frac{S_1}{S_2}\sim\mathcal{F}_{n_1-1, n_2-1} \] That is, if the p-value associated with \(F_0\) is small, it is assumed that the samples are not homoscedastic and therefore the Welch test is performed; otherwise the Student test applies
In the case of two-sample Student tests, when they have the same size and are collected two by two in very similar conditions, it is advisable to pair them and carry out the paired Student test
Each measurement can be expressed as: \[ y_{ij}=\mu_i+\beta_j+\varepsilon_{ij};\hspace{9pt} \left\{ \begin{array}{l}i=1,2\\j=1,2,\dots ,n\end{array} \right. \] If we define \(d_j=y_{1j} - y_{2j}\), remembering the properties of the operator \(E(\cdot)\), it results in \(\mu_d = \mu_1 - \mu_2\). So we can reformulate a pair of equivalent hypotheses: \[ \begin{eqnarray} H_0 :~& \mu_d = 0 \\ H_1 :~& \mu_d \neq 0 \end{eqnarray} \] So the paired test is a one-sample test, with the advantage that random effects between pairs of observations do not affect the test result
Given unknown parameter \(\vartheta\), we want to define two statistics \(L\) and \(U\), such that the probability: \(P(L\leq\vartheta\leq U)= 1-\alpha\). In this case the interval \([L,U]\) is called the confidence interval for the parameter \(\vartheta\).
Let’s consider a one-sample T-test. We know that: \[ t_0=\frac{\bar x - \mu}{S/\sqrt{n}}\sim t_{n-1} \] If \(c\) is \(t_{\alpha/2, n-1}\), then, by definition: \(P(-c\leq t_0\leq c) = 1-\alpha\)
Substituting \(t_0\) we get: \[ P(\bar x - cS/\sqrt{n} \leq \mu \leq \bar x + cS/\sqrt{n}) = 1-\alpha \]
In practice, it means that the expected value of the population (which is unknown!) has the probability \(1-\alpha\) of falling in the interval \([\bar x - cS/\sqrt{n}, \bar x + cS /\sqrt{n}]\); that probability is called confidence
If the \(\mu_0\) of the corresponding test lies outside the confidence interval, then we can reject \(H_0\) with an error less than \(\alpha\).
For a two-sample test it results: \[ P\left( -t_{\alpha/2,n_1+n_2-2}\leqslant\frac{(\bar y_1-\bar y_2)-(\mu_1-\mu_2)}{S_p\sqrt{\frac{1 }{n_1}+\frac{1}{n_2}}}\leqslant t_{\alpha/2,n_1+n_2-2} \right)=1-\alpha \] So the confidence interval \([L,U]\) is defined by: \[ \left[(\bar y_1-\bar y_2)-t_{\alpha/2,n_1+n_2-2}S_p\sqrt{\frac{1}{n_1}+\frac{1}{n_2}},~ (\bar y_1-\bar y_2)+t_{\alpha/2,n_1+n_2-2}S_p\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}~\right] \]
The collection of data relating to a sample may be affected by errors:
These errors obviously have a high probability of not agreeing with the rest of the data, i.e. of being too far from the mean in relation to the typical variance of the process.
These errors are called outlier and it is good practice to identify and eliminate them immediately after completing data collection
The most common graphical method for identifying outliers is the box-plot
It is not a real test but a criterion based on which to eliminate the point furthest from the average (only one!)
Given \(\left<x_1, x_2,\dots,x_n\right>\) with normal distribution, the maximum absolute difference is calculated \(s_0=\underset{i=1,\dots,n}{\max} \left(\frac{|x_i - \bar x|}{S_x}\right)\)
Due to the normality of the \(x_i\) it is evident that \(|x_i-\bar x|/S_x\sim\mathcal{N}(0,1)\). We can then calculate the probability of a value greater than or equal to \(s_0\) from the upper tail distribution function: \[ P_s=CDF_{\mathcal{N}}^+(s_0) \] On \(n\) normal observations we expect \(nP_s\) values larger than \(s_0\). So, if \(nP_s < 0.5\), while we have a suspicious point, then we can discard the outlier
The Grubb test is a real statistical test, based on a pair of hypotheses (\(H_1\) leads to the rejection of the suspected outlier), on a test statistic and on the calculation of a p- value.
In short: \[ \begin{eqnarray} G_0 &=& \frac{\underset{i=1,\dots,n}{\max}|y_i-\bar y|}{S_x}\\ G_0 &>& \frac{n-1}{n}\sqrt{\frac{t^2_{\alpha/(2n),n-2}}{n-2+t^2_{\alpha/(2n ),n-2}}} \Rightarrow \neg H_0 \end{eqnarray} \]
The tests seen so far always assume that the samples being operated on derive from a normal distribution, despite the unknown parameters
Before carrying out any test, therefore, it is necessary to verify the hypothesis of normality
As with the outliers analysis, also in this case it is possible to use both graphical methods and statistical tests
In general, statistical tests are always preferable, because
For the number of bars, or bins, use the Sturges formula: \(K = \lceil\log_2 n \rceil + 1\) or the Scott formula: \(K=3.49 s / \sqrt[3]{n}\)
i | x | f | q |
---|---|---|---|
1 | -1.540 | 0.061 | -1.547 |
2 | -0.929 | 0.159 | -1.000 |
3 | -0.326 | 0.256 | -0.655 |
4 | -0.295 | 0.354 | -0.375 |
5 | -0.006 | 0.451 | -0.123 |
6 | 0.415 | 0.549 | 0.123 |
7 | 1.263 | 0.646 | 0.375 |
8 | 1.272 | 0.744 | 0.655 |
9 | 1.330 | 0.841 | 1.000 |
10 | 2.405 | 0.939 | 1.547 |
Column x
is the sorted observations; the sample cumulative distribution, column f
, is corrected with the Bloom’s formula: \(f=(1-3/8)/(n+1-3/4)\); the q
column represents the standard normal quantiles of f
.
The diagonal on the graph passes through the two points of the first and third quartiles
It is a statistical test to test the null hypothesis that a given sample comes from a given (chosen!) distribution
We group \(\left<x_1, x_2, \dots, x_n\right>\) into \(k\) classes, such that each class has at least 4–5 elements. The width of each interval can be different
Let \(O_i,~i=1,2,\dots,k\) be the number of observations in each class, and \(E_i\) be the number of expected observations (expected) in each class for the hypothesized distribution
The differences between \(O_i\) and \(E_i\) will be the greater the higher the probability of \(H_1\). Then the test statistic is defined:
\[ X_0^2 = \sum_{i=1}^k \frac{(O_i-E_i)^2}{E_i} \sim \chi_{k-p-1}^2 \] where \(p\) is the number of parameters of the hypothesized distribution (2 for the normal)
Among the tests of normality it is the one with the greatest power for a given level of significance
The test statistic is based on an anonymous distribution and known only numerically
For large enough samples, the test is so powerful that it highlights even small deviations from normality, due for example to outliers. In these cases it is appropriate to accompany it with a Quantile-Quantile plot to discriminate these effects
It is clear that \(\rho_{s_1s_2}\) will be close to 0, while \(\rho_{s_1s_3}\) will be close to 1
There is a Pearson correlation test which provides a p-value associated with the alternative hypothesis that the two samples are correlated.
The Student Test considers a single factor with one or two levels.
What happens if we have more than two levels or more than one factor?
Example: tensile strength of a yarn as a function of the percentage of cotton fibers (quantitative factor)
% Cotton | #1 | #2 | #3 | #4 | #5 |
---|---|---|---|---|---|
15 | 7 | 7 | 15 | 11 | 9 |
20 | 12 | 17 | 12 | 18 | 18 |
25 | 14 | 18 | 18 | 19 | 19 |
30 | 19 | 25 | 22 | 19 | 23 |
35 | 7 | 10 | 11 | 15 | 11 |
The question is: do the treatments produce statistically significant changes in yield (strength)?
Each individual yield value can be expressed as (averages model): \[ y_{ij}=\mu_i+\varepsilon_{ij},~~\left\{\begin{array}{rcl} i &=& 1, 2, \dots, a \\ j &=& 1, 2, \dots, n \end{array}\right. \] where \(i\) are the treatments, \(j\) are the repetitions, \(\varepsilon_{ij}\) are the residuals, i.e. the purely stochastic and zero-mean component of the random variable \(y_{ij}\), while \(\mu_i\) is the deterministic component
We can separate \(\mu_i = \mu + \tau_i\), where \(\tau_i\) is the contribution of the treatment to the overall mean \(\mu\) (effects model): \[ y_{ij}=\mu + \tau_i+\varepsilon_{ij},~~\left\{\begin{array}{rcl} i &=& 1, 2, \dots, a \\ j &=& 1, 2, \dots, n \end{array}\right. \]
\[ \begin{eqnarray} H_0&:&~\mu_1=\mu_2=\dots =\mu_a \\ H_1&:&~\mu_i\neq\mu_j~~~\textrm{for at least one pair }(i, j) \end{eqnarray} \]
\[ \begin{eqnarray} H_0&:&~\tau_1=\tau_2=\dots =\tau_a = 0 \\ H_1&:&~\tau_i\neq0~~~\textrm{for at least one }i \end{eqnarray} \]
Definitions:
\[ \begin{eqnarray} y_{i.}&=&\sum_{j=1}^n y_{ij},~\bar y_{i.}=y_{i.}/n,~i=1, 2, \dots, a \\ y_{..}&=&\sum_{i=1}^a\sum_{j=1}^n y_{ij},~\bar y_{..}=y_{..}/N,~N=na \end{eqnarray} \]
Considering that: \[ SS_T=\sum_{i=1}^a\sum_{j=1}^n (y_{ij}-\bar y_{..})^2 = \sum_{i=1}^a\sum_{j =1}^n [(\bar y_{i.}-\bar y_{..})+(y_{ij}-\bar y_{i.})]^2 \]
that is to say:
\[ SS_T= n\sum_{i=1}^a(\bar y_{i.}-\bar y_{..})^2 + \sum_{i=1}^a\sum_{j=1}^n (y_{ij}-\bar y_{i.})^2 + 2\sum_{i=1}^a\sum_{j=1}^n(\bar y_{i.}-\bar y_{. .})(y_{ij}-\bar y_{i.}) \]
and since \(\sum_{j=1}^n (y_{ij}-\bar y_{i.})=y_{i.}-n\bar y_{i.}=0\), it follows that the \(SS_T\) can be decomposed as:
\[ SS_T=\sum_{i=1}^a\sum_{j=1}^n (y_{ij}-\bar y_{..})^2=n\sum_{i=1}^a(\bar y_{i.}-\bar y_{..})^2 + \sum_{i=1}^a\sum_{j=1}^n (y_{ij}-\bar y_{i.})^2=SS_{tr}+SS_E \]
We have decomposed the \(SS_T\) (which is a close relative of the variance) into mean square sums:
Remembering what was said in the definition of the distribution \(\mathcal{X}^2\):
\[ \frac{\mathit{SS}}{\sigma^2}=\frac{\sum_{i=1}^n(y_i-\bar y)^2}{\sigma^2} \sim \mathcal{X}^2_{n-1} \]
it seems that:
\[ \sum_{i=1}^{a}\sum_{j=1}^{n}\frac{(y_{ij}-\bar y_{..})^2}{\sigma^2} = \frac{SS_T}{\sigma^2} \sim \chi^2_{N-1},~~~\frac{SS_E}{\sigma^2} \sim \chi^2_{N-a},~~~\frac{SS_{tr}}{\sigma^2} \sim \chi^2_{a-1} \]
\[ F_0 = \frac{SS_{tr}/(a-1)}{SS_E/(N-a)}=\frac{MS_{tr}}{MS_E} \sim F_{a-1,N-a} \] Where \(MS_\cdot\) are called mean sum of squares
Based on the last equation we can calculate the p-value associated with the hypothesis \(H_1\): for small values we can state that at least one treatment has a statistically significant effect
The definition of \(F_0\) assumes that \(MS_{tr}\) and \(MS_E\) are independent. Since they are the result of the decomposition of \(SS_T\) this is not obvious. However, the theorem holds true:
Let \(Z_i\sim \mathcal{N}(0,1),~i=1,2,\dots,\nu\) independent samples, with \(\sum_{i=1}^\nu Z_i^2=Q_1+Q_2+\dots+Q_s\) where \(s\leqslant \nu\) and \(Q_i\) has \(\nu_i\) degrees of freedom (\(i=1, 2,\dots,s\)). Then, \(Q_1,Q_2,\dots,Q_s\) are independent random variables distributed as \(\mathcal{X}^2\) with \(\nu_1,\nu_2,\dots,\nu_s\) degrees of freedom, if and only if \[ \nu=\nu_1+\nu_2+\dots+\nu_s \] Since \((N-a)+(a-1)=(N-1)\), it follows that \(SS_{tr}/\sigma^2\) and \(SS_E/\sigma^2\) are independent random variables distributed like \(\mathcal {X}^2_{a-1}\) and \(\mathcal{X}^2_{N-a}\), respectively.
Returning to the example of the resistance of mixed yarns, we can apply the variance decomposition obtaining:
Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
---|---|---|---|---|---|
Cotton | 4 | 475.76 | 118.94 | 14.75682 | 9.1e-06 |
Residuals | 20 | 161.20 | 8.06 | NA | NA |
The table above is called ANOVA table
The very low p-value allows us to reject the null hypothesis and state that at least one treatment has statistically significant effects
However, we do not know how many and which treatments are significant. To study this aspect we use the Tukey Test
The test evaluates simultaneously all the following pairs of hypotheses: \[ \left. \begin{eqnarray} H_0&:&~\mu_i=\mu_j \\ H_1&:&~\mu_i\neq \mu_j \end{eqnarray} \right\} ~~\forall i\neq j \] For each pair \((i,j),~i\neq j\) we have the test statistic: \[ q_{0,ij}=\frac{|\bar y_{i.}-\bar y_{j.}|}{S_{p,ij}\sqrt{2/n}}\sim\mathcal{Q} _{a,k} \] where \(n\) is the size of the samples, the same for all, \(a\) is the number of treatments and \(k\) is the number of degrees of freedom of \(MS_E\), i.e. \(N-a=an-a\).
For each pair we then calculate the p-value from the CDF of the studentized range distribution, \(\mathcal{Q}\), and calculate the confidence interval: \[ \bar y_{i.}-\bar y_{j.}-q_{\alpha,a,N-a}\frac{S_{p,ij}}{\sqrt{n}} \leqslant \mu_i-\mu_j \leqslant \bar y_{i.}-\bar y_{j.}+q_{\alpha,a,N-a}\frac{S_{p,ij}}{\sqrt{n}} \]
Generally, the result of Tukey’s test is reported in a plot
Couples for which the confidence interval crosses 0 have no significant differences, and vice versa
Note: carrying out as many separate Student tests would increase the overall error probability, which would result from the combination of the error probabilities of the individual tests
paolo.bosetti@unitn.it — https://paolobosetti.quarto.pub/DESANED