Estimation
1 Probabilistic models
1.1 All models are wrong, some are useful
Box and Draper (1987), p424 (emphasis added):
…Essentially, all models are wrong, but some are useful. However, the approximate nature of the model must always be borne in mind.
see also Dunn and Smyth (2018), §1.8
1.2 Statistical analysis of scientific models
When we perform statistical analyses, we use data to help us choose between models - specifically, to determine which models best explain that data.
However, physical processes do not produce data on their own. Data is only produced when scientists implement an observation process (i.e., a scientific study), which is distinct from the underlying physical process. In some cases, the observation process and the physical process interact with each other. This phenomenon is called the “observer effect”.
In order to learn about the physical processes we are ultimately interested in, we often need to make special considerations for the observation process that produced the data which we are analyzing. In particular, if some of the planned observations in the study design were not completed, we will likely need to account for the incompleteness of the resulting data set in our analysis. If we are not sure why some observations are incomplete, we may need to model the observation process in addition to the physical process we were originally interested in. For example, if some participants in a study dropped out part-way through the study, we may need investigate why those participants dropped out, as opposed to other participants who completed the study.
These kinds of missing data issues are outside of the scope of this course; see Van Buuren (2018) for more details.
2 Estimands, estimates, and estimators
2.1 Estimands
In statistical contexts, most estimands are parameters of probabilistic models, or functions of model parameters.
Model paramaters and other estimands are often symbolized using lower-case Greek letters: \(\alpha, \beta, \gamma, \delta\), etc.
2.2 Estimates
2.3 Estimators
When estimators are applied to random variables, the estimators are also random variables.
Estimators are often symbolized by placing a ^ (“hat”) symbol on top of the corresponding estimand; for example, \(\hat\theta\).
Usually, their dependence on the data is implicit:
\[\hat\theta\stackrel{\text{def}}{=}\hat\theta(x_1,...x_n)\]
2.4 Contrasting estimands, estimates, and estimators
It’s helpful to keep in mind the mathematical type of each estimation concept:
- estimands are numbers (or vector of numbers)
- estimates are also numbers (or vectors)
- estimators are functions of random variables, so they are also random variables
3 Accuracy of estimators
3.1 Accuracy
To determine which estimator is best, we need to define best. “Accuracy” is usually most important; easy computation is usually secondary.
3.2 Estimation error
3.3 Residuals
See Linear-model residual definitions and terminology for residual definitions and for the relationship between residuals, model deviations, and estimation error.
Some frequently-used measures of accuracy include:
3.4 Mean squared error
3.5 Mean absolute error
3.6 Bias
Proof. \[ \begin{aligned} \text{Bias}{\left(\hat\theta\right)} &\stackrel{\text{def}}{=}\text{E}{\left[\varepsilon{\left(\hat\theta\right)}\right]}\\ &= \text{E}{\left[\hat\theta- \theta\right]}\\ &=\text{E}{\left[\hat\theta\right]} - \text{E}{\left[\theta\right]}\\ &=\text{E}{\left[\hat\theta\right]} - \theta \end{aligned} \]
The third equality is by the linearity of expectation.
Proof. Let’s start by expanding each term of the right-hand side:
\[ \begin{aligned} {\left(\text{Bias}{\left(\hat\theta\right)}\right)}^2 &={\left(\text{E}{\left[\hat\theta\right]} - \theta\right)}^2\\ &={\left(\text{E}{\left[\hat\theta\right]}\right)}^2 - 2\text{E}{\left[\hat\theta\right]}\theta+\theta^2\\ \end{aligned} \]
\[\text{Var}{\left(\hat\theta\right)} = \text{E}{\left[\hat\theta^2\right]} - {\left(\text{E}{\left[\hat\theta\right]}\right)}^2\\\]
Now, add them together and simplify:
\[ \begin{aligned} {\left(\text{Bias}{\left(\hat\theta\right)}\right)}^2 + \text{Var}{\left(\hat\theta\right)} &={\left(\text{E}{\left[\hat\theta\right]}\right)}^2 - 2\text{E}{\left[\hat\theta\right]}\theta+\theta^2 + \text{E}{\left[\hat\theta^2\right]} - {\left(\text{E}{\left[\hat\theta\right]}\right)}^2\\ &=\text{E}{\left[\hat\theta^2\right]} - 2\text{E}{\left[\hat\theta\right]}\theta+\theta^2\\ \end{aligned} \]
Now let’s expand the left-hand side to reach the same expression:
\[ \begin{aligned} \text{MSE}{\left(\hat\theta\right)} &= \text{E}{\left[{\left(\varepsilon{\left(\hat\theta\right)}\right)}^2\right]}\\ &= \text{E}{\left[(\hat\theta- \theta)^2\right]}\\ &= \text{E}{\left[\hat\theta^2 - 2\hat\theta\theta+ \theta^2\right]}\\ &=\text{E}{\left[\hat\theta^2\right]} - \text{E}{\left[2\hat\theta\theta\right]}+\text{E}{\left[\theta^2\right]}\\ &=\text{E}{\left[\hat\theta^2\right]} - 2\text{E}{\left[\hat\theta\right]}\theta+\theta^2\\ \end{aligned} \]
\(\text{MSE}{\left(\hat\theta\right)}\) and \({\left(\text{Bias}{\left(\hat\theta\right)}\right)}^2 + \text{Var}{\left(\hat\theta\right)}\) both equal \(\text{E}{\left[\hat\theta^2\right]} - 2\text{E}{\left[\hat\theta\right]}\theta+\theta^2\). Equality is transitive, so \(\text{MSE}{\left(\hat\theta\right)}\) and \({\left(\text{Bias}{\left(\hat\theta\right)}\right)}^2 + \text{Var}{\left(\hat\theta\right)}\) are equal to each other:
\[\text{MSE}{\left(\hat\theta\right)} = {\left(\text{Bias}{\left(\hat\theta\right)}\right)}^2 + \text{Var}{\left(\hat\theta\right)}\]
3.6.1 Unbiased estimators
Proof. If \(\hat\theta\) is unbiased, then:
\[ \begin{aligned} \text{Bias}{\left(\hat\theta\right)} &= 0\\ \text{E}{\left[\hat\theta\right]} - \theta &= 0\\ \text{E}{\left[\hat\theta\right]} &= \theta \end{aligned} \]
\[ \begin{aligned} \text{MSE}{\left(\hat\theta\right)} &\stackrel{\text{def}}{=}\text{E}{\left[{\left(\varepsilon{\left(\hat\theta\right)}\right)}^2\right]}\\ &= \text{E}{\left[{\left(\hat\theta- \theta\right)}^2\right]}\\ &= \text{E}{\left[{\left(\hat\theta- \text{E}{\left[\hat\theta\right]}\right)}^2\right]}\\ &\stackrel{\text{def}}{=}\text{Var}{\left(\hat\theta\right)} \end{aligned} \]
(Alternative proof of Equation 4) We could have started from Theorem 2 instead:
\[ \begin{aligned} \text{MSE}{\left(\hat\theta\right)} &= {\left(\text{Bias}{\left(\hat\theta\right)}\right)}^2 + \text{Var}{\left(\hat\theta\right)}\\ &= {\left(0\right)}^2 + \text{Var}{\left(\hat\theta\right)}\\ &= 0 + \text{Var}{\left(\hat\theta\right)}\\ &= \text{Var}{\left(\hat\theta\right)}\\ \end{aligned} \]
3.7 Standard error
“Standard error” is a confusing concept in a few ways. First of all, it isn’t even defined as a characteristic of the estimation error, \(\varepsilon{\left(\hat\theta\right)}\)! Moreover, it is just a synonym for standard deviation, so it seems like a redundant concept. However, standard errors help us construct p-values and confidence intervals, so they come up a lot - often enough to give them their own name.
We can relate standard error to estimation error, by rearranging the result from Theorem 2:
\[ \begin{aligned} \text{Var}{\left(\hat\theta\right)} &= \text{Var}{\left(\hat\theta- \theta\right)}\\ &= \text{Var}{\left(\varepsilon{\left(\hat\theta\right)}\right)}\\ \end{aligned} \] So the variance of the estimator is equal to the variance of the estimation error, and the standard error is equal to the standard deviation of the estimation error:
\[\text{SE}{\left(\hat\theta\right)} = \text{SD}{\left(\varepsilon{\left(\hat\theta\right)}\right)}\]
Proof. \[ \begin{aligned} \text{MSE}{\left(\hat\theta\right)} &= {\left(\text{Bias}{\left(\hat\theta\right)}\right)}^2 + \text{Var}{\left(\hat\theta\right)}\\ \therefore\text{Var}{\left(\hat\theta\right)} &= \text{MSE}{\left(\hat\theta\right)} - {\left(\text{Bias}{\left(\hat\theta\right)}\right)}^2\\ \therefore{\left(\text{SE}{\left(\hat\theta\right)}\right)}^2 &= \text{MSE}{\left(\hat\theta\right)} - {\left(\text{Bias}{\left(\hat\theta\right)}\right)}^2\\ \end{aligned} \]