Some Applied Statistics Notes about Outliers
From JHU AMS class EN.553.613 ASDA 2023 Fall.
Notice Outliers:
- Plot residuals($e_i=y_i-\hat{y_i}$) vs. $X$ or $\hat{y}$
- box plot
- dot plot
- stem plot
If > 4|$e_{i}^{*}$|, where
$e_{i}^{*}=\frac{(e_i-\bar{e})}{\sqrt{MSE}}=\frac{e_i}{\sqrt{MSE}}=semistudentized\;residuals$, where
$MSE=\frac{SSE}{n-p}=\frac{\sum_{i=1}^{N}(y_i-\hat{y_i})^2}{n-p}$, where
p is the number of the estimators.
We have
$var(ei)=\theta^2(1-h{ii})$,
$cov(ei, e_j)=-\theta^2h{ij}$,
$h_{ij}=X_i(X^TX)^{-1}X_j^T$,
$h_{ii}(leverage)=X_i(X^TX)^{-1}X_i^T$.
By the rule of thumb, we can detect outlier with respect to X while $h{ii} > \frac{2p}{n}$. We say that $h{ii} > 0.5$ as high leverage, $0.2 < h_{ii} < 5$ as moderate leverage.
ESR = $t_i$ can only detect outliers with respect to $X$.
Test for Outliers:
Bonferroni Correction:
$|t_i| = |\frac{d_i}{s(d_i)}| > t(1-\frac{\alpha}{2n}, n-p-1)$, outlier.
where $di = \frac{e_i}{(1-h{ii})}$.




