A model form, often simply called a model, specifies a set of probability distributions of an outcome variable conditioned on some predictors. Specifically, a (parametric) model is a group of conditional probability distributions indexed by its parameters.

In comparison, an estimator specifies the optimal model parameters in some criteria as a function of a sample.

A fitted model specifies a probability distribution of the outcome variable conditioned on predictors, which is an element of the model form elected by the estimator given the observed sample.

Stages of regression (bottom-up view):

1. No predictor: sample mean (null/univariate model);
2. One predictor: simple linear regression or other uni-parametric models; local regression, cubic smoothing splines;
3. Multiple predictors: linear regression; linear and cubic splines;
4. Omitted covariates: check residual scatter plot for uncaptured pattern;
5. The modeling procedure goes on until the model is valid (adequate) and prediction error estimates shrink below a level you consider negligible.

## Model Form

A regression model is a group of conditional probability models whose expectation function approximates the regression function, parametric or not.

$$Y \mid \mathbf{X} \sim P(\mathbf{X}, \theta)$$

### Linear Regression Model

Linear regression model (LM) gives a linear approximation of the regression function, with slope parameters and an intercept: simple linear regression;

$$Y \mid \mathbf{X} \sim \text{Normal}(\mathbf{X} \cdot \beta, \sigma^2)$$

### Generalized Linear Model

Generalized linear model (GLM) is a linear model composite with a link function $g(y)$ [@Nelder1972]:

$$g(Y) \mid \mathbf{X} \sim \text{Normal}(\mathbf{X} \cdot \beta, \sigma^2)$$

• Box-Cox transformation: $(X^\lambda - 1) / \lambda$, if $\lambda \ne 0$; logarithm $\ln(X)$, if $\lambda = 0$;
• linear, cubic and cubic smoothing splines;
• binary outcomes (K=2): logistic regression (logit), probit regression;
• count data: Poisson regression (log-linear);
• time-series seasonal: sinusoid;

### Other Models

Nonlinear models: step function (piecewise null model);

Hierarchical/multilevel model: adding interactions to main effects.

The hierarchy principle:

If we include an interaction in an model, we should also include the main effects, even if the p-values associated with their coefficients are not significant. Because interactions are hard to interpret in a model without main effects.

Graphical models (Bayesian network): Bayesian inference lead to Bayesian networks.

## Estimator

Estimator is the cost function (optimization criteria) on training data: least squares (LS), ridge, lasso.

Least Squares Estimators

## Model Assessment

There are two separate concerns regarding a fitted model: validity and prediction error. A fitted model is valid (or adequate) if the residuals behave according to a univariate model. The "gold standard" measure for prediction error of a fitted model is its mean square error (MSE) or root MSE (RMSE) on new observations.

R-squared $R^2$, or coefficient of determination, is the proportion of sample variance explained by a fitted model; in other words, it is the ratio of regression sum of squares (RSS) to total sum of squares (TSS).

$$R^2 = \frac{\sum_i (\hat{y}_i-\bar{y})^2}{\sum_i (y_i-\bar{y})^2}$$

Properties:

1. R-squared is within [0,1].
2. R-squared never decreases as number of predictors increases.
3. With a single predictor in LM, square roots of R-squared have the same absolute value with the correlation coefficient.

In a linear model with a single regressor and a constant term, the coefficient of determination $R^2$ is the square of the correlation between the regressor and the dependent variable,

$$R^2 = \left( \frac{ \widehat{\mathrm{cov}(X,Y)} }{\hat{\sigma}_X \hat{\sigma}_Y} \right)^2 = \frac{\left( \frac{1}{n} \sum\limits_{i=1}^n (x_i-\bar{x})(y_i-\bar{y}) \right)^2} { \left( \frac{1}{n}\sum\limits_{i=1}^n (x_i-\bar{x})^2 \right) \left( \frac{1}{n}\sum\limits_{i=1}^n (y_i-\bar{y})^2 \right) }$$

If the number of regressors is more than one, $R^2$ can be seen as the square of a coefficient of multiple correlation.

$$R^2 = \widehat{\mathrm{cov}(Y,\mathbf{X})} \left( \hat{\sigma}_Y^2 \widehat{\mathrm{cov}( \mathbf{X},\mathbf{X})} \right) ^{-1} \widehat{\mathrm{cov}(\mathbf{X},Y)}$$

$$= (Y_0^{T} X_0) [(Y_0^{T} Y_0) (X_0^{T} X_0)]^{-1} (X_0^{T} Y_0)$$

F is the ratio of explained variance to unexplained variance; also the ratio of between-group variance to within-group variance.

## Model Selection

Model selection is the process of selecting the proper level of flexibility/complexity for a model form. Given any data set, LS gives the optimal fit in RSS within the default full model (p-dimensional LM). However, optimal training error does not mean optimal prediction error; this is what we call overfitting.

Model selection methods:

• Subset selection: best subset (recommended for p<20), forward selection (suitable for p>n), and backward selection.
• Shrinkage methods (or Regularization): ridge ($L^2$ penalty, suitable for dense "true" model), lasso ($L^1$ penalty, have "variable selection" property, suitable for sparse "true" model).
• Selection by derived basis/directions: principal component regression (PCR).

Types of LM model selection:

• Group by subset size, $m$: RSS vs adjusted error estimators (adjusted R-squared, AIC, BIC) or CV;
• Group by derived basis size, $M$: RSS vs CV;
• Group by regularization parameter, $\lambda$: regularized RSS vs CV;

One-standard-error rule of prediction error estimates (MSE): Choose the simplest fitted model (lowest $m$ or $M$, highest $\lambda$) with prediction MSE within one standard error of the smallest prediction MSE.

### Model Selection Criteria

Estimates of prediction error (MSE or RMSE):

• Adjusted R-squared: does not need error variance estimate, applies to $p>n$;
• Akaike information criterion (AIC): Mallow's $C_p$, equivalent to AIC for LM;
• Bayesian information criterion (BIC);
2. Direct estimates: Cross-validation (CV) prediction error;

Adjusted R-squared increases with one more predictor, if and only if the t-ratio of the new predictor is greater than 1.

$$\bar{R}^2 = 1 - \frac{\text{RSS}/(n-k)}{\text{TSS}/(n-1)}$$

## Resampling methods

Cross validation (CV) estimates prediction error; bootstrap estimates estimator variation. Cross-validation is preferred to validation.

Bootstrap is repeated subsampling with replacement to estimates estimator variation (or any other population information).

## Regression Diagnostics

Regression diagnostics are procedures assessing the validity (adequacy) of a regression model, mostly for linear models. A regression diagnostic may take the form of a graphical approach, informal quantitative results, or a formal statistical hypothesis test: each provides guidance for further stages of a regression analysis.

### Univariate analysis

Every regression should begin and end up with verifying a univariate model (not necessarily Gaussian) against real data, known as univariate analysis.

Graphical analysis: (4-plot)

1. Run sequence plot: shifts in location or scale.
2. Lag plot: auto-correlation.
3. Histogram: distribution.
4. Normal probability plot: fit of normal distribution.

Samples can be alternatively seen as time-series, and a time-series is essentially a vector sequentially measured at constant frequency. This fact is exploited in some of the univariate analysis techniques to detect the peculiarities of time-series data: trend, seasonality, autocorrelation. If any of these are detected in univariate analysis, follow-up analysis is needed.

### On Residuals

Testing heteroskedasticity (of residuals):

1. regress squared residual on fitted value.
2. regress squared residual on a quadratic function of fitted value.
3. Formal tests:
• Welch F-test (preferred), Brown and Forsythe test.
• homogeneity of variances: Levene test, Bartlett's Test (for normal distribution).

Testing correlation of residuals:

1. serial correlation:
• lag plot: scatter plot of a sequence against its lagged values, typically lag 1.
• autocorrelation plots (ACF, PACF): autocorrelation coefficient at various time lags, with confidence band as reference lines.
• estimate an autoregressive model at low lags (e.g. AR(1)) for the residuals.
• runs test: standardized number of runs.
• unit-root tests;
2. intraclass correlation: large one-way ANOVA.

Testing (conditional) normality of residuals:

1. Graphical techniques:
• For location and scale families of distributions: probability plot (QQ plot of data against a simulated sample; such as normal probability plot);
• For finding the shape parameter: probability plot correlation coefficient (PPCC) plot;
2. Goodness-of-fit tests for distributional adequacy:
• General distributions: chi-squared goodness-of-fit test, Kolmogorov-Smirnov (K-S) test, Anderson-Darling test;
• Normality: Shapiro-Wilk test, Shapiro-Francia test;

### On Predictors

Graphical residual analysis: If the model form is adequate, residual scatter plots should appear to be a random field over all potential predictors. It is impractical to measure every independent quantity, but all available attributes should be checked, and ambient variables should be measured. If a scatter plot of residuals versus a variable did show systematic structure, the model form should be adjusted in that predictor, or include it as a predictor if not already so.

Lack-of-fit statistics/tests for model adequacy: (Testing model adequacy requires replicate measurements, e.g. validation and cross validation.)

• Adequacy of existing predictors (misspecified or missing terms)
• F-test for the ratio of "mean square for lack-of-fit" (on fitted model) to "mean square of pure error" (on replicated observations): $\hat{\sigma}_m^2 / \hat{\sigma}_r^2$
• t-test for inclusion of a single explanatory variable (statistical significance away from zero)
• F-test for inclusion of a group of variables
• Dropping predictors
• t-test of parameter significance

Multicollinearity:

• unusually high standard errors of regression coefficients.
• unusually high R-squared when you regress one explanatory variable on the others

Change of model structure between groups of observations

Comparing model structures

### On Subgroups of Observations

Outliers: observations that deviates markedly from other observations in the sample.

• Graphical techniques: normal probability plot, box plot, histogram;
• Z-score $z_i = (y_i - \bar{y}) / s$; modified Z-score $M_i = 0.6745 (y_i - \tilde{y}) / \text{MAD}$, where $\tilde{y}$ is the median and MAD is the median absolute deviation.
• Formal outlier tests (for normally distributed data): Grubbs' test, Tietjen-Moore test, generalized extreme Studentized deviate (ESD) test.

Influential observations (high leverage points): observations that have a relatively large effect on the regression model's predictions.

References: [@ISL2015], [@ESL2013]