Module 3: overview of selected data analysis methods for high-dimensional data¶

Principal Component Analysis¶

Principal Component Analysis (PCA) is a linear unsupervised method of dimension reduction.

This means that:

  • PCA is based on application of linear transformations,
  • it takes into account data parameters only, ignoring a priori the labels, if exist,
  • it allows to reduce the number of data features to a smaller one (the new features are linear combinations of the old ones).

Let $X_1,\ldots,X_p$ be the set of feature variables. Given that data, PCA finds a new set of variables $Y_1,\ldots,Y_p$ such that:

  • each $Y_i$ is a linear combination of the features, that is $Y_i=\phi_{i,1}X_1+\ldots +\phi_{i,p}X_p$,
  • $Y_1$ has the maximum variance among all such linear combinations,
  • $Y_2$ is uncorrelated to $Y_1$ and has the maximum variance,
  • in general: $Y_i$ is uncorrelated to $Y_1,\ldots,Y_{i-1}$ and has the maximal variance.

The coefficients $\phi_{i,j}$ are called loadings of the $i$-th principal component (that is $Y_{i}$).

In practice, though, we are not given random variables $X_i$ but rather the data representing it. In that way, each of $X_i$ is a finite column vector, say of length $m$, and the whole data can be arranged into a matrix $$ X=\left([X_{i,j}]_{1\leq i\leq p, 1\leq j\leq m}\right)^T. $$

Comments concerning PCA¶

  • in principle, it is an unsupervised learning method, i.e. we only care about the features, not the labels (one can apply it though in the supervised case as well - we discuss it later)

  • PCA is sensitive to scaling in the data (by desing)

  • We need to standarize the input: center and scale the data

  • PCA creates a new basis, which is orthogonal by design:

source: https://towardsdatascience.com/principal-component-analysis-pca-explained-visually-with-zero-math-1cbf392b9e7d

  • To get loadings $\phi$'s and components' variances $\lambda$'s we can:

    • Perform eigenvalue decomposition of $X^TX$

    or

    • Perform SVD decomposition of $X$: $$ X=U\Sigma V^T, $$ where $U$ is a $n\times n$ matrix called the score matrix and $V$ is a $p\times p$ matrix called the loading matrix
  • The cutoff parameter can be decided using cross validation.

  • PCA can be a preprocessing step for other methods, e.g. linear regression

    • "Good representation is all you need"

Example: Iris dataset¶

In [ ]:
from sklearn import datasets

iris = datasets.load_iris()
X = iris.data
y = iris.target
In [ ]:
X
In [ ]:
y

Scale the data¶

In [ ]:
# In general, it's a good idea to scale the data prior to PCA.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X)
X = scaler.transform(X)    

X

Perform PCA¶

In [ ]:
from sklearn.decomposition import PCA

pca = PCA()
x_new = pca.fit_transform(X)

x_new

Loadings of PCA¶

In [ ]:
import numpy as np
import pandas as pd

# loadings = columns of the following dataframe
# vectors ploted on a biplot - rows of this dataframe
loadings = pca.components_.T * np.sqrt(pca.explained_variance_)

loading_matrix = pd.DataFrame(loadings[:,:2], columns=['PC1', 'PC2'], index=iris.feature_names)

loading_matrix
In [ ]:
pca.components_
In [ ]:
np.transpose(pca.components_[1:3, :])

Biplots¶

In [ ]:
import matplotlib.pyplot as plt

def biplot(score,coeff,labels=None):
    xs = score[:,0]
    ys = score[:,1]
    n = coeff.shape[0]
    scalex = 1.0/(xs.max() - xs.min())
    scaley = 1.0/(ys.max() - ys.min())
    plt.scatter(xs * scalex,ys * scaley, c = y)
    for i in range(n):
        plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
        if labels is None:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')
        else:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')
    plt.xlim(-1,1)
    plt.ylim(-1,1)
    plt.xlabel("first considered component")
    plt.ylabel("second considered component")
    plt.grid()

# code source: https://stackoverflow.com/questions/39216897/plot-pca-loadings-and-loading-in-biplot-in-sklearn-like-rs-autoplot
In [ ]:
biplot(x_new[:,0:2],np.transpose(pca.components_[0:2, :]))
In [ ]:
biplot(x_new[:,1:3],np.transpose(pca.components_[1:3, :]))

Proportion variance explained¶

In [ ]:
pca.explained_variance_
In [ ]:
pca.explained_variance_ratio_
In [ ]:
import seaborn as sns # install via "pip install seaborn" in terminal if the command is not working
In [ ]:
pve_df = pd.DataFrame({'component': [1,2,3,4], 'PVE': pca.explained_variance_ratio_})

cumulative_pve = [0]*(len(pca.explained_variance_ratio_)+1)
cumulative_pve[0] = 0
cumulative_pve[1] = pca.explained_variance_ratio_[0]
for i in range(1,len(pca.explained_variance_ratio_)+1):
    cumulative_pve[i] = cumulative_pve[i-1]+pca.explained_variance_ratio_[i-1]
    
cumulative_pve_df = pd.DataFrame({'component': [0,1,2,3,4], 'cumulative PVE': cumulative_pve})

fig, axs = plt.subplots(ncols=2)

sns.pointplot(x='component', y='PVE', data=pve_df, ax=axs[0]);
sns.pointplot(x='component', y='cumulative PVE', data=cumulative_pve_df, ax=axs[1]);

Principal Component Regression (PCR) - PCA as a prestep for supervised learning - diabetes dataset¶

Load the dataset¶

In [ ]:
# Provide scikit
import sklearn
from sklearn import datasets
In [ ]:
diabetes = datasets.load_diabetes()
In [ ]:
diabetes.feature_names
In [ ]:
diabetes.data
In [ ]:
diabetes.target
In [ ]:
X, y = diabetes.data, diabetes.target
In [ ]:
X

Perform PCA¶

In [ ]:
# In general, it's a good idea to scale the data prior to PCA.
from sklearn.preprocessing import StandardScaler
In [ ]:
scaler = StandardScaler()
scaler.fit(X)
X = scaler.transform(X)   
In [ ]:
X
In [ ]:
from sklearn.decomposition import PCA
In [ ]:
pca = PCA()
X_pca = pca.fit_transform(X)
In [ ]:
X_pca
In [ ]:
pca.explained_variance_ratio_
In [ ]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
In [ ]:
pve_df = pd.DataFrame({'component': [1,2,3,4,5,6,7,8,9,10], 'PVE': pca.explained_variance_ratio_})

cumulative_pve = [0]*(len(pca.explained_variance_ratio_)+1)
cumulative_pve[0] = 0
cumulative_pve[1] = pca.explained_variance_ratio_[0]
for i in range(1,len(pca.explained_variance_ratio_)+1):
    cumulative_pve[i] = cumulative_pve[i-1]+pca.explained_variance_ratio_[i-1]
    
cumulative_pve_df = pd.DataFrame({'component': [0,1,2,3,4,5,6,7,8,9,10], 'cumulative PVE': cumulative_pve})

fig, axs = plt.subplots(ncols=2)

sns.pointplot(x='component', y='PVE', data=pve_df, ax=axs[0]);
sns.pointplot(x='component', y='cumulative PVE', data=cumulative_pve_df, ax=axs[1]);

Select the principal components with highest variance¶

In [ ]:
X_pca_selected = X_pca[:,[0,1]]
In [ ]:
X_pca_selected

Split the data into train and test (we want to compare the results)¶

In [ ]:
# We import the submodule allowing to split the data into train and test parts
from sklearn.model_selection import train_test_split
In [ ]:
data_idies = range(len(X))
In [ ]:
data_idies
In [ ]:
idies_train, idies_test = train_test_split(data_idies, test_size=0.35)
In [ ]:
y_train, y_test = y[idies_train], y[idies_test]
In [ ]:
X_train, X_test = X[idies_train], X[idies_test]
In [ ]:
X_pca_train, X_pca_test = X_pca_selected[idies_train], X_pca_selected[idies_test]

Create regression models¶

In [ ]:
# we must load the linear regression submodule
from sklearn.linear_model import LinearRegression
In [ ]:
def regression_model(x, y):
    model = LinearRegression()
    model.fit(x, y)
    return model
In [ ]:
original_model = regression_model(X_train, y_train)
In [ ]:
pca_model = regression_model(X_pca_train, y_train)

Assess the accuracy of the models¶

Residual sum of squares and its estimator:¶

$$ RSS = \sum_{i=1}^n(y_i-(y\_{\text{predict})_i})^2,\\ RSE=\sqrt{\frac{1}{n-2}RSS}, $$ where $n$ is the number of test samples taken for the assessment.

$R^2$ - the "normalized" RSS¶

$$ TSS = \sum_i (y_i - \bar{y})^2 $$

$$ R^2 = \frac{TSS- RSS}{TSS}$$

In [ ]:
import numpy as np
In [ ]:
def RSS_RSE(model, x, y):
    y_predict = model.predict(x)
    n = len(x)
    RSS = sum((y-y_predict)**2)
    RSE = np.sqrt(1/(n-2)*RSS)
    return RSS, RSE
In [ ]:
def R2(model, x, y):
    y_mean = np.mean(y)
    TSS = sum((y-y_mean)**2)
    RSS, RSE = RSS_RSE(model, x, y)
    return (TSS-RSS)/TSS
In [ ]:
rss_rse_original_train = R2(original_model, X_train, y_train)
rss_rse_original_test = R2(original_model, X_test, y_test)
rss_rse_pca_train = R2(pca_model, X_pca_train, y_train)
rss_rse_pca_test = R2(pca_model, X_pca_test, y_test)
In [ ]:
print("r2_original_train:", rss_rse_original_train)
print("r2_original_test:", rss_rse_original_test)
print("r2_pca_train:", rss_rse_pca_train)
print("r2_pca_test:", rss_rse_pca_test)
In [ ]:
def R2_k_components(k):
    X_pca_selected = X_pca[:,[i for i in range(k)]]
    X_pca_train, X_pca_test = X_pca_selected[idies_train], X_pca_selected[idies_test]
    pca_model = regression_model(X_pca_train, y_train)
    r2_train = R2(pca_model, X_pca_train, y_train)
    r2_test = R2(pca_model, X_pca_test, y_test)
    return r2_train, r2_test
In [ ]:
r2_train_pca, r2_test_pca = [0]*10, [0]*10
for k in range(1,10):
    r2_train_pca[k], r2_test_pca[k] = R2_k_components(k)
In [ ]:
r2_train_pca
In [ ]:
r2_test_pca
In [ ]:
train_df = pd.DataFrame({'component': [1,2,3,4,5,6,7,8,9,10], 'pca train': r2_train_pca})
test_df = pd.DataFrame({'component': [1,2,3,4,5,6,7,8,9,10], 'pca test': r2_test_pca})

fig, axs = plt.subplots(ncols=2)

sns.pointplot(x='component', y='pca train', data=train_df, ax=axs[0]);
sns.pointplot(x='component', y='pca test', data=test_df, ax=axs[1]);

Feature selection¶

Best subset selection (brute force)¶

  1. Let $\mathcal M_0$ denote the null model, which contains no predictors. This model simply predicts the sample mean for each observation.

  2. For $k=1,\ldots,p$:

(a) Fit all $p_k$ models that contain exactly k predictors.

(b) Pick the best among these $p_k$ models, and call it $\mathcal M_k$. Here best is defined as having the smallest RSS, or equivalently largest $R^2$.

  1. Select a single best model from among $\mathcal M_0, \ldots, \mathcal{M}_p$ using cross-validated prediction error, Mallow’s $C_{p}$, Akaike information critetion (AIC), Bayesian information criterion (BIC), or adjusted $R^2$

Stepwise selection¶

  • Exhastive search requires to checks all $2^p$ model

  • Best subset selection may also suffer from statistical problems when p is large: larger the search space, the higher the chance of finding models that look good on the training data, even though they might not have any predictive power on future data.

  • Thus an enormous search space can lead to overfitting and high variance of the coefficient estimates.

  • For both of these reasons, stepwise methods, which explore a far more restricted set of models, are attractive alternatives to best subset selection.

Forward Stepwise Selection (Greedy)¶

  1. Let $\mathcal M_0$ denote the null model, which contains no predictors.

  2. For $k = 0, \ldots, p-1$:

(a) Consider all $p − k$ models that augment the predictors in $\mathcal M_k$ with one additional predictor.

(b) Choose the best among these p − k models, and call it $\mathcal M_{k+1}$. Here best is defined as having smallest RSS or highest R2.

  1. Select a single best model from among $\mathcal M_0, \ldots, \mathcal{M}_p$ using cross-validated prediction error, Cp (AIC), BIC, or adjusted $R^2$
  • Computational advantage over best subset selection is clear.

  • It is not guaranteed to find the best possible model out of all $2^p$ models containing subsets of the p predictors.

  • Forward selection works in the case $n<p$. Then you perform the iteration up to $n-1$.

Backward Stepwise Selection (Greedy)¶

  1. Let $\mathcal M_p$ denote the full model, which contains all $p$ predictors.
  2. For $k=p,p-1,\ldots, 1$:

(a) Consider all $k$ models that contain all but one of the predictors in $\mathcal M_k$, for a total of $k − 1$ predictors.

(b) Choose the best among these k models, and call it $\mathcal M_{k−1}$. Here best is defined as having smallest RSS or highest R2.

  1. Select a single best model from among $\mathcal M_0, \ldots, \mathcal{M}_p$ using cross-validated prediction error, Cp (AIC), BIC, or adjusted $R^2$.
  • Like forward stepwise selection, the backward selection approach searches through only $1 + p(p + 1)/2$ models, and so can be applied in settings where p is too large to apply best subset selection

  • Like forward stepwise selection, backward stepwise selection is not guaranteed to yield the best model containing a subset of the p predictors.

  • Backward selection requires that the number of samples n is larger than the number of variables p (so that the full model can be fit). In contrast, forward stepwise can be used even when n < p, and so is the only viable subset method when p is very large.

Choosing the Optimal Model¶

We would like to select the best model with respect to the test error. The two common approaches are:

  • indirect estimatation of the test error by making an adjustment to the training error to account for the bias due to overfitting,

  • direct estimation of the test error, using either validation cross-validation

Information Criteria¶

  • Below $d$ is the total # of parameters used and $\hat{\sigma}^2$ is an estimate of the variance of the error ($\hat{\sigma}^2$ is an estimate for $\epsilon$, where the regression relationship is given by $ Y = f(X_1, \ldots, X_p) + \epsilon $ - typically $\hat{\sigma}^2$ is estimated using the full model containing all predictors).

  • All the criteria add a penalty for the size of the model.

    • we know that increasing the number of parametres decreases RSS
    • the penalty acts in the oposite direction
  • Mallow’s Cp:

$$ C_p = \frac{1}{n} (RSS + 2d\hat{\sigma}^2 ) $$

  • Akaike Criterion:

$$ AIC = \frac{1}{n\hat{\sigma}^2} (RSS + 2d\hat{\sigma}^2 ) $$

  • Bayes Criterion:

$$ BIC = \frac{1}{n}(RSS + log(n)d\hat{\sigma}^2 ) $$

  • $R^2$

$$ R^2 = 1 - \frac{RSS}{TSS} $$

  • Adjusted $R^2$

$$ \text{Adjusted } R^2 = 1 − \frac{RSS/(n − d-1)}{TSS/(n − 1)} $$

Example: credit data¶

In [ ]:
# Run it if not yet downloaded only

import urllib.request
import zipfile
import os
from pathlib import Path

url = "https://www.statlearning.com/s/ALL-CSV-FILES-2nd-Edition-corrected.zip"
zip_path = Path("ALL-CSV-FILES-2nd-Edition-corrected.zip")
extract_dir = Path("ALL-CSV-FILES-2nd-Edition-corrected")

# 1) download only if not present
if not zip_path.exists():
    print("Downloading zip...")
    urllib.request.urlretrieve(url, zip_path)
    print("Saved to", zip_path)
else:
    print("Zip already exists:", zip_path)

# 2) unzip
if not extract_dir.exists():
    print("Extracting...")
    with zipfile.ZipFile(zip_path, "r") as zf:
        zf.extractall(extract_dir)
    print("Extracted to", extract_dir)
else:
    print("Already extracted to:", extract_dir)

# 3) list extracted files
print("\nFiles:")
for p in sorted(extract_dir.glob("**/*")):
    if p.is_file():
        print(p)
In [ ]:
import pandas as pd
In [ ]:
credit_df = pd.read_csv('ALL-CSV-FILES-2nd-Edition-corrected/ALL CSV FILES - 2nd Edition/Credit.csv')
In [ ]:
credit_df
In [ ]:
import statsmodels.formula.api as smf # install via "pip install statsmodels" in terminal if the command is not working
import numpy as np
from itertools import combinations
In [ ]:
# region is categorical with 3 possible values
# so the number of predictors is (10-1) + 2 = 11
n_predictors_with_dummy = 11
n = len(credit_df)
cp, bic, adj_r2 = [], [], []

# estimate σ^2
columns = credit_df.columns[:-1]
formula = f'{credit_df.columns[-1]} ~ {" + ".join(columns)}'
object_to_fit = smf.ols(formula, credit_df)
model = object_to_fit.fit()
sigma2 = np.sum(model.resid**2)/(n-len(columns)-1)

# Best subset selection (brute force)
# Step 2.
for d in range(1, n_predictors_with_dummy):
    r2_max = 0
    for selection in combinations(range(len(columns)), d):
        selection = list(selection)
        formula = f'{credit_df.columns[-1]} ~ {" + ".join(columns[selection])}'
        p_k = smf.ols(formula, credit_df).fit()
        rss = p_k.ssr
        tss = p_k.centered_tss
        if 1-rss/tss > r2_max:
            r2_max = 1-rss/tss
            M_k = p_k
    
    rss = M_k.ssr
    tss = M_k.centered_tss
    cp.append(1/n * (rss + 2 * d * sigma2))
    bic.append(1/n * (rss + np.log(n) * d * sigma2))
    adj_r2.append(1 - rss / (n - d - 1) * (n - 1) / tss)
In [ ]:
import seaborn as sns
In [ ]:
plot_df = pd.concat([
  pd.DataFrame({'val': cp, 'type': 'cp'}).reset_index(),
  pd.DataFrame({'val': bic, 'type': 'bic'}).reset_index(),
])
plot_df['index'] += 1
plot_df = plot_df.reset_index(drop=True)
sns.pointplot(x='index', y='val', hue='type', data=plot_df);
In [ ]:
bic
In [ ]:
r2_df = pd.DataFrame({'val': adj_r2, 'type': 'adj_r2'}).reset_index()
sns.pointplot(x='index', y='val', data=r2_df);
In [ ]:
adj_r2

Shrinkage¶

  • Ridge regression and Lasso

  • The subset selection methods use least squares to fit a linear model that contains a subset of the predictors.

  • As an alternative, we can fit a model containing all $p$ predictors using a technique that regularizes (shrinks) the coefficient estimates

  • It may not be immediately obvious why such a constraint should improve the fit, but they do

  • Different kinds of regularizations are very common in the modern Machine Learning and substantially improve training for big models

  • It might happen that tuning the key hyperparameters is painful

Ridge regression¶

  • The new, regularized loss function of ridge regression is of the following form:

$$ RSS + \lambda ||\beta||_2^2 = RSS + \lambda\sum_{j=1}^p\beta_j^2 $$

  • $\lambda$ is a hyperparameter than needs to be tuned

  • Ridge regression cares about balancing between two objectives:

    • Mninimiaze RSS, at the least suqares
    • Keepnig the magnitude of parameters small (close to zero)
    • The tuning parameter $\lambda$ controls the relative impact of these two losses
  • Selecting a good value for $\lambda$ is critical (there are people in the industry that specialize in this kind of work)

  • Here we will use cross-validation. Often you just use some kind of grid search.

  • We do not want to shrink the intercept, as it corresponds to the mean of the data (when x's are 0)

Lasso¶

  • The new, regularized loss funtion of lasso regression is of the following form:

$$ RSS + \lambda ||\beta||_1 = RSS + \lambda\sum_{j=1}^p|\beta_j| $$

  • As with ridge regression, the lasso shrinks the coefficient estimates towards zero.

  • The image is known as lasso path

  • $l_1$ loss forces some coefficient exactly 0

    • Depends on the value of $\lambda$.
  • Hence lasso performs variable selection.

  • We say that lasso yields sparse models.

Ridge vs lasso¶

  • Neither ridge regression nor the lasso will universally dominate the other.

  • Lasso performs better in sparse case (small number of nonzero coeffs).

  • Ridge does well if there is a lot of predictors correlated with the outcome.

  • The number of predictors that is related to the response is never known a priori for real data sets.

Programming tasks¶

Task 1: Assess accuracy of regression models distilled by PCA¶

Compare the accuracy of the the models on predictors being subsequent subsets of pripcipal components with highest variances. Measure the accuracy by means of $R^2$. For each model the accuracy shall be assessed by usage of cross-validation with TODO PAREMETERS.

Input:

  • This dataset TODO.

Output:

  • a plot presenting $R^2$ for the model defined by the first $k$ principal components. The $X$ axis correspond to $k$, and the $Y$ axis to $R^2$.

Sources used for preparation of this notebook¶

  • a substantial part of this notebook constitutes the notes of dr Łukasz Kuciński, https://sites.google.com/view/lukaszkucinski, from Institute of Mathematics of the Polish Academy of Sciences; his notes were from the course on modern methods of data analysis

  • G. James, D. Witten, T. Hastie, R. Tibshirani, J. Taylor, An Introduction to Statistical Learning with Applications in Python (ISLP), 2023 and its associated slides - some of the descriptions here were taken from these slides, e.g. chatper 6 slides