Chapter 8: Dimensionality Reduction
Exercises:
P1. What are the main motivations for reducing a dataset’s dimensionality? What are the main drawbacks?
A1:Dimensionality reduction is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally close to its intrinsic dimension.
This process not only streamlines computational tasks but also aids in visualizing data trends, mitigating the risk of the curse of dimensionality, and improving the generalization performance of machine learning models.
(Reference: Dimensionality Reduction Techniques — PCA, LCA and SVD )Main drawbacks:
While doing dimensionality reduction, we lost some of the information, which can possibly affect the performance of subsequent training algorithms.
It can be computationally intensive.
Transformed features are often hard to interpret.
It makes the independent variables less interpretable.
P2. What is the curse of dimensionality?
A2: The curse of dimensionality refers to a set of problems that arise when working with high-dimensional data. The dimension of a dataset corresponds to the number of attributes or features that exist in it. A dataset with a large number of attributes, generally of the order of a hundred or more, is referred to as high dimensional data. Some of the diggiculties that come with high dimensional data manifest during analyzing or visualizing the data to identify patterns, and some manifest while training machine learning models. (Reference: The Curse of Dimensionality – Dimension Reduction)
As we add more dimensions to our dataset, the volume of the space increases exponentially. This means that the data becomes sparse. Not only do all these features make the training extremely slow, but they can also make it much harder to find a good solution.
It causes some problems: Data sparsity, Increased computation, Overfitting, Distances lose meaning, Performance degradation, Visualization challenges. (Reference: # The Curse of Dimensionality in Machine Learning: Challenges, Impacts, and Solutions)
P3. Once a dataset’s dimensionality has been reduced, is it possible to reverse the operation? If so, how? If not, why?
A3: It is possible to reverse the operation that can use the inverse_transform() method to decompress it back.Code Implementation:
import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import fetch_openml from sklearn.decomposition import PCA from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler # 加载 MNIST 数据集 mnist = fetch_openml('mnist_784', version=1) X, y = mnist['data'], mnist['target'] X = X.to_numpy() # 数据集划分 X_train, X_temp, y_train, y_temp = train_test_split(X, y, train_size=5000, random_state=42) X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, train_size=1000, test_size=1000, random_state=42, stratify=y_temp) # 数据标准化 scaler = StandardScaler() X_scaled = scaler.fit_transform(X_train) # 应用 PCA pca = PCA() pca.fit(X_scaled) # 计算解释方差和累积解释方差 explained_variances = pca.explained_variance_ratio_ cumsum = np.cumsum(explained_variances) d = np.argmax(cumsum >= 0.95) + 1 pca = PCA(n_components=d) X_reduced = pca.fit_transform(X_scaled) X_recovered = pca.inverse_transform(X_reduced) # Function to plot the original and reconstructed images def plot_digits(instances, images_per_row=5, title=None): size = 28 images_per_row = min(len(instances), images_per_row) images = [instance.reshape(size, size) for instance in instances] n_rows = (len(instances) - 1) // images_per_row + 1 row_images = [] for row in range(n_rows): rimages = images[row * images_per_row : (row + 1) * images_per_row] row_images.append(np.concatenate(rimages, axis=1)) image = np.concatenate(row_images, axis=0) plt.imshow(image, cmap='gray') plt.axis('off') if title: plt.title(title) # Select 25 random digit to display n_images = 25 idxs = np.random.choice(X_train.shape[0], n_images) plt.figure(figsize=(10, 5)) plt.subplot(121) plot_digits(X_train[idxs], title='Original') plt.subplot(122) plot_digits(scaler.inverse_transform(X_recovered[idxs]), title='Compresed') plt.show()
Fig. 1. Visual comparision of the original data and dimension reduced.
P4. Can PCA be used to reduce the dimensionality of a highly nonlinear dataset?
A4: Yes, Principal Component Analysis (PCA) can be used to reduce the dimensionality of highly nonlinear datasets, but it may not work well for nonlinearly correlated data. Even it can loss some neccssary information when conducted the dataset.
Fig. 2. Swiss roll reduced to 2D using kPCA with various kernels.
P5. Suppose you perform PCA on a 1,000-dimensional dataset, setting the explained variance ratio to 95%. How many dimensions will the resulting dataset have?
A5:1. Almost perfectly aligned point: PCA can reduce the dataset to one dimension while preserving 95% of variance.
2. Perfectly random points: PCA will reduce the feature to 95% of 1000.
P6. In what cases would you use vanilla PCA, Incremental PCA, Randomized PCA, or Kernel PCA?
A6:
Vanilla PCA: the dataset fit in memory.
Incremental PCA: larget dataset that don't fit in memory, online tasks.
Randomized PCA: considerably reduce dimensionality and the dataset fit the memory.
kernrl PCA: used for nonlinear PCA.
P7. How can you evaluate the performance of a dimensionality reduction algorithm on your dataset?
A7:1. one way to measure how well a dimensionality reduction algorithm preserves the information in the original data is to calculate the data reconstruction error. The lower the data reconstruction error, the better the algorithm is at retaining the essential features of the data.
2. Anothher way to evluate a dimensionality reduction algorithm's performance is to look at the data compression ratio. The higher the data compression ratio, the more efficient the algorithm is at reducing the dimensionality of the data.
3. A third way to assess a dimensionaliy reduction algorithm's performance is to examine the data visualization quality. You can use various methods to evaluate the data visualization quality, such as scatter plot, heat maps, sihouette plots, etc.
4. A fourth way to gauge a dimensionality reduction algorithm's performance is to check the data classification accuracy. The higher the data classification accuracy, the better the algorithm is at preserving the discriminative power of the data.
5. A fifth way to evaluate a dimensionality reduction algorithm's performance is to conside the algorithm complexity and scalability. You can use various metrics to measure the algorithm complexity and scalability, such as time complexity, space complexity, or iteration complexity.
6. A sixth way to judge a dimensionality reduction algorithm's performance is to assess the algorithm suitability and robustness. You can use various methods to test the algorithm suitability and robustness, such as aross-validation, sensitivity analysis, or parameter tuning.
Reference: How can you evaluate a dimensionality reduction algorithm's performance effectively?
P8. Does it make any sense to chain two different dimensionality reduction algorithms?
A8: Chaining two different dimensionality reduction algorithms can be beneficial to leverage the strengths of both or manage very high-dimensional data. However, it's essential to consider potential errors and information loss which requires thorough understanding and careful result analyisis. Reference: Does it make any sense to chain two different dimensionality reduction algorithms?P9. Load the MNIST dataset (introduced in Chapter 3) and split it into a training set and a test set (take the first 60,000 instances for training and the remaining 10,000 for testing). Train a Random Forest classifier on the dataset and time how long it takes, then evaluate the resulting model on the test set. Next, use PCA to reduce the dataset’s dimensionality, with an explained variance ratio of 95%. Train a new Random Forest classifier on the reduced dataset and see how long it takes. Was training much faster? Next, evaluate the classifier on the test set. How does it compare to the previous classifier?
A9:Code implementation:
import numpy as np from sklearn.datasets import fetch_openml from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.ensemble import RandomForestClassifier from sklearn.decomposition import PCA import time # 加载 MNIST 数据集 mnist = fetch_openml('mnist_784', version=1) X, y = mnist['data'], mnist['target'] X = X.to_numpy() # 数据集划分 X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=60000, test_size=10000, random_state=42) print("Train shape:", X_train.shape) print("Test shape:", X_test.shape) # 数据标准化 scaler = StandardScaler() X_scaled = scaler.fit_transform(X_train) # 计算运行时间 start_time = time.time() # 训练随机森林分类器 rf_classifier = RandomForestClassifier(n_estimators=2, random_state=42) rf_classifier.fit(X_scaled, y_train) # 对验证集进行标准化 X_test_scaled = scaler.transform(X_test) end_time = time.time() elapsed_time = end_time - start_time print("Time: {:.2f} seconds".format(elapsed_time)) print("Validation Accuracy: {:.2f}%".format(rf_classifier.score(X_test_scaled, y_test) * 100)) # 应用 PCA pca = PCA() pca.fit(X_scaled) # 计算解释方差和累积解释方差 explained_variances = pca.explained_variance_ratio_ cumsum = np.cumsum(explained_variances) d = np.argmax(cumsum >= 0.95) + 1 # 使用指定的主成分数进行 PCA 转换 pca = PCA(n_components=d) X_reduced = pca.fit_transform(X_scaled) X_test_reduced = pca.transform(X_test_scaled) # 计算 PCA 后训练的运行时间 start_time_pca = time.time() # 训练随机森林分类器(PCA 后) rf_classifier_pca = RandomForestClassifier(n_estimators=2, random_state=42) rf_classifier_pca.fit(X_reduced, y_train) end_time_pca = time.time() elapsed_time_pca = end_time_pca - start_time_pca print("Time after PCA: {:.2f} seconds".format(elapsed_time_pca)) print("Validation Accuracy after PCA: {:.2f}%".format(rf_classifier_pca.score(X_test_reduced, y_test) * 100))Result:
Fig. 3. Comparison of Random Forest classifier on the original data and reduced.
P10. Use t-SNE to reduce the MNIST dataset down to two dimensions and plot the result using Matplotlib. You can use a scatterplot using 10 different colors to represent each image’s target class. Alternatively, you can replace each dot in the scatterplot with the corresponding instance’s class (a digit from 0 to 9), or even plot scaled-down versions of the digit images themselves (if you plot all digits, the visualization will be too cluttered, so you should either draw a random sample or plot an instance only if no other instance has already been plotted at a close distance). You should get a nice visualization with well-separated clusters of digits. Try using other dimensionality reduction algorithms such as PCA, LLE, or MDS and compare the resulting visualizations.
A10:Code implementation:
import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import fetch_openml from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler # 加载 MNIST 数据集 mnist = fetch_openml('mnist_784', version=1) X, y = mnist['data'], mnist['target'].astype(int) X = X.to_numpy() # 数据集划分 X_sample, _, y_sample, _ = train_test_split(X, y, train_size=6000, random_state=42) # 数据标准化 scaler = StandardScaler() X_scaled = scaler.fit_transform(X_sample) # 应用 PCA from sklearn.decomposition import PCA pca = PCA(n_components=2) pca.fit(X_scaled) X_pca = pca.fit_transform(X_scaled) plt.figure(figsize=(16, 4)) plt.subplot(141) scatter_pca = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_sample, cmap='tab10', alpha=0.6, s=10) plt.colorbar(scatter_pca, ticks=range(10)) plt.title('PCA') plt.xlabel('First Principal Component') plt.ylabel('Second Principal Component') # t-SNE from sklearn.manifold import TSNE tsne = TSNE(n_components=2, random_state=42) X_tsne = tsne.fit_transform(X_scaled) plt.subplot(142) scatter_tsne = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_sample, cmap='tab10', alpha=0.6, s=10) plt.colorbar(scatter_tsne, ticks=range(10)) plt.title('t-SNE') plt.xlabel('First Principal Component') # LLE from sklearn.manifold import LocallyLinearEmbedding lle = LocallyLinearEmbedding(n_components=2, n_neighbors=10, random_state=42) X_lle = lle.fit_transform(X_scaled) plt.subplot(143) scatter_lle = plt.scatter(X_lle[:, 0], X_lle[:, 1], c=y_sample, cmap='tab10', alpha=0.6, s=10) plt.colorbar(scatter_lle, ticks=range(10)) plt.title('LLE') plt.xlabel('First Principal Component') # MDS from sklearn.manifold import MDS mds = MDS(n_components=2, random_state=42) X_mds = mds.fit_transform(X_scaled) plt.subplot(144) scatter_mds = plt.scatter(X_mds[:, 0], X_mds[:, 1], c=y_sample, cmap='tab10', alpha=0.6, s=10) plt.colorbar(scatter_mds, ticks=range(10)) plt.title('MDS') plt.xlabel('First Principal Component') plt.show()Results:
Fig. 4. Comparison results for different dimensionality reduction methods.