机器学习实战:Exercises

123 阅读6分钟
Exercises & Answers

Chapter 6: Decision Trees

Exercises:

P1. What is the approximate depth of a Decision Tree trained (without restrictions) on a training set with one million instances?

A1: The depth of a decision tree is related to the number of splits needed to separate the data points into pure subsets (where all instances belong to the same class). For a decision tree trained without restrictions, the depth can be estimated based on the number of instances in the training set.

If we assume that the tree is blanced, the depth d of the tree can be approximated by:

d ≈ log2(N)

where N is the number of instances in the training set.

For N = 1,000,000:

d ≈ log2(1,000,000)≈20

So,the approximate depth of a decision tree trained without restrictions on a dataset with one million instances would be around 20. However, if the tree is not perfectly balanced or if the features are highly informative, the depth could vary. In some cases, the tree could be deeper if there are many small or complex splits.

P2. Is a node’s Gini impurity generally lower or greater than its parent’s? Is it generally lower/greater, or always lower/greater?

A2:A node's Gini impurity is generally lower than its parent's Gini impurity. This is a direct result of how decision trees work: at each split, the algorithm chooses the feature and threshold that most effectively reduce the impurity of the resulting child nodes compared to the parent node.

P3. If a Decision Tree is overfitting the training set, is it a good idea to try 6 decreasing max_depth?
A3: Yes, if a Decision Tree is overfitting the training set, it is a good idea to try decreasing the `max_depth` parameter.

Why Decreasing max_depth Helps:

  • Overfitting occurs when the model becomes too complex, capturing noise or details specific to the training data rather than general patterns. This complexity can lead to poor performance on unseen data.
  • Decision Tree Depth: The depth of a decision tree is a measure of its complexity. A deeper tree can create more specific splits, which can result in fitting to the training data very closely (overfitting).
  • Reducing max_depth limits the tree's complexity by capping the number of levels (splits) it can make. This encourages the model to capture only the most significant patterns, helping to generalize better to new data.
P4. If a Decision Tree is underfitting the training set, is it a good idea to try scaling the input features?

A4: If a decision tree is underfitting the training set, scaling the input features is unlikely to help because decision trees are invariant to the scale of the features. Instead, you should focus on making the tree more complex by adjusting parameters like `max_depth`, `min_samples_split`, and `min_samples_leaf`.

P5. If it takes one hour to train a Decision Tree on a training set containing 1 million instances, roughly how much time will it take to train another Decision Tree on a training set containing 10 million instances?
A5: Training time for a Decision Tree typically scales more than linearly with the size of the training set, due to the nature of the tree-building algorithm, which involves evaluating many possible splits and their corresponding impurities. However, for simplicity, let’s assume that the time complexity for training a Decision Tree is approximately \(O(N \log N)\), where \(N\) is the number of instances.

Given:

  • Training time for 1 million instances = 1 hour.
  • New training set size = 10 million instances.

Estimating the Training Time:

  • The time complexity is O(NlogN)O(N \log N), so the training time (T)(T) for the original set is proportional to N1logN1N_1 \log N_1, and for the new set, it’s proportional to N2logN2N_2 \log N_2, where:
    • N1=1N_1 = 1 million
    • N2=10N_2 = 10 million

Calculate the ratio:

Time Ratio=T2T1=N2logN2N1logN1=107log(107)106log(106)=10×log(107)log(106)\text{Time Ratio} = \frac{T_2}{T_1} = \frac{N_2 \log N_2}{N_1 \log N_1} = \frac{10^7 \log(10^7)}{10^6 \log(10^6)} = 10 \times \frac{\log(10^7)}{\log(10^6)}

Since log(107)log(10^7) and log(106)log(10^6) are logarithms of powers of 10:

log(107)=7log(10)=7log(10^7) = 7\log(10) = 7

log(106)=6log(10)=6log(10^6) = 6\log(10) = 6

So:

Time Ratio=10×7611.67\text{Time Ratio} = 10 \times \frac{7}{6} \approx 11.67

Conclusion:

  • The time to train the Decision Tree on 10 million instances would be roughly 11.67 times the original time.
  • If the original time is 1 hour, the new training time would be approximately 11.67 hours.

Thus, it would take roughly 11.67 hours to train the Decision Tree on 10 million instances.

P6. If your training set contains 100,000 instances, will setting presort=True speed up training?
A6: Setting 'presort=True' in a Decision Tree's training process can actually slow down the training rather than speed it up, especially when you have a large training set like 100,000 instances.

Explanation:

  • Presort Option: The 'presort=True' option in some decision tree implementations (like in older versions of Scikit-learn) means that the algorithm will sort the data before finding the best splits at each node. This presorting can make it easier to find optimal splits, potentially improving training speed and accuracy for small datasets.

  • Large Datasets: For large datasets, like one with 100,000 instances, the overhead of presorting the data at each node becomes significant. Sorting is a computationally expensive operation, with a time complexity of O(NlogN)O(N \log N), where NN is the number of instances. This sorting process needs to be repeated at each node, leading to increased training time as the dataset size grows.

Conclusion:

  • Small Datasets: For small datasets, 'presort=True' might speed up training because it reduces the time needed to find the best splits.
  • Large Datasets: For large datasets, like one with 100,000 instances, setting 'presort=True' is likely to slow down the training process due to the additional computational cost of sorting.

For large datasets, it is usually better to leave 'presort=False' (which is the default setting) to avoid this additional overhead.

P7. Train and fine-tune a Decision Tree for the moons dataset by following these steps:

a. Use make_moons(n_samples=10000, noise=0.4) to generate a moons dataset.

b. Use train_test_split() to split the dataset into a training set and a test set.

c. Use grid search with cross-validation (with the help of the GridSearchCV class) to find good hyperparameter values for a DecisionTreeClassifier. Hint: try various values for max_leaf_nodes.

d. Train it on the full training set using these hyperparameters, and measure your model’s performance on the test set. You should get roughly 85% to 87% accuracy.

A7: To train and fine-tune a `DecisionTreeClassifier` on the moons dataset using scikit-learn, follow the steps below:

Step-by-Step Implementation:

  1. Generate the Moons Dataset:

    • Use make_moons to generate a synthetic dataset with 10,000 samples and some noise.
  2. Split the Dataset:

    • Use train_test_split to split the data into training and test sets.
  3. Hyperparameter Tuning with Grid Search:

    • Use GridSearchCV to find the best hyperparameters for the DecisionTreeClassifier, specifically tuning max_leaf_nodes.
  4. Train the Model on the Full Training Set:

    • Using the best parameters found in the grid search, train the model on the full training set.
  5. Evaluate the Model:

    • Measure the model's accuracy on the test set.

Full Imaplementation:

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Step a: Generate the moons dataset
X, y = make_moons(n_samples=10000, noise=0.4, random_state=42)

# Step b: Split the dataset into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step c: Use GridSearchCV to find good hyperparameters
param_grid = {
    'max_leaf_nodes': list(range(2, 100)),
    'min_samples_split': [2, 3, 4],
}

grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Best hyperparameters found by GridSearchCV
best_params = grid_search.best_params_
print(f"Best parameters: {best_params}")

# Step d: Train the model on the full training set using the best hyperparameters
best_tree_clf = grid_search.best_estimator_
best_tree_clf.fit(X_train, y_train)

# Measure performance on the test set
y_pred = best_tree_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Test set accuracy: {accuracy:.2f}")

Result:

image.png

P8. Grow a forest by following these steps:

a. Continuing the previous exercise, generate 1,000 subsets of the training set, each containing 100 instances selected randomly. Hint: you can use Scikit-Learn’s ShuffleSplit class for this.

b. Train one Decision Tree on each subset, using the best hyperparameter values found in the previous exercise. Evaluate these 1,000 Decision Trees on the test set. Since they were trained on smaller sets, these Decision Trees will likely perform worse than the first Decision Tree, achieving only about 80% accuracy.

c. Now comes the magic. For each test set instance, generate the predictions of the 1,000 Decision Trees, and keep only the most frequent prediction (you can use SciPy’s mode() function for this). This approach gives you majority-vote predictions over the test set.

d. Evaluate these predictions on the test set: you should obtain a slightly higher accuracy than your first model (about 0.5 to 1.5% higher). Congratulations, you have trained a Random Forest classifier!

A8: To build a Random Forest classifier by following the steps you've outlined, we'll proceed as follows:

Step-by-Step Implementation:

  1. Generate 1,000 Subsets of the Training Set:

    • Use ShuffleSplit to create 1,000 different subsets of the training set, each containing 100 instances.
  2. Train a Decision Tree on Each Subset:

    • For each subset, train a DecisionTreeClassifier using the best hyperparameters found in the previous exercise.
  3. Make Predictions with Each Tree:

    • Use each of the trained decision trees to make predictions on the test set.
  4. Majority Voting:

    • Use majority voting to combine the predictions from the 1,000 trees and determine the final prediction for each instance in the test set.
  5. Evaluate the Final Model:

    • Measure the accuracy of the final predictions on the test set.

    Full Impleentation:

from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split, ShuffleSplit
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from scipy.stats import mode

# Step a: Generate the moons dataset and split it
X, y = make_moons(n_samples=10000, noise=0.4, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step b: Initialize the best hyperparameters found earlier
best_params = {'max_leaf_nodes': 17, 'min_samples_split': 2}  # Assume these were the best from the previous exercise

# Create a ShuffleSplit object to generate 1,000 subsets of the training set
n_trees = 1000
n_instances = 100
shuffle_split = ShuffleSplit(n_splits=n_trees, train_size=n_instances, random_state=42)

# Train one Decision Tree on each subset and store the predictions
predictions = np.empty((n_trees, len(X_test)), dtype=np.uint8)

for i, (train_indices, _) in enumerate(shuffle_split.split(X_train)):
    X_subset = X_train[train_indices]
    y_subset = y_train[train_indices]
    tree_clf = DecisionTreeClassifier(**best_params, random_state=42)
    tree_clf.fit(X_subset, y_subset)
    y_pred = tree_clf.predict(X_test)
    predictions[i] = y_pred

# Step c: Use majority voting to combine predictions
y_majority_votes, _ = mode(predictions, axis=0)
y_majority_votes = y_majority_votes.reshape(-1)

# Step d: Evaluate the final model
accuracy = accuracy_score(y_test, y_majority_votes)
print(f"Test set accuracy with majority voting: {accuracy:.2f}")

Result:

image.png