Tree induction algorithm is a machine learning algorithm used for building decision trees. Decision trees are models that can be used to make predictions based on a set of input features. They work by recursively splitting the input space into smaller regions, based on the values of the input features, until a prediction can be made for each region.
Prepare the training data: The first step is to prepare the training data. This includes selecting the input features and the output variable, and cleaning and preprocessing the data as necessary.
Choose a root node: The root node is the top node of the decision tree. It is chosen based on a criterion such as information gain, gain ratio or Gini impurity.
Split the data: The next step is to split the data based on the selected feature and the threshold value. This creates two or more subsets of the data, each corresponding to a branch of the decision tree.
Repeat the process for each subset: The algorithm then repeats the process recursively for each subset, selecting a feature and a threshold value for each node, and splitting the data accordingly.
Stop criteria: The algorithm continues to split the data until a stopping criterion is met. This criterion could be a maximum tree depth, a minimum number of samples in a node, or a minimum improvement in the impurity measure.
Assign classes: Once the splitting process is complete, the algorithm assigns a class label to each leaf node. This label is determined by the majority class of the samples in the node.
Prune the tree: Finally, the tree may be pruned to remove branches that do not contribute significantly to the accuracy of the model.
Test the model: The trained decision tree model can then be tested on a separate validation dataset to evaluate its performance.
Use the model: The trained decision tree can be used to make predictions on new data by passing it down the tree based on the values of its input features, and assigning it to the class label of the corresponding leaf node.
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np
# Load the iris dataset
iris = load_iris()
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=42)
# Create a decision tree classifier
clf = DecisionTreeClassifier(criterion='entropy')
# Train the classifier on the training set
clf.fit(X_train, y_train)
# Calculate the gain ratio for each feature
gain_ratio = clf.tree_.compute_feature_importances('gain_ratio')
# Print the gain ratio for each feature
for i in range(len(gain_ratio)):
print(f"Feature {i}: {gain_ratio[i]}")
# Predict the test set labels using the trained classifier
y_pred = clf.predict(X_test)
# Calculate the accuracy of the classifier on the test set
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
Feature 0: 0.03320819089108454
Feature 1: 0.0
Feature 2: 0.08185982220866377
Feature 3: 0.8849319869002518
Accuracy: 1.0
In this code, we first load the iris dataset using the load_iris() function from scikit-learn. We then split the dataset into training and test sets using the train_test_split() function. We create a decision tree classifier using the DecisionTreeClassifier() function and set the criterion to 'entropy' to calculate gain ratio. We train the classifier on the training set using the fit() method. Next, we calculate the gain ratio for each feature using the compute_feature_importances() method of the tree_ attribute of the classifier.
Advertisement
Advertisement