Building A Decision Tree Classifier From Scratch

Building a
Decision Tree Classifier

from Scratch in Python
1 ANSHUMAN JHA
Building a Decision Tree Classifier from Scratch in Python
Table of Contents
1. Introduction
2. Algorithm's Logic
3. Fundamental Concepts
4. The Structure of a Decision Tree Classifier
5. Implementation in Python
a. Calculate Gini Impurity
b. Find the Best Split
c. Create Decision Nodes and Leaf Nodes
d. Build the Tree Recursively
e. Prune the Tree
f. Making Predictions
6. Conclusion
2 ANSHUMAN JHA
1. Introduction to Decision Tree Classifier

A Decision Tree is a popular supervised machine learning algorithm used for both classification and regression
tasks. It splits the data into subsets based on the value of input features, recursively. The goal is to create a
model that predicts the value of a target variable by learning simple decision rules inferred from the data
features.
In this guide, we will explore the fundamental concepts and steps involved in building a Decision Tree
Classifier from scratch in Python. We will not rely on high-level libraries like Scikit-Learn. Instead, we will
implement the algorithm step-by-step to understand its inner workings.
2. Algorithm's Logic
The decision tree algorithm can be summarized in the following steps:
1. Select the Best Feature: Choose the feature that best splits the data using a metric like Gini impurity or
entropy.
2. Split the Dataset: Divide the dataset into subsets based on the selected feature.
3. Create Decision Nodes: Create nodes representing the decision rules for splitting.
4. Repeat Recursively: Repeat the process for each subset until a stopping criterion is met (e.g., all
instances belong to one class or maximum tree depth is reached).
Gini Impurity and Entropy
These are measures used to evaluate the quality of a split:
• Gini Impurity: Measures the likelihood of an incorrect classification of a new instance if it was
randomly labelled according to the distribution of labels in the dataset.
• Entropy: Measures the uncertainty or randomness in the data.
3 ANSHUMAN JHA
3. Fundamental Concepts
Before diving into the implementation, let's briefly discuss some key concepts:
I. Node
A single point of decision in the tree, which consists of a condition on a feature.
II. Root Node
The topmost node in the tree, where the first split is made.
III. Leaf Node
The terminal nodes that provide the final classification.
IV. Entropy
A measure of impurity or randomness in the dataset. It is used to determine the quality of a split. The
formula for entropy is:
Where pi is the proportion of the class i in the dataset S.

V. Information Gain
The reduction in entropy achieved by partitioning the data based on a feature. It helps in selecting the
best feature for splitting.
Where Sv is the subset of S for which feature A has value v.
4 ANSHUMAN JHA
4. The Structure of a Decision Tree Classifier

This Structure illustrates the step-by-step process of building a decision tree classifier from scratch in Python. It
visually represents the flow of data and logic, highlighting key functions, classes, and decision points -
5 ANSHUMAN JHA
5. Implementation in Python
Let's implement a simple Decision Tree in Python to classify data. We'll use NumPy, pandas and collections for
numerical computations.
Step 1: Importing Required Libraries

We will start by importing necessary libraries like NumPy for numerical operations.
import numpy as np
import pandas as pd
from collections import Counter
Step 2: Define Functions for Entropy and Information Gain
Entropy Calculation
The entropy function calculates the entropy of a given set of class labels. It counts the occurrences of each class,
computes the probabilities, and sums up the contributions of each class to the entropy using the formula.
def entropy(y):
# Calculate the entropy of label array y
class_counts = Counter(y)
total_instances = len(y)
entropy_value = 0.0
for count in class_counts.values():

probability = count / total_instances
entropy_value -= probability * np.log2(probability)
return entropy_value
Information Gain Calculation
The information_gain function computes the reduction in entropy achieved by splitting the dataset based on a given
feature. It calculates the total entropy and subtracts the weighted entropy of the subsets created by the split.
def information_gain(X, y, feature_index):

# Calculate the information gain of a feature
total_entropy = entropy(y)
# Get the unique values of the feature

unique_values = np.unique(X[:, feature_index])
weighted_entropy = 0.0
for value in unique_values:

subset_indices = X[:, feature_index] == value
subset_y = y[subset_indices]
weighted_entropy += (len(subset_y) / len(y)) * entropy(subset_y)
return total_entropy - weighted_entropy
6 ANSHUMAN JHA
Step 3: Define the Decision Tree Node Class

The DecisionTreeNode class represents a node in the decision tree. It can be an internal node with a feature index
and threshold or a leaf node with a class value.
class DecisionTreeNode:
def __init__(self, feature_index=None, threshold=None, left=None, right=None, value=None):
# Initialize the node
self.feature_index = feature_index
self.threshold = threshold
self.left = left
self.right = right
self.value = value
7 ANSHUMAN JHA
Step 4: Define the Decision Tree Classifier Class
The DecisionTreeClassifier class handles the tree's growth and predictions. The fit method grows the tree
recursively using the _grow_tree method, which finds the best split and creates child nodes. The predict method
traverses the tree to predict the class label for each instance.
class DecisionTreeClassifier:
def __init__(self, max_depth=None):
# Initialize the classifier
self.max_depth = max_depth
self.root = None
def fit(self, X, y):

# Fit the classifier to the data
self.root = self._grow_tree(X, y, depth=0)
def _grow_tree(self, X, y, depth):

# Grow the tree recursively
num_samples, num_features = X.shape
num_classes = len(set(y))
# Stopping criteria
if depth >= self.max_depth or num_classes == 1 or num_samples <= 1:
leaf_value = self._most_common_class(y)
return DecisionTreeNode(value=leaf_value)
# Find the best split

best_gain = -1
split_index = None
split_threshold = None
for feature_index in range(num_features):

gain = information_gain(X, y, feature_index)
if gain > best_gain:
best_gain = gain
split_index = feature_index
split_threshold = np.unique(X[:, feature_index])[0] # Simplified threshold
# Grow the children recursively

left_indices = X[:, split_index] == split_threshold
right_indices = X[:, split_index] != split_threshold
left_child = self._grow_tree(X[left_indices], y[left_indices], depth + 1)

right_child = self._grow_tree(X[right_indices], y[right_indices], depth + 1)
return DecisionTreeNode(feature_index=split_index, threshold=split_threshold, left=left_child,

right=right_child)
def _most_common_class(self, y):

# Find the most common class in the label array
counter = Counter(y)
most_common = counter.most_common(1)[0][0]
return most_common
def predict(self, X):

# Predict the class labels for the input data
return [self._predict(inputs) for inputs in X]
def _predict(self, inputs):

# Predict the class label for a single instance
node = self.root
while node.value is None:
if inputs[node.feature_index] == node.threshold:
node = node.left
else:
node = node.right
return node.value
8 ANSHUMAN JHA
Step 5: Testing the Implementation

Let's test our implementation with a simple dataset.
The sample dataset is used to test the implementation. The classifier is trained and predictions are made to verify the
correctness of the model.
# Sample dataset
X = np.array([
[0, 0],
[0, 1],
[1, 0],
[1, 1]
])
y = np.array([0, 1, 1, 0])
# Initialize and fit the classifier

tree = DecisionTreeClassifier(max_depth=2)
tree.fit(X, y)
# Make predictions
predictions = tree.predict(X)
print(f"Predictions: {predictions}")
6. Conclusion
In this tutorial, we built a decision tree classifier from scratch in Python, covering the algorithm's logic, tree
construction, and pruning. We implemented the decision tree algorithm step-by-step and provided sample code
with detailed explanations. This implementation serves as a foundational understanding of how decision trees
work and can be a stepping stone to using more advanced machine learning libraries.
9 ANSHUMAN JHA
Constructive comments and feedback are welcomed
10 ANSHUMAN JHA

Building A Decision Tree Classifier From Scratch

Uploaded by

Copyright:

Available Formats

Building A Decision Tree Classifier From Scratch

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Building A Decision Tree Classifier From Scratch

Uploaded by

Copyright:

Available Formats

Building a

Decision Tree Classifier

1. Introduction to Decision Tree Classifier

Gini Impurity and Entropy

These are measures used to evaluate the quality of a split:

Where pi is the proportion of the class i in the dataset S.

Where Sv is the subset of S for which feature A has value v.

4. The Structure of a Decision Tree Classifier

Step 1: Importing Required Libraries

Step 2: Define Functions for Entropy and Information Gain

for count in class_counts.values():

Information Gain Calculation

def information_gain(X, y, feature_index):

# Get the unique values of the feature

for value in unique_values:

return total_entropy - weighted_entropy

Step 3: Define the Decision Tree Node Class

def fit(self, X, y):

def _grow_tree(self, X, y, depth):

# Find the best split

for feature_index in range(num_features):

# Grow the children recursively

left_child = self._grow_tree(X[left_indices], y[left_indices], depth + 1)

return DecisionTreeNode(feature_index=split_index, threshold=split_threshold, left=left_child,

def _most_common_class(self, y):

def predict(self, X):

def _predict(self, inputs):

Step 5: Testing the Implementation

# Initialize and fit the classifier

Constructive comments and feedback are welcomed

You might also like