Pra 5 ML

Machine Learning [102045609] 12102080501071
Practical 5
AIM: Write a program to demonstrate the working of the decision tree based ID3
algorithm. Use an appropriate data set for building the decision tree and apply this
knowledge to classify a new sample.
THEORY:
• What is Decision Tree:

- A Decision Tree is a popular machine learning algorithm used for both
classification and regression tasks. It is a tree-like structure that represents a
series of decisions and their possible outcomes. Each internal node of the tree
corresponds to a feature or attribute, each branch represents a decision based on
that attribute, and each leaf node represents the final outcome or class label.
Decision Trees are interpretable and easy to understand, making them useful for
both analysis and prediction.
• What is ID3 Algorithm:

- The ID3 (Iterative Dichotomiser 3) algorithm is one of the earliest and most
widely used algorithms to create Decision Trees from a given dataset. It uses the
concept of entropy and information gain to select the best attribute for splitting
the data at each node. Entropy measures the uncertainty or randomness in the
data, and information gain quantifies the reduction in uncertainty achieved by
splitting the data on a particular attribute. The ID3 algorithm recursively splits the
dataset based on the attributes with the highest information gain until a stopping
criterion is met, resulting in a Decision Tree that can be used for classification
tasks.
• Steps to Create a Decision Tree using the ID3 Algorithm:

- Step 1: Data Preprocessing:
Clean and preprocess the data. Handle missing values and convert categorical
variables into numerical representations if needed.
- Step 2: Selecting the Root Node:

Calculate the entropy of the target variable (class labels) based on the dataset.
The formula for entropy is:
Entropy(S) = -Σ (p_i * log2(p_i))
where p_i is the proportion of instances belonging to class i.
- Step 3: Calculating Information Gain:

For each attribute in the dataset, calculate the information gain when the dataset
is split on that attribute. The formula for information gain is:
Information Gain(S, A) = Entropy(S) - Σ ((|S_v| / |S|) * Entropy(S_v))
GCET 19
where S_v is the subset of instances for each possible value of attribute A, and
|S_v| is the number of instances in that subset.
- Step 4: Selecting the Best Attribute:

Choose the attribute with the highest information gain as the decision node for
the tree.
- Step 5: Splitting the Dataset:

Split the dataset based on the values of the selected attribute.
- Step 6: Repeat the Process:

Recursively repeat steps 2 to 5 for each subset until a stopping criterion is met
(e.g., the tree depth reaches a maximum limit or all instances in a subset belong
to the same class).
CODE :
import numpy as np
import math
import csv
def read_data(filename):
with open(filename, 'r') as csvfile:
datareader = csv.reader(csvfile, delimiter=',')
headers = next(datareader)
metadata = []
traindata = []
for name in headers:
metadata.append(name)
for row in datareader:
traindata.append(row)
return (metadata, traindata)
class Node:
def __init__(self, attribute):
self.attribute = attribute
self.children = []
self.answer = ""
def __str__(self):
return self.attribute
GCET 20
def subtables(data, col, delete):

dict = {}
items = np.unique(data[:, col])
count = np.zeros((items.shape[0], 1), dtype=np.int32)
for x in range(items.shape[0]):
for y in range(data.shape[0]):
if data[y, col] == items[x]:
count[x] += 1
dict[items[x]] = np.empty((int(count[x]), data.shape[1]), dtype="|S32")
pos = 0
for y in range(data.shape[0]):
if data[y, col] == items[x]:
dict[items[x]][pos] = data[y]
pos += 1
if delete:
dict[items[x]] = np.delete(dict[items[x]], col, 1)
return items, dict

def entropy(S):
items = np.unique(S)
if items.size == 1:
return 0
counts = np.zeros((items.shape[0], 1))

sums = 0
counts[x] = sum(S == items[x]) / (S.size * 1.0)
for count in counts:

sums += -1 * count * math.log(count, 2)
return sums
def gain_ratio(data, col):
items, dict = subtables(data, col, delete=False)
total_size = data.shape[0]
entropies = np.zeros((items.shape[0], 1))
intrinsic = np.zeros((items.shape[0], 1))
ratio = dict[items[x]].shape[0]/(total_size * 1.0)
entropies[x] = ratio * entropy(dict[items[x]][:, -1])
intrinsic[x] = ratio * math.log(ratio, 2)
total_entropy = entropy(data[:, -1])
iv = -1 * sum(intrinsic)
for x in range(entropies.shape[0]):
total_entropy -= entropies[x]
return total_entropy / iv
GCET 21
def create_node(data, metadata):

if (np.unique(data[:, -1])).shape[0] == 1:
node = Node("")
node.answer = np.unique(data[:, -1])[0]
return node
gains = np.zeros((data.shape[1] - 1, 1))
for col in range(data.shape[1] - 1):

gains[col] = gain_ratio(data, col)
split = np.argmax(gains)
node = Node(metadata[split])
metadata = np.delete(metadata, split, 0)
items, dict = subtables(data, split, delete=True)
child = create_node(dict[items[x]], metadata)
node.children.append((items[x], child))
return node
def empty(size):
s = ""
for x in range(size):
s += " "
return s
def print_tree(node, level):

if node.answer != "":
print(empty(level), node.answer)
return
print(empty(level), node.attribute)
for value, n in node.children:
print(empty(level + 1), value)
print_tree(n, level + 2)
metadata, traindata = read_data("tennisdata.csv")

data = np.array(traindata)
node = create_node(data, metadata)
print_tree(node, 0)
GCET 22
OUTPUT:
GCET 23

Pra 5 ML

Uploaded by

Copyright:

Available Formats

Pra 5 ML

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pra 5 ML

Uploaded by

Copyright:

Available Formats

Machine Learning [102045609] 12102080501071

• What is Decision Tree:

• What is ID3 Algorithm:

• Steps to Create a Decision Tree using the ID3 Algorithm:

- Step 2: Selecting the Root Node:

- Step 3: Calculating Information Gain:

- Step 4: Selecting the Best Attribute:

- Step 5: Splitting the Dataset:

- Step 6: Repeat the Process:

return (metadata, traindata)

def subtables(data, col, delete):

return items, dict

counts = np.zeros((items.shape[0], 1))

for count in counts:

def create_node(data, metadata):

gains = np.zeros((data.shape[1] - 1, 1))

for col in range(data.shape[1] - 1):

items, dict = subtables(data, split, delete=True)

def print_tree(node, level):

metadata, traindata = read_data("tennisdata.csv")

You might also like