Pra 5 ML

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Machine Learning [102045609] 12102080501071

Practical 5
AIM: Write a program to demonstrate the working of the decision tree based ID3
algorithm. Use an appropriate data set for building the decision tree and apply this
knowledge to classify a new sample.
THEORY:

• What is Decision Tree:


- A Decision Tree is a popular machine learning algorithm used for both
classification and regression tasks. It is a tree-like structure that represents a
series of decisions and their possible outcomes. Each internal node of the tree
corresponds to a feature or attribute, each branch represents a decision based on
that attribute, and each leaf node represents the final outcome or class label.
Decision Trees are interpretable and easy to understand, making them useful for
both analysis and prediction.

• What is ID3 Algorithm:


- The ID3 (Iterative Dichotomiser 3) algorithm is one of the earliest and most
widely used algorithms to create Decision Trees from a given dataset. It uses the
concept of entropy and information gain to select the best attribute for splitting
the data at each node. Entropy measures the uncertainty or randomness in the
data, and information gain quantifies the reduction in uncertainty achieved by
splitting the data on a particular attribute. The ID3 algorithm recursively splits the
dataset based on the attributes with the highest information gain until a stopping
criterion is met, resulting in a Decision Tree that can be used for classification
tasks.

• Steps to Create a Decision Tree using the ID3 Algorithm:


- Step 1: Data Preprocessing:
Clean and preprocess the data. Handle missing values and convert categorical
variables into numerical representations if needed.

- Step 2: Selecting the Root Node:


Calculate the entropy of the target variable (class labels) based on the dataset.
The formula for entropy is:
Entropy(S) = -Σ (p_i * log2(p_i))
where p_i is the proportion of instances belonging to class i.

- Step 3: Calculating Information Gain:


For each attribute in the dataset, calculate the information gain when the dataset
is split on that attribute. The formula for information gain is:
Information Gain(S, A) = Entropy(S) - Σ ((|S_v| / |S|) * Entropy(S_v))

GCET 19
Machine Learning [102045609] 12102080501071

where S_v is the subset of instances for each possible value of attribute A, and
|S_v| is the number of instances in that subset.

- Step 4: Selecting the Best Attribute:


Choose the attribute with the highest information gain as the decision node for
the tree.

- Step 5: Splitting the Dataset:


Split the dataset based on the values of the selected attribute.

- Step 6: Repeat the Process:


Recursively repeat steps 2 to 5 for each subset until a stopping criterion is met
(e.g., the tree depth reaches a maximum limit or all instances in a subset belong
to the same class).

CODE :

import numpy as np
import math
import csv

def read_data(filename):
with open(filename, 'r') as csvfile:
datareader = csv.reader(csvfile, delimiter=',')
headers = next(datareader)
metadata = []
traindata = []
for name in headers:
metadata.append(name)
for row in datareader:
traindata.append(row)

return (metadata, traindata)

class Node:
def __init__(self, attribute):
self.attribute = attribute
self.children = []
self.answer = ""

def __str__(self):
return self.attribute

GCET 20
Machine Learning [102045609] 12102080501071

def subtables(data, col, delete):


dict = {}
items = np.unique(data[:, col])
count = np.zeros((items.shape[0], 1), dtype=np.int32)

for x in range(items.shape[0]):
for y in range(data.shape[0]):
if data[y, col] == items[x]:
count[x] += 1

for x in range(items.shape[0]):
dict[items[x]] = np.empty((int(count[x]), data.shape[1]), dtype="|S32")
pos = 0
for y in range(data.shape[0]):
if data[y, col] == items[x]:
dict[items[x]][pos] = data[y]
pos += 1
if delete:
dict[items[x]] = np.delete(dict[items[x]], col, 1)

return items, dict


def entropy(S):
items = np.unique(S)

if items.size == 1:
return 0

counts = np.zeros((items.shape[0], 1))


sums = 0
for x in range(items.shape[0]):
counts[x] = sum(S == items[x]) / (S.size * 1.0)

for count in counts:


sums += -1 * count * math.log(count, 2)
return sums
def gain_ratio(data, col):
items, dict = subtables(data, col, delete=False)
total_size = data.shape[0]
entropies = np.zeros((items.shape[0], 1))
intrinsic = np.zeros((items.shape[0], 1))

for x in range(items.shape[0]):
ratio = dict[items[x]].shape[0]/(total_size * 1.0)
entropies[x] = ratio * entropy(dict[items[x]][:, -1])
intrinsic[x] = ratio * math.log(ratio, 2)
total_entropy = entropy(data[:, -1])
iv = -1 * sum(intrinsic)
for x in range(entropies.shape[0]):
total_entropy -= entropies[x]
return total_entropy / iv

GCET 21
Machine Learning [102045609] 12102080501071

def create_node(data, metadata):


if (np.unique(data[:, -1])).shape[0] == 1:
node = Node("")
node.answer = np.unique(data[:, -1])[0]
return node

gains = np.zeros((data.shape[1] - 1, 1))

for col in range(data.shape[1] - 1):


gains[col] = gain_ratio(data, col)

split = np.argmax(gains)

node = Node(metadata[split])
metadata = np.delete(metadata, split, 0)

items, dict = subtables(data, split, delete=True)

for x in range(items.shape[0]):
child = create_node(dict[items[x]], metadata)
node.children.append((items[x], child))

return node

def empty(size):
s = ""
for x in range(size):
s += " "
return s

def print_tree(node, level):


if node.answer != "":
print(empty(level), node.answer)
return
print(empty(level), node.attribute)
for value, n in node.children:
print(empty(level + 1), value)
print_tree(n, level + 2)

metadata, traindata = read_data("tennisdata.csv")


data = np.array(traindata)
node = create_node(data, metadata)
print_tree(node, 0)

GCET 22
Machine Learning [102045609] 12102080501071

OUTPUT:

GCET 23

You might also like