Machine Learning Lab: Delhi Technological University
Machine Learning Lab: Delhi Technological University
Machine Learning Lab: Delhi Technological University
EXPERIMENT : 3
def find_entropy_attribute(df,attribute):
Class = df.columns[-1] #To make the code generic, changing target variable class name
target_variables = df[Class].unique() #This gives all 'Yes' and 'No'
variables = df[attribute].unique() #This gives different features in that attribute (like 'Hot','Cold' in
Temperature)
entropy2 = 0
for variable in variables:
entropy = 0
for target_variable in target_variables:
num = len(df[attribute][df[attribute]==variable][df[Class] ==target_variable])
den = len(df[attribute][df[attribute]==variable])
fraction = num/(den)
entropy += -fraction*log(fraction)
fraction2 = den/len(df)
entropy2 += fraction2*entropy
return abs(entropy2)
def find_winner(df):
Entropy_att = []
IG = []
for key in df.columns[:-1]:
# Entropy_att.append(find_entropy_attribute(df,key))
IG.append(find_entropy(df)-find_entropy_attribute(df,key))
return df.columns[:-1][np.argmax(IG)]
def buildTree(df,tree=None):
Class = df.columns[-1] #To make the code generic, changing target variable class name
#Get distinct value of that attribute e.g Salary is node and Low,Med and High are values
attValue = np.unique(df[node])
subtable = get_subtable(df,node,value)
clValue,counts = np.unique(subtable['class'],return_counts=True)
tree[node][value] = clValue[0]
else:
tree[node][value] = buildTree(subtable) #Calling the function recursively
return tree
OUTPUT
DISCUSSION
A decision tree is a map of the possible outcomes of a series of related choices. It allows an
individual or organization to weigh possible actions against one another based on their
costs, probabilities, and benefits. They can be used either to drive informal discussion or to
map out an algorithm that predicts the best choice mathematically.
Using decision trees in machine learning has several advantages:
The cost of using the tree to predict data decreases with each additional data point
Works for either categorical or numerical data
Can model problems with multiple outputs
But they also have a few disadvantages:
When dealing with categorical data with multiple levels, the information gain is
biased in favor of the attributes with the most levels.
Calculations can become complex when dealing with uncertainty and lots of linked
outcomes.
Conjunctions between nodes are limited to AND, whereas decision graphs allow for
nodes linked by OR.