Muthayammal Engineering College provides a document on machine learning techniques for students in their 16CSE14 course. The document defines key terms related to supervised learning, including classification, regression, noise, multi-class classification, model selection, generalization, decision trees, Bayesian decision theory, overfitting, underfitting, posterior probability, bias, variance, covariance, ill-posed problems, validation sets, parametric and non-parametric methods, model complexity, parameter estimates, and more. Examples of classification applications include pattern recognition, optical character recognition, and medical diagnosis.
Muthayammal Engineering College provides a document on machine learning techniques for students in their 16CSE14 course. The document defines key terms related to supervised learning, including classification, regression, noise, multi-class classification, model selection, generalization, decision trees, Bayesian decision theory, overfitting, underfitting, posterior probability, bias, variance, covariance, ill-posed problems, validation sets, parametric and non-parametric methods, model complexity, parameter estimates, and more. Examples of classification applications include pattern recognition, optical character recognition, and medical diagnosis.
Muthayammal Engineering College provides a document on machine learning techniques for students in their 16CSE14 course. The document defines key terms related to supervised learning, including classification, regression, noise, multi-class classification, model selection, generalization, decision trees, Bayesian decision theory, overfitting, underfitting, posterior probability, bias, variance, covariance, ill-posed problems, validation sets, parametric and non-parametric methods, model complexity, parameter estimates, and more. Examples of classification applications include pattern recognition, optical character recognition, and medical diagnosis.
Muthayammal Engineering College provides a document on machine learning techniques for students in their 16CSE14 course. The document defines key terms related to supervised learning, including classification, regression, noise, multi-class classification, model selection, generalization, decision trees, Bayesian decision theory, overfitting, underfitting, posterior probability, bias, variance, covariance, ill-posed problems, validation sets, parametric and non-parametric methods, model complexity, parameter estimates, and more. Examples of classification applications include pattern recognition, optical character recognition, and medical diagnosis.
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to
Anna University) Rasipuram - 637 408, Namakkal Dist., Tamil Nadu MKC MUST KNOW CONCEPTS CSE 2020-2021
SUBJECT 16CSE14/ MACHINE LEARNING TECHNIQUES
S. Notation Concept/Definition/Meaning/Units/Equation/ No Term Units (Symbol) Expression . UNIT - I INTRODUCTION TO SUPERVISED LEARNING Machine learning is an application of AI which deals with 1 system programming in order to automatically learn and improve Machine Learning with experience without being explicitly programmed. Eg: Robots Types of machine Supervised learning, Unsupervised Learning & Reinforcement 2 learning learning. Learn from trained data and predict output for new input. 3 Supervised Learning Unsupervised Predict output from hidden pattern without any external trained 4 Learning data. Learner is a decision-making agent that takes actions in an 5 Reinforcement environment and receives reward (or penalty) for its actions in Learning trying to solve a problem . Types of Supervised Classification & Regression 6 learning Classification is Supervised learning technique to categorize into 7 Classification a desired and distinct number of classes. example: Male and Female A regression problem is the output variable is a real or 8 Regression continuous value, such as “salary” or “weight” pattern recognition, optical character recognition, face 9 Example for recognition, medical diagnosis, speech recognition & Classification Biometrics etc. Machine learning techniques often have to deal with noisy data, 10 which may affect the accuracy of the resulting data models. Noise Therefore, effectively dealing with noise is a key aspect in supervised learning to obtain reliable models from data. Multi class in Multiclass classification is a classification task that consists of 11 more than two classes supervised learning Model selection is the process of selecting one final machine learning model from among a collection of candidate machine 12 Model Selection learning models for a training dataset. Model selection is a process that can be applied both across different types of models (e.g. logistic regression, SVM, KNN, etc.) Generalization refers to your model's ability to adapt properly to 13 Generalization new, previously unseen data, drawn from the same distribution as the one used to create the model. We can think of machine learning as learning models of data. 14 Bayesian machine The Bayesian framework for machine learning states that you learning start out by enumerating all reasonable models of the data and assigning your prior belief P(M) to each of these models. Decision Trees are a non-parametric supervised learning method used for 15 Decision tree is used both classification and regression tasks. The goal is to create a in machine learning model that predicts the value of a target variable by learning simple decision rules inferred from the data features. Bayesian decision theory refers to a decision theory which is 16 Bayesian decision informed by Bayesian probability. It is a statistical system that theory tries to quantify the tradeoff between various decisions, making use of probabilities and costs. Trained set which are trained with lot of data and produce 17 Over fitting inaccurate output by the noise. Trained set which have less number of data and not used to 18 Under fitting generalize a new data. posterior = prior × likelihood / evidence 19 posterior probability
20 Difference between predicted output and actual value.
Bias Used to analyse linear relationship between two variables. 21 Variance
22 Used to analyse difference between two attributes.
Covariance
23 ill-posed problem where the data byitself is not sufficient to find
ill-posed problem a unique solution. A model trained on the training set predicts the right output for 24 Generalization new instances is called generalization validation set and is used to test the generalization 25 Validation set ability UNIT II PARAMETRIC AND SEMI-PARAMMMETRIC METHODS Parametric methods, like Discriminant Analysis Classification, fit a parametric model to the training data and interpolate 26 Parametric to classify test data. Nonparametric methods, classification like classification and regression trees, use other means to determine classifications. For example, polynomial regression consists of performing multiple regression with variables. in order to find the 27 Parametric polynomial coefficients (parameters). These types regression of regression are known as parametric regression since they are based on models that require the estimation of a finite number of parameters. Model complexity can be characterized by many things, and is a bit subjective. In machine learning, model complexity often 28 Model complexity refers to the number of features or terms included in a given predictive model, as well as whether the chosen model is linear, nonlinear, and so on parametric models, which are well-defined in the finite- 29 Parametric models dimensional space, and Non-parametric models, where the parameters can all span an 30 Non-parametric infinite space, a semi parametric model has a component that is models finite-dimensional (i.e. it's easy to research and understand), and another that is infinite Machine learning model selection is the second step of 31 model selection the machine learning process, following variable selection and data cleansing. Selecting the right machine learning model is a critical step, as a model which does not appropriately fit the data will yield inaccurate results Parameter estimates (also called coefficients) are the change in 32 Parameter the response associated with a one-unit change of the predictor, estimates all other predictors being held constant. Multivariate Regression is a method used to measure the degree Multivariate at which more than one independent variable (predictors) and 33 Regression more than one dependent variable (responses), are linearly related. Binary Binary classification is the task of classifying the elements of a 34 classification set into two groups on the basis of a classification rule. Clustering is the task of dividing the population or data points 35 into a number of groups such that data points in the same groups Clustering are more similar to other data points in the same group and dissimilar to the data points in other groups 36 Types of Hierarchical clustering,K-Means clustering. clustering If k is given, the K-means algorithm can be executed in the K-means following steps: Partition of objects into k non-empty subsets. ... 37 algorithm Compute the distances from each point and allot points to the cluster where the distance from the centroid is minimum Hierarchical Minimum distance clustering is also called as single 38 Clustering. linkage hierarchical clustering or nearest neighbor clustering Maximum Maximum likelihood estimation is a method that determines 39 likelihood values for the parameters of a model. estimation It describes a single trial of a Bernoulli experiment. A closed form of the probability density function of Bernoulli distribution 40 Bernoulli density is P ( x ) = p x ( 1 − p ) 1 − x P(x) = p^{x}(1-p)^{1-x} P(x)=px(1−p)1−x It is a combination of the prior distribution and the likelihood 41 prior distribution function, which tells you what information is contained in your observed data (the “new evidence”). posterior The posterior distribution summarizes what you know after the 42 distribution data has been observed Independent Independent variables (also referred to as Features) are the input 43 variables for a process that is being analyzes. Dependent Dependent variables are the output of the process 44 variables The least squares method is a statistical procedure to find the 45 least squares best fit for a set of data points by minimizing the sum of the method offsets or residuals of points from the plotted curve Least squares Least squares regression is used to predict the behavior of 46 regression dependent variables Polynomial Regression is a form of linear regression in which 47 Polynomial the relationship between the independent variable x and Regression dependent variable y is modeled as an nth degree polynomial. The relative squared error is relative to what it would have been Relative Squared 48 if a simple predictor had been used. More specifically, this Error simple predictor is just the average of the actual values. Cross-validation is a resampling procedure used to 49 Cross-validation evaluate machine learning models on a limited data sample 50 Regularization is the process of adding information in order to Regularization solve an well-posed problem or to prevent overfitting. UNIT III ARTIFICIAL NEURAL NETWORKS An artificial neuron is a mathematical function conceived 51 Artificial neuron as a model of biological neurons, a neural network. ... Usually each input is separately weighted, and the sum is passed through a non-linear function known as an activation function or transfer function. An artificial neural network learning algorithm, or neural network, 52 Neural network or just neural net. , is a computational learning system that uses learning a network of functions to understand and translate a data input of one form into a desired output, usually in another form A Perceptron is an algorithm used for supervised learning of 53 binary classifiers. Binary classifiers decide whether an input, Perceptron usually represented by a series of vectors, belongs to a specific class 54 Perceptron Perceptron Learning Rule states that the algorithm would Learning Rule automatically learn the optimal weight coefficients Gradient descent is a first-order iterative optimization algorithm 55 Gradient descent for finding a local minimum of a differentiable function The Delta rule in machine learning and neural network 56 environments is a specific type of backpropagation that helps to Delta rule refine connectionist ML/AI networks, making connections between inputs and outputs with layers of artificial neurons. Multilayer networks solve the classification problem for non linear Multilayer 57 sets by employing hidden layers, whose neurons are not directly networks connected to the output The Backpropagation algorithm looks for the minimum value of 58 Backpropagation the error function in weight space using a technique called the algorithm delta rule or gradient descent. Gradient descent is an optimization algorithm used to minimize some function by iteratively moving in the direction of steepest 59 Gradient descent descent as defined by the negative of the gradient. In machine learning, we use gradient descent to update the parameters of our mode Multilayer networks solve the classification problem for non linear 60 Multilayer sets by employing hidden layers, whose neurons are not directly networks connected to the output. A multilayer perceptron (MLP) is a class of feedforward artificial neural network (ANN). MLP utilizes a supervised learning Multilayer technique called backpropagation for training. Its multiple 61 perceptron layers and non-linear activation distinguish MLP from a linear perceptron. It can distinguish data that is not linearly separable. The activation function also helps the perceptron to learn, when it 62 Activation is part of a multilayer perceptron (MLP). Certain properties of function the activation function, especially its non-linear nature, make it possible to train complex neural networks. The connections between the different neurons are represented by the edge connecting two nodes in the graph representation of the 63 Representation of artificial neural network. They are called weights and are Neural Networks typically represented as wij. The weights on a neural network is the particular case of the parameters on any parametric model A linear threshold unit is a simple artificial neuron whose output is its thresholded total net input. That is, an LTU with threshold T 64 Threshold unit calculates the weighted sum of its inputs, and then outputs 0 if this sum is less than T, and 1 if the sum is greater than T Backpropagation simplifies the network structure by removing 65 Need of weighted links that have a minimal effect on the trained network. Backpropagation It is especially useful for deep neural networks working on error- prone projects, such as image or speech recognition. Difference The terms cost and loss functions almost refer to the same 66 between Cost and meaning.The cost function is calculated as an average of loss Loss function functions. The loss function is a value which is calculated at every instance. So, for a single training cycle loss is calculated numerous times, but the cost function is only calculated once Error-Correction Learning, used with supervised learning, is the Error-Correction 67 technique of comparing the system output to the desired output Learning value, and using that error to direct the training. is that neuron is (cytology) a cell of the nervous system, which Difference conducts nerve impulses; consisting of an axon and several 68 between neuron dendrites neurons are connected by synapses while perceptron is and Perceptron an element, analogous to a neuron, of an artificial neural network consisting of one or more layers of artificial neurons. Perceptron algorithms can be categorized into single-layer and Perceptron multi-layer perceptrons. The single-layer type organizes neurons in 69 algorithms a single layer while the multi-layer type arranges neurons in multiple layers 70 Problem in Neural If you accept most classes of problems can be reduced to Network functions, this statement implies a neural network Neural Networks have the ability to learn by themselves and produce the output that is not limited to the input provided to 71 Advantages of them. Neural Network The input is stored in its own networks instead of a database, hence the loss of data does not affect its working. Neural networks can be used to recognize handwritten characters. 72 Applications of Image Compression - Neural networks can receive and process Neural Network vast amounts of information at once, making them useful in image compression Early Stopping:Early stopping is a form of regularization while training a model with an iterative method, such as gradient descent. 73 prevent Overfitting in a neural network Use Data Augmentation Use Regularization Use Dropouts Feedforward Neural Network – Artificial Neuron Radial basis function Neural Network 74 Types of Neural Kohonen Self Organizing Neural Network Network Recurrent Neural Network(RNN) – Long Short Term Memory Convolutional Neural Network,Modular Neural Network Recurrent neural networks (RNN) are the state of the art algorithm for sequential data and are used by Apple's Siri and and Google's 75 Recurrent neural voice search. It is the first algorithm that remembers its input, due networks to an internal memory, which makes it perfectly suited for machine learning problems that involve sequential data. UNIT IV INSTANCE BASED LEARNING Definition. Instance-based learning refers to a family of 76 Instance-based techniques for classification and regression, which learning produce a class label/predication based on the similarity of the query to its nearest neighbor(s) in the training set. Instance-based learning includes nearest neighbor, Why instance locally weighted regression and case-based reasoning 77 based learning is methods. Instance-based methods are called as lazy sometimes referred to as lazy learning methods because learning they delay processing until a new instance must be classified. 78 lazy learner A lazy learner simply stores the training data and only technique when it sees a test tuple starts generalization to classify the tuple based on its similarity to the stored training tuples A lazy learning algorithm is simply an algorithm where 79 Lazy algorithm the algorithm generalizes the data after a query is made. The best example for this is KNN KNN algorithm is one of the simplest classification 80 Why KNN algorithm and it is one of the most used learning algorithm is used algorithms.KNN is a non-parametric, lazy learning algorithm K-NN is a lazy learner because it doesn't learn a 81 Why KNN is a discriminative function from the training data but lazy learner “memorizes” the training dataset instead. 'k' in KNN is a parameter that refers to the number of 82 What does K nearest neighbours to include in the majority of the mean in kNN voting process. The 'K' in K-Means Clustering has nothing to do with Is K means the 'K' in KNN algorithm. k-Means Clustering is an 83 supervised or unsupervised learning algorithm that is used for unsupervised clustering whereas KNN is a supervised learning algorithm used for classification Nearest Neighbor Rule selects the class for x with the 84 What is nearest assumption that: If x' and x were overlapping (at the Neighbour rule same point), they would share the same class The main disadvantage of the KNN algorithm is that it is Which is a 85 a lazy learner, i.e. it does not learn anything from the disadvantage of training data and simply uses the training data itself for KNN classification. The main advantages of kNN for classification are: Very What are simple implementation. Robust with regard to the search 86 advantages of space; for instance, classes don't have to be linearly KNN separable. Classifier can be updated online at very little cost as new instances with known classes are presented The k-nearest neighbors (KNN) algorithm is a simple, K Nearest supervised machine learning algorithm that can be used 87 Neighbor to solve both classification and regression problems. It's algorithm in easy to implement and understand, but has a major machine learning drawback of becoming significantly slows as the size of that data in use grows Locally weighted regression (LWR) attempts to fit the training data only in a region around the location of a 88 Locally weighted query example. LWR is a type of lazy learning, regression therefore the processing of training data is often postponed until the target value of a query example needs to be predicted. In weighted kNN, the nearest k points are given 89 weighted kNN a weight using a function called as the kernel function broad range of methods for distance weighting the training examples Remarks on range of methods for locally approximating 90 Locally weighted target functions regression
91 Radial basis Radial basis functions are means to approximate
functions multivariable (also called multivariate) functions by linear combinations of terms based on a single univariate function (the radial basis function). This is radialised so that in can be used in more than one dimension Case-based reasoning (CBR) is a paradigm of artificial intelligence and cognitive science that models 92 Case-based the reasoning process as primarily memory based. Case- reasoning based reasoners solve new problems by retrieving stored 'cases' describing similar prior problem-solving episodes and adapting their solutions to fit new needs A lazy learning algorithm is simply an algorithm where 93 Lazy Learning the algorithm generalizes the data after a query is made. The best example for this is KNN. In artificial intelligence, eager learning is a learning method in which the system tries to construct a general, 94 input-independent target function during training of the Eager Learning system, as opposed to lazy learning, where generalization beyond the training data is delayed until a query is made to the system The Euclidean distance between two points in either the plane or 3-dimensional space measures the length of a 95 Euclidean segment connecting the two points. It is the most distance obvious way of representing distance between two points Why do we use Euclidean Distance gives the distance from each cell in 96 Euclidean the raster to the closest source distance Why Euclidean Usually, the Euclidean distance is used as 97 distance is used the distance metric. Then, it assigns the point to the class in Knn among its k nearest neighbours (where k is an integer). An RBF is a function that changes with distance from a 98 Radial basis location. For example, suppose the radial basis function function is simply the distance from each location, so it forms an inverted cone over each location A radial basis function (RBF) is a real- What is Gaussian valued function whose value depends only on the 99 radial basis distance between the input and some fixed point, either function the origin, so that , or some other fixed point , called a center, so that . KNN algorithm can be used for both classification and regression problems. The KNN algorithm uses 'feature Can Knn be used 100 similarity' to predict the values of any new data points. for prediction ... The average of the values is taken to be the final prediction UNIT V ADVANCED LEARNING A Bayesian network is a probabilistic graphical model 101 Bayesian that represents a set of variables and their conditional network dependencies via a directed acyclic graph (DAG). In computer science and mathematics, a DAG is 102 Directed Acyclic a graph that is directed and without cycles connecting Graph (DAG) the other edges. causal graph A causal graph will depict whatever your assumptions 103 that you're making about the relationship between these variables. Two events A and B are conditionally independent given an event C with P(C)>0 if Conditional 104 P(A∩B|C)=P(A|C)P(B|C)(1.8) Recall that from the Independence definition of conditional probability, P(A|B)=P(A∩B)P(B), if P(B)>0. 105 Diagnostic Diagnostic or bottom-up inference. Inference Probabilistic A probabilistic database is an uncertain database in 106 Database which the possible worlds have associated probabilities. Confounding, in statistics, an extraneous variable in 107 a statistical model that correlates (directly or inversely) Hidden Variables with both the dependent variable and the independent variable. 108 Direct influence means that we can take specific steps to Direct Influence try to get the thing done. Multinomial logistic regression is used to predict a Multinomial 109 nominal dependent variable given one or more Variable independent variables. A Generative Model is a powerful way of learning any 110 Generative kind of data distribution using unsupervised learning and Model it has achieved tremendous success in just few years. A phylogenetic tree is a diagram that represents 111 Phylogenetic evolutionary relationships among Tree organisms. Phylogenetic trees are hypotheses, not definitive facts. Hidden Markov models (HMMs) have proven to be one 112 Hidden Markov of the most widely used tools for learning probabilistic Model (HMM) models of time series data. A Kalman Filter can be applied to take in the GPS data 113 Kalman Filter from the car, however GPS devices are not always entirely accurate. Bayes ball is an efficient algorithm for computing d- 114 Bayes’ ball separation by passing simple messages between nodes of the graph. The junction tree algorithm (also known as 115 Junction Trees 'Clique Tree') is a method used in machine learning to extract marginalization in general graphs. 116 Markov random A Markov Random Field is a graphical model of a joint field probability distribution. A maximal clique is a clique that cannot be extended by 117 Maximal Clique including one more adjacent vertex, meaning it is not a subset of a larger clique. 118 Factor Graph A factor graph is a type of probabilistic graphical model. Sum-product algorithm, which operates in a factor graph 119 Sum-Product and at- tempts to compute various marginal functions Algorithm associated with the global function. 120 Max-Product Max-product is a standard belief Algorithm propagation algorithm on factor graph models. 121 A decision node is a node in an activity at which the Decision Node flow branches into several optional flows. 122 Sensor Fusion where the data from different sensors are integrated to extract more information for a specific application. 123 Random The random subspace method for constructing decision Subspace forests. Error-Correcting ECOC is an ensemble method designed for multi-class 124 Output Codes classification problem. A Bayesian network is a probabilistic graphical model 125 Bayesian that represents a set of variables and their conditional network dependencies via a directed acyclic graph (DAG). GATE QUESTIONS Multiple Expert classification methods rely on a large Multiple Expert 126 training dataset in order to be properly utilized. An ensemble is itself a supervised learning algorithm, 127 Ensemble because it can be trained and then used to make predictions. Linear Opinion An important question when eliciting opinions from 128 Pools experts is how to aggregate the reported opinions. Hamming Hamming distance is a metric for comparing two binary 129 distance data strings. 130 Bagging is used when the goal is to reduce the variance Bagging of a decision tree classifier. 131 The term 'Boosting' refers to a family of algorithms Boosting which converts weak learner to strong learners. AdaBoost is an ensemble learning method (also known 132 AdaBoost as “meta-learning”) which was initially created to increase the efficiency of binary classifiers. 133 A decision stump is a machine learning model Decision Stump consisting of a one-level decision tree. Mixture of experts refers to a machine Mixture of learning technique where multiple experts (learners) are 134 experts used to divide the problem space into homogeneous regions. Dynamic Dynamic Classifier Selection based on Multiple 135 Classifier Classifier. Ensembles using Accuracy and Diversity. Selection Measure accuracy and diversity. 136 Stacked Stacked generalization, a scheme for minimizing the Generalization generalization error rate of one or more generalizers. 137 Cascading Cascading is a multistage method Spoofing is the act of disguising a communication from 138 Spoofing an unknown source as being from a known, trusted source. Multiple kernel learning (MKL) algorithms aim to find 139 Multiple kernel the best convex combination of a set of kernels to form learning the best classifier. In the classical k-armed bandit problem, there are k 140 k-armed bandit alternative arms, each with a stochastic reward whose probability distribution is initially unknown. 141 Markov decision Markov decision process (MDP) is a discrete- process time stochastic control process. A stopping rule problem has a finite horizon if there is a 142 finite-horizon known upper bound on. the number of stages at which one may stop. Infinite horizon problems are further characterized by infinite-horizon the fact that the number of stages N is infinite. 143 An Optimal Policy is a policy where you are always 144 Optimal Policy choosing the. action that maximizes the “return”/”utility” of the current state. The Bellman Equations. Step-by-step derivation, 145 Bellman’s explanation, and demystification of the most important equation equations in reinforcement learning. Value iteration is a method of computing an optimal 146 MDP policy and its value. Value iteration starts at the Value iteration "end" and then works backward, refining an estimate of either Q* or V*. In Policy Iteration - You randomly select a policy and 147 find value function corresponding to it , then find a new Policy Iteration policy based on the previous value function, and so on this will lead to optimal policy Temporal difference (TD) learning is an approach to 148 Temporal learning how to predict a quantity that depends on future difference values of a given signal. A greedy search algorithm is an. algorithm that uses a 149 greedy search heuristic for making locally optimal choices at each stage with the hope of finding a global optimum. Q-learning is an off policy reinforcement learning 150 Q-learning algorithm that seeks to find the best action to take given the current state. Signatures:1. 1.Dr.G.KAVITHA Prof&Head Faculty Team Prepared 2.Dr.N.NaveenKumar ASP/CSE 2.
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB