assignment 6 comparison of all 4 method

Decision tree C4.5

C4.5 is an algorithm used to generate a decision tree developed by Ross Quinlan. C4.5 is an extension of Quinlan's earlier ID3 algorithm. The decision trees generated by C4.5 can be used for classification, and for this reason, C4.5 is often referred to as a statistical classifier. Authors of the Weka machine learning software described the C4.5 algorithm as "a landmark decision tree program that is probably the machine learning workhorse most widely used in practice to date".

advantages:

•  Build models that could easily interpreted
•  Easy to implement
•  It can use categorical and continuous values
•  Deals with noise

disadvantages:

•   Small variation in data can lead to different decision trees (especially when the variables are close to each other in value)
•  Does not work very well on a small training set

Naive Bayes

In machine learning, naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence assumptions between the features

Naive Bayes classifiers are highly scalable, requiring a number of parameters linear in the number of variables (features/predictors) in a learning problem. Maximum-likelihood training can be done by evaluating a closed-form expression, which takes linear time, rather than by expensive iterative approximation as used for many other types of classifiers

In the statistics and computer science literature, Naive Bayes models are known under a variety of names, including simple Bayes and independence Bayes. All these names reference the use of Bayes' theorem in the classifier's decision rule, but naive Bayes is not (necessarily) a Bayesian method

advantages:
  •  easy to implement
  • requires a small amount of training data to estimate parameters
  • good results obtained in most of the cases
disadvantages:
  • makes a very strong assumption
  • dependencies exist among variable
 
K- Nearest Neighbor
 
in pattern recognition, the k-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression

in pattern recognition, the k-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression  

advantages:
  •  robust to noisy training data
  • effective if the training data is large
disadvantages:
  • need to determine value of the parameter k-NN 
  • distance based learning is not clear, which type of distance to use and which attribute to use to produce the best results. shall we use all attributes or only a few.  






Komentar

Postingan populer dari blog ini

Assignment 1 Big Data and Data Analytics