Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: I have updated the table for dimensionality reduction. There is not much available on the surface level to differentiate between the two. I have included everything that I could find.


Volunteers

...

Supervised

  • Supervised learning algorithms make predictions based on a set of examples
  • Classification: When the data are being used to predict a categorical variable, supervised learning is also called classification. This is the case when assigning a label or indicator, either dog or cat to an image. When there are only two labels, this is called binary classification. When there are more than two categories, the problems are called multi-class classification.
  • Regression: When predicting continuous values, the problems become a regression problem.
  • Forecasting: This is the process of making predictions based on past and present data. It is most commonly used to analyze trends. A common example might be an estimation of the next year sales based on the sales of the current year and previous years.

Image Removed

Algorithms

Image Removed

...

  1. Logistic regression is kind of like linear regression, but is used when the dependent variable is not a number but something else (e.g., a "yes/no" response).
  2. It's called regression but performs classification based on the regression and it classifies the dependent variable into either of the classes.
  3. is used for prediction of output which is binary
  4. Logistic function is applied to the regression to get the probabilities of it belonging in either class.

...

KNN

...

  1. it is used to identify the data points that are separated into several classes to predict the classification of a new sample point.
  2. It classifies new cases based on a similarity measure (i.e., distance functions).
  3. K-NN works well with a small number of input variables (p), but struggles when the number of inputs is very large.

...

  1. Support vector is used for both regression and classification.
  2. It is based on the concept of decision planes that define decision boundaries. A decision plane (hyperplane) is one that separates between a set of objects having different class memberships.
  3. The learning of the hyperplane in SVM is done by transforming the problem using some linear algebra

...

  1. it takes in a kernel function in the SVM algorithm and transforms it into the required form that maps data on a higher dimension which is separable.
  2. Kernel trick uses the kernel function to transform data into a higher dimensional feature space and makes it possible to perform the linear separation for classification.

...

  1. The RBF kernel SVM decision region is actually also a linear decision region.

So, the rule thumb is: use linear SVMs for linear problems, and nonlinear kernels such as the RBF kernel for non-linear problems.

...

  1. The naive Bayes classifier is based on Bayes’ theorem with the independence assumptions between predictors (i.e., it assumes the presence of a feature in a class is unrelated to any other feature). Even if these features depend on each other, or upon the existence of the other features, all of these properties independently. Thus, the name naive Bayes.
  2. Based on naive Bayes, Gaussian naive Bayes is used for classification based on the binomial (normal) distribution of data.
  3. naive Bayes model is easy to build, with no complicated iterative parameter estimation, which makes it particularly useful for very large datasets.

...

  1. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed.
  2. Entropy and information gain are used to construct a decision tree.
  3. The disadvantage of a decision tree model is overfitting, as it tries to fit the model by going deeper in the training set and thereby reducing test accuracy.

...

  1. Random forest classifier is an ensemble algorithm based on bagging i.e bootstrap aggregation.
  2. Ensemble methods combines more than one algorithm of the same or different kind for classifying objects (i.e., an ensemble of SVM, naive Bayes or decision trees)
  3. The general idea is that a combination of learning models increases the overall result selected.
  4. random forests prevent overfitting by creating trees on random subsets.
  5. The main reason is that it takes the average of all the predictions, which cancels out the biases.

...

  1. Boosting is a way to combine (ensemble) weak learners, primarily to reduce prediction bias. Instead of creating a pool of predictors, as in bagging, boosting produces a cascade of them, where each output is the input for the following learner. 
  2. Gradient boosting takes a sequential approach to obtaining predictions instead of parallelizing the tree building process. In gradient boosting, each decision tree predicts the error of the previous decision tree — thereby boosting (improving) the error (gradient).

Un-supervised

  1. Clustering -  hierarchical clusteringk-means, mixture models, DBSCAN, and OPTICS algorithm
  2. Anomaly Detection - Local Outlier Factor, and Isolation Forest
  3. Dimensionality Reduction - Principal component analysis, Independent component analysis, Non-negative matrix factorization, Singular value decomposition

Algorithms

Image Removed

...

  1. (N-1) combination of clusters are formed to choose from.
  2. Expensive and slow. n×n  distance matrix needs to be made.
  3. Cannot work on very large datasets.
  4. Results are reproducible.
  5. Does not work well with hyper-spherical clusters.
  6. Can provide insights into the way the data pts. are clustered.
  7. Can use various linkage methods(apart from centroid).

...

  1. Pre-specified number of clusters.
  2. Less computationally intensive.
  3. Suited for large dataset.
  4. Point of start can be random which leads to a different result each time the algorithm runs.
  5. K-means needs circular data. Hyper-spherical clusters.
  6. K-Means simply divides data into mutually exclusive subsets without giving much insight into the process of division.
  7. K-Means uses median or mean to compute centroid for representing cluster.

...

  1. Pre-specified number of clusters.
  2. GMs are somewhat more flexible and with a covariance matrix we can make the boundaries elliptical (as opposed to K-means which makes circular boundaries).
  3. Another thing is that GMs is a probabilistic algorithm. By assigning the probabilities to data points, we can express how strong is our belief that a given data point belongs to a specific cluster.
  4. GMs usually tend to be slower than K-Means because it takes more iterations to reach the convergence. (The problem with GMs is that they have converged quickly to a local minimum that is not very optimal for this dataset. To avoid this issue, GMs are usually initialized with K-Means.)

...


Volunteers


NameML Category
Jahanvi                                  Supervised                                        
AkankshaUnsupervised
Kanak RajReinforced


Supervised

  • Supervised learning algorithms make predictions based on a set of examples
  • Classification: When the data are being used to predict a categorical variable, supervised learning is also called classification. This is the case when assigning a label or indicator, either dog or cat to an image. When there are only two labels, this is called binary classification. When there are more than two categories, the problems are called multi-class classification.
  • Regression: When predicting continuous values, the problems become a regression problem.
  • Forecasting: This is the process of making predictions based on past and present data. It is most commonly used to analyze trends. A common example might be an estimation of the next year sales based on the sales of the current year and previous years.

Image Added

Algorithms


Image Added

NameComments on ApplicabilityReference
LOGISTIC REGRESSION
  1. Logistic regression is kind of like linear regression, but is used when the dependent variable is not a number but something else (e.g., a "yes/no" response).
  2. It's called regression but performs classification based on the regression and it classifies the dependent variable into either of the classes.
  3. is used for prediction of output which is binary
  4. Logistic function is applied to the regression to get the probabilities of it belonging in either class.

KNN

  1. it is used to identify the data points that are separated into several classes to predict the classification of a new sample point.
  2. It classifies new cases based on a similarity measure (i.e., distance functions).
  3. K-NN works well with a small number of input variables (p), but struggles when the number of inputs is very large.

SUPPORT VECTOR MACHINE
  1. Support vector is used for both regression and classification.
  2. It is based on the concept of decision planes that define decision boundaries. A decision plane (hyperplane) is one that separates between a set of objects having different class memberships.
  3. The learning of the hyperplane in SVM is done by transforming the problem using some linear algebra



Kernel SVM
  1. it takes in a kernel function in the SVM algorithm and transforms it into the required form that maps data on a higher dimension which is separable.
  2. Kernel trick uses the kernel function to transform data into a higher dimensional feature space and makes it possible to perform the linear separation for classification.

RBF Kernel
  1. The RBF kernel SVM decision region is actually also a linear decision region.

So, the rule thumb is: use linear SVMs for linear problems, and nonlinear kernels such as the RBF kernel for non-linear problems.


NAIVE BAYES
  1. The naive Bayes classifier is based on Bayes’ theorem with the independence assumptions between predictors (i.e., it assumes the presence of a feature in a class is unrelated to any other feature). Even if these features depend on each other, or upon the existence of the other features, all of these properties independently. Thus, the name naive Bayes.
  2. Based on naive Bayes, Gaussian naive Bayes is used for classification based on the binomial (normal) distribution of data.
  3. naive Bayes model is easy to build, with no complicated iterative parameter estimation, which makes it particularly useful for very large datasets.

DECISION TREE CLASSIFICATION
  1. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed.
  2. Entropy and information gain are used to construct a decision tree.
  3. The disadvantage of a decision tree model is overfitting, as it tries to fit the model by going deeper in the training set and thereby reducing test accuracy.

RANDOM FOREST CLASSIFICATION
  1. Random forest classifier is an ensemble algorithm based on bagging i.e bootstrap aggregation.
  2. Ensemble methods combines more than one algorithm of the same or different kind for classifying objects (i.e., an ensemble of SVM, naive Bayes or decision trees)
  3. The general idea is that a combination of learning models increases the overall result selected.
  4. random forests prevent overfitting by creating trees on random subsets.
  5. The main reason is that it takes the average of all the predictions, which cancels out the biases.

GRADIENT BOOSTING CLASSIFICATION
  1. Boosting is a way to combine (ensemble) weak learners, primarily to reduce prediction bias. Instead of creating a pool of predictors, as in bagging, boosting produces a cascade of them, where each output is the input for the following learner. 
  2. Gradient boosting takes a sequential approach to obtaining predictions instead of parallelizing the tree building process. In gradient boosting, each decision tree predicts the error of the previous decision tree — thereby boosting (improving) the error (gradient).


Un-supervised

  1. Clustering -  hierarchical clusteringk-means, mixture models, DBSCAN, and OPTICS algorithm
  2. Anomaly Detection - Local Outlier Factor, and Isolation Forest
  3. Dimensionality Reduction - Principal component analysis, Independent component analysis, Non-negative matrix factorization, Singular value decomposition

Algorithms


Image Added

NameComments on ApplicabilityReference
Hierarchical Clustering
  1. (N-1) combination of clusters are formed to choose from.
  2. Expensive and slow. n×n  distance matrix needs to be made.
  3. Cannot work on very large datasets.
  4. Results are reproducible.
  5. Does not work well with hyper-spherical clusters.
  6. Can provide insights into the way the data pts. are clustered.
  7. Can use various linkage methods(apart from centroid).

k-means
  1. Pre-specified number of clusters.
  2. Less computationally intensive.
  3. Suited for large dataset.
  4. Point of start can be random which leads to a different result each time the algorithm runs.
  5. K-means needs circular data. Hyper-spherical clusters.
  6. K-Means simply divides data into mutually exclusive subsets without giving much insight into the process of division.
  7. K-Means uses median or mean to compute centroid for representing cluster.

Gaussian Mixture Models
  1. Pre-specified number of clusters.
  2. GMs are somewhat more flexible and with a covariance matrix we can make the boundaries elliptical (as opposed to K-means which makes circular boundaries).
  3. Another thing is that GMs is a probabilistic algorithm. By assigning the probabilities to data points, we can express how strong is our belief that a given data point belongs to a specific cluster.
  4. GMs usually tend to be slower than K-Means because it takes more iterations to reach the convergence. (The problem with GMs is that they have converged quickly to a local minimum that is not very optimal for this dataset. To avoid this issue, GMs are usually initialized with K-Means.)

DBSCAN
  1. No pre-specified no. of clusters.
  2. Computationally a little intensive.
  3. Cannot efficiently handle large datasets.
  4. Suitable for non-compact and mixed-up arbitrary shaped clusters.
  5. Uses density-based clustering. Cannot work well with density varying data points.
  6. Not effected by noise or outliers.


DIMENSIONALITY REDUCTION ALGORITHMSAPPLICABILITY
Linear Discriminant Analysis

It is used to find a linear combination of features that characterizes or separates two or more classes of objects or events.

LDA is a supervised

LDA is also used for clustering sometimes. And almost always outperforms logistic regression.

Principle Component Analysis

It performs a linear mapping of the data from a higher-dimensional space to a lower-dimensional space in such a manner that the variance of the data in the low-dimensional representation is maximized.

PCA is unsupervised



                                

...