Machine Learning, a branch of Artificial Intelligence, “learns” without explicitly instructed and programmed to do so. Although it has been around for quite some time, it is gaining new momentum nowadays. This is the era of Big Data. Mining patterns from Big Data and using this information to make smart business decisions is a herculean and crucial task today. Using complex computational statistics in Machine Learning to do just this, is yielding brilliant results every day. Everyone has heard the term Machine Learning. It is omnipresent these days. However, there seems to be some confusion as to what it is. To understand that, visit our blog.
Regarding research, Machine Learning is probably one of the most important areas of computing today. Therefore, several algorithms have surfaced in Machine Learning, to help the computer learn as best as it can. This blog focuses on the most popular Machine Learning Algorithms used today, where you should use them, on what kind of data, and when.
The most popular Machine Learning Algorithms are –
Linear Regression is the most popular Machine Learning Algorithm, and the most used one today. It works on continuous variables to make predictions. Linear Regression attempts to form a relationship between independent and dependent variables and to form a regression line, i.e., a “best fit” line, used to make future predictions. The purpose of this regression line is to minimise the distance between the data points and the regression line to make an equation in the form of
Y = a*X + b
I.e., the equation of a straight line where, Y is the dependent variable, X is the independent variable, a is the slope, and b is the intercept. The basis of the model relies on the least squares estimation approach, among other lesser used approaches.
There are two types of Linear Regressions
– Simple Linear Regression – Where there is only one independent variable.
– Multiple Linear Regression – Where there is more than one independent variable.
Logistic Regression is a Machine Learning algorithm where the dependent variable is categorical. It estimates only discrete values, like 0 or 1, yes or no. Estimating probabilities using a Logistic Function determines the relationship between the dependent variable and the independent variables. The curve plotted between the variables makes an S-shaped curve as opposed to a straight line in linear regression. Logistic Regression is used when the outcome to be predicted binary – i.e., 0 or 1. Otherwise, other methods like Linear Regression are chosen. A logit function predicts the outcome variable Y when the outcome is to be categorical. The log of the probability that Y equals the independent variable. The equation becomes
ln(p/(1-p)) = B0 + B1X1 + B2X2 + .. + BkXk
Decision Tree is one of the most popular supervised Machine Learning algorithms. It is also the easiest to understand. Decision trees mainly perform classification, but sometimes, it forms regression trees as well. They work for categorical and continuous input and output variables. A decision tree splits the data into groups that are homogeneous but heterogeneous to each other.
So, how does the tree know when to methods?
The tree makes use of a variety of algorithms to make this decision. But the goal of the algorithm is this – Splits are performed on all criteria, and the split that results in the most homogeneous subgroup is selected. The broad algorithms used are Gini Index and Information Gain.
E.g.- If the credit worthiness of a customer is to be determined using Decision trees, many parameters are considered, as the income of the customer, his household size, and age. Now, a split is done based on each of these parameters. To determine which split provides the most homogeneous data, Gini Index or Information Gain is used.
Random Forest is an ensemble learning approach that uses Decision Trees. It is an immensely popular Machine Learning algorithm that does one better than Decision Trees. Instead of a single tree, construction of multiple trees takes place. To classify an object based on a parameter, each of the trees gives a classification. Finally, the chosen one has classification with the maximum. Random Forest works excellently for classification but not as well for regression. However, it is usually the go to the algorithm for most Data Scientists when dealing with a dataset that they’re unsure how to classify.
Artificial Neural Network
Artificial Neural Network is a supervised Machine Learning algorithm that is making giant leaps in the field of Artificial Intelligence. ANN attempts to replicate the working of the brain to help the machine to “learn’ based on past data, much like our brain learns from past experiences. ANN requires a huge training data, and it takes the time to train itself. However, it performs very accurate predictions. It makes use of a concept of “hidden layers” with neurons at each layer and connections between all the neurons.
The exact working of ANN is quite a mystery, though, much like the human brain. It is something like a “black box”. ANN makes use of only two hidden layers. However, a sub-branch of ANN called Deep Learning has been making the rounds. It makes use of many more hidden layers to make bear perfect predictions. The use of more than two hidden layers is possible today because of the advancements in computation power.
Support Vector Machine
Support Vector Machine, a supervised Machine Learning algorithm, used for classification and regression problems, but usually more for classification. It makes for a very effective tool, used mainly when classifying an item into one of two categories.
What is a Support Vector?
Every individual data point’s coordinate in the dataset is a Support Vector. The Support Vectors after plotting, form a classification.
What is a Support Vector Machine?
It is the hyperplane that divides or “classifies” the data points the best. For e.g., if you wanted to classify whether a given person was an Indian or an American, the parameters you’d choose are maybe the colour of their skin and their height. Based on the training data, we know that Americans tend to have fairer skin and are taller. Therefore, we now need to plot a line or find the best hyperplane that classifies the points in such a way that maximum distance exists between the closest data point from both the categories.
K-means clustering is an unsupervised Machine Learning algorithm that deals with clustering of data. Using training data, the model finds the best structures and forms clusters.
How does the algorithm work?
The algorithm takes two inputs:
- The number of clusters required
- The training data X.
Since there are no labels as it is unsupervised learning, there is no Y. Initially, the location of each of the K-clusters is the centroid. A new data point from the dataset is put into the cluster whose centroid it is the closest too. To determine the “nearness”, the cluster with mean with the least sum of squares is used. As new elements are added to the cluster, the cluster centroid keeps changing. The new centroid becomes the average of the locations of all the data points currently in the cluster. These two steps of assigning a data point to the cluster and then updating the cluster centroid are done iteratively.
Towards the end, you will notice that the cluster centroid doesn’t change since we’ve accurately created clusters that are homogeneous and heterogeneous with other clusters.
K-Nearest Neighbour is a simple Machine Learning algorithm that works on classification and regression problems, but more commonly on classification problems. A new element is classified into one of the classes based on a vote from its “neighbours”. A parameter “k” which signifies the number of neighbours used. The selection of k is critical, as with a small value of k, the result is not accurate and results in a lot of noise. But selecting an enormous value of k is computationally infeasible, and defeats the purpose of this algorithm. There is a loose rule followed while choosing k, which is to assign it the square root of the number of samples in the dataset.
Naive Bayes classifier
Naive Bayes classifier is a widely used Machine Learning algorithm which uses the Bayes Theorem with the base assumption that every feature is independent i.e., no feature depends on any other feature. Bayes Theorem states that the probability of an event can be calculated based on conditions that might affect the event.
Bayes Theorem states that –
P(c|x) = P(x|c) . P(c) / P(x)
P(c|x) is the probability that event of class c, given the predictor x
P(x|c) is the likelihood of x, given c
P(c) is the probability of the class
P(x) is the probability of the predictor
The posterior probability for each class is calculated, and classification is done based on the result of the calculation. The observation is classified into the class with the highest probability.
Ensemble Learning is a machine learning which uses not one but many models to make a prediction. The underlying idea for this is that collective opinion of many is more likely to be accurate than that of one. The outcome of each of the models is combined, and a prediction is made. The outcome can either be combined using average or the outcome occurring the most, or weighted averages. Ensemble Learning attempts to find a trade-off between variance and bias. The three most common methods of Ensemble Learning are Bagging, Boosting and Stacking.
Although there are many other Machine Learning algorithms, these are the most popular ones. If you’re a newbie to Machine Learning, these would be a good starting point to learn.
Which are the top Machine Learning algorithms do you think every Data Scientist should be having in their toolbox? We would love to know which are your favorite ones.