R is a programming language used in statistical computing and visualisation, which runs on several platforms. It’s an open source language with ready-to-run “binaries” available for different platforms. Its source code can be downloaded and compiled for other platforms. R can be downloaded from CRAN (Comprehensive R Archive Network) or http://www.r-project.org/
It has an easy and quickly adaptable environment. So, it allows users to visualise and manipulate data, run statistical tests, calculate, and apply machine learning algorithms. Apart from this, R has a few other benefits:
- Effective data handling and storage facility
- Gives you access to rich statistical learning packages developed by top researchers.
- Most importantly, it’s free.
- Graphical support for data analysis and visualisation
A. Getting Started
You can directly open R and start working on it, but it is advisable to first download the R-Studio which is a free and open-source integrated development environment (i.e. IDE) for R.
Now, figure out the current directory getwd() and change it using setwd().
B. Installing and loading packages
There are not many machine learning algorithms incorporated in R and so most of the times you need to download the package and then call them into your workspace.
> library(“<package name>”)
Well R has a great documentation. If anytime you get stuck on some command or function, just type “?<function name>” in R console and there will be an output window showing results for the details of that function.
Basic Functions in R
In this section, we will cover how to read data, perform basic operations and visualise data.
I. Get your data and read it!
To load the data you can either search it on the web, or look into your local disk, or use built-in datasets.
Reading data using URL
>url <- “<enter a URL address with CSV data>”
> data <- read.csv(url, header = TRUE)
Reading from your local drive
data <- read.csv(“<location of your file>”)
Using built-in dataset. Here you can directly call or look into the dataset by simply typing the name of the dataset into R-console:
For reading data tables from text files, we use read.table(). Want to see its working details? Go ahead and call for help! “?read.table” .
II. Know your data
You can’t simply jump into building the model by just reading the data. It is very important to analyse the data first.
One of the first steps in data exploration is inspecting your data. However, there are myriad ways to do that but I’ll mention here only a few of the profound ways to understand your data.
There are other statistical commands too for computing the mean mean(), variance var(), standard sd() deviation etc. to help you evaluate the dataset.
One of the best things about R is that it has great graphical properties with lots of rich libraries to let us show off our visualisation skills. Let’s find out how:
We will be taking iris dataset for our example and try out these commands to see the awesome visualisation.
> plot(iris$Petal.Length, iris$Petal.Width, main=”Iris Data”)
> plot(iris$Petal.Length, iris$Petal.Width, pch=21,
bg=c(“green”,”red”,”blue” [unclass(iris$Species)],main=”Iris Data”)
Now, let’s play around with ggplot:
>ggplot(iris, aes(x = Sepal.Length, y = Petal.Length)) +
geom_point(aes(color = Species))
You have to explore this ggplot to understand the underlying notion behind the grammar of graphics. Also, try ggvis, which is again an awesome visualisation package to work with.
The CRAN repository has more than 10,000 active packages. Let’s discuss a few of the important machine learning packages and learn how to implement them.
It is one of the very basic statistical learning approaches. lm is the function we use to generate our model. Since this is a built-in function you don’t need to install any package for it.
Let’s name the dependent variable as ‘Y’ and independent variables as x1 and x2. We want to find the coefficients of a linear regression and generate a summary for it.
>lm_model <-lm(y ∼ x1 + x2, data=as.data.frame(cbind(y,x1,x2)))
Its syntax is very similar to that of linear regression, as here also you don’t need to install any package.
> glm_mod <-glm(y ∼ x1+x2, family=binomial(link=”logit”), data=as.data.frame(cbind(y,x1,x2)))
K- Nearest Neighbour Classification
K-nearest neighbours is a simple algorithm that stores all the available cases and classifies the new ones based on stored and labelled instances i.e., by a similarity measure (e.g., distance functions).
In order to build the classifier for knn, you need to divide the dataset into testing and training and then add some arguments.
> knn_model <-knn(train=X_train, test=X_test,
It is one of the most simple and frequently used classification algorithms. Let’s see the syntax using the same formula we used in linear regression.
> tree_model <-rpart(y ∼ x1 + x2,
K-means is one of the simplest unsupervised learning algorithms that solves the well-known clustering problem.
If you have a matrix “Z,” and “n” is the no. of clusters then the syntax goes like this:
> kmeans_model <-kmeans(x=Z, centers=n)
Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes’ theorem with an assumption that features are independent.
> nB_model <-naiveBayes(y ∼ x1 + x2,
SVM (Support Vector Machines)
SVM is a supervised machine learning algorithm which can be used for both classification and regression problems.
Let “Z” be the matrix of features, and labels be a vector of 0-1 class labels. Let the regularization parameter be “C”.
> svm_model <-svm(x=X, y=as.factor(labels),
kernel =”radial”, cost=C)
There are many parameters like kernel type, the value of “C” parameter, etc. for that you have to dig deep into the algorithm and check the documentation regarding SVM.
Apriori algorithm is an algorithm for association rules to analyse the technique to uncover how items are associated with each other.
Here our dataset must be binary incidence matrix and also we have to define the support and confidence parameter for this algorithm.
> assoc_rules <-apriori(as.matrix(dataset),
parameter = list(supp = 0.8, conf = 0.9))
Random Forest is one of the most powerful ensemble techniques in machine learning algorithms for both classification and regression problems. The syntax for this algorithm is very similar to that of the linear regression model. Let’s say you have a data with seven variables where Y is the response variable and X1, X2, X3, X4, X5 and X6 are the independent variables.
> fit <- randomForest((Y) ~ X1+X2+X3+X4+X5+X6,
I hope this post was simple to understand. The sole purpose of this article is to get you acquainted with R as a tool to get you started with the basic machine learning algorithms. This article alone won’t do justice to these awesome algorithms. So, you have to dig deeper into each of these algorithms. In addition, you should learn how to predict and evaluate your result, and most importantly practice a lot by taking up different kind of problems.
How to choose the best ML technique for your dataset?