What does being a Data Scientist entail?
The simple answer to this is – Statistics, Programming, and some domain knowledge. These are one of the steps to become a Data Scientist. Seems easy enough on the surface, doesn’t it? It is, to an extent. Statistics knowledge isn’t what we learnt in school. It isn’t studying simple mean, median, and mode formulae (Although, smooth implementation of even these simple concepts can go a long way and provide great insights through visualization!), but studying the subject in detail.
Concepts like Normal Distribution, Random Variables, z-score, correlation and covariance, Linear Regression, Probability and Bayes Theorem are just some of the terms you will come across on a daily basis as a Data Scientist, that concern Statistics.
Python, R and SAS are the programming languages that a Data Scientist uses every day. No, a Data Scientist does not need to know each of these languages extensively. As is often said, don’t be the jack of all trades, be the master of one. Because all of these languages offer the same functionality. Being an expert in one is much better than being intermediate in all. There is a constant debate in the Data Science community about which language is better – Python or R. In fact, a quick Google search on this will show you the millions of posts on the controversial topic! There is no simple answer to this. It depends on which one a programmer is more comfortable in. Personally, I’d recommend Python, as I believe that R has a much steeper learning curve than Python.
Below is the Python code to add two numbers –
n1 = int(input(“Enter a number”))
n2 = int(input(“Enter the second number”))
sum = n1 + n2
print(“Sum is” +str(sum))
Isn’t this simple? Doesn’t it feel like regular English? Python also provides excellent easily understandable scientific packages for Machine Learning and dealing with large data sets like Scikit-Learn and statsmodel. The same code in R is quite cluttered and not very easily understandable to someone that is new to programming. This is just my two cents, though. If you’re new to the field, you should play around with both languages and decide which you find easier.
Having Domain Knowledge involves, in basic terms, using the insights derived from applying the previous two steps to make smart business decisions and optimize operations. This is perhaps, the most crucial step, even though it might not seem like much. Unless this step is carried out effectively, the insights cannot accurately be converted to smart business decisions.
In addition to this, a Data Scientist must have the ability to tell a story with the data. To represent the insights visually in such a way that a layman can understand and process them. Essentially, a Data Scientist should be a good storyteller. “People hear statistics, but they feel stories” It’s as simple as this; the insights gained seems worthless unless they can translate to ways that increase profit margins, i.e., make smart business decisions. Data Visualization tools often used by Data Scientists are Tableau, the scientific Python packages (matplotlib), Excel (Excel is a very powerful for data visualization. It cannot be used on “Big Data”, but is an excellent solution for small data.), etc. To put it whimsically, “if you torture the data long enough, it will confess.”, as said by Economist Ronald Coase.
Can just anyone become a Data Scientist?
There is quite a lot of confusion around who can become a Data Scientist. Therefore, let me give you a few interesting examples
- Karthik Rajagopalan of AT&T has degrees in Mechanical, Industrial and Electrical Engineering, and a PhD in Solid State Physics. He is now a Data Scientist!
- Shankar Iyer, a Data Scientist at Quora says “Our data science team at Quora has people with diverse backgrounds, including physics, economics, and chemical engineering. There are indeed some team members with degrees in statistics and machine learning, but it’s not a requirement.”
- Luis Tandalla, a student at the University of New Orleans took a couple of free Data Science courses on Coursera and a few months later, he scored his first victory in a Kaggle competition hosted by the Hewlett Foundation where he had to devise a model for accurately grading short-answer questions on exams. This, from a student who had no idea what Machine Learning was before he signed up for the courses online!
What Background a Data Scientist requires?
What do all these examples tell you? That absolutely anybody can become a Data Scientist? While it is inspiring to think so, and the statement is true to an extent, it has a little bit of a glass ceiling. A Data Scientist is someone who has some background in Mathematics and Statistics, Programming, a creative mind to ask the right questions and to use insights for business decisions, and, absolute love for data. Someone who can weave the insights into a story using all the tools at his disposal.
Programming and Statistics can be learnt, but it depends on an individual’s background and commitment, how steep the learning curve would be. The love for data, however, is something that cannot be acquired. It’s something that is inbuilt in you. If this revolution of mounds and mounds of data changing the world astounds, fascinates, drives and motivates you, then you’ll probably be the best Data Scientist out there soon!
Are Statistics and programming definite prerequisites?
There are several cases in the Data Science community where by being proficient in only Excel or Tableau or any of the visualization tools or having strong business acumen, one has become a successful Data Scientist. While I’m sure this fact brought a smile to many of your faces, I must warn you that it isn’t an absolute fact. Everything considered someone with programming and statistics knowledge is much more likely to be a successful Data Scientist than someone proficient in Excel. So, my take on the subject is this. If you’re planning to get into the field, start getting familiar with programming and statistics. Programming might be a term scaring a lot of you, but rest assured that in Data Science, in-depth knowledge in complex languages like Java is unnecessary. Start out with R or Python, very easy languages to learn.
As far as Statistics is concerned, again, you do not need to become an expert in the field. Start taking basic stats courses particular to Data Science that will teach you just enough.
What you should do right away, to become a Data Scientist!
I believe that the best way to learn anything is by throwing yourself headfirst into it and playing with it. Learning Data Science concepts will be for nothing unless you start implementing your new found knowledge in practical applications. Start by taking any of the plethoras of courses offered in Machine Learning online. But, do not wait to finish the entire course, keep revising their examples, familiarising yourself with the theory. Go to Kaggle or any of the sources that are giving away free data sets, and get working!
From the above-mentioned skills, which one do you think to learn to become a Data Scientist?