What is Data Science?
In this story, I will be sharing what I summarized and understood from a chapter of a LinkedIn course (Data Science Foundations: Fundamentals, by Barton Poulson), in addition to other things I researched and studied on my own.
“Data Scientist: The Sexiest Job of the 21st Century” — Thomas H. Davenport and D. J. Patil — Harvard Business Review, October 2012.
Thomas H. Davenport and D. J. Patil argued that Data Scientists have rare qualities that put them in high demand.
These rare qualities of Data Scientist are:
- They are able to find order, meaning, and value in unstructured data.
- They are able to predict outcomes.
- They are able to automate processes. For example; getting individualized recommendations while shopping.
- They can use data science to provide them with hidden insight that can’t be found using other ways.
How to define Data Science?
According to Drew Conway, the Senior Vice President of Two Sigma, the term “data science” is a wrong or inaccurate name or designation, mostly because of the utter lack of agreement on what a curriculum on this subject would look like.
And in order to simplify the definition of Data Science and the needed skills of data scientists, he created his famous Venn Diagram.
Hacking Skills (Computer Programming): it is important because you have novel data sources, challenging data formats, and streaming data.
Being able to manipulate text files at the command-line, understanding vectorized operations, thinking algorithmically; are the hacking skills that make for a successful data hacker.
Math & Statistics Knowledge: Once you have acquired and cleaned the data, the next step is to actually extract insight from it. In order to do this, you need to apply appropriate math and statistics methods.
Math & Statistics Knowledge allows you to judge the fit between your questions, your data, and to choose your procedure that answers your question based on your data. In addition; it helps you to know what to do when procedures fail or give impossible results.
Substantive Expertise: Since Each domain or topic area has its own goals, methods, constraints. You will need to know them based on your project domain and to know how to implement them.
Data Science Pathway
Think of data science projects like walking down a pathway, where each step gets you closer to the goal that you have in mind.
Before starting any data science project, first, you need to plan out several things that are important to success.
- Define goals. What is (are) the goal(s) of the project? What are you aiming to achieve by this project?
- Organize Resources. What are the needed tools for this project?
- Coordinate People. What are the jobs of each member and what must be done first?
- Schedule Project. When does each task must finish?
The next step is wrangling or preparing the data.
- Get data. Find and get the data you are going to use in the project.
- Clean data. Get the data ready to be used and to fit into the paradigm.
- Explore data. After preparing data, visualize it and do numerical summaries to understand what your data is.
- Refine data. Based on the exploration step, data may need to be refined or re-categorized.
- Create model. Create the statistical model (linear regression, decision tree, artificial neural network)
- Validate model. How good will your model work on new data based on what it learned from the training data set?
- Evaluate model. How much does it fit the data? How usable is it going to be?
- Refine model. Based on the previous three steps you may need to try processing a different way, adjust your parameters, or include additional variables into your model.
- Present model. Show what you learned and concluded from the data to decision-makers, invested parties, and your client. To let them know what you found.
- Deploy model. Put your model in work.
- Revisit model. See how well it is performing, especially when there is new data or a new context.
- Archive model. It is important to document everything. For example; where the data came from, how you process it and commenting the code used to analyze it.
Roles and teams in data science
There are so many skills and elements involved in a data science project, that you’re going to need people from all sorts of different backgrounds with different techniques to contribute to the overall success of the project.
– Data Engineers:
They are the developers, architects of the system. They focus on hardware and software that make data science possible. And they are responsible for providing the foundation for all the other analyses, finding trends in data, and developing algorithms to make data useful.
– Machine Learning Specialists:
They have extensive work in computer science, mathematics, deep learning, and artificial intelligence. They deeply understand how the algorithms work on data to produce the desired results.
– Researchers (topical researchers):
They focus on domain-specific research. They connect with data scientists to try to find answers to some of the big-picture questions.
They do day-to-day tasks that are necessary to run the project efficiently. And they visualize data, create reports that go into business intelligence. Their role is to help in the decision-making process, to illustrate the project performance, and to find better ways to reach the project’s goals.
They manage the entire project, have the big picture, and frame business-relevant questions and solutions. Also, they need to keep other team members on track as they know what is needed to be accomplished. They need to speak data so they can understand how data relates to the question they are trying to answer.
-Data Science Unicorn:
Very rare to find.
This is a full-stack scientist who can do it all and do it at absolute peak performance. And he/she is a leader in data science, technology, and business.