A Data Scientist is a person who makes use of his/her skills to get value out of data. Data Science is one of the trending fields nowadays, As the clock is ticking new and new technologies are coming and the rate at which data generated is increasing exponentially there are an opportunity to show cast your talent. But, how to start and where to go? These questions arise in the mind of learner who wants to become a data scientist. Here are the resources and the way in which one can learn data science.
Skills required in Data Science
Be a Learner
Data science is a broad and fuzzy field, which makes it hard to learn. So, one must know basic knowledge about data and always urge to learn more and more about it. Data is just like a pool in which, more you go deeper more you will get know about. You need something that will make you find the linkages between topics like statistics, linear algebra, and neural networks. Something that will prevent from struggling with the question “what do I learn next?”. You can learn about neural networks, image recognition, and other cutting edge techniques, which is important.
Data scientists constantly need to present the results of their analysis to others. Skill at doing this can be the difference between a good and a great data scientist. So, keep yourself updated and keep reading research papers and re-research the sections that you don’t understand. These are the signs of a good learner.
The reason why communication skills are important because to express your views effectively and shows how is your understanding towards the topic theoretically and practically. Another part is understanding how to clearly organize your results. The final piece is being able to explain your analysis clearly.
- Read and Write Blogs: The reason behind is that to get better and deep understanding regarding sub-topics in data science.
- Try to teach enthusiastic people about data science concepts. It’s amazing how much teaching can help you understand concepts.
- To speak and take part at meetups.
- Use GitHub to host all your analysis and skills. Basically, GitHub gives you the platform to share your knowledge and show your talent to others.
- Get active on virtual communities such as Quora, DataTau, http://datascience.community/
Mathematics and Statistics
For data scientists, a strong mathematics background, particularly in statistics and analysis is strongly recommended. This goes naturally along with an equally strong academic foundation in computing. A data scientist should be better at statistics than with coding. As such, statistical inference underpins much of the theory behind data analysis. You will be requiring a solid foundation of statistics and probability, which serves as a stepping stone into the world of data science.
Knowledge of Descriptive Statistics, Probability Theory, Algebra and Calculus help you quickly with basic data analysis. The challenge sometimes is not in knowing the math behind the analysis but in interpreting the results which drive the further course of action. Usually, the data transformation results in descriptive summarization to capture the essence of the data.
The following are the topics which must be learned with utmost clarity:
- Linear Algebra: Start from basic vector spaces and go up to Singular Value decomposition.
- Matrix Theory: Learn to find the inverse, transpose, multiplication of matrices, determinants, Eigenvalues, and vectors.
- Calculus: Integrals, Differentiation. Differential Equations.
- Numerical Analysis: Numerical methods to find the solution of a Differential equation and Integrals.
- Statistics: Distributions, Different kinds of Charts, Mean – Mode – Median (Different methods of finding each and relation between them).
This one is an advanced skill to go further on Data Science. The main reason of having the Business Skills to have a knowledge of the surrounding and tactics which need to deal with problems and to maximize the profit. Some of the points are elaborated below:
- Analytic Problem-Solving: Approaching high-level challenges with an eye on what is important; employing the right approach/methods to make the maximum use of time and human resources.
- Effective Communication.
- Intellectual Curiosity: Exploring new territories and finding creative and unusual ways to solve problems.
- Industry Knowledge: Understanding the way your chosen industry functions and how data are collected, analyzed and utilized.
Tools, Techniques, and Technology
To become a data scientist one should know about the tools and technology which are used. Here not all are mention but to have the basic idea of tools which are used to deal with the data (here the techniques means the tools which help in data science) are listed below with the short description:
- Apache spark– Apache Spark is an open-source data analytics cluster computing. Spark fits into the Hadoop open-source community, building on top of the Hadoop Distributed File System (HDFS).
- SOLR– To build a highly scalable data analytics engine to enable customers to engage in lightning fast, real-time knowledge discovery.
- S3 – Amazon S3 is an online file storage web service offered by Amazon Web Services. Amazon S3 provides storage through web services interfaces.
- Hadoop – Apache Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware.
- MapReduce–Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte datasets) in parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
- Corona – Corona, a new scheduling framework that separates cluster resource management from job coordination. Corona introduces a cluster manager whose only purpose is to track the nodes in the cluster and a number of free resources.
- HBase – HBase is an open source platform, non-relational, distributed database modeled after Google’s Big Table and written in Java.
- Zookeeper – Apache Zookeeper is a software project of the Apache Software Foundation, providing an open source distributed configuration service, synchronization service, and naming registry for large distributed systems.
- Hive – Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.
- Mahout – Apache Mahout is a project of the Apache Software Foundation to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily in the areas of collaborative filtering, clustering, and classification.
- Lucene– It is a bunch of search-related and NLP tools but its core feature is being a search index and retrieval system. It takes data from a store like HBase and indexes it for fast retrieval from a search query. Solr uses Lucene under the hood to provide a convenient REST API for indexing and searching data.
- NLTK – The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing(NLP) for the Python programming language.
- Freebase – Freebase is a large collaborative knowledge base consisting of metadata composed mainly of its community members. It is an online collection of structured data harvested from many sources.
- Sqoop- Sqoop is a command-line interface to back SQL data to a distributed warehouse. It’s what you might use to snapshot and copy your database tables to a Hive warehouse every night.
- Hue-Hue is a web-based GUI to a subset of the above tools. Hue aggregates the most common Apache Hadoop components into a single interface and targets the user experience.
There are so many free online courses, such are MOOC (Massive Open Online Courses) and Vskills Govt. Certification courses. Some of which are given below, I think they may be helpful to some extent. Caution: Some of them may not be free.
Introduction to CS Course
Notes: Introduction to Computer Science Course that provides instructions on coding.
- Udacity – intro to CS course,
- Coursera – Computer Science 101
Code in at least one object-oriented programming language: C++, Java, or Python
Beginner Online Resources
- Coursera – Learn to Program: The Fundamentals,
- MIT Intro to Programming in Java,
- Google’s Python Class,
- Vskills – Python Developer
- Python Open Source E-Book
Learn other Programming Languages
You should add these to your repertoire.
Develop logical reasoning and knowledge of discrete math
- MIT Mathematics for Computer Science, Coursera – Introduction to Logic,
- Coursera – Linear and Discrete Optimization, Coursera – Probabilistic Graphical Models,
Develop Strong understanding of Algorithms and Data Structures
Notes: Learn about fundamental data types (stack, queues, and bags), sorting algorithms (quicksort, mergesort, heapsort) and data structures (binary search trees, red-black trees, hash tables), Big O.
- MIT Introduction to Algorithms,
- Coursera – Introduction to Algorithms Part 1 & Part 2, Wikipedia –
- Wikipedia – List of Algorithms,
- Wikipedia – List of Data Structures,
- Book: The Algorithm Design Manual
Learn Artificial Intelligence Online Resources
- Stanford University – Introduction to Robotics, Natural Language Processing, Machine Learning
- Vskills – Data mining and warehousing online course
Learning Big Data
- Big Data University
- Vskills Big Data Courses
All these will help you to lead a successful career in Data Science.
Content Written by Akhil Joshi - Vskills Intern SRM University