Data science is a byproduct of the digital age. Although statistics have been around for hundreds of years, the earliest mentions of data science did not surface until 1964.
Today, our mobile devices generates more data than ever before, posing new challenges for storage and analysis. According to Forbes, 2.5 quintillion bytes of data is created every day. Gaining marketing insight from this ever-growing database in a timely manner is becoming increasingly difficult. This is where data science comes in.
It is not limited to just one industry or area of study, Data science has already proven valuable within healthcare, energy, economics, criminal justice and marketing.
Learning Python is easy, but you should be able to write efficient scripts and leverage the wide-range of libraries and packages that Python has to offer. This programming language is a building block for applications like manipulating data building machine learning models and many more.
Pandas is the most important library to know in Python. This is a package for data manipulation and analysis. As a data scientist, whether cleaning data, exploring data or manipulating data, this package is very useful.
SQL is used to extract the data from database, manipulated data and create data pipelines essentially. It is important for pre-analysis or pre-modeling stage in the data lifecycle.
Data Visualization & Storytelling
Data visualization refers to data that is presented visually. It can be in the form of graphs, but it can also be presented in unconventional ways.
Data storytelling takes data visualization to the next level- data storytelling refers to “How” you communicate your insights. It is like a picture book. A good picture book has a good visuals, but it also has an engaging and powerful narrative that connects the visual.
Git is extremely important for several reasons, with a few being that:
- It allows to revert to older versions of code.
- Allows to work in parallel with several other data scientist and programmers.
- It allows to use the same codebase as others even if you are working on an entirely different project.
Docker is a containerization platform that allows to deploy and run applications, like machine learning models.
Airflow is a workflow management tool that allows you to automate. Specifically it allows you to create automate workflows for data pipelines and machine learning pipelines.
Data scientist use computer science methodologies to write complex algorithms and computational systems that perform statistical analysis of large unstructured data sets- all to reach a business goal.