What_is_Data_Science
What_is_Data_Science
1|Page
Data Science- A Powerful Combination of Various Disciplines:
Data science combines computer science, mathematics and statistics,
and domain expertise. These disciplines are crucial for data scientists
to understand, collect, clean, analyse, and visualize data.
• Computer science skills are necessary for programming and
utilizing big data technologies, enabling data scientists to write
code for data tasks and deploy machine learning models.
• Math and statistics knowledge is vital for applying complex
algorithms to identify patterns, make predictions, and draw
conclusions from data.
• Domain expertise is essential for understanding specific industries
or problems, allowing data scientists to use data effectively. For
instance, a data scientist in healthcare needs a good
understanding of medical terminology.
The overlap of these disciplines represents the skills and knowledge
required for successful data scientists. A strong foundation in all three
disciplines is necessary to effectively use data for solving real-world
problems.
2|Page
What are Datasets?
Datasets are collections of data, typically organized in a structured
manner for analysis or research purposes. These collections can
include various types of data, such as text, numbers, images, or other
forms of information. Datasets serve as the foundation for data-driven
tasks in fields like data science, machine learning, and statistics. Here
are some key points about datasets:
1. Structure:
• Tabular Data: Many datasets are organized in tabular form,
similar to a spreadsheet, with rows and columns. Each row
represents an individual observation, while columns represent
different features or attributes.
• Multi-modal Data: Datasets can include a variety of data types,
such as text, images, audio, time-series, and more.
2. Types of Datasets:
• Public Datasets: These are datasets that are openly available to
the public and are often used for research, analysis, and
educational purposes. Examples include datasets provided by
government agencies, research institutions, and online repositories.
• Private Datasets: Some datasets are proprietary or restricted in
access due to privacy, security, or commercial reasons.
Example Dataset
Let's create a small example dataset to illustrate the concept. In this
case, we'll consider a simple dataset of students and their exam scores:
3|Page
In this small dataset:
• Student ID: Unique identifier for each student.
• Name: The name of the student.
• Age: The age of the student.
• Exam Score: The score achieved by the student in a particular exam.
This dataset is easy to understand and work with. It's small, making it
suitable for explanatory purposes, and it includes both categorical
(name) and numerical (age, exam score) variables. You might use
such a dataset to perform basic statistical analyses, visualize trends, or
even build simple models, depending on your analytical goals.
Here are links to some of the main free sources for datasets:
1. Kaggle: Kaggle Datasets
(kaggle.com/datasets/goldenoakresearch/us-household-income-
stats-geo-locations)
2. UCI Machine Learning Repository: UCI Machine Learning Repository
(archive.ics.uci.edu/dataset/53/iris)
3. Google Dataset Search: Google Dataset Search
(datasetsearch.research.google.com/)
4. Data.gov: data.gov.in/
5. GitHub: GitHub Datasets (github.com/awesomedata/awesome-
public-datasets)
4|Page
Responsibilities of Data Scientists
2. Data Cleaning: They preprocess and clean the data to ensure its
quality, removing inconsistencies and errors.
5|Page