Data Science (UNIT 1)
Data Science (UNIT 1)
UNIT I:
Introduction of Data Science: data science and big data–facets of data-data
science process- Ecosystem- The Data Science process – six steps- Machine
Learning
Example:
DEPARTMENT OF COMPUTER SCIENCE-RASC-R.SUGANYA/AP Page 1
23PCSCC34: DATA SCIENCE & ANALYTICS UNIT 1
Let suppose we want to travel from station A to station B by car. Now, we need to
take some decisions such as which route will be the best route to reach faster at the
location, in which route there will be no traffic jam, and which will be cost-
effective. All these decision factors will act as input data, and we will get an
appropriate answer from these decisions, so this analysis of data is called the data
analysis, which is a part of data science.
Some years ago, data was less and mostly available in a structured form, which
could be easily stored in excel sheets, and processed using BI tools.
But in today's world, data is becoming so vast, i.e., approximately 2.5 quintals
bytes of data is generating on every day, which led to data explosion. It is
estimated as per researches, that by 2020, 1.7 MB of data will be created at every
single second, by a single person on earth. Every Company requires data to work,
grow, and improve their businesses.
Now, handling of such huge amount of data is a challenging task for every
organization. So to handle, process, and analysis of this, we required some
complex, powerful, and efficient algorithms and technology, and that technology
came into existence as data Science. Following are some main reasons for using
data science technology:
o With the help of data science technology, we can convert the massive
amount of raw and unstructured data into meaningful insights.
o Data science technology is opting by various companies, whether it is a big
brand or a startup. Google, Amazon, Netflix, etc, which handle the huge
amount of data, are using data science algorithms for better customer
experience.
o Data science is working for automating transportation such as creating a
self-driving car, which is the future of transportation.
o Data science can help in different predictions such as various survey,
elections, flight ticket confirmation, etc.
Data science Jobs:
As per various surveys, data scientist job is becoming the most demanding Job of
the 21st century due to increasing demands for data science. Some people also
called it "the hottest job title of the 21st century". Data scientists are the experts
who can use various statistical tools and machine learning algorithms to understand
and analyze the data.
The average salary range for data scientist will be approximately $95,000 to $
165,000 per annum, and as per different researches, about 11.5 millions of job
will be created by the year 2026.
Types of Data Science Job
DEPARTMENT OF COMPUTER SCIENCE-RASC-R.SUGANYA/AP Page 3
23PCSCC34: DATA SCIENCE & ANALYTICS UNIT 1
If you learn data science, then you get the opportunity to find the various exciting
job roles in this domain. The main job roles are given below:
1. Data Scientist
2. Data Analyst
3. Machine learning expert
4. Data engineer
5. Data Architect
6. Data Administrator
7. Business Analyst
8. Business Intelligence Manager
Below is the explanation of some critical job titles of data science.
1. Data Analyst:
Data analyst is an individual, who performs mining of huge amount of data,
models the data, looks for patterns, relationship, trends, and so on. At the end of
the day, he comes up with visualization and reporting for analyzing the data for
decision making and problem-solving process.
Skill required: For becoming a data analyst, you must get a good background
in mathematics, business intelligence, data mining, and basic knowledge
of statistics. You should also be familiar with some computer languages and tools
such as MATLAB, Python, SQL, Hive, Pig, Excel, SAS, R, JS, Spark, etc.
2. Machine Learning Expert:
The machine learning expert is the one who works with various machine learning
algorithms used in data science such as regression, clustering, classification,
decision tree, random forest, etc.
Skill Required: Computer programming languages such as Python, C++, R, Java,
and Hadoop. You should also have an understanding of various algorithms,
problem-solving analytical skill, probability, and statistics.
DEPARTMENT OF COMPUTER SCIENCE-RASC-R.SUGANYA/AP Page 4
23PCSCC34: DATA SCIENCE & ANALYTICS UNIT 1
3. Data Engineer:
A data engineer works with massive amount of data and responsible for building
and maintaining the data architecture of a data science project. Data engineer also
works for the creation of data set processes used in modeling, mining, acquisition,
and verification.
Skill required: Data engineer must have depth knowledge of SQL, MongoDB,
Cassandra, HBase, Apache Spark, Hive, MapReduce, with language knowledge
of Python, C/C++, Java, Perl, etc.
4. Data Scientist:
A data scientist is a professional who works with an enormous amount of data to
come up with compelling business insights through the deployment of various
tools, techniques, methodologies, algorithms, etc.
Skill required: To become a data scientist, one should have technical language
skills such as R, SAS, SQL, Python, Hive, Pig, Apache spark, MATLAB. Data
scientists must have an understanding of Statistics, Mathematics, visualization, and
communication skills.
Prerequisite for Data Science
Non-Technical Prerequisite:
o Curiosity: To learn data science, one must have curiosities. When you have
curiosity and ask various questions, then you can understand the business
problem easily.
o Critical Thinking: It is also required for a data scientist so that you can find
multiple new ways to solve the problem with efficiency.
o Communication skills: Communication skills are most important for a data
scientist because after solving a business problem, you need to communicate
it with the team.
Technical Prerequisite:
DEPARTMENT OF COMPUTER SCIENCE-RASC-R.SUGANYA/AP Page 5
23PCSCC34: DATA SCIENCE & ANALYTICS UNIT 1
Big Data is used to store, analyze and organize the huge volume of structured as
well as unstructured datasets. Big Data can be described mainly with 5 V's as
follows:
o Volume
o Variety
o Velocity
o Value
o Veracity
Skills required for Big Data
o BI stands for business intelligence, which is also used for data analysis of
business information: Below are some differences between BI and Data
sciences:
Data Business intelligence deals with Data science deals with structured
Source structured data, e.g., data and unstructured data, e.g.,
warehouse. weblogs, feedback, etc.
Facets of Data
• Very large amount of data will generate in big data and data science. These data
is various types and main categories of data are as follows:
a) Structured
b) Natural language
c) Graph-based
d) Streaming
e) Unstructured
DEPARTMENT OF COMPUTER SCIENCE-RASC-R.SUGANYA/AP Page 10
23PCSCC34: DATA SCIENCE & ANALYTICS UNIT 1
f) Machine-generated
g) Audio, video and images
Structured Data
• Structured data is arranged in rows and column format. It helps for application to
retrieve and process data easily. Database management system is used for storing
structured data.
• The term structured data refers to data that is identifiable because it is organized
in a structure. The most common form of structured data or records is a database
where specific information is stored based on a methodology of columns and rows.
• Structured data is also searchable by data type within content. Structured data is
understood by computers and is also efficiently organized for human readers.
• An Excel table is an example of structured data.
Unstructured Data
• Unstructured data is data that does not follow a specified format. Row and
columns are not used for unstructured data. Therefore it is difficult to retrieve
required information. Unstructured data has no identifiable structure.
• The unstructured data can be in the form of Text: (Documents, email messages,
customer feedbacks), audio, video, images. Email is an example of unstructured
data.
• Even today in most of the organizations more than 80 % of the data are in
unstructured form. This carries lots of information. But extracting information
from these various sources is a very big challenge.
• Characteristics of unstructured data:
1. There is no structural restriction or binding for the data.
Natural Language
• Machine data contains a definitive record of all activity and behavior of our
customers, users, transactions, applications, servers, networks, factory machinery
and so on.
• It's configuration data, data from APIs and message queues, change events, the
output of diagnostic commands and call detail records, sensor data from remote
equipment and more.
• Examples of machine data are web server logs, call detail records, network event
logs and telemetry.
• Both Machine-to-Machine (M2M) and Human-to-Machine (H2M) interactions
generate machine data. Machine data is generated continuously by every
processor-based system, as well as many consumer-oriented systems.
• It can be either structured or unstructured. In recent years, the increase of
machine data has surged. The expansion of mobile devices, virtual servers and
desktops, as well as cloud- based services and RFID technologies, is making IT
infrastructures more complex.
• Audio, image and video are data types that pose specific challenges to a data
scientist. Tasks that are trivial for humans, such as recognizing objects in pictures,
turn out to be challenging for computers.
•The terms audio and video commonly refers to the time-based media storage
format for sound/music and moving pictures information. Audio and video digital
recording, also referred as audio and video codecs, can be uncompressed, lossless
compressed or lossy compressed depending on the desired quality and use cases.
• It is important to remark that multimedia data is one of the most important
sources of information and knowledge; the integration, transformation and
Streaming Data
Data science and big data are rapidly growing fields that offer a wide
range of benefits and uses across various industries. Some of the benefits
and uses of data science and big data are:
1. Improved decision-making: Data science and big data help
organizations make better decisions by analyzing and interpreting
large amounts of data. Data scientists can identify patterns, trends,
and insights that can be used to make informed decisions.
2. Increased efficiency: Data science and big data can help
Facets of data:
1. Define the problem: The first step in the data science process is to define
the problem that you want to solve. This involves identifying the business
or research question that you want to answer and determining what data
you need to collect.
2. Collect and clean the data: Once you have identified the data that you
need, you will need to collect and clean the data to ensure that it is
accurate and complete. This involves checking for errors, missing values,
and inconsistencies.
3. Explore and visualize the data: After you have collected and cleaned the
data, the next step is to explore and visualize the data. This involves
creating summary statistics, visualizations, and other descriptive analyses
to better understand the data.
4. Prepare the data: Once you have explored the data, you will need to
prepare the data for analysis. This involves transforming and
manipulating the data, creating new variables, and selecting relevant
features.
5. Build the model: With the data prepared, the next step is to build a model
that can answer the business or research question that you identified in
step one. This involves selecting an appropriate algorithm, training the
model, and evaluating its performance.
6. Evaluate the model: Once you have built the model, you will need to
evaluate its performance to ensure that it is accurate and effective. This
involves using metrics such as accuracy, precision, recall, and F1 score to
assess the model's performance.
DEPARTMENT OF COMPUTER SCIENCE-RASC-R.SUGANYA/AP Page 20
23PCSCC34: DATA SCIENCE & ANALYTICS UNIT 1
7. Deploy the model: After you have evaluated the model, the final step is to
deploy the model in a production environment. This involves integrating
the model into an application or workflow and ensuring that it can handle
real-world data and user inputs.
The big data ecosystem and data science are closely related, as the former
provides the infrastructure and tools that enable the latter.
The big data ecosystem refers to the set of technologies, platforms, and
frameworks that are used to store, process, and analyze large volumes of
data.
Some of the key components of the big data ecosystem include:
1. Storage: Big data storage systems such as Hadoop Distributed File
System (HDFS), Apache Cassandra, and Amazon S3 are designed to
store and manage large volumes of data across multiple nodes.
2. Processing: Big data processing frameworks such as Apache Spark,
Apache Flink, and Apache Storm are used to process and analyze large
volumes of data in parallel across distributed computing clusters.
3. Querying: Big data querying systems such as Apache Hive, Apache Pig,
and Apache Drill are used to extract and transform data stored in big data
storage systems.
4. Visualization: Big data visualization tools such as Tableau, D3.js, and
Apache Zeppelin are used to create interactive visualizations and
dashboards that enable data scientists and business analysts to explore
and understand data.
The data science process can be summarized into a series of steps that are
typically followed in order to extract insights and knowledge from data. These
steps are as follows:
and spreadsheets, as well as external sources, such as public data sets and
web scraping.
3. Data preparation: Once the data has been collected, it needs to be
cleaned, preprocessed, and transformed into a format that can be used for
analysis. This may involve tasks such as data cleaning, data wrangling,
and data normalization.
In this step, you will have to develop ideas that can help
identify hidden patterns and insights. You will have to find more
interesting patterns in the data, such as why sales of a
particular product or service have gone up or down. You must
analyze or notice this kind of data more thoroughly. This is one
of the most crucial steps in data science process.
Machine Leaning allows the computers to learn from the past experiences by its
own, it uses statistical methods to improve the performance and predict the output
without being explicitly programmed.
The use of machine learning in data science can be understood by the development
process or life cycle of Data Science. The different steps that occur in Data science
lifecycle are as follows:
The below table describes the basic differences between Data Science and ML:
It is a broad term that includes It is used in the data modeling step of the
various steps to create a model for data science as a complete process.
a given problem and deploy the
model.
It can work with raw, structured, It mostly requires structured data to work
and unstructured data. on.
Data scientists spent lots of time in ML engineers spend a lot of time for
handling the data, cleansing the managing the complexities that occur
data, and understanding its during the implementation of algorithms
patterns. and mathematical concepts behind that.