0% found this document useful (0 votes)
12 views

Data Science (UNIT 1)

data science

Uploaded by

suganya
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Data Science (UNIT 1)

data science

Uploaded by

suganya
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 31

23PCSCC34: DATA SCIENCE & ANALYTICS UNIT 1

UNIT I:
Introduction of Data Science: data science and big data–facets of data-data
science process- Ecosystem- The Data Science process – six steps- Machine
Learning

What is Data Science?


Data science is a deep study of the massive amount of data, which involves
extracting meaningful insights from raw, structured, and unstructured data that is
processed using the scientific method, different technologies, and algorithms.
It is a multidisciplinary field that uses tools and techniques to manipulate the data
so that you can find something new and meaningful.
Data science uses the most powerful hardware, programming systems, and most
efficient algorithms to solve the data related problems. It is the future of artificial
intelligence.
In short, we can say that data science is all about:
o Asking the correct questions and analyzing the raw data.
o Modeling the data using various complex and efficient algorithms.
o Visualizing the data to get a better perspective.
o Understanding the data to make better decisions and finding the final result.

Example:
DEPARTMENT OF COMPUTER SCIENCE-RASC-R.SUGANYA/AP Page 1
23PCSCC34: DATA SCIENCE & ANALYTICS UNIT 1

Let suppose we want to travel from station A to station B by car. Now, we need to
take some decisions such as which route will be the best route to reach faster at the
location, in which route there will be no traffic jam, and which will be cost-
effective. All these decision factors will act as input data, and we will get an
appropriate answer from these decisions, so this analysis of data is called the data
analysis, which is a part of data science.

Need for Data Science:

Some years ago, data was less and mostly available in a structured form, which
could be easily stored in excel sheets, and processed using BI tools.
But in today's world, data is becoming so vast, i.e., approximately 2.5 quintals
bytes of data is generating on every day, which led to data explosion. It is
estimated as per researches, that by 2020, 1.7 MB of data will be created at every

DEPARTMENT OF COMPUTER SCIENCE-RASC-R.SUGANYA/AP Page 2


23PCSCC34: DATA SCIENCE & ANALYTICS UNIT 1

single second, by a single person on earth. Every Company requires data to work,
grow, and improve their businesses.
Now, handling of such huge amount of data is a challenging task for every
organization. So to handle, process, and analysis of this, we required some
complex, powerful, and efficient algorithms and technology, and that technology
came into existence as data Science. Following are some main reasons for using
data science technology:
o With the help of data science technology, we can convert the massive
amount of raw and unstructured data into meaningful insights.
o Data science technology is opting by various companies, whether it is a big
brand or a startup. Google, Amazon, Netflix, etc, which handle the huge
amount of data, are using data science algorithms for better customer
experience.
o Data science is working for automating transportation such as creating a
self-driving car, which is the future of transportation.
o Data science can help in different predictions such as various survey,
elections, flight ticket confirmation, etc.
Data science Jobs:
As per various surveys, data scientist job is becoming the most demanding Job of
the 21st century due to increasing demands for data science. Some people also
called it "the hottest job title of the 21st century". Data scientists are the experts
who can use various statistical tools and machine learning algorithms to understand
and analyze the data.
The average salary range for data scientist will be approximately $95,000 to $
165,000 per annum, and as per different researches, about 11.5 millions of job
will be created by the year 2026.
Types of Data Science Job
DEPARTMENT OF COMPUTER SCIENCE-RASC-R.SUGANYA/AP Page 3
23PCSCC34: DATA SCIENCE & ANALYTICS UNIT 1

If you learn data science, then you get the opportunity to find the various exciting
job roles in this domain. The main job roles are given below:
1. Data Scientist
2. Data Analyst
3. Machine learning expert
4. Data engineer
5. Data Architect
6. Data Administrator
7. Business Analyst
8. Business Intelligence Manager
Below is the explanation of some critical job titles of data science.
1. Data Analyst:
Data analyst is an individual, who performs mining of huge amount of data,
models the data, looks for patterns, relationship, trends, and so on. At the end of
the day, he comes up with visualization and reporting for analyzing the data for
decision making and problem-solving process.
Skill required: For becoming a data analyst, you must get a good background
in mathematics, business intelligence, data mining, and basic knowledge
of statistics. You should also be familiar with some computer languages and tools
such as MATLAB, Python, SQL, Hive, Pig, Excel, SAS, R, JS, Spark, etc.
2. Machine Learning Expert:
The machine learning expert is the one who works with various machine learning
algorithms used in data science such as regression, clustering, classification,
decision tree, random forest, etc.
Skill Required: Computer programming languages such as Python, C++, R, Java,
and Hadoop. You should also have an understanding of various algorithms,
problem-solving analytical skill, probability, and statistics.
DEPARTMENT OF COMPUTER SCIENCE-RASC-R.SUGANYA/AP Page 4
23PCSCC34: DATA SCIENCE & ANALYTICS UNIT 1

3. Data Engineer:
A data engineer works with massive amount of data and responsible for building
and maintaining the data architecture of a data science project. Data engineer also
works for the creation of data set processes used in modeling, mining, acquisition,
and verification.
Skill required: Data engineer must have depth knowledge of SQL, MongoDB,
Cassandra, HBase, Apache Spark, Hive, MapReduce, with language knowledge
of Python, C/C++, Java, Perl, etc.
4. Data Scientist:
A data scientist is a professional who works with an enormous amount of data to
come up with compelling business insights through the deployment of various
tools, techniques, methodologies, algorithms, etc.
Skill required: To become a data scientist, one should have technical language
skills such as R, SAS, SQL, Python, Hive, Pig, Apache spark, MATLAB. Data
scientists must have an understanding of Statistics, Mathematics, visualization, and
communication skills.
Prerequisite for Data Science
Non-Technical Prerequisite:
o Curiosity: To learn data science, one must have curiosities. When you have
curiosity and ask various questions, then you can understand the business
problem easily.
o Critical Thinking: It is also required for a data scientist so that you can find
multiple new ways to solve the problem with efficiency.
o Communication skills: Communication skills are most important for a data
scientist because after solving a business problem, you need to communicate
it with the team.
Technical Prerequisite:
DEPARTMENT OF COMPUTER SCIENCE-RASC-R.SUGANYA/AP Page 5
23PCSCC34: DATA SCIENCE & ANALYTICS UNIT 1

o Machine learning: To understand data science, one needs to understand the


concept of machine learning. Data science uses machine learning algorithms
to solve various problems.
o Mathematical modeling: Mathematical modeling is required to make fast
mathematical calculations and predictions from the available data.
o Statistics: Basic understanding of statistics is required, such as mean,
median, or standard deviation. It is needed to extract knowledge and obtain
better results from the data.
o Computer programming: For data science, knowledge of at least one
programming language is required. R, Python, Spark are some required
computer programming languages for data science.
o Databases: The depth understanding of Databases such as SQL, is essential
for data science to get the data and to work with data.
DATA SCIENCE COMPONENTS:

The main components of Data Science are given below:

DEPARTMENT OF COMPUTER SCIENCE-RASC-R.SUGANYA/AP Page 6


23PCSCC34: DATA SCIENCE & ANALYTICS UNIT 1

1. Statistics: Statistics is one of the most important components of data science.


Statistics is a way to collect and analyze the numerical data in a large amount and
finding meaningful insights from it.
2. Domain Expertise: In data science, domain expertise binds data science
together. Domain expertise means specialized knowledge or skills of a particular
area. In data science, there are various areas for which we need domain experts.
3. Data engineering: Data engineering is a part of data science, which involves
acquiring, storing, retrieving, and transforming the data. Data engineering also
includes metadata (data about data) to the data.
4. Visualization: Data visualization is meant by representing data in a visual
context so that people can easily understand the significance of data. Data
visualization makes it easy to access the huge amount of data in visuals.
5. Advanced computing: Heavy lifting of data science is advanced computing.
Advanced computing involves designing, writing, debugging, and maintaining the
source code of computer programs.

DEPARTMENT OF COMPUTER SCIENCE-RASC-R.SUGANYA/AP Page 7


23PCSCC34: DATA SCIENCE & ANALYTICS UNIT 1

6. Mathematics: Mathematics is the critical part of data science. Mathematics


involves the study of quantity, structure, space, and changes. For a data scientist,
knowledge of good mathematics is essential.
7. Machine learning: Machine learning is backbone of data science. Machine
learning is all about to provide training to a machine so that it can act as a human
brain. In data science, we use various machine learning algorithms to solve the
problems.
Tools for Data Science
Following are some tools required for data science:
o Data Analysis tools: R, Python, Statistics, SAS, Jupyter, R Studio,
MATLAB, Excel, RapidMiner.
o Data Warehousing: ETL, SQL, Hadoop, Informatica/Talend, AWS
Redshift
o Data Visualization tools: R, Jupyter, Tableau, Cognos.
o Machine learning tools: Spark, Mahout, Azure ML studio.
What is Big Data?
Big data is huge, large, or voluminous data, information, or the
relevant statistics acquired by large organizations that are difficult to process by
traditional tools. Big data can analyze structured, unstructured or semi-structured.
Data is one of the key players to run any business, and it is exponentially
increasing with passes of time. Before a decade, organizations were capable of
dealing with gigabytes of data only and suffered problems with data storage, but
after emerging Big data, organizations are now capable of handling petabytes and
exabytes of data as well as able to store huge volumes of data using cloud and big
data frameworks such as Hadoop, etc.

DEPARTMENT OF COMPUTER SCIENCE-RASC-R.SUGANYA/AP Page 8


23PCSCC34: DATA SCIENCE & ANALYTICS UNIT 1

Big Data is used to store, analyze and organize the huge volume of structured as
well as unstructured datasets. Big Data can be described mainly with 5 V's as
follows:
o Volume
o Variety
o Velocity
o Value
o Veracity
Skills required for Big Data

o Strong knowledge of Machine Learning concepts


o Understand the Database such as SQL, NoSQL, etc.
o In-depth knowledge of various programming languages such as Hadoop,
Java, Python, etc.
o Knowledge of Apache Kafka, Scala, and cloud computing
o Knowledge of database warehouses such as Hive.
o Difference between BI and Data Science

DEPARTMENT OF COMPUTER SCIENCE-RASC-R.SUGANYA/AP Page 9


23PCSCC34: DATA SCIENCE & ANALYTICS UNIT 1

o BI stands for business intelligence, which is also used for data analysis of
business information: Below are some differences between BI and Data
sciences:

Criterion Business intelligence Data science

Data Business intelligence deals with Data science deals with structured
Source structured data, e.g., data and unstructured data, e.g.,
warehouse. weblogs, feedback, etc.

Method Analytical(historical data) Scientific(goes deeper to know the


reason for the data report)

Skills Statistics and Visualization are the Statistics, Visualization, and


two skills required for business Machine learning are the required
intelligence. skills for data science.

Focus Business intelligence focuses on Data science focuses on past data,


both Past and present data present data, and also future
predictions.

Facets of Data
• Very large amount of data will generate in big data and data science. These data
is various types and main categories of data are as follows:
a) Structured
b) Natural language
c) Graph-based
d) Streaming
e) Unstructured
DEPARTMENT OF COMPUTER SCIENCE-RASC-R.SUGANYA/AP Page 10
23PCSCC34: DATA SCIENCE & ANALYTICS UNIT 1

f) Machine-generated
g) Audio, video and images

Structured Data

• Structured data is arranged in rows and column format. It helps for application to
retrieve and process data easily. Database management system is used for storing
structured data.
• The term structured data refers to data that is identifiable because it is organized
in a structure. The most common form of structured data or records is a database
where specific information is stored based on a methodology of columns and rows.
• Structured data is also searchable by data type within content. Structured data is
understood by computers and is also efficiently organized for human readers.
• An Excel table is an example of structured data.

Unstructured Data

• Unstructured data is data that does not follow a specified format. Row and
columns are not used for unstructured data. Therefore it is difficult to retrieve
required information. Unstructured data has no identifiable structure.
• The unstructured data can be in the form of Text: (Documents, email messages,
customer feedbacks), audio, video, images. Email is an example of unstructured
data.
• Even today in most of the organizations more than 80 % of the data are in
unstructured form. This carries lots of information. But extracting information
from these various sources is a very big challenge.
• Characteristics of unstructured data:
1. There is no structural restriction or binding for the data.

DEPARTMENT OF COMPUTER SCIENCE-RASC-R.SUGANYA/AP Page 11


23PCSCC34: DATA SCIENCE & ANALYTICS UNIT 1

2. Data can be of any type.


3. Unstructured data does not follow any structural rules.
4. There are no predefined formats, restriction or sequence for unstructured data.
5. Since there is no structural binding for unstructured data, it is unpredictable in
nature.

Natural Language

• Natural language is a special type of unstructured data.


• Natural language processing enables machines to recognize characters, words and
sentences, then apply meaning and understanding to that information. This helps
machines to understand language as humans do.
• Natural language processing is the driving force behind machine intelligence in
many modern real-world applications. The natural language processing community
has had success in entity recognition, topic recognition, summarization, text
completion and sentiment analysis.
•For natural language processing to help machines understand human language, it
must go through speech recognition, natural language understanding and machine
translation. It is an iterative process comprised of several layers of text analysis.

Machine - Generated Data

• Machine-generated data is an information that is created without human


interaction as a result of a computer process or application activity. This means
that data entered manually by an end-user is not recognized to be machine-
generated.

DEPARTMENT OF COMPUTER SCIENCE-RASC-R.SUGANYA/AP Page 12


23PCSCC34: DATA SCIENCE & ANALYTICS UNIT 1

• Machine data contains a definitive record of all activity and behavior of our
customers, users, transactions, applications, servers, networks, factory machinery
and so on.
• It's configuration data, data from APIs and message queues, change events, the
output of diagnostic commands and call detail records, sensor data from remote
equipment and more.
• Examples of machine data are web server logs, call detail records, network event
logs and telemetry.
• Both Machine-to-Machine (M2M) and Human-to-Machine (H2M) interactions
generate machine data. Machine data is generated continuously by every
processor-based system, as well as many consumer-oriented systems.
• It can be either structured or unstructured. In recent years, the increase of
machine data has surged. The expansion of mobile devices, virtual servers and
desktops, as well as cloud- based services and RFID technologies, is making IT
infrastructures more complex.

Graph-based or Network Data

•Graphs are data structures to describe relationships and interactions between


entities in complex systems. In general, a graph contains a collection of entities
called nodes and another collection of interactions between a pair of nodes called
edges.
• Nodes represent entities, which can be of any object type that is relevant to our
problem domain. By connecting nodes with edges, we will end up with a graph
(network) of nodes.
• A graph database stores nodes and relationships instead of tables or documents.
Data is stored just like we might sketch ideas on a whiteboard. Our data is stored

DEPARTMENT OF COMPUTER SCIENCE-RASC-R.SUGANYA/AP Page 13


23PCSCC34: DATA SCIENCE & ANALYTICS UNIT 1

without restricting it to a predefined model, allowing a very flexible way of


thinking about and using it.
• Graph databases are used to store graph-based data and are queried with
specialized query languages such as SPARQL.
• Graph databases are capable of sophisticated fraud prevention. With graph
databases, we can use relationships to process financial and purchase transactions
in near-real time. With fast graph queries, we are able to detect that, for example, a
potential purchaser is using the same email address and credit card as included in a
known fraud case.
• Graph databases can also help user easily detect relationship patterns such as
multiple people associated with a personal email address or multiple people
sharing the same IP address but residing in different physical addresses.
• Graph databases are a good choice for recommendation applications. With graph
databases, we can store in a graph relationships between information categories
such as customer interests, friends and purchase history. We can use a highly
available graph database to make product recommendations to a user based on
which products are purchased by others who follow the same sport and have
similar purchase history.
• Graph theory is probably the main method in social network analysis in the early
history of the social network concept. The approach is applied to social network
analysis in order to determine important features of the network such as the nodes
and links (for example influencers and the followers).
• Influencers on social network have been identified as users that have impact on
the activities or opinion of other users by way of followership or influence on
decision made by other users on the network as shown in Fig. 1.2.1.

DEPARTMENT OF COMPUTER SCIENCE-RASC-R.SUGANYA/AP Page 14


23PCSCC34: DATA SCIENCE & ANALYTICS UNIT 1

• Graph theory has proved to be very effective on large-scale datasets such as


social network data. This is because it is capable of by-passing the building of an
actual visual representation of the data to run directly on data matrices.

Audio, Image and Video

• Audio, image and video are data types that pose specific challenges to a data
scientist. Tasks that are trivial for humans, such as recognizing objects in pictures,
turn out to be challenging for computers.
•The terms audio and video commonly refers to the time-based media storage
format for sound/music and moving pictures information. Audio and video digital
recording, also referred as audio and video codecs, can be uncompressed, lossless
compressed or lossy compressed depending on the desired quality and use cases.
• It is important to remark that multimedia data is one of the most important
sources of information and knowledge; the integration, transformation and

DEPARTMENT OF COMPUTER SCIENCE-RASC-R.SUGANYA/AP Page 15


23PCSCC34: DATA SCIENCE & ANALYTICS UNIT 1

indexing of multimedia data bring significant challenges in data management and


analysis. Many challenges have to be addressed including big data,
multidisciplinary nature of Data Science and heterogeneity.
• Data Science is playing an important role to address these challenges in
multimedia data. Multimedia data usually contains various forms of media, such as
text, image, video, geographic coordinates and even pulse waveforms, which come
from multiple sources. Data Science can be a key instrument covering big data,
machine learning and data mining solutions to store, handle and analyze such
heterogeneous data.

Streaming Data

Streaming data is data that is generated continuously by thousands of data sources,


which typically send in the data records simultaneously and in small sizes (order of
Kilobytes).
• Streaming data includes a wide variety of data such as log files generated by
customers using your mobile or web applications, ecommerce purchases, in-game
player activity, information from social networks, financial trading floors or
geospatial services and telemetry from connected devices or instrumentation in
data centers.

DEPARTMENT OF COMPUTER SCIENCE-RASC-R.SUGANYA/AP Page 16


23PCSCC34: DATA SCIENCE & ANALYTICS UNIT 1

Difference between Structured and Unstructured Data

Benefits and uses of data science and big data:

 Data science and big data are rapidly growing fields that offer a wide
range of benefits and uses across various industries. Some of the benefits
and uses of data science and big data are:
1. Improved decision-making: Data science and big data help
organizations make better decisions by analyzing and interpreting
large amounts of data. Data scientists can identify patterns, trends,
and insights that can be used to make informed decisions.
2. Increased efficiency: Data science and big data can help

DEPARTMENT OF COMPUTER SCIENCE-RASC-R.SUGANYA/AP Page 17


23PCSCC34: DATA SCIENCE & ANALYTICS UNIT 1

organizations automate tasks, streamline processes, and optimize


operations. This can result in significant time and cost savings.
3. Personalization: With data science and big data, organizations
can personalize their products and services to meet the specific
needs and preferences of individual customers. This can lead to
increased customer satisfaction and loyalty.
4. Predictive analytics: Data science and big data can be used to
build predictive models that can forecast future trends and
behavior. This can be useful for businesses that need to anticipate
customer needs, market trends, or supply chain disruptions.
5. Fraud detection: Data science and big data can be used to detect
fraud and other types of financial crimes. By analyzing patterns
in financial data, data scientists can identify suspicious behavior
and prevent fraud.
6. Healthcare: Data science and big data can be used to improve
patient outcomes by analyzing large amounts of medical data.
This can lead to better diagnosis, treatment, and prevention of
diseases.
7. Marketing: Data science and big data can be used to improve
marketing strategies by analyzing consumer behavior and
preferences. This can help businesses target their marketing
campaigns more effectively and generate more leads and sales.

Facets of data:

Data can be characterized by several facets, including:

1. Volume: Refers to the amount of data that is generated and collected.


With the increasing prevalence of sensors, mobile devices, and social
DEPARTMENT OF COMPUTER SCIENCE-RASC-R.SUGANYA/AP Page 18
23PCSCC34: DATA SCIENCE & ANALYTICS UNIT 1

media, data volumes are growing exponentially.


2. Velocity: Refers to the speed at which data is generated and processed.
Real-time data processing has become critical for many applications, such
as fraud detection and predictive maintenance.
3. Variety: Refers to the diversity of data sources and formats. Data can
come from structured sources such as databases, semi-structured sources
such as XML, or unstructured sources such as social media posts or
emails.
4. Veracity: Refers to the quality and accuracy of the data. Data can be
affected by errors, biases, and inconsistencies, which can impact the
results of data analysis.
5. Value: Refers to the usefulness and relevance of the data. Data must
provide meaningful insights or solve real-world problems to create value
for organizations.
6. Variability: Refers to the fluctuations and changes that occur in data
over time. For example, data may have seasonal patterns or show
different trends depending on the region or market.
7. Visualization: Refers to the ability to represent data in a way that is easy
to understand and analyze. Data visualization tools can help analysts and
decision-makers identify patterns and trends quickly.
8. Validity: Refers to the extent to which data measures what it is intended
to measure. Valid data is essential for making informed decisions based
on accurate insights.

DEPARTMENT OF COMPUTER SCIENCE-RASC-R.SUGANYA/AP Page 19


23PCSCC34: DATA SCIENCE & ANALYTICS UNIT 1

The data science process:

The data science process typically involves the following steps:

1. Define the problem: The first step in the data science process is to define
the problem that you want to solve. This involves identifying the business
or research question that you want to answer and determining what data
you need to collect.
2. Collect and clean the data: Once you have identified the data that you
need, you will need to collect and clean the data to ensure that it is
accurate and complete. This involves checking for errors, missing values,
and inconsistencies.
3. Explore and visualize the data: After you have collected and cleaned the
data, the next step is to explore and visualize the data. This involves
creating summary statistics, visualizations, and other descriptive analyses
to better understand the data.
4. Prepare the data: Once you have explored the data, you will need to
prepare the data for analysis. This involves transforming and
manipulating the data, creating new variables, and selecting relevant
features.
5. Build the model: With the data prepared, the next step is to build a model
that can answer the business or research question that you identified in
step one. This involves selecting an appropriate algorithm, training the
model, and evaluating its performance.
6. Evaluate the model: Once you have built the model, you will need to
evaluate its performance to ensure that it is accurate and effective. This
involves using metrics such as accuracy, precision, recall, and F1 score to
assess the model's performance.
DEPARTMENT OF COMPUTER SCIENCE-RASC-R.SUGANYA/AP Page 20
23PCSCC34: DATA SCIENCE & ANALYTICS UNIT 1

7. Deploy the model: After you have evaluated the model, the final step is to
deploy the model in a production environment. This involves integrating
the model into an application or workflow and ensuring that it can handle
real-world data and user inputs.

The big data ecosystem and data science:

 The big data ecosystem and data science are closely related, as the former
provides the infrastructure and tools that enable the latter.
 The big data ecosystem refers to the set of technologies, platforms, and
frameworks that are used to store, process, and analyze large volumes of
data.
 Some of the key components of the big data ecosystem include:
1. Storage: Big data storage systems such as Hadoop Distributed File
System (HDFS), Apache Cassandra, and Amazon S3 are designed to
store and manage large volumes of data across multiple nodes.
2. Processing: Big data processing frameworks such as Apache Spark,
Apache Flink, and Apache Storm are used to process and analyze large
volumes of data in parallel across distributed computing clusters.

3. Querying: Big data querying systems such as Apache Hive, Apache Pig,
and Apache Drill are used to extract and transform data stored in big data
storage systems.
4. Visualization: Big data visualization tools such as Tableau, D3.js, and
Apache Zeppelin are used to create interactive visualizations and
dashboards that enable data scientists and business analysts to explore
and understand data.

DEPARTMENT OF COMPUTER SCIENCE-RASC-R.SUGANYA/AP Page 21


23PCSCC34: DATA SCIENCE & ANALYTICS UNIT 1

5. Machine learning: Big data machine learning platforms such as Apache


Mahout, TensorFlow, and Microsoft Azure Machine Learning are used to
build and deploy machine learning models at scale.

The data science process: Overview of the data science process:

The data science process can be summarized into a series of steps that are
typically followed in order to extract insights and knowledge from data. These
steps are as follows:

1. Problem definition: In this step, the problem that needs to be solved is


clearly defined. This involves identifying the goals, scope, and objectives
of the project, as well as any constraints and assumptions that need to be
considered.
2. Data collection: This step involves gathering the necessary data from
various sources. This may include internal data sources, such as databases

DEPARTMENT OF COMPUTER SCIENCE-RASC-R.SUGANYA/AP Page 22


23PCSCC34: DATA SCIENCE & ANALYTICS UNIT 1

and spreadsheets, as well as external sources, such as public data sets and
web scraping.
3. Data preparation: Once the data has been collected, it needs to be
cleaned, preprocessed, and transformed into a format that can be used for
analysis. This may involve tasks such as data cleaning, data wrangling,
and data normalization.

4. Data exploration and visualization: This step involves exploring and


visualizing the data to gain a better understanding of its properties and
characteristics. This may include tasks such as data visualization,
summary statistics, and correlation analysis.
5. Data modeling: In this step, mathematical and statistical models are
developed to analyze the data and make predictions. This may include
tasks such as regression analysis, classification, clustering, and time
series analysis.
6. Model evaluation: Once the models have been developed, they need to
be evaluated to determine their accuracy and effectiveness. This may
involve tasks such as cross-validation, model selection, and hypothesis
testing.
7. Deployment: Finally, the insights and knowledge gained from the data
analysis are deployed in the form of reports, dashboards, and other
visualizations that can be used to inform decision-making and drive
business value.
The six data science process steps are as follows:
1. Frame the problem
2. Collect the raw data needed for your problem
3. Process the data for analysis

DEPARTMENT OF COMPUTER SCIENCE-RASC-R.SUGANYA/AP Page 23


23PCSCC34: DATA SCIENCE & ANALYTICS UNIT 1

4. Explore the data


5. Perform in-depth analysis
6. Communicate results of the analysis
As the data science process stages help in converting raw data
into monetary gains and overall profits, any data scientist
should be well aware of the process and its significance. Now,
let us discuss these data science steps in detail.

Steps in Data Science Process


A data science process can be more accurately understood
through data science online courses and certifications on data
science. But, here is a step-by-step guide to help you get
familiar with the process.

Step 1: Framing the Problem


Before solving a problem, the pragmatic thing to do is to know
what exactly the problem is. Data questions must be first
translated to actionable business questions. People will more

DEPARTMENT OF COMPUTER SCIENCE-RASC-R.SUGANYA/AP Page 24


23PCSCC34: DATA SCIENCE & ANALYTICS UNIT 1

than often give ambiguous inputs on their issues. And, in this


first step, you will have to learn to turn those inputs into
actionable outputs.
A great way to go through this step is to ask questions like:
 Who the customers are?
 How to identify them?
 What is the sale process right now?
 Why are they interested in your products?
 What products they are interested in?
You will need much more context from numbers for them to
become insights. At the end of this step, you must have as
much information at hand as possible.

Step 2: Collecting the Raw Data for the


Problem
After defining the problem, you will need to collect the requisite
data to derive insights and turn the business problem into a
probable solution. The process involves thinking through your
data and finding ways to collect and get the data you need. It
can include scanning your internal databases or purchasing
databases from external sources.
Many companies store the sales data they have in customer
relationship management (CRM) systems. The CRM data can be
easily analyzed by exporting it to more advanced tools using
data pipelines.

DEPARTMENT OF COMPUTER SCIENCE-RASC-R.SUGANYA/AP Page 25


23PCSCC34: DATA SCIENCE & ANALYTICS UNIT 1

Step 3: Processing the Data to Analyze


After the first and second data science process steps, when you
have all the data you need, you will have to process it before
going further and analyzing it. Data can be messy if it has not
been appropriately maintained, leading to errors that easily
corrupt the analysis. These issues can be values set to null
when they should be zero or the exact opposite, missing
values, duplicate values, and many more. You will have to go
through the data and check it for problems to get more
accurate insights.
The most common errors that you can encounter and should
look out for are:
1. Missing values
2. Corrupted values like invalid entries
3. Time zone differences
4. Date range errors like a recorded sale before the sales even
started
You will have to also look at the aggregate of all the rows and
columns in the file and see if the values you obtain make sense.
If it doesn’t, you will have to remove or replace the data that
doesn’t make sense. Once you have completed the data
cleaning process, your data will be ready for an exploratory
data analysis (EDA).

Step 4: Exploring the Data

DEPARTMENT OF COMPUTER SCIENCE-RASC-R.SUGANYA/AP Page 26


23PCSCC34: DATA SCIENCE & ANALYTICS UNIT 1

In this step, you will have to develop ideas that can help
identify hidden patterns and insights. You will have to find more
interesting patterns in the data, such as why sales of a
particular product or service have gone up or down. You must
analyze or notice this kind of data more thoroughly. This is one
of the most crucial steps in data science process.

Step 5: Performing In-depth Analysis


This step will test your mathematical, statistical, and
technological knowledge. You must use all the data science
tools to crunch the data successfully and discover every insight
you can. You might have to prepare a predictive model that can
compare your average customer with those who are
underperforming. You might find several reasons in your
analysis, like age or social media activity, as crucial factors in
predicting the consumers of a service or product.
You might find several aspects that affect the customer, like
some people may prefer being reached over the phone rather
than social media. These findings can prove helpful as most of
the marketing done nowadays is on social media and only
aimed at the youth. How the product is marketed hugely affects
sales, and you will have to target demographics that are not a
lost cause after all. Once you are all done with this step, you
can combine the quantitative and qualitative data that you
have and move them into action.

DEPARTMENT OF COMPUTER SCIENCE-RASC-R.SUGANYA/AP Page 27


23PCSCC34: DATA SCIENCE & ANALYTICS UNIT 1

Step 6: Communicating Results of this


Analysis
After all these data science steps, it is vital to convey your
insights and findings to the sales head and make them
understand their importance. It will help if you communicate
appropriately to solve the problem you have been given.
Proper communication will lead to action. In contrast,
improper contact may lead to inaction.

What is Machine Learning?

Machine learning is a part of artificial intelligence and the subfield of Data


Science. It is a growing technology that enables machines to learn from past data
and perform a given task automatically. It can be defined as:

Machine Leaning allows the computers to learn from the past experiences by its
own, it uses statistical methods to improve the performance and predict the output
without being explicitly programmed.

The popular applications of ML are Email spam filtering, product


recommendations, online fraud detection, etc.

Skills Needed for the Machine Learning Engineer:

o Understanding and implementation of Machine Learning Algorithms.


o Natural Language Processing.
o Good Programming knowledge of Python or R.

DEPARTMENT OF COMPUTER SCIENCE-RASC-R.SUGANYA/AP Page 28


23PCSCC34: DATA SCIENCE & ANALYTICS UNIT 1

o Knowledge of Statistics and probability concepts.


o Knowledge of data modeling and data evaluation.

Where is Machine Learning used in Data Science?

The use of machine learning in data science can be understood by the development
process or life cycle of Data Science. The different steps that occur in Data science
lifecycle are as follows:

23PCSCC34: DATA SCIENCE & ANALYTICS

23PCSCC34: DATA SCIENCE & ANALYTICS

1. Business Requirements: In this step, we try to understand the requirement


for the business problem for which we want to use it. Suppose we want to

DEPARTMENT OF COMPUTER SCIENCE-RASC-R.SUGANYA/AP Page 29


23PCSCC34: DATA SCIENCE & ANALYTICS UNIT 1

create a recommendation system, and the business requirement is to increase


sales.
2. Data Acquisition: In this step, the data is acquired to solve the given
problem. For the recommendation system, we can get the ratings provided
by the user for different products, comments, purchase history, etc.
3. Data Processing: In this step, the raw data acquired from the previous step
is transformed into a suitable format, so that it can be easily used by the
further steps.
4. Data Exploration: It is a step where we understand the patterns of the data,
and try to find out the useful insights from the data.
5. Modeling: The data modeling is a step where machine learning algorithms
are used. So, this step includes the whole machine learning process. The
machine learning process involves importing the data, data cleaning,
building a model, training the model, testing the model, and improving the
model's efficiency.
6. Deployment & Optimization: This is the last step where the model is
deployed on an actual project, and the performance of the model is checked.

Comparison Between Data Science and Machine Learning

The below table describes the basic differences between Data Science and ML:

Data Science Machine Learning

It deals with understanding and It is a subfield of data science that


finding hidden patterns or useful enables the machine to learn from the
insights from the data, which helps past data and experiences automatically.

DEPARTMENT OF COMPUTER SCIENCE-RASC-R.SUGANYA/AP Page 30


23PCSCC34: DATA SCIENCE & ANALYTICS UNIT 1

to take smarter business decisions.

It is used for discovering insights It is used for making predictions and


from the data. classifying the result for new data points.

It is a broad term that includes It is used in the data modeling step of the
various steps to create a model for data science as a complete process.
a given problem and deploy the
model.

A data scientist needs to have skills Machine Learning Engineer needs to


to use big data tools like Hadoop, have skills such as computer science
Hive and Pig, statistics, fundamentals, programming skills in
programming in Python, R, or Python or R, statistics and probability
Scala. concepts, etc.

It can work with raw, structured, It mostly requires structured data to work
and unstructured data. on.

Data scientists spent lots of time in ML engineers spend a lot of time for
handling the data, cleansing the managing the complexities that occur
data, and understanding its during the implementation of algorithms
patterns. and mathematical concepts behind that.

DEPARTMENT OF COMPUTER SCIENCE-RASC-R.SUGANYA/AP Page 31

You might also like