0% found this document useful (0 votes)
5 views

BCA Lecture I

The document provides an overview of data science, its processes, and its significance in deriving insights from big data. It discusses the types and characteristics of big data, applications across various industries, and the challenges faced in managing it. Additionally, it explains the relationship between data science, machine learning, and artificial intelligence, detailing the skills required for data scientists and the steps involved in the data science process.

Uploaded by

namrata.valecha
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

BCA Lecture I

The document provides an overview of data science, its processes, and its significance in deriving insights from big data. It discusses the types and characteristics of big data, applications across various industries, and the challenges faced in managing it. Additionally, it explains the relationship between data science, machine learning, and artificial intelligence, detailing the skills required for data scientists and the steps involved in the data science process.

Uploaded by

namrata.valecha
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Data Science:

Introduction &
Process Overview
Important Terms
• Data Science: Data Science represents optimization of
processes and resources. It produces data insights, actionable,
data informed conclusions or predictions that you can use to
understand and improve business, investments, health and
lifestyle.
• Big data: Big Data is a term used to describe a collection of
data that is huge in volume and yet growing exponentially with
time. Massive volumes of data are simply termed as Big Data.
• Example: Spotify, an on-demand music providing platform, uses Big
Data Analytics, collects data from all its users around the globe, and
then uses the analyzed data to give informed music recommendations
and suggestions to every individual user.
Types of Big data
1. Structured data: Any data that can be stored,
accessed and processed in the form of fixed format is
termed as a structured data.
2. Unstructured data: The data which have unknown
form and cannot be stored in RDBMS and cannot be
analyzed unless it is transformed into a structured
format is called as unstructured data. This type poses
multiple challenges in terms of its processing for
deriving value out of it. Text Files and multimedia
contents like images, audios, and videos are example of
unstructured data.
3. Semi Structured data: Semi-structured data can
Characteristics of Big data
1. Volume: The name Big Data itself is related to a size which is
enormous. Size of data plays a very crucial role in determining
value out of data.
2. Variety: Variety refers to heterogeneous sources and the nature
of data, both structured and unstructured. Data is in the form of
emails, photos, videos, monitoring devices, PDFs, audio. This
variety of unstructured data poses certain issues for storage,
mining and analyzing data.
3. Velocity: The term velocity refers to the speed of generation of
data. How fast the data is generated and processed to meet the
demands, determines real potential in the data.
Big data velocity deals with the speed at which data flows in
from sources like business processes, application logs, networks,
and social media sites, sensors, Mobile devices, etc. The flow of
data is massive and continuous.
Applications of Big data
1. Smarter Healthcare: Making use of the petabytes of patient’s data, the
organization can extract meaningful information and then build applications that
can predict the patient’s deteriorating condition in advance.
2. Search Quality: Every time we are extracting information from Google, we are
simultaneously generating data for it. Google stores this data and uses it to
improve its search quality.
3. Manufacturing: Analyzing big data in the manufacturing industry can reduce
component defects, improve product quality, and increase efficiency, and save
time and money.
4. Telecom: Telecom sectors collects information, analyzes it and provide solutions
to different problems. By using Big Data applications, telecom companies have
been able to significantly reduce data packet loss, which occurs when networks
are overloaded, and thus, providing a seamless connection to their customers.
Challenges of Big data
1. Data Quality: The problem here is the 4th V i.e. Variability.
The data here is very messy, inconsistent and incomplete.
Dirty data can cost approx. $600 billion to the companies.
2. Security: Since the data is huge in size, keeping it secure is
another challenge. It includes user authentication, restricting
access based on a user, recording data access histories, proper
use of data encryption etc.
3. Storage: The more data, an organization has, the more
complex the problems of managing it can become. Need a
storage system which can easily scale up or down on-demand.
Significance of Data Science
1. The principal purpose of Data Science is to find patterns
within data. It uses various statistical techniques to
analyse and draw insights from the data.
2. Data Scientist must scrutinize the data thoroughly.
• Make predictions from the data.
• Assist companies in making smarter business decisions.
• Data Science churns raw data into meaningful insights.
• Therefore, industries need data science
Data Science, Machine Learning and Artificial
Intelligence

1. Data Science Produces Insights


• Data science is distinguished from the other two fields because its goal
is an especially human one: to gain insight and understanding. ]
• The main distinction is that in data science there is always a human in
the loop: someone is understanding the insight, seeing the figure, or
benefitting from the conclusion.
• This definition of data science thus emphasizes:
• Statistical inference
• Data visualization
• Experiment design
• Domain knowledge
• Communication
Data Science, Machine Learning and Artificial
Intelligence

2. Machine Learning Produces Predictions


• We can think of machine learning as the field of prediction: of
“Given instance X with particular features, predict Y about it.”
• These predictions could be about the future, but they also could
be about qualities that aren’t immediately obvious to a
computer.
• Almost all Kaggle competitions qualify as machine learning
problems: they offer some training data, and then see if
competitors can make accurate predictions about new
examples.
• There’s plenty of overlap between data science and machine
learning. For example, logistic regression can be used to draw
insights about relationships and to make predictions.
Data Science, Machine Learning and Artificial
Intelligence

3. Artificial Intelligence Produces Actions


• Artificial intelligence is by far the oldest and the most widely recognized of
these three designations, and as a result it’s the most challenging to define.
• One common thread in definitions of AI is that an autonomous agent
executes or recommends actions. Some systems I think should be described
as AI include:

• Game-playing algorithms (Deep Blue, AlphaGo)


• Robotics and control theory (motion planning, walking a bipedal robot)
• Optimization (Google Maps choosing a route)
• Natural language processing (Bots - by “bots” here I’m referring to systems
meant to interpret natural language and then respond in kind. This can be
distinguished from text mining, where the goal is to extract insights (data
science) or text classification, where the goal is to categorize documents
(machine learning)
• Reinforcement learning
Case Study: How Would the Three
Be Used Together?

• Suppose we were building a self-driving car, and were working on


the specific problem of stopping at stop signs. We would need skills
drawn from all three of these fields.
• MI: The car has to recognize a stop sign using its cameras. We
construct a dataset of millions of photos of streetside objects, and
train an algorithm to predict which have stop signs in them.
• AI: Once our car can recognize stop signs, it needs to decide when
to take the action of applying the brakes. It’s dangerous to apply
them too early or too late, and we need it to handle varying road
conditions (for example, to recognize on a slippery road that it is not
slowing down quickly enough), which is a problem of control theory.
Case Study: How Would the Three
Be Used Together?

• Data science: In street tests we find that the car’s


performance isn’t good enough, with some false negatives in
which it drives right by a stop sign.
• After analyzing the street test data, we gain the insight that
the rate of false negatives depends on the time of day: it is
more likely to miss a stop sign before sunrise or after sunset.
• We realize that most of our training data included only
objects in full daylight, so we construct a better dataset
including nighttime images and go back to the machine
learning step.
Skill set needed for Data Scientist

Business
Domain Mathema Compute Collabora
Awarenes Soft skills
Expertise tics r Science tive Skills
s
Data Science process
Setting the
research
goal

Presentatio
Retrieving
n and
data
automation

Data Data
Modelling Preparation

Data
Exploration
Setting the research goal
• First prepare a project charter.
• This charter contains information such as what you’re
going to research, how the company benefits from that,
what data and resources you need, a timetable, and
deliverables.
• The project charter contains the details about which data
you need and where you can find it.
Retrieving data

• The second step is to collect data.


• In this step you ensure that you can use the data in your
program, which means checking the existence of, quality,
and access to the data.
• Data can also be delivered by third-party companies and
takes many forms ranging from Excel spreadsheets to
different types of databases.
Data Preparation
• Data collection is an error-prone process, in this
phase you enhance the quality of the data and
prepare it for use in subsequent steps.
• This phase consists of three sub phases:
• Data cleansing removes false values from a data source
and inconsistencies across data sources
• Data integration enriches data sources by combining
information from multiple data sources
• Data transformation ensures that the data is in a suitable
format for use in your models.
Data Exploration
• Data exploration is concerned with building a deeper
understanding of your data.
• Here we try to understand how variables interact with
each other, the distribution of the data, and whether
there are outliers.
• To achieve this you mainly use descriptive statistics,
visual techniques, and simple modelling.
• This step often goes by the abbreviation EDA, for
Exploratory Data Analysis.
Data Modelling
• In this phase you use models, domain knowledge, and
insights about the data you found in the previous steps to
answer the research question.
• You select a technique from the fields of statistics,
machine learning, operations research, and so on.
• Building a model is an iterative process that involves
selecting the variables for the model, executing the
model, and model diagnostics.
Presentation and automation

• Finally, you present the results to your business.


• These results can take many forms, ranging from
presentations to research reports.
• The business officials will want to use the insights you
gained in another project or enable an operational
process to use the outcome from your model.

You might also like