Chapter-4-Data-Science-and-Big-DataÂ-

Chapter 4 discusses the significance of data science and big data, highlighting their applications in various industries such as banking, advertising, and e-commerce. It outlines the data science project lifecycle, including problem definition, data collection, modeling, and evaluation, while also addressing the challenges faced in data science projects. The chapter emphasizes the importance of machine learning in predicting customer behavior and improving business outcomes, particularly in risk management and fraud detection.

Uploaded by

doaaomar123

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Chapter-4-Data-Science-and-Big-DataÂ-

Uploaded by

doaaomar123

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 23

Chapter 4: Data Science

and Big Data

Introduction
 Most of the research and algorithms in data science have
resulted from decades of work by statisticians, mathematicians,
and scientists.
 The recent increase in the use of data science and machine
learning algorithms is due to:
• Data availability has increased the usage of digital devices
and services, and the rise in usage of sensors technology
and connected objects has led to the generation of more
and more data. Moreover, cheaper storage is available to
collect and store the data.
• High computing power has become available at a low cost
to process big datasets and use these complex algorithms.
Introduction
 Data science and big data are firmly embedded today in a
lot of products and services that meet our daily needs.
 Some common examples of everyday applications of
data science and big data include – Biometrics, such as
face and fingerprint recognition, for unlocking phones
using our faces or fingers – Matching drivers and
commuters to simplify booking rides using taxi or other
ride apps – Recommendations for watching videos and
shopping on e-commerce websites – Targeting, such as
product advertisements based on search engine history.
 Now, let us illustrate in more detail some industry use
cases where data science and big data are used and have
proven to be successful:
 Promotional emails and digital ads: On a typical day, people are
contacted by several promotional email messages, SMS messages, or ads
for things such as hotel discounts, airline vouchers, and shopping
discounts. Sometimes, these promotional messages are targeted explicitly
to specific customers. With the advent of the internet, companies have been
gathering and storing extensive data related to their customers, such as
customer demographics, past purchases, browsing patterns, and affluence
level.

 They use this data to understand the characteristics of their customers and
build machine learning models based on those characteristics to identify
the likelihood of a customer responding to a particular ad at a given
moment. Depending on this likelihood, customers are contacted with the
right offers and in many cases even personalized offers.
 Recommendation engines: Everyday, we come across various
recommendation engines, such as for friends on Facebook, videos on
YouTube, products on Amazon, and movies on Netflix. These systems
show personalized offerings to people based on factors like their search
histories and preferences. It is widely known that Amazon uses
recommendation algorithms to personalize the shopping experience for
each customer.

 Customers see recommended items that are based on their previous

browsing behavior, product ratings, similar product purchases and other
factors. Along similar lines, Netflix recommends personalized content
(movies and TV shows) based on a user’s viewing history and ratings, the
viewing history of similar users, and so forth. Furthermore, these
recommendations are diverse, include new releases, and adapt with time in
response to changing user preferences.
Risk management: Banks and credit card companies have used credit scores to evaluate
loan or credit card applications for many years. Credit scores typically indicate the risk
level of a customer, such as the likelihood of a customer to default. The model takes into
consideration several parameters, such as payment history, length of credit history,
inquiries, and income, to predict a customer’s likelihood to default by comparing them
to similar characteristics of other customers who have defaulted in the past.

Fraud detection: Fraud does not have a constant pattern. Fraud detection model is an
evolving process, and by investigating and flagging more and more frequent cases, cases
of fraud can be identified. In the case of credit card companies, investigating a large
number of transactions one by one is not possible. Therefore, fraud detection models are
used that can auto-approve legitimate transactions and raise alerts in case of any
transactions that appear fraudulent. Events such as a sudden big purchase after many
small purchases, purchases that do not fit the cardholder’s profile, and unusual
geographical locations can raise suspicion and block the transaction or card
automatically
What is the Data Science?
 Data science is an interdisciplinary field that uses the
concepts of statistics, mathematics, computer science,
and domain expertise to extract meaningful insights from
data that can generate some business value.
 A data scientist is someone who is a specialist in these
three domains and extracts meaning from the data by
performing data and statistical analysis; building and
applying machine learning models and algorithms; and
visualizing, summarizing and communicating the results.
 The role of a business/data analyst involves using a lot of
domain expertise and data analysis to generate insights
from the data, whereas the role of a data scientist is
slightly broader and includes machine learning
modeling and programming.
Machine Learning vs. Artificial Intelligence

• Machine Learning vs. Artificial Intelligence

• Artificial intelligence (AI) is a field of computer science focused on making
computers more intelligent so that they can imitate intelligent human
behavior. AI is a medium to enable computers to learn and engage in
human-like thought processes, such as learning, reasoning and self-
correction.
• The introduction of machine learning enabled computers to learn real-
world knowledge by identifying patterns from the data, self-correction from
this learning process to enable decision making.
• Machine learning algorithms also suffered from a trade-off as their
performance was dependent on how data was presented to them. Deep
learning algorithms solved this problem by extracting information by itself.
Deep learning is a particular type of machine learning that makes the
computer learning more powerful, flexible and abstract.
Machine Learning vs. Artificial Intelligence

Data science is a broader term that includes the process for building this
machine learning model, including data collection, data processing, data
analysis, data visualization, modeling, making predictions, and so forth.

What is the big data?

It is the amount of data that traditional database software tools cannot manage
and analyze due to the complexity and size.

Data points are created in our daily activities, such as:

• Social media activity (Twitter tweets and messages; Facebook likes, shares and
messages) – Search engine activity (Google searches, creating web pages).
• Video streaming (watching or uploading videos on YouTube or Netflix).
• Payments (using credit cards, internet banking).
Machine Learning vs. Artificial Intelligence

Data Science and Big Data in Industry Practice

Data science project lifecycle:
Before starting to work on a data science task, the first step is to define the
objective of the business problem clearly.
This involves answering questions about different aspects of the problem, such
as:
• What is the business problem and how to translate it into a data science
problem?
• What data is required for the problem and how to collect and to prepare it?
• What insights can be generated from the data and what machine learning/deep
learning models can be useful for the problem?
• How to build and evaluate the model, define success (KPIs, metrics, and so on)
for the solution of the problem?
• How to measure the impact of a machine learning model in the real world?
Case Study
Case study: the need for data science in banks:
• A bank offered personal loans to 800,000 customers with savings or credit card
accounts over five years (2013–2018). While the program aimed to benefit
customers and increase bank revenue, it ultimately resulted in losses.
• The loan repayments were set up as monthly installments, determined by the loan
amount, tenure, and interest rate. If a customer’s account didn’t have enough
funds to cover an installment, the loan became “delinquent” (missed payment).
Approximately 10% of customers missed their payments, leading to
dissatisfaction due to hefty penalty fees.
• The bank's risk management system didn’t account for the likelihood of customers
missing payments, which negatively impacted both the loan program and overall
customer satisfaction. To address this issue, the bank’s management tasked the
data science team with identifying customers likely to miss payments in the next
three months. By doing so, the bank could proactively call these customers and
remind them of their payment, testing whether such reminders could reduce
missed payments.
Case Study
Objective: Identify the 30,000 customers most likely to miss loan payments in
the next three months to prioritize telephone reminders.

A few ways to approach this problem are:

Approach 1: Account Balance Threshold

• Method: Identify customers whose account balance falls below a specific amount,
such as 2–3 times their monthly installment.
• Pros: Simple and quick to implement.
• Cons: Doesn’t account for account balance fluctuations (e.g., customers waiting
for their salary but typically paying on time).
Case Study
Approach 2: Rule-Based Analysis
• Method: Analyze historical data of customers who missed payments and create
rules based on common characteristics (e.g., customers aged 30–40 with an
income below $2,500 are more likely to miss payments).
• Pros: Provides insights into customer behavior.
• Cons: Rules may not account for all variations and might miss some patterns.

Approach 3: Machine Learning

• Method: Use historical data to build a machine-learning model that predicts the
likelihood of missing payments for each customer.
• Pros: Generates a specific score for each customer and adapts to changing
patterns over time.
• Cons: Requires more data and resources for implementation.
Case Study
 Steps to Implement a Machine Learning Approach:

Step 1: Data Collection and Preparation:

• Gather data about customers who missed payments vs. those who
didn’t since 2013.
• Collect relevant details such as:
• Account balance history.
• Demographics (e.g., age, income).
• Loan details (e.g., tenure, amount, interest rate).
• Clean and preprocess the data for accuracy and consistency.
Case Study
Step 2: Modeling
• Supervised Learning: Use labeled data (e.g., delinquent = Yes/No) to
train predictive models.
• Unsupervised Learning: Explore patterns without predefined targets to
find useful relationships in the data.
• Apply algorithms such as regression or decision trees to generate
predictions for each customer.
Step 3: Model Evaluation
• Evaluate how well the model predicts delinquency by testing it on new
data.
• Metrics to check:
• Accuracy: How often the model predicts correctly.
• Precision: How many of the predicted "delinquents" are actual delinquents?
Case Study
• Step 4: Experimentation
• Test the model in real-life scenarios:
• Call some customers flagged as likely to miss payments and track
their response.
• Compare their payment behavior with those who weren’t called.
• Questions to answer:
• Does calling reduce missed payments?
• Are there customers who still miss payments despite reminders?
Case Study
There are several approaches to performing an exploratory data
analysis:
• Exploratory Data Analysis (EDA)
• To better understand the data:
• Tables: Summarize key statistics like average balance, income, and
payment history.
• Visualization: Use graphs (e.g., bar charts, histograms) to identify
trends.
• Correlations: Analyze relationships between variables (e.g., income
and delinquency).
Case Study
Final Recommendations:
 Implement a machine learning model for precise predictions.
 Use the model to generate a ranked list of customers with the highest
risk of missing payments.
 Conduct monthly calls targeting the top 30,000 high-risk customers.
 Regularly update the model with new data to improve predictions and
adapt to changing customer behaviors.
By integrating data science into its operations, the bank can reduce
missed payments, improve customer satisfaction, and optimize its loan
program. This approach also demonstrates the powerful role of data
science in solving real-world financial challenges.
Challenges from Data Science Projects

Challenges from Data Science Projects:

1- Data platform (legacy systems): different platforms may lead to
different results on the same dataset.
2- Data quality and data dictionaries: In most companies, raw data is
dirty (missing, inaccurate, duplicate, misleading, and non-integrated),
and data dictionaries are incomplete or absent.
3- Data privacy and lack of data access: In many projects, data is not
available, or not available on time, due to data privacy issues. To resolve
these issues, an upfront assessment of data privacy should be done at the
scoping phase itself, and appropriate measures should be taken to
address the issues. In some cases, a project may have to discontinue due
to unresolved issues
Challenges from Data Science Projects

4- Ethical Issues: Many times, data science projects involve working

with sensitive data such as race, gender, religion, national origin, and
medical history, and we should be careful to use only data that is
allowed by rules and regulations
5- Lack of project sponsorships: Many companies do not focus on
investing appropriately in data science projects.
6- Expectation management: It is difficult to manage expectations
about the impact of data science projects with management.
7- Focus on wrong problems: The lack of clear direction, unclear
problem statements, and unclear execution plans can cause data science
projects to fail.
Big Data Market Size Revenue

Telangana 8th Class Mathematics Text Book
78% (89)
Telangana 8th Class Mathematics Text Book
370 pages
Presentation On Data Science
No ratings yet
Presentation On Data Science
15 pages
Past Simple & Continuous - Key Students PDF
100% (1)
Past Simple & Continuous - Key Students PDF
4 pages
Tauseef Sharif - Bda
No ratings yet
Tauseef Sharif - Bda
4 pages
Unit 1 DS BCA NOTES
No ratings yet
Unit 1 DS BCA NOTES
7 pages
Honey - Bda Assignment
No ratings yet
Honey - Bda Assignment
4 pages
Unit I - BigData
No ratings yet
Unit I - BigData
47 pages
Task 1
No ratings yet
Task 1
2 pages
Case Studies 2024 - 2025 ODD SEM
No ratings yet
Case Studies 2024 - 2025 ODD SEM
61 pages
Data Science Real World Applications
No ratings yet
Data Science Real World Applications
19 pages
Introduction To Analytics
No ratings yet
Introduction To Analytics
12 pages
Notes Data Science
No ratings yet
Notes Data Science
5 pages
Design and Implementation of Enterprise Financing Decision Model Based On Data Mining
No ratings yet
Design and Implementation of Enterprise Financing Decision Model Based On Data Mining
14 pages
DATA SCIENCE LIFE CYCLE
No ratings yet
DATA SCIENCE LIFE CYCLE
12 pages
Real- Life Examples of Machine and DeepLearning
No ratings yet
Real- Life Examples of Machine and DeepLearning
18 pages
Unit 1
No ratings yet
Unit 1
10 pages
Datameer Customer Analytics Ebook
No ratings yet
Datameer Customer Analytics Ebook
13 pages
Big data
No ratings yet
Big data
47 pages
Ch7-Overview of Data Science-part 2
No ratings yet
Ch7-Overview of Data Science-part 2
15 pages
Information Technology For Management
No ratings yet
Information Technology For Management
39 pages
Main Distinctive Features of Machine Learning
No ratings yet
Main Distinctive Features of Machine Learning
7 pages
Big Data: Done by Priya Upadhyay Arun Choudhury
No ratings yet
Big Data: Done by Priya Upadhyay Arun Choudhury
22 pages
Data Mining
No ratings yet
Data Mining
24 pages
MODULE-1
No ratings yet
MODULE-1
39 pages
Da Chapter 9
No ratings yet
Da Chapter 9
55 pages
Data Analytics Compendium BITeSys 2024
No ratings yet
Data Analytics Compendium BITeSys 2024
46 pages
DMT UNIT 5
No ratings yet
DMT UNIT 5
25 pages
Project Work 1
No ratings yet
Project Work 1
12 pages
BIG DATA
No ratings yet
BIG DATA
16 pages
UNIT_1 BDA
No ratings yet
UNIT_1 BDA
14 pages
Data Mining Notes
No ratings yet
Data Mining Notes
46 pages
Lecture 4
No ratings yet
Lecture 4
18 pages
srinagah_EAS504_8
No ratings yet
srinagah_EAS504_8
5 pages
Info. System & Analytics Sem-1 Module- 4
No ratings yet
Info. System & Analytics Sem-1 Module- 4
26 pages
Large Scale Product Recommendation of Supermarket
No ratings yet
Large Scale Product Recommendation of Supermarket
19 pages
Ass 2
No ratings yet
Ass 2
6 pages
BI_Qn bank + Answers
No ratings yet
BI_Qn bank + Answers
56 pages
Predictive Data Mining On Web-Based E-Commerce Store: Hzshen@public - Tpt.edu - CN
No ratings yet
Predictive Data Mining On Web-Based E-Commerce Store: Hzshen@public - Tpt.edu - CN
6 pages
Get PDF 1
No ratings yet
Get PDF 1
4 pages
ppt1
No ratings yet
ppt1
37 pages
Introduction To Business Analytics
No ratings yet
Introduction To Business Analytics
8 pages
Evolution of Machine Learning
No ratings yet
Evolution of Machine Learning
7 pages
"Simplify Your Analytics Strategy by Narendra Mulani": Interests in Analytics and Resulting Benefits
No ratings yet
"Simplify Your Analytics Strategy by Narendra Mulani": Interests in Analytics and Resulting Benefits
26 pages
Question Bank of Big Data
No ratings yet
Question Bank of Big Data
22 pages
Chandu
No ratings yet
Chandu
12 pages
Orientation To Computing
No ratings yet
Orientation To Computing
67 pages
Unit 2
No ratings yet
Unit 2
35 pages
Group 8 - Updated Final Presentation Outline
No ratings yet
Group 8 - Updated Final Presentation Outline
6 pages
UNIT2
No ratings yet
UNIT2
17 pages
Data Analytics for Accountants Canadian Accountants
No ratings yet
Data Analytics for Accountants Canadian Accountants
7 pages
data mining
No ratings yet
data mining
23 pages
Emerging Trends in Business Analytics
No ratings yet
Emerging Trends in Business Analytics
5 pages
Business Analytics
100% (1)
Business Analytics
8 pages
Management Information Systems 3rd Edition Rainer Solutions Manual 1
100% (65)
Management Information Systems 3rd Edition Rainer Solutions Manual 1
11 pages
Management Information Systems 3rd Edition Rainer Solutions Manual 1
100% (53)
Management Information Systems 3rd Edition Rainer Solutions Manual 1
36 pages
DSV QB and Solutions
No ratings yet
DSV QB and Solutions
8 pages
IAT-1 - Bᵤgz..?-6
No ratings yet
IAT-1 - Bᵤgz..?-6
20 pages
Predicting Personal Loan Approval Using Machine Learning Handbook
No ratings yet
Predicting Personal Loan Approval Using Machine Learning Handbook
31 pages
How Banks Can Better Serve Their Customers Through Artificial Techniques
No ratings yet
How Banks Can Better Serve Their Customers Through Artificial Techniques
16 pages
UNIT 1_Data Science_III BSC CS
No ratings yet
UNIT 1_Data Science_III BSC CS
10 pages
Data Science in Finance
No ratings yet
Data Science in Finance
83 pages
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
Tawheed - A Great Principle - Ibn Taymiyyah
No ratings yet
Tawheed - A Great Principle - Ibn Taymiyyah
22 pages
Shell DEPs Online Access Flow Chart
No ratings yet
Shell DEPs Online Access Flow Chart
1 page
BUCET - Application Form PDF
No ratings yet
BUCET - Application Form PDF
1 page
m26 PDF
No ratings yet
m26 PDF
2 pages
Revised Result of 2nd & 4th Sem of BBA, Exam Held in July 2023
No ratings yet
Revised Result of 2nd & 4th Sem of BBA, Exam Held in July 2023
16 pages
Losing Dad, Paranoid Schizophrenia: A Family's Search For Hope CH 1 Excerpt by Amanda LaPera
100% (1)
Losing Dad, Paranoid Schizophrenia: A Family's Search For Hope CH 1 Excerpt by Amanda LaPera
5 pages
Cw823rej Toto Indonesia
No ratings yet
Cw823rej Toto Indonesia
1 page
Cranes and Derricks Fourth Edition Shapiro Lawrence K Shapiro Jay P Download PDF
100% (1)
Cranes and Derricks Fourth Edition Shapiro Lawrence K Shapiro Jay P Download PDF
39 pages
4 The U S Battlefront and Homefront in Wwi 1
No ratings yet
4 The U S Battlefront and Homefront in Wwi 1
2 pages
Tomb Raider Level Editor - Manual
No ratings yet
Tomb Raider Level Editor - Manual
119 pages
Present Simple Affirmative Negative Grammar Drills Information Gap Activities 84223
100% (1)
Present Simple Affirmative Negative Grammar Drills Information Gap Activities 84223
3 pages
Norms
100% (4)
Norms
23 pages
Robit Hyper 63GA Manual 2018 PDF
No ratings yet
Robit Hyper 63GA Manual 2018 PDF
9 pages
Modello Di Accordo Per La Mobilità Di Studio Nell'ambito Del Programma Erasmus+/KA1 ISTRUZIONE SUPERIORE ACCORDO N. 2017
No ratings yet
Modello Di Accordo Per La Mobilità Di Studio Nell'ambito Del Programma Erasmus+/KA1 ISTRUZIONE SUPERIORE ACCORDO N. 2017
9 pages
Tanya: Vinyl Wallcovering
100% (1)
Tanya: Vinyl Wallcovering
24 pages
Mindray TE7 Transducer Family 40352B
No ratings yet
Mindray TE7 Transducer Family 40352B
2 pages
Muhammad Ifrahim: Work Experience Skills
No ratings yet
Muhammad Ifrahim: Work Experience Skills
1 page
Sea Ltd. (SE) - Earnings Review - Upbeat Guidance, Strong Growth Buy (On CL)
No ratings yet
Sea Ltd. (SE) - Earnings Review - Upbeat Guidance, Strong Growth Buy (On CL)
12 pages
2021 Nautical Almanac
No ratings yet
2021 Nautical Almanac
284 pages
In The Supreme Court of Hong Kong High Court
No ratings yet
In The Supreme Court of Hong Kong High Court
8 pages
Einvoice, June 24 For MR RASHAD HARILALL
No ratings yet
Einvoice, June 24 For MR RASHAD HARILALL
2 pages
Rotters Incomplete Sentences Blank by Julian Rotter Outline
No ratings yet
Rotters Incomplete Sentences Blank by Julian Rotter Outline
10 pages
Marks of Maturity
No ratings yet
Marks of Maturity
1 page
Relevant Costs (Part 2) : F. M. Kapepiso
No ratings yet
Relevant Costs (Part 2) : F. M. Kapepiso
21 pages
AB18XX Application Manual
No ratings yet
AB18XX Application Manual
97 pages
Ode Introduction
No ratings yet
Ode Introduction
2 pages
Report
No ratings yet
Report
3 pages
The Connected Car - Overview of Trends and Services
No ratings yet
The Connected Car - Overview of Trends and Services
17 pages