0% found this document useful (0 votes)
3 views

Chapter-4-Data-Science-and-Big-DataÂ-

Chapter 4 discusses the significance of data science and big data, highlighting their applications in various industries such as banking, advertising, and e-commerce. It outlines the data science project lifecycle, including problem definition, data collection, modeling, and evaluation, while also addressing the challenges faced in data science projects. The chapter emphasizes the importance of machine learning in predicting customer behavior and improving business outcomes, particularly in risk management and fraud detection.

Uploaded by

doaaomar123
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Chapter-4-Data-Science-and-Big-DataÂ-

Chapter 4 discusses the significance of data science and big data, highlighting their applications in various industries such as banking, advertising, and e-commerce. It outlines the data science project lifecycle, including problem definition, data collection, modeling, and evaluation, while also addressing the challenges faced in data science projects. The chapter emphasizes the importance of machine learning in predicting customer behavior and improving business outcomes, particularly in risk management and fraud detection.

Uploaded by

doaaomar123
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Chapter 4: Data Science

and Big Data


Introduction
 Most of the research and algorithms in data science have
resulted from decades of work by statisticians, mathematicians,
and scientists.
 The recent increase in the use of data science and machine
learning algorithms is due to:
• Data availability has increased the usage of digital devices
and services, and the rise in usage of sensors technology
and connected objects has led to the generation of more
and more data. Moreover, cheaper storage is available to
collect and store the data.
• High computing power has become available at a low cost
to process big datasets and use these complex algorithms.
Introduction
 Data science and big data are firmly embedded today in a
lot of products and services that meet our daily needs.
 Some common examples of everyday applications of
data science and big data include – Biometrics, such as
face and fingerprint recognition, for unlocking phones
using our faces or fingers – Matching drivers and
commuters to simplify booking rides using taxi or other
ride apps – Recommendations for watching videos and
shopping on e-commerce websites – Targeting, such as
product advertisements based on search engine history.
 Now, let us illustrate in more detail some industry use
cases where data science and big data are used and have
proven to be successful:
 Promotional emails and digital ads: On a typical day, people are
contacted by several promotional email messages, SMS messages, or ads
for things such as hotel discounts, airline vouchers, and shopping
discounts. Sometimes, these promotional messages are targeted explicitly
to specific customers. With the advent of the internet, companies have been
gathering and storing extensive data related to their customers, such as
customer demographics, past purchases, browsing patterns, and affluence
level.

 They use this data to understand the characteristics of their customers and
build machine learning models based on those characteristics to identify
the likelihood of a customer responding to a particular ad at a given
moment. Depending on this likelihood, customers are contacted with the
right offers and in many cases even personalized offers.
 Recommendation engines: Everyday, we come across various
recommendation engines, such as for friends on Facebook, videos on
YouTube, products on Amazon, and movies on Netflix. These systems
show personalized offerings to people based on factors like their search
histories and preferences. It is widely known that Amazon uses
recommendation algorithms to personalize the shopping experience for
each customer.

 Customers see recommended items that are based on their previous


browsing behavior, product ratings, similar product purchases and other
factors. Along similar lines, Netflix recommends personalized content
(movies and TV shows) based on a user’s viewing history and ratings, the
viewing history of similar users, and so forth. Furthermore, these
recommendations are diverse, include new releases, and adapt with time in
response to changing user preferences.
Risk management: Banks and credit card companies have used credit scores to evaluate
loan or credit card applications for many years. Credit scores typically indicate the risk
level of a customer, such as the likelihood of a customer to default. The model takes into
consideration several parameters, such as payment history, length of credit history,
inquiries, and income, to predict a customer’s likelihood to default by comparing them
to similar characteristics of other customers who have defaulted in the past.

Fraud detection: Fraud does not have a constant pattern. Fraud detection model is an
evolving process, and by investigating and flagging more and more frequent cases, cases
of fraud can be identified. In the case of credit card companies, investigating a large
number of transactions one by one is not possible. Therefore, fraud detection models are
used that can auto-approve legitimate transactions and raise alerts in case of any
transactions that appear fraudulent. Events such as a sudden big purchase after many
small purchases, purchases that do not fit the cardholder’s profile, and unusual
geographical locations can raise suspicion and block the transaction or card
automatically
What is the Data Science?
 Data science is an interdisciplinary field that uses the
concepts of statistics, mathematics, computer science,
and domain expertise to extract meaningful insights from
data that can generate some business value.
 A data scientist is someone who is a specialist in these
three domains and extracts meaning from the data by
performing data and statistical analysis; building and
applying machine learning models and algorithms; and
visualizing, summarizing and communicating the results.
 The role of a business/data analyst involves using a lot of
domain expertise and data analysis to generate insights
from the data, whereas the role of a data scientist is
slightly broader and includes machine learning
modeling and programming.
Machine Learning vs. Artificial Intelligence

• Machine Learning vs. Artificial Intelligence


• Artificial intelligence (AI) is a field of computer science focused on making
computers more intelligent so that they can imitate intelligent human
behavior. AI is a medium to enable computers to learn and engage in
human-like thought processes, such as learning, reasoning and self-
correction.
• The introduction of machine learning enabled computers to learn real-
world knowledge by identifying patterns from the data, self-correction from
this learning process to enable decision making.
• Machine learning algorithms also suffered from a trade-off as their
performance was dependent on how data was presented to them. Deep
learning algorithms solved this problem by extracting information by itself.
Deep learning is a particular type of machine learning that makes the
computer learning more powerful, flexible and abstract.
Machine Learning vs. Artificial Intelligence

Data science is a broader term that includes the process for building this
machine learning model, including data collection, data processing, data
analysis, data visualization, modeling, making predictions, and so forth.

What is the big data?


It is the amount of data that traditional database software tools cannot manage
and analyze due to the complexity and size.

Data points are created in our daily activities, such as:


• Social media activity (Twitter tweets and messages; Facebook likes, shares and
messages) – Search engine activity (Google searches, creating web pages).
• Video streaming (watching or uploading videos on YouTube or Netflix).
• Payments (using credit cards, internet banking).
Machine Learning vs. Artificial Intelligence

Data Science and Big Data in Industry Practice


Data science project lifecycle:
Before starting to work on a data science task, the first step is to define the
objective of the business problem clearly.
This involves answering questions about different aspects of the problem, such
as:
• What is the business problem and how to translate it into a data science
problem?
• What data is required for the problem and how to collect and to prepare it?
• What insights can be generated from the data and what machine learning/deep
learning models can be useful for the problem?
• How to build and evaluate the model, define success (KPIs, metrics, and so on)
for the solution of the problem?
• How to measure the impact of a machine learning model in the real world?
Case Study
Case study: the need for data science in banks:
• A bank offered personal loans to 800,000 customers with savings or credit card
accounts over five years (2013–2018). While the program aimed to benefit
customers and increase bank revenue, it ultimately resulted in losses.
• The loan repayments were set up as monthly installments, determined by the loan
amount, tenure, and interest rate. If a customer’s account didn’t have enough
funds to cover an installment, the loan became “delinquent” (missed payment).
Approximately 10% of customers missed their payments, leading to
dissatisfaction due to hefty penalty fees.
• The bank's risk management system didn’t account for the likelihood of customers
missing payments, which negatively impacted both the loan program and overall
customer satisfaction. To address this issue, the bank’s management tasked the
data science team with identifying customers likely to miss payments in the next
three months. By doing so, the bank could proactively call these customers and
remind them of their payment, testing whether such reminders could reduce
missed payments.
Case Study
Objective: Identify the 30,000 customers most likely to miss loan payments in
the next three months to prioritize telephone reminders.

A few ways to approach this problem are:

Approach 1: Account Balance Threshold


• Method: Identify customers whose account balance falls below a specific amount,
such as 2–3 times their monthly installment.
• Pros: Simple and quick to implement.
• Cons: Doesn’t account for account balance fluctuations (e.g., customers waiting
for their salary but typically paying on time).
Case Study
Approach 2: Rule-Based Analysis
• Method: Analyze historical data of customers who missed payments and create
rules based on common characteristics (e.g., customers aged 30–40 with an
income below $2,500 are more likely to miss payments).
• Pros: Provides insights into customer behavior.
• Cons: Rules may not account for all variations and might miss some patterns.

Approach 3: Machine Learning


• Method: Use historical data to build a machine-learning model that predicts the
likelihood of missing payments for each customer.
• Pros: Generates a specific score for each customer and adapts to changing
patterns over time.
• Cons: Requires more data and resources for implementation.
Case Study
 Steps to Implement a Machine Learning Approach:

Step 1: Data Collection and Preparation:


• Gather data about customers who missed payments vs. those who
didn’t since 2013.
• Collect relevant details such as:
• Account balance history.
• Demographics (e.g., age, income).
• Loan details (e.g., tenure, amount, interest rate).
• Clean and preprocess the data for accuracy and consistency.
Case Study
Step 2: Modeling
• Supervised Learning: Use labeled data (e.g., delinquent = Yes/No) to
train predictive models.
• Unsupervised Learning: Explore patterns without predefined targets to
find useful relationships in the data.
• Apply algorithms such as regression or decision trees to generate
predictions for each customer.
Step 3: Model Evaluation
• Evaluate how well the model predicts delinquency by testing it on new
data.
• Metrics to check:
• Accuracy: How often the model predicts correctly.
• Precision: How many of the predicted "delinquents" are actual delinquents?
Case Study
• Step 4: Experimentation
• Test the model in real-life scenarios:
• Call some customers flagged as likely to miss payments and track
their response.
• Compare their payment behavior with those who weren’t called.
• Questions to answer:
• Does calling reduce missed payments?
• Are there customers who still miss payments despite reminders?
Case Study
There are several approaches to performing an exploratory data
analysis:
• Exploratory Data Analysis (EDA)
• To better understand the data:
• Tables: Summarize key statistics like average balance, income, and
payment history.
• Visualization: Use graphs (e.g., bar charts, histograms) to identify
trends.
• Correlations: Analyze relationships between variables (e.g., income
and delinquency).
Case Study
Final Recommendations:
 Implement a machine learning model for precise predictions.
 Use the model to generate a ranked list of customers with the highest
risk of missing payments.
 Conduct monthly calls targeting the top 30,000 high-risk customers.
 Regularly update the model with new data to improve predictions and
adapt to changing customer behaviors.
By integrating data science into its operations, the bank can reduce
missed payments, improve customer satisfaction, and optimize its loan
program. This approach also demonstrates the powerful role of data
science in solving real-world financial challenges.
Challenges from Data Science Projects

Challenges from Data Science Projects:


1- Data platform (legacy systems): different platforms may lead to
different results on the same dataset.
2- Data quality and data dictionaries: In most companies, raw data is
dirty (missing, inaccurate, duplicate, misleading, and non-integrated),
and data dictionaries are incomplete or absent.
3- Data privacy and lack of data access: In many projects, data is not
available, or not available on time, due to data privacy issues. To resolve
these issues, an upfront assessment of data privacy should be done at the
scoping phase itself, and appropriate measures should be taken to
address the issues. In some cases, a project may have to discontinue due
to unresolved issues
Challenges from Data Science Projects

4- Ethical Issues: Many times, data science projects involve working


with sensitive data such as race, gender, religion, national origin, and
medical history, and we should be careful to use only data that is
allowed by rules and regulations
5- Lack of project sponsorships: Many companies do not focus on
investing appropriately in data science projects.
6- Expectation management: It is difficult to manage expectations
about the impact of data science projects with management.
7- Focus on wrong problems: The lack of clear direction, unclear
problem statements, and unclear execution plans can cause data science
projects to fail.
Big Data Market Size Revenue

You might also like