0% found this document useful (0 votes)
10 views49 pages

module-2

The document outlines the differences between statisticians and data scientists, emphasizing the latter's focus on handling large, unstructured data and real-time applications. It highlights the importance of collaboration, effective communication, and realistic expectations in data science projects, involving various roles such as business leaders and IT teams. Additionally, it discusses the critical skills needed for data scientists, including domain knowledge, math, programming, and communication, as well as the stages of a data science project from planning to implementation.

Uploaded by

mruksartaju40
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views49 pages

module-2

The document outlines the differences between statisticians and data scientists, emphasizing the latter's focus on handling large, unstructured data and real-time applications. It highlights the importance of collaboration, effective communication, and realistic expectations in data science projects, involving various roles such as business leaders and IT teams. Additionally, it discusses the critical skills needed for data scientists, including domain knowledge, math, programming, and communication, as well as the stages of a data science project from planning to implementation.

Uploaded by

mruksartaju40
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 49

Soft Skills for Data

Scientists
MODULE-2
2.1 Comparison between Statistician
and Data Scientist
•Statistics as a field dates back to 1749, with a long-established theory.
•Data Science emerged recently due to big data and computational advancements.
•Both careers involve working with data but have key differences.

Data Handling
• Statisticians:
• Work with well-formatted numerical and categorical data.
• Datasets are small enough for PC memory.
• Data Scientists:
• Handle large databases, text, images, videos, real-time data.
• Work with streaming data and unstructured information.
Focus on Modeling
• Statisticians:
• Focus on statistical inference from small datasets.
• Develop models without much data cleaning.
• Data Scientists:
• Spend more time on data preprocessing.
• Modeling is often automated with open-source tools.
Deployment & Production
• Statisticians:
• Work mostly in research/academia.
• Bring data to models.
• Data Scientists:
• Work in industry, closer to real-time data systems.
• Bring models to data and deploy them into production.
2.2 Beyond Data and Analytics
Data science projects involve more than just data and analytics—they require
collaboration among different roles in a company.
A data scientist must communicate effectively, understand the business problem, and set
realistic expectations to ensure success.
A data science project may involve people with different roles, especially in a large
company:
1.Multiple roles are involved –
Business Leader – Defines the problem and expected value. (Example: A CEO wants to
reduce customer churn.)
IT & Data Owner – Provides data access and infrastructure. (Example: IT ensures the
database is accessible.)
Policy & Security Team – Ensures compliance with privacy laws. (Example: GDPR
compliance for user data.)
Engineering Team – Builds and maintains models. (Example: A machine learning model for
fraud detection.)
Project Manager – Keeps tasks on track. (Example: Ensuring the project meets deadlines.)
2. Effective communication is crucial –
Data scientists talk to all levels: executives, IT teams, engineers, and
front-line workers.
They must simplify technical concepts so others can understand.
(Example: Explaining AI-driven sales forecasting to a marketing team.)
3. Realistic expectations – Many projects fail due to overpromising or
poor planning.
Avoid overpromising results or setting unrealistic timelines.
Data scientists must ensure expectations are data-driven. (Example: A
project predicting customer behavior should use historical data, not
assumptions.)
4.Collaboration is key – Working with data owners, IT teams, and infrastructure
managers ensures smooth execution.
Work with data owners to get high-quality data.
5.Budget and resources matter – Cloud computing can scale projects, but costs
must be managed.
Understand the costs and limitations of computing resources. (Example: Choosing
between on-premise servers or cloud computing to analyze millions of
transactions.)
Hence , The Role of a Data Scientist
• A data scientist is not just an analyst but a project leader.
• They must balance business needs, data quality, realistic timelines, and
technical execution.
• Example: A company investing in AI for customer service must ensure:
• Data is clean and relevant.
• The model aligns with business needs.
• Infrastructure is cost-effective.
2.3 Three/Four Pillars of Knowledge
To become a great data scientist, you need a combination of technical
skills, business knowledge, and communication skills. These skills help in
analyzing data, making better business decisions, and effectively sharing
insights with others.
Key Areas of Data Science Skills:
• Domain Knowledge – Understanding the business side of data science.
• Math Skills – Essential for understanding machine learning algorithms.
• Computer Science – Programming, databases, and distributed
computing.
• Machine Learning – Applying algorithms to make predictions and automate
tasks.
• Communication Skills – Presenting data insights clearly to non-technical
people.
What Makes a Successful Data Scientist?
Data science is not just about coding or math—it requires multiple skills.
Four key skills: Business knowledge, math, programming, and communication.
1.Domain Knowledge – Why Business Understanding Matters
• Data scientists help businesses make profitable decisions.
• Without knowing the company's business model, a data scientist is less useful.
• Example: A data scientist at Amazon must understand how customers shop to
improve recommendations.
2.Math Skills – The Backbone of Machine Learning
• You can’t skip math in data science!
• Important topics:
• Linear Algebra, Calculus, & Optimization – Used in machine learning.
• Statistics & Probability – Helps in analyzing data trends.
• Example: Predicting house prices using regression requires statistics and
probability.
3.Computer Science – The Technical Side
• Programming is a must-have skill for data science.
• You need knowledge of:
• Programming languages – Python, R, SQL, Java.
• Databases – SQL (relational) & MongoDB (non-relational).
• Big Data Tools – Hadoop, Spark for large datasets.
• Example: A self-driving car uses Python & machine learning
algorithms to recognize objects.
Machine Learning – The Core of Data Science
• Machine Learning (ML) helps computers learn from data and make predictions.
• Two main types:
• Supervised Learning – Data with labels (e.g., email spam detection).
• Unsupervised Learning – No labels (e.g., customer segmentation).

• Example: Netflix uses ML to recommend shows based on your watch


history.
Distributed Computing – Handling Big Data
• Big data cannot be processed on a single computer.
• Tools like Hadoop and Spark help process large datasets across
multiple computers.
• Example: Google processes millions of search queries per second
using distributed computing.
4. Communication Skills – Presenting Insights Clearly
• A data scientist must explain findings to non-technical people.
• Ways to communicate:
• Reports – Writing summaries for managers.
• Presentations – Using graphs & charts (e.g., Tableau).
• Blog posts – Sharing insights with a wider audience.
• Example: A sales team understands a profitability report better with
charts rather than just numbers.
Becoming a Successful Data Scientist
• Data Science = Business + Math + Programming + Communication.
• Every data science project involves:
• Understanding the business problem.
• Using math & programming to analyze data.
• Applying machine learning models.
• Communicating insights clearly.
• Example: A data scientist at Uber must analyze traffic patterns, use
ML for fare prediction, and explain results to management &
engineers.
2.4 Data Science Project Cycle
2.4.1 Types of Data Science Projects
Data science projects use data and machine learning models to solve
business problems. They can be classified based on how data is used
and how models are applied.
Types of Data Science Projects:
• Offline Training & Offline Application – Model is trained and applied
offline (no real-time execution).
• Offline Training & Online Application – Model is trained offline but
used in real-time (e.g., recommendation systems).
• Online Training & Online Application – Model is trained and used in
real-time (e.g., stock market predictions).
What is a Data Science Project?
• Data science projects use data & machine learning to solve business
problems.
• Different projects require different data types and model
applications.
• We categorize them based on how models are trained and applied.
Understanding Data – Offline vs. Online
• Offline Data: Historical data stored in databases (e.g., customer past
purchases).
• Online Data: Real-time data that changes continuously (e.g., live stock
prices).
• Example: Amazon uses offline data for customer purchase history and
online data to track real-time website behavior.
Type 1 – Offline Training & Offline Application
• What it means:
• Model is trained once using historical data.
• Output is usually a report or insights (not real-time).
• Example:
• A company studies whether a new marketing strategy improves sales.
• Uses past data to analyze trends and predict success.
• Final output: A report with recommendations.
Type 2 – Offline Training & Online Application
• What it means:
• Model is trained offline but used in real-time for decision-making.
• Example:
• Personalized Ads on Social Media
• Facebook trains an ad recommendation model using past user behavior.
• When a user logs in, the model uses real-time actions to show relevant ads.
Type 3 – Online Training & Online Application
• What it means:
• Model is continuously trained & updated using real-time data.
• Used for highly dynamic environments where old data is irrelevant.
• Example:
• Stock Market Predictions
• A stock trading model analyzes live stock prices and updates itself instantly.
• Makes buy/sell decisions in milliseconds.
Choosing the Right Data Science Project
• Offline Training & Offline Application → Good for one-time analysis
(e.g., business reports).
• Offline Training & Online Application → Good for scalable models
(e.g., ad recommendations).
• Online Training & Online Application → Needed for fast-changing
environments (e.g., fraud detection).
• The choice depends on business needs, data availability, and speed
requirements
2.4.2 Problem Formulation and Project Planning Stage
• A data-driven and fact-based planning stage is essential for a
successful data science project.
Since data science is in high demand, leaders often initiate projects,
but careful planning is needed to ensure success.
Key Steps in Planning:
• Understand the Business Problem – Identify pain points and goals.
• Align Teams – Ensure collaboration between business, technology,
and project management teams.
• Ask Critical Questions – Data availability, impact, security, and
timeline.
• Define Metrics & Resources – Set key performance indicators (KPIs),
allocate computation resources, and form the right team.
Why Planning is Essential?
• Poor planning leads to project failure due to unrealistic expectations.
• A structured approach ensures clear goals, smooth execution, and measurable
success.
Who is Involved in a Data Science Project?
• Business Team → Understands problems, business goals, and reporting needs.
• Technology Team → Manages data, machine learning models, and software
deployment.
• Project & Product Management → Ensures deadlines, milestones, and
coordination.
✅ Example:
A retail company wants to use AI to predict customer purchases.
• Business Team → Defines success as increased sales.
• Tech Team → Ensures data quality and model accuracy.
• Project Managers → Keep track of progress from idea to deployment.
Key Questions to Ask Before Starting
• Business Understanding
• What are the biggest pain points in the current operation?
• What impact will a data science project have?
• Data Availability & Quality
• What data sources are available (online or offline)?
• Is the data clean, complete, and reliable?
• Project Feasibility
• What computational resources are required?
• Are there any security or privacy concerns?
• Defining Success
• What are the key performance metrics (KPIs)?
• What are the milestones and timeline?
• ✅ Example:
A bank wants to use machine learning to detect fraud.
• Pain Point: Too many false alarms in fraud detection.
• Data: Transaction history, customer behavior.
• Computational Needs: High-speed processing for real-time detection.
• Success Metrics: Reduce false alarms by 30% in 6 months.
Structuring a Data Science Project
• Step 1: Define Key Metrics → Set measurable goals (e.g., increase sales by 20%).
• Step 2: Identify Data Sources → Determine internal/external data sources.
• Step 3: Allocate Resources → Assign data engineers, scientists, and developers.
• Step 4: Set a Timeline → Define milestones and deadlines.
• ✅ Example:
A logistics company wants to optimize delivery routes using AI.
• Key Metric: Reduce delivery time by 15%.
• Data: GPS tracking, traffic data.
• Resources: Cloud-based AI tools, data engineers.
• Timeline: 3-month development, 1-month testing, 6-month rollout.
The Role of Data Scientists in Planning
• Data scientists should lead project discussions to ensure feasibility.
• If data scientists don’t lead, projects may have unrealistic goals and
fail to meet deadlines.
• Collaboration with engineers, business teams, and project managers
is crucial.
• ✅ Example:
An e-commerce company wants to use AI for product
recommendations.
• Without data scientists → Unrealistic expectations (e.g., expecting
100% accuracy).
• With data scientists → Realistic goals (e.g., improving
recommendation click-through rate by 10%).
Why Planning is Key to Success
• Data science projects fail without proper planning.
• Collaboration across teams is necessary for success.
• Defining clear goals, data sources, resources, and timelines prevents
failure.
• ✅ Example:
A healthcare company wants AI to predict patient diseases.
• Without proper planning → Inaccurate predictions, compliance
issues.
• With proper planning → Reliable, ethical, and useful AI system.
2.4.3 Project Modeling Stage
• Even with well-defined strategies, milestones, and timelines, data science
projects are dynamic and may face uncertainties. Effective communication is
crucial to address new challenges and opportunities.
Key steps include:
• Data Preparation: Cleaning, wrangling, and exploring data before modeling.
• Problem Abstraction: Converting business problems into machine learning or
statistical problems.
• Iterative Approach: Business problems are rarely solved with a single model.
Instead, multiple techniques are used in stages.
• Collaboration: Continuous discussions between data scientists, business teams,
and engineers help refine models.
• A senior data scientist must lead the iterative process, ensuring both data and
models evolve to meet business goals effectively.
2.4.3.1 Data Related Part
Data Cleaning, Preprocessing, and Feature Engineering
• Essential steps to create usable variables for ML models.
• Ensures the dataset is representative of the real-world scenario.
Key Considerations
• Representation: Data should approximate the deployment scenario.
• Bias & Assumptions: Clearly communicate limitations and quantify
the impact.
• Relevance: Sometimes, existing data isn’t sufficient—additional data
collection may be necessary.
Model-Related Part
Types of Models
• Supervised Learning (e.g., classification, regression)
• Unsupervised Learning (e.g., clustering, dimensionality reduction)
• Causal Inference (analyzing cause-effect relationships)
Model Selection & Development
• Experimentation: Often requires combining multiple techniques.
• Training, Validation, Testing: Ensuring generalization and avoiding
overfitting.
• Occam’s Razor: Prefer simpler models when possible.
Benchmarking
• Compare against business rules, common-sense decisions, or standard
models (e.g., Random Forest).
2.4.4 Model Implementation and Post Production Stage
Model Implementation
• Offline projects result in reports with model results.
• Online projects require deploying models into a production environment.
• Transitioning from offline to online involves significant additional work.
• Cloud infrastructure simplifies deployment but still requires effort.
Steps Before Production:
• Shadow Mode
• A/B Testing
Shadow Mode (Proof of Concept - POC)
• The model and data pipeline run fully but do not impact decisions.
• Helps identify and fix issues:
• Timeouts
• Missing features
• Version conflicts (e.g., Python 2 vs. 3)
• Data type mismatches
• Frequent monitoring ensures the system works as expected.
A/B Testing

• Splits incoming data into two groups:


• Control Group: No ML model influence.
• Treatment Group: Uses the ML model.
• Compares pre-defined key metrics over time.
• Determines if the model delivers business value.
• Complex applications may involve:
• Multiple treatment groups
• Many A/B tests running in parallel
Full Production & Continuous Monitoring

• If A/B testing is successful → Model moves to full production.


• Challenges in production:
• Business needs evolve.
• Data availability can change.
• Model performance can degrade.
• Monitoring System
• Detects feature changes.
• Notifies when performance drops below threshold.
• Supports fine-tuning (e.g., re-training with new data).
• Model Retirement
• Every model eventually becomes obsolete.
• A retirement plan ensures a smooth transition.
2.5 Common Mistakes in Data Science

•Data science projects fail at different stages.


•Most discussions focus on technical mistakes (e.g., overfitting, outliers).
•However, systematic mistakes impact projects at a higher level.
•Understanding these pitfalls helps improve project success.
• Stages Where Mistakes Happen
• Problem Formulation
• Project Planning
• Modeling
• Implementation & Post-Production
2.5.1 Problem Formulation Stage
Mistake #1 – Solving the Wrong Problem
• Issue: Data science teams may not be included in business
discussions.
• Impact: Wasting resources on solving irrelevant problems.
• Solution: Align data science with business needs and available data.
Mistake #2 – Overpromising Business Value
• Issue: Unrealistic expectations from leadership.
• Impact: Disappointment when results don’t match expectations.
• Solution: Set clear, data-driven expectations.
2.5.2 Project Planning Stage
Mistake #3 – Too Optimistic About Timeline
• Issue: Underestimating time needed for data cleaning and
exploration.
• Impact: Missed deadlines and unrealistic project plans.
• Solution: Allocate 60-80% of time to data preprocessing
Mistake #4 – Too Optimistic About Data Availability & Quality
• Issue: Assuming “big data” always means useful data.
• Impact: Delays due to poor data quality or missing data.
• Solution: Assess data quality early in the project
2.5.3 Project Modeling Stage
Mistake #5 – Unrepresentative Data
• Issue: Models trained on biased or outdated data.
• Impact: Poor model generalization in real-world use.
• Solution: Ensure training data reflects production conditions.
Mistake #6 – Overfitting & Complex Models
• Issue: Prioritizing complex models over explainable ones.
• Impact: Reduced interpretability and reliability.
• Solution: Favor simpler models that generalize well.
Mistake #7 – Taking Too Long to Fail
• Issue: Reluctance to shut down failing projects.
• Impact: Wasted time and resources.
• Solution: Identify failing projects early and pivot or stop.
2.5.4 Model Implementation and
Post Production Stage
Mistake #8 – Missing A/B Testing
• Issue: Assuming offline model performance will match production.
• Impact: Unexpected failures in real-world use.
• Solution: Always run A/B tests and shadow testing before full
deployment.
Mistake #9 – Failing to Scale in Real-Time Applications
• Issue: Lack of infrastructure or engineering support.
• Impact: Models fail under real-time conditions.
• Solution: Work with engineers to ensure scalability.
Mistake #10 – No Online Monitoring
• Issue: No system for tracking model performance post-deployment.
• Impact: Model decay over time.
• Solution: Implement dashboards, alerts, and regular retraining.
Summary of Mistakes
✅ Solving the wrong problem
✅ Overpromising business value
✅ Too optimistic about timeline
✅ Too optimistic about data quality
✅ Unrepresentative data
✅ Overfitting & complex models
✅ Taking too long to fail
✅ Missing A/B testing
✅ Failing to scale in production
✅ No online monitoring
Introduction to the Data
3.1 Customer Data for a Clothing Company
The dataset describes various aspects of customer information for a clothing company,
including demographics, purchasing behavior, and product preferences.
Here's a breakdown of its components:
1. Demography
•age: age of the respondent
•gender: male/female
•house: 0/1 variable indicating if the customer owns a house or not
2. Sales in the past year
•store_exp: expense in store
•online_exp: expense online
•store_trans: times of store purchase
•online_trans: times of online purchase
3. Survey on product preference
Customers responded to statements on a scale from 1 (Strongly Disagree) to 5 (Strongly Agree):
• Q1: I like to buy clothes from different brands.
• Q2: I buy almost all my clothes from some of my favorite brands.
• Q3: I like to buy premium brands.
• Q4: Quality is the most important factor in my purchasing decision.
• Q5: Style is the most important factor in my purchasing decision.
• Q6: I prefer to buy clothes in-store.
• Q7: I prefer to buy clothes online.
• Q8: Price is important.
• Q9: I like to try different styles.
• Q10: I like to make decisions myself and don’t need too much of others’ suggestions.
4. Customer Segments: Each customer is categorized into one of four segments based on their
preferences:
• Price
• Conspicuous
• Quality
• Style
The simulation is broken into three parts:
1. Define data structure: variable names, variable distribution,
customer segment names, segment size
2. Variable distribution parameters: mean and variance
3. Iterate across segments and variables. Simulate data ac cording to
specific parameters assigned
3.2 Swine Disease Breakout Data
• The swine flu outbreak dataset is a simulated study designed to understand
how different survey responses relate to the chance of an outbreak on a
farm.
Purpose of the Dataset:
• To analyze risk factors for swine flu outbreaks.
• To see how different types of questions affect outbreak probability.
• To help develop prediction models for disease spread.
How the Data Was Created:
• 800 farms were surveyed.
• Each farm answered 120 survey questions (3 answer choices per question).
• The chance of an outbreak was calculated using a formula based on survey
answers.
How Outbreaks Are Simulated

•Farms either have an outbreak (1) or not (0)


•Probability of an outbreak follows this formula

where 𝛽0 is the intercept, xi,g is a three-dimensional indication vector for question answer
and 𝜷𝐠 is the parameter vector corresponding to the 𝑔𝑡ℎ predictor.

Three types of questions are consid ered regarding their effects on the outcome. The first forty
survey questions are important questions such that the coefficients of the three answers to
these questions are all different:
• The second forty survey questions are also important questions but
only one answer has a coefficient that is different from the other two
answers:

• The last forty survey questions are also unimportant questions such
that all three answers have the same coefficients:
• The baseline coefficient 𝛽0 is set to be−40 /3𝛾 so that on average a farm have 50% of

• The parameter 𝛾 in the above simulation is set to control the strength of the questions’
chance to have an outbreak.

𝛾=0.1,0.25,0.5,1,2.
effect on the outcome. In this simulation study,we consider the situations where

• So the parameter settings are:

• For each value of 𝛾, 20 datasets are simulated.The bigger 𝛾 is, the larger the
corresponding parameter. We provided the datasets with 𝛾=2.
• Types of Questions
• Highly Important Questions (First 40)
• All three answers have different effects on outbreak probability.
• Coefficients: (1,0,-1) × γ
• Moderately Important Questions (Next 40, Questions 41-80)
• Only one answer has an effect; the other two do not.
• Coefficients: (1,0,0) × γ
• Unimportant Questions (Last 40, Questions 81-120)
• All answers have no effect on the outbreak.
• Coefficients: (0,0,0) × γ
• .
Impact of γ (gamma)(Strength of Effect):
• A control factor, γ(gamma), decides how much questions affect outbreaks.
• Higher γ means survey answers play a bigger role in outbreaks.
• Simulations were run for different values of γ to see how strong the effects
Role of γ(gamma) (Strength of Effect)
• γ(gamma) controls how much survey answers impact outbreaks.
• Higher γ(gamma) = stronger effect.
• Tested values: 0.1, 0.25, 0.5, 1, 2.
Data Simulation Process
• 20 datasets were simulated for each γ(gamma) value.
• Helps researchers test different outbreak scenarios.
• The dataset with γ=2 is provided for analysis.

You might also like