module-2
module-2
Scientists
MODULE-2
2.1 Comparison between Statistician
and Data Scientist
•Statistics as a field dates back to 1749, with a long-established theory.
•Data Science emerged recently due to big data and computational advancements.
•Both careers involve working with data but have key differences.
Data Handling
• Statisticians:
• Work with well-formatted numerical and categorical data.
• Datasets are small enough for PC memory.
• Data Scientists:
• Handle large databases, text, images, videos, real-time data.
• Work with streaming data and unstructured information.
Focus on Modeling
• Statisticians:
• Focus on statistical inference from small datasets.
• Develop models without much data cleaning.
• Data Scientists:
• Spend more time on data preprocessing.
• Modeling is often automated with open-source tools.
Deployment & Production
• Statisticians:
• Work mostly in research/academia.
• Bring data to models.
• Data Scientists:
• Work in industry, closer to real-time data systems.
• Bring models to data and deploy them into production.
2.2 Beyond Data and Analytics
Data science projects involve more than just data and analytics—they require
collaboration among different roles in a company.
A data scientist must communicate effectively, understand the business problem, and set
realistic expectations to ensure success.
A data science project may involve people with different roles, especially in a large
company:
1.Multiple roles are involved –
Business Leader – Defines the problem and expected value. (Example: A CEO wants to
reduce customer churn.)
IT & Data Owner – Provides data access and infrastructure. (Example: IT ensures the
database is accessible.)
Policy & Security Team – Ensures compliance with privacy laws. (Example: GDPR
compliance for user data.)
Engineering Team – Builds and maintains models. (Example: A machine learning model for
fraud detection.)
Project Manager – Keeps tasks on track. (Example: Ensuring the project meets deadlines.)
2. Effective communication is crucial –
Data scientists talk to all levels: executives, IT teams, engineers, and
front-line workers.
They must simplify technical concepts so others can understand.
(Example: Explaining AI-driven sales forecasting to a marketing team.)
3. Realistic expectations – Many projects fail due to overpromising or
poor planning.
Avoid overpromising results or setting unrealistic timelines.
Data scientists must ensure expectations are data-driven. (Example: A
project predicting customer behavior should use historical data, not
assumptions.)
4.Collaboration is key – Working with data owners, IT teams, and infrastructure
managers ensures smooth execution.
Work with data owners to get high-quality data.
5.Budget and resources matter – Cloud computing can scale projects, but costs
must be managed.
Understand the costs and limitations of computing resources. (Example: Choosing
between on-premise servers or cloud computing to analyze millions of
transactions.)
Hence , The Role of a Data Scientist
• A data scientist is not just an analyst but a project leader.
• They must balance business needs, data quality, realistic timelines, and
technical execution.
• Example: A company investing in AI for customer service must ensure:
• Data is clean and relevant.
• The model aligns with business needs.
• Infrastructure is cost-effective.
2.3 Three/Four Pillars of Knowledge
To become a great data scientist, you need a combination of technical
skills, business knowledge, and communication skills. These skills help in
analyzing data, making better business decisions, and effectively sharing
insights with others.
Key Areas of Data Science Skills:
• Domain Knowledge – Understanding the business side of data science.
• Math Skills – Essential for understanding machine learning algorithms.
• Computer Science – Programming, databases, and distributed
computing.
• Machine Learning – Applying algorithms to make predictions and automate
tasks.
• Communication Skills – Presenting data insights clearly to non-technical
people.
What Makes a Successful Data Scientist?
Data science is not just about coding or math—it requires multiple skills.
Four key skills: Business knowledge, math, programming, and communication.
1.Domain Knowledge – Why Business Understanding Matters
• Data scientists help businesses make profitable decisions.
• Without knowing the company's business model, a data scientist is less useful.
• Example: A data scientist at Amazon must understand how customers shop to
improve recommendations.
2.Math Skills – The Backbone of Machine Learning
• You can’t skip math in data science!
• Important topics:
• Linear Algebra, Calculus, & Optimization – Used in machine learning.
• Statistics & Probability – Helps in analyzing data trends.
• Example: Predicting house prices using regression requires statistics and
probability.
3.Computer Science – The Technical Side
• Programming is a must-have skill for data science.
• You need knowledge of:
• Programming languages – Python, R, SQL, Java.
• Databases – SQL (relational) & MongoDB (non-relational).
• Big Data Tools – Hadoop, Spark for large datasets.
• Example: A self-driving car uses Python & machine learning
algorithms to recognize objects.
Machine Learning – The Core of Data Science
• Machine Learning (ML) helps computers learn from data and make predictions.
• Two main types:
• Supervised Learning – Data with labels (e.g., email spam detection).
• Unsupervised Learning – No labels (e.g., customer segmentation).
where 𝛽0 is the intercept, xi,g is a three-dimensional indication vector for question answer
and 𝜷𝐠 is the parameter vector corresponding to the 𝑔𝑡ℎ predictor.
Three types of questions are consid ered regarding their effects on the outcome. The first forty
survey questions are important questions such that the coefficients of the three answers to
these questions are all different:
• The second forty survey questions are also important questions but
only one answer has a coefficient that is different from the other two
answers:
• The last forty survey questions are also unimportant questions such
that all three answers have the same coefficients:
• The baseline coefficient 𝛽0 is set to be−40 /3𝛾 so that on average a farm have 50% of
• The parameter 𝛾 in the above simulation is set to control the strength of the questions’
chance to have an outbreak.
𝛾=0.1,0.25,0.5,1,2.
effect on the outcome. In this simulation study,we consider the situations where
• For each value of 𝛾, 20 datasets are simulated.The bigger 𝛾 is, the larger the
corresponding parameter. We provided the datasets with 𝛾=2.
• Types of Questions
• Highly Important Questions (First 40)
• All three answers have different effects on outbreak probability.
• Coefficients: (1,0,-1) × γ
• Moderately Important Questions (Next 40, Questions 41-80)
• Only one answer has an effect; the other two do not.
• Coefficients: (1,0,0) × γ
• Unimportant Questions (Last 40, Questions 81-120)
• All answers have no effect on the outbreak.
• Coefficients: (0,0,0) × γ
• .
Impact of γ (gamma)(Strength of Effect):
• A control factor, γ(gamma), decides how much questions affect outbreaks.
• Higher γ means survey answers play a bigger role in outbreaks.
• Simulations were run for different values of γ to see how strong the effects
Role of γ(gamma) (Strength of Effect)
• γ(gamma) controls how much survey answers impact outbreaks.
• Higher γ(gamma) = stronger effect.
• Tested values: 0.1, 0.25, 0.5, 1, 2.
Data Simulation Process
• 20 datasets were simulated for each γ(gamma) value.
• Helps researchers test different outbreak scenarios.
• The dataset with γ=2 is provided for analysis.