0% found this document useful (0 votes)

10 views

ITE Elective Lecture Materials Data Colletion and Descriptive Statistics

Data collection is essential for informed decision-making, enabling the evaluation of outcomes and forecasting trends. It involves gathering both primary and secondary data, with various methods and tools available, while adhering to ethical considerations like privacy and consent. Data preprocessing is crucial for transforming raw data into a clean format, enhancing the performance of machine learning models and ensuring accurate analysis.

Uploaded by

koroneshee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

ITE Elective Lecture Materials Data Colletion and Descriptive Statistics

Uploaded by

koroneshee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Data Collection

is the process of collecting and evaluating information or data from multiple sources to find answers to
research problems, answer questions, evaluate outcomes, and forecast trends and probabilities

Importance of Data in Today's World

• Data is the lifeblood of nearly every sector in our digital age.
• Data is all around us, from our phones to the supermarkets.
• Data collection is the process of gathering and measuring information on variables of interest.
• It enables us to answer questions, test hypotheses, and evaluate outcomes.
• We use data to understand patterns, make decisions, and predict future outcomes.

What is Data?
• Data is a set of values of qualitative or quantitative variables.
• It's the raw information from which statistics are derived, and it's the basis for all scientific
conclusions.
• Not all data is created equal.

Qualitative vs Quantitative Data

• Qualitative data is about qualities, it's • Quantitative data deals with quantities and
descriptive and involves characteristics that involves numbers and measurements.
can't be counted. • Quantitative Data: Expressed in numbers and
• Qualitative data is expressed in words and graphs, analyzed through statistical methods.
analyzed through interpretations and
categorizations.

Examples of Qualitative and Quantitative Data

• Posting product or service reviews online is an example of qualitative data
• Wearing fitness trackers, and tracking the number of steps, heart rate, and the distance covered are
all quantitative data points
• Filling in surveys are the both types of data collection

The Power of Data

• Data collection helps in making informed decisions in a variety of fields.
• It can be used to predict future trends, study behavior, and paint a clearer picture of the world around
us.
• Every piece of information, every number, and every subjective detail is a potential data point.
Importance of Data Collection
• Allows for informed decision-making
• Helps validate findings and ensures accuracy in the conclusions
• Critical in monitoring performance and making improvements
Data Collection Process
Step-1: Identify what information you need to collect.
Step-2: Choose your data collection method.
Step-3: Once you collect the data, analyze it.
Step-4: Present the findings.

Primary Data Collection

Involves gathering new data directly from the source Examples include interviews, surveys, and observations
• Interviews provide in-depth information but may not be feasible for large numbers
• Surveys are efficient and cost-effective but response rate and design can affect the data quality
• Observation can provide rich data but requires careful planning

Secondary Data Collection

Uses data already collected for other purposes.
Examples include public records, statistical databases, and research articles, online database.
• Can be less time-consuming and less expensive than primary data collection
• May not be as specific or tailored to the research question
• Issues with accuracy and reliability may arise

Tools for Data Collection Questionnaires

One of the most common tools for data collection.
Can be distributed in person, through mail, or electronically.
Flexible, cost-effective, and can collect data from a large number of participants simultaneously.

Observational Tools • Include video and audio recording devices for capturing behaviors or events • Software
for tracking online behavior and conducting structured observations (checklists or rating scales) • Can be
used in a classroom or online setting

Ethics in Data Collection

Privacy • Privacy refers to the rights of individuals to control information about
themselves.
• Data collection must respect the privacy of participants by not collecting more
data than necessary.
• Data must be collected in a manner that does not intrude unnecessarily into
their lives.
Consent • Participants have the right to know how their data will be used and to agree to
this use.
• Consent should always be obtained before collecting data.
• Consent should be informed, meaning the individual fully understands what
they're agreeing to.
Confidentiality • Confidentiality relates to how data is stored and who has access to it.
• Data should be stored securely and access should be restricted to those who
need it for legitimate purposes.
• Participants need to trust that their information will be kept confidential.
Accuracy • Accuracy refers to the truthfulness and correctness of the data.
• Data collectors must strive for accuracy to maintain the integrity of the results.
• This includes carefully designing data collection methods, thoroughly training
data collectors, and checking data for errors.

Summary
• Data is crucial for decision-making and understanding the world
• Data collection is vital in research, business, and decision-making
• Primary data is collected directly from the source
• Secondary data is collected from existing data
• Questionnaires and software aid in the data collection process
• Privacy, consent, confidentiality, and accuracy are essential ethical considerations
• Follow best practices to ensure high-quality, reliable data
• Effective data collection is a cornerstone of informed decision-making

Introduction to Data Preprocessing

• Data preprocessing is the process of transforming raw data into a clean and usable format.
• It is a crucial step before applying machine learning models, ensuring optimal model performance.
• Poor preprocessing can lead to inaccurate models and misleading insights.

Why is Data Preprocessing Important?

Key reasons include:
• Improves data quality by addressing missing values, outliers, and inconsistencies.
• Enhances machine learning performance by ensuring cleaner data for algorithms.
• Prevents bias and errors during model training and prediction.
• Reduces computational complexity, saving time and resources.

The Data Preprocessing Pipeline

Steps involved in data preprocessing:
• Data Cleaning: Handling missing data, outliers, and duplicates.
• Data Transformation: Scaling features and encoding categorical variables.
• Data Reduction: Dimensionality reduction and feature selection.
• Data Integration: Merging datasets and resolving schema discrepancies.

Data Preprocessing in Machine Learning

Why preprocessing is essential for machine learning models:
• Prepares data for algorithms by normalizing and encoding it.
• Reduces noise and irrelevant features, improving model accuracy.
• Addresses class imbalances, enhancing overall model performance.

Types of Data and Their Challenges

Types of Data:
• Structured Data: Data organized in a defined manner (e.g., databases, spreadsheets).
• Unstructured Data: Data without a predefined format (e.g., text, images, videos).
• Semi-structured Data: Data with some organizational properties, but not fully structured (e.g., JSON,
XML).

Challenges with Structured Data:

Common issues include:
• Missing Values: Incomplete records lead to inaccurate analysis.
• Outliers: Extreme values can distort statistical models.
• Duplicates: Multiple occurrences of the same record can bias results.

Challenges with Unstructured Data:

Key challenges include:
• Lack of Standardization: Difficult to organize and analyze.
• Complexity: Requires advanced techniques like natural language processing (NLP) for text and
computer vision for images.
• Storage: Unstructured data often requires large storage and complex management.

Challenges with Semi-structured Data:

Common issues include:
• Inconsistent Formats: Data may vary between different sources.
• Parsing Difficulty: Requires specialized methods to extract useful information.

Addressing Data Challenges

Methods to overcome data challenges:
• Data Cleaning: Handling missing values, duplicates, and outliers.
• Preprocessing Techniques: Using advanced tools for unstructured and semi-structured data.
• Domain Expertise: Leveraging knowledge of the data to better manage inconsistencies.

Descriptive Statistics
Descriptive statistics is a branch of statistics that focuses on summarizing, organizing, and presenting data in
a meaningful way to provide insights into the characteristics of a dataset. It is the initial and fundamental step
in data analysis, enabling researchers, analysts, and decision-makers to understand and interpret the
information within the data.

Common Techniques and Concepts in Descriptive Statistics

Measures of Central Tendency:
• Mean: The arithmetic average of a set of values, calculated by summing all the values and dividing by
the number of data points.
• Median: The middle value in a dataset when arranged in ascending or descending order, splitting the
data into two equal halves.
• Mode: The most frequently occurring value in a dataset.

Measures of Dispersion:
• Range: The difference between the maximum and minimum values in a dataset.
• Variance: A measure of how much individual data points deviate from the mean, quantifying the
spread of the data.
• Standard Deviation: The square root of the variance, providing the average distance between each
data point and the mean.
• Interquartile Range (IQR): The range between the first quartile (25th percentile) and the third quartile
(75th percentile), showing the spread of the middle 50% of the data.

Measures of Distribution Shape:

• Skewness: Measures the asymmetry of the data distribution. Positive skew indicates a longer right
tail, while negative skew suggests a longer left tail.
• Kurtosis: Measures the "tailedness" of the data distribution, assessing whether the data has heavier
or lighter tails than a normal distribution.

Frequency Distribution:
• A table or chart displaying the frequency or count of each unique value in a dataset, helping to identify
patterns and common values.

Graphical Representation:
• Graphs and charts like histograms, bar charts, box plots, and scatter plots are used to visually
summarize and display data.

Percentiles:
• Percentiles divide a dataset into 100 equal parts. For example, the median is the 50th percentile, and
quartiles (25th and 75th percentiles) divide the data into four parts.

Summary Statistics:
• Summary tables provide an overview of key statistics like minimum, maximum, mean, median, and
standard deviation.

Fundamental Concepts in Descriptive Statistics

Data:
• Refers to information, facts, or values collected for analysis. Data can take various forms, including
numbers, text, and images, and is often organized into datasets with individual data points or
observations.
Variable:
• A characteristic that can vary among individuals or objects in a dataset. Variables can be independent
(predictors) or dependent (outcomes) and are typically represented in columns of a dataset.

Observation:
• A single unit or measurement in a dataset. Each row in a dataset represents an observation, while
each column represents a variable.

Variation:
• Refers to the differences or variability in data values or observations within a dataset, which helps
identify patterns and uncertainty.

Random Variables:
• Variables whose values are determined by chance. They can be:
o Discrete: Take distinct values (e.g., the number of coin flips).
o Continuous: Can take any value within a range (e.g., height).

Uncertain Variables:
• A broader term encompassing both random variables and other variables affected by uncertainty due
to incomplete information or variability.

Common Types of Data

• Structured Data: Organized and follows a predefined format, often stored in databases or tables
(e.g., spreadsheets).
• Unstructured Data: Lacks a specific format, including text, images, videos, and more.
• Semi-Structured Data: Has some structure but does not conform to a strict schema (e.g., JSON,
XML).
• Numerical Data: Includes discrete (distinct values like count) and continuous (any value within a
range like temperature).
• Categorical Data: Represents categories or labels, such as gender or product categories.
• Binary Data: Consists of two possible values, typically 0 and 1 (e.g., yes/no).
• Time Series Data: Collected at regular intervals over time, such as stock prices or weather data.

Population Data vs. Sample Data

• Population Data: Refers to data collected from all individuals or objects of interest.
• Sample Data: A subset of the population, used to make inferences about the larger group.

Modifying Data in Excel

• Sorting Data: Arranging data in a specific order (ascending or descending) to identify patterns,
trends, or outliers.
• Filtering Data: Displaying a subset of data based on criteria, hiding irrelevant rows temporarily.
• Conditional Formatting: Highlighting data that meets specific conditions, making key values easier
to spot.

Importance of Sorting and Filtering Data

Data Organization:
• Sorting helps arrange data logically, making it easier to analyze, while filtering focuses on relevant
data.
Data Exploration:
• Sorting reveals insights like highest or lowest values, while filtering allows deeper investigation
of subsets.
Identifying Outliers:
• Sorting highlights extreme values, while filtering helps isolate and examine outliers.
Data Validation and Quality Control:
• Sorting and filtering are useful for spotting duplicates, inconsistencies, and validating data
points.
Data Cleaning and Preprocessing:
• Sorting and filtering assist in identifying and addressing missing or anomalous data.

Conditional Formatting in Excel

Conditional formatting is a powerful feature in Excel that allows you to automatically format cells based on
specific conditions or criteria. This feature helps visualize and highlight data, making it easier to identify
patterns, trends, and exceptions in your spreadsheet. Conditional formatting can be applied to various types
of data, including text, numbers, and dates.
Importance of Conditional Formatting in Data Analysis
Conditional formatting is an essential tool in data analysis for several reasons. It enhances data visualization,
highlights patterns and outliers, and simplifies the interpretation of complex datasets. Key benefits include:
1. Visualizing Data Trends and Patterns:
• Conditional formatting applies different formats such as colors, data bars, or icons to data, helping
you quickly identify trends, relationships, and patterns that might not be obvious from the raw
dataset.
2. Highlighting Data Insights:
• You can emphasize important data points or outliers by setting specific conditions. This makes it
easier to spot critical insights and anomalies, allowing for more focused data analysis.
3. Data Comparison and Ranking:
• Conditional formatting is useful for visually comparing data within a range and ranking them. This can
help in identifying the highest or lowest values, tracking progress, and comparing performance
metrics across various categories.
4. Data Validation and Quality Control:
• In data analysis, ensuring accuracy and quality is vital. Conditional formatting can highlight data
points that don’t meet validation criteria, making it easier to identify and correct errors or
inconsistencies within the dataset.
5. Data Exploration and Hypothesis Testing:
• During exploratory data analysis, you can use conditional formatting to experiment with different
conditions or test various hypotheses. This flexibility allows for quick exploration of potential
relationships within the data.
6. Quick Data Summarization:
• Conditional formatting helps in summarizing data visually. For instance, color scales can be used to
represent the magnitude of values, effectively turning data into easily interpretable heatmaps or
visual summaries.
7. Enhancing Data Reporting:
• Conditional formatting can improve the visual appeal of your reports or presentations. By making data
more engaging and understandable, it helps others grasp key takeaways with minimal effort.
8. Customizing Data Views:
• Conditional formatting allows you to customize how data is displayed based on your specific goals.
You can apply formatting rules to highlight cells meeting certain criteria, or format data based on
specific date ranges, tailoring the view to your needs.
9. Reducing Data Overload:
• In large datasets, conditional formatting helps reduce data overload by focusing your attention on the
most relevant information. This makes complex datasets more manageable and actionable.

UDAS
No ratings yet
UDAS
3 pages
Session 3 Data Collection Analysis and Interpretation
No ratings yet
Session 3 Data Collection Analysis and Interpretation
31 pages
LESSON1 ObtainingData
100% (1)
LESSON1 ObtainingData
32 pages
Data Science
No ratings yet
Data Science
68 pages
UNIT 2 Notes - Data Science
No ratings yet
UNIT 2 Notes - Data Science
18 pages
Xi Ai Unit - 5 Notes
No ratings yet
Xi Ai Unit - 5 Notes
28 pages
21BCAD5C01 IDA Module 2 Notes
No ratings yet
21BCAD5C01 IDA Module 2 Notes
16 pages
Cse2026 Module 1 & 2 Detailed Notes
No ratings yet
Cse2026 Module 1 & 2 Detailed Notes
185 pages
Module 5 Lecture Note
No ratings yet
Module 5 Lecture Note
8 pages
Research Methodology Unit 4
No ratings yet
Research Methodology Unit 4
5 pages
Comprehensive Guide to Data Collection
No ratings yet
Comprehensive Guide to Data Collection
16 pages
ToolKit 1 - Unit 1 - Introduction To Data Analytics
No ratings yet
ToolKit 1 - Unit 1 - Introduction To Data Analytics
15 pages
Webinar StorytellingwithDataSession3-4
No ratings yet
Webinar StorytellingwithDataSession3-4
30 pages
Rudra Bhatt Data
No ratings yet
Rudra Bhatt Data
9 pages
Important Question of Introduction of Data Science
No ratings yet
Important Question of Introduction of Data Science
10 pages
DS Module2 L1 L11
No ratings yet
DS Module2 L1 L11
27 pages
Lecture 1,2&3
No ratings yet
Lecture 1,2&3
80 pages
TECH8000 Week 05
No ratings yet
TECH8000 Week 05
30 pages
Introduction to Data Science Module 2
No ratings yet
Introduction to Data Science Module 2
35 pages
BigDataAnalytics _ Unit1
No ratings yet
BigDataAnalytics _ Unit1
21 pages
Unit 2 BI & Data Science (1)
No ratings yet
Unit 2 BI & Data Science (1)
35 pages
all-unit-notes
No ratings yet
all-unit-notes
116 pages
Unit 2 Data Gathering
No ratings yet
Unit 2 Data Gathering
14 pages
data analytics unit-1 part 1
No ratings yet
data analytics unit-1 part 1
37 pages
Module 2: Data Collection and Sampling Design
100% (1)
Module 2: Data Collection and Sampling Design
8 pages
Data For Research
No ratings yet
Data For Research
73 pages
Dmml Notes
No ratings yet
Dmml Notes
89 pages
VETMI Data Analysis Workshop
No ratings yet
VETMI Data Analysis Workshop
577 pages
Unit 1-Stat Data Collection & Types 2-1
No ratings yet
Unit 1-Stat Data Collection & Types 2-1
11 pages
Session1-DataCharacteristics
No ratings yet
Session1-DataCharacteristics
41 pages
C20 Combined
No ratings yet
C20 Combined
291 pages
unit-1ppt
No ratings yet
unit-1ppt
29 pages
DA-Unit-2-Trio-1
No ratings yet
DA-Unit-2-Trio-1
26 pages
EDA - Unit 1
No ratings yet
EDA - Unit 1
82 pages
Unit 3 Data Analytics
No ratings yet
Unit 3 Data Analytics
16 pages
chapter-1 Introduction to Data Analytics
No ratings yet
chapter-1 Introduction to Data Analytics
34 pages
EDUC 210 CHAPTER 4 Data Collection Is A Crucial Step in Many Fields
No ratings yet
EDUC 210 CHAPTER 4 Data Collection Is A Crucial Step in Many Fields
7 pages
DEV UNIT 1&2 Notes
No ratings yet
DEV UNIT 1&2 Notes
118 pages
Statistics
100% (1)
Statistics
12 pages
Unit 2 PPT (BA)
No ratings yet
Unit 2 PPT (BA)
33 pages
Lec02 Business Analytics - 20231224 - 102047 - 0000 1
No ratings yet
Lec02 Business Analytics - 20231224 - 102047 - 0000 1
23 pages
Data Processing in Research
No ratings yet
Data Processing in Research
31 pages
Lecture One
No ratings yet
Lecture One
29 pages
Unit 2 Rizvi Sir
No ratings yet
Unit 2 Rizvi Sir
92 pages
Unit1-Data Science Fundamentals
No ratings yet
Unit1-Data Science Fundamentals
35 pages
DATA LITERACY_IX_Notes
No ratings yet
DATA LITERACY_IX_Notes
5 pages
Presentation By Abhyuday sharma (1)
No ratings yet
Presentation By Abhyuday sharma (1)
27 pages
IX Part B Unit 2 Data Literacy Notes
No ratings yet
IX Part B Unit 2 Data Literacy Notes
9 pages
Data Analysis Question and Answers
No ratings yet
Data Analysis Question and Answers
15 pages
What Is Data?: QUALITATIVE DATA: Is Everything That Refers To The
No ratings yet
What Is Data?: QUALITATIVE DATA: Is Everything That Refers To The
38 pages
BAM2 Handout
No ratings yet
BAM2 Handout
53 pages
Data Literacy
No ratings yet
Data Literacy
12 pages
Sharda Dss11e Ch03
No ratings yet
Sharda Dss11e Ch03
70 pages
Unit 2 FDS
No ratings yet
Unit 2 FDS
13 pages
3.2-output
No ratings yet
3.2-output
5 pages
LECTURE ONE
No ratings yet
LECTURE ONE
31 pages
Data Analytics Basics - Unlocked
No ratings yet
Data Analytics Basics - Unlocked
59 pages
Data Driven
No ratings yet
Data Driven
46 pages
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Data Collection: Six Sigma Thinking, #1
From Everand
Data Collection: Six Sigma Thinking, #1
Sumeet Savant
No ratings yet
SSC CHSL
No ratings yet
SSC CHSL
3 pages
BABOK_Guide_v3_Member-18-21
No ratings yet
BABOK_Guide_v3_Member-18-21
4 pages
Ps Darshan
No ratings yet
Ps Darshan
130 pages
Tema: The Imitation Game": Seminarski Rad
No ratings yet
Tema: The Imitation Game": Seminarski Rad
2 pages
kouadio & gakpa
No ratings yet
kouadio & gakpa
12 pages
1966 Benston
No ratings yet
1966 Benston
17 pages
4. Digital Paramount (EC) - Front
No ratings yet
4. Digital Paramount (EC) - Front
89 pages
EE213 Experiment 1 2019 Manual
No ratings yet
EE213 Experiment 1 2019 Manual
5 pages
Bode Plot of Basic Terms
No ratings yet
Bode Plot of Basic Terms
33 pages
RRB RPF SI 2024 Mathematics Topic Wise Hard to Easy English
No ratings yet
RRB RPF SI 2024 Mathematics Topic Wise Hard to Easy English
71 pages
Notes On SU (3) and The Quark Model
No ratings yet
Notes On SU (3) and The Quark Model
31 pages
Test Construction With IRT
No ratings yet
Test Construction With IRT
11 pages
RBT-8 For All 719 Batches - Conducted On - 3!4!19
No ratings yet
RBT-8 For All 719 Batches - Conducted On - 3!4!19
14 pages
Mathematical Foundations
No ratings yet
Mathematical Foundations
5 pages
Tank Volume Calculator
No ratings yet
Tank Volume Calculator
5 pages
Zefri Et Al - 2023 - Advanced Classification of Failure-Related Patterns On Solar Photovoltaic Farms
No ratings yet
Zefri Et Al - 2023 - Advanced Classification of Failure-Related Patterns On Solar Photovoltaic Farms
6 pages
WINSEM2015-16 CP0067 14-Jan-2016 RM01 Perl File Handling and Regex
No ratings yet
WINSEM2015-16 CP0067 14-Jan-2016 RM01 Perl File Handling and Regex
26 pages
SWIN Transformer: A Unifying Step Between Computer Vision and Natural Language Processing | by Renu
No ratings yet
SWIN Transformer: A Unifying Step Between Computer Vision and Natural Language Processing | by Renu
1 page
Mathematics Parachutes
No ratings yet
Mathematics Parachutes
13 pages
What Is The GE Strategic Business Unit
No ratings yet
What Is The GE Strategic Business Unit
7 pages
WEEK 5 Math7 COT Lesson Plan
No ratings yet
WEEK 5 Math7 COT Lesson Plan
7 pages
WPE Quiz 1
No ratings yet
WPE Quiz 1
2 pages
CBSE Class 9 Maths Lab Manual Activity 1 to 10 in Hindi
No ratings yet
CBSE Class 9 Maths Lab Manual Activity 1 to 10 in Hindi
27 pages
Trigonometry Sheet 1
No ratings yet
Trigonometry Sheet 1
29 pages
Error Detection and Correction
100% (2)
Error Detection and Correction
31 pages
Oracle Order Management Overview
No ratings yet
Oracle Order Management Overview
305 pages
Wind Tunnel Test
100% (1)
Wind Tunnel Test
32 pages
POLYNOMIALS
No ratings yet
POLYNOMIALS
42 pages
Specific Heat of Metals
No ratings yet
Specific Heat of Metals
8 pages