0% found this document useful (0 votes)
42 views

Crash Course Data Science

The document provides an overview of key concepts in data science including data collection, descriptive statistics, exploratory data analysis, data visualizations, data cleaning, and machine learning. It discusses different techniques used at various stages of a data science project from gathering data to analyzing and visualizing insights.

Uploaded by

NABEEL KHAN
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views

Crash Course Data Science

The document provides an overview of key concepts in data science including data collection, descriptive statistics, exploratory data analysis, data visualizations, data cleaning, and machine learning. It discusses different techniques used at various stages of a data science project from gathering data to analyzing and visualizing insights.

Uploaded by

NABEEL KHAN
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

CRASH COURSE DATA SCIENCE -

(BEGINNER LEVEL)
DATA COLLECTION
1) Data collection is the process of gathering relevant
information from various sources to analyze and derive insights.

2) In data science, the quality of collected data directly impacts


the accuracy of the resulting analysis and models.

3) A well-defined sampling strategy ensures that collected data


is representative of the larger population.

4) Surveys, interviews, and questionnaires are common


methods for collecting primary data directly from individuals.

5) Web scraping involves extracting information from websites


and is often used to collect data from online sources.

6) Sensor networks and Internet of Things (IoT) devices


contribute to the collection of real-time data in various
applications.

7) Secondary data refers to data collected by someone else for


a different purpose but can still be useful for analysis.

8) The bias present in collected data can lead to skewed


insights and inaccurate conclusions.

9) Data curation involves organizing, cleaning, and preparing


collected data for analysis.

10) The process of data collection should follow ethical


guidelines to ensure privacy and respect for individuals' rights.
DESCRIPTIVE STATISTICS
1) Descriptive statistics summarize and describe the main features of a dataset.

2) Descriptive statistics can be used to summarize both categorical and


numerical variables.

3) Range is a measure of dispersion that represents the difference between the


maximum and minimum values in a dataset.

4) The range is NOT a measure of central tendency that represents the middle
value in a dataset.

5) The interquartile range (IQR) is a measure of spread that represents the range
between the first quartile (Q1) and the third quartile (Q3).

6) The mode is the value that occurs most frequently in a dataset.

7) The median is less affected by outliers than the mean.

8) The median is less influenced by extreme values in the dataset, making it a


more robust measure of central tendency compared to the mean.

9) Standard deviation measures the average distance of values from the mean.

10) Standard deviation quantifies the dispersion or spread of data by measuring


the average distance between each data point and the mean.

11) Variance is NOT the square root of the standard deviation.

12) Variance is the squared value of the standard deviation.

13) Skewness is a measure of the symmetry of a distribution.

14) Skewness indicates the extent to which a distribution is skewed or


asymmetrical.

15) Correlation measures the strength and direction of the linear relationship
between two numerical variables.
EXPLORATORY DATA ANALYSIS
1) Exploratory data analysis involves summarizing and visualizing data to
gain insights and understand patterns.

2) Exploratory data analysis is typically performed after data cleaning and


preprocessing to ensure the data is in a suitable format for analysis.

3) Exploratory data analysis includes identifying outliers (extreme values) and


missing values in the dataset, which can impact the validity of the analysis.

4) Descriptive statistics, such as mean, median, and standard deviation, are


commonly calculated during exploratory data analysis to summarize the
central tendency and dispersion of the data.

5) Exploratory data analysis is NOT a flexible and iterative process.

6) Exploratory data analysis can help detect relationships and correlations


between variables, which can provide valuable insights into the dataset.

7) The primary goal of Exploratory data analysis is to gain an understanding


of the data rather than formal hypothesis testing and statistical inference.

8) Exploratory data analysis can reveal potential data quality issues, such as
inconsistent or erroneous values, and identify data anomalies that require
further investigation.

9) Graphical techniques, such as histograms, scatter plots, and box plots,


are commonly used in exploratory data analysis to visualize the distribution,
relationships, and outliers in the data.

10) Exploratory data analysis is NOT an ongoing process


DATA VISUALISATIONS

1) Data visualisation is the presentation of data in a graphical or pictorial


format

2) Bar chart, line chart and pie chart are some of the common types of
visualisation charts

3) A line chart is a data visualization technique suitable for displaying trends


over time

4) A heat map is used to represent distribution of values with colors

5) A tree map is used to show hierarchical data using nested rectangles

6) A box plot is used to show the distribution of data

7) A choropleth map is used to represent geographic data with colour


variations

8) The points on the scatter plot show the relationship between two
variables

9) In a bar chart, y-axis shows the dependent variable while x-axis shows
the independent variable

10) Python is the most commonly used programming language that creates
interactive data visualisations
DATA CLEANING
1) Imputation technique is used to fill in missing values

2) Outlier detection is used to identify and handle unusual data points

3) Standardization is used to bring all variables to a common scale

4) Deduplication is used to identify and handle duplicate records

5) Regular Expressions are used for pattern matching and extraction

6) One-Hot Encoding is used for handling categorical variables

7) Scaling is used to re-scale numerical variables

8) Trimming is used to remove unnecessary white spaces

9) Mean imputation Replacing missing values with the mean of the variable

10) Forward filling Filling missing values with the value before them

11) Interpolation Estimating missing values based on the adjacent values

12) Deleting rows Removing rows with missing values


MACHINE LEARNING
1) The two main categories of machine learning models are supervised and
unsupervised.

2) Labeled data in supervised learning provides correct answers for training


the model to learn relationships between input features and output labels.

3) Precision is the ratio of correctly predicted positive observations to the


total predicted positives, while recall is the ratio of correctly predicted
positive observations to the total actual positives.

4) Accuracy might not be suitable for imbalanced datasets because it can be


dominated by the majority class and may not reflect the true model
performance.

5) Cross-validation assesses a machine learning model's performance by


dividing the dataset into subsets, training/evaluating the model on different
combinations, and providing insights into its generalization capability.

You might also like