UNIT 1_PPT
UNIT 1_PPT
Generally, 80:20.
7. Feature Scaling
Feature scaling is the final step of data preprocessing in machine learning. It is a technique to standardize the
independent variables of the dataset in a specific range. In feature scaling, we put our variables in the same
range and in the same scale so that no any variable dominate the other variable.
There are two ways to perform feature scaling in machine learning:
Chapter 1.5 Exploratory Data Analysis(EDA)
Exploratory Data Analysis is an approach in analyzing data sets to
summarize their main characteristics, often using statistical graphics
and other data visualization methods.
EDA assists Data science professionals in various ways:-
1.Getting a better understanding of data
2. Identifying various data patterns
3. Getting a better understanding of the problem statement
Types of exploratory data analysis
Univariate non-graphical
data being analyzed consists of just one variable.
Univariate graphical
Non-graphical methods don’t provide a full picture of the data. Graphical methods are therefore
required.
Common types of univariate graphics include:
Stem-and-leaf plots, which show all data values and the shape of the distribution.
Histograms, a bar plot in which each bar represents the frequency (count) or proportion
(count/total count) of cases for a range of values.
Box plots, which graphically depict the five-number summary of minimum, first quartile, median,
third quartile, and maximum.
Multivariate nongraphical
Multivariate data arises from more than one variable. Multivariate non-graphical EDA techniques
generally show the relationship between two or more variables of the data through cross-tabulation
or statistics.
Multivariate graphical
Multivariate data uses graphics to display relationships between two or more sets of data.
Other common types of multivariate graphics include:
Scatter plot, which is used to plot data points on a horizontal and a
vertical axis to show how much one variable is affected by another.
Multivariate chart, which is a graphical representation of the
relationships between factors and a response.
Run chart, which is a line graph of data plotted over time.
Bubble chart, which is a data visualization that displays multiple
circles (bubbles) in a two-dimensional plot.
Heat map, which is a graphical representation of data where values
are depicted by color
Exploratory Data Analysis Tools
Some of the most common data science tools used to create an EDA include:
Python: An interpreted, object-oriented programming language with
dynamic semantics. Its high-level, built-in data structures, combined with
dynamic typing and dynamic binding, make it very attractive for rapid
application development, as well as for use as a scripting or glue language
to connect existing components together. Python and EDA can be used
together to identify missing values in a data set, which is important so you
can decide how to handle missing values for machine learning.
https://www.kaggle.com/code/imoore/intro-to-exploratory-data-analysis-eda
-in-python
R: An open-source programming language and free software environment
for statistical computing and graphics supported by the R Foundation for
Statistical Computing. The R language is widely used among statisticians in
data science in developing statistical observations and data analysis.
Line Plot:a type of plot which displays information as a series of data
points called “markers” connected by straight lines.
In matplot, we can make Line plot using plt.plot() function.
Scatter plot: This type of plot shows all individual data points.
plt.scatter() function
Histogram: an accurate representation of the distribution of numeric
data.
first, we divide the entire range of values into a series of intervals, and
second, we count how many values fall into each interval. The intervals
are also called bins.
To make a histogram with Matplotlib, we can use the plt.hist() function.
Box plot, also called the box-and-whisker plot: a way to show the
distribution of values based on the five-number summary: minimum,
first quartile, median, third quartile, and maximum.
Bar chart: represents categorical data with rectangular bars. Each bar
has a height corresponds to the value it represents
To make a bar chart with Maplotlib, we’ll need the plt.bar() function.
Pie chart: a circular plot, divided into slices to show numerical
proportion. plt.pie() function
Bar and Column Charts It’s one of the simplest graphs to understand
how our quantitative field is performing across various categories. It is
used for comparison.
Scatter Plot and Bubble Chart Scatter and bubble plots help us to
understand how variables are spread across the range considered. It
can be used to identify the patterns, the presence of outliers and the
relationship between the two variables. We can see that with the
increase in discount profits are decreasing.
Heatmaps It is the most preferred chart when we want to check if any
correlation between variables
Chapter 1.6 Open Data
Open Data means the kind of data which is open for anyone and
everyone for access, modification, reuse, and sharing.
Why Is Open Data Important? Open data is important because the
world has grown increasingly data-driven. But if there are restrictions
on the access and use of data, the idea of data-driven business and
governance will not be materialized.Therefore, open data has its own
unique place.
list of 15 awesome Open Data sources:
1. World Bank Open Data -data regarding what’s happening in different countries across the world, World Bank Open
Data is a vital source of Open Data.
2. WHO (World Health Organization) — Open data repository WHO’s Open Data repository is how WHO keeps track of
health-specific statistics of its 194 Member States. The repository keeps the data systematically organized.
3. Google Public Data Explorer Launched in 2010, Google Public Data Explorer can help you explore vast amounts of
public-interest datasets.
4. Registry of Open Data on AWS (RODA) This is a repository containing public datasets. It is data which is available from
AWS resources. As far as RODA is concerned, you can discover and share the data which is publicly available.
5. European Union Open Data Portal You can access whatever open data EU institutions, agencies and other
organizations publish on a single platform namely European Union Open Data Portal.
6. FiveThirtyEight It is a great site for data-driven journalism and story-telling.
7. U.S. Census Bureau U.S. Census Bureau is the biggest statistical agency of the federal government. It stores and
provides reliable facts and data regarding people, places, and economy of America
8. Data.gov Data.gov is the treasure-house of US government’s open data. It was only recently that the decision was
made to make all government data available for free. When it was launched, there were only 47. There are now
180,000 datasets.
9. DBpedia As you know, Wikipedia is a great source of information. DBpedia aims at getting structured content from the
valuable information that Wikipedia created.
10. freeCodeCamp Open Data It is an open source community. Why it matters is because it enables you to code, build pro
bono projects after nonprofits and grab a job as a developer.
11.Yelp Open Datasets The Yelp dataset is basically a subset of nothing but our own
businesses, reviews and user data for use in personal, educational and academic pursuits.
There are 5,996,996 reviews, 188,593 businesses, 280,991 pictures and 10 metropolitan
areas included in Yelp Open Datasets.
12.UNICEF Dataset Since UNICEF concerns itself with a wide variety of critical issues, it has
compiled relevant data on education, child labor, child disability, child mortality, maternal
mortality, water and sanitation, low birth-weight, antenatal care, pneumonia, malaria,
iodine deficiency disorder, female genital mutilation/cutting, and adolescents.
13.kaggle Kaggle is great because it promotes the use of different dataset publication
formats. However, the better part is that it strongly recommends that the dataset publishers
share their data in an accessible, non-proprietary format.
14.LODUM It is the Open Data initiative of the University of Münster. Under this initiative, it
is made possible for anyone to access any public information about the university in
machine- readable formats. You can easily access and reuse it as per your needs.
15.UCI Machine Learning Repository It serves as a comprehensive repository of databases,
domain theories, and data generators that are used by the machine learning community for
the empirical analysis of machine learning algorithms.
Chapter 1.7 Data APIs
Data APIs A Data API provides API access to data stored in a Data management system. APIs provide
granular, per record access to datasets and their component data files.
Limitations of APIs Whilst Data APIs are in many ways more flexible than direct download they have
disadvantages:
1. APIs are much more costly and complex to create and maintain than direct download
2. API queries are slow and limited in size because they run in real-time in memory. Thus, for bulk
access e.g. of the entire dataset direct download is much faster and more efficient (download a
1GB CSV directly is easy and takes seconds but attempting to do so via the API may crash the
server and be very slow)
Why Data APIs?
1. Data (pre)viewing: reliably and richly (e.g. with querying, mapping etc). This makes the data much
more accessible to non-technical users.
2. Visualization and analytics: rich visualization and analytics may need a data API (because they need
easily to query and aggregate parts of dataset).
3. Rich Data Exploration: when exploring the data you will want to explore through a dataset quickly
only pulling parts of the data and drilling down further as needed. 4. (Thin) Client applications: with a
data API third party users of the portal can build apps on top of the portal data easily and quickly (and
without having to host the data themselves
Domain Model The functionality associated to the Data APIs can be divided in 6 areas:
1. Descriptor: metadata describing and specifying the API e.g. general metadata e.g. name, title,
description, schema, and permissions
2. Manager for creating and editing APIs. 1. API: for creating and editing Data API's descriptors
(which triggers creation of storage and service endpoint) 2. UI: for doing this manually
3. Service (read): web API for accessing structured data (i.e. per record) with querying etc. When
we simply say "Data API" this is usually what we are talking about • Custom API & Complex
functions: e.g. aggregations, join • Tracking & Analytics: rate-limiting etc • Write API: usually
secondary because of its limited performance vs bulk loading • Bulk export of query results
especially large ones (or even export of the whole dataset in the case where the data is stored
directly in the DataStore rather than the FileStore). This is an increasingly important feature a
lower priority but if required it is substantive feature to implement.
4. Data Loader: bulk loading data into the system that powers the data API. • Bulk Load: bulk
import of individual data files • Maybe includes some ETL => this takes us more into data factory
5. Storage (Structured): the underlying structured store for the data (and its layout). For example,
Postgres and its table structure. This could be considered a separate component that the Data
API uses or as part of the Data API – in some cases the store and API are completely wrapped
together, e.g. ElasticSearch is both a store and a rich Web API
Chapter 1.8 Web Scrapping
Web Scraping Web scraping is an automatic method to obtain large
amounts of data from websites. Most of this data is unstructured data
in an HTML format which is then converted into structured data in a
spreadsheet or a database so that it can be used in various applications.
Web scraping requires two parts, namely the crawler and the scraper.
The crawler is an artificial intelligence algorithm that browses the web
to search for the particular data required by following the links across
the internet. The scraper, on the other hand, is a specific tool created to
extract data from the website.