0% found this document useful (0 votes)
4 views

UNIT 1_PPT

The document outlines the fundamentals of data acquisition, covering various sources, methods, and preprocessing techniques essential for data analysis and machine learning. It discusses internal and external data systems, the importance of data acquisition, and the steps involved in the data acquisition process, including data discovery and integration. Additionally, it highlights exploratory data analysis tools, open data sources, and web scraping techniques as part of the data acquisition landscape.

Uploaded by

rithishsindhuja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

UNIT 1_PPT

The document outlines the fundamentals of data acquisition, covering various sources, methods, and preprocessing techniques essential for data analysis and machine learning. It discusses internal and external data systems, the importance of data acquisition, and the steps involved in the data acquisition process, including data discovery and integration. Additionally, it highlights exploratory data analysis tools, open data sources, and web scraping techniques as part of the data acquisition landscape.

Uploaded by

rithishsindhuja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 67

SCSB1231 DATA AND INFORMATION SCIENCE

UNIT 1 DATA ACQUISITION


Chapter 1.1-Data Acquisition
Chapter 1.2 Sources of acquiring data
Chapter 1.3 Internal Systems and External Systems
Chapter 1.4 Data Pre processing
Chapter 1.5 Exploratory Data Analysis(EDA)
Chapter 1.6 Open Data
Chapter 1.7 Data APIs
Chapter 1.8 Web Scrapping
UNIT 1 DATA ACQUISITION
Data Acquisition – Sources of acquiring the data – Internal Systems and
External Systems, Web APIs, Data Preprocessing – Exploratory Data
Analysis(EDA) – Basic tools(plots, graphs and summary statistics) of
EDA, Open Data Sources, Data APIs, Web Scrapping – Relational
Database access(queries) to process/access data
Chapter 1.1 Data Acquisition
Outline
Introduction
Data versus Information
Types of Data
Importance of Data Acquisition
Steps
Data Acquisition Methods
Data Acquisition in Machine Learning
Data Acquisition Process
Chapter 1.2 Sources of acquiring data
Outline
Sources of acquiring data
Data Acquisition Techniques and Tools
Data Collection Sources.
i. Primary Data
ii. Secondary data
Internal
 External
Chapter 1.3 Internal Systems and External Systems
Outline
Internal Systems
External Systems
Chapter 1.4 Data Pre processing
Outline
Purpose of data preprocessing
Tasks in Data Preprocessing
Data preprocessing in Machine Learning : A practical approach
Chapter 1.5 Exploratory Data Analysis(EDA)
Outline
Intro
Types of exploratory data analysis
Exploratory Data Analysis Tools
Chapter 1.6 Open Data
Outline
Intro
Why Is Open Data Important?
List of 15 awesome Open Data sources
Chapter 1.7 Data APIs
Outline
Intro
Limitations of APIs
Why Data APIs?
Domain Model
Chapter 1.8 Web Scrapping
Outline
Intro
How Web Scrapers Work?
Different Types of Web Scrapers
Why is Python a popular programming language for Web Scraping?
Web Scraping used for?
Relational Database access (queries) to process/access data
Queries help you find and work with your data
Introduction
Data
• a raw information, facts or numbers collected to be examined or
analysed to make decisions.
• should be in a formalized manner suitable for communication,
interpretation and processing.
Information
• Result of analysing data
Data versus Information
Data are the building blocks of information. Likewise, pieces of
information are the building blocks of records.
Information: Data that has been given value through analysis,
interpretation, or compilation in a meaningful form.
Types of Data
• Structured – Data which is organized and formatted in a specific way that forms
a well-defined schema or shape to form a proper structure. • Unstructured –
These data are in an unorganized form and context specific or varying. Eg., e-mail
• Natural language - It is a special type of unstructured data; it’s challenging to
process because it requires knowledge of specific data science techniques and
linguistics
• Machine-generated - data that is automatically created by a computer, process,
application, or other machine without human intervention.
• Graph-based - data that focuses on the relationship or adjacency of objects •
Audio, video, and images – captured and recognized through sound, pictures and
videos
• Streaming - The data flows into the system when an event happens instead of
being loaded into a data store in a batch.
Data file formats
• Tabular (e.g., .csv, .tsv, .xlsx)
• Non-tabular (e.g., .txt, .rtf, .xml)
• Image (e.g., .png, .jpg, .tif)
• Agnostic (e.g., .dat)
Data Acquisition
• The process of gathering various data from different relevant sources
is referred to as Data Acquisition.
In other words it is the process of gathering real-world data and
converting it into a digital format that can be analyzed by a computer.
Importance of Data Acquisition
• In business, analyze and formulate corresponding strategies.
• It is easier to detect any discrepancy and solve it faster.
• Decreases human error and improves data security.
• cost-efficient.
• It helps in building Recommendation System
Data acquisition comprises of two steps – Data Harvest and Data
Ingestion
Data Harvest : It is the process by which a source generates data and
it considers what data is acquired.
Data Ingestion : Focuses on bringing the produced data into a given
system.
Data ingestion consists of three stages – discover, connect and
synchronize.
Data Acquisition Methods
Data can be obtained from many different sources, such as websites,
apps, IoT protocols or even physical notes, and new data sources pop
up literally every day.
Data Acquisition in Machine Learning
“ Data acquisition is the procedure of obtaining and compiling data
from diverse sources in order to test and train machine learning
models.”
• Collection and Integration of the data: The data is extracted from
various sources and multiple data need to be combined based upon the
requirement.
• Formatting: Prepare or organize the datasets as per the analysis
requirements.
• Labeling: After gathering data, it is required to label or naming the
data
Data Acquisition Process
The process of data acquisition involves searching for the datasets that
can be used to train the Machine Learning models.
The main segments are :
1. Data Discovery : sharing and searching for new datasets available on
the web and incorporating data.
2. Data Augmentation :In the context of data acquisition, we are
essentially enriching the existing data by adding more external data.
3. Data Generation : The option is to generate the datasets manually or
automatically.
Chapter 1.2 Sources of acquiring data
Data collection is the process of acquiring, collecting, extracting, and
storing the voluminous amount of data which may be in the structured
or unstructured form like text, video, audio, XML files, records, or other
image files used in later stages of data analysis.
There are four methods of acquiring data:
1. Collecting New Data;
2. Converting/Transforming Legacy Data;
3. Sharing/Exchanging Data;
4. Purchasing Data.
Data Acquisition Techniques and Tools
The major tools and techniques for data acquisition are:
1. Data Warehouses and ETL
2. Data Lakes and ELT
3. Cloud Data Warehouse providers
1. Data Warehouses and ETL
The first option to acquire data is via a data warehouse.
Data warehousing is the process of constructing and using a data warehouse to
offer meaningful business insights. A data warehouse is a centralized repository,
which is constructed by combining data from various heterogeneous sources. It is
typically constructed to store structured records having tabular formats.
The data acquisition is performed by two kinds of ETL (Extract, Transform and Load),
these are:
a) Code-based ETL: These ETL applications are developed using programming
languages such as SQL, and PL/SQL (which is a combination of SQL and
procedural features of programming languages). Examples: BASE SAS, SAS
ACCESS.
b) b) Graphical User Interface (GUI)-based ETL: This type of ETL application are
developed using the graphical user interface, and point and click techniques.
Examples are data stage, data manager, AB Initio, Informatica, ODI (Oracle Data
Integration), data services, and SSIS (SQL Server Integration Services).
2. Data Lakes and ELT
A data lake is a storage repository having the capacity to store large
amounts of data, including structured, semi-structured, and
unstructured data. It can store images, videos, audio, sound records,
and PDF files. It helps for faster ingestion of new data.
Note-Unlike data warehouses, data lakes store everything, are more
flexible, and follow the Extract, Load, and Transform (ELT) approach.
The data is first loaded and not transformed until required to
transform.
3. Cloud Data Warehouse providers
A cloud data warehouse is another service that collects, organizes, and
stores data. Unlike the traditional data warehouse, cloud data
warehouses are quicker and cheaper to set up as no physical hardware
needs to be procured.
Data Collection Sources.
1.Primary Data
• The first techniques of data collection is Primary data collection which involves the collection of original
data directly from the source or through direct interaction with the respondents. This method allows
researchers to obtain firsthand information tailored to their research objectives. There are various
techniques for primary data collection, including:
 Surveys and Questionnaires: Researchers design structured questionnaires or surveys to collect data from
individuals or groups. These can be conducted through face-to-face interviews, telephone calls, mail, or
online platforms.
 Interviews: Interviews involve direct interaction between the researcher and the respondent. They can be
conducted in person, over the phone, or through video conferencing. Interviews can be structured (with
predefined questions), semi-structured (allowing flexibility), or unstructured (more conversational).
 Observations: Researchers observe and record behaviors, actions, or events in their natural setting. This
method is useful for gathering data on human behavior, interactions, or phenomena without direct
intervention.
 Experiments: Experimental studies involve manipulating variables to observe their impact on the outcome.
Researchers control the conditions and collect data to conclude cause-and-effect relationships.
 Focus Groups: Focus groups bring together a small group of individuals who discuss specific topics in a
moderated setting. This method helps in understanding the opinions, perceptions, and experiences shared
by the participants.
2.Secondary Data Collection
• The next techniques of data collection is Secondary data collection which involves using
existing data collected by someone else for a purpose different from the original intent.
Researchers analyze and interpret this data to extract relevant information. Secondary data
can be obtained from various sources, including:
Published Sources: Researchers refer to books, academic journals, magazines, newspapers,
government reports, and other published materials that contain relevant data.
Online Databases: Numerous online databases provide access to a wide range of secondary
data, such as research articles, statistical information, economic data, and social surveys.
Government and Institutional Records: Government agencies, research institutions, and
organizations often maintain databases or records that can be used for research purposes.
Publicly Available Data: Data shared by individuals, organizations, or communities on public
platforms, websites, or social media can be accessed and utilized for research.
Past Research Studies: Previous research studies and their findings can serve as valuable
secondary data sources. Researchers can review and analyze the data to gain insights or build
upon existing knowledge.
Chapter 1.3 Internal Systems and External Systems
Internal Systems :
 Internal data is generated and used within a company or
organization.
This data is usually produced by the company's operations, such as
sales, customer service, or production.
Internal data occurs in various formats, such as spreadsheets,
databases, or customer relationship management (CRM) systems.
External Systems :
Data collected from external sources, including customers, partners,
competitors, and industry reports.
This data can be purchased from third-party providers or gathered
from publicly available sources including market research reports,
social media data, government data etc.,
Web APIs Application Programming Interface (API) - set of defined
rules that enable different applications to communicate with each
other.
REST API : REST stands for REpresentational State Transfer. It is a web
architecture with a set of constraints applied to web service
applications. REST APIs provide data in the form of resources which
can be related to objects.
Some of the popular APIs in ML and Data Science
Google Map API
Amazon Machine Learning API
Facebook API
Twitter API
IBM Watson API
US Census Bureau API
Quandl API
Chapter 1.4 Data Pre processing
Data preprocessing is an important process of data mining. In this
process, raw data is converted into an understandable format and
made ready for further analysis.
Purpose of data preprocessing:
Get data overview
Identify missing data
Identify outliers or anomalous data
Remove Inconsistencies
Tasks in Data Preprocessing
1. Data Cleaning
Data cleaning help us remove inaccurate, incomplete and incorrect data from the dataset.
Handling missing values
Noisy Data
Binning
Regression
Clustering
2. Data Integration
The process of combining data from multiple sources (databases, spreadsheets,text files)
into a single dataset.
Major problems during data integration
Schema integration
Entity identification
Detecting and resolving data values concept
3.Data Transformation
The change made in the format or the structure of the data is called
data transformation.
Methods in data transformation
Smoothing
Aggregation
Discretization
Normalization
Attribute Selection:
Concept Hierarchy Generation
4.Data reduction
It helps in increasing storage efficiency and reducing data storage to
make the analysis easier by producing almost the same results.
Steps of data reduction
Data Compression
Numerosity Reduction
Dimensionality reduction
Data preprocessing in Machine Learning : A practical approach
Data preprocessing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine learning
model. A real-world data generally contains noises, missing values, and maybe in an
unusable format which cannot be directly used for machine learning models. Data
preprocessing is required tasks for cleaning the data and making it suitable for a machine
learning model which also increases the accuracy and efficiency of a machine learni
It involves below steps:
1. Getting the dataset
2. Importing libraries
3. Importing datasets
4. Finding Missing Data
5. Encoding Categorical Data
6. Splitting dataset into training and test set
7. Feature scalingng model.
1. Getting the dataset
To use the dataset in our code, we usually put it into a CSV file.
2. Importing libraries
In order to perform data preprocessing using Python, we need to import some
predefined Python libraries.
Numpy
used for including any type of mathematical operation
import numpy as nm
Matplotlib
Python 2D plotting library
import matplotlib.pyplot as mpt
Pandas
used for importing and managing the datasets.
Import pandas as pd
3. Importing the Dataset
read_csv() function:
We can use read_csv function as below:
data_set= pd.read_csv('Dataset.csv’)
Note:
data_set is variable to store dataset.
Extracting Independent variable and Dependent variable
Independent variable- a variable doesn’t depend upon any other
variable.
Example- Input data.
Dependent variable- one variable is dependent to another variable.
Example- Output data.
In this example, the
Country, Age, Salary
are Independent
Variables and the
Purchased is
Dependant Variable
4. Finding Missing Data
If our dataset contains some missing data, then it may create a huge
problem for our machine learning model. Hence it is necessary to
handle missing values present in the dataset.
Ways to handle missing data
By deleting the particular row
By calculating the mean

Consider the following example.


In this example- nan is
to denoted as missing
value.
To handle missing values, we will use Scikit-learn library in our code,
which contains various libraries for building machine learning models.
Here we will use Imputer class of sklearn. preprocessing library. Below
is the code for it:
#handling missing data (Replacing missing data with the mean value)
from sklearn.preprocessing import Imputer
imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
#Fitting imputer object to the independent variables x.
imputerimputer= imputer.fit(x[:, 1:3])
#Replacing missing data with the calculated mean value
x[:, 1:3]= imputer.transform(x[:, 1:3])
After handling missing data, The mean value calculated and
placed in that position. (Strategy used- Mean)
5. Encoding Categorical data
Categorical data is data which has some categories such as, in our
dataset; there are two categorical variable, Country, and Purchased.
Since machine learning model completely works on mathematics and
numbers, but if our dataset would have a categorical variable, then it
may create trouble while building the model.
So it is necessary to encode these categorical variables into numbers.
For Country variable:
Firstly, we will convert the country
variables into categorical data. Use
LabelEncoder() class from preprocessing
library.
#Catgorical data
#for Country Variable
from sklearn.preprocessing import
LabelEncoder
label_encoder_x= LabelEncoder()
x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
In above code, we have imported
LabelEncoder class of sklearn library. This
class has successfully encoded the
variables into digits.
Learn about Dummy Variables:

Dummy variables are those variables which have values 0 or 1.


The 1 value gives the presence of that variable in a particular
column, and rest variables become 0.
For Dummy Encoding, we will use OneHotEncoder class of
preprocessing library.
For Purchased Variable:
labelencoder_y= LabelEncoder()
y= labelencoder_y.fit_transform(y)
For the second categorical
variable, we will only use
labelencoder object of
LableEncoder class. Here we are
not using OneHotEncoder class
because the purchased variable
has only two categories yes or no,
and which are automatically
encoded into 0 and 1.
6. Splitting the Dataset into the Training set and Test set
In machine learning data preprocessing, we divide our dataset into a training set
and test set. This is one of the crucial steps of data preprocessing as by doing
this, we can enhance the performance of our machine learning model.
Training Set: A subset of dataset to train the machine learning model, and we
already know the output.
Test set: A subset of dataset to test the machine learning model, and by using
the test set, model predicts the output.

Training Data Test Data

Generally, 80:20.
7. Feature Scaling
Feature scaling is the final step of data preprocessing in machine learning. It is a technique to standardize the
independent variables of the dataset in a specific range. In feature scaling, we put our variables in the same
range and in the same scale so that no any variable dominate the other variable.
There are two ways to perform feature scaling in machine learning:
Chapter 1.5 Exploratory Data Analysis(EDA)
Exploratory Data Analysis is an approach in analyzing data sets to
summarize their main characteristics, often using statistical graphics
and other data visualization methods.
EDA assists Data science professionals in various ways:-
1.Getting a better understanding of data
2. Identifying various data patterns
3. Getting a better understanding of the problem statement
Types of exploratory data analysis
Univariate non-graphical
data being analyzed consists of just one variable.
Univariate graphical
Non-graphical methods don’t provide a full picture of the data. Graphical methods are therefore
required.
Common types of univariate graphics include:
 Stem-and-leaf plots, which show all data values and the shape of the distribution.
 Histograms, a bar plot in which each bar represents the frequency (count) or proportion
(count/total count) of cases for a range of values.
 Box plots, which graphically depict the five-number summary of minimum, first quartile, median,
third quartile, and maximum.
Multivariate nongraphical
Multivariate data arises from more than one variable. Multivariate non-graphical EDA techniques
generally show the relationship between two or more variables of the data through cross-tabulation
or statistics.
Multivariate graphical
Multivariate data uses graphics to display relationships between two or more sets of data.
Other common types of multivariate graphics include:
 Scatter plot, which is used to plot data points on a horizontal and a
vertical axis to show how much one variable is affected by another.
Multivariate chart, which is a graphical representation of the
relationships between factors and a response.
Run chart, which is a line graph of data plotted over time.
Bubble chart, which is a data visualization that displays multiple
circles (bubbles) in a two-dimensional plot.
Heat map, which is a graphical representation of data where values
are depicted by color
Exploratory Data Analysis Tools
Some of the most common data science tools used to create an EDA include:
Python: An interpreted, object-oriented programming language with
dynamic semantics. Its high-level, built-in data structures, combined with
dynamic typing and dynamic binding, make it very attractive for rapid
application development, as well as for use as a scripting or glue language
to connect existing components together. Python and EDA can be used
together to identify missing values in a data set, which is important so you
can decide how to handle missing values for machine learning.
https://www.kaggle.com/code/imoore/intro-to-exploratory-data-analysis-eda
-in-python
R: An open-source programming language and free software environment
for statistical computing and graphics supported by the R Foundation for
Statistical Computing. The R language is widely used among statisticians in
data science in developing statistical observations and data analysis.
Line Plot:a type of plot which displays information as a series of data
points called “markers” connected by straight lines.
In matplot, we can make Line plot using plt.plot() function.
Scatter plot: This type of plot shows all individual data points.
plt.scatter() function
Histogram: an accurate representation of the distribution of numeric
data.
first, we divide the entire range of values into a series of intervals, and
second, we count how many values fall into each interval. The intervals
are also called bins.
To make a histogram with Matplotlib, we can use the plt.hist() function.
Box plot, also called the box-and-whisker plot: a way to show the
distribution of values based on the five-number summary: minimum,
first quartile, median, third quartile, and maximum.
Bar chart: represents categorical data with rectangular bars. Each bar
has a height corresponds to the value it represents
To make a bar chart with Maplotlib, we’ll need the plt.bar() function.
Pie chart: a circular plot, divided into slices to show numerical
proportion. plt.pie() function
Bar and Column Charts It’s one of the simplest graphs to understand
how our quantitative field is performing across various categories. It is
used for comparison.
Scatter Plot and Bubble Chart Scatter and bubble plots help us to
understand how variables are spread across the range considered. It
can be used to identify the patterns, the presence of outliers and the
relationship between the two variables. We can see that with the
increase in discount profits are decreasing.
Heatmaps It is the most preferred chart when we want to check if any
correlation between variables
Chapter 1.6 Open Data
Open Data means the kind of data which is open for anyone and
everyone for access, modification, reuse, and sharing.
Why Is Open Data Important? Open data is important because the
world has grown increasingly data-driven. But if there are restrictions
on the access and use of data, the idea of data-driven business and
governance will not be materialized.Therefore, open data has its own
unique place.
list of 15 awesome Open Data sources:
1. World Bank Open Data -data regarding what’s happening in different countries across the world, World Bank Open
Data is a vital source of Open Data.
2. WHO (World Health Organization) — Open data repository WHO’s Open Data repository is how WHO keeps track of
health-specific statistics of its 194 Member States. The repository keeps the data systematically organized.
3. Google Public Data Explorer Launched in 2010, Google Public Data Explorer can help you explore vast amounts of
public-interest datasets.
4. Registry of Open Data on AWS (RODA) This is a repository containing public datasets. It is data which is available from
AWS resources. As far as RODA is concerned, you can discover and share the data which is publicly available.
5. European Union Open Data Portal You can access whatever open data EU institutions, agencies and other
organizations publish on a single platform namely European Union Open Data Portal.
6. FiveThirtyEight It is a great site for data-driven journalism and story-telling.
7. U.S. Census Bureau U.S. Census Bureau is the biggest statistical agency of the federal government. It stores and
provides reliable facts and data regarding people, places, and economy of America
8. Data.gov Data.gov is the treasure-house of US government’s open data. It was only recently that the decision was
made to make all government data available for free. When it was launched, there were only 47. There are now
180,000 datasets.
9. DBpedia As you know, Wikipedia is a great source of information. DBpedia aims at getting structured content from the
valuable information that Wikipedia created.
10. freeCodeCamp Open Data It is an open source community. Why it matters is because it enables you to code, build pro
bono projects after nonprofits and grab a job as a developer.
11.Yelp Open Datasets The Yelp dataset is basically a subset of nothing but our own
businesses, reviews and user data for use in personal, educational and academic pursuits.
There are 5,996,996 reviews, 188,593 businesses, 280,991 pictures and 10 metropolitan
areas included in Yelp Open Datasets.
12.UNICEF Dataset Since UNICEF concerns itself with a wide variety of critical issues, it has
compiled relevant data on education, child labor, child disability, child mortality, maternal
mortality, water and sanitation, low birth-weight, antenatal care, pneumonia, malaria,
iodine deficiency disorder, female genital mutilation/cutting, and adolescents.
13.kaggle Kaggle is great because it promotes the use of different dataset publication
formats. However, the better part is that it strongly recommends that the dataset publishers
share their data in an accessible, non-proprietary format.
14.LODUM It is the Open Data initiative of the University of Münster. Under this initiative, it
is made possible for anyone to access any public information about the university in
machine- readable formats. You can easily access and reuse it as per your needs.
15.UCI Machine Learning Repository It serves as a comprehensive repository of databases,
domain theories, and data generators that are used by the machine learning community for
the empirical analysis of machine learning algorithms.
Chapter 1.7 Data APIs
Data APIs A Data API provides API access to data stored in a Data management system. APIs provide
granular, per record access to datasets and their component data files.
Limitations of APIs Whilst Data APIs are in many ways more flexible than direct download they have
disadvantages:
1. APIs are much more costly and complex to create and maintain than direct download
2. API queries are slow and limited in size because they run in real-time in memory. Thus, for bulk
access e.g. of the entire dataset direct download is much faster and more efficient (download a
1GB CSV directly is easy and takes seconds but attempting to do so via the API may crash the
server and be very slow)
Why Data APIs?
1. Data (pre)viewing: reliably and richly (e.g. with querying, mapping etc). This makes the data much
more accessible to non-technical users.
2. Visualization and analytics: rich visualization and analytics may need a data API (because they need
easily to query and aggregate parts of dataset).
3. Rich Data Exploration: when exploring the data you will want to explore through a dataset quickly
only pulling parts of the data and drilling down further as needed. 4. (Thin) Client applications: with a
data API third party users of the portal can build apps on top of the portal data easily and quickly (and
without having to host the data themselves
Domain Model The functionality associated to the Data APIs can be divided in 6 areas:
1. Descriptor: metadata describing and specifying the API e.g. general metadata e.g. name, title,
description, schema, and permissions
2. Manager for creating and editing APIs. 1. API: for creating and editing Data API's descriptors
(which triggers creation of storage and service endpoint) 2. UI: for doing this manually
3. Service (read): web API for accessing structured data (i.e. per record) with querying etc. When
we simply say "Data API" this is usually what we are talking about • Custom API & Complex
functions: e.g. aggregations, join • Tracking & Analytics: rate-limiting etc • Write API: usually
secondary because of its limited performance vs bulk loading • Bulk export of query results
especially large ones (or even export of the whole dataset in the case where the data is stored
directly in the DataStore rather than the FileStore). This is an increasingly important feature a
lower priority but if required it is substantive feature to implement.
4. Data Loader: bulk loading data into the system that powers the data API. • Bulk Load: bulk
import of individual data files • Maybe includes some ETL => this takes us more into data factory
5. Storage (Structured): the underlying structured store for the data (and its layout). For example,
Postgres and its table structure. This could be considered a separate component that the Data
API uses or as part of the Data API – in some cases the store and API are completely wrapped
together, e.g. ElasticSearch is both a store and a rich Web API
Chapter 1.8 Web Scrapping
Web Scraping Web scraping is an automatic method to obtain large
amounts of data from websites. Most of this data is unstructured data
in an HTML format which is then converted into structured data in a
spreadsheet or a database so that it can be used in various applications.
Web scraping requires two parts, namely the crawler and the scraper.
The crawler is an artificial intelligence algorithm that browses the web
to search for the particular data required by following the links across
the internet. The scraper, on the other hand, is a specific tool created to
extract data from the website.

You might also like