0% found this document useful (0 votes)
4 views

UNIT-I Data Science

Uploaded by

nisew46867
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

UNIT-I Data Science

Uploaded by

nisew46867
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 96

1

UDITC04-DATA SCIENCE
Unit-I INTRODUCTION TO DATA SCIENCE

12/27/2024
Unit-I Asst.Professor,Dept of EEE
Content
2

 Concept of Data science


 History
 Application areas
 Traits of Big data
 web scarping
 Analysis vs reporting.
Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
What is Data Science?
3

 Data Science is about data gathering, analysis and decision-making.


 Data Science is about finding patterns in data, through analysis, and
make future predictions.
 By using Data Science, companies are able to make:
 Better decisions (should we choose A or B)
 Predictive analysis (what will happen next?)
 Pattern discoveries (find pattern, or maybe hidden information in
the data)

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Where is Data Science Needed?
4

 Data Science is used in many industries in the world


today, e.g. banking, consultancy, healthcare, and
manufacturing.
 Examples of where Data Science is needed:
 To forecast the next years revenue for a company
 To analyze health benefit of training
 To predict who will win elections
Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Cont…
5

 Data Science can be applied in nearly every part of a business where data
is available.
 Examples are
 Consumer goods
 Stock markets
 Industry
 Politics
 Logistic companies
 E-commerce
Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
How Does a Data Scientist Work?
6

A Data Scientist requires expertise in several backgrounds:


 Machine Learning

 Statistics

 Programming (Python or R)

 Mathematics

 Databases

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Cont…
7

A Data Scientist must find patterns within the data. Before he/she can find the patterns, he/she must organize the
data in a standard format.
Here is how a Data Scientist works:
 Ask the right questions - To understand the business problem.
 Explore and collect data - From database, web logs, customer feedback, etc.
 Extract the data - Transform the data to a standardized format.
 Clean the data - Remove erroneous values from the data.
 Find and replace missing values - Check for missing values and replace them with a suitable value (e.g. an
average value).
 Normalize data - Scale the values in a practical range (e.g. 140 cm is smaller than 1,8 m. However, the
number 140 is larger than 1,8. - so scaling is important).
 Analyze data, find patterns and make future predictions.
 Represent the result - Present the result with useful insights in a way the "company" can understand.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
What is Data?
8

 Data is a collection of information.


 One purpose of Data Science is to structure data,
making it interpretable and easy to work with.
 Data can be categorized into two groups:
1. Structured data
2. Unstructured data
Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Unstructured Data
9

 Unstructured data is not organized. We must


organize the data for analysis purposes.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Structured Data
10

 Structured data is organized and easier to work


with.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
How to Structure Data?
11

 We can use an array or a database table to structure or present


data.
 Example of an array:
[80, 85, 90, 95, 100, 105, 110, 115, 120, 125]
 The following example shows how to create an array in Python:

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Database Table
12

 A database table is a table with


structured data.
 The following table shows a database
table with health data extracted from a
sports watch:
 This dataset contains information of a
typical training session such as
duration, average pulse, calorie
burnage etc.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Database Table Structure
13

 A row is a horizontal
representation of
data.
 A column is a
vertical
representation of
data.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Need for Data Science:
14

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Cont…
15

 But in today's world, data is becoming so vast, i.e., approximately 2.5 quintals bytes of data is generating on every
day, which led to data explosion. It is estimated as per researches, that by 2020, 1.7 MB of data will be created at
every single second, by a single person on earth. Every Company requires data to work, grow, and improve their
businesses.
 Now, handling of such huge amount of data is a challenging task for every organization. So to handle, process, and
analysis of this, we required some complex, powerful, and efficient algorithms and technology, and that technology
came into existence as data Science. Following are some main reasons for using data science technology:
 With the help of data science technology, we can convert the massive amount of raw and unstructured data into
meaningful insights.
 Data science technology is opting by various companies, whether it is a big brand or a startup. Google, Amazon,
Netflix, etc, which handle the huge amount of data, are using data science algorithms for better customer
experience.
 Data science is working for automating transportation such as creating a self-driving car, which is the future of
transportation.
 Data science can help in different predictions such as various survey, elections, flight ticket confirmation, etc.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Subsets of Data Science
16

 This is a mixture of Mathematics and Statistics, Machine Learning, Domain


Knowledge, IT, and software development.
 Math and Statistics is the core as everything from Exploratory Data Analysis to
Model Building requires dealing with numbers, vectors, probability, and so on.
 Machine Learning could be further divided into Deep Learning and Artificial
Intelligence, and it is the model-building subset of Data Science. Additionally,
essential software development and IT skills are deemed necessary to apply in
those fields.
 Finally, having business or domain knowledge could go a long way in determining
the result’s accuracy as different businesses use different data for prediction. Using
the right data is of utmost importance in verifying our output’s credibility.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Characteristics of Data Science
17

1. Business Understanding
It is the most important characteristic unless
you understand the business; you cannot make
a good model even if you have good
knowledge of machine learning algorithms or
statistical skills. A data scientist needs to
understand the business requirement and
develop analytics according to them. So,
domain knowledge of the business also
becomes important or helpful.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Cont…
18

2. Intuition
 Although the math involved is proven and foundational, a data scientist needs to

pick the right model with the right accuracy as all models will not give up the same
results. So a data scientist needs to feel when a model is ready for production
deployment. They also need the intuition to know at what point the production
model is stale and needs refactoring to respond to changing business environment.
3. Curiosity
 Data Science is not a new field. It has been there before also, but the progress being

made in this field is very fast. New methods to solve familiar problems are being
developed constantly, so, as a data scientist, curiosity to learn emerging technologies
becomes very important.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Challenges of Data Science Technology
19

 A high variety of information & data is required for accurate analysis


 Not adequate data science talent pool available
 Management does not provide financial support for a data science team
 Unavailability of/difficult access to data
 Business decision-makers do not effectively use data Science results
 Explaining data science to others is difficult
 Privacy issues
 Lack of significant domain expert
 If an organization is very small, it can’t have a Data Science team
Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Careers in Data science
20

 As we understand about Data, Data Analysis, and Data Science, one of


the important questions that coin up is, what are the career options that
we can take up in Data Science? We have learned about the real-life
applications of data and data science. Many of us may have found it
interesting and may want to pursue this career to explore it further. To
help you nail through the right choice, let us understand which different
careers we can take up in Data Science. Some common job titles for data
scientists include:
 1. Data Scientist 2. Business Intelligence Analyst 3. Data Mining
Engineer 4. Data Architect 5. Senior Data Scientist
Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Data Scientist
21

 Data Scientists are data enthusiasts who gather and analyze large
sets of structured and unstructured data. A data scientist's role
combines computer science, statistics, and mathematics. They
analyze, process, and model data and later interpret the results to
create actionable plans for companies and organizations.
 Data Scientists are analytical experts who utilize their skills both in
technology and social science to find trends and manage data. They
use their industry knowledge and context-specific understanding to
find solutions to business challenges.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Business Intelligence Analyst
22

 Business Intelligence Analysts use data to assess


the market and find the latest business trends in the
industry. This helps to develop a clearer picture of
how a company should shape its strategy.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Data Engineer
23

 Data Engineer examines not only the Data for their


own business but also that of third parties. In
addition to mining data, a data engineer creates
robust algorithms to help analyze the data further.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Data Architect
24

 Data Architects work closely with users, system


designers, and developers to create a blueprint that
data management systems use to centralize,
integrate and maintain the data sources.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Senior Data Scientist
25

 Senior Data Scientists anticipate the business's needs


in the future. Although they might not be involved in
gathering data, they play a high-level role in
analyzing it. Using their vast experience, they can
design and create new standards for analyzing data.
They can also create ways to use statistical data and
develop tools to further analyze the data.
Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
26

CONCEPT OF DATA SCIENCE

12/27/2024 Unit-1
Concept of Data science
27

 Data Science is a combination of multiple disciplines that


uses statistics, data analysis, and machine learning to analyze
data and to extract knowledge and insights from it.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Cont…
28

 Data science uses the most powerful hardware, programming systems,


and most efficient algorithms to solve the data related problems. It is
the future of artificial intelligence.
 In short, we can say that data science is all about:
 Asking the correct questions and analyzing the raw data.
 Modeling the data using various complex and efficient algorithms.
 Visualizing the data to get a better perspective.
 Understanding the data to make better decisions and finding the final
result.
Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Example:
29

 Let suppose we want to travel from station A to station B


by car. Now, we need to take some decisions such as
which route will be the best route to reach faster at the
location, in which route there will be no traffic jam, and
which will be cost-effective. All these decision factors
will act as input data, and we will get an appropriate
answer from these decisions, so this analysis of data is
called the data analysis, which is a part of data science.
Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Data Science Components:
30

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
31

 1. Statistics: Statistics is one of the most important components of data science. Statistics is a way to collect and analyze
the numerical data in a large amount and finding meaningful insights from it.
 2. Domain Expertise: In data science, domain expertise binds data science together. Domain expertise means specialized
knowledge or skills of a particular area. In data science, there are various areas for which we need domain experts.
 3. Data engineering: Data engineering is a part of data science, which involves acquiring, storing, retrieving, and
transforming the data. Data engineering also includes metadata (data about data) to the data.
 4. Visualization: Data visualization is meant by representing data in a visual context so that people can easily understand
the significance of data. Data visualization makes it easy to access the huge amount of data in visuals.
 5. Advanced computing: Heavy lifting of data science is advanced computing. Advanced computing involves designing,
writing, debugging, and maintaining the source code of computer programs.
 6. Mathematics: Mathematics is the critical part of data science. Mathematics involves the study of quantity, structure,
space, and changes. For a data scientist, knowledge of good mathematics is essential.
 7. Machine learning: Machine learning is backbone of data science. Machine learning is all about to provide training to a
machine so that it can act as a human brain. In data science, we use various machine learning algorithms to solve the
problems.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Data Science Lifecycle
32

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Discovery
33

 Before you begin the project, it is important to understand


the various specifications, requirements, priorities and
required budget. You must possess the ability to ask the
right questions. Here, you assess if you have the required
resources present in terms of people, technology, time and
data to support the project. In this phase, you also need to
frame the business problem and formulate initial
hypotheses (IH) to test.
Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Data preparation:
34

 In this phase, you require analytical sandbox in which you can perform analytics for the
entire duration of the project. You need to explore, preprocess and condition data prior to
modeling. Further, you will perform ETLT (extract, transform, load and transform) to get
data into the sandbox. Let’s have a look at the Statistical Analysis flow below.
 You can use R for data cleaning, transformation, and visualization. This will help you to
spot the outliers and establish a relationship between the variables. Once you have cleaned
and prepared the data, it’s time to do exploratory analytics on it. Let’s see how you can
achieve that

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Model planning:
35

 Here, you will determine the methods and techniques to draw the
relationships between variables. These relationships will set the base
for the algorithms which you will implement in the next phase. You
will apply Exploratory Data Analytics (EDA) using various statistical
formulas and visualization tools.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
36

 R has a complete set of modeling capabilities and provides a good environment for
building interpretive models.
 SQL Analysis services can perform in-database analytics using common data

mining functions and basic predictive models.


 SAS/ACCESS can be used to access data from Hadoop and is used for creating

repeatable and reusable model flow diagrams.


Although, many tools are present in the market but R is the most commonly used tool.
Now that you have got insights into the nature of your data and have decided the
algorithms to be used. In the next stage, you will apply the algorithm and build up a
model.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Model building:
37

 In this phase, you will develop datasets for training and testing purposes. Here
you need to consider whether your existing tools will suffice for running the
models or it will need a more robust environment (like fast and parallel
processing). You will analyze various learning techniques like classification,
association and clustering to build the model.
 You can achieve model building through the following tools.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Operationalize:
38

 In this phase, you deliver final reports, briefings,


code and technical documents. In addition,
sometimes a pilot project is also implemented in a
real-time production environment. This will
provide you a clear picture of the performance and
other related constraints on a small scale before full
deployment.
Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Communicate results:
39

 Now it is important to evaluate if you have been


able to achieve your goal that you had planned in
the first phase. So, in the last phase, you identify
all the key findings, communicate to the
stakeholders and determine if the results of the
project are a success or a failure based on the
criteria developed in Phase 1.
Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
The main phases of data science life cycle are
given below:
40

 1. Discovery: The first phase is discovery, which involves asking the right questions. When you start any data science project, you
need to determine what are the basic requirements, priorities, and project budget. In this phase, we need to determine all the
requirements of the project such as the number of people, technology, time, data, an end goal, and then we can frame the business
problem on first hypothesis level.
 2. Data preparation: Data preparation is also known as Data Munging. In this phase, we need to perform the following tasks:
1. Data cleaning

2. Data Reduction

3. Data integration

4. Data transformation,

After performing all the above tasks, we can easily use this data for our further processes.
 3. Model Planning: In this phase, we need to determine the various methods and techniques to establish the relation between input
variables. We will apply Exploratory data analytics(EDA) by using various statistical formula and visualization tools to understand the
relations between variable and to see what data can inform us. Common tools used for model planning are:
1. SQL Analysis Services

2. R

3. SAS

4. Python

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Cont…
41

 4. Model-building: In this phase, the process of model building starts. We will create datasets
for training and testing purpose. We will apply different techniques such as association,
classification, and clustering, to build the model.
 Following are some common Model building tools:
1. SAS Enterprise Miner
2. WEKA
3. SPCS Modeler
4. MATLAB
 5. Operationalize: In this phase, we will deliver the final reports of the project, along with briefings, code,
and technical documents. This phase provides you a clear overview of complete project performance and
other components on a small scale before the full deployment.
 6. Communicate results: In this phase, we will check if we reach the goal, which we have set on the initial
phase. We will communicate the findings and final result with the business team.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
42

HISTORY

12/27/2024 Unit-1
History of Data Science
43

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
44

 1962 – Inception
a. Future of Data Analysis – In 1962, John W Tukey wrote the “Future of Data Analysis”
where he first mentioned the importance of data analysis with respect to science rather
than mathematics.
 1974
a. Concise Survey of Computer Methods – In 1974, Peter Naur published the “Concise
Survey of Computer methods that surveys the contemporary methods of data processing in
various applications.
 1974 – 1980
a. International Association For Statistical Computing – In 1997, The committee was
formed whose sole purpose is to link traditional statistical methodology with modern
computer technology to extract useful information and knowledge from the data.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Cont…
45

 1980-1990
a. Knowledge Discovery in Databases – In 1989, Gregory Piatetsky-Shapiro
chaired the Knowledge Discovery in Databases that later went on to become the
annual conference on knowledge discovery and data mining.
 1990-2000
a. Database Marketing – In 1994, BusinessWeek published a cover story that
explains how big organizations are using the customer data to predict the
likelihood of a customer buying a specific product or not. Kind of like how
targeted ads work in the modern era for social media campaigns.
b. International Federation of Classification Society – For the first time in
1996, the term “Data Science” was used in a conference held in Japan.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
46

 2000-2010
a. Data Science – An Action Plan for Expanding the Technical Areas of the Field of Statistics – In 2001, William S
Cleveland published the action plan, that majorly focused on major areas of the technical work in the field of
statistics and coined the term Data Science.
b. Statistical Modeling – The Two Cultures – In 2001, Leo Breiman wrote “There are two cultures in the use of
statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic
data model. The other uses algorithmic models and treats the data mechanism as unknown”.
c. Data Science Journal – April 2002 saw the launch of a journal that focused on management of data and
databases in science and technology.
 2010-Present
a. Data Everywhere – In February 2010, Kenneth Cukier wrote a special report for The Economist that said a new
professional has arrived – a data scientist. Who combines the skills of software programmer, statistician and
storyteller/artist to extract the nuggets of gold hidden under mountains of data.
b. What is Data Science? – In June 2010, Mike Loukides described data science as combining entrepreneurship
with patience, the willingness to build data products incrementally, the ability to explore, and the ability to iterate
over a solution.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Tools for Data Science
47

Following are some tools required for data science:


 Data Analysis tools: R, Python, Statistics, SAS, Jupyter, R

Studio, MATLAB, Excel, RapidMiner.


 Data Warehousing: ETL, SQL, Hadoop, Informatica/Talend,

AWS Redshift
 Data Visualization tools: R, Jupyter, Tableau, Cognos.

 Machine learning tools: Spark, Mahout, Azure ML studio.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
48

APPLICATIONS OF DATA
SCIENCE

12/27/2024
Applications of Data Science:
49

 Image recognition and speech recognition:


Data science is currently using for Image and speech recognition. When you upload an
image on Facebook and start getting the suggestion to tag to your friends. This
automatic tagging suggestion uses image recognition algorithm, which is part of data
science.
When you say something using, "Ok Google, Siri, Cortana", etc., and these devices
respond as per voice control, so this is possible with speech recognition algorithm.
 Gaming world:
In the gaming world, the use of Machine learning algorithms is increasing day by day.
EA Sports, Sony, Nintendo, are widely using data science for enhancing user
experience.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
50

 Internet search:
When we want to search for something on the internet, then we use different types of
search engines such as Google, Yahoo, Bing, Ask, etc. All these search engines use the
data science technology to make the search experience better, and you can get a search
result with a fraction of seconds.
 Transport:
Transport industries also using data science technology to create self-driving cars. With
self-driving cars, it will be easy to reduce the number of road accidents.
 Healthcare:
In the healthcare sector, data science is providing lots of benefits. Data science is being
used for tumor detection, drug discovery, medical image analysis, virtual medical bots,
etc.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
51

 Recommendation systems:
Most of the companies, such as Amazon, Netflix, Google Play, etc., are using
data science technology for making a better user experience with personalized
recommendations. Such as, when you search for something on Amazon, and
you started getting suggestions for similar products, so this is because of data
science technology.
 Risk detection:
Finance industries always had an issue of fraud and risk of losses, but with the
help of data science, this can be rescued.
Most of the finance companies are looking for the data scientist to avoid risk
and any type of losses with an increase in customer satisfaction.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
52

TRAITS OF BIG DATA

12/27/2024
Traits of Big data
53

 Big data is a collection of data from many different sources and is often describe by five
characteristics: volume, value, variety, velocity, and veracity.
 Big Data contains a large amount of data that is not being processed by traditional data storage
or the processing unit. It is used by many multinational companies to process the data and
business of many organizations. The data flow would exceed 150 exabytes per day before
replication.
 5 V's of Big Data
 Volume
 Veracity
 Variety
 Value
 Velocity

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Volume
54

 The name Big Data itself is related to an enormous size. Big


Data is a vast 'volumes' of data generated from many sources
daily, such as business processes, machines, social media
platforms, networks, human interactions, and many more.
 Facebook can generate approximately a billion messages, 4.5
billion times that the "Like" button is recorded, and more
than 350 million new posts are uploaded each day. Big data
technologies can handle large amounts of data.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Variety
55

 Big Data can be structured,


unstructured, and semi-
structured that are being
collected from different sources.
Data will only be collected
from databases and sheets in
the past, But these days the data
will comes in array forms, that
are PDFs, Emails, audios, SM
posts, photos, videos, etc.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
The data is categorized as below:
56

 Structured data: In Structured schema, along with all the required columns. It is in a tabular
form. Structured Data is stored in the relational database management system.
 Semi-structured: In Semi-structured, the schema is not appropriately defined, e.g., JSON,
XML, CSV, TSV, and email. OLTP (Online Transaction Processing) systems are built to
work with semi-structured data. It is stored in relations, i.e., tables.
 Unstructured Data: All the unstructured files, log files, audio files, and image files are
included in the unstructured data. Some organizations have much data available, but they did
not know how to derive the value of data since the data is raw.
 Quasi-structured Data:The data format contains textual data with inconsistent data formats
that are formatted with effort and time with some tools.
 Example: Web server logs, i.e., the log file is created and maintained by some server that
contains a list of activities.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Veracity
57

 Veracity means how much the data is reliable. It


has many ways to filter or translate the data.
Veracity is the process of being able to handle and
manage data efficiently. Big Data is also essential
in business development.
 For example, Facebook posts with hashtags.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Value
58

 Value is an essential characteristic of big data. It is


not the data that we process or store. It
is valuable and reliable data that we store,
process, and also analyze.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Velocity
59

 Velocity plays an important role


compared to others. Velocity creates the
speed by which the data is created
in real-time. It contains the linking of
incoming data sets speeds, rate of
change, and activity bursts. The
primary aspect of Big Data is to
provide demanding data rapidly.
 Big data velocity deals with the speed
at the data flows from sources
like application logs, business
processes, networks, and social media
sites, sensors, mobile devices, etc.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
60

WEB SCRAPING

12/27/2024
What is web scraping?
61

 The dictionary meaning of word ‘Scrapping’ implies getting something from the web.
Here two questions arise: What we can get from the web and How to get that.
 The answer to the first question is ‘data’. Data is indispensable for any programmer
and the basic requirement of every programming project is the large amount of useful
data.
 The answer to the second question is a bit tricky, because there are lots of ways to get
data. In general, we may get data from a database or data file and other sources. But
what if we need large amount of data that is available online? One way to get such
kind of data is to manually search (clicking away in a web browser) and save (copy-
pasting into a spreadsheet or file) the required data. This method is quite tedious and
time consuming. Another way to get such data is using web scraping.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Cont…
62

 Web scraping, also called web data mining or web


harvesting, is the process of constructing an agent which
can extract, parse, download and organize useful
information from the web automatically. In other words,
we can say that instead of manually saving the data from
websites, the web scraping software will automatically
load and extract data from multiple websites as per our
requirement.
Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Origin of Web Scraping
63

 The origin of web scraping is screen scrapping, which


was used to integrate non-web based applications or
native windows applications. Originally screen scraping
was used prior to the wide use of World Wide Web
(WWW), but it could not scale up WWW expanded.
This made it necessary to automate the approach of
screen scraping and the technique called ‘Web Scraping’
came into existence.
Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Web Crawling v/s Web Scraping
64

 The terms Web Crawling and Scraping are often used


interchangeably as the basic concept of them is to extract data.
However, they are different from each other. We can understand
the basic difference from their definitions.
 Web crawling is basically used to index the information on the
page using bots aka crawlers. It is also called indexing. On the
hand, web scraping is an automated way of extracting the
information using bots aka scrapers. It is also called data
extraction
Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Difference between Web Crawling and Web
Scraping
65

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Uses of Web Scraping
66

 The uses and reasons for using web scraping are as endless as the uses of the World Wide Web. Web scrapers can
do anything like ordering online food, scanning online shopping website for you and buying ticket of a match the
moment they are available etc. just like a human can do. Some of the important uses of web scraping are
discussed here:
 E-commerce Websites: Web scrapers can collect the data specially related to the price of a specific product from
various e-commerce websites for their comparison.
 Content Aggregators: Web scraping is used widely by content aggregators like news aggregators and job
aggregators for providing updated data to their users.  Marketing and Sales Campaigns: Web scrapers can be
used to get the data like emails, phone number etc. for sales and marketing campaigns.
 Search Engine Optimization (SEO): Web scraping is widely used by SEO tools like SEMRush, Majestic etc. to
tell business how they rank for search keywords that matter to them.
 Data for Machine Learning Projects: Retrieval of data for machine learning projects depends upon web scraping.
 Data for Research: Researchers can collect useful data for the purpose of their research work by saving their
time by this automated process.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Components of a Web Scraper
67

 Web Crawler Module :A very necessary component of web scraper, web crawler module, is used to
navigate the target website by making HTTP or HTTPS request to the URLs. The crawler downloads
the unstructured data (HTML contents) and passes it to extractor, the next module.
 Extractor: The extractor processes the fetched HTML content and extracts the data into
semistructured format. This is also called as a parser module and uses different parsing techniques
like Regular expression, HTML Parsing, DOM parsing or Artificial Intelligence for its functioning.
 Data Transformation and Cleaning Module: The data extracted above is not suitable for ready use. It
must pass through some cleaning module so that we can use it. The methods like String
manipulation or regular expression can be used for this purpose. Note that extraction and
transformation can be performed in a single step also.
 Storage Module :After extracting the data, we need to store it as per our requirement. The storage
module will output the data in a standard format that can be stored in a database or JSON or CSV
format.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Working of a Web Scraper
68

 Web scraper may


be defined as a
software or script
used to download
the contents of
multiple web
pages and
extracting data
from it.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Why Python for Web Scraping?
69

 Python is a popular tool for implementing web scraping.


 Python programming language is also used for other useful projects related
to cyber security, penetration testing as well as digital forensic applications.
Using the base programming of Python, web scraping can be performed
without using any other third party tool. Python programming language is
gaining huge popularity and the reasons that make Python a good fit for web
scraping projects are as below:
 Syntax Simplicity: Python has the simplest structure when compared to
other programming languages. This feature of Python makes the testing
easier and a developer can focus more on programming.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
70

 Inbuilt Modules Another reason for using Python for web


scraping is the inbuilt as well as external useful libraries it
possesses. We can perform many implementations related to
web scraping by using Python as the base for programming.
Open Source Programming Language Python has huge support
from the community because it is an open source programming
language. Wide range of Applications Python can be used for
various programming tasks ranging from small shell scripts to
enterprise web applcations.
Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
71

DATA ANALYSIS

12/27/2024
What is Data Analysis?
72

 Data analysis is defined as a process of cleaning, transforming, and modeling


data to discover useful information for business decision-making. The
purpose of Data Analysis is to extract useful information from data and taking
the decision based upon the data analysis.
 A simple example of Data analysis is whenever we take any decision in our
day-to-day life is by thinking about what happened last time or what will
happen by choosing that particular decision. This is nothing but analyzing our
past or future and making decisions based on it. For that, we gather memories
of our past or dreams of our future. So that is nothing but data analysis. Now
same thing analyst does for business purposes, is called Data Analysis.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Data Analysis Tools
73

Data analysis tools make it easier for users to


process and manipulate data, analyze the
relationships and correlations between data
sets, and it also helps to identify patterns and
trends for interpretation. Here is a complete list
of tools used for data analysis in research.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Types of Data Analysis: Techniques and
Methods
74

 There are several types of Data Analysis techniques that exist


based on business and technology. However, the major Data
Analysis methods are:
1. Text Analysis
2. Statistical Analysis
3. Diagnostic Analysis
4. Predictive Analysis
5. Prescriptive Analysis
Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
75

Text Analysis
 Text Analysis is also referred to as Data Mining. It is one of the methods of data analysis to discover a
pattern in large data sets using databases or data mining tools. It used to transform raw data into business
information. Business Intelligence tools are present in the market which is used to take strategic business
decisions. Overall it offers a way to extract and examine data and deriving patterns and finally
interpretation of the data.
Statistical Analysis
 Statistical Analysis shows “What happen?” by using past data in the form of dashboards. Statistical
Analysis includes collection, Analysis, interpretation, presentation, and modeling of data. It analyses a set
of data or a sample of data. There are two categories of this type of Analysis – Descriptive Analysis and
Inferential Analysis.
Descriptive Analysis
 analyses complete data or a sample of summarized numerical data. It shows mean and deviation for
continuous data whereas percentage and frequency for categorical data.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
76

Inferential Analysis
 analyses sample from complete data. In this type of Analysis, you can find

different conclusions from the same data by selecting different samples.

Diagnostic Analysis
 Diagnostic Analysis shows “Why did it happen?” by finding the cause from

the insight found in Statistical Analysis. This Analysis is useful to identify


behavior patterns of data. If a new problem arrives in your business process,
then you can look into this Analysis to find similar patterns of that problem.
And it may have chances to use similar prescriptions for the new problems.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
77

Predictive Analysis
 Predictive Analysis shows “what is likely to happen” by using previous data. The simplest data analysis
example is like if last year I bought two dresses based on my savings and if this year my salary is
increasing double then I can buy four dresses. But of course it’s not easy like this because you have to
think about other circumstances like chances of prices of clothes is increased this year or maybe instead
of dresses you want to buy a new bike, or you need to buy a house!
 So here, this Analysis makes predictions about future outcomes based on current or past data. Forecasting
is just an estimate. Its accuracy is based on how much detailed information you have and how much you
dig in it.
Prescriptive Analysis
 Prescriptive Analysis combines the insight from all previous Analysis to determine which action to take
in a current problem or decision. Most data-driven companies are utilizing Prescriptive Analysis because
predictive and descriptive Analysis are not enough to improve data performance. Based on current
situations and problems, they analyze the data and make decisions.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Data Analysis Process
78

The Data Analysis Process is nothing but gathering information by using a proper
application or tool which allows you to explore the data and find a pattern in it. Based
on that information and data, you can make decisions, or you can get ultimate
conclusions.
 Data Analysis consists of the following phases:
 Data Requirement Gathering
 Data Collection
 Data Cleaning
 Data Analysis
 Data Interpretation
 Data Visualization

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
79

DATA REPORTING

12/27/2024
What is Data reporting?
80

 Data reporting helps you track what’s happening to your business and evaluate its
performance. It’s the process of collecting, merging, and visualizing raw data from all
available sources. Most often, it's presented in the form of tables, graphs, or charts.
Also, you shouldn’t forget that:
 In most cases, data-based records show only historical data, so you see an assessment

of past actions that you’re no longer able to change.


 To adequately evaluate your brand’s ongoing performance, you need to understand the

context (at least consider the industry and niche in which your company works).

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
How to write a data report
81

 By analyzing data, you can make informed decisions and test working
hypotheses. You can specify where your company’s resources go, what
progress has been made, and what your company should pay attention
to most.
 Want to know how to write a data analysis report? Let’s look at the steps
you need to take to create what perfectly fits your company:
 Step 1. Determine the report’s purpose and which specific questions it
should answer. Different specialists need different reports, and, of course,
they need answers to different questions. And too many questions in one
dashboard can overload it.
Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Cont…
82

 Step 2. Define metrics and data sources. Once you’ve decided on the
questions you want to answer and the specialists that will use the
dashboard, you should highlight critical metrics. Also, you should
determine what information is needed to build reports and what
sources should be connected to those reports.
 Step 3. Make sure data collection works correctly. You must be sure
of the quality of your data to make informed decisions. Make sure
that information is collected accurately and without errors. Also, note
that your attribution model should be tailored to your business needs.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Cont…
83

Pat yourself on the shoulder if you meet these


criteria:
Information is provided clearly and intelligibly
(graphs, tables, and charts precisely answer
specific questions).
Changes in metrics can be monitored at a
specified frequency (daily, weekly, monthly, in
real time).
Information can be filtered by specified
parameters.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Data report examples and templates
84

 Below are examples of various reports

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Five Key Differences Between Reporting
and Analysis
85

 One of the key differences between reporting and analytics is


that, while a report involves organizing data into summaries,
analysis involves inspecting, cleaning, transforming, and
modeling these reports to gain insights for a specific purpose.
 Knowing the difference between the two is essential to fully
benefit from the potential of both without missing out on key
features of either one. Some of the key differences include:

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Cont…
86

 1. Purpose: Reporting involves extracting data from different sources within an


organization and monitoring it to gain an understanding of the performance of the
various functions. By linking data from across functions, it helps create a cross-
channel view that facilitates comparison to understand data easily. An analysis is being
able to interpret data at a deeper level, interpreting it and providing recommendations
on actions.
 2. The Specifics: Reporting involves activities such as building, consolidating,
organizing, configuring, formatting, and summarizing. It requires clean, raw data and
reports that may be generated periodically, such as daily, weekly, monthly, quarterly,
and yearly. Analytics includes asking questions, examining, comparing, interpreting,
and confirming. Enriching data with big data can help predict future trends as well.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Cont…
87

 3. The Final Output: In the case of reporting, outputs such as canned reports,
dashboards, and alerts push information to users. Through analysis, analysts try to
extract answers using business queries and present them in the form of ad hoc
responses, insights, recommended actions, or a forecast. Understanding this key
difference can help businesses leverage analytics better.
 4. People: Reporting requires repetitive tasks that can be automated. It is often used by
functional business heads who monitor specific business metrics. Analytics requires
customization and therefore depends on data analysts and scientists. Also, it is used by
business leaders to make data-driven decisions.
 5. Value Proposition: This is like comparing apples to oranges. Both reporting and
analytics serve a different purpose. By understanding the purpose and using them
correctly, businesses can derive immense value from both.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Cont…
88

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Orbit for both Reporting and Analytics
89

 Orbit Reporting and Analytics is a single tool that can be used for both generating
different reports and running analytics to meet business objectives. It can work in
multi-cloud environments, extracting data from the cloud and on-prem systems
and presenting them in many ways as required by the user. It enables self-service,
allowing business users to generate their own reports without depending on the IT
team, in real-time. It complies with security and privacy requirements by allowing
access only to authorized users. It also allows users to generate reports in real-time
in Excel.
 It also facilitates analytics, enabling businesses to draw insights and convert them
into actions to predict future trends, identify areas of improvement across
functions, and meet the organizational goal of growth.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Cont…
90

Data Requirement Gathering


 First of all, you have to think about why do you want to do this data analysis? All you need to find out the
purpose or aim of doing the Analysis of data. You have to decide which type of data analysis you wanted to do!
In this phase, you have to decide what to analyze and how to measure it, you have to understand why you are
investigating and what measures you have to use to do this Analysis.
Data Collection
 After requirement gathering, you will get a clear idea about what things you have to measure and what should
be your findings. Now it’s time to collect your data based on requirements. Once you collect your data,
remember that the collected data must be processed or organized for Analysis. As you collected data from
various sources, you must have to keep a log with a collection date and source of the data.
Data Cleaning
 Now whatever data is collected may not be useful or irrelevant to your aim of Analysis, hence it should be
cleaned. The data which is collected may contain duplicate records, white spaces or errors. The data should be
cleaned and error free. This phase must be done before Analysis because based on data cleaning, your output of
Analysis will be closer to your expected outcome.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Cont…
91

Data Analysis
 Once the data is collected, cleaned, and processed, it is ready for Analysis. As you manipulate data, you may
find you have the exact information you need, or you might need to collect more data. During this phase,
you can use data analysis tools and software which will help you to understand, interpret, and derive
conclusions based on the requirements.
Data Interpretation
 After analyzing your data, it’s finally time to interpret your results. You can choose the way to express or
communicate your data analysis either you can use simply in words or maybe a table or chart. Then use the
results of your data analysis process to decide your best course of action.
Data Visualization
 Data visualization is very common in your day to day life; they often appear in the form of charts and
graphs. In other words, data shown graphically so that it will be easier for the human brain to understand and
process it. Data visualization often used to discover unknown facts and trends. By observing relationships
and comparing datasets, you can find a way to find out meaningful information.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Why Data Analysis?
92

 To grow your business even to grow in your life,


sometimes all you need to do is Analysis!
 If your business is not growing, then you have to look
back and acknowledge your mistakes and make a plan
again without repeating those mistakes. And even if your
business is growing, then you have to look forward to
making the business to grow more. All you need to do is
analyze your business data and business processes.
Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
93

 Step 1: Downloading Contents from Web Pages In this step, a web


scraper will download the requested contents from multiple web pages.
 Step 2: Extracting Data The data on websites is HTML and mostly
unstructured. Hence, in this step, web scraper will parse and extract
structured data from the downloaded contents.
 Step 3: Storing the Data Here, a web scraper will store and save the
extracted data in any of the format like CSV, JSON or in database.
 Step 4: Analyzing the Data After all these steps are successfully done,
the web scraper will analyze the data thus obtained.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Use Cases of Data Science
94

 Let’s take a look at some use cases of Data Science.


 Amazon: Amazon uses a personalized recommendation system to improve customer
satisfaction. This is majorly dependent on predictive analytics. Amazon analyzes the
user’s purchase history to recommend more products.
 Spotify: Spotify utilizes Data Science to offer personalized music recommendations
to the users. In 2013, Spotify made predictions about the Grammy Award Winners
by analyzing what music its users listen to. Out of the 6 predictions, 4 came true.
 Uber: Uber utilizes big data to gain better insights and provide better service to the
users. With its huge database of drivers, it can suggest to users the most suitable one.
Uber charges the customers based on the time it takes to get to the destination. This
prediction is helped by various algorithms.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
Advantages and Disadvantages of Data Science
95

 Advantages:
 It helps us to get insights from historical data with its powerful tools.
 It helps to optimize the business, hire the right persons and generate more revenue, as using data science
helps you make better future decisions for the business.
 Companies can develop and market their products better as they can better select their target customers.
 Introduction to Data Science also helps consumers search for better goods, especially in e-commerce sites
based on the data-driven recommendation system.
 Disadvantages:
 The disadvantages are generally when data science is used for customer profiling and infringement of
customer privacy.
 Their information, such as transactions, purchases, and subscriptions, is visible to their parent companies.
 The information obtained using data science can be used against a certain group, individual, country, or
community.

Department of EEE, Academy of Maritime Education and Training, Deemed to be University, Chennai 12/27/2024
96

THANK YOU

12/27/2024 Unit-1

You might also like