0% found this document useful (0 votes)
6 views

DSP U1

This document is a confidential educational resource for the RMK Group of Educational Institutions, detailing the course 'Data Science Using Python' for the 2022-2026 batch. It includes course objectives, prerequisites, a comprehensive syllabus, and outlines various learning activities and assessments. The document also provides information on textbooks, references, and equipment needed for the course.

Uploaded by

chan22006.cd
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

DSP U1

This document is a confidential educational resource for the RMK Group of Educational Institutions, detailing the course 'Data Science Using Python' for the 2022-2026 batch. It includes course objectives, prerequisites, a comprehensive syllabus, and outlines various learning activities and assessments. The document also provides information on textbooks, references, and equipment needed for the course.

Uploaded by

chan22006.cd
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 89

Please read this disclaimer before

proceeding:
This document is confidential and intended solely for the educational purpose of
RMK Group of Educational Institutions. If you have received this document
through email in error, please notify the system manager. This document contains
proprietary information and is intended only to the respective group / learning
community as intended. If you are not the addressee you should not disseminate,
distribute or copy through e-mail. Please notify the sender immediately by e-mail
if you have received this document by mistake and delete this document from
your system. If you are not the intended recipient you are notified that disclosing,
copying, distributing or taking any action in reliance on the contents of this
information is strictly prohibited.
22AI302
DATA SCIENCE USING
PYTHON
Department: AI & DS
Batch/Year: 2022-2026 /II YEAR
Created by:
Ms.Divya D M / Asst.Professor

Date: 24.07.2023
1.Table of Contents

S.NO Topic Page No.

1. Contents 5

2. Course Objectives 6

3. Pre-Requisites 7

4. Syllabus 8

5. Course outcomes 11

6. CO- PO/PSO Mapping 12

7. Lecture Plan 14

8. Activity based learning 16

9. Lecture notes 18

10. Assignments 72

11. Part A Q & A 73

12. Part B Qs 79

13. Supportive online Certification courses 80

14. Real time Applications in day to day life 81


and to Industry

15 Contents beyond the Syllabus 83

16 Assessment Schedule 85

17 Prescribed Text Books and Reference Books 87

18 Mini Project Suggestions 88


2.Course Objectives

➢ To learn the fundamentals of data science


➢ To experiment and implement python libraries for data science
➢ To apply and implement basic classification algorithms
➢ To apply clustering and outlier detection approaches.
➢ To present and interpret data using visualization tools in Python
3.Pre-Requisites

Semester-III

Data Science using


Python

Semester-II

Python Programming

Semester-I

C Programming
4.SYLLABUS
22AI302 DATA SCIENCE USING PYTHON LTPC

2023

UNIT I INTRODUCTION 6+6

Data Science: Benefits and uses – facets of data - Data Science Process: Overview –
Defining research goals – Retrieving data – data preparation - Exploratory Data
analysis – build the model – presenting findings and building applications - Data
Mining - Data Warehousing – Basic statistical descriptions of Data.

List of Exercise/Experiments:

1. Download, install and explore the features of R/Python for data analytics
• Installing Anaconda
• Basic Operations in Jupyter Notebook
• Basic Data Handling

UNIT II PYTHON LIBRARIES FOR DATA SCIENCE 6+6


Introduction to Numpy - Multidimensional Ndarrays – Indexing – Properties –
Constants – Data Visualization: Ndarray Creation – Matplotlib - Introduction to
Pandas – Series – Dataframes – Visualizing the Data in Dataframes - Pandas Objects
– Data Indexing and Selection – Handling missing data – Hierarchical indexing –
Combining datasets – Aggregation and Grouping – Joins- Pivot Tables - String
operations – Working with time series – High performance Pandas.

List of Exercise/Experiments:

1. Working with Numpy arrays - Creation of numpy array using the tuple, Determine
the size, shape and dimension of the array, Manipulation with array Attributes,
Creation of Sub array, Perform the reshaping of the array along the row vector and
column vector, Create Two arrays and perform the concatenation among the arrays.

2. Working with Pandas data frames - Series, DataFrame , and Index, Implement the
Data Selection Operations, Data indexing operations like: loc, iloc, and ix, operations
of handling the missing data like None, Nan, Manipulate on the operation of Null
Values (is null(), not null(), dropna(), fillna()).
4.SYLLABUS
3. Perform the Statistics operation for the data (the sum, product, median, minimum
and maximum, quantiles, arg min, arg max etc.).
4. Use any data set compute the mean ,standard deviation, Percentile.

UNIT III CLASSIFICATION 6+6


Basic Concepts – Decision Tree Induction – Bayes Classification Methods –
Rule-Based Classification – Model Evaluation and Selection
Bayesian Belief Networks – Classification by Back propagation – Support Vector
Machines – Associative Classification – K-Nearest-Neighbor Classifiers – Fuzzy Set
Approaches - Multiclass Classification - Semi-Supervised Classification.

List of Exercise/Experiments:
1. Apply Decision Tree algorithms on any data set.
2. Apply SVM on any data set
3. Implement K-Nearest-Neighbor Classifiers

UNIT IV CLUSTERING AND OUTLIER DETECTION 6+6

Cluster Analysis – Partitioning Methods – Evaluation of Clusters – Probabilistic


Model-Based Clustering – Outliers and Outlier Analysis – Outlier Detection Methods –
Statistical Approaches – Clustering and Classification-Based Approaches.

List of Exercise/Experiments:
1. Apply K-means algorithms for any data set.
2. Perform Outlier Analysis on any data set.

UNIT V DATA VISUALIZATION 6+6


Importing Matplotlib – Simple line plots – Simple scatter plots – visualizing errors –
density and contour plots – Histograms – legends – colors – subplots – text and
annotation – customization - three dimensional plotting - Geographic Data with
Basemap - Visualization with Seaborn.

List of Exercise/Experiments:
1. Basic plots using Matplotlib.
2. Implementation of Scatter Plot.
3. Construction of Histogram, bar plot, Subplots, Line Plots.
4.SYLLABUS

4. Implement the three dimensional potting.


5. Visualize a dataset with Seaborn.
TOTAL:30+30=60 PERIODS

TEXTBOOKS:

1. David Cielen, Arno D. B. Meysman, and Mohamed Ali, “Introducing Data


Science”, Manning Publications, 2016.
2. Ashwin Pajankar, Aditya Joshi, Hands-on Machine Learning with Python:
Implement Neural Network Solutions with Scikit-learn and PyTorch, Apress,
2022.
3. Jake VanderPlas, “Python Data Science Handbook – Essential tools for
working with data”, O’Reilly, 2017.

REFERENCES:
1. Roger D. Peng, R Programming for Data Science, Lulu.com, 2016
2. Jiawei Han, Micheline Kamber, Jian Pei, "Data Mining: Concepts and Techniques",
3rd Edition, Morgan Kaufmann, 2012.
3. Samir Madhavan, Mastering Python for Data Science, Packt Publishing, 2015
4. Laura Igual, Santi Seguí, "Introduction to Data Science: A Python Approach to
Concepts,
5. Techniques and Applications", 1st Edition, Springer, 2017
6. Peter Bruce, Andrew Bruce, "Practical Statistics for Data Scientists: 50 Essential
7. Concepts", 3rd Edition, O'Reilly, 2017
8. Hector Guerrero, “Excel Data Analysis: Modelling and Simulation”, Springer
International Publishing, 2nd Edition, 2019

NPTEL Courses:
a. Data Science for Engineers - https://onlinecourses.nptel.ac.in/noc23_cs17/preview
b. Python for Data Science - https://onlinecourses.nptel.ac.in/noc23_cs21/preview

LIST OF EQUIPMENTS:
Systems with Anaconda, Jupyter Notebook, Python, Pandas, NumPy, MatPlotlib
5.COURSE OUTCOMES

At the end of this course, the students will be able to:

COURSE OUTCOMES HKL

CO1 Explain the fundamentals of data science. K2

CO2 Experiment python libraries for data science. K3

CO3 Apply and implement basic classification algorithms. K3

CO4 Implement clustering and outlier detection approaches. K4

CO5 Present and interpret data using visualization tools in Python. K3


CO – PO/PSO Mapping
6.CO – PO /PSO Mapping Matrix
CO PO PO PO PO PO PO PO PO PO PO PO PO PS PS PS
1 2 3 4 5 6 7 8 9 10 11 12 O1 O2 03
1 3 2 2 1 1 1 1 1 2 3 3

2 3 3 3 3 1 1 1 1 2 3 3

3 3 3 3 3 3 3 3 3 2 3 3

4 3 3 3 3 3 3 3 3 2 3 3

5 3 3 3 3 3 3 3 3 2 3 3
Lecture Plan
Unit - I
Lecture Plan – Unit 1- INTRODUCTION
Sl. Topic Numb Propose Actual CO Taxo Mode of
No er of d Date Lectur nomy Deliver
. Period e Date Level y
s

Data Science: 07.08.2023 PPT /


1 Benefits and uses – 2 & CO1 K2 Chalk &
facets of data 09.08.2023 Talk

Data Science
Process: Overview
PPT /
– Defining research
2 1 10.08.2023 CO1 K2 Chalk &
goals – Retrieving
Talk
data – data
preparation
PPT /
Exploratory Data
3 1 12.08.2023 CO1 K2 Chalk &
analysis
Talk
build the model –
PPT /
presenting findings
4 1 16.08.2023 CO1 K2 Chalk &
and building
Talk
applications
PPT /
5 Data Mining 1 17.08.2023 CO1 K2 Chalk &
Talk

PPT /
6 Data Warehousing 1 19.08.2023 CO1 K2 Chalk &
Talk

Basic statistical 23.08.2023 PPT /


7 descriptions of 2 & CO1 K2 Chalk &
Data 24.08.2023 Talk
8. ACTIVITY BASED LEARNING

Activity name:

Building a Predictive Analytics Model for Real Time applications.

Students will have better understanding about how Predictive Analysis performed for
various applications by using the following steps.

● Defining Business Objectives. The project starts with using a well-defined


business objective.
● Preparing Data. You'll use historical data to train your model.
● Sampling Your Data.
● Building the Model.
● Deploying the Model.

Guidelines to do an activity :

1) Students can form group. ( 3 students / team)


2) Identify an Organization.
3) Apply procedure of Data Science Process. ( Follow above mentioned steps )
4) Conduct Peer review. ( each team will be reviewed by all other teams and
mentors )
UNIT-1
INTRODUCTION
9.LECTURE NOTES
1. Big Data

Big data is a collection of data sets so large or complex that it becomes difficult to
process them using traditional data management techniques such as, for example,
the RDBMS (relational database management systems).

1.1 Characteristics of big data

The characteristics of big data are often referred to as the three Vs:

■ Volume—How much data is there?

■ Variety—How diverse are different types of data?

■ Velocity—At what speed is new data generated?

Often these characteristics are complemented with a fourth V, veracity: How


accurate is the data? These four properties make big data different from the data
found in traditional data management tools. Consequently, the challenges they bring
can be felt in almost every aspect: data capture, curation, storage, search, sharing,
transfer, and visualization. In addition, big data calls for specialized techniques to
extract the insights.

2. Data Science

As the amount of data continues to grow, the need to leverage it becomes more
important. Data science involves using methods to analyze massive amounts of data
and extract the knowledge it contains. Data science and big data evolved from
statistics and traditional data management but are now considered to be distinct
disciplines. Data science is an evolutionary extension of statistics capable of dealing
with the massive amounts of data produced today. It adds methods from computer
science to the repertoire of statistics.
2.1 Benefits and uses of data science

Data science and big data are used almost everywhere in both commercial and
noncommercial settings.

2.1.1 Commercial companies in almost every industry use data science and big
data to gain insights into their customers, processes, staff, completion, and products.

● Many companies use data science to offer customers a better user


experience, as well as to cross-sell, up-sell, and personalize their offerings.

● A good example of this is Google AdSense, which collects data from


internet users so relevant commercial messages can be matched to the
person browsing the internet.

● MaxPoint is another example of real-time personalized advertising.

● Human resource professionals use people analytics and text mining to


screen candidates, monitor the mood of employees, and study informal
networks among coworkers.

● Financial institutions use data science to predict stock markets,


determine the risk of lending money, and learn how to attract new clients for
their services.

2.1.2 Governmental organizations are also aware of data’s value. Many


governmental organizations not only rely on internal data scientists to discover
valuable information, but also share their data with the public.

● This data is used to gain insights or build data-driven applications.

● Data.gov is the home of the US Government’s open data. A data


scientist in a governmental organization gets to work on diverse projects such
as detecting fraud and other criminal activity or optimizing project funding.
The American National Security Agency and the British Government
Communications Headquarters use data science and big data to monitor
millions of individuals.

● These organizations collected 5 billion data records from widespread


applications such as Google Maps, Angry Birds, email, and text messages,
among many other data sources. Then they applied data science techniques
to distill information.

2.1.3 Nongovernmental organizations (NGOs) use data to raise money and


defend their causes.

● The World Wildlife Fund (WWF) employs data scientists to increase the
effectiveness of their fundraising efforts.

● Many data scientists devote part of their time to helping NGOs,


because NGOs often lack the resources to collect data and employ data
scientists.

● DataKind is one such data scientist group that devotes its time to the
benefit of mankind.

2.1.4 Universities use data science in their research and also to enhance the study
experience of their students. The massive open online courses (MOOC) produces a
lot of data, which allows universities to study how this type of learning can
complement traditional classes. Examples of MOOCs are Coursera, Udacity, and edX.

3. Facets of data

In data science and big data, there are many different types of data. Each of them
tends to require different tools and techniques.

The main categories of data are these:

■ Structured
■ Unstructured

■ Natural language

■ Machine-generated

■ Graph-based

■ Audio, video, and images

■ Streaming

3.1 Structured data

Structured data is data that depends on a data model and resides in a fixed field
within a record.

It is easy to store structured data in tables within databases or Excel files as shown
in figure 1.1. SQL or Structured Query Language, is the preferred way to manage
and query data that resides in databases.

There can be structured data that is difficult to store in a traditional relational


database. Hierarchical data such as a family tree is one such example. More often,
data comes unstructured.

Figure 1.1 An Excel table is an example of structured data.


3.2 Unstructured data

Unstructured data is data that isn’t easy to fit into a data model because the content
is context-specific or varying. One example of unstructured data is the regular email
as shown in figure 1.2. Although email contains structured elements such as the
sender, title and body text, it is a challenge to find the number of people who have
written an email complaint about a specific employee because so many ways exist to
refer to a person. For example, The thousands of different languages and dialects
out there further complicate this.

Figure 1.2 Email is simultaneously an example of unstructured data and natural


language data.
3.3 Natural language

Natural language is a special type of unstructured data. It is challenging to process


because it requires knowledge of specific data science techniques and linguistics. A
human-written email as shown in figure 1.2 is also an example of natural language
data.

The natural language processing community has had success in entity recognition,
topic recognition, summarization, text completion, and sentiment analysis, but
models trained in one domain don’t generalize well to other domains. Even
state-of-the-art techniques aren’t able to decipher the meaning of every piece of text
as it is ambiguous by nature.

3.4 Machine-generated data

Machine-generated data is information that is automatically created by a computer,


process, application, or other machine without human
intervention.Machine-generated data is becoming a major data resource and will
continue to do so. Wikibon has forecast that the market value of the industrial
Internet (a term coined by Frost & Sullivan to refer to the integration of complex
physical machinery with networked sensors and software) will be approximately
$540 billion in 2020. IDC (International Data Corporation) has estimated there will
be 26 times more connected things than people in 2020. This network is commonly
referred to as the internet of things.

The analysis of machine data relies on highly scalable tools, due to its high volume
and speed. Examples of machine data are web server logs, call detail records,
network event logs and telemetry as shown in figure 1.3.

The machine data shown in figure 1.3 would fit in a classic table-structured
database. This isn’t the best approach for highly interconnected or “networked” data,
where the relationships between entities have a valuable role to play.
Figure 1.3 Example of machine-generated data

3.5 Graph-based or network data

In graph theory, a graph is a mathematical structure to model pairwise relationships


between objects. Graph or network data is the data that focuses on the relationship
or adjacency of objects. The graph structures use nodes, edges, and properties to
represent and store graphical data. Graph-based data is a natural way to represent
social networks, and its structure allows to calculate specific metrics such as the
influence of a person and the shortest path between two people. Examples of
graph-based data can be found on many social media websites as shown in the
figure 1.4.
Figure 1.4 Friends in a social network are an example of graph-based data.

Graph databases are used to store graph-based data and are queried with
specialized query languages such as SPARQL. Graph data poses its challenges, but
for a computer interpreting additive and image data, it can be even more difficult.

3.6 Audio, image and video

Audio, image and video are data types that pose specific challenges to a data
scientist. Tasks that are trivial for humans, such as recognizing objects in pictures,
turn out to be challenging for computers. In game analytics, High-speed cameras
at stadiums will capture ball and athlete movements to calculate in real time, for
example, the path taken by a defender relative to two baselines.

A company called DeepMind succeeded at creating an algorithm that’s capable of


learning how to play video games. This algorithm takes the video screen as input
and learns to interpret everything via a complex process of deep learning which
prompted Google to buy the company for their own Artificial Intelligence (AI)
development plans. The learning algorithm takes in data as it’s produced by the
computer game; it’s streaming data.
3.7 Streaming data

While streaming data can take almost any of the previous forms, it has an extra
property. The data flows into the system when an event happens instead of being
loaded into a data store in a batch. Although this isn’t really a different type of data,
we treat it here as such because we need to adapt our process to deal with this type
of information. Examples are the “What’s trending” on Twitter, live sporting or music
events, and the stock market.

4. Overview of data science process

The typical data science process consists of six steps as shown in figure 4.1.
The data science process and shows the main steps and actions.

1. The first step of this process is setting a research goal. The main
purpose here is making sure all the stakeholders understand the what, how,
and why of the project. In every serious project this will result in a project
charter.

2. The second phase is data retrieval. The data is required for analysis, so
this step includes finding suitable data and getting access to the data from the
data owner. The result is data in its raw form, which probably needs
transformation before it becomes usable.

3. The third step is data preparation. This includes transforming the data
from a raw form into data that is directly usable in the models. To achieve
this, detect and correct different kinds of errors in the data, combine data
from different data sources and transform it. After completing this step, we
can progress to data visualization and modeling.

4. The fourth step is data exploration. The goal of this step is to gain a
deep understanding of the data, look for patterns, correlations, deviations
based on visual and descriptive techniques. The insights gained from this
phase will enable us to start modeling.

5. The fifth step is data modeling. It is now that you attempt to gain the
insights or make the predictions stated in your project charter.

6. The last step of the data science model is presenting the results and
automating the analysis, if needed. One goal of a project is to change a
process and/or make better decisions. The importance of this step is more
apparent in projects on a strategic and tactical level. Certain projects require
performing the business process over and over again, so automating the
project will save time.
In reality we won’t progress in a linear way from step 1 to step 6. Often we’ll regress
and iterate between the different phases. Following these six steps will lead to
higher project success ratio and increased impact of research results. This process
ensures that we have a well-defined research plan, a good understanding of the
business question and clear deliverables before even starting to look at the data. The
first steps of the process focus on getting high-quality data as input for the models.
This way the models will perform better later on.

Step 1: Defining research goals and creating a project charter

A project starts by understanding the what, the why, and the how of our project as
shown in the figure 2.2. What does the company expect you to do? And why does
management place such a value on your research? Answering these three questions
( what,why,how) is the goal of the first phase.
The outcome should be a clear research goal, a good understanding of the context,
well-defined deliverables and a plan of action with a timetable. This information is
then best placed in a project charter. The length and formality differ between
projects and companies.

1.1 Spend time understanding the goals and context of your research

An essential outcome is the research goal that states the purpose of your
assignment in a clear and focused manner. Understanding the business goals and
context is critical for project success.

1.2 Create a project charter

Clients like to know upfront what they are paying for, so after getting a good
understanding of the business problem, try to get a formal agreement on the
deliverables. All this information is best collected in a project charter. For any
significant project this would be mandatory.

A project charter requires teamwork and the inputs should cover the following:

● A clear research goal

● The project mission and context

● How to perform analysis

● What resources are expected to use

● Proof that it’s an achievable project or proof of concepts

● Deliverables and a measure of success

● A timeline that the client can use this information to make an


estimation of the project costs and the data and people required for
the project to become a success.
Step 2: Retrieving data

The next step in data science is to retrieve the required data as shown in figure 2.3.
Sometimes there is a need to go into the field and design a data collection process.
Many companies will have already collected and stored the data and what they don’t
have can often be bought from third parties. The organizations are also making
high-quality data freely available for public and commercial use.

Data can be stored in many forms, ranging from simple text files to tables in a
database. The objective is acquiring all the data needed. This may be difficult, and
even if you succeed, data needs polishing to be of any use to you.

2.1 Start with data stored within the company ( Internal data )

The first step is to assess the relevance and quality of the data that is readily
available within the company. Most companies have a program for maintaining key
data, so much of the cleaning work may already be done. This data can be stored in
official data repositories such as databases, data marts, data warehouses, and data
lakes maintained by a team of IT professionals. The primary goal of a database is
data storage, while a data warehouse is designed for reading and analyzing that
data.
A data mart is a subset of the data warehouse and geared toward serving a specific
business unit. While data warehouses and data marts are home to preprocessed
data, data lakes contain data in its natural or raw format. Getting access to data is
another difficult task. Organizations understand the value and sensitivity of data and
often have policies in place so everyone has access to what they need and nothing
more. These policies translate into physical and digital barriers called Chinese walls.
These “walls” are mandatory and well-regulated for customer data in most countries.
Getting access to the data may take time and involve company politics.

2.2 Don’t be afraid to shop around ( External data )

If data isn’t available inside your organization, look outside your organization’s walls.
Many companies specialize in collecting valuable information. Other companies
provide data so that you, in turn, can enrich their services and ecosystem. Such is
the case with Twitter, LinkedIn, and Facebook. Although data is considered an asset
more valuable by certain companies, more and more governments and organizations
share their data for free with the world. This data can be of excellent quality and it
depends on the institution that creates and manages it. The information they share
covers a broad range of topics such as the number of accidents or amount of drug
abuse in a certain region and its demographics. This data is helpful when you want
to enrich proprietary data but also convenient when training your data science skills.
Table 2.1 shows only a small selection from the growing number of open-data
providers.

Open data site Description

Data.gov The home of the US Government’s open data

https://open-data.europa.eu/ The home of the European Commission’s open data

Freebase.org An open database that retrieves its information from sites like
Wikipedia, MusicBrains, and the SEC archive

Data.worldbank.org Open data initiative from the World Bank

Aiddata.org Open data for international development

Open.fda.gov Open data from the US Food and Drug Administration


2.3 Do data quality checks to prevent problems

A good portion of the project time should be spent on doing data correction and
cleansing. The retrieval of data is the first time we’ll inspect the data in the data
science process. Most of the errors encountered during the data gathering phase are
easy to spot. The data is investigated during the import, data preparation and
exploratory phases. The difference is in the goal and the depth of the investigation.
During data retrieval, check to see if the data is equal to the data in the source
document and to see if we have the right data types.

Step 3: Cleansing, integrating and transforming data ( Data preparation )

The data received from the data retrieval phase is likely to be “a diamond in the
rough.” so the task is to prepare it for use in the modeling and reporting phase.

The model needs the data in specific format, so data transformation will always
come into play. It is good to correct data errors as early on in the process as
possible. Figure 2.4 shows the most common actions to take during the data
cleansing, integration, and transformation phase.
3.1 Cleansing data

Data cleansing is a subprocess of the data science process that focuses on removing
errors in the data so that the data becomes a true and consistent representation of
the processes it originates from. There are at least two types of errors. The first type
is the interpretation error, such as when you take the value in your data for granted,
like saying that a person’s age is greater than 300 years. The second type of error
points to inconsistencies between data sources or against your company’s
standardized values. The table below shows an overview of the types of errors that
can be detected with easy checks.

Errors pointing to false values within one dataset

Error description Possible solution

Mistakes during data Manual overrules


entry

Redundant white space Use string functions

Impossible values Manual overrules

Missing values Remove observation or value

Outliers Validate and, if erroneous, treat as missing value (remove


or insert)

Errors pointing to inconsistencies between data sets

Error description Possible solution

Deviations from a code book Match on keys or else use manual overrules

Different units of Recalculate


measurement

Different levels of aggregation Bring to same level of measurement by aggregation


or extrapolation
General solution: Try to fix the problem early in the data acquisition chain or else fix
it in the program. Sometimes there is a need to use more advanced methods, such
as simple modeling, to find and identify data errors; diagnostic plots can be
especially insightful.

3.1.1 Data entry errors

Data collection and data entry are error-prone processes. They often require human
intervention, as humans make typos or lose their concentration for a second and
introduce an error into the chain. But data collected by machines or computers isn’t
free from errors either. Errors can arise from human sloppiness, whereas others are
due to machine or hardware failure. Examples of errors originating from machines
are transmission errors or bugs in the extract, transform, and load phase (ETL).

For small data sets we can check every value by hand. Detecting data errors when
the variables we study don’t have many classes can be done by tabulating the data
with counts. When we have a variable that can take only two values: “Good” and
“Bad”, we can create a frequency table and see if those are truly the only two values
present.

In table 2.3, the values “Godo” and “Bade” point out something went wrong in at
least 16 cases.
Value Count

Good 1598647

Bad 1354468

Godo 15

Bade 1

Table 2.3 Detecting outliers on simple variables with a frequency table

Most errors of this type are easy to fix with simple assignment statements and
if-then else rules:

if x == “Godo”:
x = “Good”
if x == “Bade”:
x = “Bad”
3.1.2 Redundant whitespace

Whitespaces tend to be hard to detect but cause errors like other redundant
characters would. Fixing redundant whitespaces can be done by most programming
languages. They all provide string functions that will remove the leading and trailing
whitespaces. For instance, in Python you can use the strip() function to remove
leading and trailing spaces.

Fixing capital letter mismatches

Capital letter mismatches are common. Most programming languages make a


distinction between “Brazil” and “brazil”. In this case we can solve the problem by
applying a function that returns both strings in lowercase, such as .lower() in
Python.“Brazil”.lower() == “brazil”.lower() should result in true.

3.1.3 Impossible values and sanity checks

Sanity checks are another valuable type of data check. Here we check the value
against physically or theoretically impossible values such as people taller than 3
meters or someone with an age of 299 years. Sanity checks can be directly
expressed with rules:

check = 0 <= age <= 120

3.1.4 Outliers

An outlier is an observation that seems to be distant from other observations or,


more specifically, one observation that follows a different logic or generative process
than the other observations. The easiest way to find outliers is to use a plot or a
table with the minimum and maximum values. An example is shown in figure 2.6.
The plot on the top shows no outliers, whereas the plot on the bottom shows
possible outliers on the upper side when a normal distribution is expected. The
normal distribution or Gaussian distribution is the most common distribution.
Expected distribution:

Distribution with outliers:

● Distribution plots are helpful in detecting outliers and helps in understanding


the variable.
● It shows most cases occurring around the average of the distribution and the
occurrences decrease when further away from it.
● The high values in the bottom graph can point to outliers when assuming a
normal distribution.
3.1.5 Dealing with missing values

Missing values aren’t necessarily wrong, but we still need to handle them separately.
Certain modeling techniques can’t handle missing values. They might be an indicator
that something went wrong in data collection or that an error happened in the ETL
process. Common techniques used are listed in table. Which technique to use at
what time is dependent on a particular case.

Technique Advantage Disadvantage

Omit the values Easy to perform You lose the information from
an observation

Set value to null Easy to perform Not every modeling technique


and/or implementation can
handle null values

Impute a static value Easy to perform Can lead to false estimations


such as 0 or the from a model
mean You don’t lose
information from the
other variables in the
observation

Impute a value from Does not disturb the Harder to execute


an estimated or model as much
theoretical You make data assumptions
distribution

Modeling the value Does not disturb the Can lead to too much
(nondependent) model too much confidence in the model

Can artificially raise


dependence among the
variables

Harder to execute

You make data assumptions


3.1.6 Deviations from a code book

Detecting errors in larger data sets against a code book or against standardized
values can be done with the help of set operations. A code book is a description of
data, a form of metadata.

It contains things such as the number of variables per observation, the number of
observations, and what each encoding within a variable means. For instance “0”
equals “negative”, “5” stands for “very positive”. A code book also tells the type of
data we are looking at: is it hierarchical, graph, something else.

3.2 Combining data from different data sources

The data comes from several different places and we need to integrate these
different sources. Data varies in size, type and structure, ranging from databases
and Excel files to text documents.

The different ways of combining the data

There are two operations to combine information from different data sets.

● The first operation is joining: enriching an observation from one table


with information from another table.

● The second operation is appending or stacking: adding the


observations of one table to those of another table.

When the data is combined, we have the option to create a new physical table or a
virtual table by creating a view. The advantage of a view is that it doesn’t consume
more disk space.

3.2.1 Joining tables

Joining tables allows us to combine the information of one observation found in one
table with the information that is found in another table. The focus is on enriching a
single observation.
Let’s say that the first table contains information about the purchases of a customer
and the other table contains information about the region where your customer lives.
Joining the tables allows us to combine the information so that we can use it for a
model, as shown in figure 2.7. To join tables, we use variables that represent the
same object in both tables, such as a date, a country name, or a Social Security
number. These common fields are known as keys. When these keys also uniquely
define the records in the table they are called primary keys.

Figure 2.7 Joining two tables on the Item and Region keys

One table may have buying behavior and the other table may have demographic
information on a person. In figure 2.7 both tables contain the client name, and this
makes it easy to enrich the client expenditures with the region of the client.

3.2.2 Appending tables

Appending or stacking tables is effectively adding observations from one table to


another table. Figure 2.8 shows an example of appending tables. One table contains
the observations from the month January and the second table contains observations
from the month February. The result of appending these tables is a larger one with
the observations from January as well as February.
Figure 2.8 Appending data from tables

Appending data from tables is a common operation but requires an equal structure
in the tables being appended.

3.2.3 Using views to simulate data joins and appends

To avoid duplication of data, we virtually combine data with views. The problem is
that when we duplicate the data, more storage space is needed. In case of less data,
that may not cause problems. But if every table consists of terabytes of data, then it
becomes problematic to duplicate the data. For this reason, the concept of a view
was invented. A view behaves as if we are working on a table, but this table is
nothing but a virtual layer that combines the tables. Figure 2.9 shows how the sales
data from the different months is combined virtually into a yearly sales table instead
of duplicating the data. Views do come with a drawback, however. While a table join
is only performed once, the join that creates the view is recreated every time it’s
queried, using more processing power than a pre-calculated table would have.
Figure 2.9 A view helps to combine data without replication.

3.2.4 Enriching aggregated measures

Data enrichment can also be done by adding calculated information to the table, such
as the total number of sales or what percentage of total stock has been sold in a
certain region as shown in the figure below. We now have an aggregated data set,
which in turn can be used to calculate the participation of each product within its
category.
3.3 Transforming data

Certain models require their data to be in a certain shape. After cleansing and
integrating the data, the next task is to transform the data so it takes a suitable form
for data modeling.

Relationships between an input variable and an output variable aren’t always linear.
Take, for instance, a relationship of the form y = aebx. Taking the log of the
independent variables simplifies the estimation problem dramatically. Figure 2.11
shows how transforming the input variables greatly simplifies the estimation
problem.

Figure 2.11 Transforming x to log x makes the relationship between x and y linear
(right), compared with the non-log x (left).

3.3.1 Reducing the number of variables

Sometimes there are too many variables and we need to reduce the number because
they don’t add new information to the model. Having too many variables in the
model makes the model difficult to handle, and certain techniques don’t perform well
when we overload them with too many input variables. For instance, all the
techniques based on a Euclidean distance perform well only up to 10 variables.
3.3.2 Turning variables into dummies

Variables can be turned into dummy variables as shown in figure 2.13. Dummy
variables can only take two values: true(1) or false(0). They are used to indicate the
absence of a categorical effect that may explain the observation. In this case there
are separate columns for the classes stored in one variable and indicate it with 1 if
the class is present and 0 otherwise. An example is turning one column named
Weekdays into the columns Monday through Sunday. we use an indicator to show if
the observation was on a Monday; put 1 on Monday and 0 elsewhere. Turning
variables into dummies is a technique that is used in modeling.

Figure 2.13 Turning variables into dummies

Turning variables into dummies is a data transformation that breaks a variable that
has multiple classes into multiple variables, each having only two possible values: 0
or 1.
Step 4: Exploratory data analysis

Information becomes much easier to grasp when shown in a picture, therefore you
mainly use graphical techniques to gain an understanding of your data and the
interactions between variables. This phase is about exploring data, discovering
anomalies missed before and to fix them. The visualization techniques used in this
phase range from simple line graphs or histograms as shown in figure 2.15, to more
complex diagrams such as Sankey and network graphs. Sometimes it’s useful to
compose a composite graph from simple graphs to get even more insight into the
data. Other times the graphs can be animated or made interactive to make it easier.

Figure 2.14 Step 4: Data exploration


Figure 2.15 From top to bottom, a bar chart, a line plot, and a distribution are
some of the graphs used in exploratory analysis.
These plots can be combined to provide even more insight, as shown in figure 2.16.
Overlaying several plots is common practice.

Figure 2.16 Drawing multiple plots together helps to understand the structure of the
data over multiple variables.

Figure 2.17 A Pareto diagram


A Pareto diagram is a combination of the values and a cumulative distribution. It’s
easy to see from this diagram that the first 50% of the countries contain slightly less
than 80% of the total amount.

Figure 2.18 Link and brush allows us to select observations in one plot and highlight
the same observations in the other plots.

Figure 2.18 shows another technique: brushing and linking. With brushing and
linking we combine and link different graphs and tables (or views) so changes in one
graph are automatically transferred to the other graphs. This interactive exploration
of data facilitates the discovery of new insights.

Figure 2.18 shows the average score per country for questions. Not only does this
indicate a high correlation between the answers, but it’s easy to see that when we
select several points on a subplot, the points will correspond to similar points on the
other graphs. In this case the selected points on the left graph correspond to points
on the middle and right graphs, although they correspond better in the middle and
right graphs.

Two other important graphs are the histogram shown in figure 2.19 and the boxplot
shown in figure 2.20. In a histogram a variable is cut into discrete categories and the
number of occurrences in each category are summed up and shown in the graph.
The boxplot, on the other hand, doesn’t show how many observations are
present but does offer an impression of the distribution within categories. It
can show the maximum, minimum, median, and other characterizing
measures at the same time.

Figure 2.19 Example histogram: the number of people in the age groups of 5-year
intervals
These techniques are mainly visual, but in practice they’re certainly not limited to
visualization techniques. Tabulation, clustering, and other modeling techniques can
also be a part of exploratory analysis. Even building simple models can be a part of
this step.

Step 5: Build the models

With clean data in place and a good understanding of the content, we are ready to
build models with the goal of making better predictions, classifying objects, or
gaining an understanding of the system that we are modeling. This phase is much
more focused than the exploratory analysis step, because we know what we are
looking for and what we want the outcome to be. Figure 2.21 shows the components
of model building.

Figure 2.21 Step 5: Data modeling

Building a model is an iterative process. Most of the models consist of the following
main steps:
1. Selection of a modeling technique and variables to enter in the model
2. Execution of the model
3. Diagnosis and model comparison
5.1 Model and variable selection

We need to select the variables we want to include in our model and a modeling
technique. Many modeling techniques are available, we need to choose the right
model for a problem. The model performance and all the requirements to use the
model has to be considered. The other factors are:

■ Must the model be moved to a production environment and, if so, would it be easy
to implement?

■ How difficult is the maintenance on the model: how long will it remain relevant if
left untouched?

■ Does the model need to be easy to explain?

5.2 Model execution

Once we have chosen a model we will need to implement it in code. The following
listing shows the execution of a linear prediction model.

Listing 2.1 Executing a linear prediction model on semi-random data


Figure 2.22 Linear regression tries to fit a line while minimizing the distance to each
point

We created predictor values that are meant to predict how the target variables
behave. For a linear regression, a “linear relation” between each x (predictor) and
the y (target) variable is assumed, as shown in figure 2.22. We created the target
variable based on the predictor by adding a bit of randomness, this gives us a
well-fitting model. The results.summary() outputs the table in figure 2.23. The exact
outcome depends on the random variables we got.
R-squared and Adj.R-squared is Model fit: higher is better but too high is suspicious.

The p-value to show whether a predictor variable has a significant influence on the
target. Lower is better and <0.05 is often considered “significant.”

Linear equation coefficients. y = 0.7658xl + 1.1252x2.

Step 6: Presenting findings and building applications

After analysing the data and building a well-performing model, we have to present
the findings as shown in figure 6. The predictions of the models or the insights
produced is of great value. For this reason, we need to automate the models. We
can also build an application that automatically updates reports, Excel spreadsheets,
or PowerPoint presentations.

Figure 6: Presentation and automation


7. Data Mining

7.1 Why data mining?

We live in a world where vast amounts of data are collected daily. Analyzing such
data is an important need. Data mining provides tools to discover knowledge from
data. It can be viewed as a result of the natural evolution of information technology.
Data mining turns a large collection of data into knowledge.

● Terabytes or petabytes of data pour into computer networks, the World


Wide Web (WWW), and various data storage devices every day from
business, society, science and engineering, medicine, and almost every other
aspect of daily life. This growth of available data volume is a result of the
computerization of our society and the fast development of powerful data
collection and storage tools.

● Businesses worldwide generate gigantic data sets, including sales


transactions, stock trading records, product descriptions, sales promotions,
company profiles and performance, and customer feedback. For example,
large stores, such as Wal-Mart, handle hundreds of millions of transactions
per week at thousands of branches around the world.

● Scientific and engineering practices generate high orders of petabytes


of data in a continuous manner, from remote sensing, process measuring,
scientific experiments, system performance, engineering observations, and
environment surveillance. Global backbone telecommunication networks carry
tens of petabytes of data traffic everyday.

● The medical and health industry generates tremendous amounts of


data from medical records, patient monitoring, and medical imaging.

● Billions of Web searches supported by search engines process tens of


petabytes of data daily. Communities and social media have become
increasingly important data sources, producing digital pictures and videos,
blogs, Web communities and various kinds of social networks.
● The list of sources that generate huge amounts of data is endless. Powerful
and versatile tools are needed to automatically uncover valuable information
from the tremendous amounts of data and to transform such data into
organized knowledge. This necessity has led to the birth of data mining.

7.2 What is data mining?

Data mining is searching for knowledge (interesting patterns) in data. Data mining is
an essential step in the process of knowledge discovery. The knowledge discovery
process is shown in Figure 7 as an iterative sequence of the following steps:

1. Data cleaning: To remove noise and inconsistent data

2. Data integration: Where multiple data sources may be combined

3. Data selection: Where data relevant to the analysis task are retrieved from the
database

4. Data transformation: Where data are transformed and consolidated into form
appropriate for mining by performing summary or aggregation operations.

5. Data mining: An essential process where intelligent methods are applied to


extract data patterns.

6. Pattern evaluation: To identify the interesting patterns representing knowledge


based on interestingness measures.

7. Knowledge presentation: Where visualization and knowledge representation


techniques are used to present mined knowledge to users.
Figure 7.2: Data mining as a step in the process of knowledge discovery.

Steps 1 through 4 are different forms of data preprocessing, where data is prepared
for mining. The data mining step may interact with the user or a knowledge base.
The interesting patterns are presented to the user and may be stored as new
knowledge in the knowledge base.

Data mining is the process of discovering interesting patterns and knowledge from
large amounts of data. The data sources can include databases, data warehouses,
web, other information repositories or data that are streamed into the system
dynamically.
7.3 What kinds of data can be mined ?

The most basic forms of data for mining applications are database data, data
warehouse data and transactional data. Data mining can also be applied to other
forms of data such as data streams, ordered/sequence data, graph or networked
data, spatial data, text data, multimedia data and the WWW.

7.3.1 Database data

A database system, also called a database management system (DBMS), consists of


a collection of interrelated data known as a database and a set of software
programs to manage and access the data. The software programs provide
mechanisms for defining database structures and data storage, for specifying and
managing concurrent, shared, or distributed data access and for ensuring
consistency and security of the information stored despite system crashes or
attempts at unauthorized access.

Relational databases are one of the most commonly available and richest information
repositories and thus they are a major data form in the study of data mining.

7.3.2 Data warehouses

A data warehouse is a repository of information collected from multiple sources


stored under a unified schema and usually residing at a single site. Data warehouses
are constructed via a process of data cleaning, data integration, data transformation,
data loading, and periodic data refreshing.

For example, rather than storing the details of each sales transaction, the data
warehouse may store a summary of the transactions per item type for each store or
summarized to a higher level for each sales region.

A data warehouse is usually modeled by a multidimensional data structure, called a


data cube, in which each dimension corresponds to an attribute or a set of attributes
in the schema, and each cell stores the value of some aggregate measure such as
count or sum(sales amount). A data cube provides a multidimensional view of data
and allows the precomputation and fast access of summarized data.
Figure 7.3.2 Typical framework of a data warehouse for All Electronics.

7.3.3 Transactional Data

Each record in a transactional database captures a transaction such as a customer’s


purchase, a flight booking or a user’s clicks on a webpage. A transaction typically
includes a unique transaction identity number (trans ID) and a list of the items
making up the transaction, such as the items purchased in the transaction. A
transactional database may have additional tables, which contain other information
related to the transactions, such as item description, information about the
salesperson or the branch, and so on.

7.3.4 Other kinds of data

There are many other kinds of data that have versatile forms and structures and
rather different semantic meanings. Such kinds of data can be seen in many
applications:

● Time-related or sequence data e.g historical records, stock exchange data and
time-series and biological sequence data.
● Data streams e.g video surveillance and sensor data, which are continuously
transmitted.
● Spatial data e.g maps.
● Engineering design data e.g the design of buildings, system components, or
integrated circuits.
● Hypertext and multimedia data including text, image, video, and audio data.

● Graph and networked data e.g social and information networks.

● The Web which is a huge, widely distributed information repository made


available by the Internet.

These applications bring about new challenges like how to handle data carrying
special structures such as sequences, trees, graphs, and networks and specific
semantics such as ordering, image, audio and video contents, and connectivity and
how to mine patterns that carry rich structures and semantics.

7.4 What kinds of patterns can be mined?

● Class/Concept Description: Characterization and Discrimination

● Mining Frequent Patterns, Associations, and Correlations

● Classification and Regression for Predictive Analysis

● Cluster Analysis

● Outlier Analysis

7.5 Technologies used in data mining

7.5.1 Statistics

Statistics studies the collection, analysis, interpretation or explanation and


presentation of data. Data mining has an inherent connection with statistics. A
statistical model is a set of mathematical functions that describe the behavior of the
objects in a target class in terms of random variables and their associated probability
distributions. Statistical models are widely used to model data and data classes. For
example, in data mining tasks like data characterization and classification, statistical
models of target classes can be built. Statistical methods can also be used to verify
data mining results
7.5.2 Machine Learning

For classification and clustering tasks, machine learning research often focuses on
the accuracy of the model. In addition to accuracy, data mining research places
strong emphasis on the efficiency and scalability of mining methods on large data
sets, as well as on ways to handle complex types of data and explore new,
alternative methods.

7.5.3 Database Systems and Data Warehouses

Database systems research focuses on the creation, maintenance, and use of


databases for organizations and end-users. Many data mining tasks need to handle
large data sets or even real-time, fast streaming data. Therefore, data mining can
make good use of scalable database technologies to achieve high efficiency and
scalability on large datasets. Data mining tasks can be used to extend the capability
of existing database systems to satisfy advanced user’s sophisticated data analysis
requirements.

A data warehouse integrates data originating from multiple sources and various
timeframes. It consolidates data in multidimensional space to form partially
materialized data cubes. The data cube model not only facilitates OLAP in
multidimensional databases but also promotes multidimensional data mining.

7.5.4 Information Retrieval

Information retrieval (IR) is the science of searching for documents or information in


documents. Documents can be text or multimedia, and may reside on the Web. A
text document which may involve one or multiple topics can be regarded as a
mixture of multiple topic models. By integrating information retrieval models and
data mining techniques, we can find the major topics in a collection of documents
and for each document in the collection, the major topics involved.
7.6 Which Kinds of Applications Are Targeted?

7.6.1 Business Intelligence

Business intelligence (BI) technologies provide historical, current, and predictive


views of business operations. Examples include reporting, online analytical
processing, business performance management, competitive intelligence,
benchmarking, and predictive analytics. Data mining is the core of business
intelligence. Online analytical processing tools in business intelligence rely on data
warehousing and multidimensional data mining. Classification and prediction
techniques are the core of predictive analytics in business intelligence, for which
there are many applications in analyzing markets, supplies, and sales.

7.6.2 Web Search Engines

A Web search engine is a specialized computer server that searches for information
on the Web. The search results of a user query are often returned as a list
sometimes called hits. The hits may consist of web pages, images, and other types
of files. Some search engines also search and return data available in public
databases or open directories.

Web search engines are essentially very large data mining applications. Various data
mining techniques are used in all aspects of search engines, ranging from crawling
e.g, deciding which pages should be crawled and the crawling frequencies, indexing
e.g, selecting pages to be indexed and deciding to which extent the index should be
constructed, and searching e.g, deciding how pages should be ranked, which
advertisements should be added, and how the search results can be personalized or
made “context aware”.

Search engines pose grand challenges to data mining. First, they have to handle a
huge and ever-growing amount of data. Second, Web search engines often have to
deal with online data. Another challenge is maintaining and incrementally updating a
model on fast growing data streams. Third, Web search engines often have to deal
with queries that are asked only a very small number of times.
7.7 Major Issues in Data Mining

Major issues in data mining research are categorized into five groups: mining
methodology, user interaction, efficiency and scalability, diversity of data types, and
data mining and society.

7.7.1 Mining Methodology

Mining various and new kinds of knowledge: Data mining covers a wide spectrum of
data analysis and knowledge discovery tasks, from data characterization and
discrimination to association and correlation analysis, classification, regression,
clustering, outlier analysis, sequence analysis, and trend and evolution analysis.
These tasks may use the same database in different ways and require the
development of numerous data mining techniques.

Mining knowledge in multidimensional space: When searching for knowledge in


large data sets, we can explore the data in multidimensional space. That is, we can
search for interesting patterns among combinations of dimensions (attributes) at
varying levels of abstraction. Such mining is known as (exploratory) multidimensional
data mining. In many cases, data can be aggregated or viewed as a
multidimensional data cube. Mining knowledge in cube space can substantially
enhance the power and flexibility of data mining.

Data mining an interdisciplinary effort: The power of data mining can be


substantially enhanced by integrating new methods from multiple disciplines.
Boosting the power of discovery in a networked environment: Most data objects
reside in a linked or interconnected environment, whether it be the Web, database
relations, files, or documents. Semantic links across multiple data objects can be
used to advantage in data mining. Handling uncertainty, noise, or incompleteness of
data: Data often contain noise, errors, exceptions, or uncertainty, or are incomplete.
Errors and noise may confuse the data mining process, leading to the derivation of
erroneous patterns.
Pattern evaluation and pattern- or constraint-guided mining: Not all the patterns
generated by data mining processes are interesting. What makes a pattern
interesting may vary from user to user. Therefore, techniques are needed to assess
the interestingness of discovered patterns based on subjective measures.

7.7.2 User Interaction

The user plays an important role in the data mining process.

Interactive mining: The data mining process should be highly interactive. Thus, it is
important to build flexible user interfaces and an exploratory mining environment,
facilitating the user’s interaction with the system. Incorporation of background
knowledge: Background knowledge, constraints, rules, and other information
regarding the domain under study should be incorporated into the knowledge
discovery process.

Ad hoc data mining and data mining query languages: Query languages e.g, SQL
have played an important role in flexible searching because they allow users to pose
ad hoc queries. Similarly, high-level data mining query languages or other high-level
flexible user interfaces will give users the freedom to define ad hoc data mining
tasks.

Presentation and visualization of data mining results: The data mining system
should present data mining results vividly and flexibly, so that the discovered
knowledge can be easily understood and directly usable by humans. This is
especially crucial if the data mining process is interactive. It requires the system to
adopt expressive knowledge representations, user-friendly interfaces, and
visualization techniques.

7.7.3 Efficiency and Scalability

Efficiency and scalability are always considered when comparing data mining
algorithms. As data amounts continue to multiply, these two factors are especially
critical.
Efficiency and scalability of data mining algorithms: Data mining algorithms must be
efficient and scalable in order to effectively extract information from huge amounts
of data in many data repositories or in dynamic data streams.

Parallel, distributed and incremental mining algorithms: The enormous size of many
data sets, the wide distribution of data and the computational complexity of some
data mining methods are factors that motivate the development of parallel and
distributed data-intensive mining algorithms.

Cloud computing and cluster computing, which use computers in a distributed and
collaborative way to tackle very large-scale computational tasks, are also active
research themes in parallel data mining.

7.7.4 Diversity of Database Types

The wide diversity of database types brings about challenges to data mining.

These include Handling complex types of data: Diverse applications generate a wide
spectrum of new data types, from structured data such as relational and data
warehouse data to semi-structured and unstructured data, from stable data
repositories to dynamic data streams, from simple data objects to temporal data,
biological sequences, sensor data, spatial data, hypertext data, multimedia data,
software program code, Web data and social network data.

Mining dynamic, networked, and global data repositories: Multiple sources of data
are connected by the internet and various kinds of networks, forming gigantic,
distributed, and heterogeneous global information systems and networks. The
discovery of knowledge from different sources of structured, semi-structured, or
unstructured yet interconnected data with diverse data semantics poses great
challenges to data mining.
Efficiency and scalability of data mining algorithms: Data mining algorithms must be
efficient and scalable in order to effectively extract information from huge amounts
of data in many data repositories or in dynamic data streams.

Parallel, distributed and incremental mining algorithms: The enormous size of many
data sets, the wide distribution of data and the computational complexity of some
data mining methods are factors that motivate the development of parallel and
distributed data-intensive mining algorithms.

Cloud computing and cluster computing, which use computers in a distributed and
collaborative way to tackle very large-scale computational tasks, are also active
research themes in parallel data mining.

7.7.4 Diversity of Database Types

The wide diversity of database types brings about challenges to data mining. These
include Handling complex types of data: Diverse applications generate a wide
spectrum of new data types, from structured data such as relational and data
warehouse data to semi-structured and unstructured data, from stable data
repositories to dynamic data streams, from simple data objects to temporal data,
biological sequences, sensor data, spatial data, hypertext data, multimedia data,
software program code, Web data and social network data.

Mining dynamic, networked, and global data repositories: Multiple sources of data
are connected by the internet and various kinds of networks, forming gigantic,
distributed, and heterogeneous global information systems and networks. The
discovery of knowledge from different sources of structured, semi-structured, or
unstructured yet interconnected data with diverse data semantics poses great
challenges to data mining.
7.7.5 Data Mining and Society

Social impacts of data mining: The improper disclosure or use of data and the
potential violation of individual privacy and data protection rights are areas of
concern that need to be addressed. Privacy-preserving data mining: Data mining will
help scientific discovery, business management, economy recovery and security
protection e.g, the real-time discovery of intruders and cyber attacks.

Invisible data mining: More and more systems should have data mining functions
built within so that people can perform data mining or use data mining results simply
by mouse clicking, without any knowledge of data mining algorithms. Intelligent
search engines and Internet-based stores perform such invisible data mining by
incorporating data mining into their components to improve their functionality and
performance.

8. Basic Statistical Descriptions of Data

Basic statistical descriptions can be used to identify properties of the data and
highlight which data values should be treated as noise or outliers.

8.1 Measuring the Central Tendency: Mean, Median, and Mode

The attribute X, like salary, is recorded for a set of objects. Let x1,x2,...,xN be the
set of N observed values or observations for X. Here, these values may also be
referred to as the data set (for X). To plot the observations for salary, where most of
the values fall gives us an idea of the central tendency of the data. Measures of
central tendency include the mean, median, mode, and midrange. The most
common and effective numeric measure of the “center” of a set of data is the
(arithmetic) mean. Let x1,x2,...,xN be a set of N values or observations, such as for
some numeric attribute X, like salary. The mean of this set of values is
This corresponds to the built-in aggregate function, average (avg() in SQL), provided
in relational database systems.

8.2 Measuring the Dispersion of Data: Range, Quartiles, Variance,


Standard Deviation and Interquartile Range

The measures such as range, quantiles, quartiles, percentiles, and the interquartile
range are used to assess the dispersion or spread of numeric data. The five-number
summary which can be displayed as a boxplot is useful in identifying outliers.
Variance and standard deviation also indicate the spread of a data distribution.

8.2.1 Range, Quartiles and Interquartile Range

Let x1,x2,...,xN be a set of observations for some numeric attribute, X. The range of
the set is the difference between the largest (max()) and smallest (min()) values.

Quantiles are points taken at regular intervals of a data distribution, dividing it into
essentially equal size consecutive sets. The 2-quantile is the data point dividing the
lower and upper halves of the data distribution. It corresponds to the median. The
4-quantiles are the three data points that split the data distribution into four equal
parts; each part represents one-fourth of the data distribution. They are more
commonly referred to as quartiles.

The 100-quantiles are more commonly referred to as percentiles, they divide the
data distribution into 100 equal-sized consecutive sets. The median, quartiles, and
percentiles are the most widely used forms of quantiles.

Figure 8.2.1 A plot of the data distribution for some attribute X.


The quantiles plotted are quartiles. The three quartiles divide the distribution into
four equal-size consecutive subsets. The second quartile corresponds to the median.

The quartiles give an indication of a distribution’s center, spread and shape.The first
quartile, denoted by Q1, is the 25th percentile. It cuts off the lowest 25% of the
data. The third quartile, denoted by Q3, is the 75th percentile. It cuts off the lowest
75% (or highest 25%) of the data. The second quartile is the 50th percentile. As the
median, it gives the center of the data distribution. The distance between the first
and third quartiles is a simple measure of spread that gives the range covered by the
middle half of the data. This distance is called the inter quartile range (IQR) and is
defined as

IQR=Q3−Q1.

8.2.2 Five-Number Summary, Boxplots and Outliers

The five-number summary of a distribution consists of the median (Q2), the


quartiles Q1 and Q3, and the smallest and largest individual observations, written in
the order of Minimum, Q1, Median, Q3, Maximum. Boxplots are a popular way of
visualizing a distribution. A boxplot incorporates the five-number summary as
follows:

● Typically, the ends of the box are at the quartiles so that the box
length is the interquartile range.

● The median is marked by a line within the box.

● Two lines called whiskers outside the box extend to the smallest
(Minimum) and largest (Maximum) observations.
Figure 2.3 Boxplot for the unit price data for items sold at four branches of
AllElectronics during a given time period.

8.2.3 Variance and Standard Deviation

Variance and standard deviation are measures of data dispersion. They indicate how
spread out a data distribution is. A low standard deviation means that the data
observations tend to be very close to the mean, while a high standard deviation
indicates that the data are spread out over a large range of values.

The variance of N observations, x1,x2,...,xN, for a numeric attribute X is


where x is the mean value of the observations, as defined in Equation. The
standard deviation, σ, of the observations is the square root of the variance, σ2.

8.3 Graphic Displays of Basic Statistical Descriptions of Data

Quantile plots,quantile–quantile plots, histograms, and scatter plots. These graphs


are helpful for the visual inspection of data, which is useful for data preprocessing.
The first three of these show univariate distributions (i.e., data for one attribute),
while scatter plots show bivariate distributions (i.e., involving two attributes).
Video Links
Unit – I
Video Links
Sl. Topic Video Link
No.
1 Data Science Process https://www.youtube.com/watch?v=s4uF8UOJz
9k

2 Data Science Life Cycle https://www.youtube.com/watch?v=4Cp6PkBKqX4

3 Exploratory Data analysis https://www.youtube.com/watch?v=-o3AxdVcUtQ

4 Facets of Data https://www.youtube.com/watch?v=WVclIFyCCOo

5 Data Mining https://www.youtube.com/watch?v=0Q7j7sv4rns

6 Data Warehousing https://www.youtube.com/watch?v=jmwGNhUXn_o

7 Statistical description of https://www.youtube.com/watch?v=CVdcr_MC2KU


Data
10. ASSIGNMENT : UNIT – I

1.Real time applications of Data Science. (CO1 , K2 )

2. What is the importance of Data Mining in Data Science. (CO1 , K2 )

3. Explain the data Science design process with the help of any real time application.
(CO1 , K3 )
PART-A Q&A UNIT-1
11. PART A : Q & A : UNIT – I
1. What is Data Science ? ( CO1 , K1 )
● As the amount of data continues to grow, the need to leverage it becomes
more important. Data science involves using methods to analyze massive
amounts of data and extract the knowledge it contains.
● Data science is an evolutionary extension of statistics capable of dealing with
the massive amounts of data produced today. It adds methods from computer
science to the repertoire of statistics.
2. What is the role of data science in commercial companies? ( CO1 , K1 )
● Commercial companies in almost every industry use data science and big data
to gain insights into their customers, processes, staff, completion, and
products.
● Many companies use data science to offer customers a better user
experience, as well as to cross-sell, up-sell, and personalize their offerings.
● Human resource professionals use people analytics and text mining to screen
candidates, monitor the mood of employees, and study informal networks
among coworkers.

3. What is a project charter? ( CO1 , K1 )


● Clients like to know upfront what they are paying for, so after getting a good
understanding of the business problem, try to get a formal agreement on the
deliverables. All this information is collected in a project charter.
● The outcome should be a clear research goal, a good understanding of the
context, well-defined deliverables and a plan of action with a timetable. This
information is then placed in a project charter.

4. List the steps involved in the data cleansing process. ( CO1 , K2 )


● Errors from data entry
● Physically impossible values
● Missing values
● Outliers
● Spaces and types
● Errors against codebook
5. What do you mean by outliers? ( CO1 , K2 )
● An outlier is an observation that seems to be distant from other observations
or, more specifically, one observation that follows a different logic or
generative process than the other observations.
● The easiest way to find outliers is to use a plot or a table with the minimum
and maximum values.
6. What are the two operations used to combine information from different datasets?
( CO1 , K1 )

● The first operation is joining: enriching an observation from one table with
information from another table.
● The second operation is appending or stacking: adding the observations of
one table to those of another table.
7. What do you mean by Exploratory data analysis? ( CO1 , K2 )
● Information becomes much easier to grasp when shown in a picture,
therefore we mainly use graphical techniques to gain an understanding of
data and the interactions between variables. This phase is about exploring
data, discovering anomalies missed before and to fix them.
● The visualization techniques used in this phase range from simple line graphs
or histograms to more complex diagrams such as Sankey and network graphs.
8. What is a pareto diagram ? ( CO1 , K1 )
We can combine simple graphs into a Pareto diagram, or 80-20 diagram. A Pareto
diagram is a combination of the values and a cumulative distribution.

9. What is the use of link and brush ? ( CO1 , K1 )


● Link and brush allows us to select observations in one plot and highlight the
same observations in the other plots.
● With brushing and linking we combine and link different graphs and tables (or
views) so changes in one graph are automatically transferred to the other
graphs. This interactive exploration of data facilitates the discovery of new
insights.
10. What are the steps involved in building a model? ( CO1 , K2 )
Building a model is an iterative process. Most of the models consist of the following
main steps:
1. Selection of a modeling technique and variables to enter in the model
2. Execution of the model
3. Diagnosis and model comparison

11.What is data mining? ( CO1 , K1 )

● Data mining is searching for knowledge (interesting patterns) in data. Data


mining is an essential step in the process of knowledge discovery.
● Data mining provides tools to discover knowledge from data and it turns a
large collection of data into knowledge.

12.What is a data warehouse? ( CO1 , K1 )

● A data warehouse is a repository of information collected from multiple


sources stored under a unified schema and usually residing at a single site.
● Data warehouses are constructed via a process of data cleaning, data
integration, data transformation, data loading, and periodic data refreshing.

13. What do you mean by OLAP? ( CO1 , K2 )


● By providing multidimensional data views and the precomputation of
summarized data, data warehouse systems can provide inherent support for
OLAP.
● Online analytical processing operations make use of background knowledge
regarding the domain of the data being studied to allow the presentation of
data at different levels of abstraction.
● Such operations accommodate different user viewpoints. Examples of OLAP
operations include drill-down and roll-up, which allow the user to view the
data at differing degrees of summarization
14. What kinds of patterns can be mined? ( CO1 , K2 )

● Class/Concept Description: Characterization and Discrimination


● Mining Frequent Patterns, Associations, and Correlations
● Classification and Regression for Predictive Analysis
● Cluster Analysis
● Outlier Analysis

15. What are the major issues in Data Mining? ( CO1 , K2 )

● mining methodology
● user interaction
● efficiency and scalability
● diversity of data types
● data mining and society

16. What is the need for basic statistical descriptions of data? ( CO1 , K1 )

Basic statistical descriptions can be used to identify properties of the data.


It highlights which data values should be treated as noise or outliers.

17. What do you mean by interquartile range ? ( CO1 , K2 )

The distance between the first and third quartiles is a simple measure of spread that
gives the range covered by the middle half of the data.
This distance is called the inter quartile range (IQR) and is defined as
IQR=Q3−Q1.
18. What is a boxplot and what do we use it ? ( CO1 , K2 )

Boxplots are a popular way of visualizing a distribution. A boxplot incorporates the


five-number summary as follows:
● Typically, the ends of the box are at the quartiles so that the box length is the
interquartile range.
● The median is marked by a line within the box.
● Two lines called whiskers outside the box extend to the smallest (Minimum)
and largest (Maximum) observations.

19. What do you mean by Five- number summary? ( CO1 , K1 )

The five-number summary of a distribution consists of the median (Q2), the quartiles
Q1 and Q3, and the smallest and largest individual observations, written in the order
of Minimum, Q1, Median, Q3, Maximum.

20. What do you mean by external data ? ( CO1 , K1 )

Although data is considered an asset more valuable by certain companies, more and
more governments and organizations share their data for free with the world.
This data can be of excellent quality and it depends on the institution that creates
and manages it. The information they share covers a broad range of topics in a
certain region and its demographics.
12. PART B QUESTIONS : UNIT – I

1. What are the Benefits and uses of data science? ( CO1 , K1 )

2. What are the facets of data? ( CO1 , K1 )

3. Describe the overview of the data science process. ( CO1 , K2 )

4. Explain the steps involved in the knowledge discovery process. ( CO1 , K2 )

5. What kinds of data can be mined ? ( CO1 , K2 )

6. What are the technologies used in data mining ? ( CO1 , K2 )


13. SUPPORTIVE ONLINE CERTIFICATION COURSES

NPTEL : https://onlinecourses.nptel.ac.in/noc21_cs69/preview?

coursera : https://www.coursera.org/learn/python-data-analysis

Udemy : https://www.udemy.com/topic/data-science/

Mooc : https://mooc.es/course/introduction-to-data-science-in-python/

edx : https://learning.edx.org/course/course-v1:Microsoft+DAT208x+2T2016/home
14. REAL TIME APPLICATIONS

1. E-mail Spam Filtering:

The upsurge in the volume of unwanted emails called spam has created an intense
need for the development of more dependable and robust anti spam filters. Machine
learning methods of recent are being used to successfully detect and filter spam
emails.
Assume that we have a dataset of 30,000 emails, out of which some are classified as
spam, and some are classified as not spam. The machine learning model will be
trained on the dataset. Once the training process is complete, we can test it with a
mail that was not included in our training dataset. The machine learning model can
make predictions on the following input and classify it correctly if the input email is
spam or not.

2. Auto complete:

Autocomplete, or word completion, is a feature in which an application predicts the


rest of a word a user is typing. In Android smartphones, this is called predictive text.
In graphical user interfaces, users can typically press the tab key to accept a
suggestion or the down arrow key to accept one of several.

As we type in “what is the wea..” we already receive some predictions. These


predictive searches also work on AI. These usually work on concepts such as natural
language processing, machine learning, and deep learning. A sequence to sequence
mechanism with attention can be used to achieve higher accuracy and lower losses
for these predictions.
3.Virtual Assistant:

A virtual assistant, also called an AI assistant or digital assistant, is an application


program that understands natural language voice commands and completes tasks
for the user. Virtual Assistants powered with AI technologies are becoming extremely
popular and are taking over the world by a storm.

We have virtual assistants like Google AI, Siri, Alexa, Cortana, and many other
similar virtual assistants. With the help of these assistants, we can pass commands,
and using speech recognition, it tries to interpret what we are saying and
automates/performs a realistic task. Using these virtual assistants, we can make
calls, send messages or emails, or browse the web with just a simple voice
command. We can also converse with these virtual assistants, and hence they can
also act as chatbots.
15. CONTENTS BEYOND SYLLABUS : UNIT – I

VARIOUS DOMAINS OF BIG DATA ANALYTICS

Understand and Optimize Business Processes:

Big data is also increasingly used to optimize business processes. Retailers are able
to optimize their stock based on predictive models generated from social media data,
web search trends and weather forecasts. Another example is supply chain or
delivery route optimization using data from geographic positioning and radio
frequency identification sensors.

Improving Health:

The computing power of big data analytics enables us to find new cures and better
understand and predict disease patterns.

We can use all the data from smart watches and wearable devices to better
understand links between lifestyles and diseases.

Big data analytics also allow us to monitor and predict epidemics and disease
outbreaks, simply by listening to what people are saying, i.e.“Feeling rubbish today -
in bed with a cold” or searching for on the Internet, i.e. “cures for flu”.

Improving Sports Performance:

Most elite sports have now embraced big data analytics. Many use video analytics to
track the performance of every player in a football or baseball game, sensor
technology is built into sports equipment such as basket balls or golf clubs, and
many elite sports teams track athletes outside of the sporting environment – using
smart technology to track nutrition and sleep, as well as social media conversations
to monitor emotional wellbeing
Improving Security and Law Enforcement:

Security services use big data analytics to foil terrorist plots and detect cyber attacks.

Police forces use big data tools to catch criminals and even predict criminal activity
and credit card companies use big data analytics it to detect fraudulent transactions.

Improving and Optimizing Cities and Countries:

Big data is used to improve many aspects of our cities and countries. For example, it
allows cities to optimize traffic flows based on real time traffic information as well as
social media and weather data. A number of cities are currently using big data
analytics with the aim of turning themselves into Smart Cities, where the transport
infrastructure and utility processes are all joined up. Where a bus would wait for a
delayed train and where traffic signals predict traffic volumes
and operate to minimize jams.
Assessment Schedule
(Proposed Date &
Actual Date)
16.Assessment Schedule
(Proposed Date & Actual Date)
Sl. ASSESSMENT Proposed Actual
No. Date Date
1 FIRST INTERNAL ASSESSMENT 09.09.2023

2 SECOND INTERNAL ASSESSMENT 26.10.2023

3 MODEL EXAMINATION 15.11.2023

4 END SEMESTER EXAMINATION 05.12.2023


17. PRESCRIBED TEXT BOOKS & REFERENCE BOOKS

TEXTBOOKS:

1. David Cielen, Arno D. B. Meysman, and Mohamed Ali, “Introducing Data


Science”, Manning Publications, 2016.
2. Ashwin Pajankar, Aditya Joshi, Hands-on Machine Learning with Python:
Implement Neural Network Solutions with Scikit-learn and PyTorch, Apress,
2022.
3. Jake VanderPlas, “Python Data Science Handbook – Essential tools for
working with data”, O’Reilly, 2017.

REFERENCES:

1. Roger D. Peng, R Programming for Data Science, Lulu.com, 2016

2. Jiawei Han, Micheline Kamber, Jian Pei, "Data Mining: Concepts and Techniques",
3rd Edition, Morgan Kaufmann, 2012.

3. Samir Madhavan, Mastering Python for Data Science, Packt Publishing, 2015

4. Laura Igual, Santi Seguí, "Introduction to Data Science: A Python Approach to


Concepts,

5. Techniques and Applications", 1st Edition, Springer, 2017

6. Peter Bruce, Andrew Bruce, "Practical Statistics for Data Scientists: 50 Essential

7. Concepts", 3rd Edition, O'Reilly, 2017

8. Hector Guerrero, “Excel Data Analysis: Modelling and Simulation”, Springer

International Publishing, 2nd Edition, 2019

E-Book links:

1. https://drive.google.com/file/d/1HoGVyZqLTQj0aA4THA__D4jJ74czxEKH/view
?usp=sharing
2. https://drive.google.com/file/d/1vJfX5xipCHZOleWfM9aUeK8mwsal6Il1/view?u
sp=sharing
3. https://drive.google.com/file/d/1aU2UKdLxLdGpmI73S1bifK8JPiMXlpoS/view?
usp=sharing
18. MINI PROJECT SUGGESTION

a) Recommendation system
b) Credit Card Fraud Detection
c) Fake News Detection
d) Customer Segmentation
e) Sentiment Analysis
f) Recommender Systems
g) Emotion Recognition
h) Stock Market Prediction
i) Email classification
j) Tweets classification
k) Uber Data Analysis
l) Social Network Analysis
Thank you

Disclaimer:

This document is confidential and intended solely for the educational purpose of RMK Group of
Educational Institutions. If you have received this document through email in error, please notify the
system manager. This document contains proprietary information and is intended only to the
respective group / learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender immediately by e-mail if you
have received this document by mistake and delete this document from your system. If you are not the
intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance
on the contents of this information is strictly prohibited.

You might also like