DSP U1
DSP U1
proceeding:
This document is confidential and intended solely for the educational purpose of
RMK Group of Educational Institutions. If you have received this document
through email in error, please notify the system manager. This document contains
proprietary information and is intended only to the respective group / learning
community as intended. If you are not the addressee you should not disseminate,
distribute or copy through e-mail. Please notify the sender immediately by e-mail
if you have received this document by mistake and delete this document from
your system. If you are not the intended recipient you are notified that disclosing,
copying, distributing or taking any action in reliance on the contents of this
information is strictly prohibited.
22AI302
DATA SCIENCE USING
PYTHON
Department: AI & DS
Batch/Year: 2022-2026 /II YEAR
Created by:
Ms.Divya D M / Asst.Professor
Date: 24.07.2023
1.Table of Contents
1. Contents 5
2. Course Objectives 6
3. Pre-Requisites 7
4. Syllabus 8
5. Course outcomes 11
7. Lecture Plan 14
9. Lecture notes 18
10. Assignments 72
12. Part B Qs 79
16 Assessment Schedule 85
Semester-III
Semester-II
Python Programming
Semester-I
C Programming
4.SYLLABUS
22AI302 DATA SCIENCE USING PYTHON LTPC
2023
Data Science: Benefits and uses – facets of data - Data Science Process: Overview –
Defining research goals – Retrieving data – data preparation - Exploratory Data
analysis – build the model – presenting findings and building applications - Data
Mining - Data Warehousing – Basic statistical descriptions of Data.
List of Exercise/Experiments:
1. Download, install and explore the features of R/Python for data analytics
• Installing Anaconda
• Basic Operations in Jupyter Notebook
• Basic Data Handling
List of Exercise/Experiments:
1. Working with Numpy arrays - Creation of numpy array using the tuple, Determine
the size, shape and dimension of the array, Manipulation with array Attributes,
Creation of Sub array, Perform the reshaping of the array along the row vector and
column vector, Create Two arrays and perform the concatenation among the arrays.
2. Working with Pandas data frames - Series, DataFrame , and Index, Implement the
Data Selection Operations, Data indexing operations like: loc, iloc, and ix, operations
of handling the missing data like None, Nan, Manipulate on the operation of Null
Values (is null(), not null(), dropna(), fillna()).
4.SYLLABUS
3. Perform the Statistics operation for the data (the sum, product, median, minimum
and maximum, quantiles, arg min, arg max etc.).
4. Use any data set compute the mean ,standard deviation, Percentile.
List of Exercise/Experiments:
1. Apply Decision Tree algorithms on any data set.
2. Apply SVM on any data set
3. Implement K-Nearest-Neighbor Classifiers
List of Exercise/Experiments:
1. Apply K-means algorithms for any data set.
2. Perform Outlier Analysis on any data set.
List of Exercise/Experiments:
1. Basic plots using Matplotlib.
2. Implementation of Scatter Plot.
3. Construction of Histogram, bar plot, Subplots, Line Plots.
4.SYLLABUS
TEXTBOOKS:
REFERENCES:
1. Roger D. Peng, R Programming for Data Science, Lulu.com, 2016
2. Jiawei Han, Micheline Kamber, Jian Pei, "Data Mining: Concepts and Techniques",
3rd Edition, Morgan Kaufmann, 2012.
3. Samir Madhavan, Mastering Python for Data Science, Packt Publishing, 2015
4. Laura Igual, Santi Seguí, "Introduction to Data Science: A Python Approach to
Concepts,
5. Techniques and Applications", 1st Edition, Springer, 2017
6. Peter Bruce, Andrew Bruce, "Practical Statistics for Data Scientists: 50 Essential
7. Concepts", 3rd Edition, O'Reilly, 2017
8. Hector Guerrero, “Excel Data Analysis: Modelling and Simulation”, Springer
International Publishing, 2nd Edition, 2019
NPTEL Courses:
a. Data Science for Engineers - https://onlinecourses.nptel.ac.in/noc23_cs17/preview
b. Python for Data Science - https://onlinecourses.nptel.ac.in/noc23_cs21/preview
LIST OF EQUIPMENTS:
Systems with Anaconda, Jupyter Notebook, Python, Pandas, NumPy, MatPlotlib
5.COURSE OUTCOMES
2 3 3 3 3 1 1 1 1 2 3 3
3 3 3 3 3 3 3 3 3 2 3 3
4 3 3 3 3 3 3 3 3 2 3 3
5 3 3 3 3 3 3 3 3 2 3 3
Lecture Plan
Unit - I
Lecture Plan – Unit 1- INTRODUCTION
Sl. Topic Numb Propose Actual CO Taxo Mode of
No er of d Date Lectur nomy Deliver
. Period e Date Level y
s
Data Science
Process: Overview
PPT /
– Defining research
2 1 10.08.2023 CO1 K2 Chalk &
goals – Retrieving
Talk
data – data
preparation
PPT /
Exploratory Data
3 1 12.08.2023 CO1 K2 Chalk &
analysis
Talk
build the model –
PPT /
presenting findings
4 1 16.08.2023 CO1 K2 Chalk &
and building
Talk
applications
PPT /
5 Data Mining 1 17.08.2023 CO1 K2 Chalk &
Talk
PPT /
6 Data Warehousing 1 19.08.2023 CO1 K2 Chalk &
Talk
Activity name:
Students will have better understanding about how Predictive Analysis performed for
various applications by using the following steps.
Guidelines to do an activity :
Big data is a collection of data sets so large or complex that it becomes difficult to
process them using traditional data management techniques such as, for example,
the RDBMS (relational database management systems).
The characteristics of big data are often referred to as the three Vs:
2. Data Science
As the amount of data continues to grow, the need to leverage it becomes more
important. Data science involves using methods to analyze massive amounts of data
and extract the knowledge it contains. Data science and big data evolved from
statistics and traditional data management but are now considered to be distinct
disciplines. Data science is an evolutionary extension of statistics capable of dealing
with the massive amounts of data produced today. It adds methods from computer
science to the repertoire of statistics.
2.1 Benefits and uses of data science
Data science and big data are used almost everywhere in both commercial and
noncommercial settings.
2.1.1 Commercial companies in almost every industry use data science and big
data to gain insights into their customers, processes, staff, completion, and products.
● The World Wildlife Fund (WWF) employs data scientists to increase the
effectiveness of their fundraising efforts.
● DataKind is one such data scientist group that devotes its time to the
benefit of mankind.
2.1.4 Universities use data science in their research and also to enhance the study
experience of their students. The massive open online courses (MOOC) produces a
lot of data, which allows universities to study how this type of learning can
complement traditional classes. Examples of MOOCs are Coursera, Udacity, and edX.
3. Facets of data
In data science and big data, there are many different types of data. Each of them
tends to require different tools and techniques.
■ Structured
■ Unstructured
■ Natural language
■ Machine-generated
■ Graph-based
■ Streaming
Structured data is data that depends on a data model and resides in a fixed field
within a record.
It is easy to store structured data in tables within databases or Excel files as shown
in figure 1.1. SQL or Structured Query Language, is the preferred way to manage
and query data that resides in databases.
Unstructured data is data that isn’t easy to fit into a data model because the content
is context-specific or varying. One example of unstructured data is the regular email
as shown in figure 1.2. Although email contains structured elements such as the
sender, title and body text, it is a challenge to find the number of people who have
written an email complaint about a specific employee because so many ways exist to
refer to a person. For example, The thousands of different languages and dialects
out there further complicate this.
The natural language processing community has had success in entity recognition,
topic recognition, summarization, text completion, and sentiment analysis, but
models trained in one domain don’t generalize well to other domains. Even
state-of-the-art techniques aren’t able to decipher the meaning of every piece of text
as it is ambiguous by nature.
The analysis of machine data relies on highly scalable tools, due to its high volume
and speed. Examples of machine data are web server logs, call detail records,
network event logs and telemetry as shown in figure 1.3.
The machine data shown in figure 1.3 would fit in a classic table-structured
database. This isn’t the best approach for highly interconnected or “networked” data,
where the relationships between entities have a valuable role to play.
Figure 1.3 Example of machine-generated data
Graph databases are used to store graph-based data and are queried with
specialized query languages such as SPARQL. Graph data poses its challenges, but
for a computer interpreting additive and image data, it can be even more difficult.
Audio, image and video are data types that pose specific challenges to a data
scientist. Tasks that are trivial for humans, such as recognizing objects in pictures,
turn out to be challenging for computers. In game analytics, High-speed cameras
at stadiums will capture ball and athlete movements to calculate in real time, for
example, the path taken by a defender relative to two baselines.
While streaming data can take almost any of the previous forms, it has an extra
property. The data flows into the system when an event happens instead of being
loaded into a data store in a batch. Although this isn’t really a different type of data,
we treat it here as such because we need to adapt our process to deal with this type
of information. Examples are the “What’s trending” on Twitter, live sporting or music
events, and the stock market.
The typical data science process consists of six steps as shown in figure 4.1.
The data science process and shows the main steps and actions.
1. The first step of this process is setting a research goal. The main
purpose here is making sure all the stakeholders understand the what, how,
and why of the project. In every serious project this will result in a project
charter.
2. The second phase is data retrieval. The data is required for analysis, so
this step includes finding suitable data and getting access to the data from the
data owner. The result is data in its raw form, which probably needs
transformation before it becomes usable.
3. The third step is data preparation. This includes transforming the data
from a raw form into data that is directly usable in the models. To achieve
this, detect and correct different kinds of errors in the data, combine data
from different data sources and transform it. After completing this step, we
can progress to data visualization and modeling.
4. The fourth step is data exploration. The goal of this step is to gain a
deep understanding of the data, look for patterns, correlations, deviations
based on visual and descriptive techniques. The insights gained from this
phase will enable us to start modeling.
5. The fifth step is data modeling. It is now that you attempt to gain the
insights or make the predictions stated in your project charter.
6. The last step of the data science model is presenting the results and
automating the analysis, if needed. One goal of a project is to change a
process and/or make better decisions. The importance of this step is more
apparent in projects on a strategic and tactical level. Certain projects require
performing the business process over and over again, so automating the
project will save time.
In reality we won’t progress in a linear way from step 1 to step 6. Often we’ll regress
and iterate between the different phases. Following these six steps will lead to
higher project success ratio and increased impact of research results. This process
ensures that we have a well-defined research plan, a good understanding of the
business question and clear deliverables before even starting to look at the data. The
first steps of the process focus on getting high-quality data as input for the models.
This way the models will perform better later on.
A project starts by understanding the what, the why, and the how of our project as
shown in the figure 2.2. What does the company expect you to do? And why does
management place such a value on your research? Answering these three questions
( what,why,how) is the goal of the first phase.
The outcome should be a clear research goal, a good understanding of the context,
well-defined deliverables and a plan of action with a timetable. This information is
then best placed in a project charter. The length and formality differ between
projects and companies.
1.1 Spend time understanding the goals and context of your research
An essential outcome is the research goal that states the purpose of your
assignment in a clear and focused manner. Understanding the business goals and
context is critical for project success.
Clients like to know upfront what they are paying for, so after getting a good
understanding of the business problem, try to get a formal agreement on the
deliverables. All this information is best collected in a project charter. For any
significant project this would be mandatory.
A project charter requires teamwork and the inputs should cover the following:
The next step in data science is to retrieve the required data as shown in figure 2.3.
Sometimes there is a need to go into the field and design a data collection process.
Many companies will have already collected and stored the data and what they don’t
have can often be bought from third parties. The organizations are also making
high-quality data freely available for public and commercial use.
Data can be stored in many forms, ranging from simple text files to tables in a
database. The objective is acquiring all the data needed. This may be difficult, and
even if you succeed, data needs polishing to be of any use to you.
2.1 Start with data stored within the company ( Internal data )
The first step is to assess the relevance and quality of the data that is readily
available within the company. Most companies have a program for maintaining key
data, so much of the cleaning work may already be done. This data can be stored in
official data repositories such as databases, data marts, data warehouses, and data
lakes maintained by a team of IT professionals. The primary goal of a database is
data storage, while a data warehouse is designed for reading and analyzing that
data.
A data mart is a subset of the data warehouse and geared toward serving a specific
business unit. While data warehouses and data marts are home to preprocessed
data, data lakes contain data in its natural or raw format. Getting access to data is
another difficult task. Organizations understand the value and sensitivity of data and
often have policies in place so everyone has access to what they need and nothing
more. These policies translate into physical and digital barriers called Chinese walls.
These “walls” are mandatory and well-regulated for customer data in most countries.
Getting access to the data may take time and involve company politics.
If data isn’t available inside your organization, look outside your organization’s walls.
Many companies specialize in collecting valuable information. Other companies
provide data so that you, in turn, can enrich their services and ecosystem. Such is
the case with Twitter, LinkedIn, and Facebook. Although data is considered an asset
more valuable by certain companies, more and more governments and organizations
share their data for free with the world. This data can be of excellent quality and it
depends on the institution that creates and manages it. The information they share
covers a broad range of topics such as the number of accidents or amount of drug
abuse in a certain region and its demographics. This data is helpful when you want
to enrich proprietary data but also convenient when training your data science skills.
Table 2.1 shows only a small selection from the growing number of open-data
providers.
Freebase.org An open database that retrieves its information from sites like
Wikipedia, MusicBrains, and the SEC archive
A good portion of the project time should be spent on doing data correction and
cleansing. The retrieval of data is the first time we’ll inspect the data in the data
science process. Most of the errors encountered during the data gathering phase are
easy to spot. The data is investigated during the import, data preparation and
exploratory phases. The difference is in the goal and the depth of the investigation.
During data retrieval, check to see if the data is equal to the data in the source
document and to see if we have the right data types.
The data received from the data retrieval phase is likely to be “a diamond in the
rough.” so the task is to prepare it for use in the modeling and reporting phase.
The model needs the data in specific format, so data transformation will always
come into play. It is good to correct data errors as early on in the process as
possible. Figure 2.4 shows the most common actions to take during the data
cleansing, integration, and transformation phase.
3.1 Cleansing data
Data cleansing is a subprocess of the data science process that focuses on removing
errors in the data so that the data becomes a true and consistent representation of
the processes it originates from. There are at least two types of errors. The first type
is the interpretation error, such as when you take the value in your data for granted,
like saying that a person’s age is greater than 300 years. The second type of error
points to inconsistencies between data sources or against your company’s
standardized values. The table below shows an overview of the types of errors that
can be detected with easy checks.
Deviations from a code book Match on keys or else use manual overrules
Data collection and data entry are error-prone processes. They often require human
intervention, as humans make typos or lose their concentration for a second and
introduce an error into the chain. But data collected by machines or computers isn’t
free from errors either. Errors can arise from human sloppiness, whereas others are
due to machine or hardware failure. Examples of errors originating from machines
are transmission errors or bugs in the extract, transform, and load phase (ETL).
For small data sets we can check every value by hand. Detecting data errors when
the variables we study don’t have many classes can be done by tabulating the data
with counts. When we have a variable that can take only two values: “Good” and
“Bad”, we can create a frequency table and see if those are truly the only two values
present.
In table 2.3, the values “Godo” and “Bade” point out something went wrong in at
least 16 cases.
Value Count
Good 1598647
Bad 1354468
Godo 15
Bade 1
Most errors of this type are easy to fix with simple assignment statements and
if-then else rules:
if x == “Godo”:
x = “Good”
if x == “Bade”:
x = “Bad”
3.1.2 Redundant whitespace
Whitespaces tend to be hard to detect but cause errors like other redundant
characters would. Fixing redundant whitespaces can be done by most programming
languages. They all provide string functions that will remove the leading and trailing
whitespaces. For instance, in Python you can use the strip() function to remove
leading and trailing spaces.
Sanity checks are another valuable type of data check. Here we check the value
against physically or theoretically impossible values such as people taller than 3
meters or someone with an age of 299 years. Sanity checks can be directly
expressed with rules:
3.1.4 Outliers
Missing values aren’t necessarily wrong, but we still need to handle them separately.
Certain modeling techniques can’t handle missing values. They might be an indicator
that something went wrong in data collection or that an error happened in the ETL
process. Common techniques used are listed in table. Which technique to use at
what time is dependent on a particular case.
Omit the values Easy to perform You lose the information from
an observation
Modeling the value Does not disturb the Can lead to too much
(nondependent) model too much confidence in the model
Harder to execute
Detecting errors in larger data sets against a code book or against standardized
values can be done with the help of set operations. A code book is a description of
data, a form of metadata.
It contains things such as the number of variables per observation, the number of
observations, and what each encoding within a variable means. For instance “0”
equals “negative”, “5” stands for “very positive”. A code book also tells the type of
data we are looking at: is it hierarchical, graph, something else.
The data comes from several different places and we need to integrate these
different sources. Data varies in size, type and structure, ranging from databases
and Excel files to text documents.
There are two operations to combine information from different data sets.
When the data is combined, we have the option to create a new physical table or a
virtual table by creating a view. The advantage of a view is that it doesn’t consume
more disk space.
Joining tables allows us to combine the information of one observation found in one
table with the information that is found in another table. The focus is on enriching a
single observation.
Let’s say that the first table contains information about the purchases of a customer
and the other table contains information about the region where your customer lives.
Joining the tables allows us to combine the information so that we can use it for a
model, as shown in figure 2.7. To join tables, we use variables that represent the
same object in both tables, such as a date, a country name, or a Social Security
number. These common fields are known as keys. When these keys also uniquely
define the records in the table they are called primary keys.
Figure 2.7 Joining two tables on the Item and Region keys
One table may have buying behavior and the other table may have demographic
information on a person. In figure 2.7 both tables contain the client name, and this
makes it easy to enrich the client expenditures with the region of the client.
Appending data from tables is a common operation but requires an equal structure
in the tables being appended.
To avoid duplication of data, we virtually combine data with views. The problem is
that when we duplicate the data, more storage space is needed. In case of less data,
that may not cause problems. But if every table consists of terabytes of data, then it
becomes problematic to duplicate the data. For this reason, the concept of a view
was invented. A view behaves as if we are working on a table, but this table is
nothing but a virtual layer that combines the tables. Figure 2.9 shows how the sales
data from the different months is combined virtually into a yearly sales table instead
of duplicating the data. Views do come with a drawback, however. While a table join
is only performed once, the join that creates the view is recreated every time it’s
queried, using more processing power than a pre-calculated table would have.
Figure 2.9 A view helps to combine data without replication.
Data enrichment can also be done by adding calculated information to the table, such
as the total number of sales or what percentage of total stock has been sold in a
certain region as shown in the figure below. We now have an aggregated data set,
which in turn can be used to calculate the participation of each product within its
category.
3.3 Transforming data
Certain models require their data to be in a certain shape. After cleansing and
integrating the data, the next task is to transform the data so it takes a suitable form
for data modeling.
Relationships between an input variable and an output variable aren’t always linear.
Take, for instance, a relationship of the form y = aebx. Taking the log of the
independent variables simplifies the estimation problem dramatically. Figure 2.11
shows how transforming the input variables greatly simplifies the estimation
problem.
Figure 2.11 Transforming x to log x makes the relationship between x and y linear
(right), compared with the non-log x (left).
Sometimes there are too many variables and we need to reduce the number because
they don’t add new information to the model. Having too many variables in the
model makes the model difficult to handle, and certain techniques don’t perform well
when we overload them with too many input variables. For instance, all the
techniques based on a Euclidean distance perform well only up to 10 variables.
3.3.2 Turning variables into dummies
Variables can be turned into dummy variables as shown in figure 2.13. Dummy
variables can only take two values: true(1) or false(0). They are used to indicate the
absence of a categorical effect that may explain the observation. In this case there
are separate columns for the classes stored in one variable and indicate it with 1 if
the class is present and 0 otherwise. An example is turning one column named
Weekdays into the columns Monday through Sunday. we use an indicator to show if
the observation was on a Monday; put 1 on Monday and 0 elsewhere. Turning
variables into dummies is a technique that is used in modeling.
Turning variables into dummies is a data transformation that breaks a variable that
has multiple classes into multiple variables, each having only two possible values: 0
or 1.
Step 4: Exploratory data analysis
Information becomes much easier to grasp when shown in a picture, therefore you
mainly use graphical techniques to gain an understanding of your data and the
interactions between variables. This phase is about exploring data, discovering
anomalies missed before and to fix them. The visualization techniques used in this
phase range from simple line graphs or histograms as shown in figure 2.15, to more
complex diagrams such as Sankey and network graphs. Sometimes it’s useful to
compose a composite graph from simple graphs to get even more insight into the
data. Other times the graphs can be animated or made interactive to make it easier.
Figure 2.16 Drawing multiple plots together helps to understand the structure of the
data over multiple variables.
Figure 2.18 Link and brush allows us to select observations in one plot and highlight
the same observations in the other plots.
Figure 2.18 shows another technique: brushing and linking. With brushing and
linking we combine and link different graphs and tables (or views) so changes in one
graph are automatically transferred to the other graphs. This interactive exploration
of data facilitates the discovery of new insights.
Figure 2.18 shows the average score per country for questions. Not only does this
indicate a high correlation between the answers, but it’s easy to see that when we
select several points on a subplot, the points will correspond to similar points on the
other graphs. In this case the selected points on the left graph correspond to points
on the middle and right graphs, although they correspond better in the middle and
right graphs.
Two other important graphs are the histogram shown in figure 2.19 and the boxplot
shown in figure 2.20. In a histogram a variable is cut into discrete categories and the
number of occurrences in each category are summed up and shown in the graph.
The boxplot, on the other hand, doesn’t show how many observations are
present but does offer an impression of the distribution within categories. It
can show the maximum, minimum, median, and other characterizing
measures at the same time.
Figure 2.19 Example histogram: the number of people in the age groups of 5-year
intervals
These techniques are mainly visual, but in practice they’re certainly not limited to
visualization techniques. Tabulation, clustering, and other modeling techniques can
also be a part of exploratory analysis. Even building simple models can be a part of
this step.
With clean data in place and a good understanding of the content, we are ready to
build models with the goal of making better predictions, classifying objects, or
gaining an understanding of the system that we are modeling. This phase is much
more focused than the exploratory analysis step, because we know what we are
looking for and what we want the outcome to be. Figure 2.21 shows the components
of model building.
Building a model is an iterative process. Most of the models consist of the following
main steps:
1. Selection of a modeling technique and variables to enter in the model
2. Execution of the model
3. Diagnosis and model comparison
5.1 Model and variable selection
We need to select the variables we want to include in our model and a modeling
technique. Many modeling techniques are available, we need to choose the right
model for a problem. The model performance and all the requirements to use the
model has to be considered. The other factors are:
■ Must the model be moved to a production environment and, if so, would it be easy
to implement?
■ How difficult is the maintenance on the model: how long will it remain relevant if
left untouched?
Once we have chosen a model we will need to implement it in code. The following
listing shows the execution of a linear prediction model.
We created predictor values that are meant to predict how the target variables
behave. For a linear regression, a “linear relation” between each x (predictor) and
the y (target) variable is assumed, as shown in figure 2.22. We created the target
variable based on the predictor by adding a bit of randomness, this gives us a
well-fitting model. The results.summary() outputs the table in figure 2.23. The exact
outcome depends on the random variables we got.
R-squared and Adj.R-squared is Model fit: higher is better but too high is suspicious.
The p-value to show whether a predictor variable has a significant influence on the
target. Lower is better and <0.05 is often considered “significant.”
After analysing the data and building a well-performing model, we have to present
the findings as shown in figure 6. The predictions of the models or the insights
produced is of great value. For this reason, we need to automate the models. We
can also build an application that automatically updates reports, Excel spreadsheets,
or PowerPoint presentations.
We live in a world where vast amounts of data are collected daily. Analyzing such
data is an important need. Data mining provides tools to discover knowledge from
data. It can be viewed as a result of the natural evolution of information technology.
Data mining turns a large collection of data into knowledge.
Data mining is searching for knowledge (interesting patterns) in data. Data mining is
an essential step in the process of knowledge discovery. The knowledge discovery
process is shown in Figure 7 as an iterative sequence of the following steps:
3. Data selection: Where data relevant to the analysis task are retrieved from the
database
4. Data transformation: Where data are transformed and consolidated into form
appropriate for mining by performing summary or aggregation operations.
Steps 1 through 4 are different forms of data preprocessing, where data is prepared
for mining. The data mining step may interact with the user or a knowledge base.
The interesting patterns are presented to the user and may be stored as new
knowledge in the knowledge base.
Data mining is the process of discovering interesting patterns and knowledge from
large amounts of data. The data sources can include databases, data warehouses,
web, other information repositories or data that are streamed into the system
dynamically.
7.3 What kinds of data can be mined ?
The most basic forms of data for mining applications are database data, data
warehouse data and transactional data. Data mining can also be applied to other
forms of data such as data streams, ordered/sequence data, graph or networked
data, spatial data, text data, multimedia data and the WWW.
Relational databases are one of the most commonly available and richest information
repositories and thus they are a major data form in the study of data mining.
For example, rather than storing the details of each sales transaction, the data
warehouse may store a summary of the transactions per item type for each store or
summarized to a higher level for each sales region.
There are many other kinds of data that have versatile forms and structures and
rather different semantic meanings. Such kinds of data can be seen in many
applications:
● Time-related or sequence data e.g historical records, stock exchange data and
time-series and biological sequence data.
● Data streams e.g video surveillance and sensor data, which are continuously
transmitted.
● Spatial data e.g maps.
● Engineering design data e.g the design of buildings, system components, or
integrated circuits.
● Hypertext and multimedia data including text, image, video, and audio data.
These applications bring about new challenges like how to handle data carrying
special structures such as sequences, trees, graphs, and networks and specific
semantics such as ordering, image, audio and video contents, and connectivity and
how to mine patterns that carry rich structures and semantics.
● Cluster Analysis
● Outlier Analysis
7.5.1 Statistics
For classification and clustering tasks, machine learning research often focuses on
the accuracy of the model. In addition to accuracy, data mining research places
strong emphasis on the efficiency and scalability of mining methods on large data
sets, as well as on ways to handle complex types of data and explore new,
alternative methods.
A data warehouse integrates data originating from multiple sources and various
timeframes. It consolidates data in multidimensional space to form partially
materialized data cubes. The data cube model not only facilitates OLAP in
multidimensional databases but also promotes multidimensional data mining.
A Web search engine is a specialized computer server that searches for information
on the Web. The search results of a user query are often returned as a list
sometimes called hits. The hits may consist of web pages, images, and other types
of files. Some search engines also search and return data available in public
databases or open directories.
Web search engines are essentially very large data mining applications. Various data
mining techniques are used in all aspects of search engines, ranging from crawling
e.g, deciding which pages should be crawled and the crawling frequencies, indexing
e.g, selecting pages to be indexed and deciding to which extent the index should be
constructed, and searching e.g, deciding how pages should be ranked, which
advertisements should be added, and how the search results can be personalized or
made “context aware”.
Search engines pose grand challenges to data mining. First, they have to handle a
huge and ever-growing amount of data. Second, Web search engines often have to
deal with online data. Another challenge is maintaining and incrementally updating a
model on fast growing data streams. Third, Web search engines often have to deal
with queries that are asked only a very small number of times.
7.7 Major Issues in Data Mining
Major issues in data mining research are categorized into five groups: mining
methodology, user interaction, efficiency and scalability, diversity of data types, and
data mining and society.
Mining various and new kinds of knowledge: Data mining covers a wide spectrum of
data analysis and knowledge discovery tasks, from data characterization and
discrimination to association and correlation analysis, classification, regression,
clustering, outlier analysis, sequence analysis, and trend and evolution analysis.
These tasks may use the same database in different ways and require the
development of numerous data mining techniques.
Interactive mining: The data mining process should be highly interactive. Thus, it is
important to build flexible user interfaces and an exploratory mining environment,
facilitating the user’s interaction with the system. Incorporation of background
knowledge: Background knowledge, constraints, rules, and other information
regarding the domain under study should be incorporated into the knowledge
discovery process.
Ad hoc data mining and data mining query languages: Query languages e.g, SQL
have played an important role in flexible searching because they allow users to pose
ad hoc queries. Similarly, high-level data mining query languages or other high-level
flexible user interfaces will give users the freedom to define ad hoc data mining
tasks.
Presentation and visualization of data mining results: The data mining system
should present data mining results vividly and flexibly, so that the discovered
knowledge can be easily understood and directly usable by humans. This is
especially crucial if the data mining process is interactive. It requires the system to
adopt expressive knowledge representations, user-friendly interfaces, and
visualization techniques.
Efficiency and scalability are always considered when comparing data mining
algorithms. As data amounts continue to multiply, these two factors are especially
critical.
Efficiency and scalability of data mining algorithms: Data mining algorithms must be
efficient and scalable in order to effectively extract information from huge amounts
of data in many data repositories or in dynamic data streams.
Parallel, distributed and incremental mining algorithms: The enormous size of many
data sets, the wide distribution of data and the computational complexity of some
data mining methods are factors that motivate the development of parallel and
distributed data-intensive mining algorithms.
Cloud computing and cluster computing, which use computers in a distributed and
collaborative way to tackle very large-scale computational tasks, are also active
research themes in parallel data mining.
The wide diversity of database types brings about challenges to data mining.
These include Handling complex types of data: Diverse applications generate a wide
spectrum of new data types, from structured data such as relational and data
warehouse data to semi-structured and unstructured data, from stable data
repositories to dynamic data streams, from simple data objects to temporal data,
biological sequences, sensor data, spatial data, hypertext data, multimedia data,
software program code, Web data and social network data.
Mining dynamic, networked, and global data repositories: Multiple sources of data
are connected by the internet and various kinds of networks, forming gigantic,
distributed, and heterogeneous global information systems and networks. The
discovery of knowledge from different sources of structured, semi-structured, or
unstructured yet interconnected data with diverse data semantics poses great
challenges to data mining.
Efficiency and scalability of data mining algorithms: Data mining algorithms must be
efficient and scalable in order to effectively extract information from huge amounts
of data in many data repositories or in dynamic data streams.
Parallel, distributed and incremental mining algorithms: The enormous size of many
data sets, the wide distribution of data and the computational complexity of some
data mining methods are factors that motivate the development of parallel and
distributed data-intensive mining algorithms.
Cloud computing and cluster computing, which use computers in a distributed and
collaborative way to tackle very large-scale computational tasks, are also active
research themes in parallel data mining.
The wide diversity of database types brings about challenges to data mining. These
include Handling complex types of data: Diverse applications generate a wide
spectrum of new data types, from structured data such as relational and data
warehouse data to semi-structured and unstructured data, from stable data
repositories to dynamic data streams, from simple data objects to temporal data,
biological sequences, sensor data, spatial data, hypertext data, multimedia data,
software program code, Web data and social network data.
Mining dynamic, networked, and global data repositories: Multiple sources of data
are connected by the internet and various kinds of networks, forming gigantic,
distributed, and heterogeneous global information systems and networks. The
discovery of knowledge from different sources of structured, semi-structured, or
unstructured yet interconnected data with diverse data semantics poses great
challenges to data mining.
7.7.5 Data Mining and Society
Social impacts of data mining: The improper disclosure or use of data and the
potential violation of individual privacy and data protection rights are areas of
concern that need to be addressed. Privacy-preserving data mining: Data mining will
help scientific discovery, business management, economy recovery and security
protection e.g, the real-time discovery of intruders and cyber attacks.
Invisible data mining: More and more systems should have data mining functions
built within so that people can perform data mining or use data mining results simply
by mouse clicking, without any knowledge of data mining algorithms. Intelligent
search engines and Internet-based stores perform such invisible data mining by
incorporating data mining into their components to improve their functionality and
performance.
Basic statistical descriptions can be used to identify properties of the data and
highlight which data values should be treated as noise or outliers.
The attribute X, like salary, is recorded for a set of objects. Let x1,x2,...,xN be the
set of N observed values or observations for X. Here, these values may also be
referred to as the data set (for X). To plot the observations for salary, where most of
the values fall gives us an idea of the central tendency of the data. Measures of
central tendency include the mean, median, mode, and midrange. The most
common and effective numeric measure of the “center” of a set of data is the
(arithmetic) mean. Let x1,x2,...,xN be a set of N values or observations, such as for
some numeric attribute X, like salary. The mean of this set of values is
This corresponds to the built-in aggregate function, average (avg() in SQL), provided
in relational database systems.
The measures such as range, quantiles, quartiles, percentiles, and the interquartile
range are used to assess the dispersion or spread of numeric data. The five-number
summary which can be displayed as a boxplot is useful in identifying outliers.
Variance and standard deviation also indicate the spread of a data distribution.
Let x1,x2,...,xN be a set of observations for some numeric attribute, X. The range of
the set is the difference between the largest (max()) and smallest (min()) values.
Quantiles are points taken at regular intervals of a data distribution, dividing it into
essentially equal size consecutive sets. The 2-quantile is the data point dividing the
lower and upper halves of the data distribution. It corresponds to the median. The
4-quantiles are the three data points that split the data distribution into four equal
parts; each part represents one-fourth of the data distribution. They are more
commonly referred to as quartiles.
The 100-quantiles are more commonly referred to as percentiles, they divide the
data distribution into 100 equal-sized consecutive sets. The median, quartiles, and
percentiles are the most widely used forms of quantiles.
The quartiles give an indication of a distribution’s center, spread and shape.The first
quartile, denoted by Q1, is the 25th percentile. It cuts off the lowest 25% of the
data. The third quartile, denoted by Q3, is the 75th percentile. It cuts off the lowest
75% (or highest 25%) of the data. The second quartile is the 50th percentile. As the
median, it gives the center of the data distribution. The distance between the first
and third quartiles is a simple measure of spread that gives the range covered by the
middle half of the data. This distance is called the inter quartile range (IQR) and is
defined as
IQR=Q3−Q1.
● Typically, the ends of the box are at the quartiles so that the box
length is the interquartile range.
● Two lines called whiskers outside the box extend to the smallest
(Minimum) and largest (Maximum) observations.
Figure 2.3 Boxplot for the unit price data for items sold at four branches of
AllElectronics during a given time period.
Variance and standard deviation are measures of data dispersion. They indicate how
spread out a data distribution is. A low standard deviation means that the data
observations tend to be very close to the mean, while a high standard deviation
indicates that the data are spread out over a large range of values.
3. Explain the data Science design process with the help of any real time application.
(CO1 , K3 )
PART-A Q&A UNIT-1
11. PART A : Q & A : UNIT – I
1. What is Data Science ? ( CO1 , K1 )
● As the amount of data continues to grow, the need to leverage it becomes
more important. Data science involves using methods to analyze massive
amounts of data and extract the knowledge it contains.
● Data science is an evolutionary extension of statistics capable of dealing with
the massive amounts of data produced today. It adds methods from computer
science to the repertoire of statistics.
2. What is the role of data science in commercial companies? ( CO1 , K1 )
● Commercial companies in almost every industry use data science and big data
to gain insights into their customers, processes, staff, completion, and
products.
● Many companies use data science to offer customers a better user
experience, as well as to cross-sell, up-sell, and personalize their offerings.
● Human resource professionals use people analytics and text mining to screen
candidates, monitor the mood of employees, and study informal networks
among coworkers.
● The first operation is joining: enriching an observation from one table with
information from another table.
● The second operation is appending or stacking: adding the observations of
one table to those of another table.
7. What do you mean by Exploratory data analysis? ( CO1 , K2 )
● Information becomes much easier to grasp when shown in a picture,
therefore we mainly use graphical techniques to gain an understanding of
data and the interactions between variables. This phase is about exploring
data, discovering anomalies missed before and to fix them.
● The visualization techniques used in this phase range from simple line graphs
or histograms to more complex diagrams such as Sankey and network graphs.
8. What is a pareto diagram ? ( CO1 , K1 )
We can combine simple graphs into a Pareto diagram, or 80-20 diagram. A Pareto
diagram is a combination of the values and a cumulative distribution.
● mining methodology
● user interaction
● efficiency and scalability
● diversity of data types
● data mining and society
16. What is the need for basic statistical descriptions of data? ( CO1 , K1 )
The distance between the first and third quartiles is a simple measure of spread that
gives the range covered by the middle half of the data.
This distance is called the inter quartile range (IQR) and is defined as
IQR=Q3−Q1.
18. What is a boxplot and what do we use it ? ( CO1 , K2 )
The five-number summary of a distribution consists of the median (Q2), the quartiles
Q1 and Q3, and the smallest and largest individual observations, written in the order
of Minimum, Q1, Median, Q3, Maximum.
Although data is considered an asset more valuable by certain companies, more and
more governments and organizations share their data for free with the world.
This data can be of excellent quality and it depends on the institution that creates
and manages it. The information they share covers a broad range of topics in a
certain region and its demographics.
12. PART B QUESTIONS : UNIT – I
NPTEL : https://onlinecourses.nptel.ac.in/noc21_cs69/preview?
coursera : https://www.coursera.org/learn/python-data-analysis
Udemy : https://www.udemy.com/topic/data-science/
Mooc : https://mooc.es/course/introduction-to-data-science-in-python/
edx : https://learning.edx.org/course/course-v1:Microsoft+DAT208x+2T2016/home
14. REAL TIME APPLICATIONS
The upsurge in the volume of unwanted emails called spam has created an intense
need for the development of more dependable and robust anti spam filters. Machine
learning methods of recent are being used to successfully detect and filter spam
emails.
Assume that we have a dataset of 30,000 emails, out of which some are classified as
spam, and some are classified as not spam. The machine learning model will be
trained on the dataset. Once the training process is complete, we can test it with a
mail that was not included in our training dataset. The machine learning model can
make predictions on the following input and classify it correctly if the input email is
spam or not.
2. Auto complete:
We have virtual assistants like Google AI, Siri, Alexa, Cortana, and many other
similar virtual assistants. With the help of these assistants, we can pass commands,
and using speech recognition, it tries to interpret what we are saying and
automates/performs a realistic task. Using these virtual assistants, we can make
calls, send messages or emails, or browse the web with just a simple voice
command. We can also converse with these virtual assistants, and hence they can
also act as chatbots.
15. CONTENTS BEYOND SYLLABUS : UNIT – I
Big data is also increasingly used to optimize business processes. Retailers are able
to optimize their stock based on predictive models generated from social media data,
web search trends and weather forecasts. Another example is supply chain or
delivery route optimization using data from geographic positioning and radio
frequency identification sensors.
Improving Health:
The computing power of big data analytics enables us to find new cures and better
understand and predict disease patterns.
We can use all the data from smart watches and wearable devices to better
understand links between lifestyles and diseases.
Big data analytics also allow us to monitor and predict epidemics and disease
outbreaks, simply by listening to what people are saying, i.e.“Feeling rubbish today -
in bed with a cold” or searching for on the Internet, i.e. “cures for flu”.
Most elite sports have now embraced big data analytics. Many use video analytics to
track the performance of every player in a football or baseball game, sensor
technology is built into sports equipment such as basket balls or golf clubs, and
many elite sports teams track athletes outside of the sporting environment – using
smart technology to track nutrition and sleep, as well as social media conversations
to monitor emotional wellbeing
Improving Security and Law Enforcement:
Security services use big data analytics to foil terrorist plots and detect cyber attacks.
Police forces use big data tools to catch criminals and even predict criminal activity
and credit card companies use big data analytics it to detect fraudulent transactions.
Big data is used to improve many aspects of our cities and countries. For example, it
allows cities to optimize traffic flows based on real time traffic information as well as
social media and weather data. A number of cities are currently using big data
analytics with the aim of turning themselves into Smart Cities, where the transport
infrastructure and utility processes are all joined up. Where a bus would wait for a
delayed train and where traffic signals predict traffic volumes
and operate to minimize jams.
Assessment Schedule
(Proposed Date &
Actual Date)
16.Assessment Schedule
(Proposed Date & Actual Date)
Sl. ASSESSMENT Proposed Actual
No. Date Date
1 FIRST INTERNAL ASSESSMENT 09.09.2023
TEXTBOOKS:
REFERENCES:
2. Jiawei Han, Micheline Kamber, Jian Pei, "Data Mining: Concepts and Techniques",
3rd Edition, Morgan Kaufmann, 2012.
3. Samir Madhavan, Mastering Python for Data Science, Packt Publishing, 2015
6. Peter Bruce, Andrew Bruce, "Practical Statistics for Data Scientists: 50 Essential
E-Book links:
1. https://drive.google.com/file/d/1HoGVyZqLTQj0aA4THA__D4jJ74czxEKH/view
?usp=sharing
2. https://drive.google.com/file/d/1vJfX5xipCHZOleWfM9aUeK8mwsal6Il1/view?u
sp=sharing
3. https://drive.google.com/file/d/1aU2UKdLxLdGpmI73S1bifK8JPiMXlpoS/view?
usp=sharing
18. MINI PROJECT SUGGESTION
a) Recommendation system
b) Credit Card Fraud Detection
c) Fake News Detection
d) Customer Segmentation
e) Sentiment Analysis
f) Recommender Systems
g) Emotion Recognition
h) Stock Market Prediction
i) Email classification
j) Tweets classification
k) Uber Data Analysis
l) Social Network Analysis
Thank you
Disclaimer:
This document is confidential and intended solely for the educational purpose of RMK Group of
Educational Institutions. If you have received this document through email in error, please notify the
system manager. This document contains proprietary information and is intended only to the
respective group / learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender immediately by e-mail if you
have received this document by mistake and delete this document from your system. If you are not the
intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance
on the contents of this information is strictly prohibited.