0% found this document useful (0 votes)
0 views

Unit 1_Data Science BCA

Uploaded by

Shashank G S
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Unit 1_Data Science BCA

Uploaded by

Shashank G S
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

1

Program
Name B.C.A Semester VI
Course Title Fundamentals of Data Science (Theory)
Course Code: DSE-E2 No. of Credits 03
Contact hours 42 Hours Duration of SEA/Exam 2 1/2 Hours
Formative Assessment
40 Summative Assessment Marks 60
Marks
Course Outcomes (COs): After the successful completion of the course, the student will be able to:
CO1 Understand the concepts of data and pre-processing of data.
CO2 Know simple pattern recognition methods
CO3 Understand the basic concepts of Clustering and Classification
CO4 Know the recent trends in Data Science
Contents 42
Hrs
Unit I: Data Mining: Introduction, Data Mining Definitions, Knowledge Discovery in
Databases (KDD) Vs Data Mining, DBMS Vs Data Mining, DM techniques, Problems, 8
Issues and Challenges in DM, DM applications.
Data Warehouse: Introduction, Definition, Multidimensional Data Model, Data Cleaning,
Data Integration and transformation, Data reduction, Discretization 8
Mining Frequent Patterns: Basic Concept – Frequent Item Set Mining Methods -Apriori
8
and Frequent Pattern Growth (FPGrowth) algorithms -Mining Association Rules
Classification: Basic Concepts, Issues, Algorithms: Decision Tree Induction. Bayes
Classification Methods, Rule-Based Classification, Lazy Learners (or Learning from your 1
Neighbors), k Nearest Neighbor. Prediction - Accuracy- Precision and Recall 0
Clustering: Cluster Analysis, Partitioning Methods, Hierarchical Methods, Density-Based
8
Methods, Grid-Based Methods, Evaluation of Clustering
2

Unit 1
Topics:

Data Mining: Introduction, Data Mining Definitions, Knowledge Discovery in Databases


(KDD) Vs Data Mining, DBMS Vs Data Mining, DM techniques, Problems, Issues and
Challenges in DM, DM applications.

Data Mining:

Def 1: Refers to extracting or mining knowledge from large amount of data stored in databases,
data warehouse, or other repository. i.e. extraction of small valuable information from huge data.

Def 2: It Is the process of discovering interesting patterns & knowledge from large amount of data.

Data archeology, data dredging, data/pattern analysis are other terms for data mining. Another
popular term Knowledge Discovery From Data (KDD).

Why Data Mining is important?

Huge data is generated and there is need to turn into useful information and knowledge. This
information & knowledge is used for various applications like Market analysis (consumer buying
pattern), Fraud detection (fraud account detection, fraud credit card holders), Science exploration
(hidden facts in data), telecommunication, etc.

Steps in Knowledge Discovery from Data:

1. Data Cleaning: Remove noise & inconsistent data.


2. Data Integration: Multiple data sources are
3. Data Selection: Only relevant data are retrieved from database
4. Data Transformation: Data is consolidated into a form which is appropriate for mining
3

5. Data Mining: Intelligent methods are applied to extract data pattern


6. Pattern Evaluation: To identify the truly interesting patterns representing knowledge based
on some interesting measures
7. Knowledge Presentation: Visualizing(graphic) & knowledge representation technique are
used to present the mined knowledge to the user

Steps 1 through 4 are different forms of data preprocessing, where data are prepared for mining.
The data mining step may interact with the user or a knowledge base. The interesting patterns are
presented to the user and may be stored as new knowledge in the knowledge base.

The preceding view shows data mining as one step in the knowledge discovery process, although
an essential one because it uncovers hidden patterns for evaluation. However, in industry, in media,
and in the research environment, the term data mining is often used to refer to the entire knowledge
discovery process (perhaps because the term is shorter than knowledge discovery from data).
Therefore, we adopt a broad view of data mining functionality: Data mining is the process of
4

discovering interesting patterns and knowledge from large amounts of data. The data sources can
include databases, data warehouses, the Web, other information repositories, or data that are
streamed into the system dynamically.

Fig. Steps in the process of knowledge discovery.

Architecture of DM System

Typically DMS consists of following components:

 Database, data warehouse, WWW, or other information repository(spread sheets, files)


 Data Warehouse Server:
This server is responsible for fetching the relevant data, based on user’s data mining
request.
 Knowledge base:
This is the domain knowledge that is used to guide the search or evaluate the
interestingness of resulting pattern
 DM Engine:
Consists of set of methods/functions like characterization, association, correlation
analysis, classification, cluster analysis, prediction, outlier analysis, etc.
 Pattern Evaluation:
Employs interestingness measure and interacts with the data mining modules so as to
focus the search toward interesting patterns
 User Interface:

Interface between user & DMS. User specifies query, task, etc. User browse data, visualize
output.
5

Fig: Architecture of DM System

Which technologies are used for DM?

Data mining involves an integration of techniques from multiple discipline such as database,
data warehouse, statistics, machine learning, pattern recognition, neural networks, data
visualization, information retrieval, image/signal processing, spatial & temporal data analysis.
6

Fig: Data Mining adopts many domains

Data mining on what kinds of data?

DM can be used to mine knowledge from any kind of data source like

 Relational Databases(RDBMS)
 DW
 Transactional DB
 Flat files
 Data streams
 WWW
 Multimedia DB
 Object-relational DB
 Text DB(unstructured)
 Time series DB(hourly, weekly)
 Spatial DB

Mining Other Kinds of Data

 Mining Spatial Data

o Spatial frequent/co-located patterns, spatial clustering and classification

 Mining Spatiotemporal and Moving Object Data

o Spatiotemporal data mining, trajectory mining, swarm, …


7

 Mining Cyber-Physical System Data

o Applications: healthcare, air-traffic control, flood simulation

 Mining Multimedia Data

o Social media data, geo-tagged spatial clustering, periodicity analysis, …

 Mining Text Data

o Topic modeling, i-topic model, integration with geo- and networked data

 Mining Web Data

o Web content, web structure, and web usage mining

 Mining Data Streams

o Dynamics, one-pass, patterns, clustering, classification, outlier detection

DM functionalities – What kinds of patterns can be mined?

DM task can be classified into 2:

1) Descriptive: Categories general properties of those data in Database.


2) Predictive: Performs inference on current data in order to make prediction.

Different kinds of patterns that can be discovered are:

1) Concept/class description:
Data entries can be associated with classes or concepts. For example, classes of items for
sale include computers and printers, and concepts of customers include bigSpenders and
budgetSpenders. It can be useful to describe individual classes and concepts in
summarized, concise, and yet precise terms. Such descriptions of a class or a concept are
called class/concept descriptions.

These description can be derived by:

1. Data characterization, by summarizing the data of the class under study. (Ex: based on
gender, buying behavior)
2. Data discrimination, by comparing the target class with one or set of comparative class.
(Ex: sales of comp with laptop)
3. Both data characterization & discrimination

The output of data characterization can be presented in various forms. Examples include pie
charts, bar charts, curves, multidimensional data cubes, and multidimensional tables
2) Mining Frequent Pattern, Association & correlation:
8

Frequent Pattern refers to pattern that occur frequently in data. Mining frequent pattern leads
to discovery of interesting association & correlation with data. Different kinds of frequent
pattern are
 Item sets - A frequent itemset typically refers to a set of items that often appear together
in a transactional data set—for example, milk and bread, which are frequently bought
together in grocery stores by many customers.
 Sub sequences – A frequently occurring subsequence, such as the pattern that
customers, tend to purchase together. For eg. Mobile, Back case, Screen guard
3) Classification & Prediction:
It is a process of building a model that describes the class & then predicting the objects
into different classes using the model. Model can be built by using if then rules, decision
tree, neural nets etc. Methods for construction classification models. Bayesian classification,
SVM, K-nearest neighbor.
Ex: Bank manager wants to know/analyze which loan applicant are ok and which can create a
risk.
4) Regression Analysis Regression analysis is a reliable method of identifying which variables
have impact on a topic of interest. The process of performing a regression allows you to
confidently determine which factors matter most, which factors can be ignored, and how these
factors influence each other.
Regression analysis is a statistical process that estimates the relationship between a dependent
variable and one or more independent variables.

 E.g, Logistic regression


Used to predict categorical dependent variables, such as yes or no, true or false, or 0 or
1. For example, insurance companies use logistic regression to decide whether to approve
a new policy.
5) Cluster Analysis:
Clustering groups data without any model. Clustering analyzes data objects without
consulting class labels
Ex: Cluster formed by buying preferences.
6) Outlier Analysis:
Finding out data which differ drastically from others.
Ex: Fraud detection.
7) Evolution Analysis:
Describe and models trends for objects whose behavior changes over time. Ex: Shares.

Major Statistical Data Mining Methods

o Regression

o Generalized Linear Model


9

o Analysis of Variance

o Mixed-Effect Models

o Factor Analysis

o Discriminant Analysis

o Survival Analysis

Visual Data Mining

o Visualization: Use of computer graphics to create visual images which aid in the
understanding of complex, often massive representations of data

o Visual Data Mining: discovering implicit but useful knowledge from large data sets using
visualization techniques

Visual data mining discovers implicit and useful knowledge from large data sets using data and/or
knowledge visualization techniques. Visual data mining can be viewed as an integration of two
disciplines: data visualization and data mining. It is also closely related to computer graphics,
multimedia systems, human–computer interaction, pattern recognition, and high-performance
computing.

In general, data visualization and data mining can be integrated in the following ways:
Data visualization: Data in a database or data warehouse can be viewed at different granularity or
abstraction levels, or as different combinations of attributes or dimensions. Data can be presented
in various visual forms, such as boxplots, 3-D cubes, data distribution charts, curves, surfaces, and
link graphs, etc. Visual display can help give users a clear impression and overview of the data
characteristics in a large data set.

Data mining result visualization: Visualization of data mining results is the presentation of the
results or knowledge obtained from data mining in visual forms. Such forms may include scatter
plots and boxplots , as well as decision trees, association rules, clusters, outliers, and generalized
rules.

Data mining process visualization: This type of visualization presents the various processes of data
mining in visual forms so that users can see how the data are extracted and from which database
or data warehouse they are extracted, as well as how the selected data are cleaned, integrated,
preprocessed, and mined. Moreover, it may also show which method is selected for data mining,
where the results are stored, and how they may be viewed.

Audio Data Mining


 Uses audio signals to indicate the patterns of data or the features of data mining results
10

 An interesting alternative to visual mining


 An inverse task of mining audio (such as music) databases which is to find patterns from
audio data
 Visual data mining may disclose interesting patterns using graphical displays, but requires
users to concentrate on watching patterns
 Instead, transform patterns into sound and music and listen to pitches, rhythms, tune, and
melody in order to identify anything interesting or unusual

KDD Vs Data Mining

KDD Data Mining

Knowledge Discovery in Databases (KDD) is Data mining (DM) is a step in the KDD
a process that automatically discovers process that involves applying algorithms to
patterns, rules, and other regular contents in extract patterns from data.
large amounts of data
KDD is a systematic process for identifying Data mining is the foundation of KDD and is
patterns in large and complex data sets. essential to the entire methodology.
Overall set of process for Knowledge Data mining is process of extraction of hidden
extraction like data cleaning, data selection, knowledge from large data. Intelligent
data integration, datamining, pattern algorithms are used to extract useful
evaluation, knowledge presentation information like data categorization, data
characterization, data discrimination,
Association, Frequent Pattern mining,
Regression, Outlier Analysis, classification,
clustering, etc.
Contains several steps It is one step in KDD
Sometimes called as alias name of Data Sometimes called as alias name of KDD
Mining

DBMS Vs Data Mining

DBMS Data Mining

System to manage the data in database like Data mining is process of extraction of hidden
creation, insertion, deletion, updating, etc. knowledge from large data. Intelligent
algorithms are used to extract useful
information like data categorization, data
characterization, data discrimination,
Association, Frequent Pattern mining,
Regression, Outlier Analysis, classification,
clustering, etc
Stores data in format suitable for data Data from Database is used for Mining
management.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


11

Application oriented Fact oriented


Concerned on business transactions like Concerned on hidden knowledge extraction
insertion, deletion etc by using intelligent algorithms
Use SQL Use algorithms
Store and manage data Analyze data
Used to manage data of an organization Used to extract valuable information from
data generated in organization.
Query based processing (Transaction) Analytical processing

Major Issues in DM

a. Mining Methodology:
Researches have been vigorously developing new DM techniques. This involves the
investigation of new kinds of knowledge, mining in multidimensional space, integrating
methods from other disciplines and consideration of semantic ties among data objects.
b. User Interaction:
Users play an important role in DM process. Interesting areas of research include how
to interact with a DMS, how to incorporate a user’s background knowledge in mining and
how to visualize and comprehend data mining results.
c. Efficiency & Scalability:
DM algorithms must be efficient & scalable in order to effectively extract information
from huge amount of data in many data repositories or in dynamic data streams. In other
words running time of algorithm must be short.
d. Diversity of database types:
The discovery of knowledge from different sources of structured, or unstructured yet
interconnected data with diverse data semantic pose great challenges to DM.
e. DM & Society:
I. Social Impact of DM:
The improper disclosure or use of data & the potential violations of individual
privacy and data protection rights are areas of concern that need to be addressed.
II. Privacy – Preserving DM:
DM poses a risk of disclosing an individual’s personal information. The research
is to observe data sensitive & preserve peoples privacy while performing successful
DM.
III. Invisible DM:
When purchasing online, the users might be unaware that the store is likely
collecting data on the buying patterns of its customers, which may be used to
recommend other items for purchase in the future.
12

Data Mining Applications

Two highly successful and popular application examples of data mining:

1. Business Intelligence:
BI technologies provide historical, current and productive views of business operations.
Without data mining many business may not be able to perform effective market analysis,
compare customer feedback on similar products, discover strength & weakness of
competitors, predictive analysis etc.
2. Web Search Engine:
Web Search Engines are very large DM applications. Various DM task like crawling,
indexing, ranking, searching are used.

Other important applications of Data Mining are:

Data Mining for Financial Data Analysis


o Financial data collected in banks and financial institutions are often relatively complete,
reliable, and of high quality
A credit card company can leverage its vast warehouse of customer transaction data to identify
customers most likely to be interested in a new credit product.
 Credit card fraud detection.
 Identify ‘Loyal’ customers.
 Extraction of information related to customers.
 Determine credit card spending by customer groups.
 Consumer credit rating
o Classification and clustering of customers for targeted marketing
- multidimensional segmentation by nearest-neighbor, classification, decision trees,
etc. to identify customer groups or associate a new customer to an appropriate
customer group
o Detection of money laundering and other financial crimes
- integration of from multiple DBs (e.g., bank transactions, federal/state crime
history DBs)
- Tools: data visualization, linkage analysis, classification, clustering tools, outlier
analysis, and sequential pattern analysis tools (find unusual access sequences)
-
Data Mining for Retail & Telcomm. Industries
o Retail industry: huge amounts of data on sales, customer shopping history, e-commerce,
etc.

o Applications of retail data mining

- Identify customer buying behaviors


13

- Discover customer shopping patterns and trends

- Improve the quality of customer service

- Achieve better customer retention and satisfaction

- Enhance goods consumption ratios

- Design more effective goods transportation and distribution policies

o Telcomm. and many other industries: Share many similar goals and expectations of retail
data mining

Data Mining Practice for Retail Industry

o Design and construction of data warehouses

o Multidimensional analysis of sales, customers, products, time, and region

o Analysis of the effectiveness of sales campaigns

o Customer retention: Analysis of customer loyalty

- Use customer loyalty card information to register sequences of purchases of


particular customers

- Use sequential pattern mining to investigate changes in customer consumption or


loyalty

- Suggest adjustments on the pricing and variety of goods

o Product recommendation and cross-reference of items

o Fraudulent analysis and the identification of usual patterns

o Use of visualization tools in data analysis

Data Mining in Science and Engineering

o Data warehouses and data preprocessing

- Resolving inconsistencies or incompatible data collected in diverse environments


and different periods (e.g. eco-system studies)

o Mining complex data types

- Spatiotemporal, biological, diverse semantics and relationships

o Graph-based and network-based mining


14

- Links, relationships, data flow, etc.

o Visualization tools and domain-specific knowledge

o Other issues

- Data mining in social sciences and social studies: text and social media

- Data mining in computer science: monitoring systems, software bugs, network


intrusion

Data Mining for Intrusion Detection and Prevention

Data mining technique plays a vital role in searching intrusion detection, network attacks, and
anomalies. These techniques help in selecting and refining useful and relevant information from
large data sets. Data mining technique helps in classify relevant data for Intrusion Detection
System. Intrusion Detection system generates alarms for the network traffic about the foreign
invasions in the system. For example:
 Detect security violations
 Misuse Detection
 Anomaly Detection

o Majority of intrusion detection and prevention systems use

- Signature-based detection: use signatures, attack patterns that are preconfigured


and predetermined by domain experts

- Anomaly-based detection: build profiles (models of normal behavior) and detect


those that are substantially deviate from the profiles

Data Mining and Recommender Systems

o Recommender systems: Personalization, making product recommendations that are likely


to be of interest to a user

o Approaches: Content-based, collaborative, or their hybrid

- Content-based: Recommends items that are similar to items the user preferred or
queried in the past

- Collaborative filtering: Consider a user's social environment, opinions of other


customers who have similar tastes or preferences

Business Transactions: Every business industry is memorized for perpetuity. Such transactions
are usually time-related and can be inter-business deals or intra-business operations. The
effective and in-time use of the data in a reasonable time frame for competitive decision-making
15

is definitely the most important problem to solve for businesses that struggle to survive in a
highly competitive world. Data mining helps to analyze these business transactions and identify
marketing approaches and decision-making. Example :
 Direct mail targeting
 Stock trading
 Customer segmentation
Market Basket Analysis: Market Basket Analysis is a technique that gives the careful study of
purchases done by a customer in a supermarket. This concept identifies the pattern of frequent
purchase items by customers. This analysis can help to promote deals, offers, sale by the
companies and data mining techniques helps to achieve this analysis task. Example:
 Data mining concepts are in use for Sales and marketing to provide better customer
service, to improve cross-selling opportunities, to increase direct mail response rates.
 Customer Retention in the form of pattern identification and prediction of likely
defections is possible by Data mining.
 Risk Assessment and Fraud area also use the data-mining concept for identifying
inappropriate or unusual behavior etc.
Education: For analyzing the education sector, data mining uses Educational Data Mining
(EDM) method. This method generates patterns that can be used both by learners and educators.
By using data mining EDM we can perform some educational task:
 Predicting students admission in higher education
 Predicting students profiling
 Predicting student performance
 Teachers teaching performance
 Curriculum development
 Predicting student placement opportunities
Research: A data mining technique can perform predictions, classification, clustering,
associations, and grouping of data with perfection in the research area. Rules generated by data
mining are unique to find results. In most of the technical research in data mining, we create a
training model and testing model. The training/testing model is a strategy to measure the
precision of the proposed model. It is called Train/Test because we split the data set into two
sets: a training data set and a testing data set. A training data set used to design the training model
whereas testing data set is used in the testing model. Example:
 Classification of uncertain data.
 Information-based clustering.
 Decision support system
 Web Mining
 Domain-driven data mining
 IoT (Internet of Things)and Cybersecurity
 Smart farming IoT(Internet of Things)
Healthcare and Insurance: A Pharmaceutical sector can examine its new deals force activity
and their outcomes to improve the focusing of high-value physicians and figure out which
promoting activities will have the best effect in the following upcoming months, Whereas the
Insurance sector, data mining can help to predict which customers will buy new policies, identify
behavior patterns of risky customers and identify fraudulent behavior of customers.
 Claims analysis i.e which medical procedures are claimed together.
 Identify successful medical therapies for different illnesses.
16

 Characterizes patient behavior to predict office visits.


Transportation: A diversified transportation company with a large direct sales force can apply
data mining to identify the best prospects for its services. A large consumer merchandise
organization can apply information mining to improve its business cycle to retailers.
 Determine the distribution schedules among outlets.
 Analyze loading patterns.

You might also like