Unit 1_Data Science BCA
Unit 1_Data Science BCA
Program
Name B.C.A Semester VI
Course Title Fundamentals of Data Science (Theory)
Course Code: DSE-E2 No. of Credits 03
Contact hours 42 Hours Duration of SEA/Exam 2 1/2 Hours
Formative Assessment
40 Summative Assessment Marks 60
Marks
Course Outcomes (COs): After the successful completion of the course, the student will be able to:
CO1 Understand the concepts of data and pre-processing of data.
CO2 Know simple pattern recognition methods
CO3 Understand the basic concepts of Clustering and Classification
CO4 Know the recent trends in Data Science
Contents 42
Hrs
Unit I: Data Mining: Introduction, Data Mining Definitions, Knowledge Discovery in
Databases (KDD) Vs Data Mining, DBMS Vs Data Mining, DM techniques, Problems, 8
Issues and Challenges in DM, DM applications.
Data Warehouse: Introduction, Definition, Multidimensional Data Model, Data Cleaning,
Data Integration and transformation, Data reduction, Discretization 8
Mining Frequent Patterns: Basic Concept – Frequent Item Set Mining Methods -Apriori
8
and Frequent Pattern Growth (FPGrowth) algorithms -Mining Association Rules
Classification: Basic Concepts, Issues, Algorithms: Decision Tree Induction. Bayes
Classification Methods, Rule-Based Classification, Lazy Learners (or Learning from your 1
Neighbors), k Nearest Neighbor. Prediction - Accuracy- Precision and Recall 0
Clustering: Cluster Analysis, Partitioning Methods, Hierarchical Methods, Density-Based
8
Methods, Grid-Based Methods, Evaluation of Clustering
2
Unit 1
Topics:
Data Mining:
Def 1: Refers to extracting or mining knowledge from large amount of data stored in databases,
data warehouse, or other repository. i.e. extraction of small valuable information from huge data.
Def 2: It Is the process of discovering interesting patterns & knowledge from large amount of data.
Data archeology, data dredging, data/pattern analysis are other terms for data mining. Another
popular term Knowledge Discovery From Data (KDD).
Huge data is generated and there is need to turn into useful information and knowledge. This
information & knowledge is used for various applications like Market analysis (consumer buying
pattern), Fraud detection (fraud account detection, fraud credit card holders), Science exploration
(hidden facts in data), telecommunication, etc.
Steps 1 through 4 are different forms of data preprocessing, where data are prepared for mining.
The data mining step may interact with the user or a knowledge base. The interesting patterns are
presented to the user and may be stored as new knowledge in the knowledge base.
The preceding view shows data mining as one step in the knowledge discovery process, although
an essential one because it uncovers hidden patterns for evaluation. However, in industry, in media,
and in the research environment, the term data mining is often used to refer to the entire knowledge
discovery process (perhaps because the term is shorter than knowledge discovery from data).
Therefore, we adopt a broad view of data mining functionality: Data mining is the process of
4
discovering interesting patterns and knowledge from large amounts of data. The data sources can
include databases, data warehouses, the Web, other information repositories, or data that are
streamed into the system dynamically.
Architecture of DM System
Interface between user & DMS. User specifies query, task, etc. User browse data, visualize
output.
5
Data mining involves an integration of techniques from multiple discipline such as database,
data warehouse, statistics, machine learning, pattern recognition, neural networks, data
visualization, information retrieval, image/signal processing, spatial & temporal data analysis.
6
DM can be used to mine knowledge from any kind of data source like
Relational Databases(RDBMS)
DW
Transactional DB
Flat files
Data streams
WWW
Multimedia DB
Object-relational DB
Text DB(unstructured)
Time series DB(hourly, weekly)
Spatial DB
o Topic modeling, i-topic model, integration with geo- and networked data
1) Concept/class description:
Data entries can be associated with classes or concepts. For example, classes of items for
sale include computers and printers, and concepts of customers include bigSpenders and
budgetSpenders. It can be useful to describe individual classes and concepts in
summarized, concise, and yet precise terms. Such descriptions of a class or a concept are
called class/concept descriptions.
1. Data characterization, by summarizing the data of the class under study. (Ex: based on
gender, buying behavior)
2. Data discrimination, by comparing the target class with one or set of comparative class.
(Ex: sales of comp with laptop)
3. Both data characterization & discrimination
The output of data characterization can be presented in various forms. Examples include pie
charts, bar charts, curves, multidimensional data cubes, and multidimensional tables
2) Mining Frequent Pattern, Association & correlation:
8
Frequent Pattern refers to pattern that occur frequently in data. Mining frequent pattern leads
to discovery of interesting association & correlation with data. Different kinds of frequent
pattern are
Item sets - A frequent itemset typically refers to a set of items that often appear together
in a transactional data set—for example, milk and bread, which are frequently bought
together in grocery stores by many customers.
Sub sequences – A frequently occurring subsequence, such as the pattern that
customers, tend to purchase together. For eg. Mobile, Back case, Screen guard
3) Classification & Prediction:
It is a process of building a model that describes the class & then predicting the objects
into different classes using the model. Model can be built by using if then rules, decision
tree, neural nets etc. Methods for construction classification models. Bayesian classification,
SVM, K-nearest neighbor.
Ex: Bank manager wants to know/analyze which loan applicant are ok and which can create a
risk.
4) Regression Analysis Regression analysis is a reliable method of identifying which variables
have impact on a topic of interest. The process of performing a regression allows you to
confidently determine which factors matter most, which factors can be ignored, and how these
factors influence each other.
Regression analysis is a statistical process that estimates the relationship between a dependent
variable and one or more independent variables.
o Regression
o Analysis of Variance
o Mixed-Effect Models
o Factor Analysis
o Discriminant Analysis
o Survival Analysis
o Visualization: Use of computer graphics to create visual images which aid in the
understanding of complex, often massive representations of data
o Visual Data Mining: discovering implicit but useful knowledge from large data sets using
visualization techniques
Visual data mining discovers implicit and useful knowledge from large data sets using data and/or
knowledge visualization techniques. Visual data mining can be viewed as an integration of two
disciplines: data visualization and data mining. It is also closely related to computer graphics,
multimedia systems, human–computer interaction, pattern recognition, and high-performance
computing.
In general, data visualization and data mining can be integrated in the following ways:
Data visualization: Data in a database or data warehouse can be viewed at different granularity or
abstraction levels, or as different combinations of attributes or dimensions. Data can be presented
in various visual forms, such as boxplots, 3-D cubes, data distribution charts, curves, surfaces, and
link graphs, etc. Visual display can help give users a clear impression and overview of the data
characteristics in a large data set.
Data mining result visualization: Visualization of data mining results is the presentation of the
results or knowledge obtained from data mining in visual forms. Such forms may include scatter
plots and boxplots , as well as decision trees, association rules, clusters, outliers, and generalized
rules.
Data mining process visualization: This type of visualization presents the various processes of data
mining in visual forms so that users can see how the data are extracted and from which database
or data warehouse they are extracted, as well as how the selected data are cleaned, integrated,
preprocessed, and mined. Moreover, it may also show which method is selected for data mining,
where the results are stored, and how they may be viewed.
Knowledge Discovery in Databases (KDD) is Data mining (DM) is a step in the KDD
a process that automatically discovers process that involves applying algorithms to
patterns, rules, and other regular contents in extract patterns from data.
large amounts of data
KDD is a systematic process for identifying Data mining is the foundation of KDD and is
patterns in large and complex data sets. essential to the entire methodology.
Overall set of process for Knowledge Data mining is process of extraction of hidden
extraction like data cleaning, data selection, knowledge from large data. Intelligent
data integration, datamining, pattern algorithms are used to extract useful
evaluation, knowledge presentation information like data categorization, data
characterization, data discrimination,
Association, Frequent Pattern mining,
Regression, Outlier Analysis, classification,
clustering, etc.
Contains several steps It is one step in KDD
Sometimes called as alias name of Data Sometimes called as alias name of KDD
Mining
System to manage the data in database like Data mining is process of extraction of hidden
creation, insertion, deletion, updating, etc. knowledge from large data. Intelligent
algorithms are used to extract useful
information like data categorization, data
characterization, data discrimination,
Association, Frequent Pattern mining,
Regression, Outlier Analysis, classification,
clustering, etc
Stores data in format suitable for data Data from Database is used for Mining
management.
Major Issues in DM
a. Mining Methodology:
Researches have been vigorously developing new DM techniques. This involves the
investigation of new kinds of knowledge, mining in multidimensional space, integrating
methods from other disciplines and consideration of semantic ties among data objects.
b. User Interaction:
Users play an important role in DM process. Interesting areas of research include how
to interact with a DMS, how to incorporate a user’s background knowledge in mining and
how to visualize and comprehend data mining results.
c. Efficiency & Scalability:
DM algorithms must be efficient & scalable in order to effectively extract information
from huge amount of data in many data repositories or in dynamic data streams. In other
words running time of algorithm must be short.
d. Diversity of database types:
The discovery of knowledge from different sources of structured, or unstructured yet
interconnected data with diverse data semantic pose great challenges to DM.
e. DM & Society:
I. Social Impact of DM:
The improper disclosure or use of data & the potential violations of individual
privacy and data protection rights are areas of concern that need to be addressed.
II. Privacy – Preserving DM:
DM poses a risk of disclosing an individual’s personal information. The research
is to observe data sensitive & preserve peoples privacy while performing successful
DM.
III. Invisible DM:
When purchasing online, the users might be unaware that the store is likely
collecting data on the buying patterns of its customers, which may be used to
recommend other items for purchase in the future.
12
1. Business Intelligence:
BI technologies provide historical, current and productive views of business operations.
Without data mining many business may not be able to perform effective market analysis,
compare customer feedback on similar products, discover strength & weakness of
competitors, predictive analysis etc.
2. Web Search Engine:
Web Search Engines are very large DM applications. Various DM task like crawling,
indexing, ranking, searching are used.
o Telcomm. and many other industries: Share many similar goals and expectations of retail
data mining
o Other issues
- Data mining in social sciences and social studies: text and social media
Data mining technique plays a vital role in searching intrusion detection, network attacks, and
anomalies. These techniques help in selecting and refining useful and relevant information from
large data sets. Data mining technique helps in classify relevant data for Intrusion Detection
System. Intrusion Detection system generates alarms for the network traffic about the foreign
invasions in the system. For example:
Detect security violations
Misuse Detection
Anomaly Detection
- Content-based: Recommends items that are similar to items the user preferred or
queried in the past
Business Transactions: Every business industry is memorized for perpetuity. Such transactions
are usually time-related and can be inter-business deals or intra-business operations. The
effective and in-time use of the data in a reasonable time frame for competitive decision-making
15
is definitely the most important problem to solve for businesses that struggle to survive in a
highly competitive world. Data mining helps to analyze these business transactions and identify
marketing approaches and decision-making. Example :
Direct mail targeting
Stock trading
Customer segmentation
Market Basket Analysis: Market Basket Analysis is a technique that gives the careful study of
purchases done by a customer in a supermarket. This concept identifies the pattern of frequent
purchase items by customers. This analysis can help to promote deals, offers, sale by the
companies and data mining techniques helps to achieve this analysis task. Example:
Data mining concepts are in use for Sales and marketing to provide better customer
service, to improve cross-selling opportunities, to increase direct mail response rates.
Customer Retention in the form of pattern identification and prediction of likely
defections is possible by Data mining.
Risk Assessment and Fraud area also use the data-mining concept for identifying
inappropriate or unusual behavior etc.
Education: For analyzing the education sector, data mining uses Educational Data Mining
(EDM) method. This method generates patterns that can be used both by learners and educators.
By using data mining EDM we can perform some educational task:
Predicting students admission in higher education
Predicting students profiling
Predicting student performance
Teachers teaching performance
Curriculum development
Predicting student placement opportunities
Research: A data mining technique can perform predictions, classification, clustering,
associations, and grouping of data with perfection in the research area. Rules generated by data
mining are unique to find results. In most of the technical research in data mining, we create a
training model and testing model. The training/testing model is a strategy to measure the
precision of the proposed model. It is called Train/Test because we split the data set into two
sets: a training data set and a testing data set. A training data set used to design the training model
whereas testing data set is used in the testing model. Example:
Classification of uncertain data.
Information-based clustering.
Decision support system
Web Mining
Domain-driven data mining
IoT (Internet of Things)and Cybersecurity
Smart farming IoT(Internet of Things)
Healthcare and Insurance: A Pharmaceutical sector can examine its new deals force activity
and their outcomes to improve the focusing of high-value physicians and figure out which
promoting activities will have the best effect in the following upcoming months, Whereas the
Insurance sector, data mining can help to predict which customers will buy new policies, identify
behavior patterns of risky customers and identify fraudulent behavior of customers.
Claims analysis i.e which medical procedures are claimed together.
Identify successful medical therapies for different illnesses.
16