Computer Science 3rd Year Specilization
Computer Science 3rd Year Specilization
11ans0The basic concept of a Data Warehouse is to facilitate a single version of truth for a company for decision
making and forecasting. A Data warehouse is an information system that contains historical and commutative
data from single or multiple sources. Data Warehouse concept, simplifies reporting and analysis process of the
organization.
Subject-Oriented
Integrated
Time-variant
Non-volatile
Subject-Oriented
A data warehouse is subject oriented as it offers information regarding a theme instead of companies' ongoing
operations. These subjects can be sales, marketing, distributions, etc.
A data warehouse never focuses on the ongoing operations. Instead, it put emphasis on modeling and analysis
of data for decision making. It also provides a simple and concise view around the specific subject by excluding
data which not helpful to support the decision process.
Integrated
In Data Warehouse, integration means the establishment of a common unit of measure for all similar data from
the dissimilar database. The data also needs to be stored in the Datawarehouse in common and universally
acceptable manner.
A data warehouse is developed by integrating data from varied sources like a mainframe, relational databases,
flat files, etc. Moreover, it must keep consistent naming conventions, format, and coding.
This integration helps in effective analysis of data. Consistency in naming conventions, attribute measures,
encoding structure etc. have to be ensured. Consider the following example:
In the above example, there are three different application labeled A, B and C. Information stored in these
applications are Gender, Date, and Balance. However, each application's data is stored different way.
16ans)Data mining is one of the most widely used methods to extract data from different sources and
organize them for better usage. In spite of having different commercial systems for data mining, a lot of
challenges come up when they are actually implemented. With rapid evolution in the field of data
mining, companies are expected to stay abreast with all the new developments.
Complex algorithms form the basis for data mining as they allow for data segmentation to identify
various trends and patterns, detect variations, and predict the probabilities of various events happening.
The raw data may come in both analog and digital format, and is inherently based on the source of the
data. Companies need to keep track of the latest data mining trends and stay updated to do well in the
industry and overcome challenging competition.
Definition: In simple words, data mining is defined as a process used to extract usable data from a larger set of
any raw data. It implies analysing data patterns in large batches of data using one or more software. Data mining
has applications in multiple fields, like science and research. As an application of data mining, businesses can
learn more about their customers and develop more effective strategies related to various business functions and
in turn leverage resources in a more optimal and insightful manner. This helps businesses be closer to their
objective and make better decisions. Data mining involves effective data collection and warehousing as well as
computer processing. For segmenting the data and evaluating the probability of future events, data mining uses
sophisticated mathematical algorithms. Data mining is also known as Knowledge Discovery in Data (KDD).
• Clustering based on finding and visually documented groups of facts not previously known.
The Data Mining Process: Technological Infrastructure Required: 1. Database Size: For creating a more powerful
system more data is required to processed and maintained. 2. Query complexity: For querying or processing
more complex queries and the greater the number of queries, the more powerful system is required. Uses: 1.
Data mining techniques are useful in many research projects, including mathematics, cybernetics, genetics and
marketing. 2. With data mining, a retailer could manage and use point-of-sale records of customer purchases to
send targeted promotions based on an individual’s purchase history. The retailer could also develop products
and promotions to appeal to specific customer segments based on mining demographic data from comment or
warranty cards.
Volume of information is increasing everyday that we can handle from business transactions, scientific data,
sensor data, Pictures, videos, etc. So, we need a system that will be capable of extracting essence of information
available and that can automatically generate report,
views or summary of data for better decision-making.
Why Data Mining is used in Business?
Data mining is used in business to make better managerial decisions by:
Automatic summarization of data
Extracting essence of information stored.
Discovering patterns in raw data.
Data Mining also known as Knowledge Discovery in Databases, refers to the nontrivial extraction of implicit,
previously unknown and potentially useful information from data stored in databases.
1. Data Cleaning: Data cleaning is defined as removal of noisy and irrelevant data from collection.
Cleaning in case of Missing values.
Cleaning noisy data, where noise is a random or variance error.
Cleaning with Data discrepancy detection and Data transformation tools.
2. Data Integration: Data integration is defined as heterogeneous data from multiple sources combined in a
common source(DataWarehouse).
Data integration using Data Migration tools.
Data integration using Data Synchronization tools.
Data integration using ETL(Extract-Load-Transformation) process.
3. Data Selection: Data selection is defined as the process where data relevant to the analysis is decided
and retrieved from the data collection.
Data selection using Neural network.
Data selection using Decision Trees.
Data selection using Naive bayes.
Data selection using Clustering, Regression, etc.
4. Data Transformation: Data Transformation is defined as the process of transforming data into appropriate
form required by mining procedure.
Data Transformation is a two step process:
Data Mapping: Assigning elements from source base to destination to capture transformations.
Code generation: Creation of the actual transformation program
17ans) Query language (QL) refers to any computer programming language that requests and retrieves data
from database and information systems by sending queries. It works on user entered structured and formal
programming command based queries to find and extract data from host databases.
Query language may also be termed database query language.
Query language is primarily created for creating, accessing and modifying data in and out from a database
management system (DBMS). Typically, QL requires users to input a structured command that is similar and
close to the English language querying construct.
For example, the SQL query: SELECT * FROM
The customer will retrieve all data from the customer records/table.
The simple programming context makes it one of the easiest programming languages to learn. There are several
different variants of QL and it has wide implementation in various database-centered services such as extracting
data from deductive and OLAP databases, providing API based access to remote applications and services and
more.
Section-c
19ans) Designing Graphical User Interfaces Based On a Data Mining Query Language
In experienced users may find data mining query language awkward to use and the syntax difficult to remember
Instead, users may prefer to communicate with DM Systems through a GUI
Data collection and data mining query composition. : This component allows the user to specify task – relevant
data sets and to compose Data Mining queries.
Presentation of discovered patterns: This component allows the display of the discovered patterns in various
forms, including tables, graphs, charts curves and other visualization techniques
Hierarchy specification and manipulation: It allows for concept hierarchy specification, either by hand by user or
automatically. It also allows for modification of concept hierarchies by the user or automatically.
Manipulation of DMPS: It allows for dynamic adjustment of DM thresholds. It also allows for selection, display
and modification of concept hierarchies
Patterns.
debugging etc.,
20ans) Data mining is a very important process where potentially useful and previously unknown information is
extracted from large volumes of data. There are a number of components involved in the data mining process.
These components constitute the architecture of a data mining system.
Data Mining Architecture
The major components of any data mining system are data source, data warehouse server, data mining engine,
pattern evaluation module, graphical user interface and knowledge base.
a) Data Sources
Database, data warehouse, World Wide Web (WWW), text files and other documents are the actual sources of
data. You need large volumes of historical data for data mining to be successful. Organizations usually store data
in databases or data warehouses. Data warehouses may contain one or more databases, text files,
spreadsheets or other kinds of information repositories. Sometimes, data may reside even in plain text files or
spreadsheets. World Wide Web or the Internet is another big source of data.
Different Processes
The data needs to be cleaned, integrated and selected before passing it to the database or data warehouse
server. As the data is from different sources and in different formats, it cannot be used directly for the data mining
process because the data might not be complete and reliable. So, first data needs to be cleaned and integrated.
Again, more data than required will be collected from different data sources and only the data of interest needs to
be selected and passed to the server. These processes are not as simple as we think. A number of techniques
may be performed on the data as part of cleaning, integration and selection.
b) Database or Data Warehouse Server
The database or data warehouse server contains the actual data that is ready to be processed. Hence, the
server is responsible for retrieving the relevant data based on the data mining request of the user.
c) Data Mining Engine
The data mining engine is the core component of any data mining system. It consists of a number of modules for
performing data mining tasks including association, classification, characterization, clustering, prediction, time-
series analysis etc.
d) Pattern Evaluation Modules
The pattern evaluation module is mainly responsible for the measure of interestingness of the pattern by using a
threshold value. It interacts with the data mining engine to focus the search towards interesting patterns.
e) Graphical User Interface
The graphical user interface module communicates between the user and the data mining system. This module
helps the user use the system easily and efficiently without knowing the real complexity behind the process.
When the user specifies a query or a task, this module interacts with the data mining system and displays the
result in an easily understandable manner.
f) Knowledge Base
The knowledge base is helpful in the whole data mining process. It might be useful for guiding the search or
evaluating the interestingness of the result patterns. The knowledge base might even contain user beliefs and
data from user experiences that can be useful in the process of data mining. The data mining engine might get
inputs from the knowledge base to make the result more accurate and reliable. The pattern evaluation module
interacts with the knowledge base on a regular basis to get inputs and also to update it.
Logical design is what you draw with a pen and paper or design with Oracle Warehouse Builder or Oracle
Designer before building your data warehouse. Physical design is the creation of the database with SQL
statements.
During the physical design process, you convert the data gathered during the logical design phase into a
description of the physical database structure. Physical design decisions are mainly driven by query performance
and database maintenance aspects. For example, choosing a partitioning strategy that meets common query
requirements enables Oracle Database to take advantage of partition pruning, a way of narrowing a search
before performing it.
Physical Design
During the logical design phase, you defined a model for your data warehouse consisting of entities, attributes,
and relationships. The entities are linked together using relationships. Attributes are used to describe the entities.
The unique identifier (UID) distinguishes between one instance of an entity and another.
Figure 3-1 illustrates a graphical way of distinguishing between logical and physical designs.
During the physical design process, you translate the expected schemas into actual database structures. At this
time, you must map:
Entities to tables
Relationships to foreign key constraints
Attributes to columns
Primary unique identifiers to primary key constraints
Unique identifiers to unique key constraints
Once you have converted your logical design to a physical one, you must create some or all of the following
structures: