Data Warehouse
Data Warehouse
Parallel DBMS features -Scope and techniques of parallel DBMS operations .Optimizer
implementation.Application transparency .Parallel environment which allows the DBMS server to take full
advantage of the existing facilities on a very low level .DBMS management tools help to configure, tune, admin
and monitor a parallel RDBMS as effectively as if it were a serial RDBMS. Parallel DBMS vendors- Oracle:
Parallel Query Option (PQO). Informix: eXtended Parallel Server (XPS). IBM: DB2 Parallel Edition (DB2 PE).
SYBASE: SYBASE MPP.4) Data Warehouse vs DBMS- Database- A common Database is based on operational
or transactional processing. Each operation is an indivisible transaction. Generally, a Database stores current
and up-to-date data which is used for daily operations. A database is generally application specific.Example
– A database stores related data, such as the student details in a school.Constructing a Database is not so
expensive.
1)data warehouse planning-Warehouse planning is the process of designing a facility's space with 1)data mining-Data mining is the process of searching and analyzing a large batch of raw data in order to identify
maximum efficiency in mind. The layout must account for the movement of materials, optimized patterns and extract useful information. Companies use data mining software to learn more about their
equipment placement, and flow of traffic.2)steps data warehouse implementation- Requirements customers. 2) key features of data mining:- Focus Attribute-The focus attribute is the variable that the algorithm
analysis and capacity planning: The first process in data warehousing involves defining enterprise in data mining attempts to predict or model based on other attributes.Aggregation-An aggregation is a potent
needs, defining architectures, carrying out capacity planning, and selecting the hardware and software tool used in data mining to summaries and reduce massive amounts of information into more manageable and
tools. This step will contain be consulting senior management as well as the different stakeholder. useful forms.-Discretization-Using a data mining technique called discretization, continuous features are turned
Hardware integration: Once the hardware and software has been selected, they require to be put by into discrete ones by having their range divided into intervals or bins.Value Mapping-Using established mappings
integrating the servers, the storage methods, and the user software tools. Modeling: Modelling is a or rules, value mapping is a data mining approach used to translate the values of a feature from one set to
significant stage that involves designing the warehouse schema and views. This may contain using a another. Calculation -As a data mining aspect, calculation entails extracting new properties or features from
modeling tool if the data warehouses are sophisticated. Physical modeling: For the data warehouses existing data using logical or mathematical processes. 3) Issues-Mining Methodology and User Interaction Issues-
to perform efficiently, physical modeling is needed. This contains designing the physical data Mining different kinds of knowledge in databases − Different users may be interested in different kinds of
warehouse organization, data placement, data partitioning, deciding on access techniques, and knowledge. Therefore it is necessary for data mining to cover a broad range of knowledge discovery
indexing.Sources: The information for the data warehouse is likely to come from several data sources. task.Interactive mining of knowledge at multiple levels of abstraction − The data mining process needs to be
This step contains identifying and connecting the sources using the gateway, ODBC drives, or another interactive because it allows users to focus the search for patterns, providing and refining data mining requests
wrapper. ETL: The data from the source system will require to go through an ETL phase. The process based on the returned results.Incorporation of background knowledge − To guide discovery process and to
of designing and implementing the ETL phase may contain defining a suitable ETL tool vendors and express the discovered patterns, the background knowledge can be used. Background knowledge may be used
purchasing and implementing the tools. Populate the data warehouses: Once the ETL tools have to express the discovered patterns not only in concise terms but at multiple levels of abstraction. Data mining
been agreed upon, testing the tools will be needed, perhaps using a staging area. User query languages and ad hoc data mining − Data Mining Query language that allows the user to describe ad hoc
applications: For the data warehouses to be helpful, there must be end-user applications. This step mining tasks, should be integrated with a data warehouse query language and optimized for efficient and flexible
contains designing and implementing applications required by the end-users.Roll-out the data mining.Presentation and visualization of data mining results − Once the patterns are discovered it needs to
warehouses and applications: Once the data warehouse has been populated and the end-client be expressed in high level languages, and visual representations. These representations should be easily
applications tested, the warehouse system and the operations may be rolled out for the user's understandable. Handling noisy or incomplete data − The data cleaning methods are required to handle the noise
community to use.3)implementation guidelines-Build incrementally: Data warehouses must be and incomplete objects while mining the data regularities. If the data cleaning methods are not there then the
built incrementally. Generally, it is recommended that a data marts may be created with one particular accuracy of the discovered patterns will be poor.Pattern evaluation − The patterns discovered should be
project in mind, and once it is implemented, several other sections of the enterprise may also want to interesting because either they represent common knowledge or lack novelty. Performance Issues-Efficiency and
implement similar systems. Need a champion: A data warehouses project must have a champion who scalability of data mining algorithms − In order to effectively extract the information from huge amount of data
is active to carry out considerable researches into expected price and benefit of the project. Data in databases, data mining algorithm must be efficient and scalable.Parallel, distributed, and incremental mining
warehousing projects requires inputs from many units in an enterprise and therefore needs to be algorithms − The factors such as huge size of databases, wide distribution of data, and complexity of data mining
driven by someone who is needed for interacting with people in the enterprises and can actively methods motivate the development of parallel and distributed data mining algorithms. Diverse Data Types
persuade colleagues. Senior management support: A data warehouses project must be fully Issues-Handling of relational and complex types of data − The database may contain complex data objects,
supported by senior management. Given the resource-intensive feature of such project and the time multimedia data objects, spatial data, temporal data etc. It is not possible for one system to mine all these kind
they can take to implement, a warehouse project signal for a sustained commitment from senior of data.Mining information from heterogeneous databases and global information systems − The data is available
management.Ensure quality: The only record that has been cleaned and is of a quality that is implicit at different data sources on LAN or WAN. These data source may be structured, semi structured or
by the organizations should be loaded in the data warehouses. Corporate strategy: A data unstructured.4)explain knowledge extraction process kDD-KDD (Knowledge Discovery in Databases) is a
warehouse project must be suitable for corporate strategies and business goals. The purpose of the process that involves the extraction of useful, previously unknown, and potentially valuable information
project must be defined before the beginning of the projects.Business plan: The financial costs from large datasets. The KDD process is an iterative process and it requires multiple iterations of the above
(hardware, software, and peopleware), expected advantage, and a project plan for a data warehouses steps to extract accurate knowledge from the data.The following steps are included in KDD process: Data
project must be clearly outlined and understood by all stakeholders. Training: Data warehouses Cleaning-Data cleaning is defined as removal of noisy and irrelevant data from collection. Cleaning in case
projects must not overlook data warehouses training requirements. For a data warehouses project to of Missing values.Cleaning noisy data, where noise is a random or variance error.Cleaning with Data
be successful, the customers must be trained to use the warehouses and to understand its capabilities. discrepancy detection and Data transformation tools. Data Integration-Data integration is defined as
Adaptability: The project should build in flexibility so that changes may be made to the data heterogeneous data from multiple sources combined in a common source(DataWarehouse). Data
warehouses if and when required. Like any system, a data warehouse will require to change, as the integration using Data Migration tools, Data Synchronization tools and ETL(Extract-Load-Transformation)
needs of an enterprise change. Joint management: The project must be handled by both IT and process. Data Selection-Data selection is defined as the process where data relevant to the analysis is
business professionals in the enterprise.4) explain hardware and operating system used in data decided and retrieved from the data collection. For this we can use Neural network, Decision Trees, Naive
warehouse-Hardware and operational design – server hardware, network hardware – parallel bayes, Clustering, and Regression methods. Data Transformation-Data Transformation is defined as the
technology – security input on design of Hardware – backup and recovery – Service level agreement process of transforming data into appropriate form required by mining procedure. Data Transformation is a
– Operating the data warehouse.parallel hardware technology: Symmetric Multi-Processing two step process: 1)Data Mapping: Assigning elements from source base to destination to capture
(SMP)- An SMP machine is a set of CPU s that share memory and disk. This is sometimes called a transformations.2)Code generation: Creation of the actual transformation program. Data Mining-Data
shared-everything environment. the CPUs in an SMP machine are all equal. a process can run on any mining is defined as techniques that are applied to extract patterns potentially useful. It transforms task
CPU in the machine, run on different CPUs at different times. Scalability of SMP machines - length relevant data into patterns, and decides purpose of modelusing classification or characterization.Pattern
of the communication bus connecting the CPUs is a natural limit . As the bus gets longer the Evaluation-Pattern Evaluation is defined as identifying strictly increasing patterns representing knowledge
interprocess or communication speeds become slower . each extra CPU imposes an extra, band with based on given measures. It find interestingness score of each pattern, and
load on the bus, increases memory contention, and so on. Example - if the database software supports uses summarization and Visualization to make data understandable by user. Knowledge Representation-This
parallel queries, a single query can be decomposed and its separate parts processed in parallel. This involves presenting the results in a way that is meaningful and can be used to make decisions.
makes query performance scalable. Massively Parallel Processing (MPP)- made up of many loosely
coupled nodes linked together ·by a high-speed connection . Each node has its own memory, and the
disks are not shared .most MPP systems allow a disk to be dual connected between two nodes o
protects against an individual node failure causing disks to be unavailable. Cluster Technology -A
cluster is a set of loosely coupled SMP machines connected by a high-speed interconnect. Each
machine has its own CPUs and memory, but they share access to disk. these systems are called shared-
disk systems. Each machine in the cluster is called a node. The aim of the cluster is to mimic a single
larger machine. In this pseudo single machine, resources such as shared disk must be managed in a
distributed fashion.
5) Classification of Data Mining Systems-Data mining refers to the process of extracting important data from raw
data. It analyses the data patterns in huge sets of data with the help of several software. To understand the
system and meet the desired requirements, data mining can be classified into the following systems:
Classification Based on the mined Databases-A data mining system can be classified based on the types of
databases that have been mined. A database system can be further segmented based on distinct principles,
such as data models, types of data, etc. Classification Based on the type of Knowledge Mined-A data mining
system categorized based on the kind of knowledge mind may have the following functionalities:
Characterization. Discrimination. Association and Correlation. Analysis. Classification. Prediction. Outlier
Analysis. Evolution Analysis. Classification Based on the Techniques Utilized-A data mining system can also be
classified based on the type of techniques that are being incorporated. These techniques can be assessed based
on the involvement of user interaction involved or the methods of analysis employed. Classification Based on
the Applications Adapted-Data mining systems classified based on adapted applications adapted are as
follows:Finance.Telecommunications.DNA.Stock Markets.E-mail. diagram.6) Data Integration-It has been an
integral part of data operations because data can be obtained from several sources. It is a strategy that integrates
data from several sources to make it available to users in a single uniform view that shows their status. Data
Integration Approaches-No Coupling − In this scheme, the data mining system does not utilize any of the
database or data warehouse functions. It fetches the data from a particular source and processes that data using
5) criteria for selecting data warehouse-Experience some data mining algorithms. The data mining result is stored in another file.Loose Coupling − In this scheme,
and expertise.Range of services. Scalability.Security andcompliance.Cost.Customer the data mining system may use some of the functions of database and data warehouse system. It fetches the
support.Customization..
data from the data respiratory managed by these systems and performs data mining on that data. It then stores
the mining result either in a file or in a designated place in a database or in a data warehouse.Semi−tight
Coupling − In this scheme, the data mining system is linked with a database or a data warehouse system and in
addition to that, efficient implementations of a few data mining primitives can be provided in the database.Tight
coupling − In this coupling scheme, the data mining system is smoothly integrated into the database or data
warehouse system. The data mining subsystem is treated as one functional component of an information
system.7)different forms or steps of data prcessing or data consolidation involve-Data Cleaning: This involves
identifying and correcting errors or inconsistencies in the data, such as missing values, outliers, and
duplicates. Various techniques can be used for data cleaning, such as imputation, removal, and
transformation.Data Integration: This involves combining data from multiple sources to create a unified
dataset. Data integration can be challenging as it requires handling data with different formats, structures,
and semantics. Techniques such as record linkage and data fusion can be used for data integration. Data
Transformation: This involves converting the data into a suitable format for analysis. Common techniques
used in data transformation include normalization, standardization, and discretization. Normalization is
used to scale the data to a common range, while standardization is used to transform the data to have zero
mean and unit variance. Discretization is used to convert continuous data into discrete categories.Data
Reduction: This involves reducing the size of the dataset while preserving the important information. Data
reduction can be achieved through techniques such as feature s election and feature extraction. Feature
selection involves selecting a subset of relevant features from the dataset, while feature extraction involves
transforming the data into a lower-dimensional space while preserving the important information.
8) Data consolidation-it is the process of combining data from multiple sources, cleaning and verifying it by discovering non-trivial, interesting, and useful patterns in large spatial datasets. The most common spatial
removing errors, and storing it in a single location, such as a data warehouse or database. Data is produced from pattern families are co-locations, spatial hotspots, spatial outliers, and location predictions.4) characteristics of
various sources and in multiple formats in every business.9) Manage Noisy Data or smoothing technique followed OLAP -Multidimensional conceptual view. Multi-User Support. Accessibility. Uniform documenting
in data cleaning process-Binning-Binning is a technique where we sort the data and then partition the data into performance.5) features of olap-multi-dimensional view of data.support for complex calculation.time
equal frequency bins. Then you may either replace the noisy data with the bin mean bin median or the bin intelligence.6)benefits of olap- OLAP helps managers in decision-making through the multidimensional record
boundary. Smoothing by bin mean method: In this method, the values in the bin are replaced by the mean value views that it is efficient in providing, thus increasing their productivity.OLAP functions are self-sufficient owing
of the bin.Smoothing by bin median: In this method, the values in the bin are replaced by the median to the inherent flexibility support to the organized databases.7) steps required to tunnig of data warehouse- kp
value.Smoothing by bin boundary: In this method, the using minimum and maximum values of the bin values tune the business rules.tune the data design.tune the application design.tune the logical structure of DB.tune
are taken, and the closest boundary value replaces the values. Regression-his is used to smooth the data and the DB operation.tune the access paths.tune memory allocation.8)types of warehouse application- financial
help handle data when unnecessary data is present. For the analysis, purpose regression helps decide the services.banking services.consumer services.retail sector.controlled manufacturing.9)recovery-Data recovery
suitable variable. Linear regression refers to finding the best line to fit between two variables so that one can be is the process of restoring data that has been lost, accidentally deleted, corrupted or made inaccessible. In
used to predict the other. Multiple linear regression involves more than two variables. Using regression to find enterprise IT, data recovery typically refers to the restoration of data to a desktop, laptop, server or external
a mathematical equation to fit into the data helps to smooth out the noise. Clustering-This is used for finding storage system from a backup.types-Simple Recovery. Full Recovery. Bulk logged. Point-in-time
the outliers and also in grouping the data. Clustering is generally used in unsupervised learning. Outlier Analysis- Recovery.10)MQE-MQE stands for Managed Query Environment. Some products have been able to provide ad-
Outliers may be detected by clustering, where similar or close values are organized into the same groups or hoc queries such as data cube and slice and dice analysis capabilities. It is done by developing a query to select
clusters. Thus, values that fall far apart from the cluster may be considered noise or outliers. Outliers are extreme data from the DBMS, which delivers the requested data to the system where it is placed into a data
values that deviate from other observations on data.Univariate outliers can be found when looking at a cube.11)types of security in data warehouse- query manager.load manager.warehouse manager.application
distribution of values in a single feature space.point outliers are single data points that lay far from the rest of development.12)functional components of data mining – data collection and data mining query
the distribution.Contextual outliers can be noise in data, such as punctuation symbols when realizing text composition.presentation of discovered patterns.hierarchy specification and manipulation.manipulation of data
analysis or background noise signal when doing speech recognition.Collective outliers can be subsets of mining primitives.chapterr3- 1)KDD process-Data Mining also known as Knowledge Discovery in Databases,
novelties in data, such as a signal that may indicate the discovery of new phenomena.10)different stratergies refers to the nontrivial extraction of implicit, previously unknown and potentially useful information from
for data cleaning-Data Migration-Data migration is a useful method of transferring data from one device to data stored in databases.type-Data Cleaning. Data Integration. Data Selection. Data Transformation. Data
another. Although this can sound very simple, a shift in storage and database or programe is involved. Any data Mining. Pattern Evaluation. Knowledge Representation.2) Advantages of KDD-Improves decision-making.
migration would require at least the transform and load phases in the sense of the extract/transform/load Increased efficiency. Better customer service. Fraud detection. Predictive modelling.3) Disadvantages-
process. For application-specific migrations, such as platform upgrades, replication of databases and file Privacy concerns. Complexity. Unintended consequences. Data Quality. High cost.4) Linear regression-it
copying, host-based software is best.Software-based on arrays is mostly used to transfer data between related analysis is used to predict the value of a variable based on the value of another variable. The variable you want
structures.Depending on their configuration, network appliances migrate volumes, files or blocks of data. data to predict is called the dependent variable.5) data cleaning routines are needed- Having clean data will ultimately
scrubbing- The data cleaning process detects and removes errors and anomalies and improves data quality. increase overall productivity and allow for the highest quality information in your decision-making. Benefits
Data quality problems arise due to misspelling during data entry, missing values, or any other invalid data.In include: Removal of errors when multiple sources of data are at play. Fewer errors make for happier clients and
basic terms, Data Scrubbing is the process of guaranteeing accurate and correct collection of information. less-frustrated employees.6) chi square test-The chi-square test is a statistical tool used to check if two
This process is especially for companies that rely on electronic data during the operation of their business. categorical variables are related or independent. It helps us understand if the observed data differs significantly
During the process, several tools are used to check the stability and accuracy of documents. Data Auditing- from the expected data. By comparing the two datasets, we can draw conclusions about whether the variables
Data auditing is the assessment of data for quality throughout its lifecycle to ensure its accuracy and efficacy for have a meaningful association.7) motivation behind data mining -The major reason for using data mining
specific usage. Data performance is measured and issues are identified for remediation. Data auditing results in techniques is requirement of useful information and knowledge from huge amounts of data. The information
better data quality, which enables enhanced analytics to improve operations.11) Steps for Cleaning and knowledge gained can be used in many applications such as business management, production control etc.
Data- Remove duplicate or irrelevant observations-Remove duplicate or pointless observations as well as Data mining came into existence as a result of the natural evolution of information technology.8) components
undesirable observations from your dataset. The majority of duplicate observations will occur during data of any data mining system are the -Data source, data warehouse server, data mining engine, pattern assessment
gathering. Duplicate data can be produced when you merge data sets from several sources, scrape data, or get module, graphical user interface, and knowledge base.9) Z-Score Normalization- If a value is exactly equal to the
data from clients or other departments. One of the most important factors to take into account in this procedure mean of all the values of the feature, it will be normalized to 0. If it is below the mean, it will be a negative
is de-duplication. Fix structural errors-When you measure or transfer data and find odd naming practices, typos, number, and if it is above the mean it will be a positive number.10) Information gain - is a measure of how much
or wrong capitalization, such are structural faults. Mislabelled categories or classes may result from these information is gained by splitting a set of data on a particular feature. It is calculated by comparing the entropy
inconsistencies. Filter unwanted outliers-There will frequently be isolated findings that, at first glance, do not of the original set of data to the entropy of the two child sets. 11) Gain ratio (GR) -is a modification of the
seem to fit the data you are analyzing. Removing an outlier if you have a good reason to, such as incorrect data information gain that reduces its bias. Gain ratio takes number and size of branches into account when choosing
entry, will improve the performance of the data you are working with. Handle missing data-Because many an attribute.12) Applications of Data Mining-Data Mining Application in Business Intelligence. Data Mining
algorithms won't tolerate missing values, you can't overlook missing data. There are a few options for handling Application in Healthcare.Data Mining Application in Fraud Detection and Prevention.Data Mining Application
missing data. Validate and QA-False conclusions can be used to inform poor company strategy and decision- in Marketing and Advertising.13) Advantages of Data Mining-Marketing/Retailing. Banking/Crediting.
making as a result of inaccurate or noisy data. False conclusions can result in a humiliating situation in a reporting Manufacturing. Customer Identification. Disadvantages-Privacy Issues. Safety concerns. information that has
meeting when you find out your data couldn't withstand further investigation. Establishing a culture of quality been misused or is erroneous. Expensive
data in your organization is crucial before you arrive. The tools you might employ to develop this plan should be
documented to achieve this.12) Dimensionality Reduction-In dimensionality reduction, data encoding or
data transformations are applied to obtain a reduced or compressed for of original data. It can be used to
remove irrelevant or redundant attributes. In this method, some data can be lost which is irrelevant. Methods
for dimensionality reduction are:Wavelet transformations.Principal Component Analysis. The components
of dimensionality reduction are feature selection and feature extraction. It leads to less misleading data and
more model accuracy.13) Numerosity Reduction-In Numerosity reduction, data volume is reduced by
choosing suitable alternating forms of data representation. It is merely a representation technique of original
data into smaller form. In this method, there is no loss of data. Methods for Numerosity reduction
are:Regression or log-linear model (parametric).Histograms, clusturing, sampling (non-parametric). It has
no components but methods that ensure reduction of data volume. It preserves the integrity of data and the
data volume is also reduced.14) hierarchy generation for categorical data-Categorical data are discrete
data. Categorical attributes have finite number of distinct values, with no ordering among the values,
examples include geographic location, item type and job category. There are several methods for generation
of concept hierarchies for categorical data. Specification of a partial ordering of attributes explicitly at
the schema level by experts: Concept hierarchies for categorical attributes or dimensions typically involve
a group of attributes. A user or an expert can easily define concept hierarchy by specifying a partial or total
ordering of the attributes at a schema level. A hierarchy can be defined at the schema level such as street <
city < province <state < country. Specification of a portion of a hierarchy by explicit data grouping:
This is identically a manual definition of a portion of a concept hierarchy. In a large database, is unrealistic
to define an entire concept hierarchy by explicit value enumeration. However, it is realistic to specify
explicit groupings for a small portion of the intermediate-level data. Specification of a set of attributes
but not their partial ordering: A user may specify a set of attributes forming a concept hierarchy, but omit
to specify their partial ordering. The system can then try to automatically generate the attribute ordering so
as to construct a meaningful concept hierarchy. Specification of only of partial set of attributes:
Sometimes a user can be sloppy when defining a hierarchy, or may have only a vague idea about what
should be included in a hierarchy. Consequently the user may have included only a small subset of the
relevant attributes for the location, the user may have only specified street and city.
1) Roll-Up-The roll-up operation (also known as drill-up or aggregation operation) performs aggregation on a data
cube, by climbing down concept hierarchies, i.e., dimension reduction. Roll-up is like zooming-out on the data
cubes.2) Drill-Down-The drill-down operation (also called roll-down) is the reverse operation of roll-up. Drill-
down is like zooming-in on the data cube. Drill-down can be performed by either stepping down a concept
hierarchy for a dimension or adding additional dimensions. Slice-A slice is a subset of the cubes corresponding
to a single value for one or more members of the dimension. Dice-The dice operation describes a subcube by
operating a selection on two or more dimension.2) clasify olap tools-MOLAP: Multidimensional Online Analytical
Processing tools. ROLAP: Relational Online Analytical Processing tools.3) Spatial data mining- is the process of