0% found this document useful (0 votes)
21 views

Data Warehouse

Uploaded by

shivadhyani10
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Data Warehouse

Uploaded by

shivadhyani10
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

1)data warehouse -A data warehouse, or enterprise data warehouse (EDW), is a system that aggregates data Data Warehouse- A data

arehouse- A data Warehouse is based on analytical processing. A Data Warehouse maintains


from different sources into a single, central, consistent data store to support data analysis, data mining, artificial historical data over time. Historical data is the data kept over years and can used for trend analysis, make
intelligence (AI) and machine learning. A data warehouse system enables an organization to run powerful future predictions and decision support. A Data Warehouse is integrated generally at the organization level,
analytics on large amounts of data (petabytes and petabytes) in ways that a standard database cannot. Features- by combining data from different databases.Example – A data warehouse integrates the data from one or
Integrated Data-Integration means placing a public entity to filter and capture similar items (or “like” data sets) more databases , so that analysis can be done to get results , such as the best performing school in a
from the unlike items in multiple database systems. These “like” data sets are then moved to the data warehouse city.Constructing a Data Warehouse can be expensive.5) OLTP System - It is well-known as an online database
using a general principle or a standard format. This method is followed in all the data warehouse systems, as it modifying system. Consists of only operational current data. It makes use of a standard database
aids in keeping the integrity of the data in the data warehouse intact. Time-Variant-Time Variance is the time
management system (DBMS). It is application-oriented. Used for business tasks. In an OLTP database, tables
prospect of the data warehouse system, which is widespread. It is generally adaptive to the time and date set
are normalized (3NF). The data is used to perform day-to-day fundamental operations. It reveals a snapshot
on the operational systems, as the data warehouse systems can directly reflect time. Subject-Oriented-The term
subject-oriented refers to the subject-wise storage of data. It means that the data in the system reside in groups of present business tasks. It serves the purpose to Insert, Update, and Delete information from the
revolving around a common idea. Grouping is an essential characteristic of the data warehouse, which aids in database. The size of the data is relatively small as the historical data is archived in MB, and GB. Very Fast
keeping the system organized and well–organized. Examples of the subjects are employees, marketing, sales, as the queries operate on 5% of the data. The data integrity constraint must be maintained in an OLTP
research, products, customers, etc. Non-volatile-Data once entered into a data warehouse must remain database. The backup and recovery process is maintained rigorously. Enhances the user’s productivity. OLAP
unchanged. All data is read-only. Previous data is not erased when current data is entered. This helps you to (Online Analytical Processing)- It is well-known as an online database query management system. Consists
analyze what has happened and when. Centralized Repository-Another critical feature of data warehousing is a of historical data from various Databases. It makes use of a data warehouse. It is subject-oriented. Used
centralized repository. This means all data is stored in a single location rather than scattered across various for Data Mining, Analytics, Decisions making, etc. In an OLAP database, tables are not normalized. The data
systems or databases. This centralized approach allows for easier data management, maintenance, and is used in planning, problem-solving, and decision-making. It provides a multi-dimensional view of different
organization and helps ensure data consistency and accuracy. Data Integration and Transformation-Data business tasks. It serves the purpose to extract information for analysis and decision-making. A large amount
Integration and Transformation is a crucial feature of data warehousing that involves converting raw data into of data is stored typically in TB, PB. Relatively slow as the amount of data involved is large. Queries may take
meaningful information. It includes data cleansing, integration, and transformation, ensuring data accuracy, hours. The OLAP database is not often updated. As a result, data integrity is unaffected. It only needs
consistency, and accessibility.components- Warehouse database-Warehouse database is the first one among the
backup from time to time as compared to OLTP. Improves the efficiency of business
components of data warehouse.Central database-It keeps all business data in the data warehouse while making
analysts.6)matadata- data that provides information about other data. Metadata summarizes basic information
it easier to report. There are various database types in which you can store the specific data types in the
warehouse. These database types include-Analytics database-These databases help manage and sustain about data, making finding and working with particular instances of data easier. It doesn’t tell you what the
analytics of data storage. Cloud-based database-Here, the databases can be retrieved and hosted on the cloud content is, but instead describes the type of thing that it is. Essentially it helps explain its provenance – its origin,
so that you do not have to acquire hardware to set up a data warehouse. Typical rational databases-These are nature and lineage. Importance- It enables data discoverability. It aids better decision-making. It improves data
row databases used on a routine basis. ETL (Extraction, Transformation, Loading) Tools-ETL tools are the central quality. It delivers time and efficiency savings. It increases collaboration. It ensures compliance.7)
components of data warehouse and help extract data from various resources. This data is then transformed into Multidimensional data-it is a data set with many different columns, also called features or attributes. The more
a suitable arrangement and is later loaded into the data warehouse. They allow you to extract data, fill mislaid columns in the data set, the more likely you are to discover hidden insights. In this case, two-dimensional analysis
data, highlight data distribution from the central repository to BI applications, and more. Metadata-Metadata is falls flat. Think of this data as being in a cube on multiple planes. It organizes the many attributes and enables
termed as ‘data about your data.’ It is one of the major components of data warehouse. Metadata tells you users to dig deeper into probable trends or patterns. Adv- Visualize the output of complex data. Examine
everything about the usage, values, source, and other features of the data sets in the warehouse. Additionally, relationships among data from multiple sources. Improve collaboration and analysis across geographic locations.
business metadata includes a context to the technical data as well. Access Tools -Data warehouses make use of Compare the impact of changes to variables on the target data easily.8) Data Cube- Grouping of data in a
a group of databases as the primary base. However, data warehouse organizations can’t work with databases multidimensional matrix is called data cubes. In Dataware housing, we generally deal with various
without using the access tools until a database administrator is available. Data mining tools-They streamline the
multidimensional data models as the data will be represented by multiple dimensions and multiple
process of checking arrays and links in vast volumes of data with the use of statistical modeling methods. OLAP
attributes. The data cube can be classified into two categories:Multidimensional data cube.Relational data
tools-These tools aid in building a multi-dimensional data warehouse while allowing business data analysis from
various points. Application Development tools-They help develop customized reports. Query reporting tools- cube. Data cube operations-Roll-up: operation and aggregate certain similar data attributes having the same
With these tools, corporate report production is quickly done through spreadsheets, innovative visuals, and dimension together. Drill-down: this operation is the reverse of the roll-up operation. Slicing: this operation
spreadsheets as well.Data warehouse bus -It is among the main components of data warehouse. The warehouse filters the unnecessary portions. Dicing: this operation does a multidimensional cutting, that not only cuts
bus indicates the flow of data in a data warehousing bus system and has a data mart. It is a level that helps users only one dimension but also can go to another dimension and cut a certain range of it. Pivot: this operation
transmit data and is also used to partition data produced for a specific group. .2) Building a Data warehouse- is very important from a viewing point of view. 9) Hierarchy- In data mining, the concept of a concept
There are two reasons why organizations consider data warehousing a critical need. Business considerations: hierarchy refers to the organization of data into a tree-like structure, where each level of the hierarchy
Organizations interested in development of a data warehouse can choose one of the following two approaches: represents a concept that is more general than the level below it. This hierarchical organization of data
Top - Down Approach -In the top down approach suggested by Bill Inmon, we build a centralized repository to allows for more efficient and effective data analysis, as well as the ability to drill down to more specific
house corporate wide business data. This repository is called Enterprise Data Warehouse (EDW). Bottom Up levels of detail when needed.ex- year->month->day->week.types- balancd .unbalancd.ragged.10) DBMS
Approach- Here we build the data marts separately at different points of time as and when the specific subject schemas- Star schema- The multidimensional view of data that is expressed using relational data base
area requirements are clear. The data marts are integrated or combined together to form a data warehouse.
semantics is provided by the data base schema design called star schema. Star schema has one large central
Design considerations -To be a successful data warehouse designer must adopt a holistic approach that is
table (fact table) and a set of smaller tables (dimensions) arranged in a radial pattern around the central
considering all data warehouse components as parts of a single complex system, and take into account all
possible data sources and all known usage requirements. Data content -The content and structure of the data table. The star schema architecture is the simplest data warehouse schema. It is called a star schema
warehouse are reflected in its data model. The data model is the template that describes how information will because the diagram resembles a star, with points radiating from a center. Usually the fact tables in a star
be organized within the integrated warehouse framework. Meta data- It defines the location and contents of schema are in third normal form(3NF) whereas dimensional tables are denormalized. Despite the fact that
data in the warehouse. Meta data is searchable by users to find definitions or subject areas. Data distribution - the star schema is the simplest architecture, it is most commonly used nowadays and is recommended by
One of the biggest challenges when designing a data warehouse is the data placement and distribution strategy. Oracle.ex-let us concider the employment data warehouse.we have three dimension tables and one fact
Data volumes continue to grow in nature. Tools -A number of tools are available that are specifically designed to table. Snowflake schema- is the result of decomposing one or more of the dimensions. The many-to one
help in the implementation of the data warehouse. Technical considerations-The hardware platform that would relationship among sets of attributes of a dimension can separate new dimension tables, forming a
house the data warehouse .The DBMS that supports the warehouse data.The communication infrastructure that hierarchy. The decomposed snowflake structure visualizes the hierarchical structure of dimensions very
connects data marts, operational systems and end users .The hardware and software to support meta data well.ex-this for a company XYZ electronics.dimension table is normalized resulting in two tables.Fact
repository. Implementation considerations- it requires many integration of many products like: Access tools. Data constellation schema- For each star schema it is possible to construct fact constellation schema (for example
extraction, clean up, transformation and migration . Data placement strategies . User levels. 3) Mapping the data by splitting the original star schema into more star schemes each of them describes facts on another level
warehouse architecture to Multiprocessor architecture- There are two advantages of having parallel relational
of dimension hierarchies). The fact constellation architecture contains multiple fact tables that share many
data base technology for data warehouse: Linear Speed up: refers the ability to increase the number of processor
dimension tables. The main shortcoming of the fact constellation schema is a more complicated design
to reduce response time.Linear Scale up: refers the ability to provide same performance on the same requests
as the database size increases. Types of parallelism-Horizontal parallelism: which means that the data base is because many variants for particular kinds of aggregation must be considered and selected. Moreover,
partitioned across multiple disks and parallel processing occurs within a specific task that is performed dimension tables are still large.ex- let us assume that deccan electronics would like to have another fact
concurrently on different processors against different set of data . Vertical parallelism: This occurs among table for supply and delivery. It may contain five dimension,or keys :time,item ,delivery-
different tasks. All query components such as scan, join, sort etc are executed in parallel in a pipelined fashion. agent,origin,destination along with the numeric measure as the number of units supplied and the cost of
In other words, an output from one task becomes an input into another task. Data partitioning:- Data partitioning delivery.it can been seen that both fact tables can share the same item-dimension table as well time -
is the key component for effective parallel execution of data base operations. Partition can be done randomly dimension table.
or intelligently. Random portioning includes random data striping across multiple disks on a single server.
Another option for random portioning is round robin fashion partitioning in which each record is placed on the
next disk assigned to the data base. Intelligent partitioning assumes that DBMS knows where a specific record
is located and does not waste time searching for it across all disks. The various intelligent partitioning include:
Hash partitioning: A hash algorithm is used to calculate the partition number based on the value of the
partitioning key for each row. Data base architectures of parallel processing -There are three DBMS software
architecture styles for parallel processing: Shared Memory Architecture -Tightly coupled shared memory
systems, illustrated in following figure have the following characteristics: Multiple PUs share memory.Each PU
has full access to all shared memory through a common bus.Communication between nodes occurs via shared
memory. Performance is limited by the bandwidth of the memory bus. Shared Disk Architecture- Shared disk
systems are typically loosely coupled. Such systems, illustrated in following figure, have the following
characteristics: Each node consists of one or more PUs and associated memory. Memory is not shared between
nodes. Communication occurs over a common high-speed bus. Each node has access to the same disks and
other resources. A node can be an SMP if the hardware supports it. Shared Nothing Architecture-Shared nothing
systems are typically loosely coupled. In shared nothing systems only one CPU is connected to a given disk. If a
table or database is located on that disk,access depends entirely on the PU which owns it.

Parallel DBMS features -Scope and techniques of parallel DBMS operations .Optimizer
implementation.Application transparency .Parallel environment which allows the DBMS server to take full
advantage of the existing facilities on a very low level .DBMS management tools help to configure, tune, admin
and monitor a parallel RDBMS as effectively as if it were a serial RDBMS. Parallel DBMS vendors- Oracle:
Parallel Query Option (PQO). Informix: eXtended Parallel Server (XPS). IBM: DB2 Parallel Edition (DB2 PE).
SYBASE: SYBASE MPP.4) Data Warehouse vs DBMS- Database- A common Database is based on operational
or transactional processing. Each operation is an indivisible transaction. Generally, a Database stores current
and up-to-date data which is used for daily operations. A database is generally application specific.Example
– A database stores related data, such as the student details in a school.Constructing a Database is not so
expensive.
1)data warehouse planning-Warehouse planning is the process of designing a facility's space with 1)data mining-Data mining is the process of searching and analyzing a large batch of raw data in order to identify
maximum efficiency in mind. The layout must account for the movement of materials, optimized patterns and extract useful information. Companies use data mining software to learn more about their
equipment placement, and flow of traffic.2)steps data warehouse implementation- Requirements customers. 2) key features of data mining:- Focus Attribute-The focus attribute is the variable that the algorithm
analysis and capacity planning: The first process in data warehousing involves defining enterprise in data mining attempts to predict or model based on other attributes.Aggregation-An aggregation is a potent
needs, defining architectures, carrying out capacity planning, and selecting the hardware and software tool used in data mining to summaries and reduce massive amounts of information into more manageable and
tools. This step will contain be consulting senior management as well as the different stakeholder. useful forms.-Discretization-Using a data mining technique called discretization, continuous features are turned
Hardware integration: Once the hardware and software has been selected, they require to be put by into discrete ones by having their range divided into intervals or bins.Value Mapping-Using established mappings
integrating the servers, the storage methods, and the user software tools. Modeling: Modelling is a or rules, value mapping is a data mining approach used to translate the values of a feature from one set to
significant stage that involves designing the warehouse schema and views. This may contain using a another. Calculation -As a data mining aspect, calculation entails extracting new properties or features from
modeling tool if the data warehouses are sophisticated. Physical modeling: For the data warehouses existing data using logical or mathematical processes. 3) Issues-Mining Methodology and User Interaction Issues-
to perform efficiently, physical modeling is needed. This contains designing the physical data Mining different kinds of knowledge in databases − Different users may be interested in different kinds of
warehouse organization, data placement, data partitioning, deciding on access techniques, and knowledge. Therefore it is necessary for data mining to cover a broad range of knowledge discovery
indexing.Sources: The information for the data warehouse is likely to come from several data sources. task.Interactive mining of knowledge at multiple levels of abstraction − The data mining process needs to be
This step contains identifying and connecting the sources using the gateway, ODBC drives, or another interactive because it allows users to focus the search for patterns, providing and refining data mining requests
wrapper. ETL: The data from the source system will require to go through an ETL phase. The process based on the returned results.Incorporation of background knowledge − To guide discovery process and to
of designing and implementing the ETL phase may contain defining a suitable ETL tool vendors and express the discovered patterns, the background knowledge can be used. Background knowledge may be used
purchasing and implementing the tools. Populate the data warehouses: Once the ETL tools have to express the discovered patterns not only in concise terms but at multiple levels of abstraction. Data mining
been agreed upon, testing the tools will be needed, perhaps using a staging area. User query languages and ad hoc data mining − Data Mining Query language that allows the user to describe ad hoc
applications: For the data warehouses to be helpful, there must be end-user applications. This step mining tasks, should be integrated with a data warehouse query language and optimized for efficient and flexible
contains designing and implementing applications required by the end-users.Roll-out the data mining.Presentation and visualization of data mining results − Once the patterns are discovered it needs to
warehouses and applications: Once the data warehouse has been populated and the end-client be expressed in high level languages, and visual representations. These representations should be easily
applications tested, the warehouse system and the operations may be rolled out for the user's understandable. Handling noisy or incomplete data − The data cleaning methods are required to handle the noise
community to use.3)implementation guidelines-Build incrementally: Data warehouses must be and incomplete objects while mining the data regularities. If the data cleaning methods are not there then the
built incrementally. Generally, it is recommended that a data marts may be created with one particular accuracy of the discovered patterns will be poor.Pattern evaluation − The patterns discovered should be
project in mind, and once it is implemented, several other sections of the enterprise may also want to interesting because either they represent common knowledge or lack novelty. Performance Issues-Efficiency and
implement similar systems. Need a champion: A data warehouses project must have a champion who scalability of data mining algorithms − In order to effectively extract the information from huge amount of data
is active to carry out considerable researches into expected price and benefit of the project. Data in databases, data mining algorithm must be efficient and scalable.Parallel, distributed, and incremental mining
warehousing projects requires inputs from many units in an enterprise and therefore needs to be algorithms − The factors such as huge size of databases, wide distribution of data, and complexity of data mining
driven by someone who is needed for interacting with people in the enterprises and can actively methods motivate the development of parallel and distributed data mining algorithms. Diverse Data Types
persuade colleagues. Senior management support: A data warehouses project must be fully Issues-Handling of relational and complex types of data − The database may contain complex data objects,
supported by senior management. Given the resource-intensive feature of such project and the time multimedia data objects, spatial data, temporal data etc. It is not possible for one system to mine all these kind
they can take to implement, a warehouse project signal for a sustained commitment from senior of data.Mining information from heterogeneous databases and global information systems − The data is available
management.Ensure quality: The only record that has been cleaned and is of a quality that is implicit at different data sources on LAN or WAN. These data source may be structured, semi structured or
by the organizations should be loaded in the data warehouses. Corporate strategy: A data unstructured.4)explain knowledge extraction process kDD-KDD (Knowledge Discovery in Databases) is a
warehouse project must be suitable for corporate strategies and business goals. The purpose of the process that involves the extraction of useful, previously unknown, and potentially valuable information
project must be defined before the beginning of the projects.Business plan: The financial costs from large datasets. The KDD process is an iterative process and it requires multiple iterations of the above
(hardware, software, and peopleware), expected advantage, and a project plan for a data warehouses steps to extract accurate knowledge from the data.The following steps are included in KDD process: Data
project must be clearly outlined and understood by all stakeholders. Training: Data warehouses Cleaning-Data cleaning is defined as removal of noisy and irrelevant data from collection. Cleaning in case
projects must not overlook data warehouses training requirements. For a data warehouses project to of Missing values.Cleaning noisy data, where noise is a random or variance error.Cleaning with Data
be successful, the customers must be trained to use the warehouses and to understand its capabilities. discrepancy detection and Data transformation tools. Data Integration-Data integration is defined as
Adaptability: The project should build in flexibility so that changes may be made to the data heterogeneous data from multiple sources combined in a common source(DataWarehouse). Data
warehouses if and when required. Like any system, a data warehouse will require to change, as the integration using Data Migration tools, Data Synchronization tools and ETL(Extract-Load-Transformation)
needs of an enterprise change. Joint management: The project must be handled by both IT and process. Data Selection-Data selection is defined as the process where data relevant to the analysis is
business professionals in the enterprise.4) explain hardware and operating system used in data decided and retrieved from the data collection. For this we can use Neural network, Decision Trees, Naive
warehouse-Hardware and operational design – server hardware, network hardware – parallel bayes, Clustering, and Regression methods. Data Transformation-Data Transformation is defined as the
technology – security input on design of Hardware – backup and recovery – Service level agreement process of transforming data into appropriate form required by mining procedure. Data Transformation is a
– Operating the data warehouse.parallel hardware technology: Symmetric Multi-Processing two step process: 1)Data Mapping: Assigning elements from source base to destination to capture
(SMP)- An SMP machine is a set of CPU s that share memory and disk. This is sometimes called a transformations.2)Code generation: Creation of the actual transformation program. Data Mining-Data
shared-everything environment. the CPUs in an SMP machine are all equal. a process can run on any mining is defined as techniques that are applied to extract patterns potentially useful. It transforms task
CPU in the machine, run on different CPUs at different times. Scalability of SMP machines - length relevant data into patterns, and decides purpose of modelusing classification or characterization.Pattern
of the communication bus connecting the CPUs is a natural limit . As the bus gets longer the Evaluation-Pattern Evaluation is defined as identifying strictly increasing patterns representing knowledge
interprocess or communication speeds become slower . each extra CPU imposes an extra, band with based on given measures. It find interestingness score of each pattern, and
load on the bus, increases memory contention, and so on. Example - if the database software supports uses summarization and Visualization to make data understandable by user. Knowledge Representation-This
parallel queries, a single query can be decomposed and its separate parts processed in parallel. This involves presenting the results in a way that is meaningful and can be used to make decisions.
makes query performance scalable. Massively Parallel Processing (MPP)- made up of many loosely
coupled nodes linked together ·by a high-speed connection . Each node has its own memory, and the
disks are not shared .most MPP systems allow a disk to be dual connected between two nodes o
protects against an individual node failure causing disks to be unavailable. Cluster Technology -A
cluster is a set of loosely coupled SMP machines connected by a high-speed interconnect. Each
machine has its own CPUs and memory, but they share access to disk. these systems are called shared-
disk systems. Each machine in the cluster is called a node. The aim of the cluster is to mimic a single
larger machine. In this pseudo single machine, resources such as shared disk must be managed in a
distributed fashion.
5) Classification of Data Mining Systems-Data mining refers to the process of extracting important data from raw
data. It analyses the data patterns in huge sets of data with the help of several software. To understand the
system and meet the desired requirements, data mining can be classified into the following systems:
Classification Based on the mined Databases-A data mining system can be classified based on the types of
databases that have been mined. A database system can be further segmented based on distinct principles,
such as data models, types of data, etc. Classification Based on the type of Knowledge Mined-A data mining
system categorized based on the kind of knowledge mind may have the following functionalities:
Characterization. Discrimination. Association and Correlation. Analysis. Classification. Prediction. Outlier
Analysis. Evolution Analysis. Classification Based on the Techniques Utilized-A data mining system can also be
classified based on the type of techniques that are being incorporated. These techniques can be assessed based
on the involvement of user interaction involved or the methods of analysis employed. Classification Based on
the Applications Adapted-Data mining systems classified based on adapted applications adapted are as
follows:Finance.Telecommunications.DNA.Stock Markets.E-mail. diagram.6) Data Integration-It has been an
integral part of data operations because data can be obtained from several sources. It is a strategy that integrates
data from several sources to make it available to users in a single uniform view that shows their status. Data
Integration Approaches-No Coupling − In this scheme, the data mining system does not utilize any of the
database or data warehouse functions. It fetches the data from a particular source and processes that data using
5) criteria for selecting data warehouse-Experience some data mining algorithms. The data mining result is stored in another file.Loose Coupling − In this scheme,
and expertise.Range of services. Scalability.Security andcompliance.Cost.Customer the data mining system may use some of the functions of database and data warehouse system. It fetches the
support.Customization..
data from the data respiratory managed by these systems and performs data mining on that data. It then stores
the mining result either in a file or in a designated place in a database or in a data warehouse.Semi−tight
Coupling − In this scheme, the data mining system is linked with a database or a data warehouse system and in
addition to that, efficient implementations of a few data mining primitives can be provided in the database.Tight
coupling − In this coupling scheme, the data mining system is smoothly integrated into the database or data
warehouse system. The data mining subsystem is treated as one functional component of an information
system.7)different forms or steps of data prcessing or data consolidation involve-Data Cleaning: This involves
identifying and correcting errors or inconsistencies in the data, such as missing values, outliers, and
duplicates. Various techniques can be used for data cleaning, such as imputation, removal, and
transformation.Data Integration: This involves combining data from multiple sources to create a unified
dataset. Data integration can be challenging as it requires handling data with different formats, structures,
and semantics. Techniques such as record linkage and data fusion can be used for data integration. Data
Transformation: This involves converting the data into a suitable format for analysis. Common techniques
used in data transformation include normalization, standardization, and discretization. Normalization is
used to scale the data to a common range, while standardization is used to transform the data to have zero
mean and unit variance. Discretization is used to convert continuous data into discrete categories.Data
Reduction: This involves reducing the size of the dataset while preserving the important information. Data
reduction can be achieved through techniques such as feature s election and feature extraction. Feature
selection involves selecting a subset of relevant features from the dataset, while feature extraction involves
transforming the data into a lower-dimensional space while preserving the important information.
8) Data consolidation-it is the process of combining data from multiple sources, cleaning and verifying it by discovering non-trivial, interesting, and useful patterns in large spatial datasets. The most common spatial
removing errors, and storing it in a single location, such as a data warehouse or database. Data is produced from pattern families are co-locations, spatial hotspots, spatial outliers, and location predictions.4) characteristics of
various sources and in multiple formats in every business.9) Manage Noisy Data or smoothing technique followed OLAP -Multidimensional conceptual view. Multi-User Support. Accessibility. Uniform documenting
in data cleaning process-Binning-Binning is a technique where we sort the data and then partition the data into performance.5) features of olap-multi-dimensional view of data.support for complex calculation.time
equal frequency bins. Then you may either replace the noisy data with the bin mean bin median or the bin intelligence.6)benefits of olap- OLAP helps managers in decision-making through the multidimensional record
boundary. Smoothing by bin mean method: In this method, the values in the bin are replaced by the mean value views that it is efficient in providing, thus increasing their productivity.OLAP functions are self-sufficient owing
of the bin.Smoothing by bin median: In this method, the values in the bin are replaced by the median to the inherent flexibility support to the organized databases.7) steps required to tunnig of data warehouse- kp
value.Smoothing by bin boundary: In this method, the using minimum and maximum values of the bin values tune the business rules.tune the data design.tune the application design.tune the logical structure of DB.tune
are taken, and the closest boundary value replaces the values. Regression-his is used to smooth the data and the DB operation.tune the access paths.tune memory allocation.8)types of warehouse application- financial
help handle data when unnecessary data is present. For the analysis, purpose regression helps decide the services.banking services.consumer services.retail sector.controlled manufacturing.9)recovery-Data recovery
suitable variable. Linear regression refers to finding the best line to fit between two variables so that one can be is the process of restoring data that has been lost, accidentally deleted, corrupted or made inaccessible. In
used to predict the other. Multiple linear regression involves more than two variables. Using regression to find enterprise IT, data recovery typically refers to the restoration of data to a desktop, laptop, server or external
a mathematical equation to fit into the data helps to smooth out the noise. Clustering-This is used for finding storage system from a backup.types-Simple Recovery. Full Recovery. Bulk logged. Point-in-time
the outliers and also in grouping the data. Clustering is generally used in unsupervised learning. Outlier Analysis- Recovery.10)MQE-MQE stands for Managed Query Environment. Some products have been able to provide ad-
Outliers may be detected by clustering, where similar or close values are organized into the same groups or hoc queries such as data cube and slice and dice analysis capabilities. It is done by developing a query to select
clusters. Thus, values that fall far apart from the cluster may be considered noise or outliers. Outliers are extreme data from the DBMS, which delivers the requested data to the system where it is placed into a data
values that deviate from other observations on data.Univariate outliers can be found when looking at a cube.11)types of security in data warehouse- query manager.load manager.warehouse manager.application
distribution of values in a single feature space.point outliers are single data points that lay far from the rest of development.12)functional components of data mining – data collection and data mining query
the distribution.Contextual outliers can be noise in data, such as punctuation symbols when realizing text composition.presentation of discovered patterns.hierarchy specification and manipulation.manipulation of data
analysis or background noise signal when doing speech recognition.Collective outliers can be subsets of mining primitives.chapterr3- 1)KDD process-Data Mining also known as Knowledge Discovery in Databases,
novelties in data, such as a signal that may indicate the discovery of new phenomena.10)different stratergies refers to the nontrivial extraction of implicit, previously unknown and potentially useful information from
for data cleaning-Data Migration-Data migration is a useful method of transferring data from one device to data stored in databases.type-Data Cleaning. Data Integration. Data Selection. Data Transformation. Data
another. Although this can sound very simple, a shift in storage and database or programe is involved. Any data Mining. Pattern Evaluation. Knowledge Representation.2) Advantages of KDD-Improves decision-making.
migration would require at least the transform and load phases in the sense of the extract/transform/load Increased efficiency. Better customer service. Fraud detection. Predictive modelling.3) Disadvantages-
process. For application-specific migrations, such as platform upgrades, replication of databases and file Privacy concerns. Complexity. Unintended consequences. Data Quality. High cost.4) Linear regression-it
copying, host-based software is best.Software-based on arrays is mostly used to transfer data between related analysis is used to predict the value of a variable based on the value of another variable. The variable you want
structures.Depending on their configuration, network appliances migrate volumes, files or blocks of data. data to predict is called the dependent variable.5) data cleaning routines are needed- Having clean data will ultimately
scrubbing- The data cleaning process detects and removes errors and anomalies and improves data quality. increase overall productivity and allow for the highest quality information in your decision-making. Benefits
Data quality problems arise due to misspelling during data entry, missing values, or any other invalid data.In include: Removal of errors when multiple sources of data are at play. Fewer errors make for happier clients and
basic terms, Data Scrubbing is the process of guaranteeing accurate and correct collection of information. less-frustrated employees.6) chi square test-The chi-square test is a statistical tool used to check if two
This process is especially for companies that rely on electronic data during the operation of their business. categorical variables are related or independent. It helps us understand if the observed data differs significantly
During the process, several tools are used to check the stability and accuracy of documents. Data Auditing- from the expected data. By comparing the two datasets, we can draw conclusions about whether the variables
Data auditing is the assessment of data for quality throughout its lifecycle to ensure its accuracy and efficacy for have a meaningful association.7) motivation behind data mining -The major reason for using data mining
specific usage. Data performance is measured and issues are identified for remediation. Data auditing results in techniques is requirement of useful information and knowledge from huge amounts of data. The information
better data quality, which enables enhanced analytics to improve operations.11) Steps for Cleaning and knowledge gained can be used in many applications such as business management, production control etc.
Data- Remove duplicate or irrelevant observations-Remove duplicate or pointless observations as well as Data mining came into existence as a result of the natural evolution of information technology.8) components
undesirable observations from your dataset. The majority of duplicate observations will occur during data of any data mining system are the -Data source, data warehouse server, data mining engine, pattern assessment
gathering. Duplicate data can be produced when you merge data sets from several sources, scrape data, or get module, graphical user interface, and knowledge base.9) Z-Score Normalization- If a value is exactly equal to the
data from clients or other departments. One of the most important factors to take into account in this procedure mean of all the values of the feature, it will be normalized to 0. If it is below the mean, it will be a negative
is de-duplication. Fix structural errors-When you measure or transfer data and find odd naming practices, typos, number, and if it is above the mean it will be a positive number.10) Information gain - is a measure of how much
or wrong capitalization, such are structural faults. Mislabelled categories or classes may result from these information is gained by splitting a set of data on a particular feature. It is calculated by comparing the entropy
inconsistencies. Filter unwanted outliers-There will frequently be isolated findings that, at first glance, do not of the original set of data to the entropy of the two child sets. 11) Gain ratio (GR) -is a modification of the
seem to fit the data you are analyzing. Removing an outlier if you have a good reason to, such as incorrect data information gain that reduces its bias. Gain ratio takes number and size of branches into account when choosing
entry, will improve the performance of the data you are working with. Handle missing data-Because many an attribute.12) Applications of Data Mining-Data Mining Application in Business Intelligence. Data Mining
algorithms won't tolerate missing values, you can't overlook missing data. There are a few options for handling Application in Healthcare.Data Mining Application in Fraud Detection and Prevention.Data Mining Application
missing data. Validate and QA-False conclusions can be used to inform poor company strategy and decision- in Marketing and Advertising.13) Advantages of Data Mining-Marketing/Retailing. Banking/Crediting.
making as a result of inaccurate or noisy data. False conclusions can result in a humiliating situation in a reporting Manufacturing. Customer Identification. Disadvantages-Privacy Issues. Safety concerns. information that has
meeting when you find out your data couldn't withstand further investigation. Establishing a culture of quality been misused or is erroneous. Expensive
data in your organization is crucial before you arrive. The tools you might employ to develop this plan should be
documented to achieve this.12) Dimensionality Reduction-In dimensionality reduction, data encoding or
data transformations are applied to obtain a reduced or compressed for of original data. It can be used to
remove irrelevant or redundant attributes. In this method, some data can be lost which is irrelevant. Methods
for dimensionality reduction are:Wavelet transformations.Principal Component Analysis. The components
of dimensionality reduction are feature selection and feature extraction. It leads to less misleading data and
more model accuracy.13) Numerosity Reduction-In Numerosity reduction, data volume is reduced by
choosing suitable alternating forms of data representation. It is merely a representation technique of original
data into smaller form. In this method, there is no loss of data. Methods for Numerosity reduction
are:Regression or log-linear model (parametric).Histograms, clusturing, sampling (non-parametric). It has
no components but methods that ensure reduction of data volume. It preserves the integrity of data and the
data volume is also reduced.14) hierarchy generation for categorical data-Categorical data are discrete
data. Categorical attributes have finite number of distinct values, with no ordering among the values,
examples include geographic location, item type and job category. There are several methods for generation
of concept hierarchies for categorical data. Specification of a partial ordering of attributes explicitly at
the schema level by experts: Concept hierarchies for categorical attributes or dimensions typically involve
a group of attributes. A user or an expert can easily define concept hierarchy by specifying a partial or total
ordering of the attributes at a schema level. A hierarchy can be defined at the schema level such as street <
city < province <state < country. Specification of a portion of a hierarchy by explicit data grouping:
This is identically a manual definition of a portion of a concept hierarchy. In a large database, is unrealistic
to define an entire concept hierarchy by explicit value enumeration. However, it is realistic to specify
explicit groupings for a small portion of the intermediate-level data. Specification of a set of attributes
but not their partial ordering: A user may specify a set of attributes forming a concept hierarchy, but omit
to specify their partial ordering. The system can then try to automatically generate the attribute ordering so
as to construct a meaningful concept hierarchy. Specification of only of partial set of attributes:
Sometimes a user can be sloppy when defining a hierarchy, or may have only a vague idea about what
should be included in a hierarchy. Consequently the user may have included only a small subset of the
relevant attributes for the location, the user may have only specified street and city.

1) Roll-Up-The roll-up operation (also known as drill-up or aggregation operation) performs aggregation on a data
cube, by climbing down concept hierarchies, i.e., dimension reduction. Roll-up is like zooming-out on the data
cubes.2) Drill-Down-The drill-down operation (also called roll-down) is the reverse operation of roll-up. Drill-
down is like zooming-in on the data cube. Drill-down can be performed by either stepping down a concept
hierarchy for a dimension or adding additional dimensions. Slice-A slice is a subset of the cubes corresponding
to a single value for one or more members of the dimension. Dice-The dice operation describes a subcube by
operating a selection on two or more dimension.2) clasify olap tools-MOLAP: Multidimensional Online Analytical
Processing tools. ROLAP: Relational Online Analytical Processing tools.3) Spatial data mining- is the process of

You might also like