0% found this document useful (0 votes)
2 views30 pages

CCS341-Data Warehousing Notes-Unit I

A Data Warehouse (DW) is a centralized repository designed for data analysis and decision-making, integrating historical data from various sources. It supports business intelligence by enabling complex queries and analytics that operational databases cannot efficiently handle. Key characteristics include being subject-oriented, integrated, time-variant, and non-volatile, making it essential for organizations to maintain historical data and derive insights for strategic decisions.

Uploaded by

futureone143
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views30 pages

CCS341-Data Warehousing Notes-Unit I

A Data Warehouse (DW) is a centralized repository designed for data analysis and decision-making, integrating historical data from various sources. It supports business intelligence by enabling complex queries and analytics that operational databases cannot efficiently handle. Key characteristics include being subject-oriented, integrated, time-variant, and non-volatile, making it essential for organizations to maintain historical data and derive insights for strategic decisions.

Uploaded by

futureone143
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 30

DEPARTMENTOFINFORMATIONTECHNOLOGY

CCS341- DATA WAREHOUSING

UNIT1-INTRODUCTIONTODATA WAREHOUSE
DataWarehouse:
DataWarehouse isseparatefromDBMS,itstoresahugeamountofdata,whichis
typicallycollectedfrommultipleheterogeneoussourceslikefiles,DBMS,etc.Thegoalisto produce
statistical results that may help in decision-making.
A data warehouse, or enterprise data warehouse (EDW), is a system that aggregates data from
different sources into a single, central, consistent data store to support data analysis, data mining,
artificialintelligence(AI),andmachinelearning. Adatawarehousesystem enablesanorganizationto run
powerful analytics on huge volumes of historical data in ways that a standard database cannot.

ExampleApplicationsofDataWarehousing
Data Warehousing can be applied anywhere where we have a huge amount of data and we want to
see statistical results that help in decision making.
 Social Media Websites: The social networking websites like Facebook, Twitter, LinkedIn, etc. are based on
analyzing large data sets. These sites gather data related to members, groups, locations, etc., and store it in a
single central repository.Being a large amountof data, Data Warehouseis needed for implementing the same.
 Banking: Most of the banks these days use warehouses to see the spending patterns of account/cardholders.
They use this to provide them with special offers, deals, etc.
 Government: Government uses a data warehouse to store and analyze tax payments which are used to detect
tax thefts.
UNIT–I

INTRODUCTIONTODATAWAREHOUSE

INTRODUCTION:

Data Warehouse is a relational database management system (RDBMS) construct to meet


therequirement of transaction processing systems. It can be loosely described as any centralized
datarepository which can be queried for business benefits. It is a database that stores information oriented
tosatisfy decision-making requests. It is a group of decision support technologies, targets to enabling
theknowledge worker (executive, manager, and analyst) to make superior and higher decisions. So,
DataWarehousing support architectures and tool for business executives
tosystematicallyorganize,understand and use their information to make strategic decisions.

Data Warehouse environment contains an extraction, transportation, and loading (ETL) solution,anonline
analytical processing (OLAP) engine, customer analysis tools, and other applications that handlethe
process of gathering information and delivering it to business users.

WhatisaDataWarehouse?

A Data Warehouse (DW) is a relational database that is designed for query and analysis rather
thantransaction processing. It includes historical data derived from transaction data from single and
multiplesources.

A Data Warehouse provides integrated, enterprise-wide, historical data and focuses on providing
supportfor decision-makers for data modeling and analysis.

AData Warehouse is a group ofdata specific to the entire organization, not only to a particular group
ofusers.

It is not used for daily operations and transaction processing but used for makingdecisions.A

Data Warehouse can be viewed as a data system with the following attributes:

o It isadatabasedesignedforinvestigativetasks, usingdatafrom variousapplications.


o Itsupportsarelativelysmallnumberofclientswithrelatively longinteractions.
o Itincludescurrentandhistorical datatoprovideahistoricalperspectiveofinformation.
o Its usageis read-intensive.
o Itcontains afewlargetables.

"DataWarehouseisasubject-oriented,integrated,and time-variantstoreof informationinsupport


ofmanagement'sdecisions."
Characteristics ofDataWarehouse

Subject-Oriented

A data warehouse target on the modeling and analysis of data for decision-makers. Therefore,
datawarehouses typically provide a concise and straightforward view around a particular subject, such
ascustomer, product, or sales, instead of the global organization's ongoing operations. This is done
byexcludingdatathatarenotusefulconcerningthesubjectandincludingalldataneededbytheusersto
understand the subject.
Integrated

A data warehouse integrates various heterogeneous data sources like RDBMS, flat files, and
onlinetransaction records. It requires performing data cleaning and integration during datawarehousing
toensure consistency in naming conventions, attributes types, etc., among different data sources.

Time-Variant

Historical information is kept in a data warehouse. For example, one can retrieve files from 3 months,
6months, 12 months, or even previous data from a data warehouse. These variations with a
transactionssystem, where often only the most current file is kept.
Non-Volatile

The data warehouse is a physically separate data storage, which is transformed from thesourceoperational
RDBMS. The operational updates of data do not occur in the data warehouse, i.e.,
update,insert,anddeleteoperationsarenotperformed.Itusuallyrequiresonlytwoproceduresindataaccessing:
Initial loading of data and access to data. Therefore, the DW does not require transactionprocessing,
recovery,and concurrency capabilities,which allows for substantialspeedupofdataretrieval. Non-Volatile
defines that once entered into the warehouse, and data should not change.

Goals

ofDataWareh
ousing

Tohelpreportingaswellasanalysis
Maintainthe organization'shistoricalinformation
Bethefoundation for decisionmaking.

NeedforDataWarehouse

Data Warehouseis neededforthefollowing reasons:


1. Business User: Business users require a data warehouse to view summarized data from the
past.Since these people are non-technical, the data may be presented to them in an elementary
form.
2. Storehistorical data: Data Warehouseisrequired to store thetime variabledatafrom the past.This
input is made to be used for various purposes.
3. Makestrategicdecisions: Somestrategiesmaybedependinguponthedatainthedatawarehouse.
So, data warehouse contributes to making strategic decisions.
4. Fordataconsistencyandquality:Bringingthedatafromdifferentsourcesatacommonplace,the user can
effectively undertake to bring the uniformity and consistency in data.
5. High response time:Data warehousehas to be ready for somewhat unexpectedloads and typesof
queries, which demands a significant degree of flexibility and quick response time.

BenefitsofDataWarehouse

1. Understandbusinesstrendsandmakebetter forecasting decisions.


2. DataWarehousesaredesignedtoperformwellenormousamountsof data.
3. The structure ofdata warehousesismoreaccessiblefor end-users to navigate,understand,andquery.
4. Queriesthatwouldbecomplex inmanynormalizeddatabasescouldbe easiertobuildandmaintain in
data warehouses.
5. Data warehousing is an efficient method tomanage demand forlots of information from lots
ofusers.
6. Datawarehousing providethecapabilitiesto analyzealargeamountof historical data.

DatawarehouseComponent:

Architecture is the proper arrangement of the elements. We build a data warehouse with software
andhardware components. To suit the requirements of our organizations, we arrange these building we
maywant to boost up another part with extra tools and services. All of these depends on our
circumstances.
The figure shows the essential elements of a typical warehouse. We see the SourceData componentshows
on the left. The Data staging element serves as the next building block. In the middle, we see theData
Storage component that handles the data warehouses data. This element not onlystoresandmanages the
data; it also keeps track of data using the metadata repository. The Information Deliverycomponent shows
on the right consists of all the different ways of making the information from the datawarehouses
available to the users.

SourceDataComponent

Sourcedatacomingintothedatawarehousesmaybegroupedintofourbroadcategories:

Production Data: Thistype of datacomesfrom thedifferentoperating systemsof theenterprise.Basedon the


data requirements in the data warehouse, we choose segments of the data from the
variousoperationalmodes.

Internal Data: In each organization, the client keeps their "private" spreadsheets, reports,
customerprofiles, and sometimes even departmentdatabases. This is the internal data, part of which could
beuseful in a data warehouse.

ArchivedData:Operationalsystemsaremainlyintendedtorunthecurrentbusiness.Ineveryoperational system,
we periodically take the old data and store it in achieved files.

External Data: Most executivesdepend on informationfrom externalsourcesfor alarge percentage ofthe


information they use. They use statistics associating to their industry produced by the externaldepartment.

DataStagingComponent

After we have been extracted data from various operational systems and external sources, we have
toprepare the files for storing in the data warehouse. The extracted data coming from several
differentsources need to be changed, converted, and made ready in a format that is relevant to be saved
forquerying and analysis.

Wewillnow discussthethreeprimaryfunctionsthat take placeinthestagingarea.


1) Data Extraction: This method has to deal with numerous data sources. We have to employ
theappropriate techniques for each data source.

2) DataTransformation:Asweknow,dataforadatawarehousecomesfrommanydifferentsources.If data
extraction for a data warehouse posture big challenges, data transformation presentevensignificant
challenges. We perform several individual tasks as part of data transformation.

First, we clean the data extracted from each source. Cleaning may be the correctionofmisspellings
or may deal with providing default values for missing data elements, or elimination ofduplicates when we
bring in the same data from various source systems.

Standardization of data components forms a large part of datatransformation.Datatransformation


contains many forms of combining pieces of data from differentsources. Wecombinedata from single
source record or related data parts from many source records.

On the other hand, data transformation also contains purging source data that is not useful
andseparating outsource records into new combinations. Sorting and merging of data take place on a
largescale in the data staging area. When the data transformation function ends, we have a collection
ofintegrated data that is cleaned, standardized, and summarized.

3) Data Loading: Two distinct categories of tasks form data loading functions. When we complete
thestructure and construction of the data warehouseand go liveforthefirst time, wedo the initial loadingof
the information intothe data warehouse storage. The initial load moves high volumes of data using upa
substantial amount of time.

DataStorageComponents

Data storage for the data warehousing is a split repository. The data repositories for the
operationalsystems generally include only the currentdata. Also, these data repositories includethe data
structuredin highly normalized for fast and efficient processing.

InformationDeliveryComponent

The information delivery element isused to enable the process of subscribing for data warehouse filesand
having it transferred to one or more destinations according to some customer-specified
schedulingalgorithm.
MetadataComponent

Metadata in a data warehouse is equal to the data dictionary or the data cataloginadatabasemanagement
system. In the data dictionary, we keep the data about the logical data structures, the dataabout the records
and addresses, the information about the indexes, and so on.

DataMarts

It includes a subset ofcorporate-wide data that is of value to a specific group of users. The scopeis
confined to particular selected subjects. Data in a data warehouse should be a fairly current, but notmainly
up to the minute, although development in the data warehouse industry has made standard andincremental
datadumpsmoreachievable.Datamartsarelower thandatawarehousesandusuallycontain organization. The
current trends in data warehousing are to developed a data warehouse withseveral smaller related data
marts for particular kinds of queries and reports.

ManagementandControlComponent

The management and control elements coordinate the services and functions within the
datawarehouse. These components control the data transformation and the data transfer into the
datawarehouse storage. On the other hand, it moderates the data delivery to the clients. Its work with
thedatabase management systems and authorizes data tobe correctly saved in the repositories.It
monitorsthe movement of information into the staging method and from there into the data warehouses
storageitself.

WhyweneedaseparateDataWarehouse?

➢ Data Warehousequeriesare complexbecausetheyinvolve thecomputation oflarge groups ofdata at


summarized levels.
➢ It may require the use of distinctive data organization, access, and implementation method
basedon multidimensional views.

➢ Performing OLAPqueriesinoperational databasedegradetheperformanceoffunctional tasks.

➢ Data Warehouse is used for analysis and decision making in which extensive database
isrequired, includinghistoricaldata, whichoperationaldatabasedoes nottypicallymaintain.

➢ The separation ofan operational database from data warehouses is based onthe
differentstructures and uses of data in these systems.

➢ Because the two systems provide different functionalities and require different kinds ofdata, it
isnecessary to maintain separate databases.

DifferencebetweenDatabaseandDataWarehouse

Database DataWarehouse

1. It is used for Online Transactional 1. It is used for Online Analytical


Processing(OLTP) butcan be used for other Processing(OLAP). This reads the
objectives such as Data Warehousing.This historicalinformation forthe customers for
records the data from the clients for history. business decisions.

2. The tables and joins are complicated since they


arenormalized for RDBMS. This is done to reduce 2. The tables and joins are accessible since
redundantfiles and to save storage space. theyare de-normalized. This is done to minimize
theresponse time for analytical queries.

3. Dataisdynamic 3. Data islargelystatic

4. Entity:RelationalmodelingproceduresareusedforRD
BMS database design. 4. Data:ModelingapproachareusedfortheDataWar
ehousedesign.

5. Optimizedforwriteoperations. 5. Optimizedfor readoperations.

6. Performanceislowfor analysisqueries. 6.Highperformanceforanalytical queries.


7. The database isthe place wherethe data is
7. Data Warehouse is the place where
takenas a baseand managed to get available fast
and efficient access. theapplication data is handled for analysis
andreportingobjectives.

DifferencebetweenOperationalDatabaseandDataWarehouse

➢ The Operational Databaseis the sourceof information for the datawarehouse.Itincludesdetailed


information used to run the day to day operations of the business. The data frequentlychanges as
updates are made and reflect the current value of the last transactions.

➢ Operational Database Management Systems also called asOLTP(OnlineTransactionsProcessing


Databases), are used to manage dynamic data in real-time.

➢ Data Warehouse Systems serve users or knowledge workers in the purpose of data analysis
anddecision-making. Such systems can organize and present information in specific formats
toaccommodate the diverse needs of various users. These systems are called as Online-
AnalyticalProcessing (OLAP) Systems.

➢ Data Warehouse and the OLTP database are both relational databases.However,thegoalsofboth
these databases are different.

OperationalDatabase DataWarehouse

Operationalsystemsaredesignedtosupporthigh-volume Data warehousing systems are typically designed


transaction processing. tosupport high-volume analytical processing
(i.e.,OLAP).
Operationalsystems areusually concernedwith Datawarehousing systemsareusually concernedwith
currentdata. historical data.
Data within operationalsystems aremainly Non-volatile,newdata
updatedregularly according to need. maybeaddedregularly.OnceAdded rarely changed.

It is designed for real-time business


dealingandprocesses. Itisdesignedforanalysisofbusinessmeasuresbysubject
area, categories, and attributes.
It is optimized for a simple set of transactions,
generallyadding or retrieving a single row at a time per It is optimized for extent loads and high,
table. complex,unpredictable queries that access many
rowspertable.

Itisoptimizedforvalidationofincominginformationduring
transactions, uses validation data tables. Loadedwithconsistent,validinformation,requiresno
real-time validation.

It supports thousands of concurrent clients. Itsupports afewconcurrentclients relative

toOLTP.Operational systems are widely process-oriented. Data warehousing systems are widely subject-
oriented

Operational systems are usually optimized toperformfast


inserts and updatesof associativelysmall volumesofdata. Data warehousing systems are usually optimized
toperform fast retrievals of relatively high volumes
ofdata.

DataIn DataOut

LessNumberofdataaccessed. LargeNumberofdataaccessed.

Relationaldatabases are created for on-line


transactionalProcessing(OLTP) DataWarehousedesignedforon-
lineAnalyticalProcessing(OLAP)

DifferencebetweenOLTP andOLAP

OLTP System

OLTPSystemhandlewithoperationaldata.Operationaldataarethosedatacontainedintheoperation of a
particular system. Example, ATM transactions and Bank transactions, etc.

OLAPSystem

➢ OLAP handle with Historical Data or Archival Data. Historical data are those data that
areachieved over a long period. For example, if we collect the last 10 years information about
flightreservation, the data can give us much meaningful data such as the trends in the reservation.
Thismay provide useful information like peak time of travel, what kind of people are traveling
invarious classes (Economy/Business) etc.
➢ The major difference between an OLTP and OLAP system is the amount of data analyzed in
asingle transaction. Whereas an OLTP manage many concurrent customers and queries
touchingonly an individual record or limited groups of files at a time. An OLAP system must have
thecapability to operate on millions of files to answer a single query.

Feature OLTP OLAP

Characteristic Itisasystemwhichisusedto It is a system which is used to manageinformationalData.


manageoperational Data.

Users
Knowledgeworkers, including managers,
Clerks,clients,andinformationte executives,andanalysts.
chnologyprofessionals.

Systemori
entation OLTP system is a customer- OLAP system is market-oriented, knowledge
oriented,transaction, and query workersincluding managers, do data analysts
processing aredone by clerks, clients, executive andanalysts.
andinformation

technologyprofessionals.

Data contents OLTPsystemmanagescurrentdata


that too detailed and are used OLAP system manages a large amount of
fordecisionmaking. historicaldata, provides facilitates for summarization
andaggregation, and stores and manages data at
differentlevels of granularity. This information makes
the datamore comfortable to use
ininformeddecisionmaking.

DatabaseSize 100MB-GB 100 GB-TB

DatabasedesignOLTPsystemusuallyusesanentity-
relationship(ER)datamodelandapplic OLAP system typically uses either astarorsnowflake
ation-orienteddatabasedesign. model and subject-oriented databasedesign.

View OLTP system focuses primarily onthe OLAP system often spans multiple versions of
current data within an enterpriseor adatabase schema, due to the evolutionary process
department, without referring ofan organization. OLAP systems also deal with
tohistorical information or data datathat originates from various organizations,
indifferentorganizations. integratinginformation from many data stores.

Volume of data Not very large Because oftheir large volume, OLAP data are
storedon multiple storage media.

AccesspatternsTheaccesspatternsofanOLTP
system subsist mainly of short,atomic Accesses to OLAP systems are mostly read-
transactions. Such a systemrequires onlymethods because of these data warehouses
concurrency control storeshistoricaldata.
andrecoverytechniques.

Access mode Read/write Mostlywrite

Insert Short and fast inserts and Periodiclong-runningbatchjobsrefresh thedata.


updatesproposed by end-users.
andUpdates

Number ofTens Millions


recordsaccessed
NormalizationFully Normalized

Partially Normalized

ProcessingSpeed VeryFast It depends on the amount of files contained, batchdata refresh, and co

DataWarehouseArchitecture

➢ A data warehouse architecture is a method of defining the overall architecture of


datacommunication processing and presentation that exist for end-clients computing within
theenterprise. Each data warehouse is different, but all are characterized by standard
vitalcomponents.

➢ Production applications such as payroll accounts payable product purchasing andinventorycontrol


are designed for online transaction processing (OLTP). Such applications gather detaileddata from
day to day operations.

➢ Data Warehouse applications are designed to support the user ad-hoc data requirements,anactivity
recently dubbed online analytical processing (OLAP). These include applications such
asforecasting, profiling, summary reporting, and trend analysis.

➢ Production databases are updated continuously by either by hand or via OLTP applications.
Incontrast, a warehouse database is updated from operational systems periodically, usually
duringoff-hours. As OLTP data accumulates in production databases, it is regularly extracted,
filtered,and then loaded into a dedicated warehouse server that is accessible to users. As the
warehouse ispopulated, it must be restructured tables de-normalized, data cleansed of errors and
redundanciesand new fields and keys added to reflect the needs to the user for sorting, combining,
andsummarizingdata.

➢ Data warehouses and their architectures very depending upon the elements of an
organization'ssituation.

Threecommonarchitectures are:

o DataWarehouseArchitecture:Basic
o Data WarehouseArchitecture:WithStaging Area
o Data WarehouseArchitecture:WithStagingAreaand DataMarts
DataWarehouseArchitecture:Basic

OperationalSystem

➢ Anoperationalsystemisamethodusedindatawarehousingtorefertoa system that is


used to process the day-to-day transactions of an organization.

FlatFiles

➢ A Flatfile systemisasystemoffilesinwhichtransactionaldataisstored,andevery file in


the system must have a different name.

MetaData

➢ Asetofdata thatdefinesandgivesinformation about otherdata.


➢ MetaData usedinDataWarehouseforavariety ofpurpose,including:

➢ Meta Data summarizes necessary information about data, which can make
findingand work with particular instances of data more accessible. For example,
author,data build, and data changed, and file size are examples of very basic
documentmetadata.

➢ Metadata is used to direct a queryto the most appropriate data


source.Lightlyandhighlysummarizeddata

➢ The area of the data warehouse saves all the predefined lightly and
highlysummarized (aggregated) data generated by the warehouse manager.
➢ The goals of the summarized information are to speed up query performance.
Thesummarized record is updated continuously as new information is loaded into
thewarehouse.

End-UseraccessTools
➢ The principal purpose of a data warehouse is to provide informationtothebusiness
managers for strategic decision-making.These customersinteract withthe
warehouse using end-client access tools.

The examplesofsomeoftheend-useraccesstoolscanbe:

o ReportingandQueryTools
o ApplicationDevelopment Tools
o Executive InformationSystemsTools
o OnlineAnalytical ProcessingTools
o DataMining Tools
o Data WarehouseArchitecture:WithStaging Area
o Wemustcleanand processyour operationalinformationbeforeputitinto thewarehouse.
o We can do this programmatically, although data warehouses uses a staging area (A place
wheredata is processed before entering the warehouse).
o A staging area simplifies data cleansing and consolidation for operational method coming
frommultiple source systems, especially for enterprise data warehouses where all relevant data of
anenterprise is consolidated.
DataWarehouseStagingAreaisatemporarylocationwherearecordfromsourcesystemsiscopied

DataWarehouseArchitecture:WithStagingAreaandDataMarts

➢ We may want to customize our warehouse's architecture formultiplegroupswithin


our organization.

➢ We can do this by adding data marts. A data mart is a segment of adatawarehouses


that can provided information for reporting and analysis on a section,unit,
department or operation in the company, e.g., sales, payroll, production, etc.

➢ The figure illustrates an example where purchasing, sales, andstocksareseparated.


In this example, a financial analyst wants to analyze historical data forpurchases
and sales or mine historical information to make predictions
aboutcustomerbehavior.
PropertiesofDataWarehouseArchitectures

Thefollowingarchitecturepropertiesarenecessaryforadata warehousesystem:

1. Separation:Analyticalandtransactionalprocessingshouldbekeepapartasmuchaspossible.

2. Scalability: Hardware and software architectures should be simple to upgrade the data volume,
whichhas to be managed and processed, and the number of user's requirements, which have to be
met,progressivelyincrease.

3. Extensibility: The architecture should be able to perform new operations and technologies
withoutredesigning the whole system.

4. Security: Monitoring accesses are necessary because of the strategic data stored in thedatawarehouses.

5. Administerability:DataWarehousemanagementshouldnotbecomplicated.
TypesofDataWarehouseArchitectures

Single-TierArchitecture
➢ Single-Tier architectureisnot periodically usedin practice. Its purpose istominimizetheamount of
data stored to reach this goal; it removes data redundancies.

➢ The figure shows the only layer physically available is the source layer. In this method,
datawarehousesarevirtual.Thismeansthatthedatawarehouseisimplementedasamultidimensional
view of operational data created by specific middleware, or an intermediateprocessinglayer.
The vulnerability of this architecture lies in its failure to meet the requirement for separation
betweenanalytical and transactional processing. Analysis queries are agreed to operational data after
themiddleware interprets them. In this way, queries affect transactional workloads.

Two-TierArchitecture
The requirement for separation plays an essential role in defining the two-tier architecture for a
datawarehouse system, as shown in fig:

Although it is typically called two-layer architecture to highlight a separation betweenphysicallyavailable


sources and data warehouses, in fact, consists of four subsequent data flow stages:

1. Source layer: A data warehouse system uses a heterogeneous source of data. That data is
storedinitiallytocorporaterelationaldatabasesorlegacy databases,oritmaycomefromaninformation
system outside the corporate walls.
2. Data Staging: The data stored to the source should be extracted, cleansed to removeinconsistencies
and fill gaps, and integrated to merge heterogeneous sources into one standardschema. The so-
named Extraction, Transformation, and Loading Tools (ETL) can combineheterogeneous
schemata, extract, transform, cleanse, validate, filter, and load source data into adatawarehouse.
3. DataWarehouselayer:Informationissavedtoonelogicallycentralizedindividualrepository:adatawareh
ouse.Thedatawarehousescanbedirectlyaccessed,butitcanalsobeusedasa
sourceforcreatingdatamarts,which
partiallyreplicatedatawarehousecontentsandaredesignedforspecificenterprisedepartments.Meta-
datarepositoriesstoreinformationonsources, access procedures, data staging, users, data mart
schema, and so on.
4. Analysis: In this layer, integrated data is efficiently, and flexible accessed to issue
reports,dynamically analyze information, and simulate hypothetical business scenarios. It should
featureaggregate information navigators, complex query optimizers, and customer-friendly GUIs.

Three-TierArchitecture
➢ The three-tier architecture consists of the source layer (containing multiple source system),
thereconciled layer and the data warehouse layer (containing both data warehouses and data
marts).The reconciled layer sits between the source data and data warehouse.
➢ Themainadvantageof thereconciledlayer isthatitcreatesastandardreferencedatamodelfora whole
enterprise. At the same time, it separates the problems of source data extraction andintegration
from those of data warehouse population.Insome cases, the reconciledlayer isalso
directly used to accomplish better some operational tasks, such as producing daily reports thatcannot
be satisfactorily prepared using the corporate applications or generating data flows to feedexternal
processes periodically to benefit from cleaning and integration.
➢ Thisarchitectureisespeciallyusefulfortheextensive,enterprise-widesystems.Adisadvantage
of this structureis theextrafilestorage spaceused through the extraredundantreconciled layer.It also
makes the analytical tools a little further away from being real-time.

Three-TierDataWarehouseArchitecture
Data Warehousesusually haveathree-level(tier) architecturethatincludes:

1. BottomTier (DataWarehouseServer)
2. MiddleTier(OLAPServer)
3. TopTier(FrontendTools).

➢ Abottom-tier that consists of the Data Warehouse server, which is almost always an RDBMS.
Itmay include several specialized data marts and a metadata repository.
➢ Data from operational databases and external sources (such as user profile data provided
byexternal consultants) are extracted using application program interfaces called a gateway.
Agateway is provided by the underlying DBMS and allows customer programs to generate
SQLcode to be executed at a server.

Examplesof gateways containODBC(Open Database Connection) and OLE-DB(Open-Linking andEmbedding for


Databases), by Microsoft, and JDBC (Java Database Connection).

Amiddle-tierwhich consistsofan OLAPserverforfastqueryingofthedatawarehouse.

The OLAPserverisimplementedusing either

(1) ARelationalOLAP(ROLAP)model,i.e.,anextendedrelationalDBMSthatmapsfunctionsonmultidimensio
nal data to standard relational operations.

(2) AMultidimensionalOLAP(MOLAP)model,
i.e.,aparticularpurposeserverthatdirectlyimplements multidimensional information and
operations.

Atop-tierthatcontains front-endtools fordisplayingresultsprovidedby OLAP,aswellasadditionaltools for data


mining of the OLAP-generated data.

The overallDataWarehouseArchitectureis shownin fig:


Themetadatarepositorystores informationthatdefinesDWobjects.Itincludesthefollowingparameters and
information for the middle and the top-tier applications:

1. A description of the DW structure, including the warehouse schema, dimension, hierarchies,


datamart locations, and contents, etc.
2. Operational metadata, which usually describes the currency level of the stored data, i.e.,
active,archived or purged, and warehouse monitoring information, i.e., usage statistics, error
reports,audit,etc.
3. System performance data, which includes indices, used to improve data access and
retrievalperformance.
4. Information about the mapping from operational databases, which provides source RDBMSs
andtheir contents, cleaning and transformation rules, etc.
5. Summarization algorithms, predefined queries, and reports business data, which include
businessterms and definitions, ownership information, etc.

PrinciplesofDataWarehousing

LoadPerformance
Data warehouses require increase loading of new data periodically basis within narrow
timewindows; performance on the load process should be measured in hundreds of millions of rows
andgigabytes per hour and must not artificially constrain the volume of data business.

LoadProcessing

Many phases must be taken to load new or update data into the data warehouse, including
dataconversion, filtering, reformatting, indexing, and metadata update.

DataQualityManagement

Fact-based management demands the highest data quality. The warehouse ensures
localconsistency, global consistency, and referential integritydespite "dirty" sources and massive
databasesize.

QueryPerformance

Fact-based management must not be slowed by the performance of the data warehouse
RDBMS;large, complex queries must be complete in seconds, not days.

TerabyteScalability

Data warehousesizes are growing at astonishing rates.Today thesesize from a fewto hundredsof
gigabytes and terabyte-sized data warehouses.

Snowflakevs.Oracle:WhichDataWarehouseisBetter?
Snowflake and OracleAutonomous Data Warehouse are two cloud data warehouses that provide youwith
a singlesourceof truth (SSOT)for all thedatathatexistsinyour organization.You canuseeitherof these
warehouses to run data through business intelligence (BI) tools and automate insights fordecision-making.
But which one should you add to your tech stack? In this guide, learn the differencesbetween Snowflake
vs. Oracle and how you can transfer data to the warehouse of your choice.

Here’sthekeytakeawaystoknowaboutSnowflakevs.Oracle:

• Snowflake and Oracle are both powerful data warehousing platforms with their own
uniquestrengths and capabilities.
• Snowflake is a cloud-native platform known for its scalability, flexibility, and performance.
Itoffers a shared data model and separation of compute and storage, enabling seamless scaling
andcost-efficiency.
• Oracle,ontheotherhand,hasalong-standingreputation andoffersacomprehensivesuiteofdata
management tools and solutions. It is recognized for its reliability, scalability, and
extensiveecosystem.
• Snowflake excels in handling large-scale, concurrent workloads and provides native
integrationwith popular data processing and analytics tools.
• Oracle provides powerful optimization capabilities and offers a robust platform for enterprise-
scale data warehousing, analytics, and business intelligence.
WhatIsSnowflake?

Snowflakeisa datawarehousebuiltforthecloud.Itcentralizesdatafrommultiplesources,enablingyou to run in-


depth business insights that power your teams.

At its core, Snowflake is designed to handle structured and semi-structured data from various
sources,allowing organizations to integrate and analyze data from diverse systems seamlessly. Its
uniquearchitecture separates compute and storage, enabling users to scale each independently based on
theirspecific needs. This elasticity ensures optimal resourceallocation and cost-efficiency, as users only
payfor the actual compute and storage utilized.

Snowflake uses a SQL-based query language, making it accessible to data analysts and SQL developers.Its
intuitive interface and user-friendly features allow for efficient data exploration, transformation,
andanalysis. Additionally, Snowflake provides robust security and compliance
features,ensuringdataprivacy and protection.

One of Snowflake’s notable strengths is its ability to handle large-scale, concurrent workloads
withoutperformance degradation. Its auto-scaling capabilities automatically adjust resources based on
theworkload demands, eliminating the need for manual tuning and optimization.

Another key advantage of Snowflake is its native integration with popular data processing and analyticstools,
such as Apache Spark, Python, and R. This compatibility enables seamless data integration,
dataengineering, and advanced analytics workflows.

WhatIsOracle?

Oracle is available as a cloud data warehouse and an on-premise warehouse (available through OracleExadata
Cloud Service). For this comparison, DreamFactory will review Oracle’s cloud service.

Like Snowflake, Oracle provides a centralized location for analytical data activities, making it easier
forbusinesses like yours to identify trends and patterns in large sets of big data.

Oracle’s flagship product, Oracle Database, is a robust and highly scalable relational databasemanagement
system (RDBMS). It is known for its reliability, performance, and extensive feature set,makingit
suitableforhandlinglarge-scaleenterprisedatarequirements.Oracle Database supports awide range of data
types and provides advanced features for data modeling, indexing, and querying.

In addition to its RDBMS, Oracle provides a complete ecosystem of data management tools andtechnologies.
Oracle Data Warehouse solutions, such as Oracle Exadata and Oracle Autonomous DataWarehouse, offer
high-performance, optimized platforms specifically designed for data warehousing
andanalyticsworkloads.

Oracle’s data warehousing offerings come with a suite of powerful analytics and business intelligencetools.
Oracle Analytics Cloud (OAC) provides comprehensiveself-serviceanalyticscapabilities,enabling users to
explore and visualize data, build interactive dashboards, and generateactionableinsights.

Snowflakevs.Oracle:Pricing

Snowflake and Oracle’s cloud data warehouse adopt a pay-as-you-go model, where you only pay for
theamount of data you consume. Thismodel can work out to be expensiveif you have largeamounts
ofdata, but Snowflake might save you more money in the long run. That’s because clusters will stop
whenyou’re not running any queries (and resume when queries run again).
EaseofUse
Snowflake automatically applies all upgrades, fixes, and security features, reducing your
workload.Oracle,however,typicallyrequiresadatabaseadministratorofsomekind,whichcanaddtothecostof
data warehousing in your organization. Similar problems exist with scaling these warehouses to meetthe
needs of your business. Snowflake data warehouse manages partitioning, indexing, and other
datamanagement tasks automatically; Oracle usually requires a database administrator to execute
anyscalability-related changes. Consider these differences when comparing Snowflake vs. Oracle.
Features

What about Snowflake vs Oracle features? Oracle lets you build and run machine learning
algorithmsinside its warehouse, which can prove incredible for your analytical objectives. Snowflake
lacks thiscapability, requiring users to invest in a stand-alone machine learning platform to run
algorithms. Oraclealso offers support for cursors, making it simple to program data.

On the flip side, Snowflake comes with an integrated automatic query performance optimization
featurethat makes it easy to query data without playing around with too many settings.

SnowflakevsOracle:DataSecurity

Snowflake and Oracle take data security seriously, with features such as data encryption, IP
blocklists,multi-factor authentication, access controls, and adherence to data security standards such as
PCI DSS.

DataGovernance
Users should be aware of data governance principles when transferring data to Snowflake or
Oracle.Legislation such as GDPR and HIPAA mean businesses can incur expensive penalties for
incorrectlymoving sensitive information between data sources and a warehouse. Both platforms handle
datagovernance adequately, with the ability to manage data quality rules and data stewardship workflows.

WhattoConsiderBeforeusingSnowflakevs.Oracle

While Snowflake and Oracle are effective data warehouses for analytics, both have steep learning
curvesthat many businesses might struggle with. Companies will need coding knowledge (SQL)
whenoperationalizing data in these warehouses and require a data engineer to ensure a smooth transfer of
databetween sources and their warehouse of choice.

Moving data to Snowflake or Oracle typically involves a process called Extract, Transfer, Load, or
ETL.That means users have to extract data from a source like a relational database, transactional
database,customer relationship management (CRM) system, enterprise resource planning (ERP) system,
or otherdata platform.After data extraction,usersmust transform datainto thecorrect formatfor
analyticsbefore loading it to Snowflake or Oracle. Another data integration option is Extract,
Load,Transfer,where users extract data and load it to Snowflake or Oracle before transforming that data
into a suitableformat.

ETL, ELT, and other data integration methods require a specific skill set because these processes are
socomplicated. Using DreamFactory can provide a solution to this problem. It connects data sources
toSnowflake or Oracle through a live, documented, and standardized REST API, offering an alternative
todatawarehousing.

Snowflakevs.Oracle:KeyDifferences

Snowflake and Oracle are two prominent players in the data warehousing space, each offering its
ownstrengths and capabilities. Understanding the key differences between Snowflake and Oracle can
helporganizations make informed decisions when choosing a data warehousing solution.
One of the primary differences lies in their architecture. Snowflake is designed asacloud-
nativeplatform,builtfrom the ground upfor the cloud environment.It offersa unique separation of
computeand storage, allowing independent scaling and optimized
performance.Thisarchitectureenablesseamless scalability, cost-efficiency, and flexibility, making it an
attractive choice for organizationsoperating in the cloud.

On the other hand, Oracle has a long-standing history in the data warehousing market, initially built
foron-premises deployments and later transitioning to the cloud. Oracle provides a comprehensive suite
oftools and solutions, including its flagship Oracle Database, which is widely recognized for its
reliability,scalability, and robust features. Oracle’s offering appeals to organizations with existing
Oracledeployments, as it allows them to leverage their familiarity with Oracle tools, interfaces, and
ecosystem.

In terms of performance and scalability, Snowflake excels in its ability to handle large-scale workloads.Its
multi-cluster architecture and auto-scaling capabilities ensure optimal performance even withconcurrent
workloads. Additionally, Snowflake’s native support for semi-structured data allowsorganizations to work
with diverse data types more efficiently.

Oracle, on the other hand, offers powerful optimization capabilities, particularly with its Exadata
andAutonomous Data Warehouse offerings. These platforms are specifically designed to deliver high-
performance data processing, analytics, and query optimization for enterprise-scale workloads.

Dataintegrationandanalyticsarealsokeyareasofdifferentiation.Snowflakeprovidesnativeintegration with
various data processing and analytics tools,making it easierfororganizationstoleverage their existing
analytics ecosystem. On the other hand, Oracle offers a comprehensive ecosystemof data integration and
analytics tools, enabling organizations to tap into a wide range of solutions fortheir specific requirements.

Snowflakevs.Oracle:Which Is Best?

When comparing Snowflake and Oracle, two prominent players in the data warehousing
landscape,several factors come into play. Let’s delve into the comparison to help you determine which
platformmight be the best fit for your needs.

1. ScalabilityandPerformance:
• Snowflake: Snowflake’s cloud-native architecture provides unparalleled scalability,
allowingyou to effortlessly scale compute and storage resources independently. Its multi-
clusterarchitecture ensures optimal performance even with large-scale, concurrent workloads.
• Oracle: Oracleoffersrobustscalabilityoptions,particularlywithitsExadataandAutonomous Data
Warehouse offerings. These solutions are engineeredforhigh-performance data warehousing,
enabling organizations to handle massive data volumeseffectively.
2. FlexibilityandAgility:
• Snowflake: Snowflake’s separation of compute and storage, along withitscloud-basednature,
grants users the flexibility to scale resources on-demand and pay only for what isutilized. It
also supports semi-structured data natively, allowing for easy integration andanalysis of
diverse data types.
• Oracle: Oracle provides a comprehensive suite of data managementtoolsand technologiesthat
enable agility and flexibility. With its extensive ecosystem, organizations can leveragevarious
Oracle products and services for seamless integration and advanced analyticscapabilities.
3. Easeof UseandUserExperience:
• Snowflake: Snowflake boasts a user-friendly interface and intuitive SQL-based
querylanguage, making it accessible to data analysts and SQL developers.Itsself-
tuningcapabilities and auto-scaling features simplify administration and optimize performance.
• Oracle: Oracle has a long-standing reputation for its user-friendly interfaces and robust
tools.Oracle Database, combined with its analytics and business intelligence solutions, offers
afamiliar environment for users already experienced with Oracle technologies.
4. IntegrationandEcosystem:
• Snowflake: Snowflake provides native integration with popular data processing and
analyticstools, facilitating seamless data integration and workflows. It has a growing
ecosystem ofpartners and connectors, expanding its compatibility with various third-party
systems.
• Oracle: Oracle’s extensive ecosystem offers a wide range oftools, applications, and industry-
specific solutions. With its strong integration capabilities and partnerships, Oracle
enablesorganizations to connect and consolidate their data across multiple sources effectively.
5. SecurityandCompliance:
• Snowflake: Snowflake places a strong emphasis on security and compliance.Itprovidesrobust
security features, including encryption, access controls, and compliance certifications,ensuring
data protection and regulatory compliance.
• Oracle: Oraclehas a long history of prioritizing security andcompliance.Itsdatamanagement
solutions offer advanced security features, auditing capabilities, and datagovernance controls
to safeguard sensitive information.

Snowflakevs.Oracle:HowDreamFactoryCanHelp

When comparing Snowflake vs. Oracle, realize that both providers offer superior data warehouses thathelp
you operationalizeand analyze real-time data inyour organization. Snowflakemight be easiertouse and
work out cheaper because of its ability to pause clusters when not running queries. However,Oracle
comes with support for cursors and in-built machine learning capabilities, helping you programand
generate advanced insights from workloads.

You can also compare Snowflake vs Oracle with other data warehouses such asAmazon(AWS)Redshift,
Microsoft Azure, and Google BigQuery. Whatever option you choose, think about how yourbusiness will
transfer data to a warehouse.

Create a Snowflake or Oracle REST API in 30 seconds with DreamFactory’s API


generationsolution.Allyouneedisyourdatawarehousecredentials,andDreamFactorywilltaketherest
bygenerating OpenAPI documentation and securing your API with keys. Start your
FREEDreamFactory trial now!

FrequentlyAsked Questions:Snowflakevs.OracleWhat

is Snowflake?

Snowflakeis acloud-baseddatawarehousing platform knownfor itsmodern architecture,scalability,and


performance. It offers a shared data model, separating compute and storage, and provides flexibility,ease
of use, and native integration with various data processing tools.

WhatisOracle?

Oracle is a renowned provider of data warehousing and database management systems. It offers
acomprehensive suite of products and services, including Oracle Database, designed for enterprise-
scaledata management, analytics, and business intelligence.

WhatarethekeyadvantagesofSnowflake?

Snowflake excels in scalability, allowing independent scaling of compute and storage. It offers a cloud-
nativearchitecture,flexibility,nativesupportfor semi-structureddata,andstrongperformanceevenwith
concurrent workloads. It provides an intuitive interface and self-tuning capabilities.
WhatarethestrengthsofOracle?

Oracle is recognized for its reliability, scalability, and comprehensive ecosystem. It offers a
robustrelational database management system (Oracle Database) along with a suite of data
management,analytics, and business intelligence tools. Oracle has a strong reputation and extensive
integrationcapabilities.

Whichplatformismoresuitableforclouddeployments?
Both Snowflake and Oracle offer cloud-based options. However, Snowflake is built as a cloud-
nativesolution, while Oracle has transitioned its traditional offerings to the cloud. Snowflake’s
architecture andpricing model are optimized for the cloud, providing seamless scalability and cost-
efficiency.

CanSnowflakeandOraclehandlelarge-scaledataworkloads?

Yes, both Snowflake and Oracle have the capability to handle large-scale data workloads.
Snowflake’smulti-cluster architecture and auto-scaling capabilities ensure performance, while Oracle’s
Exadata andAutonomous Data Warehouse offer optimized platforms for data warehousing.

Whataboutdataintegrationandanalyticscapabilities?

Snowflake provides native integration with various data processing and analytics tools,
facilitatingseamlessdataintegrationandanalyticsworkflows.Oracleoffersacomprehensiveecosystemoftoolsa
nd solutions, enabling organizations to leverage its wide range of data integrationandanalyticsofferings.

HowdoSnowflakeandOracledifferintermsofpricing?

Snowflake follows a consumption-based pricing model, where users pay for the actual compute
andstorage resources utilized. Oracle typically follows a traditional licensing model, although it
hasintroduced more flexible pricing options for its cloud-based offerings.

WhichplatformisbetterforexistingOracleusers?

Oracle provides advantages for existing Oracle users due to its compatibility with existing
Oracledeployments, familiarity of tools and interfaces, and the ability to leverage the Oracle
ecosystem.However, Snowflake’s cloud-native architecture and scalability may also be worth
considering.

WhichdatawarehousingsolutionshouldIchoose?
The choice between Snowflake and Oracle depends on various factors, including scalability
needs,flexibility,cloudreadiness,integrationrequirements,existinginfrastructure,andpreferences.Conducting
a thorough evaluation based on your specific needsand priorities is recommended tomakean informed
decision.

You might also like