FINAL DW Record PDF
FINAL DW Record PDF
INSTITUTE OF TECHNOLOGY
COIMBATORE 641 032
Name : GURUMOORTHY M S
Class/Sec: CSE-A
lOMoAR cPSD| 45982007
HINDUSTHAN
INSTITUTE OF TECHNOLOGY
COIMBATORE 641 032
…………………………………………………………………………………………………
…………………………………….
in the 22CS509L – DATA WARHOUSING LABORATORY of this
B.E COMPUTER SCIENCE AND ENGINEERING for the V Semester during
the year 2024.
Place:
Date:
……………………………….
CONTENTS
S.No Date Experiment Page No Marks Sign
AIM:
To exploring the data and performing integration with weka
PROCEDURE:
WEKA Installation
To install WEKA on your machine, visit WEKA’s official website and download the installation
file. WEKA supports installation on Windows, Mac OS X and Linux. You just need to follow the
instructions on this page to install WEKA for your OS.
The WEKA GUI Chooser application will start and you would see the following screen
The GUI Chooser application allows you to run five different types of applications as listed
here:
• Explorer
• Experimenter
• KnowledgeFlow
• Workbench
• Simple CLI
• Classify
• Cluster
• Associate
• Select Attributes
• Visualize
Under these tabs, there are several pre-implemented machine learning algorithms. Let us
look into each of them in detail now.
Preprocess Tab
Initially as you open the explorer, only the Preprocess tab is enabled. The first step in machine
learning is to preprocess the data. Thus, in the Preprocess option, you will select the data file,
process it and make it fit for applying the various machine learning algorithms.
Classify Tab
The Classify tab provides you several machine learning algorithms for the classification of your
data. To list a few, you may apply algorithms such as Linear Regression, Logistic Regression,
Support Vector Machines, Decision Trees, RandomTree, RandomForest, NaiveBayes, and so on.
The list is very exhaustive and provides both supervised and unsupervised machine learning
algorithms.
Cluster Tab
lOMoAR cPSD| 45982007
Under the Cluster tab, there are several clustering algorithms provided - such as SimpleKMeans,
FilteredClusterer, HierarchicalClusterer, and so on.
Associate Tab
Under the Associate tab, you would find Apriori, FilteredAssociator and FPGrowth.
Select Attributes Tab
Select Attributes allows you feature selections based on several algorithms such as
ClassifierSubsetEval, PrinicipalComponents, etc.
Visualize Tab
Lastly, the Visualize option allows you to visualize your processed data for analysis. As you
noticed, WEKA provides several ready-to-use algorithms for testing and building your machine
learning applications. To use WEKA effectively, you must have a sound knowledge of these
algorithms, how they work, which one to choose under what circumstances, what to look for in
their processed output, and so on. In short, you must have a solid foundation in machine learning
to use WEKA effectively in building your apps.
Loading Data
• Web
• Database
• Open URL
• Open DB
Click on the Open file ... button. A directory navigator window opens as shown in the following
screen
lOMoAR cPSD| 45982007
As you would notice it supports several formats including CSV and JSON. The default file type
is Arff.
Arff Format
An Arff file contains two sections - header and data.
• The header describes the attribute types.
• The @data tag starts the list of data rows each containing the comma separated
• fields.
• The attributes can take nominal values as in the case of outlook shown here:
@attribute outlook (sunny, overcast, rainy)
lOMoAR cPSD| 45982007
As an example for Arff format, the Weather data file loaded from the WEKA sample databases is
shown below:
• The attributes can take real values as in this case: @attribute temperature real
• You can also set a Target or a Class variable called play as shown here:
@attribute play (yes, no)
• The Target assumes two nominal values yes or no.
lOMoAR cPSD| 45982007
Understanding Data
Let us first look at the highlighted Current relation sub window. It shows the name of the database
that is currently loaded. You can infer two points from this sub window:
• There are 14 instances - the number of rows in the table.
• The table contains 5 attributes - the fields, which are discussed in the upcoming
• sections.
On the left side, notice the Attributes sub window that displays the various fields in the
database.
lOMoAR cPSD| 45982007
The weather database contains five fields - outlook, temperature, humidity, windy and play.
when you select an attribute from this list by clicking on it, further details on the attribute itself
are displayed on the right hand side.
Let us select the temperature attribute first. When you click on it, you would see the following
screen:
• The table underneath this information shows the nominal values for this field as
Removing Attributes
Many a time, the data that you want to use for model building comes with many irrelevant fields.
For example, the customer database may contain his mobile number which is relevant in analysing
his credit rating
To remove Attribute/s select them and click on the Remove button at the bottom.
The selected attributes would be removed from the database. After you fully preprocess the data,
you can save it for model building.
Next, you will learn to preprocess the data by applying filters on this data.
lOMoAR cPSD| 45982007
Data Integration
Suppose you have 2 datasets as below and need to merge them together
• java -cp weka.jar weka.core.Instances merge <path to file1> <path to file 2> >
AIM:
PROCEDURE:
Data validation is the process of verifying and validating data that is collected before it is
used. Any type of data handling task, whether it is gathering data, analyzing it, or structuring it
for presentation, must include data validation to ensure accurate results.
1. Data Sampling
• Click on choose ( certain datasets in sample datasets does not allow this operation. I used
Brest-cancer dataset for this experiment )
• Filters -> supervised -> Instance -> Re-sample
• Click on the name of the algorithm to change parameters
• Change biasToUniformClass to have a biased sample. If you set it to 1 resulting dataset
will have equal number of instances for each class. Ex:- Brest-cancer positive 20 negative
20.
• Change noReplacement accordingly.
• Change sampleSizePrecent accordingly. ( self explanatory )
2. Removing duplicates
3. Data Reduction
PCA
• Load iris dataset
• Filters -> unsupervised -> attribute -> PrincipleComponents
• Original iris dataset have 5 columns. ( 4 data + 1 class ). Lets reduce that to 3 columns ( 2
data + 1 class ).
•
maximumAttributeNames – PCA algorithm calculates 4 principle components for this
dataset. Upon them we are selecting the 2 components which have the most variance ( PC1,
PC2 ). Then we need to re-represent data again using these selected components ( reducing
4D plot to 2D plot ). In this process we can select how many principle components we are
using when re-generating values. See the final result below where you can see new columns
are created using 3 principle components multiplied by respective bias values.
lOMoAR cPSD| 45982007
4. Data transformation
Normalization
• Load iris dataset
• Filters -> unsupervised -> attribute -> normalize
• Normalization is important when you don’t know the distribution of data beforehand.
• Scale is the length of number line and translation is the lower bound.
• Ex :- scale 2 and translation -1 => -1 to 1, scale 4 and translation -2 => -2 to 2
• This filter get applied to all numeric columns. You can’t selectively normalize.
Standardization
• Load iris dataset.
• Used when dataset known to be in Gaussian (bell curve) distribution.
• Filters -> unsupervised -> attribute -> standardize
• This filter get applied to all numeric columns. You can’t selectively standardize.
Discretization
• Load diabetes dataset.
• Discretization comes in handy when using decision trees.
• Suppose you need to change weight column to two values like low and high.
lOMoAR cPSD| 45982007
AIM:
To plan the architecture for real time application.
PROCEDURE:
DESIGN STEPS:
1. Gather Requirements: Aligning the business goals and needs of different departments
with the overall data warehouse project.
2. Set Up Environments: This step is about creating three environments for data warehouse
development, testing, and production, each running on separate servers
3. Data Modeling: Design the data warehouse schema, including the fact tables and
dimension tables, to support the business requirements.
4. Develop Your ETL Process: ETL stands for Extract, Transform, and Load. This process
is how data gets moved from its source into your warehouse.
5. OLAP Cube Design: Design OLAP cubes to support analysis and reporting requirements.
6. Reporting & Analysis: Developing and deploying the reporting and analytics tools that
will be used to extract insights and knowledge from the data warehouse.
7. Optimize Queries: Optimizing queries ensures that the system can handle large amounts
of data and respond quickly to queries.
8. Establish a Rollout Plan: Determine how the data warehouse will be introduced to the
organization, which groups or individuals will have access to it, and how the data will be
presented to these users.
lOMoAR cPSD| 45982007
AIM:
To Write a query for Star, Snowflake and Galaxy schema definitions.
PROCEDURE:
STAR SCHEMA
SNOWFLAKE SCHEMA
• A fact constellation has multiple fact tables. It is also known as galaxy schema.
• The sales fact table is same as that in the star schema.
• The sales fact table is same as that in the star schema.
• The shipping fact table also contains two measures, namely dollars sold and units sold.
SYNTAX:
Cube Definition:
define cube < cube_name > [ < dimension-list > }: < measure_list >
Dimension Definition:
define dimension < dimension_name > as ( < attribute_or_dimension_list > )
SAMPLE PROGRAM:
define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier type)
lOMoAR cPSD| 45982007
define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier (supplier key, supplier type))
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city (city key, city, province or state, country))
define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier type)
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city, province or state,country)
define cube shipping [time, item, shipper, from location, to location]:
SNOWFLAKE SCHEMA
RESULT:
Thus the query for star, Snowflake and Galaxy schema was written Successfully.
lOMoAR cPSD| 45982007
AIM:
To design a data warehouse for real time applications
PROCEDURE:
Dropping Tables
Since decision-making is concerned with the trends related to students’ history, behavior, and
academic
performance, tables “assets” and “item” are not needed; and therefore, they are discarded and
excluded from the
data warehouse. DROP
TABLE assets ; DROP
TABLE item ;
Merging Tables
Based on the design assumptions, the three tables “department”, “section”, and “course” do not
constitute
separately important parameters for extracting relevant patterns and discovering knowledge.
Therefore, they are
merged altogether with the “transcript_fact_table” table.
SELECT co_name FROM course, section, transcript
WHERE tr_id = n AND str_semester/year = se_semester/year AND tr_se_num = se_num AND
se_code =
co_code ;
ALTER TABLE transcript fact table ADD co_course TEXT ;
DROP TABLE department ;
DROP TABLE section ;
DROP TABLE course ;
lOMoAR cPSD| 45982007
Furthermore, table “Activities” is merged with table “RegistrationActivities” and a new table is produced
called “RegisteredActivities”.
SELECT act_name FROM activities, registrationActivities WHERE reg_act_id =
act_id ;
New Columns
During transformation new columns can be added. In fact, tr_courseDifficulty is added to table
“transcript_fact_table” in order to increase the degree of knowledge and information.
ALTER TABLE transcript_fact_table ADD tr_courseDifficulty TEXT ; Moreover a Boolean
column is added to table “receipt” called re_paidOnDueDate ALTER TABLE receipt
(re_paidOnDueDate) ;
Removing Columns
Unnecessary columns can be removed too during the transformation process. Below is a list of useless columns
that were discarded during the transformation process from tables “Account”, “Student”, “Receipt” and
“Activities” respectively:
ALTER TABLE Receipt REMOVE re_dueDate REMOVE
re_dateOfPayment ;
ALTER TABLE Activities REMOVE ac_supervisor ; ALTER
TABLE Student REMOVE st_phone REMOVE st_email ;
Conceptual Schema – The Snowflake Schema
The proposed data warehouse is a Snowflake type design with one center fact table and seven dimensions
Output:
lOMoAR cPSD| 45982007
AIM:
PROCEDURE:
Step-1: Identifying the business objective: The first step is to identify the business objective. Sales,
HR, Marketing, etc. are some examples of the need of the organization. Since it is the most important
step of Data Modelling the selection of business objectives also depends on the quality of data
available for that process.
Step-2: Identifying Granularity: Granularity is the lowest level of information stored in the table.
The level of detail for business problems and its solution is described by Grain.
Step-3: Identifying Dimensions and their Attributes: Dimensions are objects or things.
Dimensions categorize and describe data warehouse facts and measures in a way that supports
meaningful answers to business questions. A data warehouse organizes descriptive attributes as
columns in dimension tables. For Example, the data dimension may contain data like a year, month,
and weekday.
Step-4: Identifying the Fact: The measurable data is held by the fact table. Most of the fact table
rows are numerical values like price or cost per unit, etc.
Step-5: Building of Schema: We implement the Dimension Model in this step. A schema is a
database structure. There are two popular schemes: Star Schema and
Dimensional data modeling is a technique used in data warehousing to organize and structure data in
a way that makes it easy to analyze and understand. In a dimensional data model, data is organized
into dimensions and facts.
Overall, dimensional data modeling is an effective technique for organizing and structuring data in a
data warehouse for analysis and reporting. By providing a simple and intuitive structure for the data,
the dimensional model makes it easy for users to access and understand the data they need to make
informed business decisions
• Simplified Data Access: Dimensional data modeling enables users to easily access data
through simple queries, reducing the time and effort required to retrieve and analyze data.
• Enhanced Query Performance: The simple structure of dimensional data modelingallows
for faster query performance, particularly when compared to relation modeling
• uses simple, intuitive structures that are easy to understand, even for non-technical users.
• Limited Complexity: Dimensional data modeling may not be suitable for very complex data
relationships, as it relies on simple structures to organize data.
lOMoAR cPSD| 45982007
Limited Integration: Dimensional data modeling may not integrate well with other data
models, particularly those that rely on normalization techniques.
OUTPUT
RESULT:
AIM:
INTRODUCTION:
WhatsApp is one of the most popular messaging applications worldwide, known for its real-time
communication capabilities. To manage its vast amount of data generated by millions of users, WhatsApp
employs Online Transaction Processing (OLTP) systems. This case study explores how WhatsApp utilizes
OLTP to enhance user experience, maintain data integrity, and ensure scalability.
Objectives
• *Data Integrity*: Maintain accuracy and consistency of user data (messages, contacts, etc.).
- WhatsApp's architecture relies on OLTP to handle a large number of transactions per second. Each
message sent or received is processed as a transaction, ensuring that messages are delivered promptly.2.
*ACID Compliance*
- WhatsApp's OLTP system adheres to ACID (Atomicity, Consistency, Isolation, Durability) properties to
ensure that every message transaction is reliable. This is crucial for maintaining the integrity of message data,
especially in scenarios where messages are sent or received while users are offline.
- User authentication is a critical OLTP function. WhatsApp uses its OLTP system to manage user
accounts, including registration, login, and account recovery processes. This ensures secure and consistent
user experiences.
4. *Scalable Architecture*
- WhatsApp's OLTP system is designed to scale horizontally, allowing it to handle increased loads by
adding more servers. This architecture helps maintain performance during peak times when user activity
spikes.
5. *Data Synchronization
The app syncs messages across devices using an OLTP approach, ensuring that users have access to their
chat history regardless of the device being used. This requires real-time updates to the database as messages
are sent and received
- *Solution*: Distributed database systems and load balancing are used to spread the load across multiple
servers, ensuring smooth operation.
lOMoAR cPSD| 45982007
2. *Network Reliability*
- *Solution*: WhatsApp implements message queuing to store messages temporarily during poor
connectivity, ensuring messages are delivered once the connection is restored.
- *Challenge*: Users often switch between devices, leading to potential data inconsistencies.
- *Solution*: The OLTP system synchronizes data across all devices in real-time, utilizing unique
identifiers for messages to ensure that all devices reflect the same information.
Results
• *User Satisfaction*: The implementation of OLTP has resulted in minimal delays in message delivery,
leading to high user satisfaction and retention rates.
• *Robust Security*: By ensuring data integrity and secure user authentication, WhatsApp maintains user
trust, which is vital for any messaging platform.
• *Growth Management*: The scalable nature of the OLTP architecture has allowed WhatsApp to
accommodate rapid growth in user numbers without sacrificing performance.
lOMoAR cPSD| 45982007
Conclusion
WhatsApp's effective use of OLTP systems is crucial for its success as a leading messaging application. By
ensuring real-time processing, data integrity, and scalability, WhatsApp can provide a seamless and reliable
user experience, positioning itself as a trusted platform for communication worldwide. This case study
highlights the importance of OLTP in handling high-volume transactions and maintaining consistent, real-
time data across a vast user base.
lOMoAR cPSD| 45982007
AIM:
INTRODUCTION:
A major global bank with millions of customers and thousands of branches across various countries
needed a robust solution to manage its high volume of daily transactions. The bank's primary
challenge was processing millions of banking transactions in real-time while ensuring data
accuracy, security, and high availability. It required an ecient Online Transaction Processing
(OLTP) system to handle transactions such as deposits, withdrawals, money transfers, and account
balance inquiries.
Objective:
To implement an OLTP system that could handle large- scale, real-time banking transactions while
providing high availability, reliability, and scalability.
The bank decided to implement the Finacle Core Banking Solution, an OLTP system developed by
Infosys that supports real-time transaction processing and helps banks manage day-to-day
operations eciently.
Finacle’s OLTP capabilities enabled the bank to process millions of transactions in real-time across
various banking channels, such as branch banking, online banking, mobile banking, and ATMs.
Each transaction (such as a cash deposit, loan disbursement, or account balance update) was
processed instantly, ensuring immediate updates to the system.
Transactions included:
• - Customer Deposits and Withdrawals
- Finacle ensured 24/7 availability, which was critical for the bank’s global operations. Its
architecture supported real-time replication, ensuring that all transactions were instantly recorded in
the database, even during high-trac periods.
- To enhance data integrity, Finacle’s OLTP system maintained ACID (Atomicity, Consistency,
Isolation, Durability) properties to guarantee that every transaction was processed accurately and
securely. This prevented issues like double spending or data corruption.
- As the bank grew, Finacle’s OLTP system could scale horizontally and vertically to accommodate
the increasing volume of transactions. The system could handle thousands of simultaneous
transactions without any performance degradation.
- The system’s database(often integrated with Oracle or Microsoft SQL Server) allowed for
concurrent access by thousands of users, ensuring smooth operations even during peak hours.
4. Multi-Channel Support:
- Finacle supported seamless integration across
- Branches: Bank sta could quickly process customer transactions, reducing wait times.
- Online and Mobile Banking: Real-time balance inquiries, bill payments, and transfers were made
possible for millions of users.
- All these channels accessed the same underlying database, ensuring consistency across the entire
system. For example, if a customer deposited money at an ATM, their account balance would be
immediately updated, and the same information would be available at the branch and on mobile
banking.
- As a banking system, Finacle adhered to strict security protocols to protect sensitive financial data.
The OLTP system included encryption, authentication and fraud detection mechanisms to ensure
that only authorized users could access or modify data.
- The system also ensured regulatory compliance with banking regulations in various countries,
allowing the bank to operate globally without risk of non- compliance.
Finacle’s OLTP system provided robust backup and disaster recovery mechanisms, ensuring that
data could be recovered in case of system failures or disasters. This minimized downtime and
prevented any potential data loss.
lOMoAR cPSD| 45982007
CONCLUSION:
By implementing the **Finacle Core Banking Solution**, the bank was able to eciently manage its
high-volume, real-time transactions across various channels. The system’s OLTP capabilities
ensured fast, accurate, and secure transaction processing, leading to a better customer experience
and improved operational eciency.
Finacle’s OLTP system not only addressed the bank’s immediate needs but also provided a
scalable, secure, and reliable solution that could grow with the bank’s expanding operations. This
case study highlights how OLTP systems like Finacle can transform banking operations by
ensuring real-time processing, high availability, and enhanced security in a complex, fast- paced
environment.