DWDM Lab Manual Excercises
DWDM Lab Manual Excercises
On left-side navigation, we can see different databases & it‘s related tables. Now
we are going to build tables & populate table‘s data in database through SQL
queries. These tables in database can be used further for building data warehouse.
3
In the above two windows, we created a database named “sample” & in that
database we created two tables named as “user_details” & “hockey” through SQL
queries.
Now, we are going to populate (filling) sample data through SQL queries in those
two created tables as represented in below windows.
4
Through MySQL administrator & SQLyog, we can import databases from other
sources (.XLS, .CSV, .sql) & also we can export our databases as backup for
further processing. We can connect MySQL to other applications for data analysis
& reporting.
5
In the above window, left side navigation bar consists of a database named as
―sales_dw‖ in which there are six different tables (dimcustdetails, dimcustomer,
dimproduct, dimsalesperson, dimstores, factproductsales) has been created.
After creating tables in database, here we are going to use a tool called as
“Microsoft Visual Studio 2012 for Business Intelligence” for building multi-
dimensional models.
6
In the above window, we are seeing Microsoft Visual Studio before creating a
project in which right side navigation bar contains different options like Data
Sources, Data Source Views, Cubes, Dimensions etc.
Through Data Sources, we can connect to our MySQL database named as
“sales_dw”.
Then, automatically all the tables in that database will be retrieved to this tool for
creating multi- dimensional models.
By data source views & cubes, we can see our retrieved tables in multi-
dimensional models. We need to add dimensions also through dimensions option.
In general, multi-dimensional models consists of dimension tables & fact tables.
Process Extract:
The Extract step covers the data extraction from the source system and makes
it accessible for further processing. The main objective of the extract step is to
retrieve all the required data from the source system with as little resources as
possible. The extract step should be designed in a way that it does not
negatively affect the source system in terms or performance, response time or
any kind of locking.
There are several ways to perform the extract:
Update notification - if the source system is able to provide a
notification that a record has been changed and describe the change,
this is the easiest way to get the data.
Incremental extract - some systems may not be able to provide
notification that an update has occurred, but they are able to identify
which records have been modified and provide an extract of such
records. During further ETL steps, the system needs to identify
changes and propagate it down. Note, that by using daily extract, we
may not be able to handle deleted records properly.
Full extract - some systems are not able to identify which data has
been changed at all, so a full extract is the only way one can get the
data out of the system. The full extract requires keeping a copy of the
last extract in the same format in order to be able to identify changes.
Full extract handles deletions as well.
When using Incremental or Full extracts, the extract frequency is
extremely important. Particularly for full extracts; the data volumes
can be in tens of gigabytes.
Clean:
The cleaning step is one of the most important as it ensures the quality of the
data in the data warehouse.
Cleaning should perform basic data unification rules, such as:
Making identifiers unique (sex categories Male/Female/Unknown,
M/F/null, Man/Woman/Not Available are translated to standard
Male/Female/Unknown)
Convert null values into standardized Not Available/Not Provided value
Convert phone numbers, ZIP codes to a standardized form
Validate address fields, convert them into proper naming, e.g.
Street/St/St./Str./Str
Validate address fields against each other (State/Country, City/State,
City/ZIP code, City/Street).
Transform:
The transform step applies a set of rules to transform the data from the source
to the target. This includes converting any measured data to the same
dimension (i.e. conformed dimension) using the same units so that they can
later be joined. The transformation step also requires joining data from
several sources,
9
In these above tools, we are going to use OpenRefine 2.8 ETL tool to different
sample datasets for extracting, data cleaning, transforming & loading.
1
Perform Various OLAP operations such slice, dice, roll up, drill up and
pivot. ANS:
OLAP Operations are being implemented practically using Microsoft Excel.
Procedure for OLAP Operations:
1. Open Microsoft Excel, go to Data tab in top & click on ―Existing Connections”.
2. Existing Connections window will be opened, there “Browse for more” option
should be clicked for importing .cub extension file for performing OLAP
Operations. For sample, I took music.cub file.
5. Now we are going to perform roll-up (drill-up) operation, in the above window I
selected January month then automatically Drill-up option is enabled on top. We will
click on Drill- up option, then the below window will be displayed.
1
While inserting slicers for slicing operation, we select 2 Dimensions (for e.g.,
CategoryName & Year) only with one Measure (for e.g. Sum of sales). After inserting a
slice& adding a filter (CategoryName: AVANT ROCK & BIG BAND; Year: 2009 &
2010), we will get table as shown below.
1
8. Finally, the Pivot (rotate) OLAP operation is performed by swapping rows (Order
Date- Year) & columns (Values-Sum of Quantity & Sum of Sales) through right side
bottom navigation bar as shown below.
1
After Swapping (rotating), we will get resultant as represented below with a pie-chart for
Category-Classical& Year Wise data.
1
The GUI Chooser application allows you to run five different types of
applications -
The Explorer is the central panel where most data mining tasks are
performed.
The Experimenter panel is used to run experiments and conduct statistical
tests between learning schemes.
The KnowledgeFlow panel is used to provide an interface to drag and
drop components, connect them to form a knowledge flow and analyze the
data and results.
The WorkBench panel is used to discover, explore & learn about different
statistical distributions.
The Simple CLI panel provides the command-line interface powers to run
WEKA.
The Explorer - When you click on the Explorer button in the Applications selector,
it opens the following screen.
1
Click the “Add New” button in the Datasets pane and select the
required dataset (ARFF format files).
Click the “Add New” button in the “Algorithms” pane and click “OK”
to add the required algorithm.
2. The run tab is for running your designed experiments. Experiments can be
started and stopped. There is not a lot to it.
Click the “Start” button to run the small experiment you designed.
3. The analyze tab is for analyzing the results collected from an experiment.
Results can be loaded from a file, from the database or from an experiment
just completed in the tool. A no. of performance measures are collected from a
given experiment which can be compared between algorithms using tools like
statistical significance.
Click the “Experiment” button the “Source” pane to load the results
from the experiment you just ran.
Click the “Perform Test” button to summary the classification
accuracy results for the single algorithm in the experiment.
The KnowledgeFlow – When you click on the KnowledgeFlow button in the
Applications selector, it opens the following screen.
The Weka Workbench is an environment that combines all the GUI interfaces
into a single interface. It is useful if you find yourself jumping a lot between two or
more different interfaces, such as between the Explorer and the Experiment
Environment. This can happen if you try out a lot of what if’s in the Explorer and
quickly take what you learn and put it into controlled experiments.
The Simple CLI – When you click on the Simple CLI button in the Applications
selector, it opens the following screen.
Weka can be used from a simple Command Line Interface (CLI). This is
powerful because you can write shell scripts to use the full API from command line
calls with parameters, allowing you to build models, run experiments and make
predictions without a graphical user interface.
The Simple CLI provides an environment where you can quickly and easily
experiment with the Weka command line interface commands.
2
Navigate the options available in the WEKA (ex. Select attributes panel,
Preprocess panel, Classify panel, Cluster panel, Associate panel and Visualize
panel)
ANS:
EXPLORER PANEL
Preprocessor Panel
1. A variety of dataset formats can be loaded: WEKA‘s ARFF format (.arff
extension), CSV format (.csv extension), C4.5 format (.data & .names
extension), or serialized Instances format (.bsi entension).
2. Load a standard dataset in the data/ directory of your Weka installation,
specifically data/breast-cancer.arff.
Classify Panel
Test Options
1. The result of applying the chosen classifier will be tested according to the
options that are set by clicking in the Test options box.
2. There are four test modes:
Use training set: The classifier is evaluated on how well it predicts the
class of the instances it was trained on.
Supplied test set: The classifier is evaluated on how well it predicts
the class of a set of instances loaded from a file. Clicking the Set...
button brings up a dialog allowing you to choose the file to test on.
Cross-validation: The classifier is evaluated by cross-validation, using
the number of folds that are entered in the Folds text field.
Percentage split: The classifier is evaluated on how well it predicts a
certain percentage of the data which is held out for testing. The amount
of data held out depends on the value entered in the % field.
3. Click the “Start” button to run the ZeroR classifier on the dataset and
summarize the results.
2
Cluster Panel
1. Click the “Start” button to run the EM clustering algorithm on the dataset and
summarize the results.
2
Associate Panel
1. Click the “Start” button to run the Apriori association algorithm on the dataset
and summarize the results.
Visualize Panel
1. Increase the point size and the jitter and click the “Update” button to set an
improved plot of the categorical attributes of the loaded dataset.
EXPERIMENTER
Setup Panel
1. Click the “New” button to create a new Experiment.
2. Click the “Add New” button in the Datasets pane and select
the data/diabetes.arff dataset.
3. Click the “Add New” button in the “Algorithms” pane and click “OK” to add
the ZeroR algorithm.
2
Run Panel
1. Click the “Start” button to run the small experiment you designed.
Analyse Panel
1. Click the “Experiment” button the “Source” pane to load the results from the
experiment you just ran.
2. Click the “Perform Test” button to summary the classification accuracy results
for the single algorithm in the experiment.
2
Study the arff file format Explore the available data sets in WEKA. Load a
data set (ex. Weather dataset, Iris dataset, etc.)
ANS:
1. An ARFF (Attribute-Relation File Format) file is an ASCII text file that
describes a list of instances sharing a set of attributes.
2. ARFF files have two distinct sections – The Header & the Data.
The Header describes the name of the relation, a list of the attributes,
and their types.
The Data section contains a comma separated list of data.
The ARFF Header Section
The ARFF Header section of the file contains the relation declaration and attribute
declarations.
The @relation Declaration
The relation name is defined as the first line in the ARFF file. The format
is: @relation <relation-name>
where <relation-name> is a string. The string must be quoted if the name includes
spaces.
The @attribute Declarations
Attribute declarations take the form of an ordered sequence of @attribute
statements. Each attribute in the data set has its own @attribute statement which
uniquely defines the name of that attribute and its data type. The order the
attributes are declared indicates the column position in the data section of the file.
The format for the @attribute statement is:
@attribute <attribute-name> <datatype>
where the <attribute-name> must start with an alphabetic character. If spaces are
to be included in the name, then the entire name must be quoted.
The <datatype> can be any of the four types:
1. numeric
2. <nominal-specification>
3. string
4. date [<date-format>]
ARFF Data Section
The ARFF Data section of the file contains the data declaration line and the actual
instance lines.
The @data Declaration
The @data declaration is a single line denoting the start of the data segment in the
file. The format is:
@data
The instance data:
Each instance is represented on a single line, with carriage returns denoting the
end of the instance.
Attribute values for each instance are delimited by commas. They must appear in
the order that they were declared in the header section (i.e. the data corresponding
to the nth @attribute declaration is always the nth field of the attribute).
Missing values are represented by a single question mark, as
in: @data
4.4, ?, 1.5, ?, Iris-setosa
2
4. Plot Histogram
Applying Resample Filter on the dataset selected, yields the following results.
3
Load weather. nominal, Iris, Glass datasets into Weka and run Apriori
Algorithm with different support and confidence values.
ANS:
Loading WEATHER.NOMINAL dataset
1. Select WEATHER.NOMINAL dataset from the available datasets in the
preprocessing tab.
2. Apply Apriori algorithm by selecting it from the Associate tab and click start
button.
3. The Associator output displays the following result.
=== Run information ===
Scheme: weka.associations.Apriori -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S
-1.0 -c -1
Relation: weather.symbolic
Instances: 14
Attributes: 5
outlook
temperature
humidity
windy
play
=== Associator model (full training set) ===
Apriori
=======
Minimum support: 0.15 (2 instances)
Minimum metric <confidence>: 0.9
Number of cycles performed: 17
age
spectacle-prescrip
astigmatism
tear-prod-rate
contact-lenses
Test mode: evaluate on training data
=== Classifier model (full training set) ===
J48 pruned tree
Extract if-then rules from the decision tree generated by the classifier, Observe
the confusion matrix.
ANS:
Loading CONTACT-LENSES dataset and Run JRip algorithm.
1. Select CONTACT-LENSES dataset from the available datasets in the
preprocessing tab.
2. Apply JRip algorithm by selecting it from the Classify tab.
3. Now select one of the Test Options available (Training Set/ Supplied Test Set/
Cross-Validation/ Percentage Split).
4. Also select Output Entropy Evaluation Measures from the More options available
in Test Options.
5. Now click Start button available.
6. The Classifier output displays the following result.
=== Run information ===
Scheme: weka.classifiers.rules.JRip -F 3 -N 2.0 -O 2 -S 1
Relation: contact-lenses
Instances: 24
Attributes: 5
age
spectacle-prescrip
astigmatism
tear-prod-rate
contact-lenses
Test mode: evaluate on training data
JRIP rules:
===========
(tear-prod-rate = normal) and (astigmatism = yes) => contact-lenses=hard (6.0/2.0)
(tear-prod-rate = normal) => contact-lenses=soft (6.0/1.0)
=> contact-lenses=none (12.0/0.0)
Number of Rules : 3
Time taken to build model: 0 seconds
a b c <-- classified as
5 0 0 | a = soft
0 4 0 | b = hard
1 2 12 | c = none
4
Load each dataset into Weka and perform Naïve-bayes classification and k-
Nearest Neighbour classification. Interpret the results obtained.
ANS:
Loading CONTACT-LENSES dataset and Run Naïve-Bayes algorithm.
1. Select CONTACT-LENSES dataset from the available datasets in the
preprocessing tab.
2. Apply Naïve-Bayes algorithm by selecting it from the Classify tab.
3. Now select one of the Test Options available (Training Set/ Supplied Test Set/
Cross-Validation/ Percentage Split).
4. Also select Output Entropy Evaluation Measures from the More options available
in Test Options.
5. Now click Start button available.
6. The Classifier output displays the following result.
=== Run information ===
Scheme: weka.classifiers.bayes.NaiveBayes
Relation: contact-lenses
Instances: 24
Attributes: 5
age
spectacle-prescrip
astigmatism
tear-prod-rate
contact-lenses
Test mode: evaluate on training data
=== Classifier model (full training set) ===
Naive Bayes Classifier
Class
Attribute soft hard none
(0.22) (0.19) (0.59)
==========================================
age
young 3.0 3.0 5.0
pre-presbyopic 3.0 2.0 6.0
presbyopic 2.0 2.0 7.0
[total] 8.0 7.0 18.0
spectacle-prescrip
myope 3.0 4.0 8.0
hypermetrope 4.0 2.0 9.0
[total] 7.0 6.0 17.0
astigmatism
no 6.0 1.0 8.0
yes 1.0 5.0 9.0
[total] 7.0 6.0 17.0
tear-prod-rate
4
Plot RoC
Curves.
ANS:
1. Select CONTACT-LENSES dataset from the available datasets in the
preprocessing tab.
2. Apply Naïve-Bayes algorithm by selecting it from the Classify tab.
3. Now select one of the Test Options available (Training Set/ Supplied Test Set/
Cross-Validation/ Percentage Split).
4. Also select Output Entropy Evaluation Measures from the More options available
in Test Options.
5. Now click Start button available.
6. The Classifier output displays the result.
7. For plotting RoC curves – right click on the -bayes.NaiveBayes available in the
result list, select the Visulaize Threshold Curve option in which we can select any
of the class available (soft, hard, none).
8. After selecting a class, RoC curve plot will be displayed with False Positive Rate
as X-axis and True Positive Rate as Y-axis.
5
Compare classification results of ID3, J48, Naïve-Bayes and k-NN classifiers for
each dataset, and deduce which classifier is performing best and poor for each
dataset and justify.
ANS:
By observing all the classification results of the algorithms ID3, K-NN, J48 & Naïve
Bayes –
ID3 Algorithm accuracy & performance is best.
J48 Algorithm accuracy & performance is poor.
5
mean/mode
5
RESULT
5
RESULT
5
Set up the knowledge flow to load an ARFF (batch mode) and perform a
cross validation using J48 algorithm.
ANS:
Knowledge flow to load an ARFF (batch mode) and perform a cross
validation using J48 algorithm.
RESULT
5
Demonstrate plotting multiple ROC curves in the same plot window by using
J48 and Random Forest tree.
Plotting multiple ROC curves in the same plot window by using J48 and
Random Forest tree.
RESULT
6
OUTPUT:
Set
12
37
1
38
84
28
24
61
88
65
66
72
85
75
64
91
27
47
42
6
9. Write a Python program to generate frequent item sets / association rules using
Apriori algorithm.
PROGRAM:
# installing the apyori package
!pip install apyori
# importing libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
OUTPUT:
# Displaying the results non-sorted
output_DataFrame
10. Write a program to calculate chi-square value using Python. Report your
observation.
PROGRAM:
# importing libraries
import numpy as np
from scipy.stats import chi2
OUTPUT:
Chi-square statistic: 10.0
P-value: 0.0014
OBSERVATION:
The chi-square statistic is a measure of the discrepancy between the observed and
expected frequencies in a contingency table. The p-value is the probability of
obtaining a chi-square statistic as large or larger than the one observed, assuming that
the null hypothesis is true. In this case, the null hypothesis is that there is no
association between the two variables in the contingency table. The p-value of 0.0014
is less than the significance level of 0.05, so we reject the null hypothesis and
conclude that there is a significant association between the two variables.
6
return dict
# Calculating Mean
def mean(numbers):
return sum(numbers) / float(len(numbers))
def MeanAndStdDev(mydata):
info = [(mean(attribute), std_dev(attribute)) for attribute in zip(*mydata)]
# eg: list = [ [a, b, c], [m, n, o], [x, y, z]]
# here mean of 1st attribute =(a + m+x), mean of 2nd attribute = (b +
n+y)/3 # delete summaries of last class
del info[-1]
return info
return probabilities
# Accuracy score
def accuracy_rate(test, predictions):
correct = 0
for i in range(len(test)):
if test[i][-1] == predictions[i]:
correct += 1
return (correct / float(len(test))) * 100.0
# driver code
# add the data path in your system
filename = r'E:\user\MACHINE LEARNING\machine learning algos\Naive
bayes\filedata.csv'
# prepare model
info = MeanAndStdDevForClass(train_data)
# test model
predictions = getPredictions(info, test_data)
accuracy = accuracy_rate(test_data, predictions)
print("Accuracy of your model is: ", accuracy)
OUTPUT:
Total number of examples are: 200
Out of these, training examples are: 140
Test examples are: 60
Accuracy of your model is: 71.2376788
7
}
//creating array for 2-frequency
itemset int nt1[][]=new int[10][10];
for(j=0;j<10;j++)
{
//generating unique items for 2-frequency itemlist
for(m=j+1;m<10;m++)
{
for(i=0;i<n;i++)
{
if(item[i][j]==1 &&item[i][m]==1)
//checking there would atleast 1 itemset in 1-frequency itemset and 2-
frequency itemlist.
{
nt1[j][m]=nt1[j][m]+1;
//incrementing for each items with all other items in 2-frequency itemset
}
}
if(nt1[j][m]!=0) //if 2-frequency itemlist is present
System.out.println("Number of Items of “+itemlist[j]+"& "+itemlist[m]+"
:"+nt1[j][m]);
//printing number of items of each items with other items with their
frequency.
}
}
for(j=0;j<10;j++)
{
for(m=j+1;m<10;m++)
{
if(((nt1[j][m]/(float)n)*100)>=50)
q[j]=1;
else
q[j]=0;
if(q[j]==1)
{
System.out.println("Item "+itemlist[j]+"& "+itemlist[m]+" is selected ");
}
}
}
}
}
7
OUTPUT:
Enter the number of transaction:
3
items :1--Milk 2--Bread 3--Coffee 4--Juice 5--Cookies 6--Jam 7--Tea 8--Butter 9--
Sugar 10--Water
Transaction 1:
Is Item MILK present in this transaction(1/0)? :
1
Is Item BREAD present in this transaction(1/0)? :
1
Is Item COFFEE present in this transaction(1/0)? :
1
Is Item JUICE present in this transaction(1/0)? :
1
Is Item COOKIES present in this transaction(1/0)? :
1
Is Item JAM present in this transaction(1/0)? :
1
Is Item TEA present in this transaction(1/0)?
:1
Is Item BUTTER present in this transaction(1/0)? :
1
Is Item SUGAR present in this transaction(1/0)? :
1
Is Item WATER present in this transaction(1/0)?
:1
Transaction 2:
Is Item MILK present in this transaction(1/0)? :
1
Is Item BREAD present in this transaction(1/0)? :
1
Is Item COFFEE present in this transaction(1/0)? :
0
Is Item JUICE present in this transaction(1/0)? :
0
Is Item COOKIES present in this transaction(1/0)? :
1
Is Item JAM present in this transaction(1/0)? :
1
Is Item TEA present in this transaction(1/0)?
:0
7
13. Write a program to cluster your choice of data using simple k-means algorithm
using JDK.
PROGRAM:
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
if (dataPoint.getClusterIndex() == i) {
clusterCentroid.add(dataPoint);
}
}
clusterCentroid.divide(dataPoints.size());
clusterCentroids.set(i, clusterCentroid);
}
@Override
public String toString() {
return "(" + x + ", " + y + ")";
}
}
}
OUTPUT:
(0.8, 1.2)
(3.2, 3.6)
(1.0, 1.2)
8
14. Write a program of cluster analysis using simple k-means algorithm Python
programming language.
PROGRAM:
# importing libraries
import numpy as np
import matplotlib.pyplot as plt
OUTPUT:
OBSERVATION:
This program will generate 100 random data points and then cluster them using the k-
means algorithm. The number of clusters is chosen to be 3. The program will then plot
the data points and the centroids.
8
15. Write a program to compute/display dissimilarity matrix (for your own dataset
containing at least four instances with two attributes) using Python.
PROGRAM:
# importing library
import numpy as np
OUTPUT:
[[0. 2.82842712 5.65685425 8.48528137]
[2.82842712 0. 2.82842712 5.65685425]
[5.65685425 2.82842712 0. 2.82842712]
[8.48528137 5.65685425 2.82842712 0. ]]
OBSERVATION:
This program first creates a dataset containing four instances with two attributes.
Then, it computes the dissimilarity matrix using the Euclidean distance formula.
Finally, it displays the dissimilarity matrix.
8
16. Visualize the datasets using matplotlib in python. (Histogram, Box plot, Bar
chart, Pie chart etc.,)
PROGRAM:
LINE CHART
import matplotlib.pyplot as plt
OUTPUT:
8
HISTOGRAM
import matplotlib.pyplot as plt
import numpy as np
# Create a histogram
plt.hist(data)
OUTPUT:
8
BOXPLOT
import matplotlib.pyplot as plt
import numpy as np
OUTPUT:
8
BAR CHART
import matplotlib.pyplot as plt
import numpy as np
OUTPUT:
8
PIE CHART
import matplotlib.pyplot as plt
import numpy as np
OUTPUT: