Big Data Analytics Tools..DS_Store__MACOSXBig Data Analyti.docx

Big Data Analytics Tools./.DS_Store
__MACOSX/Big Data Analytics Tools./._.DS_Store
Big Data Analytics Tools./ Final Exam/PROJECT - BETTER
UNDERSTAND ATTRITION.docx
FINAL EXAM – EXERCISE – To Better Understand Attrition.
This is a final project – you are going to exam the HR-
BalanceSheet dataset and write a short report on what you
found. I will guide you through the analysis, but as we go
through the analysis you are going to need to capture data for
the final report.
1. Load the dataset into Statistica
2. Generate Histograms for all of the data
a. Make notes on what you observe from the histograms. Can
you learn anything about the business from these histograms?
b. Capture all of the histograms.
3. Now generate a correlation matrix to see if any variables are
highly correlated. If variables are highly correlated and you
are doing a supervised method (e.g., decision tree), then one of
them must be omitted from the analysis. Do you know why?
Statistics->Nonparametrics->Correlations Okay.
Now select ALL of the variables and select “Spearman rank R”.
4. Let’s copy this out to Excel.
a. Open a blank Excel file
b. Go to Statistica – the output correlation matrix –
i. Hit Ctrl – A - this will select everything.

ii. Right Click - select “Copy with Headers”
iii. Go To Excel – select Paste
5. Select all of the numbers in Excel
a. Go To Conditional Formatting
i. Highlight all values greater than 0.70
6. This tells you the values that are highly correlated. Record
what they are – these cannot be used in a supervised modeling
exercise together. For example, JobLevel and
TotalWorkingYears are highly correlated.
a. Make a list of all of the variables that are highly correlated
(>0.7).
BUSINESS PROBLEM: The company has employee data for
the last several years. In this data set we have a wide range of
data, including whether or not they left the company (i.e.,
Attrition). If Attrition is set to “Yes”, they left the company.
If Attrition is set to “No”, they did not leave the company.
The first thing we want to do is take a “high” level look at those
people who left the company.
Go to Selection Criteria – that is accessible through the Sel:Off
setting at the bottom of the Statistica window. Click on
“Sel:Off”
Set the selection criteria to Attribute = “Yes”.
7. Generate Histograms for all of the data
a. Make notes on what you observe from the histograms. Can
you learn anything about the business from these histograms?
b. Capture the histograms that tell you something about the
business.
Go back to the selection criteria and turn the Sel: back to “Off”.

8. Now build a decision tree (C&RT) to see if we can find out
what influences where or not individuals decide to leave the
company.
If you exclude the variables that are highly correlated, you can
generate a tree.
Generate a C&RT tree
Pick your variables (Quick)
· Attrition is your dependent variable
· Select the categorical and continuous variables that you
reasonably think could be an issue with respect to attrition.
· Select your response codes
· ALL
Don’t do anything on Classification (YET) – you may want to
go back and play with the classification weights – but, don’t do
that yet.
On the “Stopping” tab, change the minimum n to 20. This will
allow it to build a deeper tree.
Select V-fold cross validation on the Validation tab

Set Surrogate to 2 on the Advanced tab and hit OK.
Look at your tree –
Look at the Predicted Versus Observed – under classification.
Look at “Importance” on the Summary tab – this tells you which
variables have the greatest impact.
This is your initial tree ---
Now – the best you’re going to be able to do is get about 80%
accuracy on both Predicting yes and no.
So – play with it and see how good you can get it.
· Play with the classification costs
· You may try to create a stratified subsample using Attrition as
the strata variable
1
__MACOSX/Big Data Analytics Tools./ Final
Exam/._PROJECT - BETTER UNDERSTAND
ATTRITION.docx
Big Data Analytics Tools./ Final Exam/.DS_Store

__MACOSX/Big Data Analytics Tools./ Final
Exam/._.DS_Store
Big Data Analytics Tools./ Final Exam/FINAL EXAM -
2018.docx
FINAL EXAM NAME: _________________________
1. When we evaluate models we often discuss things like
predictive accuracy, speed, robustness, scalability, and
interpretability. Briefly discuss what is meant by
“interpretability” and why it is important.
2. You have been hired by the county government to help
automate a system to detect fraudulent spending by government
employees.
You have been given a database of transactions for the past 10
years to work with. Each record in the database contains all of
the details of each transaction as well as information related to
the particular employee. In this database accountants have
manually gone through the data and marked each transaction as
either “Good” or “Fraudulent”.
The goal – build a model based on the historical data that will
flag future transactions as either “Good” or “Fraudulent”. This
will eliminate the need for the accountants to have to go
through each transaction manually in the future. What type of
modeling technique (e.g., decision tree, association analysis,
clustering, etc.) would you use and why?
3. AT&T has been losing customers to Verizon. They want to
try to understand why this is the case. They have customer
records for the past 5 years that contain demographic
information (age, gender, etc.) for the customers, the type of
plan that they have, the number of interactions they have had
with customer support and whether or not those customers left

AT&T.
AT&T wants you to build a model that can be used to predict
whether or not a customer is going to leave and switch to
another provider. What type of technique (e.g., decision tree,
association analysis, clustering, etc.) would you use and why??
4. Kroger is trying to find ways to improve sales. They have
all of their receipts for the past 5 years. The receipts contain
information about what was purchased, who purchased, and the
date and time of the transaction.
You task is to analyze sales patterns and make recommendations
with respect to store layouts that, you hope, will increase sales.
What type of modeling technique (e.g., decision trees,
association analysis, clustering, etc.) would you use and why?
5. You work for a cable service provider. You provide a
variety of services for your customers. Your company provides
cable TV, home phone, security systems, and internet services.
Your customer base is very diverse. Your customers could be
male/female, young/old, single/married/married with children,
etc. You have a wide range of ethnic backgrounds and income
levels.
You want to make your marketing campaigns more effective.
This means targeting the right groups with the right messages
using the right media. (For example, marketing via social
media may be more or less effective for 18 year olds as
compared to 80 year olds).
You been tasked to use the customer database and determine
what the different customer segments are and what they look
like. Then, once you figure out what the unique groups are
you can go through and develop a targeted campaign for each

group. What type of modeling technique (e.g., decision trees,
association analysis, clustering, etc.) would you use to
determine the different market sectors and why?
Text and Web Analytics
6. When we do text analytics, we read in the data, we transform
the data into documents, and then we must generate a
term/document matrix. This term/document matrix is what we
use to perform analysis.
Generation of the term/document matrix involves some
processing of the document (see figure on the right).
Briefly describe each step and what it does.
a. Tokenize:
b. Transform Cases:
c. Filter Stopwords (English):
d. Filter Tokens by Length:
e. Stem (Porter):
7. Briefly describe what “Sentiment Analysis” is and how it
might be used by a company.
8. What is the difference between text mining and data mining?
9. FINAL EXAM – 2018 – BETTER UNDERSTAND
ATTRITION “projects”
Write me a short report that tells me the following (I’d like for
this report to be uploaded in a separate standalone word file and
look like something you would give an employer):
Business Scenario – write this like you worked for the company.
Tell me what the issue is you are exploring and why.
What you did and why you did it – just discuss the technique
you used, why it was appropriate and what you did. If you did
several iterations, let me know what the final configuration was.

I don’t need to know everything that went on – just what you
did to get the final results.
What you found – tell me everything you found/learned.
Include screen shots, graphs, etc. Anything appropriate to
communicate what you found. Do NOT show me everything
that was generated – just those things that support your
“findings”.
Recommendations - What impact this would have to the
business AND what your recommendations are for the business.
1
__MACOSX/Big Data Analytics Tools./ Final Exam/._FINAL
EXAM - 2018.docx
Big Data Analytics Tools./ Final Exam/HR-BalancedSheet.sta
__MACOSX/Big Data Analytics Tools./ Final Exam/._HR-
BalancedSheet.sta
Big Data Analytics Tools./Final Exam materials/1 BINS 4352 -
Association Analysis.pptx
Association Analysis
BINS 4352
Learning Objectives
Gain an understanding of how Association Analysis is used
Understand how Associations are created and how to
interpret/evaluate those Associations
Discuss and understand Association metrics – Lift, Support, and
Confidence

Gain familiarity with RapidMiner
Association Analysis
(Market Basket Analysis)
This is a widely used and, in many ways, one of the most
successful data mining algorithm.
It can be used to determines what products people purchase
together.
Uses
Stores can use this information to determine store layout and
product placement
Direct marketers can use this information to determine which
new products to offer to their current customers.
Inventory policies can be improved if reorder points reflect the
demand for the complementary products.
Any application where you are looking to see if there is a
pattern where strong associations are present
Parable Of “Beer And Diapers”
Customers who bought diapers at a grocery store between 5-
7pm also tend to by beer.
This is a good example of the business value present in big data
analytics.
More than a parable – it was the result of a study commissioned
by Osco in the 1990’s and represented a starting point in big
data analytics
The finding led to the notion that there is value in discovering
uncommon relationships in data can be used to drive business
value.

Association Rules for
Market Basket Analysis
Rules are written in the form “left-hand side implies right-hand
side” and an example is:
Yellow Peppers IMPLIES Red Peppers, Bananas
To make effective use of a rule, three numeric measures about
that rule must be considered:
(1) support
(2) confidence
(3) lift
Measures of Predictive Ability
Support and Confidence: An Illustration
A
B
C

A
C
D
B
C
D
A
D
E
B
C
ERULESUPPORTCONFIDENCELIFTA => D2/52/3(2/3)/(2/5) =
1.67C => A2/52/4(2/4)/(2/5) = 1.25A => C2/52/3(2/3)/(2/5) =
1.67B & C => D1/51/3(1/3)/(1/5) = 1.67
A Note On Lift
Lift is an interesting measurement and one that has undergone a
great deal of scrutiny
For our purposes we defined Lift as Confidence/Support
However, there are other ways to calculate this measure
Some have argued that one must take into account the frequency
of the observation
You don’t necessarily want a product that is in 100,000

transactions to be penalized over a product that is involved in
10 transactions simply due to the number of occurrences (or
visa versa)
As such – when looking at this value in a tool keep in mind that
it is the “relative” value that is important and not the “absolute”
value.
Market Basket Analysis Methodology
We first need a list of transactions and what was purchased.
Receipts from stores
This may have to be “reformatted” depending on the tool that
you’re using
Next, we choose a list of products to analyze, and tabulate how
many times each was purchased with the others.
The diagonals of the table shows how often a product is
purchased in any combination, and the off-diagonals show
which combinations were bought.
A Convenience Store Example
Consider the following simple example about five transactions
at a convenience store:
Transaction 1: Frozen pizza, cola, milk
Transaction 2: Milk, potato chips
Transaction 3: Cola, frozen pizza
Transaction 4: Milk, pretzels
Transaction 5: Cola, pretzels
These need to be cross tabulated and displayed in a table.

A Convenience Store Example (cont)
The diagonal shows how many times a product was purchased
(in any combination)
Pizza and Cola sell together more often than any other combo; a
cross-marketing opportunity?
Milk sells well with everything – people probably come here
specifically to buy it.Product BoughtPizza alsoMilk
alsoCola
alsoChips alsoPretzels
alsoPizza21200Milk13111Cola21301Chips01010Pretzels01102
Using The Results
The tabulations can immediately be translated into association
rules and the numerical measures computed.
Comparing this week’s table to last week’s table can
immediately show the effect of this week’s promotional
activities.
But, you need to be careful that the results were not impact by
some external event (e.g., bad weather)
Some rules are going to be trivial (hot dogs and buns sell
together) or inexplicable (toilet rings sell only when a new
hardware store is opened).
Using The Results

Barbie® => Candy
Forbes (Palmeri 1997) reported that a major retailer has
determined that customers who buy Barbie dolls have a 60%
likelihood of buying one of three types of candy bars. The
retailer was unsure what to do with this nugget. The online
newsletter Knowledge Discovery Nuggets invited suggestions
(Piatesky-Shapiro 1998)
Put them closer together in the store.
Put them far apart in the store.
Package candy bars with the dolls.
Package Barbie + candy + poorly selling item.
Raise the price on one, lower it on the other.
Barbie accessories for proofs of purchase.
Do not advertise candy and Barbie together.
Offer candies in the shape of a Barbie Doll.
Augmenting Data to Yield More Insights
The sales data can be augmented with the addition of virtual
items.
For example, we could record that the customer was new to us
or had children.
The transaction record might look like:
Item 1: Sweater Item 2: Jacket Item 3: New
This might allow us to see what patterns new customers have
versus existing customers.
Limitations to Market Basket Analysis
A large number of real transactions are needed to do an
effective basket analysis, but the data’s accuracy is
compromised if all the products do not occur with similar
frequency.

The analysis can sometimes capture results that were due to
some external event
For example:
The success of previous marketing campaigns (and not natural
tendencies of customers).
Weather or natural disaster.
Association Analysis in Rapidminer
The Dataset
The data is organized into “Transactions”
Each transaction represents a grocery store receipt
The items we are interested in
includeHerringBaguetteAvocadoHeinekenOlivesSardinesCorned
BeefPeppersSodaCrackerBourbonArtichokeCokeApplesChicken
HamTurkeyIce CreamSteakBordeaux
Data is coded where “YES” indicates that it was purchased and
“NO” indicates that it was not purchased
Running Assoication Analysis in Rapid Miner
Select
New Process
RadipMiner Studio Professional Main Menu
RapidMiner Studio is very similar in layout to SAS Enterprise
Miner
Design Pane – where you layout the analysis you want to run
Drag/Drop Objects from the Operator list into the Design Space

Importing Dataset
There are several ways to import data
I am going to read the Excel file that has pre-processed grocery
store receipt data
I drag the “Read Excel” operator into the design space.
Connect the inp port on the side of the design space to the fil
port on the operator
Many operators have 2 output ports – one for processed data
and the other for an original data “pass through”
Configuring The Read Excel Operator
Parameters associated with the Read Excel Operator appear on
the right side of the screen when the operator is selected.
Go to “Import Configuration”
Select The Excel File
Select the data file that you want to import
Select “Next”
Preview The Data
The file that you import can contain multiple sheets
At this point you can select the sheet and the range of cells that
you wish to import
The data file we are working with has 1 sheet and by default all
of the entries are selected
Select “Next”

Annotating Data
Now you have the opportunity to add annotations to the data
We don’t need to set an attribute name for this data set
Select “Next”
Selecting attribute types
RapidMiner tries to determine the types of the attributes from
the data.
For Association Analysis we need to set the types to either
“Binomial” or “Nominal”
I am going to select “Binomial”
Select “Next”
Finished !
Once all of the data types are changed to Binomial – Select
“Finish”
Attribute Selection
Next we need to select the attributes that will be used in the
analysis.
This can be found on the Operator Pane under Blending ->
Attributes -> Selection
Drag/Drop Select Attributes into the Design
Select Attributes
Connect the “out” port of Read Excel to the “exa” port on the

Select Attributes.
Select the Select Attributes Operator – the parameters for the
operator will appear on the right side of the screen.
Selecting Attributes
On the Parameters Pane
Select the “subset” Attribute filter type
Select “Select Attributes”
The Select Attributes dialog box appears
Select all of the attributes except for “Transaction”
This is an ID for the transaction and is not needed for the
analysis
FP-Growth (Frequency Calculations)
Next we drag/drop and FP-Growth operator into the design
Connect the “exa” port of Select Attributes to “exa” of FP-
Growth
The FP Growth operator determines the “frequent item sets”
A frequent item set denotes the items (products) in the set that
have been purchased together frequently (in a certain ratio of
transactions)
We also need to define the positive value (open advanced
parameters)
Create Association Rules
Drag/drop “Create Association Rules” from the Operator Pane
into the Design space.
Connect the “fre” (frequencies) of FP-Growth to “ite” of Create

Association Rules
Parameters driving the rule creation can be set (Confidence, lift,
… and Thresholds)
Ready To Run !
Connect the “rul” and “ite” ports of the Create Association
Rules operator to the output (res) of the design space.
Select “Run” -
Output
We get 2 sets of output
One tab is for the FP-Growth operator and show the Frequent
Item Sets
The other contains the Association Rules
The Frequency data shows you the support for every
combination of products in the data set
Associations Rules (Sorted By Support)
Association Rules (Sorted By Confidence)

Association Rules (Sorted By Lift)
Interpreting The Rules
Rule: IF (turkey, baguette) THEN (ham, olives)
Support: The percentage of the time that the rule was true
26.7% of the time the basket contained both (turkey,baguette)
and (ham, olives)
Confident: The percentage of the time that the baskets that did
contain (turkey, baguette) also contained (ham, olives)
85% of the time when the basket contained (turkey, baguette) it
also contained (ham, olives)
Lift: is the relative measure that looks at how many times
Confidence is larger than the expected level (similar to what we
discussed earlier – better than a baseline model)
Greater than 1 is desired
The larger the value the better
Association Analysis in Statistica
Same Data File
Reformatted data just a little for Statistica (in Excel)
Each line contains what was sold for that transaction

Link Analysis
Go to Data Mining -> Link Analysis
Select Non-sequential association analysis
Select Variables as we have done in the past
Transaction – is the Transaction ID
Food items – are Multi-response variables
Database Selection
If you run the analysis multiple times, you may have to select a
database name.
This should only be the case if you exit the tool, reload data,
and try to run again.
Select “OK” and run the analysis
Results
Association Rules
Frequency Itemsets
Rule Graph
Web Graph

Questions?
__MACOSX/Big Data Analytics Tools./Final Exam
materials/._1 BINS 4352 - Association Analysis.pptx
Big Data Analytics Tools./Final Exam materials/2 BINS 4352-
Data Preparation.pptx
Data Preparation
Objectives
Provide a perspective on how data may have to be prepared for
a given Analytics task
Understand the relationship between the data mining technique
and the business problem being addressed
Gain an understanding of the types of data issues that might be
found
Understand options when it comes to “fixing” or “addressing”
data issues
Understand what impact “uncleaned” data may have on the
analysis
Provide a “practical” and “applied” understanding of statistical
concepts
Data in Data Mining
Data: a collection of facts usually obtained as the result of
experiences, observations, or experiments.
Data may consist of numbers, words, images, …
Data: lowest level of abstraction (from which information and
knowledge are derived).
Nominal – mutually exclusive, but not order categories (e.g.,

male, female)
Ordinal – order matters (e.g., Freshman, Sophomore, Junior,
Senior).
Interval – measure where difference between two variables is
meaningful.
Ratio – all the properties of Interval, but with an absolute 0
which means complete lack of that variable.
3
Inspecting Data and Preparation in Excel
Preparing Data In Excel
We are looking for anomalies in the data
Missing values
Values out of a defined range
Etc.
Once we find these values – we can
Repair the data
Impute missing values
Delete the response
5
DATA Dictionary
6
The data dictionary tab describes everything that is in the data
set
If you download a data set from the Internet, it will often
include a data dictionary in a separate file.

Raw data
7
The raw data is the data as it was collected.
There have been no changes/modifications to the data at this
point.
Highlight
Blanks
8
All of the data was selected in the spreadsheet
Go to FIND & SELECT in Excel (on the HOME Tab)
Select “GO TO SPECIAL”
Select “Blanks”
Then set the fill color to RED
Analyze Row / Columns
9
The raw data is the data as it was collected.
There have
been no changes/modifications to the
data at this point.
Count non-blank responses
Calculate the frequency of the “expected values”
Calculate difference from the means
Calculate single answer bias

How To Handle Anomalies In The Data
Decide what your going to do about the anomalies you found in
the data.
Filter values
Repair the file (impute values)
Leave them alone
Delete responses
Etc.
If you are going to “change” data or “delete” data, move the
original values to a separate sheet.
You are documenting what you did
This makes it easier to “undo” your change if you need to.
11
Statistica
Data
This sheet contains the “cleaned” data that you’re going to load
into Statistica for further analysis.
The excel manipulation was intended to address the “obvious”
problems
Graphical Inspection Credit Scoring Data
in Statistica
Application
As we have discussed practically all data will need some
preparation
More over – that preparation may be slightly different based on
the application and the type of analysis that you are doing

It is important to have a good understanding of both the data
and what you’re trying to accomplish through the data mining
process
Data preparation
Handling missing data and outliers
Selecting important variables
Sampling
Data preparation is specific to BOTH the data set and the task.
The preparation method and decisions made during data
preparation may change if either change
Application For Credit Scoring Data
Business Need
A financial institution has data about their past customers.
These customers are classified as either good or bad credit risks
based on their history with the institution.
The classification (good/bad) is based on whether or not the
loan payment was delinquent and the magnitude of the loss
A financial institution needs a way to decide if and how much
credit to extend to customers who apply for loans.
Business goal: reduce the losses due to bad loans
Goals of the data mining process
Determine the variables that are best predictors of credit risk
Find a high performance predictive model that classifies
customers
Deploy that model to make decisions on future credit
applications
Update the model as more data is collected
Credit Scoring Data Set
We are going to explore the credit scoring data set

This data will be used to explore
Data preparation
Classification
Try to keep in mind – this will be a “classification exercise”.
It can be applied to different data sets and domains where
classification is appropriate
Examine the dataDataTypeCredit ratingCategoricalBalance of
current accountCategoricalDuration of
creditContinuousPayment of previous creditCategoricalPurpose
of creditCategoricalAmount of creditContinuousValue of
savings CategoricalEmployed by current
employerCategoricalInstallment in % of available
incomeCategoricalMarital
statusCategoricalGenderCategoricalLiving in current household
forCategoricalMost valuable
assetCategoricalAgeContinuousFurther running
creditsCategoricalType of apartmentCategoricalNumber of
previous credits at this
bankCategoricalOccupationCategoricalTrainTestcategorical
Credit Scoring
Start by looking at the credit scoring application and the
business need
Review the variables in the credit risk data set
Discuss the next steps for the data mining process
Classification
Classification can be used to classify a variable with 2 or more

groups
Find the probability of a particular predicted classification.
For example:
Loan denied
Loan approved
Examine The Data
Below is the data in Statistica
It was opened by FILE-> OPEN
Look At Some Histograms Of The Data
Histograms
Credit rating is the dependent variable – it is the one we want to
make predictions for
Notice that there have been more than twice as many customers
with good credit as compared to bad
This may mean that we need to adjust our sample to keep the
analysis from being “good” credit biased
Histograms
Here we have the number of previous credits at the bank
5-6 and 7 or more are relatively small compared to the other
categories
Hence, we may want to recode the data to have a 5 or more
category
This is a good general rule of thumb

Remaining Variables
Remaining Variables
Note the majority of customers either have no previous credit or
have paid back their previous loans
Remaining Variables
Remaining Variables
Remaining Variables
Note: that there have been more than twice as many male
customers as female customers
Remaining Variables
Remaining Variables

Age is interesting in that you would expect that customers need
to be at least 18 years old to apply for credit, so we need to
make sure that is the case in the 15-20 year old group
Remaining Variables
Remaining Variables
Next Steps
We have stated the business goals and have data available to do
the analysis
We have visually inspected the data in order to gain a high-level
understanding of the data
We still need to do more work here
But, we have identified our dependent variable (Credit Risk)
and the potential predictor variables
We need to continue exploring the data and look at things that
we can do to prepare the data for analysis
Further Analysis
Go to the interactive drill down tool
Select the “Drill Variables”
Select “Payment of Previous Credits”
Drill Down
Select “No Previous Credits”
Go back to the “Drill down variables” and select “Number of
Previous Payments”

Select “Brush”
Previous payment
So here we have an apparent contradiction.
We have drilled down to look at customers where the Payment
of Previous Credits = no previous credits
But, yet the number of previous credits at this bank has values
for 2-4, 5-6, and 7 or more?
Scatter Plot Matrix
A scatter plot matrix can help us look for potential outliers
Scatter Plot Matrix
We can combine this with the Drill Down and look at scatter
plots of all of the continuous variables with respect to those
where the Payment of Previous Credits = no previous credits
Scatter Plot – Duration Of Credit Vs Amount Of Credit
Scatter Plot
We can look for outliers here
If we know that there is a maximum loan amount, then we can
remove those that are greater than that value

If we know that there is a maximum duration, we can remove all
of those that are greater than that
Other Graphical Techniques
Box Plots
A box plot or boxplot is a convenient way of graphically
depicting groups of numerical data through their quartiles.
Box plots may also have lines extending vertically from the
boxes (whiskers) indicating variability outside the upper and
lower quartiles, hence the terms box-and-whisker plot and box-
and-whisker diagram.
Outliers may be plotted as individual points.
Box plots are non-parametric: they display variation in samples
of a statistical population without making any assumptions of
the underlying statistical distribution.
The spacing between the different parts of the box indicate the
degree of dispersion (spread) and skewness in the data, and
show outliers.
There Are Several Variations of Box Plots
The “box” :
The band inside the box is the second quartile (the median).
Statistica gives you the option to make this the “mean”
The ends of the “whiskers” can represent several possible
alternative values:
Min and Max of all of the data
Lowest datum still within 1.5 interquartile range (IQR) of the
lower quartile, and the highest datum still within 1.5 IQR of the
upper quartile (often called the Tukey boxplot)
one standard deviation above and below the mean of the data

Etc.
Box Plot
These may seem more “primitive” than a histogram, but they do
have some advantages
They take up less space – so they are particularly useful in
comparing distributions among several groups
The number and width of bins can greatly impact the appearance
of a histogram
Box plots are particularly good at identifying outliers in
continuous data
Box Plots in Statistica
Go to “GRAPHS” and select “Box Plots”
Graph Type
Box Whiskers
Regular (will give you 1 variable per graph)
Multiple (will give you 1 graph with all on it”
Variables – where you select the variables for analysis
Box Plots
Leave the defaults for Grouping Intervals
Middle point
Value – median
Style – determines the graphics that show the median in the box

Box Plots
I have re-selected “Multiple” Graph type
I prefer the median be marked with a “Line”
Box Plots
The graph is fairly crowded – but, I can see outliers that are
identified by the small circles.
I will re-run this with fewer variables to see if we can see it
better
Box Plots
Now we can more clearly see where the outliers are in the data
set.
These need to be examined for deletion
Next we will look at this in SPSS
Marking The Outliers
Right clicking on the outliers will pull up a menu where you can
tell Statistica to “Mark the Outliers”
SPSS – Box Plots
The data is loaded into
SPSS
This can be done by loading the same file we used for Statistica

Go to “Graphs” and then to “Legacy Dialogs” and Select “Box
Plots”
Select “Simple” and “Summaries of separate variables”
SPSS – Box Plots
Next we select the variables and move them to the “Boxes
Represent” area
Select “OK”
SPSS – Box Plots
We see output very similar to Statistica.
The outliers are marked with “o” on the graph
In SPSS the default is to display the “number” of the response
with the outlier
Inspecting Outliers
In Box Plot
We want to be very conservative when identifying responses as
“outliers”
Count how many “outliers” each respondent has
For example, response 188 has 3 outliers, 187 has 4 outliers,
and so on.
I may delete 1 or 2 responses that have a “large” number of
outliers and then rerun the box plot.
This will cause things to shift a bit
I then iterate until I’m happy with the data.
Leaving outliers in the data is fine – this may be a “true”

response
QUESTIONS?
Data
CategoricalNumerical
NominalOrdinalIntervalRatio
Histogram of Credit Rating
CreditScoring 19v*1000c
Credit Rating = 1000*1*Normal(Location=0.7, Scale=0.4585)
bad good
Credit Rating
0
100
200
300
400
500
600
700
800
No of obs
Histogram of Number of previous credits at this bank
Number of previous credits at this bank =
1000*1*Normal(Location=1.407, Scale=0.5777)
one 2- 4 5- 6 7 or more
Number of previous credits at this bank
0
100
200
300
400
500
600

700
No of obs
Histogram of Balance of Current Account
Balance of Current Account = 1000*1*Normal(Location=2.577,
Scale=1.2576)
no running account
no balance
<= $300
>$300
Balance of Current Account
0
50
100
150
200
250
300
350
400
450
No of obs
Histogram of Duration of Credit
Duration of Credit = 1000*10*Normal(Location=20.903,
Scale=12.0588)
-1001020304050607080
Duration of Credit
0
50
100
150
200
250
300
350

400
No of obs
Histogram of Payment of Previous Credits
Payment of Previous Credits = 1000*1*Normal(Location=2.545,
Scale=1.0831)
hesistant
problematic running accounts
no previous credits
no problems with current credits
paid back
Payment of Previous Credits
0
100
200
300
400
500
600
No of obs
Histogram of Purpose of Credit
Purpose of Credit = 1000*1*Normal(Location=2.828,
Scale=2.7444)
other
new car
used car
furniture
television
household appliances
repair
education
vacation
retraining
business
Purpose of Credit

0
20
40
60
80
100
120
140
160
180
200
220
240
260
280
300
No of obs
Histogram of Amount of Credit
Amount of Credit = 1000*5000*Normal(Location=4579.7472,
Scale=3951.8525)
-$5,000.00
$0.00
$5,000.00
$10,000.00
$15,000.00
$20,000.00
$25,000.00
$30,000.00
Amount of Credit
0
100
200
300
400
500

600
700
800
No of obs
Histogram of Value of Savings
Value of Savings = 1000*1*Normal(Location=2.105,
Scale=1.58)
no savings
<140 140-700 700-1400 >1400
Value of Savings
0
100
200
300
400
500
600
700
No of obs
Histogram of Employed by Current Employer for
Employed by Current Employer for =
unemployed
<1 year 1-5 years 5-8 years > 8 years
Employed by Current Employer for
0
50
100
150
200
250
300
350
400

No of obs
Histogram of Installment in % of Available Income
Installment in % of Available Income =
> 35 25-35 15- 25 < 15
Installment in % of Available Income
0
100
200
300
400
500
No of obs
Histogram of Marital Status
Marital Status = 1000*1*Normal(Location=2.682,
Scale=0.7081)
divorced/living apart
divorced/living apart/married
single
married/widowed
Marital Status
0
100
200
300
400
500
600
No of obs
Histogram of Gender
Gender = 1000*1*Normal(Location=1.31, Scale=0.4627)
male female
Gender

0
100
200
300
400
500
600
700
800
No of obs
Histogram of Living in Current Household for
Living in Current Household for =
< 1 year 1-5 years 5-8 years >8 years
Living in Current Household for
0
50
100
150
200
250
300
350
400
450
No of obs
Histogram of Most Valuable Assets
Most Valuable Assets = 1000*1*Normal(Location=2.358,
Scale=1.0502)
no assets
car
life insurance
ownership of house or land
Most Valuable Assets

0
50
100
150
200
250
300
350
No of obs
Histogram of Age
Age = 1000*5*Normal(Location=33.544, Scale=11.3498)
1015
20
253035404550556065707580
Age
0
20
40
60
80
100
120
140
160
180
200
220
240
260
No of obs
Histogram of Further running credits
Further running credits = 1000*1*Normal(Location=2.675,
Scale=0.7056)
at other banks

at department store
no further running credits
Further running credits
0
100
200
300
400
500
600
700
800
900
No of obs
Histogram of Type of Apartment
Type of Apartment = 1000*1*Normal(Location=1.928,
Scale=0.5302)
free rented owned
Type of Apartment
0
100
200
300
400
500
600
700
800
No of obs
Histogram of Number of previous credits at this bank
Number of previous credits at this bank =
one 2- 4 5- 6 7 or more

0
100
200
300
400
500
600
700
No of obs
Histogram of Occupation
Occupation = 1000*1*Normal(Location=2.904, Scale=0.6536)
unskilled with no permanant residence
unskilled with permanant residence
skilled employee
executive/self-employed
Occupation
0
100
200
300
400
500
600
700
No of obs
Histogram for brushing: Number of previous credits at this bank
N Total: 1000, Selected: 530
Payment of Previous Credits = no previous credits
050100150200250300350400450500
Number of counts
one
2- 4
5- 6
7 or more

Correlations
Duration of Credit
Amount of Credit
Age
Correlations
Payment of Previous Credits = no previous credits
Duration of Credit
Amount of Credit
Age
Scatterplot of Amount of Credit against Duration of Credit
Amount of Credit = 298.4367+204.818*x
01020304050607080
Duration of Credit
-$2,000.00
$0.00
$2,000.00
$4,000.00
$6,000.00
$8,000.00
$10,000.00
$12,000.00
$14,000.00
$16,000.00
$18,000.00
$20,000.00
$22,000.00
$24,000.00
$26,000.00
$28,000.00
Amount of Credit
Box Plot of multiple variables
1-ORIGINAL-FULL-DATASET-WALK-THROUGH 61v*261c
Median; Box: 25%-75%; Whisker: Non-Outlier Range

Median
25%-75%
Non-Outlier Range
Outliers
Extremes
CCSC_1
CCOC_1
CCAC_1
CCAC_6
HCA_5
UCA_4
PEOU_3
PU_4
PP_5
BI_5
PF_4
PIIT_1
0
1
2
3
4
5
6
7
8
Box Plot of multiple variables
1-ORIGINAL-FULL-DATASET-WALK-THROUGH 61v*261c
Median; Box: 25%-75%; Whisker: Non-Outlier Range
Median
25%-75%
Non-Outlier Range
Outliers
Extremes
UCA_1
UCA_3

UCA_5
PEOU_1
PEOU_3
PU_1
PU_3
PP_1
PP_3
PP_5
0
1
2
3
4
5
6
7
8
__MACOSX/Big Data Analytics Tools./Final Exam
materials/._2 BINS 4352- Data Preparation.pptx
Big Data Analytics Tools./Final Exam materials/2 BINS 4352 -
Cluster Analysis.pptx
Cluster Analysis
Cluster Analysis - Review
Outline
What is it? What is a cluster?
How is it different from a decision tree?
What is distance and linkage?
What is hierarchical clustering?

What is scree plot and denogram?
What is non-hierarchical clustering (k-means)?
How to learn it in detail?
3
Simple Case
StudyStudentPhysicsCalculusJoe1520Bill2015Paula2621Jane44
52Jack5045Carlos5738Carla8085Russell9088Eddie9898
4
If we look at the student data on the left, we can easily see a
pattern emerging
There are natural groupings of students that seem to be apparent
in the data.
Plotting these on a 2-dimensional scatter plot makes it more
visible
Similarities And Dissimilarities
If we look at the objects in any one of the groupings, they are
all very similar to the other objects in that grouping
Also, if we look at any one object in a grouping, it is very
dissimilar to any object in another grouping
This give rise to the notion of:
Homogeneous within and Heterogeneous across based on
characteristics
Clusters

These groupings are clusters
They represent a “natural grouping of similar objects” based on
a collection of input parameters
Now, if you add a third or forth dimension (e.g., English,
History), then the clusters may change
However, the way in which they are constructed is the same
Clusters
Clusters
There is no objective function (i.e., an equation to be optimized
given certain constraints with variables that need to be
minimized or maximized such as trying to express a business
goal in mathematical terms)
There is no dependent variable
This is sometimes called subjective segmentation
The segmentation is developed on its own based on the values
of the input variables
It is called a unsupervised learning technique.
Once the segments are developed
They need to be understood
You need to decide how you are going to deal with the segments
that have emerged
How Is This Different From A Decision Tree
Decision tree technique requires a clearly defined dependent
variable
For example, GOOD/BAD credit
The technique is based on identifying those variables
(characteristics) which are closely associated with the
dependent variable.

Distances
Distances (cont)
Paula
Bill
Joe
Jane
Jack
Carlos
Carla
Russell
Eddie
Distance Between 2 Clusters
Sometimes called the linkage function: intermediate cluster
distance
How do you calculate the distance between 2 clusters

Big Data Analytics Tools..DS_Store__MACOSXBig Data Analyti.docx

Single linkage
Calculate the distance between each point in one cluster to each
point in the neighboring cluster and then find the shortest
distance.
Complete linkage
Similar to single linkage – except we look for the furthest
distance
Centroid distance
Calculate the “center” of each cluster and then calculate the
distance between centroids
Hierarchical Clustering
Used when you have a small number of observations (usually
hundreds)
You can not use this method for a large dataset because it
becomes computationally impractical
The way this works is to join objects (cases) together into
successively larger clusters using some measure of similarity or
distance
During the clustering process it shows how the clusters are
formed
The result of the clustering is the hierarchical tree
In SAS this is done by using the “proc cluster” command
In Statistica this is called Joining or Tree Clustering
Hierarchical Clustering
We begin by each case in a cluster by itself
In each step we slowly “relax” the criterion which defines

“uniqueness”
In other words, we lower the threshold by what it means to
declare two or more objects within the same cluster
As a result we group more and more objects together and each
“layer” consists of increasingly dissimilar objects
In the last step – all objects are grouped together into a single
cluster
Scree Plot
When we look at a scree plot what we have is the “within
cluster variance”
Total variance = between variance + within variance
The “elbow” in the graph indicates the optimal number of
clusters
When the number of clusters = 1, the within group variance =
total variance
Dendogram
When the data contain a clear structure in terms of clusters,
then its structure is often reflected in the hierarchical tree
The result of a successful analysis is that you can detect and
interpret the structure by looking at the branches
Hierarchical Clustering - Students
Everything starts out in a cluster by itself
Find the closest objects.
Merge those into a cluster.
Recalcuate distances (from center of new cluster)

Find the closest objects.
Merge those into a cluster.
Recalcuate distances (from center of new cluster)
Joe Bill Paula
Jack
Jane
…
K-means Clustering
Non-hierarchical
Used when you have a large number of observations
You decide up front how many clusters you need (k)
K-means algorithm
Partition objects into k non-empty subsets (randomly)
Compute the centroid for each of the clusters
Centroid is the center (i.e., mean point of the cluster)
This defines the “seed” point for each cluster
Assign each object to the cluster with the nearest seed point
Go back to step 2 and repeat until the assignment does not
change
In SAS you do this my using the “proc fastclus” command
EM Clustering in Statistica
Getting started
EM algorithm
Uses distributions of (continuous) the data to find the clusters
You specify the distribution technique
Very similar to K-means clustering

First step – hypothesize how many clusters will be in the data
With K-means and EM, the optimum number of clusters can be
determined with V-fold cross validation
Setting Up A Cluster Analysis
Go to DATA MINING -> Cluster
This will pull up the K-means / EM clustering dialog box
Use EM clustering
Select all of the variables (Except for Sample)
Select Variables And Configure V-fold Validation
Select EM algorithm
Number of clusters 2
Number of iterations 50
Keep defaults on EM tab – we have limited options here due to
all of the variables are categorical
On the Validation tab select
V-fold cross validation
Statistica will search for the optimum number of clusters
between 2 and 25 clusters
Select “OK”

Big Data Analytics Tools..DS_Store__MACOSXBig Data Analyti.docx

More Related Content

Similar to Big Data Analytics Tools..DS_Store__MACOSXBig Data Analyti.docx (20)

More from tangyechloe (20)

Recently uploaded (20)

Big Data Analytics Tools..DS_Store__MACOSXBig Data Analyti.docx