SlideShare a Scribd company logo
R: A Proposed Analysis and Visualization Environment for Network
Security Data
Joshua McNutt
CERT Network Situational Awareness Group,
Carnegie Mellon University, Pittsburgh, PA 15213, USA
jmcnutt@cert.org
Abstract
The R statistical language provides an analysis environ-
ment which is flexible, extensible and analytically pow-
erful. This paper details its potential as an analysis and
visualization interface to SiLK flow analysis tools as part
of a network situational awareness capability.
1 Introduction
The efficacy of network security analysis is highly de-
pendent upon the data interface and analysis environment
made available to the analyst. The command line seldom
offers adequate visual displays of data, while many GUI
designs necessarily limit the query specificity afforded
at the command line. This paper proposes the use of R,
a statistical analysis and visualization environment, for
interfacing with flow data. R is a complete program-
ming language and, consequently, is highly extensible.
Its built-in analysis and visualization capabilities provide
the analyst with a powerful means for investigating and
modeling network behavior.
2 R! What is it good for?
R is a language and environment for statistical comput-
ing and graphics used by statisticians worldwide. It is
syntactically very similar to the S language which was
developed at Bell Laboratories (now Lucent Technolo-
gies). Unlike S, R is available as free software under the
terms of the Free Software Foundation’s GNU General
Public License in source code form. Additional details
are provided by the R Project for Statistical Computing
(http://www.r-project.org/). The website also provides
links to documentation and program files for download-
ing. Supported platforms include Windows, Linux and
MacOS X.
R is an object-based environment which can run inter-
actively or in batch mode. It has the ability to generate
publication-quality graphical displays on-screen or for
hardcopy. Users can write scripts and functions which
leverage the programming language’s many features, in-
cluding loops, conditionals, user-defined recursive func-
tions and input/output facilities. For computationally-
intensive tasks, C and Fortran code can be linked.
There are a handful of packages supplied with the
R distribution covering virtually all standard statistical
analyses. Many more packages are available through the
Comprehensive R Archive Network (CRAN), a family
of Internet sites covering a very wide range of modern
statistical methods.
3 SiLK Tools
The suite of command line tools known as the Sys-
tem for Internet-Level Knowledge (SiLK) are used for
the collection and examination of Cisco NetFlow ver-
sion 5 data. The CERT Network Situational Awareness
(NetSA) Team wrote SiLK 1
for the purpose of analyzing
flow data collected on large volume networks. Flow data
provides summaries of host communications providing a
comprehensive view of network traffic.
The SiLK analysis tools provide Unix-like commands
with functionality that includes selecting (a.k.a. filter-
ing), displaying (ASCII output), sorting and summariz-
ing packed binary flow data. Multiple commands can
also be piped together for complex filtering. In this pa-
per, we utilize the tools rwfilter (to select the data) and
rwcount to generate binned time series of flow records,
bytes and packets and feed the results into R for analysis.
Further details on the functionality of SiLK can be found
in [1].
4 Motivation: Command Line versus GUI
Many experienced users enjoy the query specificity af-
forded by the command line. But, in order to visualize
R Objects
Object Description
vector ordered collection of numbers
scalar single-element vector
array multi-dimensional vector
matrix two-dimensional array
factor vector of categorical 2
data
data frame matrix-like structures in which the
columns can be of different types (e.g.,
numerical and categorical variables)
list general form of vector in which the
various elements need not be of the
same type, and are often themselves
vectors or lists. Lists provide a
convenient way to return the results
of a statistical computation.
function an object in R which manipulates
other objects
Table 1: Data object types in R
their data, they must make do with a third-party graphing
program. They often do not favor a graphical user inter-
face because their options for both queries and visualiza-
tion tend to become more limited. What we hope to pro-
vide with the R interface is a preservation of command
line control with the added features of integrated visu-
alization and analysis. Essentially, we would describe
it as an enhanced command line experience, but it also
provides the analyst with all of the benefits of the R lan-
guage’s object-based workspace model.
5 R Data Manipulation
5.1 R Data Objects
Every entity in the R environment is an object. Numeric
vectors, ordered collections of numbers, are the simplest
and most common type of object, but there are many oth-
ers. See Table 1 for a description of the object types.
In this paper, our example uses a data frame to store
our data. The data frame object is a very flexible matrix-
like entity which, unlike a matrix, allows the columns to
be of different types.
5.2 SiLK Data Access
It should be noted that while we use R to interface with
SiLK, virtually any command-line tool could be used
with R. Also, R has multiple SQL database interface li-
braries. Many methods exist for interfacing with data
stores. We detail below the R-SiLK interface being used
at this time.
Within R, wrapper functions tied to specific tools in
the SiLK suite read in the user-specified SiLK command
line as a text-string parameter. The wrapper function
makes a system call to the computer running the flow
tools. Then, using a standard R data input function, the
wrapper function reads in the ASCII output of the com-
mand line call. The results of the wrapper function call
are assigned to a list object in R. Each element of that
list represents a different analysis result, e.g. a matrix
of the data, summary statistics, etc. Subsequent analysis
and visualization operations can then be applied to that
output object or any of its elements.
5.3 R Workspace
All objects are located in the user’s workspace which can
be saved at the conclusion of the R session and restored
at the start of the next session. The command history()
produces a list of all commands submitted to R by the
user.
5.4 Analysis Capability
From simple summary statistics to advanced simulations,
the R platform provides functions, extension packages
(available through CRAN) and visualization capabilities
appropriate to a wide range of flow analysis tasks. The
object-based nature of the R environment makes it a use-
ful platform for the network security analyst. Objects
from different analyses can be preserved in the user’s
workspace for comparison purposes. Also, rapid proto-
typing of new analysis tools is possible due to the wealth
of built-in capabilities and the ease with which new func-
tions can be written.
The CERT/NetSA Team has used R for a variety of
analysis tasks, from logistic regression to robust correla-
tion analysis. We have used its SQL interface functional-
ity to access hourly roll-ups of flow data summarized by
port and protocol from a special database created specifi-
cally for port analysis. This has made it possible to study
temporal correlations in port activity and identify ports
which are exhibiting substantial volumetric changes.
5.5 Graphing Capability
One of the most important features of R is its ability to
create publication quality graphical displays. R has a
huge set of standard statistical graphs, stemplots, box-
plots, scatterplots, etc. Extension packages are available
for more advanced 3D plotting and highly-specialized
display types. The advantage for the analyst running R
in interactive mode is the ability to make slight changes
2
Time
LogScale
010100100001e+061e+081e+101e+12
05/18/2005 12:00:00 05/18/2005 12:36:00
Records
Bytes
Packets
LogScale
010100100001e+061e+081e+101e+12
Records
Bytes
Packets
3D scatterplot of time periods
40 50 60 70 80 90100110120
010002000300040005000600070008000
0
50000
100000
150000
200000
250000
300000
350000
400000
Records
Bytes
Packets
Figure 1: Graphical output of rwcount.analyze()
to the SiLK query and quickly visualize those changes in
a newly drawn graph. Given the flexibility of its graphi-
cal facilities, R is also an ideal environment for advanced
analysts to perform visualization prototyping.
6 R-SiLK wrapper function prototype: rw-
count.analyze()
Our first proof-of-concept SiLK interface function is the
wrapper rwcount.analyze() which calls the SiLK tool rw-
count. Details of this wrapper function are provided in
Table 2. The function has two input parameters, com-
mand and plot. The parameter command is a text string
which is assigned a SiLK command line call to rw-
count, which returns binned time series of records, bytes,
and flows. The other input parameter, plot, determines
whether a graphical display will be generated at runtime.
The default is plot=TRUE. The visualization provided
in our prototype includes three plots: a time series plot,
side-by-side boxplots, and a 3D scatterplot of the data.
Figure 1 provides an example of the graphical output
generated by rwcount.analyze().
When rwcount.analyze() is called, its output is as-
signed to a list object in R. The list it generates contains
five elements: data, command, stats, cor, and type. These
elements are defined in Table 2.
A sample R session using rwcount.analyze() to exam-
ine FTP traffic is provided below. The parameter com-
mand is assigned a SiLK command line. In our example,
we specify TCP traffic (−−proto=6) directed at destina-
tion port 21 (−−dport=21) for the hour between noon
and 1 p.m. on May 18, 2005. Those specifications are
provided to rwfilter via switches, and the selected flows
(in binary, packed format) are piped into rwcount where
we have specified a bin size of thirty seconds (−−bin-
size=30). The output of rwcount consists of the time
series of bytes, records and packets which are read into a
data frame object in R. This data frame is also an element
in the output list object returned by rwcount.analyze().
In this example, the output list returned by the function
is assigned to obj. The list of object elements are printed
with the function names() and correspond to the items
in Table 2. As an example of automated analysis that
can be returned in a results object, the correlation ma-
trix of the series is found in obj$cor. This output shows
that bytes, records and packets are highly correlated with
each other (ρ > .99). Since obj$data is a data frame of
the three time series, we can print the records field by
typing obj$data$Records. This is one of the time series
plotted in Figure 1.
> obj <- rwcount.analyze(command=
"rwrun rwfilter
--start-date=2005/05/18:12:00:00
--proto=6
--dport=21
--print-file
--pass=stdout |
rwcount
--bin-size=30",
plot=TRUE)
> names(obj)
[1] "data" "command" "stats" "cor"
[5] "type"
> obj$cor
Records Bytes Packets
Records 1.0000 0.9944 0.9951
Bytes 0.9944 1.0000 0.9964
Packets 0.9951 0.9964 1.0000
> obj$data$Records
Records
05/18/2005 12:00:00 76218
05/18/2005 12:05:00 73374
3
rwcount.analyze() details
Input Parameters
Parameter Description
command SiLK command line text string
plot Logic element determines whether
R will perform runtime plotting
Output List Elements
List Element Description
data Data frame containing rwcount
time series for Bytes, Records and
Packets
command Same as input parameter description
stats Summary statistics for Bytes, Records
and Packets
cor Correlation matrix for Bytes, Records
and Packets
type Text string to indicate which wrapper
function generated this object
Table 2: rwcount.analyze() function description
05/18/2005 12:10:00 55743
...
7 Analyst Benefits
One of the advantages of R is its potential for rapid anal-
ysis prototyping. A user can very quickly write functions
that generate a slew of experimental analysis results de-
scribing a host, a subnet, or traffic volumes. Each result
can be included in the function’s output list and evalu-
ated. Analysis results which prove useful can be quickly
integrated and become standard output elements.
In analytical work, the ability to label preliminary re-
sults objects provides the investigator with a facility for
generating an audit trail. In R, this labeling is performed
by the addition of object elements which describe the ob-
ject to either the analyst or other functions which will
operate on the object. By default, rwcount.analyze() re-
turns the elements type and command. The element type
can be used to describe the object to other functions.
For example, a generic graphing function (perhaps called
rw.visualize()) would read in an object and determine
how it should be displayed based upon its type. The ele-
ment command describes to the user how the object was
created by storing the SiLK command. Additional ele-
ments can also be added to existing objects. For instance,
a user may wish to attach a comment (e.g. ”Surge in host
count lasted for 6 hours”) to an object by adding a text
string element.
Since objects are preserved when the users save their
workspace in R, comparison with objects from future
analyses is very simple. Also, the user can graph objects
from a previous analysis side-by-side with new results.
We believe the experienced analyst will leverage the
enhanced command line experience, fast visualization
and rapid analysis prototyping. For analyses requiring
longer data pulls, R can also serve as an integrated script-
ing and analysis environment.
We envision a hierarchy of analysis functions. At the
lowest level would be functions like rwcount.analyze()
which use a SiLK command line call as a parameter. A
function at the next level of the hierarchy would allow a
user to specify criteria of interest via function parameters
(e.g. dport=80, proto=6). This function would both gen-
erate the necessary SiLK command line and submit it to
rwcount.analyze() for processing. Using these functions,
novice analysts unacquainted with the SiLK command
line would be able to perform real analysis tasks imme-
diately. These functions could also be used for learning
purposes since the SiLK command line needed for the
query is provided in the output object.
8 Future Work
Our wrapper function rwcount.analyze() is merely a
proof-of-concept prototype of an interface between R
and SiLK. Next steps include the development of addi-
tional wrapper functions, making further improvements
to rwcount.analyze(), and developing a generic visualiza-
tion scheme that reads the type field in an output object
to determine the appropriate display.
9 Conclusion
This paper has introduced the reader to R, demonstrat-
ing an overlap between its capabilities and the needs of
network security analysts. R provides a truly integrated
environment for data analysis and visualization. Further,
the ability to interface with SiLK flow analysis tools and
other data storage formats makes it an ideal environment
for enhancing and extending a network situational aware-
ness capability.
References
[1] CARRIE GATES, MICHAEL COLLINS, E. A. More netflow tools:
For performance and security. In LISA XVIII (2004), pp. 121–131.
Notes
1http://silktools.sourceforge.net/
2We are using ”categorical” here to describe string character data
(e.g. ”male” versus ”female”).
4
Ad

More Related Content

What's hot (20)

PaloAlto Enterprise Security Solution
PaloAlto Enterprise Security SolutionPaloAlto Enterprise Security Solution
PaloAlto Enterprise Security Solution
Prime Infoserv
 
Keynote: Elastic Observability evolution and vision
  Keynote: Elastic Observability evolution and vision  Keynote: Elastic Observability evolution and vision
Keynote: Elastic Observability evolution and vision
Elasticsearch
 
Industrial IoT and OT/IT Convergence
Industrial IoT and OT/IT ConvergenceIndustrial IoT and OT/IT Convergence
Industrial IoT and OT/IT Convergence
Michelle Holley
 
Evolution of Wireless Communication Technologies
Evolution of Wireless Communication TechnologiesEvolution of Wireless Communication Technologies
Evolution of Wireless Communication Technologies
Akhil Bansal
 
ITIL compliant Open Source tools
ITIL compliant Open Source toolsITIL compliant Open Source tools
ITIL compliant Open Source tools
Bruno Cornec
 
How to build high performance 5G networks with vRAN and O-RAN
How to build high performance 5G networks with vRAN and O-RANHow to build high performance 5G networks with vRAN and O-RAN
How to build high performance 5G networks with vRAN and O-RAN
Qualcomm Research
 
Grafana.pptx
Grafana.pptxGrafana.pptx
Grafana.pptx
Bhushan Rane
 
Failure Friday: Start Injecting Failure Today!
Failure Friday: Start Injecting Failure Today! Failure Friday: Start Injecting Failure Today!
Failure Friday: Start Injecting Failure Today!
PagerDuty
 
Prolexic Routed Product Brief - DDoS defense for protecting network and data ...
Prolexic Routed Product Brief - DDoS defense for protecting network and data ...Prolexic Routed Product Brief - DDoS defense for protecting network and data ...
Prolexic Routed Product Brief - DDoS defense for protecting network and data ...
Akamai Technologies
 
Modern vs. Traditional SIEM
Modern vs. Traditional SIEM Modern vs. Traditional SIEM
Modern vs. Traditional SIEM
Alert Logic
 
Migrating Legacy Applications to AWS Cloud: Strategies and Challenges
Migrating Legacy Applications to AWS Cloud: Strategies and ChallengesMigrating Legacy Applications to AWS Cloud: Strategies and Challenges
Migrating Legacy Applications to AWS Cloud: Strategies and Challenges
OSSCube
 
The Akamai Security Portfolio
The Akamai Security PortfolioThe Akamai Security Portfolio
The Akamai Security Portfolio
Elisabeth Bitsch-Christensen
 
Three layer API Design Architecture
Three layer API Design ArchitectureThree layer API Design Architecture
Three layer API Design Architecture
Harish Kumar
 
Internet Bandwidth Projection and Evolution (bangladesh)
Internet Bandwidth Projection and Evolution (bangladesh)Internet Bandwidth Projection and Evolution (bangladesh)
Internet Bandwidth Projection and Evolution (bangladesh)
Md. Abdul Hadi Dipu
 
An introduction to 5G
An introduction to 5GAn introduction to 5G
An introduction to 5G
Andrei Novikov
 
SDN-ppt-new
SDN-ppt-newSDN-ppt-new
SDN-ppt-new
Gifty Susan Mani
 
Twitter sentimentanalysis report
Twitter sentimentanalysis reportTwitter sentimentanalysis report
Twitter sentimentanalysis report
Savio Aberneithie
 
Fraud detection with Machine Learning
Fraud detection with Machine LearningFraud detection with Machine Learning
Fraud detection with Machine Learning
Scaleway
 
Endpoint Security
Endpoint SecurityEndpoint Security
Endpoint Security
Ahmed Hashem El Fiky
 
Composale DXP with MACH architecture.pptx
Composale DXP with MACH architecture.pptxComposale DXP with MACH architecture.pptx
Composale DXP with MACH architecture.pptx
Pieter Brinkman
 
PaloAlto Enterprise Security Solution
PaloAlto Enterprise Security SolutionPaloAlto Enterprise Security Solution
PaloAlto Enterprise Security Solution
Prime Infoserv
 
Keynote: Elastic Observability evolution and vision
  Keynote: Elastic Observability evolution and vision  Keynote: Elastic Observability evolution and vision
Keynote: Elastic Observability evolution and vision
Elasticsearch
 
Industrial IoT and OT/IT Convergence
Industrial IoT and OT/IT ConvergenceIndustrial IoT and OT/IT Convergence
Industrial IoT and OT/IT Convergence
Michelle Holley
 
Evolution of Wireless Communication Technologies
Evolution of Wireless Communication TechnologiesEvolution of Wireless Communication Technologies
Evolution of Wireless Communication Technologies
Akhil Bansal
 
ITIL compliant Open Source tools
ITIL compliant Open Source toolsITIL compliant Open Source tools
ITIL compliant Open Source tools
Bruno Cornec
 
How to build high performance 5G networks with vRAN and O-RAN
How to build high performance 5G networks with vRAN and O-RANHow to build high performance 5G networks with vRAN and O-RAN
How to build high performance 5G networks with vRAN and O-RAN
Qualcomm Research
 
Failure Friday: Start Injecting Failure Today!
Failure Friday: Start Injecting Failure Today! Failure Friday: Start Injecting Failure Today!
Failure Friday: Start Injecting Failure Today!
PagerDuty
 
Prolexic Routed Product Brief - DDoS defense for protecting network and data ...
Prolexic Routed Product Brief - DDoS defense for protecting network and data ...Prolexic Routed Product Brief - DDoS defense for protecting network and data ...
Prolexic Routed Product Brief - DDoS defense for protecting network and data ...
Akamai Technologies
 
Modern vs. Traditional SIEM
Modern vs. Traditional SIEM Modern vs. Traditional SIEM
Modern vs. Traditional SIEM
Alert Logic
 
Migrating Legacy Applications to AWS Cloud: Strategies and Challenges
Migrating Legacy Applications to AWS Cloud: Strategies and ChallengesMigrating Legacy Applications to AWS Cloud: Strategies and Challenges
Migrating Legacy Applications to AWS Cloud: Strategies and Challenges
OSSCube
 
Three layer API Design Architecture
Three layer API Design ArchitectureThree layer API Design Architecture
Three layer API Design Architecture
Harish Kumar
 
Internet Bandwidth Projection and Evolution (bangladesh)
Internet Bandwidth Projection and Evolution (bangladesh)Internet Bandwidth Projection and Evolution (bangladesh)
Internet Bandwidth Projection and Evolution (bangladesh)
Md. Abdul Hadi Dipu
 
Twitter sentimentanalysis report
Twitter sentimentanalysis reportTwitter sentimentanalysis report
Twitter sentimentanalysis report
Savio Aberneithie
 
Fraud detection with Machine Learning
Fraud detection with Machine LearningFraud detection with Machine Learning
Fraud detection with Machine Learning
Scaleway
 
Composale DXP with MACH architecture.pptx
Composale DXP with MACH architecture.pptxComposale DXP with MACH architecture.pptx
Composale DXP with MACH architecture.pptx
Pieter Brinkman
 

Similar to Using R for Cyber Security Part 1 (20)

Data analytics with R
Data analytics with RData analytics with R
Data analytics with R
Dr. C.V. Suresh Babu
 
Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...
Rusif Eyvazli
 
quantile regression in R Roger Koenker.pdf
quantile regression in R Roger Koenker.pdfquantile regression in R Roger Koenker.pdf
quantile regression in R Roger Koenker.pdf
202312442
 
R Programming Language
R Programming LanguageR Programming Language
R Programming Language
NareshKarela1
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
Debraj GuhaThakurta
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Debraj GuhaThakurta
 
1_Introduction.pptx
1_Introduction.pptx1_Introduction.pptx
1_Introduction.pptx
ranapoonam1
 
Ijcatr04051012
Ijcatr04051012Ijcatr04051012
Ijcatr04051012
Editor IJCATR
 
Data Analytics with R and SQL Server
Data Analytics with R and SQL ServerData Analytics with R and SQL Server
Data Analytics with R and SQL Server
Stéphane Fréchette
 
Unit 3
Unit 3Unit 3
Unit 3
Piyush Rochwani
 
Database Integrated Analytics using R InitialExperiences wi
Database Integrated Analytics using R InitialExperiences wiDatabase Integrated Analytics using R InitialExperiences wi
Database Integrated Analytics using R InitialExperiences wi
OllieShoresna
 
R basics for MBA Students[1].pptx
R basics for MBA Students[1].pptxR basics for MBA Students[1].pptx
R basics for MBA Students[1].pptx
rajalakshmi5921
 
Visualization Proess
Visualization ProessVisualization Proess
Visualization Proess
Pawandeep Kaur
 
Sparkr sigmod
Sparkr sigmodSparkr sigmod
Sparkr sigmod
waqasm86
 
Big data analysis using spark r published
Big data analysis using spark r publishedBig data analysis using spark r published
Big data analysis using spark r published
Dipendra Kusi
 
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and VisualizationKIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
Dr. Radhey Shyam
 
Big data analytics with R tool.pptx
Big data analytics with R tool.pptxBig data analytics with R tool.pptx
Big data analytics with R tool.pptx
salutiontechnology
 
Data Visualization Project Presentation
Data Visualization Project PresentationData Visualization Project Presentation
Data Visualization Project Presentation
Shubham Shrivastava
 
Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using R
Alex Palamides
 
statistical computation using R- report
statistical computation using R- reportstatistical computation using R- report
statistical computation using R- report
Kamarudheen KV
 
Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...
Rusif Eyvazli
 
quantile regression in R Roger Koenker.pdf
quantile regression in R Roger Koenker.pdfquantile regression in R Roger Koenker.pdf
quantile regression in R Roger Koenker.pdf
202312442
 
R Programming Language
R Programming LanguageR Programming Language
R Programming Language
NareshKarela1
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
Debraj GuhaThakurta
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Debraj GuhaThakurta
 
1_Introduction.pptx
1_Introduction.pptx1_Introduction.pptx
1_Introduction.pptx
ranapoonam1
 
Data Analytics with R and SQL Server
Data Analytics with R and SQL ServerData Analytics with R and SQL Server
Data Analytics with R and SQL Server
Stéphane Fréchette
 
Database Integrated Analytics using R InitialExperiences wi
Database Integrated Analytics using R InitialExperiences wiDatabase Integrated Analytics using R InitialExperiences wi
Database Integrated Analytics using R InitialExperiences wi
OllieShoresna
 
R basics for MBA Students[1].pptx
R basics for MBA Students[1].pptxR basics for MBA Students[1].pptx
R basics for MBA Students[1].pptx
rajalakshmi5921
 
Sparkr sigmod
Sparkr sigmodSparkr sigmod
Sparkr sigmod
waqasm86
 
Big data analysis using spark r published
Big data analysis using spark r publishedBig data analysis using spark r published
Big data analysis using spark r published
Dipendra Kusi
 
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and VisualizationKIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
Dr. Radhey Shyam
 
Big data analytics with R tool.pptx
Big data analytics with R tool.pptxBig data analytics with R tool.pptx
Big data analytics with R tool.pptx
salutiontechnology
 
Data Visualization Project Presentation
Data Visualization Project PresentationData Visualization Project Presentation
Data Visualization Project Presentation
Shubham Shrivastava
 
Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using R
Alex Palamides
 
statistical computation using R- report
statistical computation using R- reportstatistical computation using R- report
statistical computation using R- report
Kamarudheen KV
 
Ad

More from Ajay Ohri (20)

Introduction to R ajay Ohri
Introduction to R ajay OhriIntroduction to R ajay Ohri
Introduction to R ajay Ohri
Ajay Ohri
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
Ajay Ohri
 
Social Media and Fake News in the 2016 Election
Social Media and Fake News in the 2016 ElectionSocial Media and Fake News in the 2016 Election
Social Media and Fake News in the 2016 Election
Ajay Ohri
 
Pyspark
PysparkPyspark
Pyspark
Ajay Ohri
 
Download Python for R Users pdf for free
Download Python for R Users pdf for freeDownload Python for R Users pdf for free
Download Python for R Users pdf for free
Ajay Ohri
 
Install spark on_windows10
Install spark on_windows10Install spark on_windows10
Install spark on_windows10
Ajay Ohri
 
Ajay ohri Resume
Ajay ohri ResumeAjay ohri Resume
Ajay ohri Resume
Ajay Ohri
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientists
Ajay Ohri
 
National seminar on emergence of internet of things (io t) trends and challe...
National seminar on emergence of internet of things (io t)  trends and challe...National seminar on emergence of internet of things (io t)  trends and challe...
National seminar on emergence of internet of things (io t) trends and challe...
Ajay Ohri
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
Ajay Ohri
 
How Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessHow Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help business
Ajay Ohri
 
Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data Science
Ajay Ohri
 
Tradecraft
Tradecraft   Tradecraft
Tradecraft
Ajay Ohri
 
Software Testing for Data Scientists
Software Testing for Data ScientistsSoftware Testing for Data Scientists
Software Testing for Data Scientists
Ajay Ohri
 
Craps
CrapsCraps
Craps
Ajay Ohri
 
A Data Science Tutorial in Python
A Data Science Tutorial in PythonA Data Science Tutorial in Python
A Data Science Tutorial in Python
Ajay Ohri
 
How does cryptography work? by Jeroen Ooms
How does cryptography work?  by Jeroen OomsHow does cryptography work?  by Jeroen Ooms
How does cryptography work? by Jeroen Ooms
Ajay Ohri
 
Using R for Social Media and Sports Analytics
Using R for Social Media and Sports AnalyticsUsing R for Social Media and Sports Analytics
Using R for Social Media and Sports Analytics
Ajay Ohri
 
Kush stats alpha
Kush stats alpha Kush stats alpha
Kush stats alpha
Ajay Ohri
 
Analyze this
Analyze thisAnalyze this
Analyze this
Ajay Ohri
 
Introduction to R ajay Ohri
Introduction to R ajay OhriIntroduction to R ajay Ohri
Introduction to R ajay Ohri
Ajay Ohri
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
Ajay Ohri
 
Social Media and Fake News in the 2016 Election
Social Media and Fake News in the 2016 ElectionSocial Media and Fake News in the 2016 Election
Social Media and Fake News in the 2016 Election
Ajay Ohri
 
Download Python for R Users pdf for free
Download Python for R Users pdf for freeDownload Python for R Users pdf for free
Download Python for R Users pdf for free
Ajay Ohri
 
Install spark on_windows10
Install spark on_windows10Install spark on_windows10
Install spark on_windows10
Ajay Ohri
 
Ajay ohri Resume
Ajay ohri ResumeAjay ohri Resume
Ajay ohri Resume
Ajay Ohri
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientists
Ajay Ohri
 
National seminar on emergence of internet of things (io t) trends and challe...
National seminar on emergence of internet of things (io t)  trends and challe...National seminar on emergence of internet of things (io t)  trends and challe...
National seminar on emergence of internet of things (io t) trends and challe...
Ajay Ohri
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
Ajay Ohri
 
How Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessHow Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help business
Ajay Ohri
 
Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data Science
Ajay Ohri
 
Software Testing for Data Scientists
Software Testing for Data ScientistsSoftware Testing for Data Scientists
Software Testing for Data Scientists
Ajay Ohri
 
A Data Science Tutorial in Python
A Data Science Tutorial in PythonA Data Science Tutorial in Python
A Data Science Tutorial in Python
Ajay Ohri
 
How does cryptography work? by Jeroen Ooms
How does cryptography work?  by Jeroen OomsHow does cryptography work?  by Jeroen Ooms
How does cryptography work? by Jeroen Ooms
Ajay Ohri
 
Using R for Social Media and Sports Analytics
Using R for Social Media and Sports AnalyticsUsing R for Social Media and Sports Analytics
Using R for Social Media and Sports Analytics
Ajay Ohri
 
Kush stats alpha
Kush stats alpha Kush stats alpha
Kush stats alpha
Ajay Ohri
 
Analyze this
Analyze thisAnalyze this
Analyze this
Ajay Ohri
 
Ad

Recently uploaded (20)

Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 

Using R for Cyber Security Part 1

  • 1. R: A Proposed Analysis and Visualization Environment for Network Security Data Joshua McNutt CERT Network Situational Awareness Group, Carnegie Mellon University, Pittsburgh, PA 15213, USA [email protected] Abstract The R statistical language provides an analysis environ- ment which is flexible, extensible and analytically pow- erful. This paper details its potential as an analysis and visualization interface to SiLK flow analysis tools as part of a network situational awareness capability. 1 Introduction The efficacy of network security analysis is highly de- pendent upon the data interface and analysis environment made available to the analyst. The command line seldom offers adequate visual displays of data, while many GUI designs necessarily limit the query specificity afforded at the command line. This paper proposes the use of R, a statistical analysis and visualization environment, for interfacing with flow data. R is a complete program- ming language and, consequently, is highly extensible. Its built-in analysis and visualization capabilities provide the analyst with a powerful means for investigating and modeling network behavior. 2 R! What is it good for? R is a language and environment for statistical comput- ing and graphics used by statisticians worldwide. It is syntactically very similar to the S language which was developed at Bell Laboratories (now Lucent Technolo- gies). Unlike S, R is available as free software under the terms of the Free Software Foundation’s GNU General Public License in source code form. Additional details are provided by the R Project for Statistical Computing (http://www.r-project.org/). The website also provides links to documentation and program files for download- ing. Supported platforms include Windows, Linux and MacOS X. R is an object-based environment which can run inter- actively or in batch mode. It has the ability to generate publication-quality graphical displays on-screen or for hardcopy. Users can write scripts and functions which leverage the programming language’s many features, in- cluding loops, conditionals, user-defined recursive func- tions and input/output facilities. For computationally- intensive tasks, C and Fortran code can be linked. There are a handful of packages supplied with the R distribution covering virtually all standard statistical analyses. Many more packages are available through the Comprehensive R Archive Network (CRAN), a family of Internet sites covering a very wide range of modern statistical methods. 3 SiLK Tools The suite of command line tools known as the Sys- tem for Internet-Level Knowledge (SiLK) are used for the collection and examination of Cisco NetFlow ver- sion 5 data. The CERT Network Situational Awareness (NetSA) Team wrote SiLK 1 for the purpose of analyzing flow data collected on large volume networks. Flow data provides summaries of host communications providing a comprehensive view of network traffic. The SiLK analysis tools provide Unix-like commands with functionality that includes selecting (a.k.a. filter- ing), displaying (ASCII output), sorting and summariz- ing packed binary flow data. Multiple commands can also be piped together for complex filtering. In this pa- per, we utilize the tools rwfilter (to select the data) and rwcount to generate binned time series of flow records, bytes and packets and feed the results into R for analysis. Further details on the functionality of SiLK can be found in [1]. 4 Motivation: Command Line versus GUI Many experienced users enjoy the query specificity af- forded by the command line. But, in order to visualize
  • 2. R Objects Object Description vector ordered collection of numbers scalar single-element vector array multi-dimensional vector matrix two-dimensional array factor vector of categorical 2 data data frame matrix-like structures in which the columns can be of different types (e.g., numerical and categorical variables) list general form of vector in which the various elements need not be of the same type, and are often themselves vectors or lists. Lists provide a convenient way to return the results of a statistical computation. function an object in R which manipulates other objects Table 1: Data object types in R their data, they must make do with a third-party graphing program. They often do not favor a graphical user inter- face because their options for both queries and visualiza- tion tend to become more limited. What we hope to pro- vide with the R interface is a preservation of command line control with the added features of integrated visu- alization and analysis. Essentially, we would describe it as an enhanced command line experience, but it also provides the analyst with all of the benefits of the R lan- guage’s object-based workspace model. 5 R Data Manipulation 5.1 R Data Objects Every entity in the R environment is an object. Numeric vectors, ordered collections of numbers, are the simplest and most common type of object, but there are many oth- ers. See Table 1 for a description of the object types. In this paper, our example uses a data frame to store our data. The data frame object is a very flexible matrix- like entity which, unlike a matrix, allows the columns to be of different types. 5.2 SiLK Data Access It should be noted that while we use R to interface with SiLK, virtually any command-line tool could be used with R. Also, R has multiple SQL database interface li- braries. Many methods exist for interfacing with data stores. We detail below the R-SiLK interface being used at this time. Within R, wrapper functions tied to specific tools in the SiLK suite read in the user-specified SiLK command line as a text-string parameter. The wrapper function makes a system call to the computer running the flow tools. Then, using a standard R data input function, the wrapper function reads in the ASCII output of the com- mand line call. The results of the wrapper function call are assigned to a list object in R. Each element of that list represents a different analysis result, e.g. a matrix of the data, summary statistics, etc. Subsequent analysis and visualization operations can then be applied to that output object or any of its elements. 5.3 R Workspace All objects are located in the user’s workspace which can be saved at the conclusion of the R session and restored at the start of the next session. The command history() produces a list of all commands submitted to R by the user. 5.4 Analysis Capability From simple summary statistics to advanced simulations, the R platform provides functions, extension packages (available through CRAN) and visualization capabilities appropriate to a wide range of flow analysis tasks. The object-based nature of the R environment makes it a use- ful platform for the network security analyst. Objects from different analyses can be preserved in the user’s workspace for comparison purposes. Also, rapid proto- typing of new analysis tools is possible due to the wealth of built-in capabilities and the ease with which new func- tions can be written. The CERT/NetSA Team has used R for a variety of analysis tasks, from logistic regression to robust correla- tion analysis. We have used its SQL interface functional- ity to access hourly roll-ups of flow data summarized by port and protocol from a special database created specifi- cally for port analysis. This has made it possible to study temporal correlations in port activity and identify ports which are exhibiting substantial volumetric changes. 5.5 Graphing Capability One of the most important features of R is its ability to create publication quality graphical displays. R has a huge set of standard statistical graphs, stemplots, box- plots, scatterplots, etc. Extension packages are available for more advanced 3D plotting and highly-specialized display types. The advantage for the analyst running R in interactive mode is the ability to make slight changes 2
  • 3. Time LogScale 010100100001e+061e+081e+101e+12 05/18/2005 12:00:00 05/18/2005 12:36:00 Records Bytes Packets LogScale 010100100001e+061e+081e+101e+12 Records Bytes Packets 3D scatterplot of time periods 40 50 60 70 80 90100110120 010002000300040005000600070008000 0 50000 100000 150000 200000 250000 300000 350000 400000 Records Bytes Packets Figure 1: Graphical output of rwcount.analyze() to the SiLK query and quickly visualize those changes in a newly drawn graph. Given the flexibility of its graphi- cal facilities, R is also an ideal environment for advanced analysts to perform visualization prototyping. 6 R-SiLK wrapper function prototype: rw- count.analyze() Our first proof-of-concept SiLK interface function is the wrapper rwcount.analyze() which calls the SiLK tool rw- count. Details of this wrapper function are provided in Table 2. The function has two input parameters, com- mand and plot. The parameter command is a text string which is assigned a SiLK command line call to rw- count, which returns binned time series of records, bytes, and flows. The other input parameter, plot, determines whether a graphical display will be generated at runtime. The default is plot=TRUE. The visualization provided in our prototype includes three plots: a time series plot, side-by-side boxplots, and a 3D scatterplot of the data. Figure 1 provides an example of the graphical output generated by rwcount.analyze(). When rwcount.analyze() is called, its output is as- signed to a list object in R. The list it generates contains five elements: data, command, stats, cor, and type. These elements are defined in Table 2. A sample R session using rwcount.analyze() to exam- ine FTP traffic is provided below. The parameter com- mand is assigned a SiLK command line. In our example, we specify TCP traffic (−−proto=6) directed at destina- tion port 21 (−−dport=21) for the hour between noon and 1 p.m. on May 18, 2005. Those specifications are provided to rwfilter via switches, and the selected flows (in binary, packed format) are piped into rwcount where we have specified a bin size of thirty seconds (−−bin- size=30). The output of rwcount consists of the time series of bytes, records and packets which are read into a data frame object in R. This data frame is also an element in the output list object returned by rwcount.analyze(). In this example, the output list returned by the function is assigned to obj. The list of object elements are printed with the function names() and correspond to the items in Table 2. As an example of automated analysis that can be returned in a results object, the correlation ma- trix of the series is found in obj$cor. This output shows that bytes, records and packets are highly correlated with each other (ρ > .99). Since obj$data is a data frame of the three time series, we can print the records field by typing obj$data$Records. This is one of the time series plotted in Figure 1. > obj <- rwcount.analyze(command= "rwrun rwfilter --start-date=2005/05/18:12:00:00 --proto=6 --dport=21 --print-file --pass=stdout | rwcount --bin-size=30", plot=TRUE) > names(obj) [1] "data" "command" "stats" "cor" [5] "type" > obj$cor Records Bytes Packets Records 1.0000 0.9944 0.9951 Bytes 0.9944 1.0000 0.9964 Packets 0.9951 0.9964 1.0000 > obj$data$Records Records 05/18/2005 12:00:00 76218 05/18/2005 12:05:00 73374 3
  • 4. rwcount.analyze() details Input Parameters Parameter Description command SiLK command line text string plot Logic element determines whether R will perform runtime plotting Output List Elements List Element Description data Data frame containing rwcount time series for Bytes, Records and Packets command Same as input parameter description stats Summary statistics for Bytes, Records and Packets cor Correlation matrix for Bytes, Records and Packets type Text string to indicate which wrapper function generated this object Table 2: rwcount.analyze() function description 05/18/2005 12:10:00 55743 ... 7 Analyst Benefits One of the advantages of R is its potential for rapid anal- ysis prototyping. A user can very quickly write functions that generate a slew of experimental analysis results de- scribing a host, a subnet, or traffic volumes. Each result can be included in the function’s output list and evalu- ated. Analysis results which prove useful can be quickly integrated and become standard output elements. In analytical work, the ability to label preliminary re- sults objects provides the investigator with a facility for generating an audit trail. In R, this labeling is performed by the addition of object elements which describe the ob- ject to either the analyst or other functions which will operate on the object. By default, rwcount.analyze() re- turns the elements type and command. The element type can be used to describe the object to other functions. For example, a generic graphing function (perhaps called rw.visualize()) would read in an object and determine how it should be displayed based upon its type. The ele- ment command describes to the user how the object was created by storing the SiLK command. Additional ele- ments can also be added to existing objects. For instance, a user may wish to attach a comment (e.g. ”Surge in host count lasted for 6 hours”) to an object by adding a text string element. Since objects are preserved when the users save their workspace in R, comparison with objects from future analyses is very simple. Also, the user can graph objects from a previous analysis side-by-side with new results. We believe the experienced analyst will leverage the enhanced command line experience, fast visualization and rapid analysis prototyping. For analyses requiring longer data pulls, R can also serve as an integrated script- ing and analysis environment. We envision a hierarchy of analysis functions. At the lowest level would be functions like rwcount.analyze() which use a SiLK command line call as a parameter. A function at the next level of the hierarchy would allow a user to specify criteria of interest via function parameters (e.g. dport=80, proto=6). This function would both gen- erate the necessary SiLK command line and submit it to rwcount.analyze() for processing. Using these functions, novice analysts unacquainted with the SiLK command line would be able to perform real analysis tasks imme- diately. These functions could also be used for learning purposes since the SiLK command line needed for the query is provided in the output object. 8 Future Work Our wrapper function rwcount.analyze() is merely a proof-of-concept prototype of an interface between R and SiLK. Next steps include the development of addi- tional wrapper functions, making further improvements to rwcount.analyze(), and developing a generic visualiza- tion scheme that reads the type field in an output object to determine the appropriate display. 9 Conclusion This paper has introduced the reader to R, demonstrat- ing an overlap between its capabilities and the needs of network security analysts. R provides a truly integrated environment for data analysis and visualization. Further, the ability to interface with SiLK flow analysis tools and other data storage formats makes it an ideal environment for enhancing and extending a network situational aware- ness capability. References [1] CARRIE GATES, MICHAEL COLLINS, E. A. More netflow tools: For performance and security. In LISA XVIII (2004), pp. 121–131. Notes 1http://silktools.sourceforge.net/ 2We are using ”categorical” here to describe string character data (e.g. ”male” versus ”female”). 4