SlideShare a Scribd company logo
Insights into Customer Behavior
from Clickstream Data
Ronald J. Nowling
Red Hat, Inc.
rnowling@redhat.com
http://rnowling.github.io/
Who Am I?
•  Software Engineer at Red Hat
•  Data Science Team, Emerging Technologies
–  Evaluate solutions in open-source Big Data
space
–  Ensure software works for Red Hat customers
–  Promote data science internally through
consulting projects
•  Apache Bigtop PMC
2	
  
Clickstream Data
3	
  
Clickstream Data
61 million page views
4	
  
Clickstream Data
61 million page views
125,000 registered users
5	
  
Clickstream Data
61 million page views
125,000 registered users
500,000 pages
6	
  
Clickstream Data
61 million page views
125,000 registered users
500,000 pages
125,000 knowledgebase articles
7	
  
Potential Applications
•  Build customer profiles to aid sales teams
•  Recommendation system for
knowledgebase
•  Improve customer portal search
•  Guide selection of new knowledgebase
topics by content writers
8	
  
9	
  
Strip
Formatting
Clean
Words
Vectorize Cluster
What are the different types of kernel packages in Red Hat
Enterprise Linux?
=============================================================
Issue
------
What are the different types of kernel packages in Red Hat
Enterprise Linux?
Environment
---------------
Red Hat Enterprise Linux
Resolution
------------
Red Hat Enterprise Linux contains the following kernel
packages:
10	
  
Strip
Formatting
Clean
Words
Vectorize Cluster
What are the different types of kernel packages in Red Hat
Enterprise Linux
Issue
What are the different types of kernel packages in Red Hat
Enterprise Linux
Environment
Red Hat Enterprise Linux
Resolution
Red Hat Enterprise Linux contains the following kernel
packages some may not apply to your architecture and not all
are available in all major releases kernel contains the
kernel and following key features
11	
  
Strip
Formatting
Clean
Words
Vectorize Cluster
What are the different types of kernel packages in Red Hat
Enterprise Linux
Issue
What are the different types of kernel packages in Red Hat
Enterprise Linux
Environment
Red Hat Enterprise Linux
Resolution
Red Hat Enterprise Linux contains the following kernel
packages some may not apply to your architecture and not all
are available in all major releases kernel contains the
kernel and following key features
12	
  
Strip
Formatting
Clean
Words
Vectorize Cluster
What are the different type of kernel package in Red Hat
Enterprise Linux
Issue
What are the different type of kernel package in Red Hat
Enterprise Linux
Environment
Red Hat Enterprise Linux
Resolution
Red Hat Enterprise Linux contain the follow kernel
package some may not apply to your architecture and not all
are available in all major release kernel contain the
kernel and follow key feature
13	
  
Strip
Formatting
Clean
Words
Vectorize Cluster
What are the different type of kernel package in Red Hat
Enterprise Linux
Issue
What are the different type of kernel package in Red Hat
Enterprise Linux
Environment
Red Hat Enterprise Linux
Resolution
Red Hat Enterprise Linux contain the follow kernel
package some may not apply to your architecture and not all
are available in all major release kernel contain the
kernel and follow key feature
14	
  
Strip
Formatting
Clean
Words
Vectorize Cluster
different type kernel package Red Hat
Enterprise Linux
Issue
different type kernel package Red Hat
Enterprise Linux
Environment
Red Hat Enterprise Linux
Resolution
Red Hat Enterprise Linux contain kernel
package apply architecture
available major release kernel contain
kernel follow key feature
15	
  
Strip
Formatting
Clean
Words
Vectorize Cluster
kernel: 5
red: 4
hat: 4
enterprise: 4
linux: 4
package: 3
contain: 3
different: 2
type: 2
intel: 2
environment: 1
resolution: 1
follow: 1
system: 1
16	
  
Strip
Formatting
Clean
Words
Vectorize Cluster
kernel: 5
red: 4
hat: 4
enterprise: 4
linux: 4
package: 3
contain: 3
different: 2
type: 2
intel: 2
environment: 1
resolution: 1
follow: 1
system: 1
17	
  
Strip
Formatting
Clean
Words
Vectorize Cluster
kernel: 5
red: 4
hat: 4
enterprise: 4
linux: 4
package: 3
contain: 3
18	
  
Strip
Formatting
Clean
Words
Vectorize Cluster
19	
  
Strip
Formatting
Clean
Words
Vectorize Cluster
Topics
openshift gear cartridge online
node broker
vm rhev virtualization disk
glusterfs storage volume brick rhs
glusterd node client mount geo
rhel support driver hp hardware
version firmware card intel
20	
  
Topics
openshift gear cartridge online
node broker
vm rhev virtualization disk
glusterfs storage volume brick rhs
glusterd node client mount geo
rhel support driver hp hardware
version firmware card intel
21	
  
Topics
openshift gear cartridge online
node broker
vm rhev virtualization disk
glusterfs storage volume brick rhs
glusterd node client mount geo
rhel support driver hp hardware
version firmware card intel
22	
  
Topics
openshift gear cartridge online
node broker
vm rhev virtualization disk
glusterfs storage volume brick rhs
glusterd node client mount geo
rhel support driver hp hardware
version firmware card intel
23	
  
Topics
openshift gear cartridge online
node broker
vm rhev virtualization disk
glusterfs storage volume brick rhs
glusterd node client mount geo
rhel support driver hp hardware
version firmware card intel
24	
  
Topic Article Counts
25	
  
Clickstream Processing
Parse
Raw Daily
Page Views
Clean &
Filter
Raw Daily
Page Views
Raw Daily
Page Views
Parse
Parse
Clean &
Filter
Clean &
Filter
Accounts
Aggregate
Topic View
Counts
Project onto
Topics
26	
  
Clickstream Processing
Parse
Raw Daily
Page Views
Clean &
Filter
Raw Daily
Page Views
Raw Daily
Page Views
Parse
Parse
Clean &
Filter
Clean &
Filter
Accounts
Aggregate
Topic View
Counts
Project onto
Topics
27	
  
Clickstream Processing
Parse
Raw Daily
Page Views
Clean &
Filter
Raw Daily
Page Views
Raw Daily
Page Views
Parse
Parse
Clean &
Filter
Clean &
Filter
Accounts
Aggregate
Topic View
Counts
Project onto
Topics
28	
  
Clickstream Processing
Parse
Raw Daily
Page Views
Clean &
Filter
Raw Daily
Page Views
Raw Daily
Page Views
Parse
Parse
Clean &
Filter
Clean &
Filter
Accounts
Aggregate
Topic View
Counts
Project onto
Topics
29	
  
Clickstream Processing
Parse
Raw Daily
Page Views
Clean &
Filter
Raw Daily
Page Views
Raw Daily
Page Views
Parse
Parse
Clean &
Filter
Clean &
Filter
Accounts
Aggregate
Topic View
Counts
Project onto
Topics
30	
  
Customer Profiles
•  Dominant topics
– JBoss
– Red Hat Enterprise Virtualization
– Hardware support
– Gluster
– Booting into rescue mode
– Packages
31	
  
Customer Profiles
•  Supporting topics
– Logging
– LDAP
– Samba
– High resource usage
– File systems / LVM / block devices
– Networking
32	
  
Customer Profiles
•  JBoss and RHEV appear in combination
with a number of other products
•  Some products only appear by
themselves with supporting topics
(logging, networking, filesystems)
– OpenShift
– Gluster
33	
  
Topic Enrichments
34	
  
Malformed TSV Files
•  Gzip files need to be read sequentially
•  Tab-separated, no quoting (in theory!)
•  Escaped tabs and newlines within records
•  E.g., n or t
•  Improperly escaped tabs and newlines
•  E.g., t vs t
•  Extraneous unmatched quote marks
•  E.g., ‘some_user
35	
  
Lessons Learned
•  Consider custom Hadoop input formats
for tricky file formats
•  Verify everything – what works in general
may not work for you
– Stemming
– Filtering most frequent words
– K-Means vs LDA
36	
  
Lessons Learned
•  K-Means
– Improve accuracy: Multiple runs, more
iterations
•  Watch out for memory leaks
– Un-persist cached RDDs
– Un-persist broadcasted variables
•  Parquet for performance
37	
  
Potential Applications
•  Build customer profiles to aid sales teams
•  Recommendation system for
knowledgebase
•  Improve customer portal search
•  Guide selection of new knowledgebase
topics for content writers
38	
  
Resources
http://rnowling.github.io/
39	
  
QUESTIONS
40	
  

More Related Content

What's hot (20)

PDF
Server-side Tagging in Google Tag Manager - MeasureSummit 2020
Simo Ahava
 
PDF
Amazon OpenSearch Deep dive - 내부구조, 성능최적화 그리고 스케일링
Amazon Web Services Korea
 
PPTX
MySQL8.0_performance_schema.pptx
NeoClova
 
PPTX
Getting Started in Pentesting the Cloud: Azure
Beau Bullock
 
PDF
AWS Summit Seoul 2023 | SOCAR는 어떻게 2만대의 차량을 운영할까?: IoT Data의 수집부터 분석까지
Amazon Web Services Korea
 
PPTX
Row-level security and Dynamic Data Masking
SolidQ
 
PPTX
Avro
Eric Turcotte
 
PDF
[우리가 데이터를 쓰는 법] 모바일 게임 로그 데이터 분석 이야기 - 엔터메이트 공신배 팀장
Dylan Ko
 
PDF
webservice scaling for newbie
DaeMyung Kang
 
PDF
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
Altinity Ltd
 
PPTX
Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...
Alluxio, Inc.
 
PDF
LG전자 - Amazon Aurora 및 RDS 블루/그린 배포를 이용한 데이터베이스 업그레이드 안정성 확보 - 발표자: 이은경 책임, L...
Amazon Web Services Korea
 
PPTX
Your 1st Ceph cluster
Mirantis
 
PDF
後悔しないもんごもんごの使い方 〜アプリ編〜
Masakazu Matsushita
 
PDF
AWS IAM policies in plain english
Bogdan Naydenov
 
PDF
복잡한 권한신청문제 ConsoleMe로 해결하기 - 손건 (AB180) :: AWS Community Day Online 2021
AWSKRUG - AWS한국사용자모임
 
PDF
실시간 스트리밍 분석 Kinesis Data Analytics Deep Dive
Amazon Web Services Korea
 
PDF
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Henning Jacobs
 
PPTX
Dynamodb ppt
Shellychoudhary1
 
PPTX
Commonalities, money laundering, ethics, international standards, gac 2 24-14
ACFCS
 
Server-side Tagging in Google Tag Manager - MeasureSummit 2020
Simo Ahava
 
Amazon OpenSearch Deep dive - 내부구조, 성능최적화 그리고 스케일링
Amazon Web Services Korea
 
MySQL8.0_performance_schema.pptx
NeoClova
 
Getting Started in Pentesting the Cloud: Azure
Beau Bullock
 
AWS Summit Seoul 2023 | SOCAR는 어떻게 2만대의 차량을 운영할까?: IoT Data의 수집부터 분석까지
Amazon Web Services Korea
 
Row-level security and Dynamic Data Masking
SolidQ
 
[우리가 데이터를 쓰는 법] 모바일 게임 로그 데이터 분석 이야기 - 엔터메이트 공신배 팀장
Dylan Ko
 
webservice scaling for newbie
DaeMyung Kang
 
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
Altinity Ltd
 
Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...
Alluxio, Inc.
 
LG전자 - Amazon Aurora 및 RDS 블루/그린 배포를 이용한 데이터베이스 업그레이드 안정성 확보 - 발표자: 이은경 책임, L...
Amazon Web Services Korea
 
Your 1st Ceph cluster
Mirantis
 
後悔しないもんごもんごの使い方 〜アプリ編〜
Masakazu Matsushita
 
AWS IAM policies in plain english
Bogdan Naydenov
 
복잡한 권한신청문제 ConsoleMe로 해결하기 - 손건 (AB180) :: AWS Community Day Online 2021
AWSKRUG - AWS한국사용자모임
 
실시간 스트리밍 분석 Kinesis Data Analytics Deep Dive
Amazon Web Services Korea
 
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Henning Jacobs
 
Dynamodb ppt
Shellychoudhary1
 
Commonalities, money laundering, ethics, international standards, gac 2 24-14
ACFCS
 

Viewers also liked (20)

PDF
Time Series Analysis with Spark by Sandy Ryza
Spark Summit
 
PPTX
How we solved Real-time User Segmentation using HBase
DataWorks Summit
 
PDF
Some Important Streaming Algorithms You Should Know About-(Ted Dunning, MapR)
Spark Summit
 
PDF
Clickstream Data Warehouse - Turning clicks into customers
Albert Hui
 
PDF
Not Your Father's Database by Vida Ha
Spark Summit
 
PDF
Implementing and Visualizing Clickstream data with MongoDB
MongoDB
 
PDF
Viadeos Segmentation platform with Spark on Mesos
Cepoi Eugen
 
PDF
Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by S...
Spark Summit
 
PDF
20 Inspiring Quotes On Customer Service
WebAble Digital
 
PDF
Monte Carlo Simulations in Ad-Lift Measurement Using Spark by Prasad Chalasan...
Spark Summit
 
PDF
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
lucenerevolution
 
PDF
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
Spark Summit
 
PDF
Building a Recommendation Engine Using Diverse Features by Divyanshu Vats
Spark Summit
 
PDF
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Spark Summit
 
PDF
Production Readiness Testing At Salesforce Using Spark MLlib
Spark Summit
 
PPTX
How Apache Spark Is Helping Tame the Wild West of Wi-Fi
Spark Summit
 
PDF
Data Scientist Workbench 入門
soh kaijima
 
PDF
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Spark Summit
 
PDF
Spark Summit EU 2015: SparkUI visualization: a lens into your application
Databricks
 
PDF
Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...
Spark Summit
 
Time Series Analysis with Spark by Sandy Ryza
Spark Summit
 
How we solved Real-time User Segmentation using HBase
DataWorks Summit
 
Some Important Streaming Algorithms You Should Know About-(Ted Dunning, MapR)
Spark Summit
 
Clickstream Data Warehouse - Turning clicks into customers
Albert Hui
 
Not Your Father's Database by Vida Ha
Spark Summit
 
Implementing and Visualizing Clickstream data with MongoDB
MongoDB
 
Viadeos Segmentation platform with Spark on Mesos
Cepoi Eugen
 
Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by S...
Spark Summit
 
20 Inspiring Quotes On Customer Service
WebAble Digital
 
Monte Carlo Simulations in Ad-Lift Measurement Using Spark by Prasad Chalasan...
Spark Summit
 
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
lucenerevolution
 
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
Spark Summit
 
Building a Recommendation Engine Using Diverse Features by Divyanshu Vats
Spark Summit
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Spark Summit
 
Production Readiness Testing At Salesforce Using Spark MLlib
Spark Summit
 
How Apache Spark Is Helping Tame the Wild West of Wi-Fi
Spark Summit
 
Data Scientist Workbench 入門
soh kaijima
 
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Spark Summit
 
Spark Summit EU 2015: SparkUI visualization: a lens into your application
Databricks
 
Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...
Spark Summit
 
Ad

Similar to Insights into Customer Behavior from Clickstream Data by Ronald Nowling (20)

PDF
Red Hat Enterprise Linux 8 Workshop
Ahmed El-Rayess
 
PDF
Bigdata ready reference
Helly Patel
 
PDF
Red Hat Enterprise Linux 8 Technical overview v1(1).pdf
SimonCoter2
 
PDF
Red Hat Enterprise Linux 8
Kangaroot
 
PDF
2012-03-15 What's New at Red Hat
Shawn Wells
 
PDF
Why Pay for Open Source Linux? Avoid the Hidden Cost of DIY
Enterprise Management Associates
 
PPTX
Saeed al ali 10 bb
s3eedAlAli
 
PDF
RHEL roadmap
Ramesh Kumar
 
PDF
2008-01-22 Red Hat (Security) Roadmap Presentation
Shawn Wells
 
PDF
Linux Unveiled: From Novice to Guru Kameron Hussain
ahendadedzi
 
PDF
24HOP Introduction to Linux for SQL Server DBAs
Kellyn Pot'Vin-Gorman
 
PPTX
Openslava 2017 - Are developers the real emerging technology?
Eric D. Schabell
 
PDF
2011-03-15 Lockheed Martin Open Source Day
Shawn Wells
 
PDF
2008-07-30 IBM Teach the Teacher (IBM T3), Red Hat Update for System z
Shawn Wells
 
PDF
Rh436 pdf
Ranjeet Kumar Azad
 
PDF
RHEL roadmap
Terry Wang
 
PDF
2011 NASA Open Source Summit - Brian Stevens
NASA Open Government Initiative
 
PDF
Administer and Secure Enterprise Linux 2021st Edition Russell Overton
zondahoyes75
 
PDF
Administer and Secure Enterprise Linux 2021st Edition Russell Overton
anibeakatira
 
PDF
RH436 pdf
KALIPRASANNA BASU
 
Red Hat Enterprise Linux 8 Workshop
Ahmed El-Rayess
 
Bigdata ready reference
Helly Patel
 
Red Hat Enterprise Linux 8 Technical overview v1(1).pdf
SimonCoter2
 
Red Hat Enterprise Linux 8
Kangaroot
 
2012-03-15 What's New at Red Hat
Shawn Wells
 
Why Pay for Open Source Linux? Avoid the Hidden Cost of DIY
Enterprise Management Associates
 
Saeed al ali 10 bb
s3eedAlAli
 
RHEL roadmap
Ramesh Kumar
 
2008-01-22 Red Hat (Security) Roadmap Presentation
Shawn Wells
 
Linux Unveiled: From Novice to Guru Kameron Hussain
ahendadedzi
 
24HOP Introduction to Linux for SQL Server DBAs
Kellyn Pot'Vin-Gorman
 
Openslava 2017 - Are developers the real emerging technology?
Eric D. Schabell
 
2011-03-15 Lockheed Martin Open Source Day
Shawn Wells
 
2008-07-30 IBM Teach the Teacher (IBM T3), Red Hat Update for System z
Shawn Wells
 
RHEL roadmap
Terry Wang
 
2011 NASA Open Source Summit - Brian Stevens
NASA Open Government Initiative
 
Administer and Secure Enterprise Linux 2021st Edition Russell Overton
zondahoyes75
 
Administer and Secure Enterprise Linux 2021st Edition Russell Overton
anibeakatira
 
Ad

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
PDF
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
PDF
Goal Based Data Production with Sim Simeonov
Spark Summit
 
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

Recently uploaded (20)

PDF
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PDF
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PPT
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PDF
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 

Insights into Customer Behavior from Clickstream Data by Ronald Nowling

  • 1. Insights into Customer Behavior from Clickstream Data Ronald J. Nowling Red Hat, Inc. [email protected] http://rnowling.github.io/
  • 2. Who Am I? •  Software Engineer at Red Hat •  Data Science Team, Emerging Technologies –  Evaluate solutions in open-source Big Data space –  Ensure software works for Red Hat customers –  Promote data science internally through consulting projects •  Apache Bigtop PMC 2  
  • 4. Clickstream Data 61 million page views 4  
  • 5. Clickstream Data 61 million page views 125,000 registered users 5  
  • 6. Clickstream Data 61 million page views 125,000 registered users 500,000 pages 6  
  • 7. Clickstream Data 61 million page views 125,000 registered users 500,000 pages 125,000 knowledgebase articles 7  
  • 8. Potential Applications •  Build customer profiles to aid sales teams •  Recommendation system for knowledgebase •  Improve customer portal search •  Guide selection of new knowledgebase topics by content writers 8  
  • 9. 9   Strip Formatting Clean Words Vectorize Cluster What are the different types of kernel packages in Red Hat Enterprise Linux? ============================================================= Issue ------ What are the different types of kernel packages in Red Hat Enterprise Linux? Environment --------------- Red Hat Enterprise Linux Resolution ------------ Red Hat Enterprise Linux contains the following kernel packages:
  • 10. 10   Strip Formatting Clean Words Vectorize Cluster What are the different types of kernel packages in Red Hat Enterprise Linux Issue What are the different types of kernel packages in Red Hat Enterprise Linux Environment Red Hat Enterprise Linux Resolution Red Hat Enterprise Linux contains the following kernel packages some may not apply to your architecture and not all are available in all major releases kernel contains the kernel and following key features
  • 11. 11   Strip Formatting Clean Words Vectorize Cluster What are the different types of kernel packages in Red Hat Enterprise Linux Issue What are the different types of kernel packages in Red Hat Enterprise Linux Environment Red Hat Enterprise Linux Resolution Red Hat Enterprise Linux contains the following kernel packages some may not apply to your architecture and not all are available in all major releases kernel contains the kernel and following key features
  • 12. 12   Strip Formatting Clean Words Vectorize Cluster What are the different type of kernel package in Red Hat Enterprise Linux Issue What are the different type of kernel package in Red Hat Enterprise Linux Environment Red Hat Enterprise Linux Resolution Red Hat Enterprise Linux contain the follow kernel package some may not apply to your architecture and not all are available in all major release kernel contain the kernel and follow key feature
  • 13. 13   Strip Formatting Clean Words Vectorize Cluster What are the different type of kernel package in Red Hat Enterprise Linux Issue What are the different type of kernel package in Red Hat Enterprise Linux Environment Red Hat Enterprise Linux Resolution Red Hat Enterprise Linux contain the follow kernel package some may not apply to your architecture and not all are available in all major release kernel contain the kernel and follow key feature
  • 14. 14   Strip Formatting Clean Words Vectorize Cluster different type kernel package Red Hat Enterprise Linux Issue different type kernel package Red Hat Enterprise Linux Environment Red Hat Enterprise Linux Resolution Red Hat Enterprise Linux contain kernel package apply architecture available major release kernel contain kernel follow key feature
  • 15. 15   Strip Formatting Clean Words Vectorize Cluster kernel: 5 red: 4 hat: 4 enterprise: 4 linux: 4 package: 3 contain: 3 different: 2 type: 2 intel: 2 environment: 1 resolution: 1 follow: 1 system: 1
  • 16. 16   Strip Formatting Clean Words Vectorize Cluster kernel: 5 red: 4 hat: 4 enterprise: 4 linux: 4 package: 3 contain: 3 different: 2 type: 2 intel: 2 environment: 1 resolution: 1 follow: 1 system: 1
  • 17. 17   Strip Formatting Clean Words Vectorize Cluster kernel: 5 red: 4 hat: 4 enterprise: 4 linux: 4 package: 3 contain: 3
  • 20. Topics openshift gear cartridge online node broker vm rhev virtualization disk glusterfs storage volume brick rhs glusterd node client mount geo rhel support driver hp hardware version firmware card intel 20  
  • 21. Topics openshift gear cartridge online node broker vm rhev virtualization disk glusterfs storage volume brick rhs glusterd node client mount geo rhel support driver hp hardware version firmware card intel 21  
  • 22. Topics openshift gear cartridge online node broker vm rhev virtualization disk glusterfs storage volume brick rhs glusterd node client mount geo rhel support driver hp hardware version firmware card intel 22  
  • 23. Topics openshift gear cartridge online node broker vm rhev virtualization disk glusterfs storage volume brick rhs glusterd node client mount geo rhel support driver hp hardware version firmware card intel 23  
  • 24. Topics openshift gear cartridge online node broker vm rhev virtualization disk glusterfs storage volume brick rhs glusterd node client mount geo rhel support driver hp hardware version firmware card intel 24  
  • 26. Clickstream Processing Parse Raw Daily Page Views Clean & Filter Raw Daily Page Views Raw Daily Page Views Parse Parse Clean & Filter Clean & Filter Accounts Aggregate Topic View Counts Project onto Topics 26  
  • 27. Clickstream Processing Parse Raw Daily Page Views Clean & Filter Raw Daily Page Views Raw Daily Page Views Parse Parse Clean & Filter Clean & Filter Accounts Aggregate Topic View Counts Project onto Topics 27  
  • 28. Clickstream Processing Parse Raw Daily Page Views Clean & Filter Raw Daily Page Views Raw Daily Page Views Parse Parse Clean & Filter Clean & Filter Accounts Aggregate Topic View Counts Project onto Topics 28  
  • 29. Clickstream Processing Parse Raw Daily Page Views Clean & Filter Raw Daily Page Views Raw Daily Page Views Parse Parse Clean & Filter Clean & Filter Accounts Aggregate Topic View Counts Project onto Topics 29  
  • 30. Clickstream Processing Parse Raw Daily Page Views Clean & Filter Raw Daily Page Views Raw Daily Page Views Parse Parse Clean & Filter Clean & Filter Accounts Aggregate Topic View Counts Project onto Topics 30  
  • 31. Customer Profiles •  Dominant topics – JBoss – Red Hat Enterprise Virtualization – Hardware support – Gluster – Booting into rescue mode – Packages 31  
  • 32. Customer Profiles •  Supporting topics – Logging – LDAP – Samba – High resource usage – File systems / LVM / block devices – Networking 32  
  • 33. Customer Profiles •  JBoss and RHEV appear in combination with a number of other products •  Some products only appear by themselves with supporting topics (logging, networking, filesystems) – OpenShift – Gluster 33  
  • 35. Malformed TSV Files •  Gzip files need to be read sequentially •  Tab-separated, no quoting (in theory!) •  Escaped tabs and newlines within records •  E.g., n or t •  Improperly escaped tabs and newlines •  E.g., t vs t •  Extraneous unmatched quote marks •  E.g., ‘some_user 35  
  • 36. Lessons Learned •  Consider custom Hadoop input formats for tricky file formats •  Verify everything – what works in general may not work for you – Stemming – Filtering most frequent words – K-Means vs LDA 36  
  • 37. Lessons Learned •  K-Means – Improve accuracy: Multiple runs, more iterations •  Watch out for memory leaks – Un-persist cached RDDs – Un-persist broadcasted variables •  Parquet for performance 37  
  • 38. Potential Applications •  Build customer profiles to aid sales teams •  Recommendation system for knowledgebase •  Improve customer portal search •  Guide selection of new knowledgebase topics for content writers 38