SlideShare a Scribd company logo
6
How we worked as a Team
● worked separately on feature engineering. 90% of
our time was spent here.
● delegated Modeling part to DataRobot to:
○ find best algorithm (with XGboost as a winner!)
○ model text features
○ tune hyperparameters
○ experiment different feature sets and blend 8 XGBoost
using different sets
○ communicate results
Most read
7
Feature engineering techniques used
● counts
● time statistics (min, mean, max, diff)
● entropy
● sequences treated as text on which we ran
○ SVD on 3grams
○ DataRobot text mining solution
● 20 first components of SVD on user x object
NB: removed duplicated log info and used training + test
sets to build most features
Most read
21
How we got to the TOP3
● entropy features mentioned before
● exploited info in
○ log count in the 5 / 10 / 20 days after end of course
○ log counts by event, sign_up counts and day entropy in the next
10 days after end of course
○ time to sign up for new course
○ time until the next log for same user
added ~0.001 to AUC (vs
less powerful features)
added ~0.002 to AUC
Most read
Featurizing log data
before XGBoost
Xavier Conort
Thursday, August 20, 2015 @
● XuetangX, a Chinese MOOC learning platform initiated
by Tsinghua University,
● launched online on Oct 10th, 2013.
● more than 100 Chinese courses and over 260
international courses
● high dropout rate
The competition host
● challenge: predict whether a user will drop a course
within next 10 days based on his or her prior activities.
● data:
○ enrollment_train (120K rows) / enrollment_test (80K rows):
■ Columns: enrollment_id, username, course_id
○ log_train / log_test
■ Columns: enrollment_id, time, source, event, object
○ object
■ Columns: course_id, module_id, category, children, start
○ truth_train
■ Columns: enrollment_id, dropped_out
Problem to solve
Log data 5890
objects
Team
Chief Product Officer Chief Data Scientist
Data Scientist Data Scientist
(O. Zhang)
How we worked as a Team
● worked separately on feature engineering. 90% of
our time was spent here.
● delegated Modeling part to DataRobot to:
○ find best algorithm (with XGboost as a winner!)
○ model text features
○ tune hyperparameters
○ experiment different feature sets and blend 8 XGBoost
using different sets
○ communicate results
Feature engineering techniques used
● counts
● time statistics (min, mean, max, diff)
● entropy
● sequences treated as text on which we ran
○ SVD on 3grams
○ DataRobot text mining solution
● 20 first components of SVD on user x object
NB: removed duplicated log info and used training + test
sets to build most features
How to build efficient features in R
Key course features
● course_id
● first log time
● enrollment counts
● unique log counts
● mean time interval
Key enrollment count features
● log counts
● unique log counts
● ratio between unique log counts over log counts
● unique log counts by event (nagivate, access,
problem, video, page_close, discussion, wiki)
● unique log counts before end of course (5 days, 10
days and 30 days before)
● sequence number of enrollment in that course
Key enrollment time stats
● log time stats (min, mean, max)
● gap between first and last log of enrollment
● gap between enrollment first log and course first log
● gap between enrollment last log and course last logs
● difference between mean log time and mid point
between first and last log
● log interval stats (mean, 90, 99 and 100 quantiles)
Enrollment entropy features
enrollment entropy over
● days
● weekdays
● fraction (4) of weekdays
● hours of the day
● hours of the day for the last 1/3/7 days before last
logs
● object (when event == problem)
● chapter ids
Example of entropy feature
- log(weekday_log_count / enrollment_log_count) *
weekday_log_count / enrollment_log_count
Sum => weekday_entropy[enrollment_id==1]
1.589988
Enrollment sequence features
● for each enrollment_id, built sequences of
○ weekdays
○ objects
■ all objects / 'problem' and 'video' objects only
○ events
● treated sequences as 4 text variables. Ran for each
○ svd on 3 grams => first 10 components
○ DataRobot stacked predictions from logistic regr.
& Nystroem SVM on (tuned) n-grams
Extract of enrollment object sequences
1/2-grams from Object sequences
DataRobot on Object 1-2 grams
Key user count features and time
stats
● enrollment count
● binary indicator whether user signed up for each of
the 38 courses
● unique log count
● mean log time interval
● sequence number of enrollment for that user
User entropy features
user entropy over
● days
● weekdays
● fraction (4) of weekdays
● hours of the day
User sequence features
● for each user, built sequences of
○ weekdays
○ chapter_ids
○ events
● treated them as 3 text variables. Ran
○ SVD on 3 grams => first 10 components
○ DataRobot stacked predictions from logistic regr.
+ Nystroem SVM on (tuned) n-grams
How we got to the TOP3
● entropy features mentioned before
● exploited info in
○ log count in the 5 / 10 / 20 days after end of course
○ log counts by event, sign_up counts and day entropy in the next
10 days after end of course
○ time to sign up for new course
○ time until the next log for same user
added ~0.001 to AUC (vs
less powerful features)
added ~0.002 to AUC
XGBoost
Thank you!

More Related Content

What's hot (20)

지식그래프 개념과 활용방안 (Knowledge Graph - Introduction and Use Cases)
지식그래프 개념과 활용방안 (Knowledge Graph - Introduction and Use Cases)지식그래프 개념과 활용방안 (Knowledge Graph - Introduction and Use Cases)
지식그래프 개념과 활용방안 (Knowledge Graph - Introduction and Use Cases)
Myungjin Lee
 
Customer Data Platform 101
Customer Data Platform 101Customer Data Platform 101
Customer Data Platform 101
Kiyoto Tamura
 
Introducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneIntroducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache Lucene
Sease
 
The Customer Data Platform, the Future of the Marketing Database
The Customer Data Platform, the Future of the Marketing DatabaseThe Customer Data Platform, the Future of the Marketing Database
The Customer Data Platform, the Future of the Marketing Database
RedEye
 
Generating Realistic Synthetic Data in Finance
Generating Realistic Synthetic Data in FinanceGenerating Realistic Synthetic Data in Finance
Generating Realistic Synthetic Data in Finance
Gautier Marti
 
What exactly is Business Intelligence?
What exactly is Business Intelligence?What exactly is Business Intelligence?
What exactly is Business Intelligence?
James Serra
 
Data Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaData Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation Criteria
ScyllaDB
 
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
David Talby
 
Building a Knowledge Graph using NLP and Ontologies
Building a Knowledge Graph using NLP and OntologiesBuilding a Knowledge Graph using NLP and Ontologies
Building a Knowledge Graph using NLP and Ontologies
Neo4j
 
Data Enrichment for Better Lead Generation
Data Enrichment for Better Lead GenerationData Enrichment for Better Lead Generation
Data Enrichment for Better Lead Generation
Vbout.com
 
What is big data?
What is big data?What is big data?
What is big data?
David Wellman
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
Jayant Mukherjee
 
Data-Ed: Data-centric Strategy & Roadmap
Data-Ed: Data-centric Strategy & RoadmapData-Ed: Data-centric Strategy & Roadmap
Data-Ed: Data-centric Strategy & Roadmap
Data Blueprint
 
Big Data Presentation
Big  Data PresentationBig  Data Presentation
Big Data Presentation
Ritika Barethia
 
Fine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsFine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP models
OVHcloud
 
Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...
Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...
Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...
Neo4j
 
Cohort Analysis at Scale
Cohort Analysis at ScaleCohort Analysis at Scale
Cohort Analysis at Scale
Blake Irvine
 
DMPs are Dead. Welcome to the CDP Era.
DMPs are Dead. Welcome to the CDP Era.DMPs are Dead. Welcome to the CDP Era.
DMPs are Dead. Welcome to the CDP Era.
mParticle
 
Apache Kylin Meetup: Berlin - With OLX Group
Apache Kylin Meetup: Berlin - With OLX GroupApache Kylin Meetup: Berlin - With OLX Group
Apache Kylin Meetup: Berlin - With OLX Group
Tyler Wishnoff
 
Credit Card Fraud Detection Using ML In Databricks
Credit Card Fraud Detection Using ML In DatabricksCredit Card Fraud Detection Using ML In Databricks
Credit Card Fraud Detection Using ML In Databricks
Databricks
 
지식그래프 개념과 활용방안 (Knowledge Graph - Introduction and Use Cases)
지식그래프 개념과 활용방안 (Knowledge Graph - Introduction and Use Cases)지식그래프 개념과 활용방안 (Knowledge Graph - Introduction and Use Cases)
지식그래프 개념과 활용방안 (Knowledge Graph - Introduction and Use Cases)
Myungjin Lee
 
Customer Data Platform 101
Customer Data Platform 101Customer Data Platform 101
Customer Data Platform 101
Kiyoto Tamura
 
Introducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneIntroducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache Lucene
Sease
 
The Customer Data Platform, the Future of the Marketing Database
The Customer Data Platform, the Future of the Marketing DatabaseThe Customer Data Platform, the Future of the Marketing Database
The Customer Data Platform, the Future of the Marketing Database
RedEye
 
Generating Realistic Synthetic Data in Finance
Generating Realistic Synthetic Data in FinanceGenerating Realistic Synthetic Data in Finance
Generating Realistic Synthetic Data in Finance
Gautier Marti
 
What exactly is Business Intelligence?
What exactly is Business Intelligence?What exactly is Business Intelligence?
What exactly is Business Intelligence?
James Serra
 
Data Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaData Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation Criteria
ScyllaDB
 
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
David Talby
 
Building a Knowledge Graph using NLP and Ontologies
Building a Knowledge Graph using NLP and OntologiesBuilding a Knowledge Graph using NLP and Ontologies
Building a Knowledge Graph using NLP and Ontologies
Neo4j
 
Data Enrichment for Better Lead Generation
Data Enrichment for Better Lead GenerationData Enrichment for Better Lead Generation
Data Enrichment for Better Lead Generation
Vbout.com
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
Jayant Mukherjee
 
Data-Ed: Data-centric Strategy & Roadmap
Data-Ed: Data-centric Strategy & RoadmapData-Ed: Data-centric Strategy & Roadmap
Data-Ed: Data-centric Strategy & Roadmap
Data Blueprint
 
Fine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsFine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP models
OVHcloud
 
Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...
Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...
Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...
Neo4j
 
Cohort Analysis at Scale
Cohort Analysis at ScaleCohort Analysis at Scale
Cohort Analysis at Scale
Blake Irvine
 
DMPs are Dead. Welcome to the CDP Era.
DMPs are Dead. Welcome to the CDP Era.DMPs are Dead. Welcome to the CDP Era.
DMPs are Dead. Welcome to the CDP Era.
mParticle
 
Apache Kylin Meetup: Berlin - With OLX Group
Apache Kylin Meetup: Berlin - With OLX GroupApache Kylin Meetup: Berlin - With OLX Group
Apache Kylin Meetup: Berlin - With OLX Group
Tyler Wishnoff
 
Credit Card Fraud Detection Using ML In Databricks
Credit Card Fraud Detection Using ML In DatabricksCredit Card Fraud Detection Using ML In Databricks
Credit Card Fraud Detection Using ML In Databricks
Databricks
 

Viewers also liked (20)

Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions
odsc
 
Xgboost
XgboostXgboost
Xgboost
Vivian S. Zhang
 
How hackathons can drive top line revenue growth
How hackathons can drive top line revenue growthHow hackathons can drive top line revenue growth
How hackathons can drive top line revenue growth
HackerEarth
 
6 rules of enterprise innovation
6 rules of enterprise innovation6 rules of enterprise innovation
6 rules of enterprise innovation
HackerEarth
 
Open Innovation - A Case Study
Open Innovation - A Case StudyOpen Innovation - A Case Study
Open Innovation - A Case Study
HackerEarth
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field
Domino Data Lab
 
Kill the wabbit
Kill the wabbitKill the wabbit
Kill the wabbit
Joe Kleinwaechter
 
No-Bullshit Data Science
No-Bullshit Data ScienceNo-Bullshit Data Science
No-Bullshit Data Science
Domino Data Lab
 
Smart Switchboard: An home automation system
Smart Switchboard: An home automation systemSmart Switchboard: An home automation system
Smart Switchboard: An home automation system
HackerEarth
 
USC LIGHT Ministry Introduction
USC LIGHT Ministry IntroductionUSC LIGHT Ministry Introduction
USC LIGHT Ministry Introduction
Jeong-Yoon Lee
 
Menstrual Health Reader - mEo
Menstrual Health Reader - mEoMenstrual Health Reader - mEo
Menstrual Health Reader - mEo
HackerEarth
 
Druva Casestudy - HackerEarth
Druva Casestudy - HackerEarthDruva Casestudy - HackerEarth
Druva Casestudy - HackerEarth
HackerEarth
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
Ted Xiao
 
How to assess & hire Java developers accurately?
How to assess & hire Java developers accurately?How to assess & hire Java developers accurately?
How to assess & hire Java developers accurately?
HackerEarth
 
Data science at the command line
Data science at the command lineData science at the command line
Data science at the command line
Sharat Chikkerur
 
Leveraged Analytics at Scale
Leveraged Analytics at ScaleLeveraged Analytics at Scale
Leveraged Analytics at Scale
Domino Data Lab
 
Marriage - LIGHT Ministry
Marriage - LIGHT MinistryMarriage - LIGHT Ministry
Marriage - LIGHT Ministry
Jeong-Yoon Lee
 
Tda presentation
Tda presentationTda presentation
Tda presentation
HJ van Veen
 
Vowpal Wabbit
Vowpal WabbitVowpal Wabbit
Vowpal Wabbit
odsc
 
Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions
odsc
 
How hackathons can drive top line revenue growth
How hackathons can drive top line revenue growthHow hackathons can drive top line revenue growth
How hackathons can drive top line revenue growth
HackerEarth
 
6 rules of enterprise innovation
6 rules of enterprise innovation6 rules of enterprise innovation
6 rules of enterprise innovation
HackerEarth
 
Open Innovation - A Case Study
Open Innovation - A Case StudyOpen Innovation - A Case Study
Open Innovation - A Case Study
HackerEarth
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field
Domino Data Lab
 
Smart Switchboard: An home automation system
Smart Switchboard: An home automation systemSmart Switchboard: An home automation system
Smart Switchboard: An home automation system
HackerEarth
 
USC LIGHT Ministry Introduction
USC LIGHT Ministry IntroductionUSC LIGHT Ministry Introduction
USC LIGHT Ministry Introduction
Jeong-Yoon Lee
 
Menstrual Health Reader - mEo
Menstrual Health Reader - mEoMenstrual Health Reader - mEo
Menstrual Health Reader - mEo
HackerEarth
 
Druva Casestudy - HackerEarth
Druva Casestudy - HackerEarthDruva Casestudy - HackerEarth
Druva Casestudy - HackerEarth
HackerEarth
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
Ted Xiao
 
How to assess & hire Java developers accurately?
How to assess & hire Java developers accurately?How to assess & hire Java developers accurately?
How to assess & hire Java developers accurately?
HackerEarth
 
Data science at the command line
Data science at the command lineData science at the command line
Data science at the command line
Sharat Chikkerur
 
Leveraged Analytics at Scale
Leveraged Analytics at ScaleLeveraged Analytics at Scale
Leveraged Analytics at Scale
Domino Data Lab
 
Marriage - LIGHT Ministry
Marriage - LIGHT MinistryMarriage - LIGHT Ministry
Marriage - LIGHT Ministry
Jeong-Yoon Lee
 
Tda presentation
Tda presentationTda presentation
Tda presentation
HJ van Veen
 
Vowpal Wabbit
Vowpal WabbitVowpal Wabbit
Vowpal Wabbit
odsc
 

Similar to Featurizing log data before XGBoost (20)

KDD CUP 2015 - 9th solution
KDD CUP 2015 - 9th solutionKDD CUP 2015 - 9th solution
KDD CUP 2015 - 9th solution
志明 陳
 
Christopher Brooks SOED 2016
Christopher Brooks SOED 2016Christopher Brooks SOED 2016
Christopher Brooks SOED 2016
Colleen Ganley
 
NIPS 2017 Competition Track : Personalized Cancer Treatment -- Classifying Cl...
NIPS 2017 Competition Track : Personalized Cancer Treatment -- Classifying Cl...NIPS 2017 Competition Track : Personalized Cancer Treatment -- Classifying Cl...
NIPS 2017 Competition Track : Personalized Cancer Treatment -- Classifying Cl...
Nishant Kumar
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
Arithmer Inc.
 
Make Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature EngineeringMake Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature Engineering
DataRobot
 
ML_case_studysakil_Analysis_result. pptx
ML_case_studysakil_Analysis_result. pptxML_case_studysakil_Analysis_result. pptx
ML_case_studysakil_Analysis_result. pptx
ravitejaveguru1
 
03-classificationTrees03-classificationTrees.pptx
03-classificationTrees03-classificationTrees.pptx03-classificationTrees03-classificationTrees.pptx
03-classificationTrees03-classificationTrees.pptx
DavidClement34
 
Chapter 04-discriminant analysis
Chapter 04-discriminant analysisChapter 04-discriminant analysis
Chapter 04-discriminant analysis
Raman Kannan
 
BAS 250 Lecture 8
BAS 250 Lecture 8BAS 250 Lecture 8
BAS 250 Lecture 8
Wake Tech BAS
 
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntKaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Eugene Yan Ziyou
 
Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
lucenerevolution
 
Human_Activity_Recognition_Predictive_Model
Human_Activity_Recognition_Predictive_ModelHuman_Activity_Recognition_Predictive_Model
Human_Activity_Recognition_Predictive_Model
David Ritchie
 
Data-Driven Education 2020: Using Big Educational Data to Improve Teaching an...
Data-Driven Education 2020: Using Big Educational Data to Improve Teaching an...Data-Driven Education 2020: Using Big Educational Data to Improve Teaching an...
Data-Driven Education 2020: Using Big Educational Data to Improve Teaching an...
Peter Brusilovsky
 
Group13 kdd cup_report_submitted
Group13 kdd cup_report_submittedGroup13 kdd cup_report_submitted
Group13 kdd cup_report_submitted
Chamath Sajeewa
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
Student
 
[ppt]
[ppt][ppt]
[ppt]
butest
 
[ppt]
[ppt][ppt]
[ppt]
butest
 
Fundamentals of data science presentation
Fundamentals of data science presentationFundamentals of data science presentation
Fundamentals of data science presentation
topuri1218
 
Lecture4.pptx
Lecture4.pptxLecture4.pptx
Lecture4.pptx
yasir149288
 
Ml application on_student_non_deployment
Ml application on_student_non_deploymentMl application on_student_non_deployment
Ml application on_student_non_deployment
Accenture
 
KDD CUP 2015 - 9th solution
KDD CUP 2015 - 9th solutionKDD CUP 2015 - 9th solution
KDD CUP 2015 - 9th solution
志明 陳
 
Christopher Brooks SOED 2016
Christopher Brooks SOED 2016Christopher Brooks SOED 2016
Christopher Brooks SOED 2016
Colleen Ganley
 
NIPS 2017 Competition Track : Personalized Cancer Treatment -- Classifying Cl...
NIPS 2017 Competition Track : Personalized Cancer Treatment -- Classifying Cl...NIPS 2017 Competition Track : Personalized Cancer Treatment -- Classifying Cl...
NIPS 2017 Competition Track : Personalized Cancer Treatment -- Classifying Cl...
Nishant Kumar
 
Make Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature EngineeringMake Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature Engineering
DataRobot
 
ML_case_studysakil_Analysis_result. pptx
ML_case_studysakil_Analysis_result. pptxML_case_studysakil_Analysis_result. pptx
ML_case_studysakil_Analysis_result. pptx
ravitejaveguru1
 
03-classificationTrees03-classificationTrees.pptx
03-classificationTrees03-classificationTrees.pptx03-classificationTrees03-classificationTrees.pptx
03-classificationTrees03-classificationTrees.pptx
DavidClement34
 
Chapter 04-discriminant analysis
Chapter 04-discriminant analysisChapter 04-discriminant analysis
Chapter 04-discriminant analysis
Raman Kannan
 
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntKaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Eugene Yan Ziyou
 
Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
lucenerevolution
 
Human_Activity_Recognition_Predictive_Model
Human_Activity_Recognition_Predictive_ModelHuman_Activity_Recognition_Predictive_Model
Human_Activity_Recognition_Predictive_Model
David Ritchie
 
Data-Driven Education 2020: Using Big Educational Data to Improve Teaching an...
Data-Driven Education 2020: Using Big Educational Data to Improve Teaching an...Data-Driven Education 2020: Using Big Educational Data to Improve Teaching an...
Data-Driven Education 2020: Using Big Educational Data to Improve Teaching an...
Peter Brusilovsky
 
Group13 kdd cup_report_submitted
Group13 kdd cup_report_submittedGroup13 kdd cup_report_submitted
Group13 kdd cup_report_submitted
Chamath Sajeewa
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
Student
 
Fundamentals of data science presentation
Fundamentals of data science presentationFundamentals of data science presentation
Fundamentals of data science presentation
topuri1218
 
Ml application on_student_non_deployment
Ml application on_student_non_deploymentMl application on_student_non_deployment
Ml application on_student_non_deployment
Accenture
 

Recently uploaded (20)

Kubernetes Cloud Native Indonesia Meetup - May 2025
Kubernetes Cloud Native Indonesia Meetup - May 2025Kubernetes Cloud Native Indonesia Meetup - May 2025
Kubernetes Cloud Native Indonesia Meetup - May 2025
Prasta Maha
 
Security Operations and the Defense Analyst - Splunk Certificate
Security Operations and the Defense Analyst - Splunk CertificateSecurity Operations and the Defense Analyst - Splunk Certificate
Security Operations and the Defense Analyst - Splunk Certificate
VICTOR MAESTRE RAMIREZ
 
MCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCP
MCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCPMCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCP
MCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCP
Sambhav Kothari
 
Introducing the OSA 3200 SP and OSA 3250 ePRC
Introducing the OSA 3200 SP and OSA 3250 ePRCIntroducing the OSA 3200 SP and OSA 3250 ePRC
Introducing the OSA 3200 SP and OSA 3250 ePRC
Adtran
 
Cyber security cyber security cyber security cyber security cyber security cy...
Cyber security cyber security cyber security cyber security cyber security cy...Cyber security cyber security cyber security cyber security cyber security cy...
Cyber security cyber security cyber security cyber security cyber security cy...
pranavbodhak
 
Gihbli AI and Geo sitution |use/misuse of Ai Technology
Gihbli AI and Geo sitution |use/misuse of Ai TechnologyGihbli AI and Geo sitution |use/misuse of Ai Technology
Gihbli AI and Geo sitution |use/misuse of Ai Technology
zainkhurram1111
 
Agentic AI - The New Era of Intelligence
Agentic AI - The New Era of IntelligenceAgentic AI - The New Era of Intelligence
Agentic AI - The New Era of Intelligence
Muzammil Shah
 
Talk: On an adventure into the depths of Maven - Kaya Weers
Talk: On an adventure into the depths of Maven - Kaya WeersTalk: On an adventure into the depths of Maven - Kaya Weers
Talk: On an adventure into the depths of Maven - Kaya Weers
Kaya Weers
 
Grannie’s Journey to Using Healthcare AI Experiences
Grannie’s Journey to Using Healthcare AI ExperiencesGrannie’s Journey to Using Healthcare AI Experiences
Grannie’s Journey to Using Healthcare AI Experiences
Lauren Parr
 
What is DePIN? The Hottest Trend in Web3 Right Now!
What is DePIN? The Hottest Trend in Web3 Right Now!What is DePIN? The Hottest Trend in Web3 Right Now!
What is DePIN? The Hottest Trend in Web3 Right Now!
cryptouniversityoffi
 
Droidal: AI Agents Revolutionizing Healthcare
Droidal: AI Agents Revolutionizing HealthcareDroidal: AI Agents Revolutionizing Healthcare
Droidal: AI Agents Revolutionizing Healthcare
Droidal LLC
 
GDG Cloud Southlake #43: Tommy Todd: The Quantum Apocalypse: A Looming Threat...
GDG Cloud Southlake #43: Tommy Todd: The Quantum Apocalypse: A Looming Threat...GDG Cloud Southlake #43: Tommy Todd: The Quantum Apocalypse: A Looming Threat...
GDG Cloud Southlake #43: Tommy Todd: The Quantum Apocalypse: A Looming Threat...
James Anderson
 
UiPath Community Berlin: Studio Tips & Tricks and UiPath Insights
UiPath Community Berlin: Studio Tips & Tricks and UiPath InsightsUiPath Community Berlin: Studio Tips & Tricks and UiPath Insights
UiPath Community Berlin: Studio Tips & Tricks and UiPath Insights
UiPathCommunity
 
Dr Jimmy Schwarzkopf presentation on the SUMMIT 2025 A
Dr Jimmy Schwarzkopf presentation on the SUMMIT 2025 ADr Jimmy Schwarzkopf presentation on the SUMMIT 2025 A
Dr Jimmy Schwarzkopf presentation on the SUMMIT 2025 A
Dr. Jimmy Schwarzkopf
 
Introducing FME Realize: A New Era of Spatial Computing and AR
Introducing FME Realize: A New Era of Spatial Computing and ARIntroducing FME Realize: A New Era of Spatial Computing and AR
Introducing FME Realize: A New Era of Spatial Computing and AR
Safe Software
 
Let’s Get Slack Certified! 🚀- Slack Community
Let’s Get Slack Certified! 🚀- Slack CommunityLet’s Get Slack Certified! 🚀- Slack Community
Let’s Get Slack Certified! 🚀- Slack Community
SanjeetMishra29
 
Supercharge Your AI Development with Local LLMs
Supercharge Your AI Development with Local LLMsSupercharge Your AI Development with Local LLMs
Supercharge Your AI Development with Local LLMs
Francesco Corti
 
Offshore IT Support: Balancing In-House and Offshore Help Desk Technicians
Offshore IT Support: Balancing In-House and Offshore Help Desk TechniciansOffshore IT Support: Balancing In-House and Offshore Help Desk Technicians
Offshore IT Support: Balancing In-House and Offshore Help Desk Technicians
john823664
 
European Accessibility Act & Integrated Accessibility Testing
European Accessibility Act & Integrated Accessibility TestingEuropean Accessibility Act & Integrated Accessibility Testing
European Accessibility Act & Integrated Accessibility Testing
Julia Undeutsch
 
Splunk Leadership Forum Wien - 20.05.2025
Splunk Leadership Forum Wien - 20.05.2025Splunk Leadership Forum Wien - 20.05.2025
Splunk Leadership Forum Wien - 20.05.2025
Splunk
 
Kubernetes Cloud Native Indonesia Meetup - May 2025
Kubernetes Cloud Native Indonesia Meetup - May 2025Kubernetes Cloud Native Indonesia Meetup - May 2025
Kubernetes Cloud Native Indonesia Meetup - May 2025
Prasta Maha
 
Security Operations and the Defense Analyst - Splunk Certificate
Security Operations and the Defense Analyst - Splunk CertificateSecurity Operations and the Defense Analyst - Splunk Certificate
Security Operations and the Defense Analyst - Splunk Certificate
VICTOR MAESTRE RAMIREZ
 
MCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCP
MCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCPMCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCP
MCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCP
Sambhav Kothari
 
Introducing the OSA 3200 SP and OSA 3250 ePRC
Introducing the OSA 3200 SP and OSA 3250 ePRCIntroducing the OSA 3200 SP and OSA 3250 ePRC
Introducing the OSA 3200 SP and OSA 3250 ePRC
Adtran
 
Cyber security cyber security cyber security cyber security cyber security cy...
Cyber security cyber security cyber security cyber security cyber security cy...Cyber security cyber security cyber security cyber security cyber security cy...
Cyber security cyber security cyber security cyber security cyber security cy...
pranavbodhak
 
Gihbli AI and Geo sitution |use/misuse of Ai Technology
Gihbli AI and Geo sitution |use/misuse of Ai TechnologyGihbli AI and Geo sitution |use/misuse of Ai Technology
Gihbli AI and Geo sitution |use/misuse of Ai Technology
zainkhurram1111
 
Agentic AI - The New Era of Intelligence
Agentic AI - The New Era of IntelligenceAgentic AI - The New Era of Intelligence
Agentic AI - The New Era of Intelligence
Muzammil Shah
 
Talk: On an adventure into the depths of Maven - Kaya Weers
Talk: On an adventure into the depths of Maven - Kaya WeersTalk: On an adventure into the depths of Maven - Kaya Weers
Talk: On an adventure into the depths of Maven - Kaya Weers
Kaya Weers
 
Grannie’s Journey to Using Healthcare AI Experiences
Grannie’s Journey to Using Healthcare AI ExperiencesGrannie’s Journey to Using Healthcare AI Experiences
Grannie’s Journey to Using Healthcare AI Experiences
Lauren Parr
 
What is DePIN? The Hottest Trend in Web3 Right Now!
What is DePIN? The Hottest Trend in Web3 Right Now!What is DePIN? The Hottest Trend in Web3 Right Now!
What is DePIN? The Hottest Trend in Web3 Right Now!
cryptouniversityoffi
 
Droidal: AI Agents Revolutionizing Healthcare
Droidal: AI Agents Revolutionizing HealthcareDroidal: AI Agents Revolutionizing Healthcare
Droidal: AI Agents Revolutionizing Healthcare
Droidal LLC
 
GDG Cloud Southlake #43: Tommy Todd: The Quantum Apocalypse: A Looming Threat...
GDG Cloud Southlake #43: Tommy Todd: The Quantum Apocalypse: A Looming Threat...GDG Cloud Southlake #43: Tommy Todd: The Quantum Apocalypse: A Looming Threat...
GDG Cloud Southlake #43: Tommy Todd: The Quantum Apocalypse: A Looming Threat...
James Anderson
 
UiPath Community Berlin: Studio Tips & Tricks and UiPath Insights
UiPath Community Berlin: Studio Tips & Tricks and UiPath InsightsUiPath Community Berlin: Studio Tips & Tricks and UiPath Insights
UiPath Community Berlin: Studio Tips & Tricks and UiPath Insights
UiPathCommunity
 
Dr Jimmy Schwarzkopf presentation on the SUMMIT 2025 A
Dr Jimmy Schwarzkopf presentation on the SUMMIT 2025 ADr Jimmy Schwarzkopf presentation on the SUMMIT 2025 A
Dr Jimmy Schwarzkopf presentation on the SUMMIT 2025 A
Dr. Jimmy Schwarzkopf
 
Introducing FME Realize: A New Era of Spatial Computing and AR
Introducing FME Realize: A New Era of Spatial Computing and ARIntroducing FME Realize: A New Era of Spatial Computing and AR
Introducing FME Realize: A New Era of Spatial Computing and AR
Safe Software
 
Let’s Get Slack Certified! 🚀- Slack Community
Let’s Get Slack Certified! 🚀- Slack CommunityLet’s Get Slack Certified! 🚀- Slack Community
Let’s Get Slack Certified! 🚀- Slack Community
SanjeetMishra29
 
Supercharge Your AI Development with Local LLMs
Supercharge Your AI Development with Local LLMsSupercharge Your AI Development with Local LLMs
Supercharge Your AI Development with Local LLMs
Francesco Corti
 
Offshore IT Support: Balancing In-House and Offshore Help Desk Technicians
Offshore IT Support: Balancing In-House and Offshore Help Desk TechniciansOffshore IT Support: Balancing In-House and Offshore Help Desk Technicians
Offshore IT Support: Balancing In-House and Offshore Help Desk Technicians
john823664
 
European Accessibility Act & Integrated Accessibility Testing
European Accessibility Act & Integrated Accessibility TestingEuropean Accessibility Act & Integrated Accessibility Testing
European Accessibility Act & Integrated Accessibility Testing
Julia Undeutsch
 
Splunk Leadership Forum Wien - 20.05.2025
Splunk Leadership Forum Wien - 20.05.2025Splunk Leadership Forum Wien - 20.05.2025
Splunk Leadership Forum Wien - 20.05.2025
Splunk
 

Featurizing log data before XGBoost

  • 1. Featurizing log data before XGBoost Xavier Conort Thursday, August 20, 2015 @
  • 2. ● XuetangX, a Chinese MOOC learning platform initiated by Tsinghua University, ● launched online on Oct 10th, 2013. ● more than 100 Chinese courses and over 260 international courses ● high dropout rate The competition host
  • 3. ● challenge: predict whether a user will drop a course within next 10 days based on his or her prior activities. ● data: ○ enrollment_train (120K rows) / enrollment_test (80K rows): ■ Columns: enrollment_id, username, course_id ○ log_train / log_test ■ Columns: enrollment_id, time, source, event, object ○ object ■ Columns: course_id, module_id, category, children, start ○ truth_train ■ Columns: enrollment_id, dropped_out Problem to solve
  • 5. Team Chief Product Officer Chief Data Scientist Data Scientist Data Scientist (O. Zhang)
  • 6. How we worked as a Team ● worked separately on feature engineering. 90% of our time was spent here. ● delegated Modeling part to DataRobot to: ○ find best algorithm (with XGboost as a winner!) ○ model text features ○ tune hyperparameters ○ experiment different feature sets and blend 8 XGBoost using different sets ○ communicate results
  • 7. Feature engineering techniques used ● counts ● time statistics (min, mean, max, diff) ● entropy ● sequences treated as text on which we ran ○ SVD on 3grams ○ DataRobot text mining solution ● 20 first components of SVD on user x object NB: removed duplicated log info and used training + test sets to build most features
  • 8. How to build efficient features in R
  • 9. Key course features ● course_id ● first log time ● enrollment counts ● unique log counts ● mean time interval
  • 10. Key enrollment count features ● log counts ● unique log counts ● ratio between unique log counts over log counts ● unique log counts by event (nagivate, access, problem, video, page_close, discussion, wiki) ● unique log counts before end of course (5 days, 10 days and 30 days before) ● sequence number of enrollment in that course
  • 11. Key enrollment time stats ● log time stats (min, mean, max) ● gap between first and last log of enrollment ● gap between enrollment first log and course first log ● gap between enrollment last log and course last logs ● difference between mean log time and mid point between first and last log ● log interval stats (mean, 90, 99 and 100 quantiles)
  • 12. Enrollment entropy features enrollment entropy over ● days ● weekdays ● fraction (4) of weekdays ● hours of the day ● hours of the day for the last 1/3/7 days before last logs ● object (when event == problem) ● chapter ids
  • 13. Example of entropy feature - log(weekday_log_count / enrollment_log_count) * weekday_log_count / enrollment_log_count Sum => weekday_entropy[enrollment_id==1] 1.589988
  • 14. Enrollment sequence features ● for each enrollment_id, built sequences of ○ weekdays ○ objects ■ all objects / 'problem' and 'video' objects only ○ events ● treated sequences as 4 text variables. Ran for each ○ svd on 3 grams => first 10 components ○ DataRobot stacked predictions from logistic regr. & Nystroem SVM on (tuned) n-grams
  • 15. Extract of enrollment object sequences
  • 17. DataRobot on Object 1-2 grams
  • 18. Key user count features and time stats ● enrollment count ● binary indicator whether user signed up for each of the 38 courses ● unique log count ● mean log time interval ● sequence number of enrollment for that user
  • 19. User entropy features user entropy over ● days ● weekdays ● fraction (4) of weekdays ● hours of the day
  • 20. User sequence features ● for each user, built sequences of ○ weekdays ○ chapter_ids ○ events ● treated them as 3 text variables. Ran ○ SVD on 3 grams => first 10 components ○ DataRobot stacked predictions from logistic regr. + Nystroem SVM on (tuned) n-grams
  • 21. How we got to the TOP3 ● entropy features mentioned before ● exploited info in ○ log count in the 5 / 10 / 20 days after end of course ○ log counts by event, sign_up counts and day entropy in the next 10 days after end of course ○ time to sign up for new course ○ time until the next log for same user added ~0.001 to AUC (vs less powerful features) added ~0.002 to AUC