Deep learning and applications in non-cognitive domains II

DEEP LEARNING &
APPLICATIONS IN
NON-COGNITIVE DOMAINS
PART II: PRACTICE
5/12/16 1
Truyen Tran
Deakin University
truyen.tran@deakin.edu.au
prada-research.net/~truyen
AusDM’16, Canberra, Dec 7th 2016

dbta.com
PART II: PRACTICE
APPLYING DEEP LEARNING TO NON-COGNITIVE DOMAINS
 Hand-on:
 Introducing programming frameworks
(Theano, TensorFlow, Mxnet)
 Domains how-to:
 Healthcare
 Software engineering
 Anomaly detection
5/12/16 2

THEANO & TENSORFLOW
 Two most popular frameworks at present. Both in Python.
 Theano
  Academic-driven. Pioneer.
  Symbolic computation à can be tricky to debug
  Wrapper: Lasagne, Keras
 TensorFlow
  Google à Native distributed computing support
  A lot of support, huge community
  Slightly bigger/messier code
  Linux/Mac only but VirtualBox will help in Windows
  Wrapper: Keras
5/12/16 3

√ Excellent support for many languages
√ Fast, portable
√ Intuitive syntax
√ Recent choice by AWS
5/12/16 4
http://bickson.blogspot.com.au/2016/02/mxnet-vs-tensorflow.html
https://github.com/dmlc/mxnet

BUILDING A MODEL
 Everything is a computational graph
 From here to there is a tensor
 So simple stacking is fine (the idea behind Keras)
 Fit small datasets first to test the water
  But be cautious: small data do not always generalize
 Always monitor the gap between train/validation
sets: small gap indicates underfitting, big widening
gap indicates overfitting.
5/12/16 5
 Check the model assumption
  Is this only the vector à FNN?
  Is this a regular sequence à RNN?
  Is there repeated motifs à CNN?
  Is there a mix of static and dynamic features?
  What does the output look like?
  A class
  A sequence
  An image?
  What are performance measures? à Surrogate smooth
objective functions

STEPS
 Prepare a clean big dataset
 Design a suitable architecture à the main ART
 Choose an optimizer (sgd, momentum, adagrad, adadelta, rmsprop, adam)
 Normalise data (very important for fast training & well-behaved learning curve)
 Shuffle data randomly (extremely important!)
 Run the optimizer
 Sit back & wait (in fact, should spend time monitor the convergence)
 Grid search if time permits (sometimes very important to get correct convergence!)
 Ensemble if time permits
 Reiterate if needed
5/12/16 6

THINGS TO TAKE CARE OF
 Data quality
 Leakage
  Never touch validation data for feature engineering
  Be aware of overlapping between training/validation in time-sensitive data
 Memory limitation
 CPU/GPU time
 Always shuffle the data BEFORE training – create a mixing of labels
 Initialisation matters
 Dropouts: almost always help, normally with bigger models. But be careful with RNNs.
 Numerical overflow/underflow: exp of large number, log of or division by zeros
5/12/16 7

TEKsystems
dbta.com
APPLYING TO NON-COGNITIVE DOMAINS
  Where humans need
extensive training to do well
  Domains that demand
transparency &
interpretability.
5/12/16 8
Healthcare
Software engineering
Anomaly detection
http://www.bentoaktechnologies.com/Images/code_scrn.jpg

WHAT MAKE NON-COGNITIVE DOMAINS
HARD?
 Great diversity but may be small in size
 High uncertainty, low-quality/missing data
 Reusable models do not usually exist
 However, at the end of the day, we need few generic things:
  Vector -> DNN (e.g., highway net)
  Sequence -> RNN (e.g., LSTM, GRU)
  Repeated Motifs -> CNN
  Set -> attention (Will visit in Part III)
  Graphs -> Column Networks (Will visit in Part III)
5/12/16 9

TEKsystems
HEALTHCARE
5/12/16 10
INTEGRATED DATA VIEW OF MULTIPLE HOSPITAL SYSTEMS
MULTI DATA INPUT METHODS
FLEXIBLE TO CREATE, EASY TO USE
SECURE AND ACCESSIBLE, ANYWHERE

HOW DOES AI WORK FOR HEALTH?
11
Diagnosis Prognosis EfficiencyDiscovery
http://hubpages.com/education/Top-Medical-Inventions-of-The-1950s, http://www.ctrr.net/journal/

HEALTHCARE: CHALLENGES + OPPORTUNITIES
 Long-term dependencies
 Irregular timing
 Mixture of discrete codes and continuous
measures
 Complex interaction of diseases and care
processes
 Cohort of interest can be small (e.g.,
<1K)
 Rich domain knowledge & ontologies
5/12/16 12
 May include textual notes
 May contain physiological signals (e.g., EEG/
ECG)
 May contain images (e.g., MRI, X-ray, retina)
 Genomics
 Detailed neuronal mapping (US) & simulation
(EU)
 New modalities: social medial, wearable devices

THIS TUTORIAL WILL COVER:
 Electronic medical records (EMR)
5/12/16 13
visits/admissions
time gap
?
prediction point
http://www.healthpages.org/brain-injury/brain-injury-intensive-care-unit-icu/
 Physiological measures in Intensive
Care Unit (CU)
•  Time-stamped
•  Coded data: diagnosis, procedure
& medication
•  Text not considered, but in principle
can be mapped in to vector using
LSTM
<Time, Type, Value>

MEDICAL RECORDS: FEEDFORWARD
NETS
5/12/16 14
visits/admissions
time gap
?
prediction point
history future
assessment
15 days 30 days 60 days 120 days 180 days
hidden layers
pooling
history
future
360 days
fragmentation
+ aggregation
[0-3]m[3-6]m[6-12]m
data segments
[12-24]m[24-48]m

SUICIDE RISK PREDICTION: MACHINE VERSUS
CLINICIAN
5/12/16 15

DEEPR: CNN FOR REPEATED MOTIFS AND
SHORT SEQUENCES (NGUYEN ET AL, J-BHI, 2016)
5/12/16 16
output
max-pooling
convolution --
motif detection
embedding
sequencing
medical record
visits/admissions
time gaps/transferphrase/admission
prediction
1
2
3
4
5
time gap
record
vector
word
vector
?
prediction point

DISEASE EMBEDDING &
MOTIFS DETECTION
5/12/16 17
E11 I48 I50
Type 2 diabetes mellitus
Atrial fibrillation and flutter
Heart failure
E11 I50 N17
Type 2 diabetes mellitus
Heart failure
Acute kidney failure

DEEPCARE: DYNAMICS
5/12/16 18
memory
*
input
gate
forget
gate
prev. memory
output
gate
*
*
input
aggregation over
time → prediction
previous
intervention
history
states
current
data
time
gap
current
intervention
current
state
New in DeepCare

DEEPCARE:
STRUCTURE
5/12/16 19
Time gap
LSTM
Admission
(disease)
(intervention)
Vector embedding
Multiscale pooling
Neural network
Future risks
Long short-term
memory
Latent states
FutureHistory
LSTM LSTM LSTM

DEEPCARE: PREDICTION RESULTS
5/12/16 20
Intervention recommendation (precision@3) Unplanned readmission prediction (F-score)
12 months 3 months 12 months 3 months

DEEPIC: MORTALITY PREDICTION IN
INTENSIVE CARE UNITS (WORK IN PROGRESS)
 Existing methods: LSTM with
missingness and time-gap as input.
 New method: Deepic
 Steps:
 Measurement quantization
 Time gap quantization
 Sequencing words into sentence
 CNN
5/12/16 21
http://www.healthpages.org/brain-injury/brain-injury-intensive-care-unit-icu/
Time,Parameter,Value
00:00,RecordID,132539
00:00,Age,54
00:00,Gender,0
00:00,Height,-1
00:00,ICUType,4
00:00,Weight,-1
00:07,GCS,15
00:07,HR,73
00:07,NIDiasABP,65
00:07,NIMAP,92.33
00:07,NISysABP,147
00:07,RespRate,19
00:07,Temp,35.1
00:07,Urine,900
00:37,HR,77
00:37,NIDiasABP,58
00:37,NIMAP,91
00:37,NISysABP,157
00:37,RespRate,19
00:37,Temp,35.6
00:37,Urine,60
Data: Physionet 2012

DEEPIC: SYMBOLIC & TIME GAP REPRESENTATION
OF DATA
5/12/16 22
output
max-pooling
convolution --
motif detection
embedding
sequencing
measurement points
time gapsmeasurements
prediction
1
2
3
4
5
time gap
record
vector
word
vector
?
prediction point
discretization0

5/12/16 23http://www.bentoaktechnologies.com/Images/code_scrn.jpg
SOFTWARE ANALYTICS
DATA-DRIVEN SOFTWARE ENGINEERING

TOWARDS INTELLIGENT ASSISTANTS
 Goal: To model code, text, team, user, execution, project
& enabled business process à answer any queries by
developers, managers, users and business
 For now:
 DeepSoft vision
 LSTM for code language model
 LD-RNN for report representation
 Stacked/deep inference (later)
5/12/16 24
http://lifelong.engr.utexas.edu/images/course/swpm_b.jpg

ANALYTICS FOR AGILE SOFTWARE PROJECT
MANAGEMENT
5/12/16 25
http://www.solutionguidance.com/?page_id=1579

CHALLENGES: LONG-TERM TEMPORAL
DEPENDENCIES IN SOFTWARE
 Software is similar to an evolving organism
 What will happen next to a software system depends heavily on what has previously been done to it.
 E.g. the implementation of a functionality may constraint how other functionalities are implemented in the
future.
 E.g. a previous change (to fix a bug or add a new feature) may inject new bugs and lead to further
changes.
 E.g. refactoring a piece of code may have long-term benefits in future maintenance.
 Today’s software products undergo rapid cycles of development, testing and release
 A software project typically has many releases
 A release requires the completion of some tasks (i.e. resolution of some issues).
 An issue is described using natural language (raw data).
 The resolution of an issue may result in code patches (raw data).
26

A DEEP LANGUAGE MODEL FOR
SOFTWARE CODE (DAM ET AL, FSE’16 SE+NL)
 A good language model for source code would capture the long-term
dependencies
 The model can be used for various prediction tasks, e.g. defect prediction, code
duplication, bug localization, etc.
 The model can be extended to model software and its development process.
5/12/16 27
Slide by Hoa Khanh Dam

CHARACTERISTICS OF SOFTWARE CODE
 Repetitiveness
  E.g. for (int i = 0; i < n; i++)
 Localness
  E.g. for (int size may appear more often that for (int i in some source files.
 Rich and explicit structural information
  E.g. nested loops, inheritance hierarchies
 Long-term dependencies
  try and catch (in Java) or file open and close are not immediately followed each other.
28

CODE LANGUAGE MODEL
29
 Previous work has applied RNNs to model software code (White et al, MSR 2015)
 RNNs however do not capture the long-term dependencies in code

EXPERIMENTS
 Built dataset of 10 Java projects: Ant, Batik, Cassandra, Eclipse-E4, Log4J, Lucene, Maven2, Maven3, Xalan-J,
and Xerces.
 Comments and blank lines removed. Each source code file is tokenized to produce a sequence of code tokens.
 Integers, real numbers, exponential notation, hexadecimal numbers replaced with <num> token, and
constant strings replaced with <str> token.
 Replaced less “popular” tokens with <unk>
 Code corpus of 6,103,191 code tokens, with a vocabulary of 81,213 unique tokens.
30

EXPERIMENTS (CONT.)
31
 Both RNN and LSTM improve with more training data (whose size grows with sequence length).
 LSTM consistently performs better than RNN: 4.7% improvement to 27.2% (varying sequence length), 10.7% to 37.9% (varying embedding size).

STORY POINT ESTIMATION
 Traditional estimation methods require
experts, LOC or function points.
  Not applicable early
  Expensive
 Feature engineering is not easy!
 Needs a cheap way to start from just a
documentation.
5/12/16 32

LD-RNN FOR REPORT
REPRESENTATION
(CHOETKIERTIKUL ET AL, WORK IN PROGRESS)
 LD = Long Deep
 LSTM for document representation
 Highway-net with tied parameters for
story point estimation
5/12/16 33
pooling
Embed
LSTM
story point
estimate
W1 W2 W3 W4 W5 W6
Recurrent Highway NetRegression
Standardize XD logging to align with
document representation
h1
h2 h3
h4 h5
h6
….
….
….
….

RESULTS
5/12/16 34
MAE = Mean Absolute Error

TASK DEPENDENCY IN SOFTWARE PROJECT
(CHOETKIERTIKUL ET AL, WORK IN PROGRESS)
5/12/16 35

TASK DEPENDENCY IN SOFTWARE
PROJECT (MORE ON PART III)
5/12/16 36
Column networksStacked Inference

ANOMALY DETECTION
USING UNSUPERVISED LEARNING (PART III)
5/12/16 37
dbta.com
This work is partially supported by the Telstra-Deakin Centre of Excellence in Big Data and Machine Learning

DETECTION METHODS
5/12/16 38
Auto-encoder
(deterministic)
External
outlier
detector?
Reconstruction
error
Restricted Boltzmann Machine
(probabilistic)
Detection threshold
Fee energy surface

MIXED-VARIATE RBM (TRAN ET AL, 2011)
5/12/16 40
þ
ý
þ
ý
¤
¡
¡
¤
¡
¡
1
2
3

ABNORMALITY
ACROSS
ABSTRACTIONS
5/12/16 42
F1(x1)
F2(x2)
Rank 1 Rank 2
F3(x3)
Rank 3
Rank aggregation
Mv.RBM Mv.DBN-L2 Mv.DBN-L3
WA1 WA1
WA2
WD1
WD2
WD3

MALICIOUS URL CLASSIFICATION
5/12/16 43

5/12/16 44http://www.indiainfoline.com/article/news-sector-information-technology/india-ranks-4th-in-highest-users-who-clicked-malicious-urls-in-2015-trend-micro-116052700684_1.html

MODEL OF MALICIOUS URLS
5/12/16 45
Safe/Unsafe
max-pooling
convolution --
motif detection
Embedding (may
be one-hot)
Prediction with FFN
1
2
3
4
record
vector
char
vector
h t t p : / / w w w . s
Train on 900K malicious URLs
1,000K good URLs
Accuracy: 96%
No feature engineering!

SUMMARY OF PART II
 Hand-on:
 Introducing programming frameworks (Theano,
TensorFlow, Mxnet)
 Domains how-to:
 Healthcare
 Software engineering
 Anomaly detection
5/12/16 46

5/12/16 47
https://duroosullughatilarabiyyah.files.wordpress.com/2010/07/qa.jpg

Deep learning and applications in non-cognitive domains II

Recommended

More Related Content

What's hot (20)

Similar to Deep learning and applications in non-cognitive domains II (20)

More from Deakin University (20)

Recently uploaded (20)

Deep learning and applications in non-cognitive domains II