0% found this document useful (0 votes)
26 views6 pages

Enabling Edge Devices that Learn from Each Other Cross Modal

The document presents RecycleML, a method for transferring knowledge between edge devices using unlabeled data to enhance activity recognition across different sensing modalities. It demonstrates that RecycleML can reduce the required labeled data by over 90% and accelerate training by up to 50 times compared to traditional methods. The approach is validated through a new dataset, CMActivity, which includes synchronized data from vision, audio, and inertial sensors.

Uploaded by

pearsonicin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views6 pages

Enabling Edge Devices that Learn from Each Other Cross Modal

The document presents RecycleML, a method for transferring knowledge between edge devices using unlabeled data to enhance activity recognition across different sensing modalities. It demonstrates that RecycleML can reduce the required labeled data by over 90% and accelerate training by up to 50 times compared to traditional methods. The approach is validated through a new dataset, CMActivity, which includes synchronized data from vision, audio, and inertial sensors.

Uploaded by

pearsonicin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Enabling Edge Devices that Learn from Each Other: Cross Modal

Training for Activity Recognition


Tianwei Xing∗ Sandeep Singh Sandha∗ Bharathan Balaji
University of California, Los Angeles University of California, Los Angeles University of California, Los Angeles
[email protected] [email protected] [email protected]

Supriyo Chakraborty Mani Srivastava


IBM T. J. Watson Research Center University of California, Los Angeles
[email protected] [email protected]

ABSTRACT 1 INTRODUCTION
Edge devices rely extensively on machine learning for intelligent Edge devices are typically equipped with a wide variety of sensing
inferences and pattern matching. However, edge devices use a multi- modalities for tracking environmental markers. To provide insights
tude of sensing modalities and are exposed to wide ranging contexts. and enable context-aware applications (e.g. user activity recogni-
It is difficult to develop separate machine learning models for each tion [25], workout tracking [22], speech recognition [8]) the data
scenario as manual labeling is not scalable. To reduce the amount collected on these devices are used to train deep neural network
of labeled data and to speed up the training process, we propose models. However, to fully realize the learning-at-the-edge para-
to transfer knowledge between edge devices by using unlabeled digm, several challenges still needs to be addressed. In particular,
data. Our approach, called RecycleML, uses cross modal transfer the model training process needs to handle insufficient labeled data,
to accelerate the learning of edge devices across different sens- and the heterogeneity in inter-device sensing modalities.
ing modalities. Using human activity recognition as a case study, As a step towards addressing the above concerns, we propose
over our collected CMActivity dataset, we observe that RecycleML RecycleML– a mechanism to transfer knowledge between edge
reduces the amount of required labeled data by at least 90% and devices. Our approach is guided by the observation that application-
speeds up the training process by up to 50 times in comparison to specific semantic concepts can be better associated with features in
training the edge device from scratch. the higher layers (close to the output side) of a network model [5].
This observation allows us to conceptualize the layers of the dif-
CCS CONCEPTS ferent networks as an hourglass model, as shown in Figure 1. The
• Computing methodologies → Transfer learning; Neural lower half of the hourglass correspond to the lower layers (close to
networks; Learning latent representations; • Hardware → the input side) of the individual models (trained on specific sensing
Sensor applications and deployments; modalities). The narrow waist is the common layer (latent space)
into which the lower layers project their data for knowledge trans-
fer. The upper half of the hourglass comprises of the task-specific
KEYWORDS
higher layer features which are trained in a targeted fashion for
edge devices, transfer learning, cross modality, shared latent repre- task-specific transfer.
sentation, activity recognition To evaluate RecycleML, we emulate edge devices with three
ACM Reference format: sensing modalities - vision, audio and inertial (IMU) sensing as
Tianwei Xing, Sandeep Singh Sandha, Bharathan Balaji, Supriyo Chakraborty, shown in Figure 2. We perform zero-shot learning [23], i.e. use zero
and Mani Srivastava. 2018. Enabling Edge Devices that Learn from Each training labels, across different sensing modalities when they are
Other: Cross Modal Training for Activity Recognition. In Proceedings of performing the same classification task. We achieve this by training
EdgeSys ’18: International Workshop on Edge Systems, Analytics and Network- the target edge device model to have the same latent space as the
ing, Munich, Germany, June 10–15, 2018 (EdgeSys ’18), 6 pages. source model. RecycleML can also learn to expand the classification
https://doi.org/10.1145/3213344.3213351 tasks of the transferred model with very few training examples.
Our results across a mix of sensory substitutions and task trans-
∗ Both authors contributed equally to this work. fers show that, over our collected CMActivity dataset, RecycleML
reduces the amount of labeled data required to train edge devices
by at least 90% and speeds up the training process by up to 50 times
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed after doing knowledge transfer using unlabeled data.
for profit or commercial advantage and that copies bear this notice and the full citation Our contributions are as follows:
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a (1) We combine the idea of transfer learning (lower layers trans-
fee. Request permissions from [email protected].
EdgeSys ’18, June 10–15, 2018, Munich, Germany fer) with sensory substitution (higher layers transfer) to-
© 2018 Association for Computing Machinery. gether and propose a unified framework, where the knowl-
ACM ISBN 978-1-4503-5837-8/18/06. . . $15.00 edge in every part of a network could be transferred.
https://doi.org/10.1145/3213344.3213351

37
EdgeSys ’18, June 10–15, 2018, Munich, Germany Xing, Sandha et al.

Figure 2: Knowledge transfer across edge devices with


different sensing modalities.

2.2.1 Knowledge Transfer. For simplicity, let us consider two


Figure 1: Shared representation between edge devices. edge devices D X and DY , each with different sensing modality
capturing data X and Y respectively. Suppose D X has a pre-trained
model M X and performs task TX . Our goal is to train a new model
(2) We introduce a new dataset CMActivity that have synchro- MY for DY to perform task TY . To transfer knowledge from D X to
nized data of three modalities: vision, audio, and inertial. DY , we collect data X and Y from both devices while observing the
(3) For activity recognition task, we verify that the shared repre- same event. X and Y need not be labeled. An important requirement
sentation exists for time series sensory data, and it can help is the time synchronization in devices D X and DY so as to capture
transfer knowledge from ambience edge devices to wearable the same event in their data X and Y . Synchronization is natural in
edge devices and vice versa. The code for our experiment is different sensing modalities. For example, vision, audio and inertial
available on-line.1 . sensors observing the same event of human motion can capture it
in different signals (see Section 3.1 for details).
2 METHOD OVERVIEW We input data X to the pre-trained model M X , and instead of
2.1 Conceptual Scenario getting the final output value, we calculate the activation values
f (X ) of an intermediate layer that acts as our shared latent feature
Suppose Alice has an edge device DV 1 with camera in her living
representation. f is the transformation of all the early layers before
room, and it is trained to do activity recognition. Alice wants to
the specific activation. We use f (X ) as the training value for the
replicate the inferencing ability of DV 1 on other devices: a smart
model MY of device DY . Specifically, we choose a new network
watch DW which she wears regularly, an acoustic device D A1 in
д, specialized for input modality Y , and train the network д(Y ) so
her living room to turn off DV 1 whenever needed due to privacy
that it maps Y to the same shared latent feature representation
reasons, and a camera DV 2 and a voice assistant D A2 in her office,
by minimizing |д(Y ) − f (X )| 2 as our loss function. We generate
Our objective is to transfer activity recognition knowledge of DV 1
the model MY for device DY by adding the task specific output
to D A1 and DW (Video→Audio and IMU), and later, transfer activity
layers to д. In this way, model M X teaches the new model MY in a
recognition knowledge of DW to D A2 and DV 2 (IMU→Audio and
teacher-student data distillation manner [11].
Video).
2.2.2 Task Transfer. Transferring knowledge from device D X
2.2 RecycleML Description to DY does not need any ground-truth labels. However, the new
RecycleML uses the same latent feature representation across edge model MY for device DY may need additional information before
devices of different modalities to do knowledge transfer. Knowl- performing any classification or regression task. Therefore, three
edge transfer uses synchronous unlabeled data to map the input of different scenarios arise when devices D X and DY performing tasks
untrained model to the shared latent feature representation of the TX and TY in classification settings respectively: (i) Devices D X
pre-trained model (details in Section 2.2.1). Later edge devices can and DY are performing same tasks TX , (ii) Devices D X and DY
either reuse the upper layer across models or do task transfer on are performing related tasks TX and TY , e.g. where TX and TY are
the upper layers if needed (details in Section 2.2.2). both human activity inferencing but with different numbers of
categories, and (iii) Devices D X and DY are performing completely
different tasks TX and TY . In this paper, we study how to transfer
1 https://github.com/nesl/RecycleML knowledge between devices in the first two scenarios.

38
Enabling Edge Devices that Learn from Each Other EdgeSys ’18, June 10–15, 2018, Munich, Germany

We explore two different methods of task transfer: and observed a maximum time difference of 0.5 seconds between
• PureTransfer directly uses the higher layers of model M X for the observer smartphone and the user smartphone. We leave it for
new model MY . In this case no further training is needed future to explore the effect of poor time synchronization across
and no labeled data is required. devices in observing the same event. We expect the knowledge
• Transfer+LimitedTrain freezes the network д and adds higher transfer capabilities of RecycleML to degrade as the time difference
layers to MY and retrains only the higher layers using limited between devices increases.
labeled data. The details of CMActivities are shown in Table 1. The data col-
lection was done at different locations with two users wearing
In the first scenario, since the tasks are same we can use both
separate set of clothes at each location so as to make sure that the
methods. In the second and the third scenarios, direct transfer of
trained classifier learns the activity features and is least affected by
higher layers from model M X to model MY does not work as M X
the environmental factors. We split 767 videos and IMU sessions
does not give the same desired output. Hence, we use the second
into three parts: training dataset (624), testing dataset (71) and per-
method. In our experiments, we evaluate scenario (i) of task transfer
sonalization dataset (72). Training and testing datasets contain 7
using both methods of PureTransfer and Transfer+LimitedTrain and
activities at 5 different locations and personalization dataset con-
scenario (ii) using Transfer+LimitedTrain.
In our experiments, we used the output of last hidden layer tains 5 activities at 6t h location. We don’t have Go Upstairs and Go
after removing the final output layer from model M X as the f Downstairs activities in the personalization dataset.
transformation. Here f and д serve as shared latent representations The training dataset is further split into 3 parts: Pre-Training
across modalities. We add a single task specific layer to д to generate set, Transfer set and LimitTrain set. The personalization dataset is
model MY . In future, we will explore the different choices of f and split into PersonalTrain and PersonalTest sets. The testing dataset
addition of multiple task specific output layers to д. is used only for evaluation. The frame rate of video is 29 and the
sampling frequency of audio and IMU is 22050 Hz and 25 Hz re-
3 EVALUATION spectively. We use a window of 2 seconds to extract vision, audio
and IMU features from dataset with sliding window of 0.4 seconds
3.1 Dataset between consecutive windows. In case of vision and IMU, we use
For our experiments, we collected a new dataset, called CMAc- raw features directly as input to the models. We extracted features
tivities, composed of videos for vision and audio modality, and from the raw audio data using Librosa [16] and use it as the input
corresponding IMU data (accelerometer and gyroscope) from sen- features. Specifically, we extract mel-frequency cepstral coefficients
sors on left and right wrist. We collected 767 videos of roughly 10 (MFCC) [15], power spectrogram [6], mel-scaled spectrogram, spec-
second each from 2 users2 doing 7 different activities at 6 locations. tral contrast [13] and tonal centroid features (tonnetz) [10].
Every video contains a single activity and is used to label the vision, In total, we have 11976 samples in training (5000 samples for
audio and IMU data. The total duration of collected data for each Pre-Training set, 6000 samples for Transfer set, and 976 samples
modality is 125 minutes. for LimitedTrain set), 1377 samples in test and 1592 samples in
personalization (475 samples for PersonalTrain set and 1117 samples
Table 1: Description of CMActivities dataset for PersonalTest set) for each modality.

Activity Number of Videos Duration (sec) Table 2: Testing accuracy of baseline models

Go Upstairs 162 1338


Go Downstairs 161 1113 Input Modality Video Audio IMU
Walk 119 1143 Accuracy 90.92% 92.81% 90.99%
Run 115 891 Number of parameters 4.6M 0.8M 57K
Jump 73 995
Wash Hand 73 1070
Jumping Jack 90 958
3.2 Baselines
To compare the results of RecycleML, we trained Video, Sound and
We collected the videos of the user using an observer smart-
IMU models using Pre-Training dataset individually to do activ-
phone. The wrist sensors communicate the data to the smartphone
ity recognition. The models we use are the state-of-the-art deep
of the user doing the activities. The IMU data was timestamped
learning architectures that are generally adapted in a wide range
by user’s smartphone and the video by the observer smartphone.
of applications:
Time synchronization between vision and audio is naturally present
(a) Video Network is a reduced version of C3D [24] network. It in-
because both are extracted from the same videos. However, time
cludes four 3D-convolutional modules combined with 3D-maxpooling
synchronization between the user smartphone and the observer
layers, followed by 3 fully-connected layers and one output layer.
smartphone is needed so as to synchronize video and IMU data. In
The total number of parameters are about 4.6 million.
our data collection, we used the default smartphone timestamps syn-
(b) Audio Network is a multi-layer perceptron model. It has 10
chronized through the Network Time Protocol (NTP) [17] service,
fully-connected layers and a total of 810 K parameters. We add
2 The data is collected from the authors and thus does not require approval from IRB. drop-out to avoid overfitting.

39
EdgeSys ’18, June 10–15, 2018, Munich, Germany Xing, Sandha et al.

Table 3: Comparison of knowledge transfer between devices.


Significance tests (compared to the training from scratch) are carried out using t-test with P<0.005 in most cases.

Transfer Trained-Device Pure-Transfer Transfer+LimitedTrain Training from Scratch


Video(DV 1 ) to Audio(D A1 ) 90.92% 90.20% 90.36% 84.12%
Video(DV 1 ) to IMU(DW ) 90.92% 94.19% 94.37% 70.73%
IMU(DW ) to Video(DV 2 ) 90.99% 74.00% 75.13% 72.26%
IMU(DW ) to Audio(D A2 ) 90.99% 84.82% 87.82% 84.28%

(c) IMU Network is a CNN network. It has 2 convolutional modules Table 3 shows the knowledge transfer results between devices
(convolution layer + maxpooling layer), 3 fully-connected layers doing the same task of activity recognition. Model performance is
and a output layer. 57K parameters are trainable in this network. measured by test accuracy. Considering row 1, Trained-Device is
Table 2 shows the summary of the individual models. The models the accuracy of pre-trained device DV 1 . Pure-Transfer and Trans-
are trained using the training dataset and tested on testing dataset. fer+LimitedTrain are the accuracy of device D A1 using both methods
These baseline models are trained using SGD [4] and Adam [14] respectively. The last cell shows the accuracy of audio model trained
optimizers with a learning rate of 0.001. We save the models with from scratch using LimitTrainSet. As we can see both methods Pure-
best test accuracy after training for 500 epochs. Transfer and Transfer+LimitedTrain achieve better accuracy than
training from scratch. This shows that shared latent feature repre-
sentation is successful in doing knowledge transfer across devices
3.3 Knowledge Transfer Results of different modalities. We also observe that Transfer+LimitedTrain
Knowledge transfer results are presented in Table 3. In the first usually gives the best performance.
and second experiment, vision device DV 1 is trained while acoustic
device D A1 and wearable device DW are untrained respectively.
In the third and fourth experiment wearable device DW is trained
while vision device DV 2 and acoustic device D A2 are untrained. For
each of these four transfers, we follow the same procedure. Taking
vision device DV 1 to acoustic device D A1 as an example, we first
train the vision model of DV 1 from scratch using the Pre-Training
set (5000 samples) of training dataset. We use the standard SGD
optimizer with a learning rate of 0.001. The training is finished in
500 epochs. We then use DV 1 as a pre-trained device to transfer
knowledge to a D A1 following the procedure described in Section
2.2.1. In the knowledge transferring process, we use Adam optimizer
with a learning rate of 0.001, and run it for 500 epochs. The data
used in transfer process are the synchronized unlabeled vision and
sound data from Transfer set (6000 samples) of training dataset.
After transfer, the higher layers of audio model can be created using
two methods Pure-Transfer and Transfer+LimitedTrain discussed in
Section 2.2.2 when both DV 1 and D A1 are doing the same task. In
Pure-Transfer method audio model uses the output layer of vision
model directly. In Transfer+LimitedTrain, we train the new output
layer for audio model. We select a small labeled set of 500 samples Figure 3: Transfer+LimitedTrain converges in 10 epochs
randomly out of 976 samples from LimitedTrain set of training whereas Training from scratch requires training for
dataset and name it LimitTrainSet. We use the LimitTrainSet to around 500 epochs.
train the output layer of audio model for 100 epochs using Adam
optimizer. As a comparison, we also trained an audio model from
scratch using the same LimitTrainSet for 500 epochs. We use more
epochs for training from scratch as it takes more time to converge. In our experiment, we train every model for 10 times to preclude
The other three transfers are tested in the same way. The Audio and the effect of randomness. Based on the results, significance tests
IMU models which are trained from scratch use Adam optimizer. (compared to training from scratch) are carried out using t-test. We
Note: In Video to IMU transfer, it takes more time to transfer the find that the Transfer+LimitedTrain can outperform training from
knowledge, so we perform the knowledge transfer for 1000 epochs. scratch (p < 0.005) in three cases (Video to Audio, Video to IMU,
In real implementations, the knowledge transfer process for edge IMU to Audio); and p < 0.4 for the case of IMU to Video transfer.
devices can either be done in background or at the server using This is because video model is complicated and sensitive, and the
unlabeled data, so as to avoid the overhead. performance of video model trained from scratch fluctuates.

40
Enabling Edge Devices that Learn from Each Other EdgeSys ’18, June 10–15, 2018, Munich, Germany

3.4 RecycleML Reduces Training Time


We further compare the effect of number of epochs between Trans-
fer+LimitedTrain method and training from scratch using Limit-
edTrainSet (500 samples). Figure 3 shows our results in all the 4
transfers. Clearly, Transfer+LimitedTrain method trains model with
accuracy greater than 80% in most of the cases with less than 10
epochs, while training from scratch can not achieve comparable
accuracy after 500 epochs. This makes RecycleML even more suit-
able to be deployed on edge devices: it reduces the training time by
50x. The reason for this huge gain is the knowledge transfer using
unlabeled data and Transfer+LimitedTrain trains only the output
layer so it requires very less number of epochs.

3.5 RecycleML Reduces Required Labeled Data


To study the effect of number of labeled data samples on model accu-
racies, we change the size of training data for Transfer+LimitedTrain
and training from scratch. All the training samples were selected
randomly from LimitedTrain set (976 samples) of training dataset.
Although methods converge at different speeds (Transfer+LimitedTrain Figure 4: With different sizes of labeled data,
converges in 10 epochs, while Training from scratch takes about Transfer+LimitedTrain converges better than Training
500 epochs), in this experiment, we only compare the converged from scratch.
performance of all the models. Figure 4 shows our results for
four device transfers. Consider Video (DV 1 ) to Audio (D A1 ), Trans-
fer+LimitedTrain is compared with training Audio (D A1 ) from scratch.
Using Transfer+LimitedTrain, the model achieve best achievable ac-
curacy using only 50 data samples. While training model from
scratch cannot get comparable results even if we increase the
size of available data to 976 samples as shown in upper left fig-
ure. The testing was performed on entire test dataset. So Recy-
cleML reduces labeled data requirement by at least 90%. However,
in ideal scenario, when abundant labeled data samples are available,
training from scratch slowly converges and can outperform Trans-
fer+LimitedTrain. For IMU (D I MU ) to Video (DV 2 ), when more than
750 labeled data are available, training from scratch can outperform
the method of Transfer+LimitedTrain.

3.6 Related Task Transfer Using RecycleML


We tested knowledge transfer from video device to IMU device
with video model doing activity recognition task with 7 categories
while goal of IMU model is to do activity recognition task with 5
categories in a totally different location. Figure 5: Transferring knowledge to a new task:
We did knowledge transfer as described in Section 2.2.1 and Transfer+LimitedTrain learns faster and better than
finally used Transfer+LimitedTrain method to train the output layer Training from Scratch.
of IMU model using PersonalTrain set (475 samples). The trained
models are tested on PersonalTest set (1117 samples). In Figure 5, we
plot the learning curve on Transfer+LimitedTrain and training from combining modalities for human activity recognition on mobile
scratch trained using PersonalTrain . When transferring knowledge devices. We use the idea of representing multiple modalities in the
to a relevant task, RecycleML still learns faster: it converges in 10 same latent space in intermediate layers of a deep network, but our
epochs and gets a testing accuracy of 91.58%, while training from focus is on knowledge transfer for machine learning models across
scratch takes 500 epochs and only gets an accuracy of 61.86%. multi-modal edge devices.
Ba et al. [3], Hinton et al. [11] present knowledge transfer be-
4 RELATED WORK tween the same modality. Ngian et al. [19] use shared representa-
RecycleML is inspired from prior works in machine learning for tions to improve visual speech classification. Aytar et al. [1] learn
multimodal data. Previous works [12, 18, 20, 21] combine lower shared representations that connect multiple forms of image and
layers from multiple modalities to develop a unified model that text data. Frome et al. [7] show knowledge transfer from text to
outperforms the individual modalities. Radu et al. [20, 21] study vision for object classification. Gupta et al. [9] present knowledge

41
EdgeSys ’18, June 10–15, 2018, Munich, Germany Xing, Sandha et al.

transfer between labeled RBG images and unlabeled depth and op- or implied, of the funding agencies. The U.S. and UK Governments are
tical flow images. Aytar et al. [2] show that visual knowledge can authorized to reproduce and distribute reprints for Government purposes
be transfer from vision to sound. notwithstanding any copy-right notation hereon.
The prior works either focus on image and text data, or take two
modalities (vision and audio) from the same source into considera-
REFERENCES
[1] Aytar, Y., Castrejon, L., Vondrick, C., Pirsiavash, H., and Torralba, A.
tion. In RecycleML, we consider three commonly available sensing Cross-modal scene networks. IEEE transactions on pattern analysis and machine
modalities on edge devices from multiple sources, and create a uni- intelligence (2017).
[2] Aytar, Y., Vondrick, C., and Torralba, A. Soundnet: Learning sound repre-
fied representation that bridge them. This allows edge devices to sentations from unlabeled video. In Advances in Neural Information Processing
use multimodal knowledge transfer across different sensing modal- Systems (2016), pp. 892–900.
ities of ambient sensors (vision and audio) and wearables sensors [3] Ba, J., and Caruana, R. Do deep nets really need to be deep? In Advances in
neural information processing systems (2014), pp. 2654–2662.
(IMU) for the first time. [4] Bottou, L. Large-scale machine learning with stochastic gradient descent. In
Proceedings of COMPSTAT’2010. Springer, 2010, pp. 177–186.
5 DISCUSSION [5] Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., and Dar-
rell, T. Decaf: A deep convolutional activation feature for generic visual recog-
While RecycleML shows promise in terms of handling both paucity nition. In International conference on machine learning (2014), pp. 647–655.
of labeled data and also speeds up model training across multiple [6] Ellis, D. Chroma feature analysis and synthesis. Resources of Laboratory for the
Recognition and Organization of Speech and Audio-LabROSA (2007).
modalities, the ability of the approach to generalize to different [7] Frome, A., Corrado, G., Shlens, J., Bengio, S., Dean, J., Ranzato, M., and
applications for larger datasets needs further investigation. Further- Mikolov, T. Devise: A deep visual-semantic embedding model. In Neural
Information Processing Systems (NIPS) (2013).
more, our experiments indicate that while the trained models can [8] Graves, A., Mohamed, A.-r., and Hinton, G. Speech recognition with deep
be personalized to a specific environment, they need regularization recurrent neural networks. In Acoustics, speech and signal processing (icassp),
to generalize to new settings. 2013 ieee international conference on (2013), IEEE, pp. 6645–6649.
[9] Gupta, S., Hoffman, J., and Malik, J. Cross modal distillation for supervision
For cross modal knowledge transfer using RecycleML, we need transfer. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference
unlabeled but synchronized data. In our experiments, since audio on (2016), IEEE, pp. 2827–2836.
and video data are captured by the same device, they are natu- [10] Harte, C., Sandler, M., and Gasser, M. Detecting harmonic change in musical
audio. In Proceedings of the 1st ACM workshop on Audio and music computing
rally synchronized. In addition, we used the default smartphone multimedia (2006), ACM, pp. 21–26.
timestamps, synchronized through the Network Time Protocol [11] Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural
network. arXiv preprint arXiv:1503.02531 (2015).
(NTP) [17] service, to synchronize IMU device with video and sound [12] Huang, J., and Kingsbury, B. Audio-visual deep learning for noise robust
device. In real settings, however, edge devices have to be time syn- speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2013
chronized in order to observe the same event at the same time. IEEE International Conference on (2013), IEEE, pp. 7596–7599.
[13] Jiang, D.-N., Lu, L., Zhang, H.-J., Tao, J.-H., and Cai, L.-H. Music type clas-
In our experiments, we chose the fully connected layer (imme- sification by spectral contrast feature. In Multimedia and Expo, 2002. ICME’02.
diately prior to the output layer) as the common latent space. In Proceedings. 2002 IEEE International Conference on (2002), vol. 1, IEEE, pp. 113–116.
future, we plan to explore different choices for the shared represen- [14] Kingma, D. P., and Ba, J. Adam: A method for stochastic optimization. arXiv
preprint arXiv:1412.6980 (2014).
tation layer, for efficient sensory substitution and task transfer on [15] Logan, B., et al. Mel frequency cepstral coefficients for music modeling. In
edge devices. ISMIR (2000), vol. 270, pp. 1–11.
[16] McFee, B., Raffel, C., Liang, D., Ellis, D. P., McVicar, M., Battenberg, E., and
Nieto, O. librosa: Audio and music signal analysis in python. In Proceedings of
6 CONCLUSION the 14th python in science conference (2015), pp. 18–25.
[17] Mills, D. L. Internet time synchronization: the network time protocol. IEEE
Heterogeneity in sensing modality of the edge devices, together Transactions on communications 39, 10 (1991), 1482–1493.
with lack of labeled training data, represent two of the most sig- [18] Münzner, S., Schmidt, P., Reiss, A., Hanselmann, M., Stiefelhagen, R., and
nificant barriers to enabling the learning-on-the-edge paradigm. Dürichen, R. Cnn-based sensor fusion techniques for multimodal human activity
recognition. In Proceedings of the 2017 ACM International Symposium on Wearable
Towards this end, we presented RecycleML, a system that enables Computers (2017), ACM, pp. 158–165.
multi-modality edge devices to perform knowledge transfer be- [19] Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., and Ng, A. Y. Multimodal deep
tween their models by mapping their lower layers to a shared latent learning. In Proceedings of the 28th international conference on machine learning
(ICML-11) (2011), pp. 689–696.
space representation. RecycleML further allows task-specific trans- [20] Radu, V., Lane, N. D., Bhattacharya, S., Mascolo, C., Marina, M. K., and
fer between models by targeted retraining of the higher layers Kawsar, F. Towards multimodal deep learning for activity recognition on mobile
devices. In Proceedings of the 2016 ACM International Joint Conference on Pervasive
beyond the shared latent space – reducing the amount of labeled and Ubiquitous Computing: Adjunct (2016), ACM, pp. 185–188.
data needed for model training. Our initial experiments, performed [21] Radu, V., Tong, C., Bhattacharya, S., Lane, N. D., Mascolo, C., Marina, M. K.,
using multi-modality data (vision, audio, IMU) for activity recogni- and Kawsar, F. Multimodal deep learning for activity and context recogni-
tion. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous
tion, show that transfer model trained using RecycleML leads to Technologies 1, 4 (2018), 157.
reduced training time and results in increased accuracy compared [22] Shen, C., Ho, B.-J., and Srivastava, M. Milift: Efficient smartwatch-based
to an edge model trained from scratch using limited labeled data. workout tracking using automatic segmentation. IEEE Transactions on Mobile
Computing (2017).
[23] Socher, R., Ganjoo, M., Manning, C. D., and Ng, A. Zero-shot learning through
7 ACKNOWLEDGEMENT cross-modal transfer. In Advances in neural information processing systems (2013),
pp. 935–943.
This research was sponsored by the U.S. Army Research Laboratory and [24] Tran, D., Bourdev, L., Fergus, R., Torresani, L., and Paluri, M. Learning
the UK Ministry of Defence under Agreement Number W911NF-16-3-0001, spatiotemporal features with 3d convolutional networks. In Computer Vision
by the National Institutes of Health under award #U154EB020404, and by (ICCV), 2015 IEEE International Conference on (2015), IEEE, pp. 4489–4497.
the National Science Foundation under award #1636916. The views and [25] Yang, J., Nguyen, M. N., San, P. P., Li, X., and Krishnaswamy, S. Deep convolu-
tional neural networks on multichannel time series for human activity recogni-
conclusions contained in this document are those of the authors and should tion. In IJCAI (2015), pp. 3995–4001.
not be interpreted as representing the official policies, either expressed

42

You might also like