Enabling Edge Devices that Learn from Each Other Cross Modal
Enabling Edge Devices that Learn from Each Other Cross Modal
ABSTRACT 1 INTRODUCTION
Edge devices rely extensively on machine learning for intelligent Edge devices are typically equipped with a wide variety of sensing
inferences and pattern matching. However, edge devices use a multi- modalities for tracking environmental markers. To provide insights
tude of sensing modalities and are exposed to wide ranging contexts. and enable context-aware applications (e.g. user activity recogni-
It is difficult to develop separate machine learning models for each tion [25], workout tracking [22], speech recognition [8]) the data
scenario as manual labeling is not scalable. To reduce the amount collected on these devices are used to train deep neural network
of labeled data and to speed up the training process, we propose models. However, to fully realize the learning-at-the-edge para-
to transfer knowledge between edge devices by using unlabeled digm, several challenges still needs to be addressed. In particular,
data. Our approach, called RecycleML, uses cross modal transfer the model training process needs to handle insufficient labeled data,
to accelerate the learning of edge devices across different sens- and the heterogeneity in inter-device sensing modalities.
ing modalities. Using human activity recognition as a case study, As a step towards addressing the above concerns, we propose
over our collected CMActivity dataset, we observe that RecycleML RecycleML– a mechanism to transfer knowledge between edge
reduces the amount of required labeled data by at least 90% and devices. Our approach is guided by the observation that application-
speeds up the training process by up to 50 times in comparison to specific semantic concepts can be better associated with features in
training the edge device from scratch. the higher layers (close to the output side) of a network model [5].
This observation allows us to conceptualize the layers of the dif-
CCS CONCEPTS ferent networks as an hourglass model, as shown in Figure 1. The
• Computing methodologies → Transfer learning; Neural lower half of the hourglass correspond to the lower layers (close to
networks; Learning latent representations; • Hardware → the input side) of the individual models (trained on specific sensing
Sensor applications and deployments; modalities). The narrow waist is the common layer (latent space)
into which the lower layers project their data for knowledge trans-
fer. The upper half of the hourglass comprises of the task-specific
KEYWORDS
higher layer features which are trained in a targeted fashion for
edge devices, transfer learning, cross modality, shared latent repre- task-specific transfer.
sentation, activity recognition To evaluate RecycleML, we emulate edge devices with three
ACM Reference format: sensing modalities - vision, audio and inertial (IMU) sensing as
Tianwei Xing, Sandeep Singh Sandha, Bharathan Balaji, Supriyo Chakraborty, shown in Figure 2. We perform zero-shot learning [23], i.e. use zero
and Mani Srivastava. 2018. Enabling Edge Devices that Learn from Each training labels, across different sensing modalities when they are
Other: Cross Modal Training for Activity Recognition. In Proceedings of performing the same classification task. We achieve this by training
EdgeSys ’18: International Workshop on Edge Systems, Analytics and Network- the target edge device model to have the same latent space as the
ing, Munich, Germany, June 10–15, 2018 (EdgeSys ’18), 6 pages. source model. RecycleML can also learn to expand the classification
https://doi.org/10.1145/3213344.3213351 tasks of the transferred model with very few training examples.
Our results across a mix of sensory substitutions and task trans-
∗ Both authors contributed equally to this work. fers show that, over our collected CMActivity dataset, RecycleML
reduces the amount of labeled data required to train edge devices
by at least 90% and speeds up the training process by up to 50 times
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed after doing knowledge transfer using unlabeled data.
for profit or commercial advantage and that copies bear this notice and the full citation Our contributions are as follows:
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a (1) We combine the idea of transfer learning (lower layers trans-
fee. Request permissions from [email protected].
EdgeSys ’18, June 10–15, 2018, Munich, Germany fer) with sensory substitution (higher layers transfer) to-
© 2018 Association for Computing Machinery. gether and propose a unified framework, where the knowl-
ACM ISBN 978-1-4503-5837-8/18/06. . . $15.00 edge in every part of a network could be transferred.
https://doi.org/10.1145/3213344.3213351
37
EdgeSys ’18, June 10–15, 2018, Munich, Germany Xing, Sandha et al.
38
Enabling Edge Devices that Learn from Each Other EdgeSys ’18, June 10–15, 2018, Munich, Germany
We explore two different methods of task transfer: and observed a maximum time difference of 0.5 seconds between
• PureTransfer directly uses the higher layers of model M X for the observer smartphone and the user smartphone. We leave it for
new model MY . In this case no further training is needed future to explore the effect of poor time synchronization across
and no labeled data is required. devices in observing the same event. We expect the knowledge
• Transfer+LimitedTrain freezes the network д and adds higher transfer capabilities of RecycleML to degrade as the time difference
layers to MY and retrains only the higher layers using limited between devices increases.
labeled data. The details of CMActivities are shown in Table 1. The data col-
lection was done at different locations with two users wearing
In the first scenario, since the tasks are same we can use both
separate set of clothes at each location so as to make sure that the
methods. In the second and the third scenarios, direct transfer of
trained classifier learns the activity features and is least affected by
higher layers from model M X to model MY does not work as M X
the environmental factors. We split 767 videos and IMU sessions
does not give the same desired output. Hence, we use the second
into three parts: training dataset (624), testing dataset (71) and per-
method. In our experiments, we evaluate scenario (i) of task transfer
sonalization dataset (72). Training and testing datasets contain 7
using both methods of PureTransfer and Transfer+LimitedTrain and
activities at 5 different locations and personalization dataset con-
scenario (ii) using Transfer+LimitedTrain.
In our experiments, we used the output of last hidden layer tains 5 activities at 6t h location. We don’t have Go Upstairs and Go
after removing the final output layer from model M X as the f Downstairs activities in the personalization dataset.
transformation. Here f and д serve as shared latent representations The training dataset is further split into 3 parts: Pre-Training
across modalities. We add a single task specific layer to д to generate set, Transfer set and LimitTrain set. The personalization dataset is
model MY . In future, we will explore the different choices of f and split into PersonalTrain and PersonalTest sets. The testing dataset
addition of multiple task specific output layers to д. is used only for evaluation. The frame rate of video is 29 and the
sampling frequency of audio and IMU is 22050 Hz and 25 Hz re-
3 EVALUATION spectively. We use a window of 2 seconds to extract vision, audio
and IMU features from dataset with sliding window of 0.4 seconds
3.1 Dataset between consecutive windows. In case of vision and IMU, we use
For our experiments, we collected a new dataset, called CMAc- raw features directly as input to the models. We extracted features
tivities, composed of videos for vision and audio modality, and from the raw audio data using Librosa [16] and use it as the input
corresponding IMU data (accelerometer and gyroscope) from sen- features. Specifically, we extract mel-frequency cepstral coefficients
sors on left and right wrist. We collected 767 videos of roughly 10 (MFCC) [15], power spectrogram [6], mel-scaled spectrogram, spec-
second each from 2 users2 doing 7 different activities at 6 locations. tral contrast [13] and tonal centroid features (tonnetz) [10].
Every video contains a single activity and is used to label the vision, In total, we have 11976 samples in training (5000 samples for
audio and IMU data. The total duration of collected data for each Pre-Training set, 6000 samples for Transfer set, and 976 samples
modality is 125 minutes. for LimitedTrain set), 1377 samples in test and 1592 samples in
personalization (475 samples for PersonalTrain set and 1117 samples
Table 1: Description of CMActivities dataset for PersonalTest set) for each modality.
Activity Number of Videos Duration (sec) Table 2: Testing accuracy of baseline models
39
EdgeSys ’18, June 10–15, 2018, Munich, Germany Xing, Sandha et al.
(c) IMU Network is a CNN network. It has 2 convolutional modules Table 3 shows the knowledge transfer results between devices
(convolution layer + maxpooling layer), 3 fully-connected layers doing the same task of activity recognition. Model performance is
and a output layer. 57K parameters are trainable in this network. measured by test accuracy. Considering row 1, Trained-Device is
Table 2 shows the summary of the individual models. The models the accuracy of pre-trained device DV 1 . Pure-Transfer and Trans-
are trained using the training dataset and tested on testing dataset. fer+LimitedTrain are the accuracy of device D A1 using both methods
These baseline models are trained using SGD [4] and Adam [14] respectively. The last cell shows the accuracy of audio model trained
optimizers with a learning rate of 0.001. We save the models with from scratch using LimitTrainSet. As we can see both methods Pure-
best test accuracy after training for 500 epochs. Transfer and Transfer+LimitedTrain achieve better accuracy than
training from scratch. This shows that shared latent feature repre-
sentation is successful in doing knowledge transfer across devices
3.3 Knowledge Transfer Results of different modalities. We also observe that Transfer+LimitedTrain
Knowledge transfer results are presented in Table 3. In the first usually gives the best performance.
and second experiment, vision device DV 1 is trained while acoustic
device D A1 and wearable device DW are untrained respectively.
In the third and fourth experiment wearable device DW is trained
while vision device DV 2 and acoustic device D A2 are untrained. For
each of these four transfers, we follow the same procedure. Taking
vision device DV 1 to acoustic device D A1 as an example, we first
train the vision model of DV 1 from scratch using the Pre-Training
set (5000 samples) of training dataset. We use the standard SGD
optimizer with a learning rate of 0.001. The training is finished in
500 epochs. We then use DV 1 as a pre-trained device to transfer
knowledge to a D A1 following the procedure described in Section
2.2.1. In the knowledge transferring process, we use Adam optimizer
with a learning rate of 0.001, and run it for 500 epochs. The data
used in transfer process are the synchronized unlabeled vision and
sound data from Transfer set (6000 samples) of training dataset.
After transfer, the higher layers of audio model can be created using
two methods Pure-Transfer and Transfer+LimitedTrain discussed in
Section 2.2.2 when both DV 1 and D A1 are doing the same task. In
Pure-Transfer method audio model uses the output layer of vision
model directly. In Transfer+LimitedTrain, we train the new output
layer for audio model. We select a small labeled set of 500 samples Figure 3: Transfer+LimitedTrain converges in 10 epochs
randomly out of 976 samples from LimitedTrain set of training whereas Training from scratch requires training for
dataset and name it LimitTrainSet. We use the LimitTrainSet to around 500 epochs.
train the output layer of audio model for 100 epochs using Adam
optimizer. As a comparison, we also trained an audio model from
scratch using the same LimitTrainSet for 500 epochs. We use more
epochs for training from scratch as it takes more time to converge. In our experiment, we train every model for 10 times to preclude
The other three transfers are tested in the same way. The Audio and the effect of randomness. Based on the results, significance tests
IMU models which are trained from scratch use Adam optimizer. (compared to training from scratch) are carried out using t-test. We
Note: In Video to IMU transfer, it takes more time to transfer the find that the Transfer+LimitedTrain can outperform training from
knowledge, so we perform the knowledge transfer for 1000 epochs. scratch (p < 0.005) in three cases (Video to Audio, Video to IMU,
In real implementations, the knowledge transfer process for edge IMU to Audio); and p < 0.4 for the case of IMU to Video transfer.
devices can either be done in background or at the server using This is because video model is complicated and sensitive, and the
unlabeled data, so as to avoid the overhead. performance of video model trained from scratch fluctuates.
40
Enabling Edge Devices that Learn from Each Other EdgeSys ’18, June 10–15, 2018, Munich, Germany
41
EdgeSys ’18, June 10–15, 2018, Munich, Germany Xing, Sandha et al.
transfer between labeled RBG images and unlabeled depth and op- or implied, of the funding agencies. The U.S. and UK Governments are
tical flow images. Aytar et al. [2] show that visual knowledge can authorized to reproduce and distribute reprints for Government purposes
be transfer from vision to sound. notwithstanding any copy-right notation hereon.
The prior works either focus on image and text data, or take two
modalities (vision and audio) from the same source into considera-
REFERENCES
[1] Aytar, Y., Castrejon, L., Vondrick, C., Pirsiavash, H., and Torralba, A.
tion. In RecycleML, we consider three commonly available sensing Cross-modal scene networks. IEEE transactions on pattern analysis and machine
modalities on edge devices from multiple sources, and create a uni- intelligence (2017).
[2] Aytar, Y., Vondrick, C., and Torralba, A. Soundnet: Learning sound repre-
fied representation that bridge them. This allows edge devices to sentations from unlabeled video. In Advances in Neural Information Processing
use multimodal knowledge transfer across different sensing modal- Systems (2016), pp. 892–900.
ities of ambient sensors (vision and audio) and wearables sensors [3] Ba, J., and Caruana, R. Do deep nets really need to be deep? In Advances in
neural information processing systems (2014), pp. 2654–2662.
(IMU) for the first time. [4] Bottou, L. Large-scale machine learning with stochastic gradient descent. In
Proceedings of COMPSTAT’2010. Springer, 2010, pp. 177–186.
5 DISCUSSION [5] Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., and Dar-
rell, T. Decaf: A deep convolutional activation feature for generic visual recog-
While RecycleML shows promise in terms of handling both paucity nition. In International conference on machine learning (2014), pp. 647–655.
of labeled data and also speeds up model training across multiple [6] Ellis, D. Chroma feature analysis and synthesis. Resources of Laboratory for the
Recognition and Organization of Speech and Audio-LabROSA (2007).
modalities, the ability of the approach to generalize to different [7] Frome, A., Corrado, G., Shlens, J., Bengio, S., Dean, J., Ranzato, M., and
applications for larger datasets needs further investigation. Further- Mikolov, T. Devise: A deep visual-semantic embedding model. In Neural
Information Processing Systems (NIPS) (2013).
more, our experiments indicate that while the trained models can [8] Graves, A., Mohamed, A.-r., and Hinton, G. Speech recognition with deep
be personalized to a specific environment, they need regularization recurrent neural networks. In Acoustics, speech and signal processing (icassp),
to generalize to new settings. 2013 ieee international conference on (2013), IEEE, pp. 6645–6649.
[9] Gupta, S., Hoffman, J., and Malik, J. Cross modal distillation for supervision
For cross modal knowledge transfer using RecycleML, we need transfer. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference
unlabeled but synchronized data. In our experiments, since audio on (2016), IEEE, pp. 2827–2836.
and video data are captured by the same device, they are natu- [10] Harte, C., Sandler, M., and Gasser, M. Detecting harmonic change in musical
audio. In Proceedings of the 1st ACM workshop on Audio and music computing
rally synchronized. In addition, we used the default smartphone multimedia (2006), ACM, pp. 21–26.
timestamps, synchronized through the Network Time Protocol [11] Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural
network. arXiv preprint arXiv:1503.02531 (2015).
(NTP) [17] service, to synchronize IMU device with video and sound [12] Huang, J., and Kingsbury, B. Audio-visual deep learning for noise robust
device. In real settings, however, edge devices have to be time syn- speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2013
chronized in order to observe the same event at the same time. IEEE International Conference on (2013), IEEE, pp. 7596–7599.
[13] Jiang, D.-N., Lu, L., Zhang, H.-J., Tao, J.-H., and Cai, L.-H. Music type clas-
In our experiments, we chose the fully connected layer (imme- sification by spectral contrast feature. In Multimedia and Expo, 2002. ICME’02.
diately prior to the output layer) as the common latent space. In Proceedings. 2002 IEEE International Conference on (2002), vol. 1, IEEE, pp. 113–116.
future, we plan to explore different choices for the shared represen- [14] Kingma, D. P., and Ba, J. Adam: A method for stochastic optimization. arXiv
preprint arXiv:1412.6980 (2014).
tation layer, for efficient sensory substitution and task transfer on [15] Logan, B., et al. Mel frequency cepstral coefficients for music modeling. In
edge devices. ISMIR (2000), vol. 270, pp. 1–11.
[16] McFee, B., Raffel, C., Liang, D., Ellis, D. P., McVicar, M., Battenberg, E., and
Nieto, O. librosa: Audio and music signal analysis in python. In Proceedings of
6 CONCLUSION the 14th python in science conference (2015), pp. 18–25.
[17] Mills, D. L. Internet time synchronization: the network time protocol. IEEE
Heterogeneity in sensing modality of the edge devices, together Transactions on communications 39, 10 (1991), 1482–1493.
with lack of labeled training data, represent two of the most sig- [18] Münzner, S., Schmidt, P., Reiss, A., Hanselmann, M., Stiefelhagen, R., and
nificant barriers to enabling the learning-on-the-edge paradigm. Dürichen, R. Cnn-based sensor fusion techniques for multimodal human activity
recognition. In Proceedings of the 2017 ACM International Symposium on Wearable
Towards this end, we presented RecycleML, a system that enables Computers (2017), ACM, pp. 158–165.
multi-modality edge devices to perform knowledge transfer be- [19] Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., and Ng, A. Y. Multimodal deep
tween their models by mapping their lower layers to a shared latent learning. In Proceedings of the 28th international conference on machine learning
(ICML-11) (2011), pp. 689–696.
space representation. RecycleML further allows task-specific trans- [20] Radu, V., Lane, N. D., Bhattacharya, S., Mascolo, C., Marina, M. K., and
fer between models by targeted retraining of the higher layers Kawsar, F. Towards multimodal deep learning for activity recognition on mobile
devices. In Proceedings of the 2016 ACM International Joint Conference on Pervasive
beyond the shared latent space – reducing the amount of labeled and Ubiquitous Computing: Adjunct (2016), ACM, pp. 185–188.
data needed for model training. Our initial experiments, performed [21] Radu, V., Tong, C., Bhattacharya, S., Lane, N. D., Mascolo, C., Marina, M. K.,
using multi-modality data (vision, audio, IMU) for activity recogni- and Kawsar, F. Multimodal deep learning for activity and context recogni-
tion. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous
tion, show that transfer model trained using RecycleML leads to Technologies 1, 4 (2018), 157.
reduced training time and results in increased accuracy compared [22] Shen, C., Ho, B.-J., and Srivastava, M. Milift: Efficient smartwatch-based
to an edge model trained from scratch using limited labeled data. workout tracking using automatic segmentation. IEEE Transactions on Mobile
Computing (2017).
[23] Socher, R., Ganjoo, M., Manning, C. D., and Ng, A. Zero-shot learning through
7 ACKNOWLEDGEMENT cross-modal transfer. In Advances in neural information processing systems (2013),
pp. 935–943.
This research was sponsored by the U.S. Army Research Laboratory and [24] Tran, D., Bourdev, L., Fergus, R., Torresani, L., and Paluri, M. Learning
the UK Ministry of Defence under Agreement Number W911NF-16-3-0001, spatiotemporal features with 3d convolutional networks. In Computer Vision
by the National Institutes of Health under award #U154EB020404, and by (ICCV), 2015 IEEE International Conference on (2015), IEEE, pp. 4489–4497.
the National Science Foundation under award #1636916. The views and [25] Yang, J., Nguyen, M. N., San, P. P., Li, X., and Krishnaswamy, S. Deep convolu-
tional neural networks on multichannel time series for human activity recogni-
conclusions contained in this document are those of the authors and should tion. In IJCAI (2015), pp. 3995–4001.
not be interpreted as representing the official policies, either expressed
42