This document is a slide presentation on recent advances in deep learning. It discusses self-supervised learning, which involves using unlabeled data to learn representations by predicting structural information within the data. The presentation covers pretext tasks, invariance-based approaches, and generation-based approaches for self-supervised learning in computer vision and natural language processing. It provides examples of specific self-supervised methods like predicting image rotations, clustering representations to generate pseudo-labels, and masked language modeling.
This document summarizes recent research on applying self-attention mechanisms from Transformers to domains other than language, such as computer vision. It discusses models that use self-attention for images, including ViT, DeiT, and T2T, which apply Transformers to divided image patches. It also covers more general attention modules like the Perceiver that aims to be domain-agnostic. Finally, it discusses work on transferring pretrained language Transformers to other modalities through frozen weights, showing they can function as universal computation engines.
This document summarizes recent research on applying self-attention mechanisms from Transformers to domains other than language, such as computer vision. It discusses models that use self-attention for images, including ViT, DeiT, and T2T, which apply Transformers to divided image patches. It also covers more general attention modules like the Perceiver that aims to be domain-agnostic. Finally, it discusses work on transferring pretrained language Transformers to other modalities through frozen weights, showing they can function as universal computation engines.
ICASSP 2019音声&音響論文読み会(https://connpass.com/event/128527/)での発表資料です。
AASP (Audio and Acoustic Signal Processing) 分野の紹介と、ICASSP 2019での動向を紹介しています。#icassp2019jp
Utilization of social networking services toward accessible academic meetings
From the viewpoint of ``reasonable information accessibility,''
the popularization of real-time social networking services,
such as Twitter and Ustream,
can be regarded as the oppotunity to change academic meetings
into the ``universally designed'' events.
Accessibility should always be considered to some extent
without excessive cost.
Captioning service using automatic speech recognition is also investigated
as a part of the universally designed events using social media.
33. Interspeech2019論文読み会@Sony2019/11/2433
参考文献
wav2vec: Unsupervised Pre-training for Speech Recognition, Steffen Schneider, et al., https://arxiv.org/abs/1904.05862
Representation learning with contrastive predictive coding, Aaron van den Oord, et al., https://arxiv.org/abs/1807.03748
Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech, Yu-An Chung, et al., https://arxiv.org/abs/1803.08976
BERT: Pre-training of deep bidirectional transformers for language understanding., Jacob Devlin, et al., https://arxiv.org/abs/1810.04805
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks, Jiasen Lu, et al., https://arxiv.org/abs/1908.02265
Roberta: A robustly optimized bert pretraining approach, Yinhan Liu, et al., https://arxiv.org/abs/1907.11692
vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations, Alexei Baevski, et al., https://arxiv.org/abs/1910.05453