O documento discute a importância de conhecer os jovens para poder evangelizá-los. A juventude é vista como uma realidade teológica significativa, e os jovens precisam ouvir sobre um Deus que é real em sua experiência juvenil. A evangelização da Igreja deve mostrar aos jovens a beleza e sacralidade de sua juventude.
Several recent papers have explored self-supervised learning methods for vision transformers (ViT). Key approaches include:
1. Masked prediction tasks that predict masked patches of the input image.
2. Contrastive learning using techniques like MoCo to learn representations by contrasting augmented views of the same image.
3. Self-distillation methods like DINO that distill a teacher ViT into a student ViT using different views of the same image.
4. Hybrid approaches that combine masked prediction with self-distillation, such as iBOT.
This document summarizes recent research on applying self-attention mechanisms from Transformers to domains other than language, such as computer vision. It discusses models that use self-attention for images, including ViT, DeiT, and T2T, which apply Transformers to divided image patches. It also covers more general attention modules like the Perceiver that aims to be domain-agnostic. Finally, it discusses work on transferring pretrained language Transformers to other modalities through frozen weights, showing they can function as universal computation engines.
ConvMixer is a simple CNN-based model that achieves state-of-the-art results on ImageNet classification. It divides the input image into patches and embeds them into high-dimensional vectors, similar to ViT. However, unlike ViT, it does not use attention but instead applies simple convolutional layers between the patch embedding and classification layers. Experiments show that despite its simplicity, ConvMixer outperforms more complex models like ResNet, ViT, and MLP-Mixer on ImageNet, demonstrating that patch embeddings may be as important as attention mechanisms for vision tasks.
This document summarizes recent advances in single image super-resolution (SISR) using deep learning methods. It discusses early SISR networks like SRCNN, VDSR and ESPCN. SRResNet is presented as a baseline method, incorporating residual blocks and pixel shuffle upsampling. SRGAN and EDSR are also introduced, with EDSR achieving state-of-the-art PSNR results. The relationship between reconstruction loss, perceptual quality and distortion is examined. While PSNR improves yearly, a perception-distortion tradeoff remains. Developments are ongoing to produce outputs that are both accurately restored and naturally perceived.
YouTube nnabla channelの次の動画で利用したスライドです。
【DeepLearning研修】Transformerの基礎と応用 --第3回 Transformerの画像での応用
https://youtu.be/rkuayDInyF0
【参考文献】
・Deep Residual Learning for Image Recognition
https://arxiv.org/abs/1512.03385
・An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
https://arxiv.org/abs/2010.11929
・ON THE RELATIONSHIP BETWEEN SELF-ATTENTION AND CONVOLUTIONAL LAYERS
https://arxiv.org/abs/1911.03584
・Image Style Transfer Using Convolutional Neural Networks
https://ieeexplore.ieee.org/document/7780634
・Are Convolutional Neural Networks or Transformers more like human vision
https://arxiv.org/abs/2105.07197
・HOW DO VISION TRANSFORMERS WORK?
https://arxiv.org/abs/2202.06709
・Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
https://arxiv.org/abs/1610.02391
・Quantifying Attention Flow in Transformers
https://arxiv.org/abs/2005.00928
・Transformer Interpretability Beyond Attention Visualization
https://arxiv.org/abs/2012.09838
・End-to-End Object Detection with Transformers
https://arxiv.org/abs/2005.12872
・SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers
https://arxiv.org/abs/2105.15203
・Training data-efficient image transformers & distillation through attention
https://arxiv.org/abs/2012.12877
・Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
https://arxiv.org/abs/2103.14030
・Masked Autoencoders Are Scalable Vision Learners
https://arxiv.org/abs/2111.06377
・Emerging Properties in Self-Supervised Vision Transformers
https://arxiv.org/abs/2104.14294
・Scaling Laws for Neural Language Models
https://arxiv.org/abs/2001.08361
・Learning Transferable Visual Models From Natural Language Supervision
https://arxiv.org/abs/2103.00020
・Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
https://arxiv.org/abs/2403.03206
・Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
https://arxiv.org/abs/2402.17177
・SSII2024技術マップ
https://confit.atlas.jp/guide/event/ssii2024/static/special_project_tech_map
This document summarizes recent research on applying self-attention mechanisms from Transformers to domains other than language, such as computer vision. It discusses models that use self-attention for images, including ViT, DeiT, and T2T, which apply Transformers to divided image patches. It also covers more general attention modules like the Perceiver that aims to be domain-agnostic. Finally, it discusses work on transferring pretrained language Transformers to other modalities through frozen weights, showing they can function as universal computation engines.
ConvMixer is a simple CNN-based model that achieves state-of-the-art results on ImageNet classification. It divides the input image into patches and embeds them into high-dimensional vectors, similar to ViT. However, unlike ViT, it does not use attention but instead applies simple convolutional layers between the patch embedding and classification layers. Experiments show that despite its simplicity, ConvMixer outperforms more complex models like ResNet, ViT, and MLP-Mixer on ImageNet, demonstrating that patch embeddings may be as important as attention mechanisms for vision tasks.
This document summarizes recent advances in single image super-resolution (SISR) using deep learning methods. It discusses early SISR networks like SRCNN, VDSR and ESPCN. SRResNet is presented as a baseline method, incorporating residual blocks and pixel shuffle upsampling. SRGAN and EDSR are also introduced, with EDSR achieving state-of-the-art PSNR results. The relationship between reconstruction loss, perceptual quality and distortion is examined. While PSNR improves yearly, a perception-distortion tradeoff remains. Developments are ongoing to produce outputs that are both accurately restored and naturally perceived.
YouTube nnabla channelの次の動画で利用したスライドです。
【DeepLearning研修】Transformerの基礎と応用 --第3回 Transformerの画像での応用
https://youtu.be/rkuayDInyF0
【参考文献】
・Deep Residual Learning for Image Recognition
https://arxiv.org/abs/1512.03385
・An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
https://arxiv.org/abs/2010.11929
・ON THE RELATIONSHIP BETWEEN SELF-ATTENTION AND CONVOLUTIONAL LAYERS
https://arxiv.org/abs/1911.03584
・Image Style Transfer Using Convolutional Neural Networks
https://ieeexplore.ieee.org/document/7780634
・Are Convolutional Neural Networks or Transformers more like human vision
https://arxiv.org/abs/2105.07197
・HOW DO VISION TRANSFORMERS WORK?
https://arxiv.org/abs/2202.06709
・Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
https://arxiv.org/abs/1610.02391
・Quantifying Attention Flow in Transformers
https://arxiv.org/abs/2005.00928
・Transformer Interpretability Beyond Attention Visualization
https://arxiv.org/abs/2012.09838
・End-to-End Object Detection with Transformers
https://arxiv.org/abs/2005.12872
・SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers
https://arxiv.org/abs/2105.15203
・Training data-efficient image transformers & distillation through attention
https://arxiv.org/abs/2012.12877
・Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
https://arxiv.org/abs/2103.14030
・Masked Autoencoders Are Scalable Vision Learners
https://arxiv.org/abs/2111.06377
・Emerging Properties in Self-Supervised Vision Transformers
https://arxiv.org/abs/2104.14294
・Scaling Laws for Neural Language Models
https://arxiv.org/abs/2001.08361
・Learning Transferable Visual Models From Natural Language Supervision
https://arxiv.org/abs/2103.00020
・Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
https://arxiv.org/abs/2403.03206
・Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
https://arxiv.org/abs/2402.17177
・SSII2024技術マップ
https://confit.atlas.jp/guide/event/ssii2024/static/special_project_tech_map
cvpaper.challenge2019のMeta Study Groupでの発表スライド
点群深層学習についてのサーベイ ( https://www.slideshare.net/naoyachiba18/ss-120302579 )を経た上でのMeta Study
Daniel Neimark, Omri Bar, Maya Zohar, Dotan Asselmann; Video Transformer Network, Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2021, pp. 3163-3172
https://openaccess.thecvf.com/content/ICCV2021W/CVEU/html/Neimark_Video_Transformer_Network_ICCVW_2021_paper.html
https://arxiv.org/abs/2102.00719