Several recent papers have explored self-supervised learning methods for vision transformers (ViT). Key approaches include:
1. Masked prediction tasks that predict masked patches of the input image.
2. Contrastive learning using techniques like MoCo to learn representations by contrasting augmented views of the same image.
3. Self-distillation methods like DINO that distill a teacher ViT into a student ViT using different views of the same image.
4. Hybrid approaches that combine masked prediction with self-distillation, such as iBOT.
Several recent papers have explored self-supervised learning methods for vision transformers (ViT). Key approaches include:
1. Masked prediction tasks that predict masked patches of the input image.
2. Contrastive learning using techniques like MoCo to learn representations by contrasting augmented views of the same image.
3. Self-distillation methods like DINO that distill a teacher ViT into a student ViT using different views of the same image.
4. Hybrid approaches that combine masked prediction with self-distillation, such as iBOT.
4 Ways to Communicate Compensation That Drive Strategic OutcomesBambooHR
Compensation is one topic that we can be unsure how to communicate. This webinar shares important research on the why communicating compensation correctly is so important and how to do it so that it drives strategic outcomes.
2018/10/20コンピュータビジョン勉強会@関東「ECCV読み会2018」発表資料
Yew, Z. J., & Lee, G. H. (2018). 3DFeat-Net: Weakly Supervised Local 3D Features for Point Cloud Registration. European Conference on Computer Vision.
This document summarizes a paper titled "DeepI2P: Image-to-Point Cloud Registration via Deep Classification". The paper proposes a method for estimating the camera pose within a point cloud map using a deep learning model. The model first classifies whether points in the point cloud fall within the camera's frustum or image grid. It then performs pose optimization to estimate the camera pose by minimizing the projection error of inlier points onto the image. The method achieves more accurate camera pose estimation compared to existing techniques based on feature matching or depth estimation. It provides a new approach for camera localization using point cloud maps without requiring cross-modal feature learning.
2020/10/10に開催された第4回全日本コンピュータビジョン勉強会「人に関する認識・理解論文読み会」発表資料です。
以下の2本を読みました
Harmonious Attention Network for Person Re-identification. (CVPR2018)
Weekly Supervised Person Re-Identification (CVPR2019)
24. CRFを用いた例
CRFは認識対象クラスに関する知識をモデルの中に入れ込むこ
とが可能なため、 Semantic SegmentationではCRFを用いた手法
が性能的に良い。一方、SemanticでないSegmentationではMRF
が用いられることが多い。
X.He, R. S. Zemel, M. A. Carreira-Perpinan, “Multiscale
Conditional Random Fields for Image Labeling”, CVPR2004
J. Shotton, J. Winn, C. Rother, A. Criminisi, “TextonBoost for
Image Understanding: Multi-Class Object Recognition and
Segmentation by Jointly Modeling Texture, Layout, and
Context”, IJCV2009
P. Krahenbuhl, V. Koltun, “Efficient Inference in Fully Connected
CRFs with Gaussian Edge Potentials”, NIPS2011
P. Arbelaez, B. Hariharan, C. Gu, S. Gupta, L. D. Bourdev, J. Malik,
“Semantic Segmentation using Regions and Parts”, CVPR2012
(非CRF)
25. CRF for Image Labeling (He, et al., 2004)
CRFをSemantic Segmentationへ適用した最初の論文
ローカル特徴、全体特徴、ラベル間の位置関係を考慮し
たモデルを構築して最適化
28. Semantic Segmentation using Regions and
Parts (Arbelaez, et al., 2012)
一度ざっくりとした領域分割をして、各領域において多ク
ラスに対するスコアを算出し、それらを特徴として用いて,
ラベリングをしていく。
CRFを用いず、各領域のスコアを統合して画素ごとのスコ
アを算出する。
SVM
Part Compatibility
Global Appearance
Semantic Contours
Geometrical properties
Multi Class
36. ニューラルネットワークによる手法の例
P. H. Pinheiro, R. Collobert, “Recurrent Convolutional Neural
Networks for Scene Labeling”, ICML2014
J. Long, E. Shelhamer, T. Darrel, “Fully Convolutional Networks
for Semantic Segmentation”, CVPR2015
S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D.
Du, C. Huang, P. H. S. Torr, “Conditional Random Fields as
Recurrent Neural Networks”, ICCV2015
H. Noh, S. Hong, B. Han, “Learning Deconvolution Network for
Semantic Segmentation”, ICCV2015
G. Lin, C. Shen, A. Hengel, I. Reid, “Efficient Piecewise Training
of Deep Structured Models for Semantic Segmentation”,
CVPR2016
P. Isola, J. Y. Zhu, T. Zhou, A. A. Efros, “Image to Image
Translation with Conditional Adversarial Networks”,
arXiv:1611.67004v1, 2016
37. RCNN for Scene Labeling (Pinheiro and
Collobert, 2014)
ネットワークfで各画素のラ
ベルを予測し、その結果を
入力に加えて繰り返しfで
予測を行うことで、段階的
にラベルの予測精度を上
げていく
CRFの平滑化項にあたるよ
うな、コンテクスト(ラベル間
の位置関係)を評価してお
らず、各画素ごとに特徴か
らラベルを判別しているに
等しい