Several recent papers have explored self-supervised learning methods for vision transformers (ViT). Key approaches include:
1. Masked prediction tasks that predict masked patches of the input image.
2. Contrastive learning using techniques like MoCo to learn representations by contrasting augmented views of the same image.
3. Self-distillation methods like DINO that distill a teacher ViT into a student ViT using different views of the same image.
4. Hybrid approaches that combine masked prediction with self-distillation, such as iBOT.
Several recent papers have explored self-supervised learning methods for vision transformers (ViT). Key approaches include:
1. Masked prediction tasks that predict masked patches of the input image.
2. Contrastive learning using techniques like MoCo to learn representations by contrasting augmented views of the same image.
3. Self-distillation methods like DINO that distill a teacher ViT into a student ViT using different views of the same image.
4. Hybrid approaches that combine masked prediction with self-distillation, such as iBOT.
2018/10/20コンピュータビジョン勉強会@関東「ECCV読み会2018」発表資料
Yew, Z. J., & Lee, G. H. (2018). 3DFeat-Net: Weakly Supervised Local 3D Features for Point Cloud Registration. European Conference on Computer Vision.
This document summarizes a paper titled "DeepI2P: Image-to-Point Cloud Registration via Deep Classification". The paper proposes a method for estimating the camera pose within a point cloud map using a deep learning model. The model first classifies whether points in the point cloud fall within the camera's frustum or image grid. It then performs pose optimization to estimate the camera pose by minimizing the projection error of inlier points onto the image. The method achieves more accurate camera pose estimation compared to existing techniques based on feature matching or depth estimation. It provides a new approach for camera localization using point cloud maps without requiring cross-modal feature learning.
2020/10/10に開催された第4回全日本コンピュータビジョン勉強会「人に関する認識・理解論文読み会」発表資料です。
以下の2本を読みました
Harmonious Attention Network for Person Re-identification. (CVPR2018)
Weekly Supervised Person Re-Identification (CVPR2019)
6. データセット
Oakland 3D Point Cloud Dataset
Munoz, D., Bagnell, J.A.,Vandapel, N., & Hebert, M. (2009). Contextual
Classification with Functional Max-Margin Markov Networks. In IEEE
Conference on ComputerVision and Pattern Recognition.
Paris-rue-Madame
Serna,A., Marcotegui, B., Goulette, F., & Deschaud, J.-E. (2014). Paris-
rue-Madame database : a 3D mobile laser scanner dataset for
benchmarking urban detection , segmentation and classification
methods. In International Conference on Pattern Recognition Applications
and Methods (ICPRAM).
IQmulus
Bredif, M.,Vallet, B., Serna,A., Marcotegui, B., & Paparoditis, N. (2015).
TERRAMOBILITA/IQMULUS URBAN POINT CLOUD ANALYSIS
BENCHMARK. Computers and Graphics, 49, 126–133.
7. データセット
Semantic 3D
Hackel,T., Savinov, N., Ladicky, L.,Wegner, J. D., Schindler, K., &
Pollefeys, M. (2017). SEMANTIC3D.NET:A New Large-Scale
Point Cloud Classification. ISPRS Annals of the Photogrammetry,
Remote Sensing and Spatial Information Sciences, IV-1-W1, 91–
98.
Paris-Lille-3D
Roynard, X., Deschaud, J., & Goulette, F. (2018). Paris-Lille-3D : a
large and high-quality ground truth urban point cloud dataset
for automatic segmentation and classification. In IEEE
Conference on ComputerVision and Pattern Recognition
Workshop
8. Oakland 3D Point Cloud Dataset
OaklandのCMUの周りで取得した点群データ+ラベル
http://www.cs.cmu.edu/~vmr/datasets/oakland_3d/cvpr09/doc/
車両脇にとりつけたSICK LMS Laser Scannerから取得
1.61M点
44カテゴリラベル
9. Paris-rue-Madame
パリのrue-Madameの約160mの区間で、 Mobile Laser
System(MLS)により取得した点群およびラベル
http://www.cmm.mines-
paristech.fr/~serna/rueMadameDataset.html
20M点
17クラス
Object label Object class
12. Paris-Lille-3D
Mobile Laser System (MLS)を用いてParisとLilleで取得した
点群+ラベルデータセット
http://npm3d.fr/paris-lille-3d
全長1940m
143.1M点
50クラス
13. LiDARを用いたSemantic Segmentation
[Hackel2016]Hackel,T.,Wegner, J. D., & Schindler, K. (2016). FAST
SEMANTIC SEGMENTATION of 3D POINT CLOUDS with
STRONGLYVARYING DENSITY. ISPRS Annals of the Photogrammetry,
Remote Sensing and Spatial Information Sciences, 3(July)
[Thomas2018] Thomas, H., & Marcotegui, J. D. B. (2018). Semantic
Classification of 3D Point Clouds with Multiscale Spherical
Neighborhoods. International Conference on 3DVision (3DV).
[Tchapmi2017]Tchapmi, L. P., Choy, C. B.,Armeni, I., Gwak, J., &
Savarese, S. (2017). SEGCloud : Semantic Segmentation of 3D Point
Clouds. In International Conference of 3DVision (3DV).
[Dewan2017] Dewan,A., Oliveira, G. L., & Burgard,W. (2017). Deep
Semantic Classification for 3D LiDAR Data. In International Conference
on Intelligent Robots and Systems.
[Boulch2017]Boulch,A., Saux, B. Le, & Audebert, N. (2017).
Unstructured point cloud semantic labeling using deep segmentation
networks. In EurographicsWorkshop on 3D Object Retrieval.
14. LiDARを用いたSemantic Segmentation
[Roynard2018] Roynard, X., Deschaud, J., Goulette, F., Roynard, X.,
Deschaud, J., Goulette, F., … Goulette, F. (2018). Classification of Point
Cloud Scenes with MultiscaleVoxel Deep Network. ArXiv, 1804.03583.
[Landrieu2018]Landrieu, L., & Simonovsky, M. (2018). Large-scale Point
Cloud Semantic Segmentation with Superpoint Graphs. IEEE Conference
on ComputerVision and Pattern Recognition.
[Wu2018]Wu, B.,Wan,A.,Yue, X., & Keutzer, K. (2018). SqueezeSeg:
Convolutional Neural Nets with Recurrent CRF for Real-Time Road-
Object Segmentation from 3D LiDAR Point Cloud. IEEE International
Conference on Robotics and Automation (ICRA).
[Wu2018_2] Wu, B., Zhou, X., Zhao, S.,Yue, X., Keutzer, K., & Berkeley, U. C.
(2018). SqueezeSegV2 : Improved Model Structure and Unsupervised
Domain Adaptation for Road-Object Segmentation from a LiDAR Point
Cloud.
[Ye2018]Ye, X., Li, J., Du, L., & Zhang, X. (2018). 3D Recurrent Neural
Networks with Context Fusion for Point Cloud Semantic Segmentation. In
European Conference on ComputerVision.
22. [Dewan2017]Deep Semantic Classification
for LiDAR Data (1/4)
点群をMovable, Non-movable, Dynamic(今動いている)の
3タイプにラベル付け
点群を3チャネルの画像(デプス、高さ、輝度)へ投影し、
CNN(Fast-Net)でObjectnessを判別
2枚の点群からRigid flowを用いて、点ごとの動き(6自由度)を
推定
Objectnessと点の動きをもとにBayes Filterでラベル推定
23. [Dewan2017]Deep Semantic Classification
for LiDAR Data (2/4)
Fast-Net
Oliveira, G. L., Burgard,W., & Brox,T. (2016). Efficient Deep
Models for Monocular Road Segmentation. In International
Conference on Intelligent Robots and Systems.
Rigid Flow
Dewan,A., Caselitz,T.,Tipaldi, G. D., & Burgard,W. (2016). Rigid
Scene Flow for 3D LiDAR Scans. In International Conference on
Intelligent Robots and Systems.
2つのPoint Cloud間で以下の𝜙を最大化するように各点の6自
由度の動き𝝉𝑖を算出
近傍点の動きの差を小さく
2つの点群の対応点の
特徴が近くなるように
24. [Dewan2017]Deep Semantic Classification
for LiDAR Data (3/4)
Bayes Filter
時刻𝑡において、各点がラベル𝑥𝑡 = ሼ
ሽ
dynamic, movable, non −
movable をとる確率分布を求める
動き Objectness物体かどうかラベル
それぞれモデル化(元論文参照)
前フレームの情報を伝播させることで逐次的に計算可能
25. [Dewan2017]Deep Semantic Classification
for LiDAR Data (4/4)
KITTI 3D Object Detection Benchmark
物体ラベルからMovableとNon-Movableラベルを取得
点群にMovable、Non-Movable、Dynamicラベルを付与したデータセット
Ayush Dewan,Tim Caselitz, Gian Diego Tipaldi, and Wolfram
Burgard.Motion-based detection and tracking in 3d lidar scans. In IEEE
International Conference on Robotics and Automation (ICRA), 2016.
46. 点群に対する畳み込みニューラルネットワーク
ここでは重要、または屋外環境に適応した事例があるものに絞って紹
介します。
[Qi2017]Qi, C. R., Su, H., Mo, K., & Guibas, L. J. (2017). PointNet :
Deep Learning on Point Sets for 3D Classification and Segmentation
Big Data + Deep Representation Learning. IEEE Conference on
ComputerVision and Pattern Recognition.
[Qi2017_2]Qi, C. R.,Yi, L., Su, H., & Guibas, L. J. (2017). PointNet++:
Deep Hierarchical Feature Learning on Point Sets in a Metric Space.
Conference on Neural Information Processing Systems.
[Tatarchenko2018]Tatarchenko, M., Park, J., Koltun,V., & Zhou, Q.
(2018).Tangent convolutions for dense prediction in 3D. IEEE
Conference on ComputerVision and Pattern Recognition.
[Wang2018]Wang, S., Suo, S., Ma,W., & Urtasun, R. (2018). Deep
Parametric Continuous Convolutional Neural Networks. IEEE
Conference on ComputerVision and Pattern Recognition
47. [Qi2017]PointNet (1/2)
47
各点群の点を独立に畳み込む
Global Max Poolingで点群全体の特徴量を取得
T-Netによって点群を回転させて正規化
コード:
https://github.com/charlesq34/pointnet
各点を個別
に畳み込み
アフィン変換
各点の特徴を統合
58. 付録
[Zehng2015]Zehng, S., Jayasumana, S., Romera-Paredes, B.,
Vineet,V., Su, Z., Du, D., …Torr, P. H. S. (2015). Conditional
Random Fields as Recurrent Neural Networks. In IEEE
Conference on ComputerVision and Pattern Recognition.
[Iandola2016]Iandola, F. N., Han, S., Moskewicz, M.W.,
Ashraf, K., Dally,W. J., & Keutzer, K. (2016). SqueezeNet:
AlexNet-level accuracy with 50x fewer parameters and
<0.5MB model size. ArXiv, 1602.07360.
59. [Zheng2015]CRF as RNN
Fully Connected CRFの平均場近似による学習と等価なRNNを構築
特徴抽出部分にFCN(Fully Convolutional Networks)を用いることで、
end to endで誤差逆伝播法による学習が行えるネットワークを構築
平均場近似の一回のIterationを表すCNN
ネットワークの全体像
ソースコード
https://github.com/torrvisi
on/crfasrnn (Caffe)