点群SegmentationのためのTransformerサーベイ

点群Segmentationのための
Transformerサーベイ
2023/05/23 takmin

自己紹介
3
株式会社ビジョン＆ITラボ代表取締役
皆川卓也（みながわたくや）
博士（工学）
「コンピュータビジョン勉強会＠関東」主催
株式会社フューチャースタンダード技術顧問
略歴：
1999-2003年
日本HP（後にアジレント・テクノロジーへ分社）にて、ITエンジニアとしてシステム構築、プリ
セールス、プロジェクトマネジメント、サポート等の業務に従事
2004-2009年
コンピュータビジョンを用いたシステム/アプリ/サービス開発等に従事
2007-2010年
慶應義塾大学大学院後期博士課程にて、コンピュータビジョンを専攻
単位取得退学後、博士号取得（2014年）
2009年-現在
フリーランスとして、コンピュータビジョンのコンサル/研究/開発等に従事（2018年法人化）
http://visitlab.jp

4
株式会社ビジョン＆ITラボ
はコンピュータビジョンとAI
によって御社の「こまった」
を助ける会社です

ビジョン
技術の町医者
AIビジネスについて、気軽に相談できる

事業内容
1. Ｒ＆Ｄコンサルティング
2. 受託研究/開発
3. 開発マネジメント
4. 開発コンサルティング
5. ビジネス化コンサルティング
6

ソリューション/製品
7
深層学習 (Deep Learning)
Virtual / Augmented Reality
ナンバープレート認識
ビジョン＆ITラボの代表的なソリューション
や製品の例を紹介いたします。

深層学習 (Deep Learning)
8
深層学習についてのコンサルティングや開発支援などを
行います。
 画像識別
 物体検出
 領域分割
 人物姿勢推定
 画像変換
 画像生成
 etc

Virtual Reality/Augmented Reality
9
御社がVirtual RealityやAugmented Realityを用いたビジネ
スを行う上で必要な、総合的な技術コンサルティングや開
発/プロダクトを提供します。
 特定物体認識
 Visual SLAM
 三次元スキャン
 Face Tracking

ナンバープレート認識：
Number Plate Recognizer
 画像や動画からナンバープレートを読み取ります
入力画像/動画文字＋座標
Number Plate
Recognizer
札幌000 (み) 0000
• Web APIまたはSDKで提供可能
• SDK
• LinuxまたはWindows
• C++またはPython
• アルファベット分類番号および図柄入りナンバープレートにも対応
• GPU不要でロバストかつ高速な認識

お問合せ先
11
https://visitlab.jp

発表の背景
13
「MetaFormerのアイデアはPointNetや点群畳み込み
に通じるところがあり、特にPointNetで用いられた
（Global Poolingで得られた）大域特徴量と点ごとの特
徴量を結合してShared MLPで変換するというアイデア
は、MetaFormer構造の目的とよく似ています。」
コンピュータビジョン最前線 Winter2022
ニュウモン点群深層学習 Deepで挑む３Dへの第一歩千葉直也
より

本資料の目的
15
 主にSemantic Segmentationを目的と
して、点群にTransformerを適用した
手法について調査
 どのように適用したのか？
 Vision Transformer、MLP Mixer、Pool
Formerなどと何が違うのか？
 PointNet/PointNet++と何が違うの
か？

本資料の内容
16
 PointNetのおさらい
 PointNet
 PointNet++
 PointNeXt
 Transformerのおさらい
 Transformer
 VisionTransformer
 MLP Mixer
 Meta Former (Pool
Former)
 点群＋Transformer
 Point Transformer
 Point TransformerV2
 Point Mixer
 Point Cloud Transformer
 PointVoxelTransformer
 Dual Transformer
 Fast Point Transformer
 Point BERT
 Stratified Transformer
 OctFormer
 Self-positioning Point-based
Transformer
 まとめ

PointNetおさらい：出典
18
 PointNet
 Qi, C. R., Su, H., Mo, K., & Guibas, L. J. (2017). PointNet : Deep
Learning on Point Sets for 3D Classification and Segmentation Big
Data + Deep Representation Learning. IEEE Conference on Computer
Vision and Pattern Recognition (CVPR).
 PointNet++
 Qi, C. R.,Yi, L., Su, H., & Guibas, L. J. (2017). PointNet++: Deep
Hierarchical Feature Learning on Point Sets in a Metric Space.
Conference on Neural Information Processing Systems (NeurIPS)
 PointNeXt
 Qian, G., Li,Y., Peng, H., Mai, J., Hammoud, H.A.A. K., Elhoseiny, M., &
Ghanem, B. (2022). PointNeXt: Revisiting PointNet++ with Improved
Training and Scaling Strategies. Conference on Neural Information
Processing Systems (NeurIPS).

PointNet
19
 各点群の点を独立に（周辺の点を参照せず）MLPで特徴量を学習
 Global Max Poolingで点群全体の特徴量を取得

PointNet
20
直交行列（≒回
転行列）を学習
し、座標変換

PointNet
21
し、座標変換
座標値（3次元）
を特徴量（64次
元）へ変換

PointNet
22
し、座標変換
元）へ変換
64次元の直
交行列を学
習し、特徴量
を変換

PointNet
23
し、座標変換
元）へ変換
64次元の直
交行列を学
習し、特徴量
を変換
特徴量の変換
（点ごと）

PointNet
24
し、座標変換
元）へ変換
64次元の直
交行列を学
習し、特徴量
を変換
特徴量の変換
（点ごと）
Max Poolingで全点
の特徴を統合し、
Global特徴を算出

PointNet
25
し、座標変換
元）へ変換
64次元の直
交行列を学
習し、特徴量
を変換
特徴量の変換
（点ごと）
Classification Score

PointNet
26
し、座標変換
元）へ変換
64次元の直
交行列を学
習し、特徴量
を変換
特徴量の変換
（点ごと）
Global特徴
を各点の特
徴に追加
Segmentation
Task

PointNet
27
し、座標変換
元）へ変換
64次元の直
交行列を学
習し、特徴量
を変換
特徴量の変換
（点ごと）
Global特徴
を各点の特
徴に追加
特徴量の変換
（点ごと）

PointNet
28
し、座標変換
元）へ変換
64次元の直
交行列を学
習し、特徴量
を変換
特徴量の変換
（点ごと）
Global特徴
を各点の特
徴に追加
特徴量の変換
（点ごと）
特徴量から各点
のラベルスコア
算出
(Segmentation)

PointNet++
29
 PointNetを階層的に適用
 点群をクラスタ分割→PointNet→クラスタ内で統合を繰
り返す

PointNet++
30
り返す
Farthest Point
Samplingでサンプリン
グした点を中心に半
径rでグルーピング
（オーバーラップあり）

PointNet++
31
り返す
グループごとに
PointNetを適用

PointNet++
32
り返す
サンプリング＋グルーピン
グ＋PointNetを繰り返し

PointNet++
33
り返す
K近傍の点から、
距離に基づいた重
み付き和でアップ
サンプルした点の
特徴量を補間

PointNet++
34
り返す
各点単独で
PointNet

PointNet++
35
り返す
アップサンプルと
PointNetを繰り返し

PointNeXt
36
 PointNet++の性能を以下の仕組みによって大幅改善
 Data Augmentation、最適化手法、ハイパーパラメータを最新研究の
知見に基づき再調整
 受容野を広げるために、Groupingの際に近傍との相対距離を正規
化
 層を深くするために、Inverted Residual MLP (InvResMLP)ブロックを
導入

Transformerおさらい: 出典
38
 Transformer
 Vaswani,A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L.,
Gomez,A. N., Kaiser, L., & Polosukhin, I. (2017).Attention Is All
You Need. Advances in Neural Information Processing Systems
(NeurIPS).
 Vision Transformer
 Dosovitskiy,A., Beyer, L., Kolesnikov,A.,Weissenborn, D., Zhai,
X., Unterthiner,T., Dehghani, M., Minderer, M., Heigold, G., Gelly,
S., Uszkoreit, J., & Houlsby, N. (2021).An Image Is Worth 16x16
Words:Transformers For Image Recognition At Scale.
International Conference on Learning Representations (ICLR).

Transformerおさらい: 出典
39
 MLP Mixer
 Tolstikhin, I., Houlsby, N., Kolesnikov,A., Beyer, L., Zhai, X.,
Unterthiner,T.,Yung, J., Steiner,A., Keysers, D., Uszkoreit, J., Lucic,
M., & Dosovitskiy,A. (2021). MLP-Mixer:An all-MLP
Architecture forVision. Advances in Neural Information Processing
Systems
 Meta Former (Pool Former)
 Yu,W., Luo, M., Zhou, P., Si, C., Zhou,Y.,Wang, X., Feng, J., &Yan, S.
(2022). MetaFormer is ActuallyWhatYou Need forVision.
Proceedings of the IEEE Computer Society Conference on Computer
Vision and Pattern Recognition

Transformer
40
 自然言語処理の分野で提案され
た手法で、EncoderとDecoderで構
成される。
 Encoderは単語列や時系列信号
等のシーケンスを入力として、特
徴ベクトルのシーケンスへ変換す
る。
 Decoderは特徴ベクトルのシーケ
ンスを受け取り、入力シーケンス
の再現、または別のシーケンスを
出力する（例：翻訳）
 Attention（注意機構）という仕組み
を用いることで、例えば単語同士
の関係の重要度などを特徴ベクト
ルに埋め込んでいる。

Attention
41
 Queryによって、メモリ（Key-
Value）の中から必要な情報
を選択的に取得する仕組み
 例：翻訳のケース
 Query:
 日本語の単語（特徴ベクトル）
 Key,Value:
 英語の文章（英単語特徴ベク
トル群）
 出力：
 英語の各単語ベクトルの重み
付き和
 重みはQueryと関連が高いも
のほど大きい
Query Key Value

Attention
42
 Self-Attention
 Cross-Attention
𝑊𝑄 𝑊𝐾 𝑊𝑉
Q K V
𝑊𝑄 𝑊𝐾 𝑊𝑉
Q K V

Attention
43
QueryとKeyの内積を計算し、Query
に対するKeyの類似度を算出
𝑸𝑲T
スケール調整(𝑑𝑘:入力次元数)
𝑸𝑲T
𝑑𝑘
類似度に基づいた重み
softmax
𝑸𝑲T
𝑑𝑘
Valueの重み付き和
softmax
𝑸𝑲T
𝑑𝑘
𝑽

Multi-Head Attention
44
 Single-HeadのAttentionを
複数並列に並べることで、
複数のAttention表現を取
得
 入力次元は計算量を抑え
るためにHead数hで除算し
た数

Vision Transformer
45
 画像を16x16のパッチに分割し、パッチをトークンとして
Transformer Encoderを適用したところ、State-of-the-artの
CNNに匹敵する性能

MLP Mixer
46
 Self-Attentionを特徴ベクトルの転置＋MLPで置き換える
ことで、VisionTransformerに匹敵する性能
Patch(トークン)
の混ぜ合わせ
Channelの混ぜ
合わせ

Meta Former
47
 Vision TransformerやMLP Mixerを、Token Mixing + Channel
Mixingというアーキテクチャで一般化

Pool Former
48
 Token MixingをGlobal Average Poolingのようなシンプル
な方法で実現しても、VisionTransformerやMLP Mixerに
匹敵する性能

49
点群＋Transformer
Visionとの比較

Point Transformer
50
 Zhao, H., Jiang, L., Jia, J.,Torr, P., & Koltun,V. (2021). Point
Transformer. International Conference on ComputerVision
(ICCV).
 点群にTransformerを適用した最初期の論文の一つ
 Vector AttentionやPositional Embeddingに相対座標を利用す
る等、Transformerを点群に適用するにあたり、様々な工夫を
施している。
 SegmentationおよびClassificationタスクで当時のState-of-the-
Artを達成

Point Transformer
51
Semantic Segmentation
Classification
 ネットワーク構造

Point Transformer
52
 MLPブロック
座標を特徴
量へ変換

Point Transformer
53
 PointTransformerブロック
入力特徴量座標
近傍点を使った特徴量変換

Point Transformer
54
 PointTransformerブロック
近傍点を使った特徴量変換

Point Transformer ブロック
55
x𝑗, 𝑝𝑗
x𝑖, 𝑝𝑖
K近傍

56
x p x
φ x𝑖 − 𝜓(x𝑗)
Query Key
x𝑗, 𝑝𝑗
x𝑖, 𝑝𝑖
𝛿 = 𝜃 p𝑖 − p𝑗
MLP 相対座標
Positional Embedding
𝛼 x𝑗
MLP
Value
K近傍

57
x𝑗, 𝑝𝑗
x𝑖, 𝑝𝑖
𝛾 φ x𝑖 − 𝜓 x𝑗 + 𝛿
Query Key Positional
Embedding
𝛾𝑖−1
𝛾𝑖−𝐾
𝛾𝑖−𝑗
…
…
𝛼 x𝑗 + 𝛿
Value Positional
Embedding
𝛼1
𝛼𝐾
𝛼𝑗
…
…
K近傍

K近傍
58
x𝑗, 𝑝𝑗
x𝑖, 𝑝𝑖
y𝑖 = ෍
x𝑗∈𝜒(𝑖)
𝜎 𝛾 φ x𝑖 − 𝜓 x𝑗 + 𝛿 ⊙ 𝛼 x𝑗 + 𝛿
チャネル方向にSoftmax
⊙
要素ごと
の積
総和
Vector Attention
𝜎

Point Transformer
59
 Transition Downブロック
ダウンサンプリング
サンプリング
サンプリングした点のK近
傍の特徴量をMLPで変換
K個の特徴量のMax Pooling

Point Transformer
60
 Transition Upブロック
アップサンプリング
Skip Connection
３近傍点で補間

Point Transformer
61
 実験：S3DIS dataset

Point Transformer
62
 実験： ModelNet40, ShapeNetPart

Vision TransformerとPoint Transformerの違
い
63
Vision
Transformer
Point
Transformer
QueryとKey
の相関
内積差分＋MLP
Attention • スカラー
• Multi-Head
• ベクトル（チャネ
ル方向にも重み
づけ）
• Single-Head
Positional
Embedding
ランダムな初期値
から学習
点の相対座標＋
MLP
Token
Mixing
画像全体 K近傍点
PointTransformerV2
ではMulti-Head

Point Transformer V2
64
 Wu, X., Lao,Y., Jiang, L., Liu, X., & Zhao, H. (2022). Point
TransformerV2: GroupedVector Attention and Partition-
based Pooling. Advances in Neural Information Processing
Systems (NeurIPS), NeurIPS
 PointTransformerに対して、以下を導入することで性能改善
 GroupedVector Attention
 より強力なPositional Embedding
 Partition Based Pooling

66
Multi-Head版
のVector
Attention

67
より強力な
Positional
Embedding

68
K近傍での
Pooling/Unpooling
空間をパーティションに区
切ってPooling/Unpooling

69
 実験：ScanNet v2, S3DIS dataset

PointMixer
70
 Choe, J., Park, C., Rameau, F., Park, J., & Kweon, I. S. (2022).
PointMixer: MLP-Mixer for Point Cloud Understanding.
European Conference on ComputerVision (ECCV)
 MLP Mixerを、点群のような疎で乱雑なデータに対して適用す
るために、Token-Mixing部分をChannel-MixingとSoftmaxの組
み合わせで置き換え
 Inter-Set、Intra-Set、Hierarchical-Setの３パターンでmixing
 高効率

PointMixer
71
 基本構造はPointTransformerと同じ

PointMixer
72

PointMixer
73
 Mixer Block
入力特徴量
座標

PointMixer
74
 Mixer Block
入力特徴量
座標
チャネル方向にSoftmax
⊙
要素ごと
の積
𝜎
𝑔2 𝑔3
𝐲𝒊
Σ

PointMixer
75
 Mixer Block
入力特徴量
座標
★をk近傍点●の特徴量を用いてアップデート

PointMixer
76
 Mixer Block
入力特徴量
座標
★は★から見たk近傍点の１つ
★の特徴量を用いて★をアップデート

PointMixer
77

PointMixer
78

PointMixer
79
サンプリングした点の特徴量をK近傍からアップデート

PointMixer
80
スキップ接続された点群座標へアップサン
プリング

PointMixer
81
スキップ接続された点群座標へアップサン
プリング
ダウンサンプリングの時とは対称方向にアップサンプリングして特徴量更新

PointMixer
82
 実験：S3DIS, ModelNet40

MLP MixerとPointMixerの違い
83
MLP Mixer PointMixer
MLP Mixing トークンの転置チャネル方向の
Sotmaxによる重み
付き和
Positional
Embedding
なし。（トークンの順
番に含まれている）
点の相対座標＋
MLP
Token
Mixing
画像全体 K近傍点
所感：PointMixerはMLP Mixerとはまるで別物

Point TransformerとPointMixerの違い
84
 Point Transformer
 y𝑖 = σx𝑗∈𝜒(𝑖) 𝜎 𝛾 φ x𝑖 − 𝜓 x𝑗 + 𝛿 ⊙ 𝛼 x𝑗 + 𝛿
 PointMixer
 y𝑖 = σx𝑗∈𝜒(𝑖) 𝜎 𝑔2 𝑔1 x𝑗 ; 𝛿 ⊙ 𝑔3 x𝑗
KeyとQueryの差分
+Positional Embedding
KeyにPositional
EmbeddingをConcat
Value
+ Positional Embedding
Value
PointMixerのToken Mixingは、シンプルにSoftmaxによる
チャネル方向の重み付き和のみ
Softmax
チャネル方向の
重み付き和

PointNet++/PointTransformer/PointMixerの
比較
85
SOP =
Symmetric
Operation
 Transformer Blockの構造比較

PointNet++/PointTransformer/PointMixerの
比較
86
Max Pooling
Softmax
+Summation
Softmax
+Summation
SOP =
Symmetric
Operation
 Transformer Blockの構造比較

画像と点群比較
87
Vision Point Cloud
PointNet++にChannel Mixingを加えたらPool Formerに対応。

88
点群＋Transformer
その他の手法

PCT: Point Cloud Transformer
89
 Guo, M. H., Cai, J. X., Liu, Z. N., Mu,T. J., Martin, R. R., & Hu,
S. M. (2021). PCT: Point cloud transformer. Computational
Visual Media, 7(2), 187–199.
 点群の座標を特徴量へ変換し、通常のTransformerと同様、
Key、Queryの内積を用いてＡｔｔｅｎｔｉｏｎを生成し、Valueに重み
づけ
 全ての点同士でSelf-Attentionを計算
 グラフ理論で用いられるラプラシアン行列を用いたOffset
Attentionを導入することで、順序不変なAttentionを実装

90

91
点群を
特徴量
へ変換

92
Self-
Attention

93
Linear
+
Batch Normalization
+
ReLU

94
Max Poolingと
Average Poolingの
Concat

95
Linear
+
Batch
Normalization
+
ReLU
+
Dropout

96

97
𝑸 ∙ 𝑲𝑇

98
𝑭𝑆𝐴 = 𝜎 𝑸 ∙ 𝑲𝑇 ∙ 𝑽
Attention

99
通常のSelf Attention
Offset-Attention
𝑭𝑜𝑢𝑡 = (𝑰 − 𝑨)𝑭𝑖𝑛
Attention Mapを隣
接行列とみなす
Laplacian Matrix

100
 実験: Model40
Classification

101
 実験：S3DIS

Vision TransformerとPCTの違い
102
Vision
Transformer
Point
Transformer
QueryとKey
の相関
内積内積
Attention Multi-Head Offset-Attention
Positional
Embedding
ランダムな初期値
から学習
Sampling +
Groupingで周辺領
域から特徴量算出
Token
Mixing
画像全体点群全体

PVT: Point Voxel Transformer
103
 Zhang, C.,Wan, H., Shen, X., & Wu, Z. (2022). PVT: Point-
voxel transformer for point cloud learning. International
Journal of Intelligent Systems
 点群ベースのAttentionとVoxelベースのAttention (Sparse
Window Attention)を組み合わせることで、高速高性能なモデ
ルを実現
 VoxelベースのAttentionでは、点が内在するVoxelのみ使用し、
Voxel化されたWindow内でSelf-Attentionを取ることで、計算
量削減し、また点群密度の影響を低減

104
 PointVoxelTransformer Block
 Voxel Branch:
 点群をボクセル化し、局所領域でSelf Attention
 Point Branch:
 領域全体で点群同士の相対座標も考慮したSelf Attention。巨大な
点群に対しては簡易な External Attentionを使用

105
 Voxel Branch:
 Point Branch:
Voxel Branch
Window内で、疎な点群に対し、
ハッシュテーブルを用いてSelf-
Attention
特徴量をVoxel
上へ割り当て

106
 Voxel Branch:
 Point Branch:
Point Branch
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑄𝐾𝑇
+ 𝐵 ∙ 𝑉
点同士の相対位置

107
 実験：ShapeNetPart, S3DIS

Dual Transformer
108
 Han, X. F., Jin,Y. F., Cheng, H. X., & Xiao, G. Q. (2022). Dual
Transformer for Point Cloud Analysis. IEEETransactions on
Multimedia.
 Self-Attentionを点群同士、およびチャネル方向に対して適用
するDualTransformer Blockを導入

Dual Transformer
109
 Dual Point CloudTransformer Blockを導入
 点群同士、およびチャネル同士のMulti-Head Self-Attentionを
それぞれ独立に計算し、和を取る。

Dual Transformer
110
 Dual Point CloudTransformer Blockを導入
 点群同士、およびチャネル同士のMulti-Head Self-Attentionを
それぞれ独立に計算し、和を取る。
点群同士のSelf- Attention
softmax 𝑄𝐾𝑇
∙ 𝑉
チャネル間のSelf- Attention
softmax 𝑄𝑇
𝐾 ∙ 𝑉

Dual Transformer
111
 実験：
 ModelNet40
 ShapeNet

Fast Point Transformer
112
 Park, C., Jeong,Y., Cho, M., & Park, J. (2022). Fast Point
Transformer. Conference on ComputerVision and Pattern
Recognition (CVPR)
 LightWeightな局所領域でのSelf-Attention Blockを導入
 Voxel-Hashingベースアーキテクチャによって、Point
Transformerと比較して129倍の推論の高速化

114
点群をVoxelで分割
𝒫𝑖𝑛 = 𝐩𝑛, 𝐢𝑛
座標特徴ベ
クトル

115
Voxel内の特徴量算出
𝒱 = 𝐯𝑖, 𝐟𝑖, 𝐜𝑖
Voxel
座標
特徴量 Centroid
座標

119
Light-Weight Self-Attention
𝐠𝑖 = 𝐟𝑖 + δabs 𝐜𝑖 − 𝐯𝑖
CentroidとVoxelの
相対座標＋MLP

120
𝐟𝑖
′
= ෍
𝑗∈𝒩 𝑖
𝑎 𝐠𝑖, δabs 𝐯𝑖 − 𝐯𝑗 𝜓 𝐠𝑖
CentroidとVoxelの
相対座標＋MLP
隣接Voxelの相対座
標＋MLP
cosine
類似度

121
𝐟𝑖
′
= ෍
𝑗∈𝒩 𝑖
Query Key Value

122
𝐟𝑖
′
= ෍
𝑗∈𝒩 𝑖
Query Key Value
全ての(i, j)の組み合わせで、
Kパターンのみ

124
Voxel特徴から点群を復元
𝒫𝑜𝑢𝑡 = 𝐩𝑛, 𝐢𝑛
座標特徴ベ
クトル

125
 実験：S3DIS

Point-BERT
126
 Yu, X.,Tang, L., Rao,Y., Huang,T., Zhou, J., & Lu, J. (2022).
Point-BERT: Pre-training 3D Point CloudTransformers
with Masked Point Modeling. Conference on ComputerVision
and Pattern Recognition (CVPR)
 点群解析のための事前学習モデルの作成
 Classificationは2層のMLPを加えて識別。
 Object Part Segmentationは、Transformerのいくつかの中間
層と最終層の特徴量を元に、各点のラベルを計算

Point-BERT
128
点群をパッ
チに分割

Point-BERT
129
点群をパッ
チに分割
点群パッ
チから特
徴量算出

Point-BERT
130
点群をパッ
チに分割
dVAEを用いて
パッチ特徴量か
ら離散トークンを、
元点群が復元で
きるよう学習
トークン
点群パッ
チから特
徴量算出

Point-BERT
131
点群をパッ
チに分割
dVAEを用いて
きるよう学習
トークン
点群パッ
チから特
徴量算出
パッチ特徴量の
シーケンス
マスクをかける

Point-BERT
132
点群をパッ
チに分割
dVAEを用いて
きるよう学習
トークン
点群パッ
チから特
徴量算出
シーケンス
Transformerで
マスク部も含
め、トークンを
予測するよう
学習

Point-BERT
133
点群をパッ
チに分割
dVAEを用いて
きるよう学習
トークン
点群パッ
チから特
徴量算出
シーケンス
Transformerで
マスク部も含
め、トークンを
予測するよう
学習
データ拡張
（CutMixの点
群版）を用い
てContrastive
Learningで表
現学習

Point-BERT
134
 実験： ModelNet40,
SpaheNetPart

Stratified Transformer
135
 Lai, X., Liu, J., Jiang, L.,Wang, L., Zhao, H., Liu, S., Qi, X., & Jia,
J. (2022). StratifiedTransformer for 3D Point Cloud
Segmentation. Conference on ComputerVision and Pattern
Recognition (CVPR)
 近傍に対しては密に、遠方に対しては疎にサンプリングする
ことで、局所領域の特徴と広域での特徴、両方を集約できる
モデルを提案

137
学習可能なLook Up
Tableを用いて、点同士
の相対座標をQuery 、
Key、Valueの特徴量へ
変換し、埋め込み

139
Layer Normalization
Feed Forward Network

140
Layer Normalization
異なるサイズのWindow
内でSelf-Attention

141
Layer Normalization
通常のTransformerと同様にKeyとQueryの内積を用い
る（Multi-Head Self Attention）
y𝑖 = ෍
𝑗
softmax 𝑄𝑢𝑒𝑟𝑦𝑖 ∙ 𝐾𝑒𝑦𝑗 ∙ 𝑉𝑎𝑙𝑢𝑒𝑗

142
Layer Normalization
Feed Forward Network
Windowを1/2ずら
してSelf-Attentionを
計算

144
Farthest Point
Sampling
+ k-nn

145
Farthest Point
Sampling
+ k-nn
サンプル点のk
近傍でPooling

147
Skip Connectionか
らの入力特徴量
Skip Connectionか
らの入力座標
前層からの入力
Down Sample前の
点の特徴量を補間

148
 実験：S3DIS、ShapeNetPart

OctFormer
149
 Wang, P.-S. (2023). OctFormer: Octree-based
Transformers for 3D Point Clouds. ACMTransactions on
Graphics (SIGGRAPH), 42(4), 1–11.
 点群をWindowで区切ってSelf-Attentionを計算することで、計
算量削減
 Windowごとの点の数が異なるという課題を解決するために、
Windowの形状を柔軟に変更
 Windowの位置をずらして再計算することで、Receptive Field
を拡大（Dilated Partition）

OctFormer
150
 八分木（英: Octree）とは、木構造の一種で、各ノードに
最大8個の子ノードがある。3次元空間を8つのオクタント
（八分空間）に再帰的に分割する場合によく使われる。
Wikipediaより(https://ja.wikipedia.org/wiki/%E5%85%AB%E5%88%86%E6%9C%A8)

OctFormer
151
• 点群からOctreeを生成
（ここでは２次元で説明）。
• 赤が点群、点が存在す
るノードはグレー。
• Z-Order Curveを用いて、
Octreeノードを１列に並
べる。
• 点の存在するノードお
よび同じ親をもつノード
のみ並べる。
• ノード配列をオーバー
ラップの無いWindow
で分割（同じ色が同じ
Window）
• Window内のノード数
は一定（ここでは7）
• 設定したWindow内で
Self-Attentionを計算
• Windowの位置をずら
すことで受容野を広げ
る。
• Dilation=2の例
• Z-Order Curve上（ただ
し空ノードは含まない）
で2個おきのノードを同
じWindowに設定

OctFormer
152
 実験：ScanNet

Self-Positioning Point-based Transformer
(SPoTr)
153
 Park, J., Lee, S., Kim, S., Xiong,Y., & Kim, H. J. (2023).
Self-positioning Point-based Transformer for Point
Cloud Understanding. Conference on Computer
Vision and Pattern Recognition (CVPR).
 リソース削減のために、全ての点同士のSelf-
Attentionを取るのではなく、グローバルおよびロー
カルの特徴を捉えたself-positioning point (SP point)
を使用。
 SP pointを用いてローカルおよびグローバルなCross-
Attentionを取ることで、3つのベンチマーク(SONN,
SN-Part, and S3DIS)でSOTA達成

(SPoTr)
154
SP Pointの算出方法

(SPoTr)
155
入力点群
座標
各点の特徴ベクトル

(SPoTr)
156
潜在変数
各潜在変数を元に算出されたSP
Point座標
𝛿𝑠 = ෍
𝑖
Softmax 𝒇𝑖
T
𝒛𝑠 𝑥𝑖

(SPoTr)
157
SP Pointに近い点ほど大きい重み
𝑔 𝛿𝑠, 𝑥𝑖 = exp −𝛾 𝛿𝑠 − 𝑥𝑖
2
潜在変数に近い特徴ほど大きい重み
ℎ 𝒛𝑠, 𝒇𝑖 =
exp 𝒇𝑖
T
𝒛𝑠
σ𝑗 exp 𝒇𝑗
T
𝒛𝑠
各SP Pointの特徴ベクトル
𝝍𝑠 = ෍
𝑖
𝑔 𝛿𝑠, 𝑥𝑖 ∙ ℎ 𝒛𝑠, 𝒇𝑖 ∙ 𝒇𝑖

(SPoTr)
158
Channel-wise Point Attention (CWPA)

(SPoTr)
159
SP Pointと入力点群の相
対座標算出
(Positional Embedding)
入力点群座標 SP Point座標

(SPoTr)
160
入力点群の特徴ベクトル
(Query)
SP Pointの特徴ベクトル
(Key)
SP Pointと入力点群間の
特徴ベクトル差分
SP Pointと入力点群間の
特徴ベクトル差分

(SPoTr)
161
MLPで特徴ベクトルの変換
MLPで特徴ベクトルの変換
Vector Attention
(Channel方向)

(SPoTr)
162
 全体ネットワーク構成
Segmentation
Classification

(SPoTr)
163
 全体ネットワーク構成
Segmentation
Classification
Farthest Point
Sampling
SP Pointを用いた
CWPA 入力近傍点群の
みでCWPA

(SPoTr)
164
実験:S3DIS

点群＋Transformerまとめ
165
Attentionの計算範囲 Attentionの取り方 Positional Embedding
PointTransformer 局所領域のみ差分＋Vector Attention 相対座標＋MLP
PointMixer 局所領域のみ差分＋Vector Attention 相対座標＋MLP
PCT 点群全体（小さな点群）内積＋Offset Attention 特徴量がすでに座標情報を含
んでいるという考え方
PVT 局所領域＋全体点群（た
だし大規模点群に対して
は簡易処理）
内積＋Scalar Attention 相対座標
DualTransformer 点群全体（小さな点群）内積＋Scalar Attention 記載なし
Fast Point
Transformer
局所領域のみ Light-Weight Self-
Attention
相対座標（Voxel間 orVoxel-
Centroid間）＋MLP
Point BERT 点群全体（局所領域を
トークンとして）
内積＋Scalar Attention クラスタ中心座標＋MLP
Stratified
Transformer
局所領域(マルチスケー
ル)＋ShiftedWindow
内積＋Scalar Attention 相対座標を量子化したLook
UpTable
OctFormer 局所領域（可変形状）
+DilatedWindow
内積＋Scalar Attention Conditional Positional Encoding
(DepthWise Conv + Batch
Norm)
SPoTr Self-Positioning Point 差分＋Vector Attention 相対座標＋MLP

点群SegmentationのためのTransformerサーベイ

More Related Content

What's hot (20)

Similar to 点群SegmentationのためのTransformerサーベイ (20)

More from Takuya Minagawa (20)

Recently uploaded (10)

点群SegmentationのためのTransformerサーベイ