データに隠れた構造を推定して予測に活かす〜行列分解とそのテストスコアデータへの応用〜

データに隠れた構造を推定して予測に活かす:
⾏列分解とそのテストスコアデータへの応⽤
兼村厚範
1

Who are you?
•  atsu-kan@aist.go.jp
–  酒は飲みません
•  ヒトから得られたデータの解析、ヒトの情報処
理に興味があります
•  研究者（Scien;st? ⼯学者 = Engineerのつもり）
•  タニマチ：産総研
–  国⽴研究開発法⼈産業技術総合研究所
⼈間情報部⾨情報数理グループ
（兼⼈⼯知能センター機械学習チーム）
•  株式会社リクルートキャリアと共同研究実施中
2

⾏列分解
•  ある⾏列を、

と2つの⾏列の積に分解

•  因⼦分解、因⼦化とも⾔う
•  因数分解のようなもの
•  因⼦分解も因数分解も両⽅とも factoriza;on
4
U = W H
W 2 RI⇥K
, H 2 RK⇥J
10 = 5 · 2 20151121 = 674
U 2 RI⇥J

隠れた構造の推定
•  トピックモデル、独⽴成分分析（ICA）、スパー
ス辞書学習、etc.
•  データの構成にどんな成分がどれだけ⼊ってい
るかを推定し、可視化する。
–  テスト回答、⽂書、購買履歴、脳活動、etc.
5

6
78 COMMUNICATIONS OF THE ACM | APRIL 2012 | VOL. 55 | NO. 4
genetics, such as “sequenced” and ics are generated first, before the documents. different topics. Why latent? Keep reading.
Figure 1. The intuitions behind latent Dirichlet allocation. We assume that some number of “topics,” which are distributions over words,
exist for the whole collection (far left). Each document is assumed to be generated as follows. First choose a distribution over the topics (the
histogram at right); then, for each word, choose a topic assignment (the colored coins) and choose the word from the corresponding topic.
The topics and topic assignments in this figure are illustrative—they are not fit from real data. See Figure 2 for topics fit from data.
. , ,
. , ,
. . .
gene
dna
genetic
life
evolve
organism
brain
neuron
nerve
data
number
computer
. , ,
Topics Documents
Topic proportions and
assignments
0.04
0.02
0.01
0.04
0.02
0.01
0.02
0.01
0.01
0.02
0.02
0.01
data
number
computer
. , ,
0.02
0.02
0.01
(Blei, CACM, 2012)

7
(Blei, CACM, 2012)
review articles
evolutionary biology, and each word
is drawn from one of those three top-
ics. Notice that the next article in
the collection might be about data
analysis and neuroscience; its distri-
bution over topics would place prob-
ability on those two topics. This is
the distinguishing characteristic of
algorithm assumed that there were 100
topics.) We then computed the inferred
topic distribution for the example
article (Figure 2, left), the distribution
over topics that best describes its par-
ticular collection of words. Notice that
this topic distribution, though it can
use any of the topics, has only “acti-
about subjects like genetics and data
analysis are replaced by topics about
discrimination and contract law.
The utility of topic models stems
from the property that the inferred hid-
den structure resembles the thematic
structure of the collection. This inter-
pretable hidden structure annotates
Figure 2. Real inference with LDA. We ﬁt a 100-topic LDA model to 17,000 articles from the journal Science. At left are the inferred
topic proportions for the example article in Figure 1. At right are the top 15 most frequent words from the most frequent topics found
in this article.
“Genetics”
human
genome
dna
genetic
genes
sequence
gene
molecular
sequencing
map
information
genetics
mapping
project
sequences
life
two
“Evolution”
evolution
evolutionary
species
organisms
origin
biology
groups
phylogenetic
living
diversity
group
new
common
“Disease”
disease
host
bacteria
diseases
resistance
bacterial
new
strains
control
infectious
malaria
parasite
parasites
united
tuberculosis
“Computers”
computer
models
information
data
computers
system
network
systems
model
parallel
methods
networks
software
new
simulations
1 8 16 26 36 46 56 66 76 86 96
Topics
Probability
0.00.10.20.30.4

8
(Barnard et al., JMLR, 2003)

9
(Morioka, Kanemura, et al., NeuroImage, 2015)
C Topoplots and time courses of spatial bases
−4 −2 0 2 4 6
−100
−50
0
50
100
150
−4 −2 0 2 4 6
−20
0
20
40
60
80
−4 −2 0 2 4 6
−20
0
20
40
−4 −2 0 2 4 6
−20
0
20
40
60
Time [s]
6 8
1 1.2
ed
L
Attend left
Attend right
(i)
(ii)
(iii)
(iv)
Response(%)
173H. Morioka et al. / NeuroImage 111 (2015) 167–178

予測に活かす
•  データの構成がわかると様々な使い道がある
•  ベイジアンスパムフィルタ
–  メールのデータセットを解析、スパムトピック成分
を多く含むか否かで分類 → 時間を救う、ウィルス
などから守る
•  広告、リコメンデーション（推薦）
–  購買履歴データセットを解析、この⼈はこの商品を
買いそうかどうか → 売り上げ
•  脳活動解析
–  脳のどこがいつ活動しているか、その時間プロファ
イルは → 医療、リハビリ、神経科学
10

解析フロー
11
データ
セット
成分推定
可視化
予測

データを⾏列として表現
•  番⽬のデータ点
•  縦に積み重ねて⾏列にする
•  例
–  ⽂書であれば単語頻度（i が⽂書、j が単語）
–  買い物であれば購買履歴（i がユーザ、j が商品）
12
ui = [ui1, ui2, . . . , uiJ ] 2 RJ
i (1  i  I)
U =
2
6
6
6
4
— u1 —
— u2 —
...
— uI —
3
7
7
7
5
2 RI⇥J

データを成分に分解
13
•  2つの⾏列を使って表現
–  K はトピック数
U = W H
W 2 RI⇥K
, H 2 RK⇥J
2
6
6
6
4
— u1 —
— u2 —
...
— uI —
3
7
7
7
5
=
2
6
6
6
6
6
6
6
4
– w1 –
– w2 –
...
– wI –
3
7
7
7
7
7
7
7
5
2
6
6
6
4
— h1 —
— h2 —
...
— hK —
3
7
7
7
5
wi 2 RK
(1  i  I) hk 2 RI
(1  k  K)

⾏ごとに⾒る
•  1つのデータ点が、
成分ベクトル（トピック）の重み付け和
で表現
•  に対する重みは
14
ui
hk
ui wi = [wi1, wi2, . . . , wiK]
8
>>>><
>>>>:
u1 = w11h1 + w12h2 + · · · + w1KhK
u2 = w21h1 + w22h2 + · · · + w2KhK
...
uI = wI1h1 + wI2h2 + · · · + wIKhK

⾮負⾏列分解
•  の各要素を⾮負に制約する
•  部分ごとの表現が得られる
•  「部分ごとの表現」？
–  ⼀部の要素にのみ値を持ち、ほかは全部ゼロ
15
W , H
wik, hkj 0 for all i, k, j

16
VQ
× =
NMF
=×
Original Figure 1 Non-n
faces, whereas
holistic represe
m ¼ 2;429 fac
n ϫ m matrix V
three different t
and methods. A
r ¼ 49 basis im
with red pixels.
represented by
superposition a
superpositions a
learns to repres
VQ
× =
PCA
=×
superposition a
superpositions
learns to repres
⾮
負
⾏
列
分
解
主
成
分
分
析
(Lee & Seung, 1999)
H
W
u

•  ⼊⼒：⾮負⾏列、トピック数
•  出⼒：⾮負⾏列
1.  初期値を適当に設定
–  乱数、PCAやICAの結果。ただし⾮負性に注意。
2.  乗法的更新ルール

–  最⼩⾃乗誤差が増加しないことが証明済み
–  Cf. ポアソン分布のEMアルゴリズム、Richardson-Lucy
アルゴリズム
どうやって推定？
17
U K
W , H
、時間経過
ている [1]∼
や α 係数推
性を有する
とする。項
ータ（二値
ij a2
ij すなわち自乗誤差である（KL 擬距離を用いることもで
きる）。最適化の手続きは次の式の反復計算で与えられる [5]。
hkj ← hkj
[W T
U]kj
[W TW H]kj
,
wik ← wik
[UHT
]ik
[W HHT]ik
.
(8)
これは乗法的な更新則なので、初期値が非負であれば更新後も
必ず非負となる。
2. 2 欠測のもとでの非負行列分解
正×正＝正なので
更新前後で
⾮負性が保たれる
(Lee & Seung, 2000)

テストのデータ（項⽬応答パタン）
18

応答パタンの⾮負⾏列分解
•  項⽬応答パタンを、次の2要素に分解する
–  トピック＝潜在スキルセット（全員に共通）
–  その重み（個⼈ごと）
19
0 1 ...
1 0 ...
1 1 ...
... ... ...
2.4
1.9
1.30
1.21
4.3
1.5
1.8
1.20
1.2
3.3
5.4
5.10
1.31
2.3
1.6
3.2
1.32
5.7
2.1
1.25
2.7
5.2
1.17
1.4
5.5
1.26
1.29
1.24
1.3
1.7
1.23
1.28
1.19
1.11
5.6
2.2
2.6
1.22
1.13
1.15
1.18
1.27
1.10
1.12
5.9
1.16
5.1
4.4
5.8
4.2
4.1
4.6
1.14
2.5
3.1
4.5
5.3
1.1
5
4
3
2
1
basis
1
2
3
4
5
0
0.2
0.4
0.6
0.8
1
Mixture coefficients
1 2 3 4 5
46
65
306
144
110
55
191
116
244
154
299
12
153
167
48
312
194
205
159
283
293
23
138
248
212
120
42
94
64
289
222
152
260
95
124
164
143
300
288
134
30
252
257
115
256
251
282
235
172
128
118
113
170
103
279
214
37
40
291
307
301
162
82
142
90
58
89
43
195
135
305
241
41
268
108
79
28
190
275
266
207
151
262
182
259
166
228
188
163
285
186
49
105
51
218
247
231
78
67
302
273
5
93
88
119
127
276
91
220
206
184
242
107
224
27
98
74
20
310
21
165
290
81
114
136
211
131
208
296
86
308
174
292
216
92
284
137
44
18
75
246
200
4
263
189
33
304
197
8
177
234
17
187
238
225
96
80
121
274
278
132
264
145
203
215
66
298
313
52
14
139
255
126
161
150
84
36
16
294
232
70
69
140
85
181
122
6
63
77
109
156
1
265
287
72
243
185
130
175
149
250
179
183
106
158
254
272
311
133
97
83
160
303
47
13
9
146
219
249
233
202
123
87
39
15
199
217
201
226
148
213
258
111
73
245
50
267
198
35
240
53
10
176
171
61
270
60
169
3
229
101
24
280
117
173
57
269
157
141
261
32
193
102
19
236
147
277
297
11
45
309
180
68
76
168
125
54
31
230
34
295
286
62
253
227
104
204
239
22
178
221
271
29
112
2
59
281
99
209
25
26
71
237
192
129
38
223
100
196
56
7
155
210
basis
1
2
3
4
5
0
0.2
0.4
0.6
0.8
Basis components
313 x 58 313 x 5 5 x 58
= ・

20
2.4
1.9
1.30
1.21
4.3
1.5
1.8
1.20
1.2
3.3
5.4
5.10
1.31
2.3
1.6
3.2
1.32
5.7
2.1
1.25
2.7
5.2
1.17
1.4
5.5
1.26
1.29
1.24
1.3
1.7
1.23
1.28
1.19
1.11
5.6
2.2
2.6
1.22
1.13
1.15
1.18
1.27
1.10
1.12
5.9
1.16
5.1
4.4
5.8
4.2
4.1
4.6
1.14
2.5
3.1
4.5
5.3
1.1
5
4
3
2
1
basis
1
2
3
4
5
0
0.2
0.4
0.6
0.8
1
Mixture coefficients
H

21
1 2 3 4 5
46
65
306
144
110
55
191
116
244
154
299
12
153
167
48
312
194
205
159
283
293
23
138
248
212
120
42
94
64
289
222
152
260
95
124
164
143
300
288
134
30
252
257
115
256
251
282
235
172
128
118
113
170
103
279
214
37
40
291
307
301
162
82
142
90
58
89
43
195
135
305
241
41
268
108
79
28
190
275
266
207
151
262
182
259
166
228
188
163
285
186
49
105
51
218
247
231
78
67
302
273
5
93
88
119
127
276
91
220
206
184
242
107
224
27
98
74
20
310
21
165
290
81
114
136
211
131
208
296
86
308
174
292
216
92
284
137
44
18
75
246
200
4
263
189
33
304
197
8
177
234
17
187
238
225
96
80
121
274
278
132
264
145
203
215
66
298
313
52
14
139
255
126
161
150
84
36
16
294
232
70
69
140
85
181
122
6
63
77
109
156
1
265
287
72
243
185
130
175
149
250
179
183
106
158
254
272
311
133
97
83
160
303
47
13
9
146
219
249
233
202
123
87
39
15
199
217
201
226
148
213
258
111
73
245
50
267
198
35
240
53
10
176
171
61
270
60
169
3
229
101
24
280
117
173
57
269
157
141
261
32
193
102
19
236
147
277
297
11
45
309
180
68
76
168
125
54
31
230
34
295
286
62
253
227
104
204
239
22
178
221
271
29
112
2
59
281
99
209
25
26
71
237
192
129
38
223
100
196
56
7
155
210
basis
1
2
3
4
5
0
0.2
0.4
0.6
0.8
Basis components

スキル計測における御利益
•  ⽋測推定
–  まだ答えていない項⽬に対する正誤が予測できる
–  の要素に⽋測をゆるす形でアルゴリズムを拡張
•  適応出題
–  スキルセットの推定を加味して、次の2つを弁別
•  ほとんど間違いなく正解/誤答するであろう項⽬
•  正解・不正解が事前にはっきりと分かる項⽬
–  後者から出題してゆく
22
U
(Zhang, SDM, 2006)

推定の例
23
観測 40/58
真のスコア44
推定スコア39
0.0
0.5
1.0
1 10 20 30 40 50 58
Item
Response
Type
Estimated
True
0.0
0.5
1.0
1 10 20 30 40 50 58
Item
Response
Type
Estimated
True

0
10
20
30
0 20 40 60
Number of observed responses
RMSEinscoreestimation
⽋測推定＆適応出題の精度
24
58

プログラマのためのパッケージ案内
•  ⾃前実装はおすすめしません
•  R
–  パッケージ nmf
–  ただし⽋測値には⾮対応
•  Python
–  scikit.learn: sklearn.decomposiiton.NMF()
•  Matlab
–  Sta;s;cs and Machine Learning Toolbox: nnmf()
•  ⾮負でない⾏列分解、トピックモデル推定
（LDA）などのパッケージも多数
25

謝辞と学会発表
•  ⼤成弘⼦、⿅内学、橋本将崇
（株式会社リクルートキャリア）
–  CODE.SCOREチーム：h]ps://codescore.jp/
•  ⾚穂昭太郎（産総研）
•  情報論的学習理論ワークショップ（IBIS）@つく
ばで発表予定（2015-11-27）
•  兼村、⼤成、⿅内、橋本、⾚穂「能⼒テスト得
点の⾮負⾏列分解」信学技報IBISML、115(323):
203–208、2015
26

参考⽂献
•  Strang, Introduc-on to Linear Algebra, Wellesley-Cambridge Press, 2009.
–  ストラング『線形代数イントロダクション』近代科学社、2015。
•  Lee & Seung, “Learning the parts of objects by non-nega;ve matrix
factoriza;on,” Nature, 1999.
•  Lee & Seung, “Algorithms for non-nega;ve matrix factoriza;on,” Adv.
Neural Inf. Process. Syst. (NIPS), 2000.
•  Morioka, Kanemura, et al., “Learning a common dic;onary for subject-
transfer decoding with res;ng calibra;on,” NeuroImage, 2015.
•  Blei, “Probabilis;c topic models,” Commun. ACM, 2012.
•  Barnard et al., “Matching words and pictures,” J. Mach. Learn. Res., 2003.
•  Zhang et al., “Learning from incomplete ra;ngs using non-nega;ve matrix
factoriza;on,” SIAM Conf. Data Mining (SDM), 2006.
27

データに隠れた構造を推定して予測に活かす〜行列分解とそのテストスコアデータへの応用〜

More Related Content

Similar to データに隠れた構造を推定して予測に活かす〜行列分解とそのテストスコアデータへの応用〜 (20)