20181212 - PGconf.ASIA - LT

GPUを使ったIn-database Analytics
～［実装してみた］ロジスティック回帰分析～
HeteroDB,Inc
Chief Architect & CEO
KaiGai Kohei <kaigai@heterodb.com>

みなさん、
PL/CUDA使ってますか？
Hello guys. Are you using PL/CUDA?
This caption is not automatic by machine-learning. I preliminary write up by manual.
PGconf.ASIA 2018 LT - In-database Analytics using GPU2

Result
PL/CUDAユーザ定義関数
▌PL/CUDAとは？
 SQLユーザ定義関数として、GPUで実行可能なCUDA Cコードを書ける。
▌特長
 GPUに最適化したコードをマニュアルで記述する事ができる。
 前処理・後処理に柔軟なデータ操作が可能なSQLを利用できる。
All In-database Analytics
Scan
Pre-Process
Analytics
Post-ProcessCREATE FUNCTION
my_logic( reggstore, text )
RETURNS matrix
AS $$
$$ LANGUAGE ‘plcuda’;
Custom CUDA C code block
(runs on GPU device) ✓ 統計解析・機械学習に対する
マニュアルでの最適化
✓ 数千演算コアと広帯域メモリを
最大限に活用
ready
PL/CUDA allows UDF written in CUDA C program that is executable on GPU. Valuable due to integration of
manual (extreme) optimization for GPU and flexible data operation by SQL.

PL/CUDA利用例 – 創薬における類似化合物サーチ
ID NAME Fingerprint (1024bit)
1 CHEMBL153534 00000000000100000010000000000010001000000...
2 CHEMBL405398 00000000000000010010000000000000000100000...
3 CHEMBL503634 00000100000000000000010000000000000000000...
: : :
Data structure of chemical compounds
データベース化合物
(約1,000万件)
クエリ化合物
(~1,000件)
探索すべき組合せ = 約100億通り
DBサーバ
類似度計算
ロジック
問い合わせ
類似化合物の
リスト
For similarity search on drug-discovery, GPU calculated 10billion of distance between chemical compounds
x150 times faster than C-binary on CPU. It is very computing intensive workloads.
x150 times
faster!!

Is there any sample program?
Oh.... this case was proprietary algorithm. Now we have no sample code in public.
それ、どこかにサンプルプログラム
転がってませんか？

作ってみた。
題材：ロジスティック回帰分析
I tried to make it.
Theme: Logistic Regression Analytics

ロジスティック回帰分析とは（1/2）
二値分類のための機械学習手法の一つ
Logistic Regression Analytics is a machine-learning method for binary classification.
True
False

ロジスティック回帰分析とは（2/2）
データが正しく分類される確率がロジスティック関数に従う
Probability of “right” classification follows the logistic function
𝜎 𝛼 =
1
1 − 𝑒−𝛼

パラメータを求める（1/3）
一般化すると....
パラメータ： 𝑤 = 𝑤0, 𝑤1, ⋯ , 𝑤 𝑚
説明変数： 𝜑𝑖 = 1, 𝑥1, ⋯ , 𝑥 𝑚 𝑖
従属変数： 𝑡𝑖 = 0 𝑜𝑟 1
分割面を定めるという事は、
説明変数の重み（傾き）と
切片を求める事に等しい。
0 = 𝑤0 + 𝑤1 𝑥 + 𝑤2 𝑦
Determination of division surface is equivalent to seek the weight of the explanatory variables and
intercept. But teacher data tell us boolean state for the combination of explanatory variables.

問題設定：トレーニングセットが得られる確率を最大化する。
𝑧𝑖 = 𝜎 𝑊 𝑇
𝜑𝑖 であるとき、𝑷 = ς𝑖=1
𝑁
𝑃𝑖 = ς𝑖=1
𝑁
𝑍𝑖
𝑡 𝑖
1 − 𝑍𝑖
1−𝑡 𝑖
分割面から離れるほど、
当該説明変数は真である、
または偽である可能性は高い。
トレーニングセットは、
最も顕在化する可能性が
高いものであったと仮定する。
Explanatory variables far from the division surface has higher probability of true/false. We assume the
training-set is result of the highest likelihood, maximized by the W parameter.

以下を繰り返しパラメータを推定する
ഥ𝑤 𝑛𝑒𝑤 = ഥ𝑤 𝑜𝑙𝑑 − Φ 𝑇 𝑅Φ −1Φ 𝑇 ҧ𝑧 − ҧ𝑡
ただし、
Φ =
1 𝑥11 ⋯ 𝑥1𝑚
⋮ ⋱ ⋮
1 𝑥 𝑛1 ⋯ 𝑥 𝑛𝑚
ҧ𝑡 = 𝑡1, … , 𝑡 𝑛
ҧ𝑧 = 𝑧1, … , 𝑧 𝑛
𝑅 = 𝑑𝑖𝑎𝑔 𝑧1 1 − 𝑧1 , … , 𝑧 𝑛 1 − 𝑧 𝑛
For more details, check out the book. Anyway, W is updated for each iteration, then Wnew shall seek to the
reasonable parameter then Wold. Eventually, difference of Wnew and Wold becomes very small.
詳しくはこちら

計算量を考える。
▌説明変数の数は多くない：数個～百個程度 ... m個
▌学習データの数は多いかも：数百個～数千万個 ... n個
ഥ𝑤 𝑛𝑒𝑤 = ഥ𝑤 𝑜𝑙𝑑 − ഥ𝑤Δ = ഥ𝑤 𝑜𝑙𝑑 − Φ 𝑇
𝑅Φ −1
Φ 𝑇
ҧ𝑧 − ҧ𝑡
Estimation for amount of the calculation. # of explanatory variables are to up hundreds, but # of training
data set is more than million items. It is suitable for parallel calculation by GPU.
ΦR
n
-1
Φ 𝑇
ҧ𝑧 − ҧ𝑡
Φ 𝑇
n
m
n
1
-1
Φ 𝑇
𝑅Φ −1
Φ 𝑇
ҧ𝑧 − ҧ𝑡
ഥ𝑤Δ
𝑚 × 𝑚 𝑚 × 1
𝑚 × 1

行列積 Φ 𝑇 𝑅Φ を並列に計算するコード例
KERNEL_FUNCTION_MAXTHREADS(void) logregr_update_P(cl_double **Preg, /* out */
cl_float **Xp,
cl_int width,
VectorTypeFloat *Z) {
cl_double *P = Preg[0];
__shared__ cl_float v[MAXTHREADS_PER_BLOCK]; // shared variables
nitems_bs = TYPEALIGN(get_local_size(), nitems);
nloops = width * width * nitems_bs;
for (loop = get_global_id(); // unique identifier of GPU threads
loop < nloops;
loop += get_global_size()) { // add total number of GPU threads
k = loop % nitems_bs; // index of 𝑅 column/row
i = (loop / nitems_bs) % width; // index of Φ 𝑇
column
j = loop / (nitems_bs * width); // index of Φ column
if (k < nitems) {
cl_float z = Z->values[k];
cl_float x1 = (i == 0 ? 1.0 : Xp[i-1][k]);
cl_float x2 = (j == 0 ? 1.0 : Xp[j-1][k]);
v[get_local_id()] = x1 * z * (1.0 - z) * x2;
}
else
v[get_local_id()] = 0.0;
sum = pgstromTotalSum(v,MAXTHREADS_PER_BLOCK); // total sum of the element
if (get_local_id() == 0) // calculated by the sibling threads
atomicAdd(&P[i + j * width], sum);
__syncthreads();
}
}

GPU活用による計算 – 縮約アルゴリズムの例
●item[0]
step.1 step.2 step.4step.3
GPUを用いた
Σi=0...N-1item[i]
配列総和の計算
◆
●
▲ ■ ★
● ◆
●
● ◆ ▲
●
● ◆
●
● ◆ ▲ ■
●
● ◆
●
● ◆ ▲
●
● ◆
●
item[1]
item[2]
item[3]
item[4]
item[5]
item[6]
item[7]
item[8]
item[9]
item[10]
item[11]
item[12]
item[13]
item[14]
item[15]
log2N ステップで
items[]の総和を計算
HW支援によるコア間の同期機構
SELECT count(X),
sum(Y),
avg(Z)
FROM my_table;
集約関数の計算で用いる仕組み
Values on shared memory can be accessed by multiple CPU cores simultaneously. Hardware supports inter-
cores synchronization, and it enables to calculate total sum with log2N steps.

ロジスティック回帰分析のサンプルプログラム
$ git clone https://github.com/heterodb/toybox.git
$ cd toybox/logistic_regression/
$ make && make install
$ psql postgres
postgres=# create extension logregr;
CREATE EXTENSION
To get the sample code, open “heterodb/toybox” on GitHub, then move to “logistic_regression”.
You can install it using CREATE EXTENSION, if PG-Strom is correctly setup.
https://github.com/heterodb/toybox/ ➔ logistic_regression

動かしてみる（1/4）- 人為的なテストデータを作成
postgres=# CREATE TABLE logreg (
t bool,
x1 float,
x2 float,
x3 float,
x4 float );
CREATE TABLE
-- ↓全ての 1 + 2𝑥1 − 3𝑥2 + 𝑥3 + 0.5𝑥4 > 0を true と分類するトレーニングデータ
-- 4000万件を投入してみる
postgres=# INSERT INTO logreg
(SELECT (1.0+2.0*x1-3.0*x2+x3+0.5*x4) > 0 t, x1, x2, x3, x4
FROM (SELECT random() x1,
random() x2,
random() x3,
random() x4
FROM generate_series(1,40000000)) x);
INSERT 0 40000000
OK, let’s work the PL/CUDA function. First of all, make a normal table with 40M rows of random data.
All the rows that satisfy 1 + 2𝑥1 − 3𝑥2 + 𝑥3 + 0.5𝑥4 > 0 are marked as ‘true’.

動かしてみる（2/4）- GPUデバイスメモリへのデータのロード①
postgres=# CREATE FOREIGN TABLE ft (
t bool,
x1 real,
x2 real,
x3 real,
x4 real
) SERVER gstore_fdw
OPTIONS (pinning '0');
CREATE FOREIGN TABLE
postgres=# INSERT INTO ft
(SELECT * FROM logreg);
INSERT 0 40000000
Gstore_Fdw is a FDW extension on behalf of the GPU device memory, specified by the ‘pinning’ option.
INSERT INTO the Gstore_Fdw table loads 40M rows in the ‘logreg’ table.
GPU device memory
Foreign Table
(gstore_fdw)
✓ データ形式の変換
✓ データ圧縮
✓ トランザクション制御

動かしてみる（3/4）
[kaigai@saba src]$ nvidia-smi
Thu Dec 6 12:10:56 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.72 Driver Version: 410.72 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P40 Off | 00000000:02:00.0 Off | N/A |
| N/A 42C P0 52W / 250W | 817MiB / 22919MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 27650 C ...bgworker: PG-Strom GPU memory keeper 807MiB |
+-----------------------------------------------------------------------------+
807MB of GPU device memory is preserved. The dataset consumes 680MB, in addition to the 120MB
for device management.
デバイス管理用：約120MB +
(sizeof(bool) + 4*sizeof(float)) * 40M = 680MB

動かしてみる（4/4）
postgres=# SELECT logregr_train('ft',
attnum_of('ft','t'),
attnums_of('ft','{x1,x2,x3,x4}'));
logregr_train
------------------------------------------
{3376.4,6752.71,-10129.1,3376.3,1688.27}
(1 row)
Time: 3647.059 ms (00:03.647)
Weight of the explanatory variables are estimated. 5 elements are returned because here is four
explanatory variables and intercept. It takes 3.6sec.

CPUでの実装と比較してみる（1/3）
MADLib の logregr_train() 関数を利用
postgres=# SELECT madlib.logregr_train(‘logreg’, ‘hoge’,
‘t’,’ARRAY[1,x1,x2,x3,x4]’,
NULL, 20);
logregr_train
---------------
(1 row)
Time: 1301307.361 ms (21:41.307)
postgres=# SELECT coef FROM hoge;
coef
------------------------------------------------------
{3041.82722783601,6083.57794939209,-9125.44857123801,3041.73992459095,1520.98287953044}
(1 row)
For the same jobs, MADLib’s logregr_train() tooks 21min41sec. PL/CUDA implementation was 356 times
faster than the CPU-based implementation.
1301307.36 / 3647.06
= 356.8倍かかった

CPUでの実装と比較してみる（2/3）- 検算
テストデータを作った時の
説明変数の“傾き”はこちら
logregr_train()の結果、
推定したパラメータは
こちらの線の傾き
w0 w1 w2 w3 w4
PL/CUDA 3376.4 6752.71 -10129.1 3376.3 1688.27
MADLib 3041.83 6083.58 -9125.45 3041.74 1520.98
The result of logregr_train() is different from the weight when we made the dataset artificially, because it
returns the gradient and intercept of the normal vector towards the division surface.

CPUでの実装と比較してみる（3/3）- 検算
注意：！トレーニングセットへの推論処理は本来はご法度！
postgres=# SELECT COUNT(*)
FROM (SELECT t, logregr_predict(ARRAY[ 3376.4, 6752.71,
-10129.1, 3376.3,
1688.27]::float[],
ARRAY[x1,x2,x3,x4]) p
FROM logreg) data
WHERE t != p;
count
-------
90
(1 row)
postgres=# SELECT COUNT(*)
FROM (SELECT t, logregr_predict(hoge.coef,
ARRAY[x1,x2,x3,x4]) p
FROM logreg, hoge) data
WHERE t != p;
count
-------
70
(1 row)
Prediction by our PL/CUDA function told 90 of 40M rows wrongly, and MADLib also told 70 of 40M.
Note that we usually don’t apply prediction on the training set when we have “actual” data analytics.
推定が「正しくない」件数をカウント

まとめ
▌PL/CUDAのサンプルプログラム
https://github.com/heterodb/toybox
▌ PL/CUDAはいいぞ
▌効果が高いと思われるワークロード
 機械学習（Machine-Learning）
 類似度サーチ（Similarity-Search）
 異常検知（Anomaly Detection）
 画像生成（Image Generation）
 ....その他
Conclusion: We could make a sample program of PL/CUDA, and be published. PL/CUDA is fun.
PL/CUDA will be valuable for machine-learning, similarity-search, anomaly-detection, image generation, ...

20181212 - PGconf.ASIA - LT

More Related Content

What's hot (20)

Similar to 20181212 - PGconf.ASIA - LT (20)

More from Kohei KaiGai (20)

20181212 - PGconf.ASIA - LT