SlideShare a Scribd company logo
GPUを使ったIn-database Analytics
~[実装してみた]ロジスティック回帰分析~
HeteroDB,Inc
Chief Architect & CEO
KaiGai Kohei <kaigai@heterodb.com>
みなさん、
PL/CUDA使ってますか?
Hello guys. Are you using PL/CUDA?
This caption is not automatic by machine-learning. I preliminary write up by manual.
PGconf.ASIA 2018 LT - In-database Analytics using GPU2
Result
PL/CUDAユーザ定義関数
PGconf.ASIA 2018 LT - In-database Analytics using GPU3
▌PL/CUDAとは?
 SQLユーザ定義関数として、GPUで実行可能なCUDA Cコードを書ける。
▌特長
 GPUに最適化したコードをマニュアルで記述する事ができる。
 前処理・後処理に柔軟なデータ操作が可能なSQLを利用できる。
All In-database Analytics
Scan
Pre-Process
Analytics
Post-ProcessCREATE FUNCTION
my_logic( reggstore, text )
RETURNS matrix
AS $$
$$ LANGUAGE ‘plcuda’;
Custom CUDA C code block
(runs on GPU device) ✓ 統計解析・機械学習に対する
マニュアルでの最適化
✓ 数千演算コアと広帯域メモリを
最大限に活用
ready
PL/CUDA allows UDF written in CUDA C program that is executable on GPU. Valuable due to integration of
manual (extreme) optimization for GPU and flexible data operation by SQL.
PL/CUDA利用例 – 創薬における類似化合物サーチ
PGconf.ASIA 2018 LT - In-database Analytics using GPU4
ID NAME Fingerprint (1024bit)
1 CHEMBL153534 00000000000100000010000000000010001000000...
2 CHEMBL405398 00000000000000010010000000000000000100000...
3 CHEMBL503634 00000100000000000000010000000000000000000...
: : :
Data structure of chemical compounds
データベース化合物
(約1,000万件)
クエリ化合物
(~1,000件)
探索すべき組合せ = 約100億通り
DBサーバ
類似度計算
ロジック
問い合わせ
類似化合物の
リスト
For similarity search on drug-discovery, GPU calculated 10billion of distance between chemical compounds
x150 times faster than C-binary on CPU. It is very computing intensive workloads.
x150 times
faster!!
Is there any sample program?
Oh.... this case was proprietary algorithm. Now we have no sample code in public.
それ、どこかにサンプルプログラム
転がってませんか?
PGconf.ASIA 2018 LT - In-database Analytics using GPU5
作ってみた。
題材:ロジスティック回帰分析
I tried to make it.
Theme: Logistic Regression Analytics
PGconf.ASIA 2018 LT - In-database Analytics using GPU6
ロジスティック回帰分析とは(1/2)
二値分類のための機械学習手法の一つ
Logistic Regression Analytics is a machine-learning method for binary classification.
True
False
PGconf.ASIA 2018 LT - In-database Analytics using GPU7
ロジスティック回帰分析とは(2/2)
データが正しく分類される確率がロジスティック関数に従う
Probability of “right” classification follows the logistic function
𝜎 𝛼 =
1
1 − 𝑒−𝛼
PGconf.ASIA 2018 LT - In-database Analytics using GPU8
パラメータを求める(1/3)
一般化すると....
パラメータ: 𝑤 = 𝑤0, 𝑤1, ⋯ , 𝑤 𝑚
説明変数: 𝜑𝑖 = 1, 𝑥1, ⋯ , 𝑥 𝑚 𝑖
従属変数: 𝑡𝑖 = 0 𝑜𝑟 1
分割面を定めるという事は、
説明変数の重み(傾き)と
切片を求める事に等しい。
0 = 𝑤0 + 𝑤1 𝑥 + 𝑤2 𝑦
Determination of division surface is equivalent to seek the weight of the explanatory variables and
intercept. But teacher data tell us boolean state for the combination of explanatory variables.
PGconf.ASIA 2018 LT - In-database Analytics using GPU9
パラメータを求める(2/3)
問題設定:トレーニングセットが得られる確率を最大化する。
𝑧𝑖 = 𝜎 𝑊 𝑇
𝜑𝑖 であるとき、𝑷 = ς𝑖=1
𝑁
𝑃𝑖 = ς𝑖=1
𝑁
𝑍𝑖
𝑡 𝑖
1 − 𝑍𝑖
1−𝑡 𝑖
分割面から離れるほど、
当該説明変数は真である、
または偽である可能性は高い。
トレーニングセットは、
最も顕在化する可能性が
高いものであったと仮定する。
Explanatory variables far from the division surface has higher probability of true/false. We assume the
training-set is result of the highest likelihood, maximized by the W parameter.
PGconf.ASIA 2018 LT - In-database Analytics using GPU10
パラメータを求める(3/3)
以下を繰り返しパラメータを推定する
ഥ𝑤 𝑛𝑒𝑤 = ഥ𝑤 𝑜𝑙𝑑 − Φ 𝑇 𝑅Φ −1Φ 𝑇 ҧ𝑧 − ҧ𝑡
ただし、
Φ =
1 𝑥11 ⋯ 𝑥1𝑚
⋮ ⋱ ⋮
1 𝑥 𝑛1 ⋯ 𝑥 𝑛𝑚
ҧ𝑡 = 𝑡1, … , 𝑡 𝑛
ҧ𝑧 = 𝑧1, … , 𝑧 𝑛
𝑅 = 𝑑𝑖𝑎𝑔 𝑧1 1 − 𝑧1 , … , 𝑧 𝑛 1 − 𝑧 𝑛
For more details, check out the book. Anyway, W is updated for each iteration, then Wnew shall seek to the
reasonable parameter then Wold. Eventually, difference of Wnew and Wold becomes very small.
詳しくはこちら
PGconf.ASIA 2018 LT - In-database Analytics using GPU11
計算量を考える。
▌説明変数の数は多くない: 数個~百個程度 ... m個
▌学習データの数は多いかも: 数百個~数千万個 ... n個
ഥ𝑤 𝑛𝑒𝑤 = ഥ𝑤 𝑜𝑙𝑑 − ഥ𝑤Δ = ഥ𝑤 𝑜𝑙𝑑 − Φ 𝑇
𝑅Φ −1
Φ 𝑇
ҧ𝑧 − ҧ𝑡
Estimation for amount of the calculation. # of explanatory variables are to up hundreds, but # of training
data set is more than million items. It is suitable for parallel calculation by GPU.
ΦR
n
-1
Φ 𝑇
ҧ𝑧 − ҧ𝑡
Φ 𝑇
n
m
n
1
-1
Φ 𝑇
𝑅Φ −1
Φ 𝑇
ҧ𝑧 − ҧ𝑡
ഥ𝑤Δ
𝑚 × 𝑚 𝑚 × 1
𝑚 × 1
PGconf.ASIA 2018 LT - In-database Analytics using GPU12
行列積 Φ 𝑇 𝑅Φ を並列に計算するコード例
KERNEL_FUNCTION_MAXTHREADS(void) logregr_update_P(cl_double **Preg, /* out */
cl_float **Xp,
cl_int width,
VectorTypeFloat *Z) {
cl_double *P = Preg[0];
__shared__ cl_float v[MAXTHREADS_PER_BLOCK]; // shared variables
nitems_bs = TYPEALIGN(get_local_size(), nitems);
nloops = width * width * nitems_bs;
for (loop = get_global_id(); // unique identifier of GPU threads
loop < nloops;
loop += get_global_size()) { // add total number of GPU threads
k = loop % nitems_bs; // index of 𝑅 column/row
i = (loop / nitems_bs) % width; // index of Φ 𝑇
column
j = loop / (nitems_bs * width); // index of Φ column
if (k < nitems) {
cl_float z = Z->values[k];
cl_float x1 = (i == 0 ? 1.0 : Xp[i-1][k]);
cl_float x2 = (j == 0 ? 1.0 : Xp[j-1][k]);
v[get_local_id()] = x1 * z * (1.0 - z) * x2;
}
else
v[get_local_id()] = 0.0;
sum = pgstromTotalSum(v,MAXTHREADS_PER_BLOCK); // total sum of the element
if (get_local_id() == 0) // calculated by the sibling threads
atomicAdd(&P[i + j * width], sum);
__syncthreads();
}
}
PGconf.ASIA 2018 LT - In-database Analytics using GPU13
GPU活用による計算 – 縮約アルゴリズムの例
●item[0]
step.1 step.2 step.4step.3
GPUを用いた
Σi=0...N-1item[i]
配列総和の計算
◆
●
▲ ■ ★
● ◆
●
● ◆ ▲
●
● ◆
●
● ◆ ▲ ■
●
● ◆
●
● ◆ ▲
●
● ◆
●
item[1]
item[2]
item[3]
item[4]
item[5]
item[6]
item[7]
item[8]
item[9]
item[10]
item[11]
item[12]
item[13]
item[14]
item[15]
log2N ステップで
items[]の総和を計算
HW支援によるコア間の同期機構
SELECT count(X),
sum(Y),
avg(Z)
FROM my_table;
集約関数の計算で用いる仕組み
PGconf.ASIA 2018 LT - In-database Analytics using GPU14
Values on shared memory can be accessed by multiple CPU cores simultaneously. Hardware supports inter-
cores synchronization, and it enables to calculate total sum with log2N steps.
ロジスティック回帰分析のサンプルプログラム
$ git clone https://github.com/heterodb/toybox.git
$ cd toybox/logistic_regression/
$ make && make install
$ psql postgres
postgres=# create extension logregr;
CREATE EXTENSION
To get the sample code, open “heterodb/toybox” on GitHub, then move to “logistic_regression”.
You can install it using CREATE EXTENSION, if PG-Strom is correctly setup.
https://github.com/heterodb/toybox/ ➔ logistic_regression
PGconf.ASIA 2018 LT - In-database Analytics using GPU15
動かしてみる(1/4)- 人為的なテストデータを作成
postgres=# CREATE TABLE logreg (
t bool,
x1 float,
x2 float,
x3 float,
x4 float );
CREATE TABLE
-- ↓全ての 1 + 2𝑥1 − 3𝑥2 + 𝑥3 + 0.5𝑥4 > 0を true と分類するトレーニングデータ
-- 4000万件を投入してみる
postgres=# INSERT INTO logreg
(SELECT (1.0+2.0*x1-3.0*x2+x3+0.5*x4) > 0 t, x1, x2, x3, x4
FROM (SELECT random() x1,
random() x2,
random() x3,
random() x4
FROM generate_series(1,40000000)) x);
INSERT 0 40000000
OK, let’s work the PL/CUDA function. First of all, make a normal table with 40M rows of random data.
All the rows that satisfy 1 + 2𝑥1 − 3𝑥2 + 𝑥3 + 0.5𝑥4 > 0 are marked as ‘true’.
PGconf.ASIA 2018 LT - In-database Analytics using GPU16
動かしてみる(2/4)- GPUデバイスメモリへのデータのロード①
postgres=# CREATE FOREIGN TABLE ft (
t bool,
x1 real,
x2 real,
x3 real,
x4 real
) SERVER gstore_fdw
OPTIONS (pinning '0');
CREATE FOREIGN TABLE
postgres=# INSERT INTO ft
(SELECT * FROM logreg);
INSERT 0 40000000
Gstore_Fdw is a FDW extension on behalf of the GPU device memory, specified by the ‘pinning’ option.
INSERT INTO the Gstore_Fdw table loads 40M rows in the ‘logreg’ table.
GPU device memory
Foreign Table
(gstore_fdw)
✓ データ形式の変換
✓ データ圧縮
✓ トランザクション制御
PGconf.ASIA 2018 LT - In-database Analytics using GPU17
動かしてみる(3/4)
[kaigai@saba src]$ nvidia-smi
Thu Dec 6 12:10:56 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.72 Driver Version: 410.72 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P40 Off | 00000000:02:00.0 Off | N/A |
| N/A 42C P0 52W / 250W | 817MiB / 22919MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 27650 C ...bgworker: PG-Strom GPU memory keeper 807MiB |
+-----------------------------------------------------------------------------+
807MB of GPU device memory is preserved. The dataset consumes 680MB, in addition to the 120MB
for device management.
デバイス管理用:約120MB +
(sizeof(bool) + 4*sizeof(float)) * 40M = 680MB
PGconf.ASIA 2018 LT - In-database Analytics using GPU18
動かしてみる(4/4)
postgres=# SELECT logregr_train('ft',
attnum_of('ft','t'),
attnums_of('ft','{x1,x2,x3,x4}'));
logregr_train
------------------------------------------
{3376.4,6752.71,-10129.1,3376.3,1688.27}
(1 row)
Time: 3647.059 ms (00:03.647)
Weight of the explanatory variables are estimated. 5 elements are returned because here is four
explanatory variables and intercept. It takes 3.6sec.
PGconf.ASIA 2018 LT - In-database Analytics using GPU19
CPUでの実装と比較してみる(1/3)
MADLib の logregr_train() 関数を利用
postgres=# SELECT madlib.logregr_train(‘logreg’, ‘hoge’,
‘t’,’ARRAY[1,x1,x2,x3,x4]’,
NULL, 20);
logregr_train
---------------
(1 row)
Time: 1301307.361 ms (21:41.307)
postgres=# SELECT coef FROM hoge;
coef
------------------------------------------------------
{3041.82722783601,6083.57794939209,-9125.44857123801,3041.73992459095,1520.98287953044}
(1 row)
For the same jobs, MADLib’s logregr_train() tooks 21min41sec. PL/CUDA implementation was 356 times
faster than the CPU-based implementation.
1301307.36 / 3647.06
= 356.8倍かかった
PGconf.ASIA 2018 LT - In-database Analytics using GPU20
CPUでの実装と比較してみる(2/3)- 検算
テストデータを作った時の
説明変数の“傾き”はこちら
logregr_train()の結果、
推定したパラメータは
こちらの線の傾き
w0 w1 w2 w3 w4
PL/CUDA 3376.4 6752.71 -10129.1 3376.3 1688.27
MADLib 3041.83 6083.58 -9125.45 3041.74 1520.98
The result of logregr_train() is different from the weight when we made the dataset artificially, because it
returns the gradient and intercept of the normal vector towards the division surface.
PGconf.ASIA 2018 LT - In-database Analytics using GPU21
CPUでの実装と比較してみる(3/3)- 検算
注意:!トレーニングセットへの推論処理は本来はご法度!
postgres=# SELECT COUNT(*)
FROM (SELECT t, logregr_predict(ARRAY[ 3376.4, 6752.71,
-10129.1, 3376.3,
1688.27]::float[],
ARRAY[x1,x2,x3,x4]) p
FROM logreg) data
WHERE t != p;
count
-------
90
(1 row)
postgres=# SELECT COUNT(*)
FROM (SELECT t, logregr_predict(hoge.coef,
ARRAY[x1,x2,x3,x4]) p
FROM logreg, hoge) data
WHERE t != p;
count
-------
70
(1 row)
Prediction by our PL/CUDA function told 90 of 40M rows wrongly, and MADLib also told 70 of 40M.
Note that we usually don’t apply prediction on the training set when we have “actual” data analytics.
推定が「正しくない」件数をカウント
PGconf.ASIA 2018 LT - In-database Analytics using GPU22
まとめ
▌PL/CUDAのサンプルプログラム
https://github.com/heterodb/toybox
▌ PL/CUDAはいいぞ
▌効果が高いと思われるワークロード
 機械学習(Machine-Learning)
 類似度サーチ(Similarity-Search)
 異常検知(Anomaly Detection)
 画像生成(Image Generation)
 ....その他
Conclusion: We could make a sample program of PL/CUDA, and be published. PL/CUDA is fun.
PL/CUDA will be valuable for machine-learning, similarity-search, anomaly-detection, image generation, ...
PGconf.ASIA 2018 LT - In-database Analytics using GPU23
20181212 - PGconf.ASIA - LT

More Related Content

PDF
PostgreSQL Conference Japan 2021 B2 Citus 10
PDF
PL/CUDA - GPU Accelerated In-Database Analytics
PDF
Pgunconf 20121212-postgeres fdw
PDF
20191211_Apache_Arrow_Meetup_Tokyo
PDF
20170127 JAWS HPC-UG#8
PDF
20211112_jpugcon_gpu_and_arrow
PDF
Lt ingaoho-jsonb+postgeres fdw
PDF
PostgreSQL Unconference #29 Unicode IVS
PostgreSQL Conference Japan 2021 B2 Citus 10
PL/CUDA - GPU Accelerated In-Database Analytics
Pgunconf 20121212-postgeres fdw
20191211_Apache_Arrow_Meetup_Tokyo
20170127 JAWS HPC-UG#8
20211112_jpugcon_gpu_and_arrow
Lt ingaoho-jsonb+postgeres fdw
PostgreSQL Unconference #29 Unicode IVS

What's hot (20)

PDF
20180914 GTCJ INCEPTION HeteroDB
PDF
pgconfasia2016 lt ssd2gpu
PDF
db tech showcase 2019 D10 Oracle Database New Features
PDF
20181211 - PGconf.ASIA - NVMESSD&GPU for BigData
PDF
pg_trgmと全文検索
PDF
20200806_PGStrom_PostGIS_GstoreFdw
PDF
PythonでテキストをJSONにした話(PyCon mini sapporo 2015)
PPTX
巨大な表を高速に扱うData.table について
PDF
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PDF
PostgreSQL13 新機能紹介
PPT
Maatkit で MySQL チューニング
PDF
pg_bigmを用いた全文検索のしくみ(前編)
PDF
PostgreSQL 10 新機能 @オープンセミナー香川 2017
PDF
20190925_DBTS_PGStrom
PDF
PostgreSQL 10 新機能 @OSC 2017 Fukuoka
PPTX
押さえておきたい、PostgreSQL 13 の新機能!! (PostgreSQL Conference Japan 2020講演資料)
PDF
R3.0.0 is relased
PDF
KOF2015 PostgreSQL 9.5
PDF
Hackers Champloo 2016 postgresql-9.6
PDF
20171212 titech lecture_ishizaki_public
20180914 GTCJ INCEPTION HeteroDB
pgconfasia2016 lt ssd2gpu
db tech showcase 2019 D10 Oracle Database New Features
20181211 - PGconf.ASIA - NVMESSD&GPU for BigData
pg_trgmと全文検索
20200806_PGStrom_PostGIS_GstoreFdw
PythonでテキストをJSONにした話(PyCon mini sapporo 2015)
巨大な表を高速に扱うData.table について
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PostgreSQL13 新機能紹介
Maatkit で MySQL チューニング
pg_bigmを用いた全文検索のしくみ(前編)
PostgreSQL 10 新機能 @オープンセミナー香川 2017
20190925_DBTS_PGStrom
PostgreSQL 10 新機能 @OSC 2017 Fukuoka
押さえておきたい、PostgreSQL 13 の新機能!! (PostgreSQL Conference Japan 2020講演資料)
R3.0.0 is relased
KOF2015 PostgreSQL 9.5
Hackers Champloo 2016 postgresql-9.6
20171212 titech lecture_ishizaki_public
Ad

Similar to 20181212 - PGconf.ASIA - LT (20)

PDF
20171220_hbstudy80_pgstrom
PDF
(JP) GPGPUがPostgreSQLを加速する
PDF
GPUとSSDがPostgreSQLを加速する~クエリ処理スループット10GB/sへの挑戦~ [DB Tech Showcase Tokyo/2017]
PDF
20180920_DBTS_PGStrom_JP
PDF
20210511_PGStrom_GpuCache
PDF
Oracle_GoldenGate_23ai_導入Tips_v1.12_公開版[051-100].pdf
PDF
SSDとGPUがPostgreSQLを加速する【OSC.Enterprise】
PDF
PostgreSQL v9.5の新機能~CustomScan/Join Interface
PDF
機械学習とこれを支える並列計算: ディープラーニング・スーパーコンピューターの応用について
PDF
SQL+GPU+SSD=∞ (Japanese)
PDF
PostgreSQL最新動向 ~カラムナストアから生成AI連携まで~ (Open Source Conference 2025 Tokyo/Spring ...
PDF
TPC-DSから学ぶPostgreSQLの弱点と今後の展望
PDF
20201113_PGconf_Japan_GPU_PostGIS
PDF
Halide による画像処理プログラミング入門
PDF
20180217 FPGA Extreme Computing #10
KEY
NVIDIA Japan Seminar 2012
PDF
高速ネットワーク最新動向と具体例 (ENOG58 Meeting)
PDF
PostgreSQLレプリケーション(pgcon17j_t4)
PDF
about dakota6.7 gui
PDF
20221111_JPUG_CustomScan_API
20171220_hbstudy80_pgstrom
(JP) GPGPUがPostgreSQLを加速する
GPUとSSDがPostgreSQLを加速する~クエリ処理スループット10GB/sへの挑戦~ [DB Tech Showcase Tokyo/2017]
20180920_DBTS_PGStrom_JP
20210511_PGStrom_GpuCache
Oracle_GoldenGate_23ai_導入Tips_v1.12_公開版[051-100].pdf
SSDとGPUがPostgreSQLを加速する【OSC.Enterprise】
PostgreSQL v9.5の新機能~CustomScan/Join Interface
機械学習とこれを支える並列計算: ディープラーニング・スーパーコンピューターの応用について
SQL+GPU+SSD=∞ (Japanese)
PostgreSQL最新動向 ~カラムナストアから生成AI連携まで~ (Open Source Conference 2025 Tokyo/Spring ...
TPC-DSから学ぶPostgreSQLの弱点と今後の展望
20201113_PGconf_Japan_GPU_PostGIS
Halide による画像処理プログラミング入門
20180217 FPGA Extreme Computing #10
NVIDIA Japan Seminar 2012
高速ネットワーク最新動向と具体例 (ENOG58 Meeting)
PostgreSQLレプリケーション(pgcon17j_t4)
about dakota6.7 gui
20221111_JPUG_CustomScan_API
Ad

More from Kohei KaiGai (20)

PDF
20221116_DBTS_PGStrom_History
PDF
20210928_pgunconf_hll_count
PDF
20210731_OSC_Kyoto_PGStrom3.0
PDF
20210301_PGconf_Online_GPU_PostGIS_GiST_Index
PDF
20201128_OSC_Fukuoka_Online_GPUPostGIS
PDF
20201006_PGconf_Online_Large_Data_Processing
PDF
20200828_OSCKyoto_Online
PDF
20200424_Writable_Arrow_Fdw
PDF
20191115-PGconf.Japan
PDF
20190926_Try_RHEL8_NVMEoF_Beta
PDF
20190909_PGconf.ASIA_KaiGai
PDF
20190516_DLC10_PGStrom
PDF
20190418_PGStrom_on_ArrowFdw
PDF
20190314 PGStrom Arrow_Fdw
PDF
20181212 - PGconfASIA - LT - English
PDF
20181210 - PGconf.ASIA Unconference
PDF
20181116 Massive Log Processing using I/O optimized PostgreSQL
PDF
20181016_pgconfeu_ssd2gpu_multi
PDF
20181025_pgconfeu_lt_gstorefdw
PDF
20180920_DBTS_PGStrom_EN
20221116_DBTS_PGStrom_History
20210928_pgunconf_hll_count
20210731_OSC_Kyoto_PGStrom3.0
20210301_PGconf_Online_GPU_PostGIS_GiST_Index
20201128_OSC_Fukuoka_Online_GPUPostGIS
20201006_PGconf_Online_Large_Data_Processing
20200828_OSCKyoto_Online
20200424_Writable_Arrow_Fdw
20191115-PGconf.Japan
20190926_Try_RHEL8_NVMEoF_Beta
20190909_PGconf.ASIA_KaiGai
20190516_DLC10_PGStrom
20190418_PGStrom_on_ArrowFdw
20190314 PGStrom Arrow_Fdw
20181212 - PGconfASIA - LT - English
20181210 - PGconf.ASIA Unconference
20181116 Massive Log Processing using I/O optimized PostgreSQL
20181016_pgconfeu_ssd2gpu_multi
20181025_pgconfeu_lt_gstorefdw
20180920_DBTS_PGStrom_EN

20181212 - PGconf.ASIA - LT

  • 2. みなさん、 PL/CUDA使ってますか? Hello guys. Are you using PL/CUDA? This caption is not automatic by machine-learning. I preliminary write up by manual. PGconf.ASIA 2018 LT - In-database Analytics using GPU2
  • 3. Result PL/CUDAユーザ定義関数 PGconf.ASIA 2018 LT - In-database Analytics using GPU3 ▌PL/CUDAとは?  SQLユーザ定義関数として、GPUで実行可能なCUDA Cコードを書ける。 ▌特長  GPUに最適化したコードをマニュアルで記述する事ができる。  前処理・後処理に柔軟なデータ操作が可能なSQLを利用できる。 All In-database Analytics Scan Pre-Process Analytics Post-ProcessCREATE FUNCTION my_logic( reggstore, text ) RETURNS matrix AS $$ $$ LANGUAGE ‘plcuda’; Custom CUDA C code block (runs on GPU device) ✓ 統計解析・機械学習に対する マニュアルでの最適化 ✓ 数千演算コアと広帯域メモリを 最大限に活用 ready PL/CUDA allows UDF written in CUDA C program that is executable on GPU. Valuable due to integration of manual (extreme) optimization for GPU and flexible data operation by SQL.
  • 4. PL/CUDA利用例 – 創薬における類似化合物サーチ PGconf.ASIA 2018 LT - In-database Analytics using GPU4 ID NAME Fingerprint (1024bit) 1 CHEMBL153534 00000000000100000010000000000010001000000... 2 CHEMBL405398 00000000000000010010000000000000000100000... 3 CHEMBL503634 00000100000000000000010000000000000000000... : : : Data structure of chemical compounds データベース化合物 (約1,000万件) クエリ化合物 (~1,000件) 探索すべき組合せ = 約100億通り DBサーバ 類似度計算 ロジック 問い合わせ 類似化合物の リスト For similarity search on drug-discovery, GPU calculated 10billion of distance between chemical compounds x150 times faster than C-binary on CPU. It is very computing intensive workloads. x150 times faster!!
  • 5. Is there any sample program? Oh.... this case was proprietary algorithm. Now we have no sample code in public. それ、どこかにサンプルプログラム 転がってませんか? PGconf.ASIA 2018 LT - In-database Analytics using GPU5
  • 6. 作ってみた。 題材:ロジスティック回帰分析 I tried to make it. Theme: Logistic Regression Analytics PGconf.ASIA 2018 LT - In-database Analytics using GPU6
  • 7. ロジスティック回帰分析とは(1/2) 二値分類のための機械学習手法の一つ Logistic Regression Analytics is a machine-learning method for binary classification. True False PGconf.ASIA 2018 LT - In-database Analytics using GPU7
  • 8. ロジスティック回帰分析とは(2/2) データが正しく分類される確率がロジスティック関数に従う Probability of “right” classification follows the logistic function 𝜎 𝛼 = 1 1 − 𝑒−𝛼 PGconf.ASIA 2018 LT - In-database Analytics using GPU8
  • 9. パラメータを求める(1/3) 一般化すると.... パラメータ: 𝑤 = 𝑤0, 𝑤1, ⋯ , 𝑤 𝑚 説明変数: 𝜑𝑖 = 1, 𝑥1, ⋯ , 𝑥 𝑚 𝑖 従属変数: 𝑡𝑖 = 0 𝑜𝑟 1 分割面を定めるという事は、 説明変数の重み(傾き)と 切片を求める事に等しい。 0 = 𝑤0 + 𝑤1 𝑥 + 𝑤2 𝑦 Determination of division surface is equivalent to seek the weight of the explanatory variables and intercept. But teacher data tell us boolean state for the combination of explanatory variables. PGconf.ASIA 2018 LT - In-database Analytics using GPU9
  • 10. パラメータを求める(2/3) 問題設定:トレーニングセットが得られる確率を最大化する。 𝑧𝑖 = 𝜎 𝑊 𝑇 𝜑𝑖 であるとき、𝑷 = ς𝑖=1 𝑁 𝑃𝑖 = ς𝑖=1 𝑁 𝑍𝑖 𝑡 𝑖 1 − 𝑍𝑖 1−𝑡 𝑖 分割面から離れるほど、 当該説明変数は真である、 または偽である可能性は高い。 トレーニングセットは、 最も顕在化する可能性が 高いものであったと仮定する。 Explanatory variables far from the division surface has higher probability of true/false. We assume the training-set is result of the highest likelihood, maximized by the W parameter. PGconf.ASIA 2018 LT - In-database Analytics using GPU10
  • 11. パラメータを求める(3/3) 以下を繰り返しパラメータを推定する ഥ𝑤 𝑛𝑒𝑤 = ഥ𝑤 𝑜𝑙𝑑 − Φ 𝑇 𝑅Φ −1Φ 𝑇 ҧ𝑧 − ҧ𝑡 ただし、 Φ = 1 𝑥11 ⋯ 𝑥1𝑚 ⋮ ⋱ ⋮ 1 𝑥 𝑛1 ⋯ 𝑥 𝑛𝑚 ҧ𝑡 = 𝑡1, … , 𝑡 𝑛 ҧ𝑧 = 𝑧1, … , 𝑧 𝑛 𝑅 = 𝑑𝑖𝑎𝑔 𝑧1 1 − 𝑧1 , … , 𝑧 𝑛 1 − 𝑧 𝑛 For more details, check out the book. Anyway, W is updated for each iteration, then Wnew shall seek to the reasonable parameter then Wold. Eventually, difference of Wnew and Wold becomes very small. 詳しくはこちら PGconf.ASIA 2018 LT - In-database Analytics using GPU11
  • 12. 計算量を考える。 ▌説明変数の数は多くない: 数個~百個程度 ... m個 ▌学習データの数は多いかも: 数百個~数千万個 ... n個 ഥ𝑤 𝑛𝑒𝑤 = ഥ𝑤 𝑜𝑙𝑑 − ഥ𝑤Δ = ഥ𝑤 𝑜𝑙𝑑 − Φ 𝑇 𝑅Φ −1 Φ 𝑇 ҧ𝑧 − ҧ𝑡 Estimation for amount of the calculation. # of explanatory variables are to up hundreds, but # of training data set is more than million items. It is suitable for parallel calculation by GPU. ΦR n -1 Φ 𝑇 ҧ𝑧 − ҧ𝑡 Φ 𝑇 n m n 1 -1 Φ 𝑇 𝑅Φ −1 Φ 𝑇 ҧ𝑧 − ҧ𝑡 ഥ𝑤Δ 𝑚 × 𝑚 𝑚 × 1 𝑚 × 1 PGconf.ASIA 2018 LT - In-database Analytics using GPU12
  • 13. 行列積 Φ 𝑇 𝑅Φ を並列に計算するコード例 KERNEL_FUNCTION_MAXTHREADS(void) logregr_update_P(cl_double **Preg, /* out */ cl_float **Xp, cl_int width, VectorTypeFloat *Z) { cl_double *P = Preg[0]; __shared__ cl_float v[MAXTHREADS_PER_BLOCK]; // shared variables nitems_bs = TYPEALIGN(get_local_size(), nitems); nloops = width * width * nitems_bs; for (loop = get_global_id(); // unique identifier of GPU threads loop < nloops; loop += get_global_size()) { // add total number of GPU threads k = loop % nitems_bs; // index of 𝑅 column/row i = (loop / nitems_bs) % width; // index of Φ 𝑇 column j = loop / (nitems_bs * width); // index of Φ column if (k < nitems) { cl_float z = Z->values[k]; cl_float x1 = (i == 0 ? 1.0 : Xp[i-1][k]); cl_float x2 = (j == 0 ? 1.0 : Xp[j-1][k]); v[get_local_id()] = x1 * z * (1.0 - z) * x2; } else v[get_local_id()] = 0.0; sum = pgstromTotalSum(v,MAXTHREADS_PER_BLOCK); // total sum of the element if (get_local_id() == 0) // calculated by the sibling threads atomicAdd(&P[i + j * width], sum); __syncthreads(); } } PGconf.ASIA 2018 LT - In-database Analytics using GPU13
  • 14. GPU活用による計算 – 縮約アルゴリズムの例 ●item[0] step.1 step.2 step.4step.3 GPUを用いた Σi=0...N-1item[i] 配列総和の計算 ◆ ● ▲ ■ ★ ● ◆ ● ● ◆ ▲ ● ● ◆ ● ● ◆ ▲ ■ ● ● ◆ ● ● ◆ ▲ ● ● ◆ ● item[1] item[2] item[3] item[4] item[5] item[6] item[7] item[8] item[9] item[10] item[11] item[12] item[13] item[14] item[15] log2N ステップで items[]の総和を計算 HW支援によるコア間の同期機構 SELECT count(X), sum(Y), avg(Z) FROM my_table; 集約関数の計算で用いる仕組み PGconf.ASIA 2018 LT - In-database Analytics using GPU14 Values on shared memory can be accessed by multiple CPU cores simultaneously. Hardware supports inter- cores synchronization, and it enables to calculate total sum with log2N steps.
  • 15. ロジスティック回帰分析のサンプルプログラム $ git clone https://github.com/heterodb/toybox.git $ cd toybox/logistic_regression/ $ make && make install $ psql postgres postgres=# create extension logregr; CREATE EXTENSION To get the sample code, open “heterodb/toybox” on GitHub, then move to “logistic_regression”. You can install it using CREATE EXTENSION, if PG-Strom is correctly setup. https://github.com/heterodb/toybox/ ➔ logistic_regression PGconf.ASIA 2018 LT - In-database Analytics using GPU15
  • 16. 動かしてみる(1/4)- 人為的なテストデータを作成 postgres=# CREATE TABLE logreg ( t bool, x1 float, x2 float, x3 float, x4 float ); CREATE TABLE -- ↓全ての 1 + 2𝑥1 − 3𝑥2 + 𝑥3 + 0.5𝑥4 > 0を true と分類するトレーニングデータ -- 4000万件を投入してみる postgres=# INSERT INTO logreg (SELECT (1.0+2.0*x1-3.0*x2+x3+0.5*x4) > 0 t, x1, x2, x3, x4 FROM (SELECT random() x1, random() x2, random() x3, random() x4 FROM generate_series(1,40000000)) x); INSERT 0 40000000 OK, let’s work the PL/CUDA function. First of all, make a normal table with 40M rows of random data. All the rows that satisfy 1 + 2𝑥1 − 3𝑥2 + 𝑥3 + 0.5𝑥4 > 0 are marked as ‘true’. PGconf.ASIA 2018 LT - In-database Analytics using GPU16
  • 17. 動かしてみる(2/4)- GPUデバイスメモリへのデータのロード① postgres=# CREATE FOREIGN TABLE ft ( t bool, x1 real, x2 real, x3 real, x4 real ) SERVER gstore_fdw OPTIONS (pinning '0'); CREATE FOREIGN TABLE postgres=# INSERT INTO ft (SELECT * FROM logreg); INSERT 0 40000000 Gstore_Fdw is a FDW extension on behalf of the GPU device memory, specified by the ‘pinning’ option. INSERT INTO the Gstore_Fdw table loads 40M rows in the ‘logreg’ table. GPU device memory Foreign Table (gstore_fdw) ✓ データ形式の変換 ✓ データ圧縮 ✓ トランザクション制御 PGconf.ASIA 2018 LT - In-database Analytics using GPU17
  • 18. 動かしてみる(3/4) [kaigai@saba src]$ nvidia-smi Thu Dec 6 12:10:56 2018 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.72 Driver Version: 410.72 CUDA Version: 10.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla P40 Off | 00000000:02:00.0 Off | N/A | | N/A 42C P0 52W / 250W | 817MiB / 22919MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 27650 C ...bgworker: PG-Strom GPU memory keeper 807MiB | +-----------------------------------------------------------------------------+ 807MB of GPU device memory is preserved. The dataset consumes 680MB, in addition to the 120MB for device management. デバイス管理用:約120MB + (sizeof(bool) + 4*sizeof(float)) * 40M = 680MB PGconf.ASIA 2018 LT - In-database Analytics using GPU18
  • 19. 動かしてみる(4/4) postgres=# SELECT logregr_train('ft', attnum_of('ft','t'), attnums_of('ft','{x1,x2,x3,x4}')); logregr_train ------------------------------------------ {3376.4,6752.71,-10129.1,3376.3,1688.27} (1 row) Time: 3647.059 ms (00:03.647) Weight of the explanatory variables are estimated. 5 elements are returned because here is four explanatory variables and intercept. It takes 3.6sec. PGconf.ASIA 2018 LT - In-database Analytics using GPU19
  • 20. CPUでの実装と比較してみる(1/3) MADLib の logregr_train() 関数を利用 postgres=# SELECT madlib.logregr_train(‘logreg’, ‘hoge’, ‘t’,’ARRAY[1,x1,x2,x3,x4]’, NULL, 20); logregr_train --------------- (1 row) Time: 1301307.361 ms (21:41.307) postgres=# SELECT coef FROM hoge; coef ------------------------------------------------------ {3041.82722783601,6083.57794939209,-9125.44857123801,3041.73992459095,1520.98287953044} (1 row) For the same jobs, MADLib’s logregr_train() tooks 21min41sec. PL/CUDA implementation was 356 times faster than the CPU-based implementation. 1301307.36 / 3647.06 = 356.8倍かかった PGconf.ASIA 2018 LT - In-database Analytics using GPU20
  • 21. CPUでの実装と比較してみる(2/3)- 検算 テストデータを作った時の 説明変数の“傾き”はこちら logregr_train()の結果、 推定したパラメータは こちらの線の傾き w0 w1 w2 w3 w4 PL/CUDA 3376.4 6752.71 -10129.1 3376.3 1688.27 MADLib 3041.83 6083.58 -9125.45 3041.74 1520.98 The result of logregr_train() is different from the weight when we made the dataset artificially, because it returns the gradient and intercept of the normal vector towards the division surface. PGconf.ASIA 2018 LT - In-database Analytics using GPU21
  • 22. CPUでの実装と比較してみる(3/3)- 検算 注意:!トレーニングセットへの推論処理は本来はご法度! postgres=# SELECT COUNT(*) FROM (SELECT t, logregr_predict(ARRAY[ 3376.4, 6752.71, -10129.1, 3376.3, 1688.27]::float[], ARRAY[x1,x2,x3,x4]) p FROM logreg) data WHERE t != p; count ------- 90 (1 row) postgres=# SELECT COUNT(*) FROM (SELECT t, logregr_predict(hoge.coef, ARRAY[x1,x2,x3,x4]) p FROM logreg, hoge) data WHERE t != p; count ------- 70 (1 row) Prediction by our PL/CUDA function told 90 of 40M rows wrongly, and MADLib also told 70 of 40M. Note that we usually don’t apply prediction on the training set when we have “actual” data analytics. 推定が「正しくない」件数をカウント PGconf.ASIA 2018 LT - In-database Analytics using GPU22
  • 23. まとめ ▌PL/CUDAのサンプルプログラム https://github.com/heterodb/toybox ▌ PL/CUDAはいいぞ ▌効果が高いと思われるワークロード  機械学習(Machine-Learning)  類似度サーチ(Similarity-Search)  異常検知(Anomaly Detection)  画像生成(Image Generation)  ....その他 Conclusion: We could make a sample program of PL/CUDA, and be published. PL/CUDA is fun. PL/CUDA will be valuable for machine-learning, similarity-search, anomaly-detection, image generation, ... PGconf.ASIA 2018 LT - In-database Analytics using GPU23