Halaman ini diterjemahkan oleh Cloud Translation API.

Menggunakan GPU dengan Dataproc Serverless
Tetap teratur dengan koleksi Simpan dan kategorikan konten berdasarkan preferensi Anda.

Anda dapat memasang akselerator GPU ke workload batch Dataproc Serverless untuk mendapatkan hasil berikut:

Mempercepat pemrosesan beban kerja analisis data berskala besar.
Mempercepat pelatihan model pada set data besar menggunakan library machine learning GPU.
Melakukan analisis data lanjutan, seperti pemrosesan video atau bahasa alam.

Semua runtime Dataproc Serverless Spark yang didukung menambahkan library RAPIDS Spark ke setiap node workload. Runtime Dataproc Serverless Spark versi 1.1 juga menambahkan library XGBoost ke node workload. Library ini menyediakan alat transformasi data dan machine learning yang canggih yang dapat Anda gunakan dalam beban kerja yang dipercepat GPU.

Manfaat GPU

Berikut adalah beberapa manfaat saat Anda menggunakan GPU dengan workload Spark Dataproc Serverless:

Peningkatan performa: Akselerasi GPU dapat meningkatkan performa workload Spark secara signifikan, terutama untuk tugas yang membutuhkan komputasi intensif, seperti machine dan deep learning, pemrosesan grafik, dan analisis kompleks.
Pelatihan model yang lebih cepat: Untuk tugas machine learning, memasang GPU dapat secara drastis mengurangi waktu yang diperlukan untuk melatih model, sehingga data scientist dan engineer dapat melakukan iterasi dan bereksperimen dengan cepat.
Skalabilitas: Pelanggan dapat menambahkan lebih banyak node GPU atau GPU yang lebih canggih ke node untuk menangani kebutuhan pemrosesan yang semakin kompleks.
Efisiensi biaya: Meskipun GPU memerlukan investasi awal, Anda dapat menghemat biaya seiring waktu karena waktu pemrosesan yang lebih singkat dan penggunaan resource yang lebih efisien.
Analisis data yang ditingkatkan: Akselerasi GPU memungkinkan Anda melakukan analisis lanjutan, seperti analisis gambar dan video serta pemrosesan bahasa alami, pada set data besar.
Produk yang lebih baik: Pemrosesan yang lebih cepat memungkinkan pengambilan keputusan yang lebih cepat dan aplikasi yang lebih responsif.

Batasan dan pertimbangan

Anda dapat memasang GPU NVIDIA A100 atau NVIDIA L4 ke beban kerja batch Dataproc Serverless. Akselerator A100 dan L4 tunduk pada ketersediaan regional GPU Compute Engine.
Library XGBoost hanya disediakan untuk workload dengan akselerasi GPU Dataproc Serverless saat menggunakan runtime Spark versi 1.x Dataproc Serverless.
Batch Dataproc Serverless yang dipercepat GPU dengan XGBoost menggunakan peningkatan kuota Compute Engine. Misalnya, untuk menjalankan workload batch serverless yang menggunakan GPU NVIDIA L4, Anda harus mengalokasikan kuota NVIDIA_L4_GPUS.
Tugas yang mengaktifkan akselerator tidak kompatibel dengan kebijakan organisasi constraints/compute.requireShieldedVm. Jika organisasi Anda menerapkan kebijakan ini, tugas yang diaktifkan dengan akseleratornya tidak akan berhasil dijalankan.
Anda harus menetapkan kumpulan karakter default ke UTF-8 saat menggunakan akselerasi GPU RAPIDS dengan runtime Dataproc Serverless yang didukung sebelum versi 2.2. Lihat Membuat beban kerja batch serverless dengan akselerator GPU untuk mengetahui informasi selengkapnya.

Harga

Lihat Harga Dataproc Serverless untuk informasi harga akselerator.

Sebelum memulai

Sebelum membuat beban kerja batch serverless dengan akselerator GPU yang terpasang, lakukan hal berikut:

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the Dataproc, Compute Engine, and Cloud Storage APIs.

Enable the APIs

Install the Google Cloud CLI.

If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.

To initialize the gcloud CLI, run the following command:

gcloud init

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the Dataproc, Compute Engine, and Cloud Storage APIs.

Enable the APIs

Install the Google Cloud CLI.

If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.

To initialize the gcloud CLI, run the following command:

gcloud init

In the Google Cloud console, go to the Cloud Storage Buckets page.
Go to Buckets
Click Create.
On the Create a bucket page, enter your bucket information. To go to the next step, click Continue.
1. In the Get started section, do the following:
  - Enter a globally unique name that meets the bucket naming requirements.
  - To add a bucket label, expand the Labels section (), click Add label, and specify a key and a value for your label.
2. In the Choose where to store your data section, do the following:
  1. Select a Location type.
  2. Choose a location where your bucket's data is permanently stored from the Location type drop-down menu.
    - If you select the dual-region location type, you can also choose to enable turbo replication by using the relevant checkbox.
  3. To set up cross-bucket replication, select Add cross-bucket replication via Storage Transfer Service and follow these steps:
    Set up cross-bucket replication
    
    In the Bucket menu, select a bucket.
    
    In the Replication settings section, click Configure to configure settings for the replication job.
    
    The Configure cross-bucket replication pane appears.
    
    To filter objects to replicate by object name prefix, enter a prefix that you want to include or exclude objects from, then click Add a prefix.
    
    To set a storage class for the replicated objects, select a storage class from the Storage class menu. If you skip this step, the replicated objects will use the destination bucket's storage class by default.
    
    Click Done.
3. In the Choose how to store your data section, do the following:
  1. Select a default storage class for the bucket or Autoclass for automatic storage class management of your bucket's data.
  2. To enable hierarchical namespace, in the Optimize storage for data-intensive workloads section, select Enable hierarchical namespace on this bucket.
    Note: You cannot enable hierarchical namespace in existing buckets.
4. In the Choose how to control access to objects section, select whether or not your bucket enforces public access prevention, and select an access control method for your bucket's objects.
  Note: You cannot change the Prevent public access setting if this setting is enforced at an organization policy.
5. In the Choose how to protect object data section, do the following:
  - Select any of the options under Data protection that you want to set for your bucket.
    - To enable soft delete, click the Soft delete policy (For data recovery) checkbox, and specify the number of days you want to retain objects after deletion.
    - To set Object Versioning, click the Object versioning (For version control) checkbox, and specify the maximum number of versions per object and the number of days after which the noncurrent versions expire.
    - To enable the retention policy on objects and buckets, click the Retention (For compliance) checkbox, and then do the following:
      - To enable Object Retention Lock, click the Enable object retention checkbox.
      - To enable Bucket Lock, click the Set bucket retention policy checkbox, and choose a unit of time and a length of time for your retention period.
  - To choose how your object data will be encrypted, expand the Data encryption section (), and select a Data encryption method.
Click Create.

Membuat workload batch serverless dengan akselerator GPU

Kirimkan workload batch Dataproc Serverless yang menggunakan GPU NVIDIA L4 untuk menjalankan tugas PySpark yang diparalelkan. Ikuti langkah-langkah berikut menggunakan gcloud CLI:

Klik Luaskan, lalu buat dan simpan kode PySpark yang tercantum ke file test-py-spark-gpu.py di komputer lokal Anda menggunakan editor teks atau kode.

#!/usr/bin/env python

"""S8s Accelerators Example."""

import subprocess
from typing import Any
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.types import IntegerType
from pyspark.sql.types import StructField
from pyspark.sql.types import StructType

spark = SparkSession.builder.appName("joindemo").getOrCreate()


def get_num_gpus(_: Any) -> int:
  """Returns the number of GPUs."""
  p_nvidia_smi = subprocess.Popen(
      ["nvidia-smi", "-L"], stdin=None, stdout=subprocess.PIPE
  )
  p_wc = subprocess.Popen(
      ["wc", "-l"],
      stdin=p_nvidia_smi.stdout,
      stdout=subprocess.PIPE,
      stderr=subprocess.PIPE,
      universal_newlines=True,
  )
  [out, _] = p_wc.communicate()
  return int(out)


num_workers = 5
result = (
    spark.sparkContext.range(0, num_workers, 1, num_workers)
    .map(get_num_gpus)
    .collect()
)
num_gpus = sum(result)
print(f"Total accelerators: {num_gpus}")

# Run the join example
schema = StructType([StructField("value", IntegerType(), True)])
df = (
    spark.sparkContext.parallelize(range(1, 10000001), 6)
    .map(lambda x: (x,))
    .toDF(schema)
)
df2 = (
    spark.sparkContext.parallelize(range(1, 10000001), 6)
    .map(lambda x: (x,))
    .toDF(schema)
)
joined_df = (
    df.select(col("value").alias("a"))
    .join(df2.select(col("value").alias("b")), col("a") == col("b"))
    .explain()
)

Gunakan gcloud CLI di komputer lokal Anda untuk mengirimkan tugas batch serverless Dataproc Serverless dengan lima pekerja, dengan setiap pekerja diakselerasi dengan GPU L4:

gcloud dataproc batches submit pyspark test-py-spark-gpu.py \
    --project=PROJECT_ID \
    --region=REGION \
    --deps-bucket=BUCKET_NAME \
    --version=1.1 \
    --properties=spark.dataproc.executor.compute.tier=premium,spark.dataproc.executor.disk.tier=premium,spark.dataproc.executor.resource.accelerator.type=l4,spark.executor.instances=5,spark.dataproc.driverEnv.LANG=C.UTF-8,spark.executorEnv.LANG=C.UTF-8,spark.shuffle.manager=com.nvidia.spark.rapids.RapidsShuffleManager

Catatan:

PROJECT_ID: Google Cloud Project ID Anda.
REGION: Region Compute Engine yang tersedia untuk menjalankan beban kerja.
BUCKET_NAME: Nama bucket Cloud Storage Anda. Spark mengupload dependensi workload ke folder /dependencies di bucket ini sebelum menjalankan workload batch.
--version: Semua Runtime Dataproc Serverless yang Didukung menambahkan library RAPIDS ke setiap node beban kerja yang dipercepat GPU. Hanya runtime versi 1.1 yang menambahkan library XGBoost ke setiap node beban kerja yang dipercepat GPU.

--properties (lihat Properti alokasi resource Spark) :

spark.dataproc.driverEnv.LANG=C.UTF-8 dan spark.executorEnv.LANG=C.UTF-8 (diperlukan dengan versi runtime sebelum 2.2): Properti ini menetapkan set karakter default ke C.UTF-8.
spark.dataproc.executor.compute.tier=premium (wajib): Beban kerja yang dipercepat GPU ditagih menggunakan Unit Komputasi Data (DCU) premium. Lihat Harga Accelerator Dataproc Serverless.
spark.dataproc.executor.disk.tier=premium (wajib): Node dengan akselerator A100-40, A100-80, atau L4 harus menggunakan tingkat disk premium.
spark.dataproc.executor.resource.accelerator.type=l4 (wajib): Hanya satu jenis GPU yang harus ditentukan. Contoh tugas memilih GPU L4. Jenis akselerator berikut dapat ditentukan dengan nama argumen berikut:

Jenis GPU Nama argumen

A100 40GB a100-40

A100 80GB a100-80
spark.executor.instances=5 (wajib diisi): Minimal harus dua. Tetapkan ke lima untuk contoh ini.
spark.executor.cores (opsional): Anda dapat menetapkan properti ini untuk menentukan jumlah vCPU core. Nilai yang valid untuk GPU L4 adalah 4, default, atau 8, 12, 16, 24, 48, atau 96. Satu-satunya nilai yang valid dan default untuk GPU A100 adalah 12. Konfigurasi dengan GPU L4 dan core 24, 48, atau 96 memiliki GPU 2, 4, atau 8 yang terpasang ke setiap eksekutor. Semua konfigurasi lainnya memiliki GPU 1 yang terpasang.
spark.dataproc.executor.disk.size (wajib): GPU L4 memiliki ukuran disk tetap sebesar 375 GB, kecuali untuk konfigurasi dengan core 24, 48, atau 96, yang masing-masing memiliki 750, 1,500, atau 3,000 GB. Jika Anda menetapkan properti ini ke nilai yang berbeda saat mengirimkan beban kerja yang dipercepat L4, error akan terjadi. Jika Anda memilih GPU A100 40 atau A100 80, ukuran yang valid adalah 375 g, 750 g, 1.500 g, 3.000 g, 6.000 g, dan 9.000 g.

Jenis GPU	Nama argumen
A100 40GB	`a100-40`
A100 80GB	`a100-80`

spark.executor.memory (opsional) dan spark.executor.memoryOverhead (opsional): Anda dapat menetapkan salah satu properti ini, tetapi tidak keduanya. Jumlah memori yang tersedia yang tidak digunakan oleh properti set diterapkan ke properti unset. Secara default, spark.executor.memoryOverhead ditetapkan ke 40% memori yang tersedia untuk workload batch PySpark, dan 10% untuk workload lainnya (lihat Properti alokasi resource Spark).

Tabel berikut menunjukkan jumlah memori maksimum yang dapat ditetapkan untuk berbagai konfigurasi GPU A100 dan L4. Nilai minimum untuk salah satu properti adalah 1024 MB.

	A100 (40 GB)	A100 (80 GB)	L4 (4 core)	L4 (8 core)	L4 (12 core)	L4 (16 core)	L4 (24 core)	L4 (48 core)	L4 (96 core)
Total memori maksimum (MB)	78040	165080	13384	26768	40152	53536	113072	160608	321216

Properti RAPIDS Spark (opsional): Secara default, Dataproc Serverless menetapkan nilai properti RAPIDS Spark berikut:
- spark.plugins=com.nvidia.spark.SQLPlugin
- spark.executor.resource.gpu.amount=1
- spark.task.resource.gpu.amount=1/$spark_executor_cores
- spark.shuffle.manager=''. Secara default, properti ini tidak ditetapkan. Namun, NVIDIA merekomendasikan untuk mengaktifkan RAPIDS shuffle manager saat menggunakan GPU untuk meningkatkan performa. Untuk melakukannya, tetapkan spark.shuffle.manager=com.nvidia.spark.rapids.RapidsShuffleManager saat Anda mengirimkan beban kerja.
Lihat Konfigurasi RAPIDS Accelerator untuk Apache Spark guna menetapkan properti RAPIDS Spark, dan Konfigurasi Lanjutan RAPIDS Accelerator untuk Apache Spark guna menetapkan properti lanjutan Spark.

Menggunakan GPU dengan Dataproc Serverless Tetap teratur dengan koleksi Simpan dan kategorikan konten berdasarkan preferensi Anda.

Manfaat GPU

Batasan dan pertimbangan

Harga

Sebelum memulai

Set up cross-bucket replication

Membuat workload batch serverless dengan akselerator GPU

Menggunakan GPU dengan Dataproc Serverless
Tetap teratur dengan koleksi Simpan dan kategorikan konten berdasarkan preferensi Anda.