0% found this document useful (0 votes)

12 views

Migrating Data From HDFS To Big Query

Uploaded by

Madhu Sudhan

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

Migrating Data From HDFS To Big Query

Uploaded by

Madhu Sudhan

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

Migrating data from HDFS to Big Query

This guide will walk you through the entire process of migrating 500 Hive external tables in
Parquet format to BigQuery using Google Cloud services such as Google Cloud Storage
(GCS), Dataflow, Dataform, and Cloud Composer.

Step 1: Setup IAM Permissions

Ensure the necessary IAM roles are assigned to the service accounts used by Cloud
Composer, Dataflow, and other Google Cloud services:

 Google Cloud Storage: roles/storage.admin

 BigQuery: roles/bigquery.admin
 Dataflow: roles/dataflow.admin
 Cloud Composer: roles/composer.admin
 Service Account: Ensure the service account used by Airflow has the necessary permissions
to execute the above roles.

Step 2: Transfer Data from HDFS to GCS using hadoop distcp

If you have direct access to the Hadoop cluster, you can use hadoop distcp to transfer the
data from HDFS to GCS:

sh
Copy code
hadoop distcp hdfs://namenode:8020/path-to-hive-tables/
gs://your-bucket/path/

Step 3: Create Dataflow Pipeline for Transformation and Loading

Create a Dataflow job to read Parquet files from GCS, add new columns, and load the data
into BigQuery.

Dataflow Pipeline (Python)

1. Create the Dataflow script (dataflow_pipeline.py):

python
Copy code
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions,
GoogleCloudOptions, StandardOptions
from apache_beam.io import ReadFromParquet, WriteToBigQuery
from datetime import datetime

def add_columns(element):
element['timestamp'] = datetime.utcnow().isoformat()
element['source_name'] = 'hive_source'
return element

def run():
options = PipelineOptions()
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = 'your-project-id'
google_cloud_options.job_name = 'hive-to-bigquery'
google_cloud_options.staging_location = 'gs://your-bucket/staging'
google_cloud_options.temp_location = 'gs://your-bucket/temp'
options.view_as(StandardOptions).runner = 'DataflowRunner'

with beam.Pipeline(options=options) as p:
(p
| 'ReadFromParquet' >> ReadFromParquet('gs://your-bucket/path-to-
parquet-files/*.parquet')
| 'AddColumns' >> beam.Map(add_columns)
| 'WriteToBigQuery' >> WriteToBigQuery(
'your-project-id:your_dataset.your_table',
schema='SCHEMA_AUTODETECT',

create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
)

if __name__ == '__main__':
run()

2. Upload the script to GCS:

sh
Copy code
gsutil cp dataflow_pipeline.py gs://your-bucket/path-to-dataflow-
script.py

Step 4: Setup Dataform for Schema Management

1. Initialize a Dataform project:

sh
Copy code
dataform init my_dataform_project
cd my_dataform_project

2. Configure dataform.json:

json
Copy code
{
"warehouse": "bigquery",
"defaultSchema": "your_dataset"
}

3. Create SQLX files for table definitions:

Create a definitions directory if it doesn't exist, and add SQLX files for your tables.
For example, definitions/example_table.sqlx:

sqlx
Copy code
config {
type: "table",
description: "This is an example table created from a Dataform
script",
columns: {
id: "The unique identifier",
name: "The name of the entity",
timestamp: "The timestamp when the row was inserted",
source_name: "The source of the data"
}
}

select
id,
name,
current_timestamp() as timestamp,
'hive_source' as source_name
from
your_dataset.your_table

4. Run Dataform:

sh
Copy code
dataform run

Step 5: Orchestrate the Process Using Cloud Composer

Create modularized Airflow DAGs to automate each step of the process.

5.1 Airflow DAG to Transfer Data to GCS

Create an Airflow DAG (transfer_data_to_gcs_dag.py) to transfer data from HDFS to

GCS:

python
Copy code
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.utils.dates import days_ago

default_args = {
'owner': 'airflow',
'start_date': days_ago(1),
'retries': 1,
}

with DAG('transfer_data_to_gcs', default_args=default_args,

schedule_interval='@daily') as dag:
transfer_data = BashOperator(
task_id='distcp_to_gcs',
bash_command='hadoop distcp hdfs://namenode:8020/path-to-hive-
tables/ gs://your-bucket/path/'
)
5.2 Airflow DAG to Run Dataflow Job

Create an Airflow DAG (load_data_to_bigquery_dag.py) to run the Dataflow job:

python
Copy code
from airflow import DAG
from airflow.providers.google.cloud.operators.dataflow import
DataflowCreatePythonJobOperator
from airflow.utils.dates import days_ago

default_args = {
'owner': 'airflow',
'start_date': days_ago(1),
'retries': 1,
}

with DAG('load_data_to_bigquery', default_args=default_args,

schedule_interval='@daily') as dag:
run_dataflow_job = DataflowCreatePythonJobOperator(
task_id='run_dataflow',
py_file='gs://your-bucket/path-to-dataflow-script.py',
dataflow_default_options={
'project': 'your-project-id',
'region': 'us-central1',
'staging_location': 'gs://your-bucket/staging',
'temp_location': 'gs://your-bucket/temp',
'runner': 'DataflowRunner'
}
)
5.3 Airflow DAG to Run Dataform

Create an Airflow DAG (run_dataform_dag.py) to run Dataform:

python
Copy code
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.utils.dates import days_ago

default_args = {
'owner': 'airflow',
'start_date': days_ago(1),
'retries': 1,
}

with DAG('run_dataform', default_args=default_args,

schedule_interval='@daily') as dag:
run_dataform = BashOperator(
task_id='run_dataform',
bash_command='dataform run --project-dir
/path/to/your/dataform/project'
)

Conclusion

By following these steps, you can efficiently migrate your Hive external tables to BigQuery.
This approach leverages GCS for intermediate storage, Dataflow for transformation and
loading, Dataform for schema management, and Cloud Composer for orchestration. Each
component is modular, allowing for maintainability and scalability.

Teja
No ratings yet
Teja
5 pages
Flutter Firebase PDF
No ratings yet
Flutter Firebase PDF
23 pages
Big Data & Analytics Lab Manual
No ratings yet
Big Data & Analytics Lab Manual
51 pages
Data Vault Case Study
No ratings yet
Data Vault Case Study
6 pages
05 Functions
No ratings yet
05 Functions
6 pages
06 Data Processing Using Google Cloud Functions
No ratings yet
06 Data Processing Using Google Cloud Functions
2 pages
ELK Config
No ratings yet
ELK Config
8 pages
Example Import GCP To ADLS
No ratings yet
Example Import GCP To ADLS
7 pages
Apache Airflow For Data Science
No ratings yet
Apache Airflow For Data Science
23 pages
Gatsby-Source-Filesystem: Install How To Use
No ratings yet
Gatsby-Source-Filesystem: Install How To Use
8 pages
Capital Bikeshare SQL
No ratings yet
Capital Bikeshare SQL
3 pages
DOCS
No ratings yet
DOCS
45 pages
2.2 create an Airflow DAG to read and write files using the PythonOperator
No ratings yet
2.2 create an Airflow DAG to read and write files using the PythonOperator
3 pages
Cloudupload
No ratings yet
Cloudupload
51 pages
Experiment 3.5 How To Create Simulation Entities in Run-Time Using A Global Manager Entity
No ratings yet
Experiment 3.5 How To Create Simulation Entities in Run-Time Using A Global Manager Entity
8 pages
Javatpoint Com Vue Js Computed Properties PDF
No ratings yet
Javatpoint Com Vue Js Computed Properties PDF
1 page
BDF Programs
No ratings yet
BDF Programs
32 pages
Cloud Computing Homework 4
No ratings yet
Cloud Computing Homework 4
3 pages
3 - ETL Processing On Google Cloud Using Dataflow and BigQuery
0% (1)
3 - ETL Processing On Google Cloud Using Dataflow and BigQuery
15 pages
bda lab
No ratings yet
bda lab
39 pages
MapReduce Merged
No ratings yet
MapReduce Merged
18 pages
3,4,5,6 Clouddcomputing
No ratings yet
3,4,5,6 Clouddcomputing
16 pages
Getting Started With CouchDB in Node - Js
No ratings yet
Getting Started With CouchDB in Node - Js
10 pages
cloudinary
No ratings yet
cloudinary
7 pages
Steps To GetFolderIDFromPath When Using Framework Folders in UCM
100% (1)
Steps To GetFolderIDFromPath When Using Framework Folders in UCM
5 pages
CKAD Exercices Part3.1
No ratings yet
CKAD Exercices Part3.1
3 pages
Spring Boot Notes:: List of Annotations
No ratings yet
Spring Boot Notes:: List of Annotations
9 pages
Notes of Azure Data Bricks
No ratings yet
Notes of Azure Data Bricks
16 pages
Project Ringba Doc
No ratings yet
Project Ringba Doc
6 pages
APIs
No ratings yet
APIs
5 pages
Crud App Django
No ratings yet
Crud App Django
7 pages
Apache Airflow
No ratings yet
Apache Airflow
69 pages
How Can You Perform Fast Searching Result in Dot Net
No ratings yet
How Can You Perform Fast Searching Result in Dot Net
8 pages
AWS Big Data
100% (1)
AWS Big Data
39 pages
Symfony 2 Poster
No ratings yet
Symfony 2 Poster
1 page
Practical 9 ETI
No ratings yet
Practical 9 ETI
3 pages
Build Grad Le
No ratings yet
Build Grad Le
4 pages
Experiment 04
No ratings yet
Experiment 04
4 pages
Workshop 5
No ratings yet
Workshop 5
31 pages
Pwa Documentation - V4
No ratings yet
Pwa Documentation - V4
40 pages
09 Interacting With REST API
No ratings yet
09 Interacting With REST API
59 pages
Efficient Database Monitoring Query Scripts - 240318 - 142050-1
No ratings yet
Efficient Database Monitoring Query Scripts - 240318 - 142050-1
45 pages
Cs6712 Grid and Cloud Lab
0% (1)
Cs6712 Grid and Cloud Lab
94 pages
Automating Object Configuration Tasks Using Graccess: Configuring The Opcclient Diobject
No ratings yet
Automating Object Configuration Tasks Using Graccess: Configuring The Opcclient Diobject
8 pages
MERN_progress
No ratings yet
MERN_progress
11 pages
Contract Document Information Extract and Store in Salesforce
No ratings yet
Contract Document Information Extract and Store in Salesforce
7 pages
Bypass Paywalls For Scientific Documentsv3.4.3
No ratings yet
Bypass Paywalls For Scientific Documentsv3.4.3
6 pages
3 Tier Sample
No ratings yet
3 Tier Sample
41 pages
Write a PHP program to create a url shortener laravel (1)
No ratings yet
Write a PHP program to create a url shortener laravel (1)
5 pages
pr-gate-steps
No ratings yet
pr-gate-steps
3 pages
Monitoring Scripts12
No ratings yet
Monitoring Scripts12
47 pages
Blackcoffeee Assignment Solution
No ratings yet
Blackcoffeee Assignment Solution
9 pages
Lab Manual
No ratings yet
Lab Manual
94 pages
Unit-IV-Notes
No ratings yet
Unit-IV-Notes
10 pages
platform-build-run-steps
No ratings yet
platform-build-run-steps
3 pages
Oop Lab CPP 3
No ratings yet
Oop Lab CPP 3
8 pages
Samiuddin Phase 4
No ratings yet
Samiuddin Phase 4
7 pages
script
No ratings yet
script
16 pages
Gce Requirements
No ratings yet
Gce Requirements
4 pages
GoogleCloudFlatform_Intermediate
No ratings yet
GoogleCloudFlatform_Intermediate
6 pages
Azure Databricks
No ratings yet
Azure Databricks
21 pages
Snowflake Document
No ratings yet
Snowflake Document
26 pages
Azure For Starters
From Everand
Azure For Starters
Chinmoy Mukherjee
No ratings yet
GCP Sample
No ratings yet
GCP Sample
7 pages
Madhusudhan Senior Data Engineer
No ratings yet
Madhusudhan Senior Data Engineer
4 pages
Ac 2007-2658: Helping Engineering Students Write Effective Email
No ratings yet
Ac 2007-2658: Helping Engineering Students Write Effective Email
10 pages
February
No ratings yet
February
8 pages
CV - Alekh Ved
No ratings yet
CV - Alekh Ved
5 pages
Cloudera Data Analyst Training PDF
No ratings yet
Cloudera Data Analyst Training PDF
2 pages
Hadoop Course Contents PDF
No ratings yet
Hadoop Course Contents PDF
3 pages
Nikhil Kumar Mutyala - Senior Big Data Engineer
No ratings yet
Nikhil Kumar Mutyala - Senior Big Data Engineer
7 pages
Cloudera Administrator Training Slides PDF
No ratings yet
Cloudera Administrator Training Slides PDF
601 pages
TIE- 21CS71 SIMP with Key Answers (1)
No ratings yet
TIE- 21CS71 SIMP with Key Answers (1)
19 pages
Sashi Kumar ADE
No ratings yet
Sashi Kumar ADE
6 pages
Hbase Hive Pig
No ratings yet
Hbase Hive Pig
144 pages
MIMIC in The OMOP Common Data Model
No ratings yet
MIMIC in The OMOP Common Data Model
12 pages
Data Modeling With DynamoDB
No ratings yet
Data Modeling With DynamoDB
9 pages
Spark SQL and DataFrames - Spark 2.2.0 Documentation
No ratings yet
Spark SQL and DataFrames - Spark 2.2.0 Documentation
35 pages
Module 2 Hadoop Eco System
No ratings yet
Module 2 Hadoop Eco System
13 pages
Costa and Santos CAISE-2
No ratings yet
Costa and Santos CAISE-2
16 pages
Vardhaman College of Engineering
No ratings yet
Vardhaman College of Engineering
83 pages
Big Data Assignment Revised
No ratings yet
Big Data Assignment Revised
4 pages
Rohit
No ratings yet
Rohit
14 pages
BigData and Hadoop - Syllabus
No ratings yet
BigData and Hadoop - Syllabus
2 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
84 pages
Unit 5
No ratings yet
Unit 5
20 pages
Big Data Analytics
No ratings yet
Big Data Analytics
19 pages
QueryGrid Installation
No ratings yet
QueryGrid Installation
257 pages
Unit-Vi Hive Hadoop & Big Data
100% (1)
Unit-Vi Hive Hadoop & Big Data
24 pages
Hdfs - Using Hadoop For Storing Stock Market Tick Data - Stack Overflow
No ratings yet
Hdfs - Using Hadoop For Storing Stock Market Tick Data - Stack Overflow
2 pages
Ite06 Big Data Analytics-Qbank
No ratings yet
Ite06 Big Data Analytics-Qbank
18 pages
Resume Mohit
No ratings yet
Resume Mohit
6 pages
dSbDa MiniProject Case Study
No ratings yet
dSbDa MiniProject Case Study
10 pages
Data Analysis With Hive
No ratings yet
Data Analysis With Hive
2 pages

Migrating Data From HDFS To Big Query

Uploaded by

Migrating Data From HDFS To Big Query

Uploaded by

Migrating data from HDFS to Big Query

Step 1: Setup IAM Permissions

 Google Cloud Storage: roles/storage.admin

Step 2: Transfer Data from HDFS to GCS using hadoop distcp

Step 3: Create Dataflow Pipeline for Transformation and Loading

Dataflow Pipeline (Python)

1. Create the Dataflow script (dataflow_pipeline.py):

2. Upload the script to GCS:

Step 4: Setup Dataform for Schema Management

1. Initialize a Dataform project:

3. Create SQLX files for table definitions:

Step 5: Orchestrate the Process Using Cloud Composer

Create modularized Airflow DAGs to automate each step of the process.

5.1 Airflow DAG to Transfer Data to GCS

Create an Airflow DAG (transfer_data_to_gcs_dag.py) to transfer data from HDFS to

with DAG('transfer_data_to_gcs', default_args=default_args,

Create an Airflow DAG (load_data_to_bigquery_dag.py) to run the Dataflow job:

with DAG('load_data_to_bigquery', default_args=default_args,

Create an Airflow DAG (run_dataform_dag.py) to run Dataform:

with DAG('run_dataform', default_args=default_args,

You might also like