Datacamp ETL Documentation
Datacamp ETL Documentation
Datacamp ETL
Documentation
November 2009
[email protected]
www.knowerce.sk
knowerce|consulting
Document information
Creator Knowerce, s.r.o.
Vavilovova 16
851 01 Bratislava
[email protected]
www.knowerce.sk
Document revision 1
Document Restrictions
Copyright (C) 2009 Knowerce, s.r.o., Stefan Urbanek
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU
Free Documentation License, Version 1.3 or any later version published by the Free Software
Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the
license is included in the section entitled "GNU Free Documentation License".
Offer [email protected] 2
knowerce|consulting
Contents
Introduction ....................................................................................................................................................................................4
Overview .........................................................................................................................................................................................6
System Context 6
Objects and classes 6
Installation ........................................................................................................................................................................................8
Software Requirements 8
Preparation 8
Database initialisation 8
Configuration 9
Running ETL Jobs......................................................................................................................................................................10
Launching 10
Manual Launching 10
Scheduled using cron 10
Running Programatically 10
What jobs will be run 10
Job Status 11
Job Management........................................................................................................................................................................12
Scheduling 12
Forced run 12
Creating a Job Bundle .............................................................................................................................................................14
Example: Public Procurement Extraction ETL job 14
Job Utility Methods 14
Errors and Failing a Job 14
Defaults...........................................................................................................................................................................................15
ETL System Defaults 15
Using defaults in jobs 15
Appendix: ETL Tables .............................................................................................................................................................17
etl_jobs 17
etl_job_status 17
etl_defaults 18
etl_batch 18
Cron Example.............................................................................................................................................................................19
Offer [email protected] 3
knowerce|consulting
Introduction
This document describes architecture, structures and process of Datacamp Extraction Transformation
and Loading framework. Purpose of the framework is to perform automated scheduled data
processing, usually in the background. Main features:
Offer [email protected] 4
knowerce|consulting
Wiki Documentation:
http://wiki.github.com/Stiivi/Datacamp-ETL/
Support
General Discussion Mailing List
http://groups.google.com/group/datacamp
Offer [email protected] 5
knowerce|consulting
Overview
System Context
Datacamp ETL framework has plug-in based architecture and runs on top of a database server.
ETL modules
directory (directories)
job module bundle
DB Server
ETL Staging
Database
Job Manager
ETL Defaults
Batch
Download Batch
List of files and additional information for automated parallel downloading and
Download Batch
processing
Offer [email protected] 6
knowerce|consulting
Job Abstract class for ETL jobs, provides utilities for running, logging and error handling
Job Status Information about job run: when was run, what was the result and reason for failure.
Offer [email protected] 7
knowerce|consulting
Installation
Software Requirements
■ database server
1
■ ruby
■ rails
■ gems: sequel
Preparation
I. create a directory where working files, such as dumps and ETL files, will be stored, for example:
/var/lib/datacamp
II. create a database. For use with Datacamp web application create two schemas:
■ data schema, example: datacamp_data
■ staging schema (for ETL), example: datacamp_staging
III. create a database user that has full access (SELECT, INSERT, UPDATE, CREATE TABLE, …) to
the datacamp ETL schemas
■ sources
■ working directory
■ one or two database schemas
■ database user with appropriate permissions
Database initialisation
To initialize ETL database schema run appropriate SQL script from install directory, for example:
mysql -u root -p datacamp_staging < install/etl_tables.mysql.sql
1 currently works only with MySQL server as there are couple of MySQL specific code residues. This will change
in the future.
Offer [email protected] 8
knowerce|consulting
Configuration
Create config.yml in the ETL directory. You can use config.yml.example as a template.
Configuration variables are:
Variable Description
etl_files_path Path for working files – downloaded, extracted and temporary files
dataset_dump_path Datacamp application specific. Where Datacamp datasets are being dumped (dumps
are shared by ETL and the application)
log_file File where logs are being written. If not set, standard error output (stderr) is used
# ETL Configuration
#
###########################################################
# Paths
# Where temporary ETL files are stored (such as files downloaded from web)
etl_files_path: /var/lib/datacamp-etl
job_search_path: /usr/lib/datacamp-etl/jobs
###########################################################
# Database Connection
connection:
host: localhost
username: root
password:
charset: utf8
staging_schema: datacamp_staging
dataset_schema: datacamp_data
app_schema: datacamp_app
Offer [email protected] 9
knowerce|consulting
Manual Launching
Jobs are being run with simply launching the etl.rb script:
ruby etl.rb
The script looks for config.yml in current directory. You can pass another configuration file:
ruby etl.rb --config another_config.yml
Running Programatically
Or configure JobManager manually and run all jobs by:
job_manager = JobManager.new
… # configure job_manager here
job_manager.run_scheduled_jobs
Log is being written to preconfigured file or to standard error output. See Installation instructions how
to configure the log file.
Offer [email protected] 10
knowerce|consulting
Job Status
Each job leaves a footprint of its run in etl_job_status table. The table contains information:
Column Description
phase if job has more phases, this column identifies which phase the job is in
■ running – job is still running (or ETL crashed and did not reset the job status)
■ ok – job finished correctly
■ failed – job dod not finished correctly, see phase and message for more information
Offer [email protected] 11
knowerce|consulting
Job Management
Jobs are managet through etl_jobs table where you specify:
Column Description
number which specifies order in which jobs are being run. Jobs are run from lowest
run_order
number to highest. If number is the same for more jobs, behaviour is undefined
Example:
To add a new job, insert a line into the table and set job information. To remove a job just delete a line.
Scheduling
Jobs can be currently scheduled on daily basis:
■ daily:
run each day
■ monday, tuesday, wednesday, thursday, friday, saturday, sunday – run on particular week
day
Once the job was successfully run by scheduler, the job manager does not run it again unless explicitly
specified by force_run flag.
Forced run
There is a way how to run jobs out-of-schedule by setting the force_run flag. This allows data
managers to re-run an ETL job remotely without requiring access to the system where ETL processes
are being hosted. The job will be run next time scheduler is run. For example: if ETL is scheduled in
cron for hourly run, then the job is re-run within next hour, if it is scheduled for daily runs it will be run
next day.
The flag is reset to 0 after each run to prevent running again. Reason for this behaviour is to prevent
running lengthy, time and CPU consuming jobs unintentionally and to protect already processed data
from possible inconsistencies introduced by running jobs at unexpected times.
Offer [email protected] 12
knowerce|consulting
Offer [email protected] 13
knowerce|consulting
The class should implement run method with the main job code.
def run
… job code goes here …
end
Also each job has access to defaults dictionary. See chapter about Defaults for more information.
Offer [email protected] 14
knowerce|consulting
Defaults
Defaults is configurable key-value dictionary used by ETL jobs and the ETL system as well. The key-
value pairs are stored by domains. Domain usually corresponds to job name, for example: invoices
loading job and invoices transformation job share common domain invoices. The domain etl is reserved
for ETL system configuration. Purpose of defaults is to be able to configure ETL jobs remotely and in
more convenient way.
Defaults are stored in etl_defaults table which contains: domain, default_key and value:
force_run_all On next ETL run all enabled jobs are launched, regardless of FALSE
their scheduling. See Running ETL?
reset_force_run_flag After running forced job (see Running ETL?) clear it’s flag so it TRUE
will be not run again.
Offer [email protected] 15
knowerce|consulting
Offer [email protected] 16
knowerce|consulting
run_order int order in which the jobs are being run. If more jobs have same order numer, the
behaviour is undefined.
last_run_date datetime date and time when job was alst run
etl_job_status
phase varchar phase in which the job currently is wile running or was when finished
Offer [email protected] 17
knowerce|consulting
etl_defaults
id int association id
etl_batch
id int
batch_type varchar
batch_source varchar
data_source_name varchar
data_source_url varchar
valid_due_date date
batch_date date
username varchar
created_at datetime
updated_at datetime
Offer [email protected] 18
knowerce|consulting
Cron Example
#!/bin/bash
#
# ETL cron job script
#
# Ubuntu/Debian: Put this script in /etc/cron.daily
# Other unces: schedule appropriately in /etc/crontab
#####################################################################
# ETL Configuration
#####################################################################
ETL_TOOL=etl.rb
$RUBY -I $ETL_PATH $ETL_PATH/$ETL_TOOL --config $CONFIG
Offer [email protected] 19