Business Dashboards using Bonobo ETL, Grafana and Apache Airflow

BUSINESS
DASHBOARDS
using Bonobo, Airflow and Grafana
makersquad.fr

makersquad.fr
Romain Dorgueil 
romain@makersquad.fr
 
Maker ~0 years
Containers ~5 years
Python ~10 years
Web ~15 years
Linux ~20 years
Code ~25 years
rdorgueil

Intro. Product
1. Plan.
2. Implement.
3. Visualize.
4. Monitor.
5. Iterate.
Outro. References, Pointers, Sidekick
Content

DISCLAIMERS
If you build a product, do your own research. 
Take time to learn and understand tools you consider. 
 
 
I don’t know nothing, and I recommend nothing. 
I assume things and change my mind when I hit a wall. 
I’m the creator and main developper of Bonobo ETL. 
I’ll try to be objective, but there is a bias here.

GET https://aprc.it/api/800x600/ep2018.europython.eu/

Load Balancer
TCP / L4
Janitor
(asyncio)
SpideSpideSpideSpideSpideSpideSpider
(asyncio)
“Events” Message Queue
Database
(postgres)
“Orders” Message Queue
Object
Storage
Redis
AMQP
HTTP
“CRAWL”
“CREATED”
AMQP/RabbitMQ
HTTP/HTTPS
SQL/Storage
Reverse Proxy
HTTP2 / L7
WebsiWebsite
(django)
APISeAPISeAPISeAPISeAPISeAPIServer
(tornado)
“MISS”
Local
Cache

Load Balancer
TCP / L4
HTTP
Reverse Proxy
HTTP2 / L7
HTTP/HTTPS
SQL/Storage
Prometheus
AlertManager
Grafana
Weblate
MANAGEMENT
SERVICES
Google Analytics
EXTERNAL SERVICES
Stripe
Slack
Sentry
Mailgun
Drift
MixMax
…
Prometheus
Kubernetes
RabbitMQ
Redis
PostgreSQL
NGINX + VTS
Apercite
…
PROMETHEUS EXPORTERS
Database
(postgres) +

Business Dashboards using Bonobo ETL, Grafana and Apache Airflow

- Peter Drucker (?)
« If you can’t measure it,
you can’t improve it. »

« What gets measured
gets improved. »
- Peter Drucker (?)

- Take your time to choose metrics wisely.
- Cause vs Effect.
- Less is More.
- One at a time. Or one by team.
- There is not one answer to this question.
Planning

Vanity metrics will
waste your time

- What business are you in?
- What stage are you at?
- Start with a framework.
- You may build your own, later.
Planning

Pirate Metrics(based on Ash Maurya version)

Lean Analytics(book by Alistair Croll & Benjamin Yoskovitz)

Plan A
- What business? Software as a Service
- What stage? Empathy / Stickyness
- What metric matters?
- Rate from acquisition to activation.
- QOS (both for display and measure
improvements).

Idea
DataSources
anything
Database
metric → value
Aggregated
(dims, metrics)

Model
Metric
(id) → name
HourlyValue 
(metric, date, hour) → value
DailyValue 
(metric, date) → value
1
1
n
n

Quick to write.
Not the best.
Keywords to read more: Star and Snowﬂake Schemas

Select('''
SELECT *
FROM …
WHERE …
''')
def qualify(row):
yield (
row,
'active' if …
else 'inactive'
)
Join('''
SELECT count(*)
FROM …
WHERE uid = %(id)s
''')
def report(row):
send_email(
render(
'email.html', row
)
)
Bonobo

- Independent threads.
- Data is passed first in, first out.
- Supports any kind of directed acyclic graphs.
- Standard Python callable and iterators.
- Getting started still fits in here (ok, barely)
$ pip install bonobo
$ bonobo init somejob.py
$ python somejob.py
Bonobo

Extract
from bonobo.config import use_context, Service
from bonobo_sqlalchemy.readers import Select
@use_context
class ObjectCountsReader(Select):
engine = Service('website.engine')
query = '''
SELECT count(%(0)s.id) AS cnt
FROM %(0)s
'''
output_fields = ['dims', 'metrics']
def formatter(self, input_row, row):
now = datetime.datetime.now()
return ({
'date': now.date(),
'hour': now.hour,
}, {
'objects.{}.count'.format(input_row[1]): row['cnt']
})
… counts from website’s database

Extract
TABLES_METRICS = {
AsIs('apercite_account_user'): 'users',
AsIs('apercite_account_userprofile'): 'profiles',
AsIs('apercite_account_apikey'): 'apikeys',
}
def get_readers():
return [
TABLES_METRICS.items(),
ObjectCountsReader(),
]
… counts from website’s database

Normalize
bonobo.SetFields(['dims', 'metrics'])
All data should look the same

Load
class AnalyticsWriter(InsertOrUpdate):
dims = Option(required=True)
filter = Option(required=True)
@property
def discriminant(self):
return ('metric_id', *self.dims)
def get_or_create_metrics(self, context, connection, metrics):
…
def __call__(self, connection, table, buffer, context, row, engine):
dims, metrics = row
if not self.filter(dims, metrics): return
# Get database rows for metric objects.
db_metrics_ids = self.get_or_create_metrics(context, connection, metrics)
# Insert or update values.
for metric, value in metrics.items():
yield from self._put(table, connection, buffer, {
'metric_id': db_metrics_ids[metric],
**{dim: dims[dim] for dim in self.dims},
'value': value,
})

Compose
def get_graph():
normalize = bonobo.SetFields(['dims', 'metrics'])
graph = bonobo.Graph(*get_readers(), normalize)
graph.add_chain(
AnalyticsWriter(
table_name=HourlyValue.__tablename__,
dims=('date', 'hour',),
filter=lambda dims, metrics: 'hour' in dims,
name='Hourly',
), _input=normalize
)
graph.add_chain(
AnalyticsWriter(
table_name=DailyValue.__tablename__,
dims=('date',),
filter=lambda dims, metrics: 'hour' not in dims,
name='Daily',
),
_input=normalize
)
return graph

Conﬁgure
def get_services():
return {
'sqlalchemy.engine': EventsDatabase().create_engine(),
'website.engine': WebsiteDatabase().create_engine(),
}

bonobo inspect --graph job.py | dot -o graph.png -T png
Inspect

Run
$ python -m apercite.analytics read objects --write
- dict_items in=1 out=3 [done]
- ObjectCountsReader in=3 out=3 [done]
- SetFields(['dims', 'metrics']) in=3 out=3 [done]
- HourlyAnalyticsWriter in=3 out=3 [done]
- DailyAnalyticsWriter in=3 [done]

Got it.
Let’s add readers.
We’ll run through, you’ll have the code.

Google Analytics
@use('google_analytics')
def read_analytics(google_analytics):
reports = google_analytics.reports().batchGet(
body={…}
).execute().get('reports', [])
for report in reports:
dimensions = report['columnHeader']['dimensions']
metrics = report[‘columnHeader']['metricHeader']['metricHeaderEntries']
rows = report['data']['rows']
for row in rows:
dim_values = zip(dimensions, row['dimensions'])
yield (
{
GOOGLE_ANALYTICS_DIMENSIONS.get(dim, [dim])[0]:
GOOGLE_ANALYTICS_DIMENSIONS.get(dim, [None, IDENTITY])[1](val)
for dim, val in dim_values
},
{
GOOGLE_ANALYTICS_METRICS.get(metric['name'], metric['name']):
GOOGLE_ANALYTICS_TYPES[metric['type']](value)
for metric, value in zip(metrics, row['metrics'][0]['values'])
},
)

Prometheus
class PrometheusReader(Configurable):
http = Service('http')
endpoint = 'http://{}:{}/api/v1'.format(PROMETHEUS_HOST, PROMETHEUS_PORT)
queries = […]
def __call__(self, *, http):
start_at, end_at = self.get_timerange()
for query in self.queries:
for result in http.get(…).json().get('data', {}).get('result', []):
metric = result.get('metric', {})
for ts, val in result.get('values', []):
name = query.target.format(**metric)
_date, _hour = …
yield {
'date': _date,
'hour': _hour,
}, {
name: float(val)
}

Spider counts
class SpidersReader(Select):
kwargs = Option()
output_fields = ['row']
@property
def query(self):
return '''
SELECT spider.value AS name,
spider.created_at AS created_at,
spider_status.attributes AS attributes,
spider_status.created_at AS updated_at
FROM
spider
JOIN …
WHERE spider_status.created_at > %(now)s
ORDER BY spider_status.created_at DESC
'''
def formatter(self, input_row, row):
return (row, )

Spider counts
def spider_reducer(self, left, right):
result = dict(left)
result['spider.total'] += len(right.attributes)
for worker in right.attributes:
if 'stage' in worker:
result['spider.active'] += 1
else:
result['spider.idle'] += 1
return result

Spider counts
now = datetime.datetime.utcnow() - datetime.timedelta(minutes=30)
def get_readers():
return (
SpidersReader(kwargs={'now': now}),
Reduce(spider_reducer, initializer={
'spider.idle': 0,
'spider.active': 0,
'spider.total': 0,
}),
(lambda x: ({'date': now.date(), 'hour': now.hour}, x))
)

Inspect
We can generate ETL graphs with all readers or only a few.

Run
$ python -m apercite.analytics read all --write
- read_analytics in=1 out=91 [done]
- EventsReader in=1 out=27 [done]
- EventsTimingsReader in=1 out=2039 [done]
- group_timings in=2039 out=24 [done]
- format_timings_for_metrics in=24 out=24 [done]
- SpidersReader in=1 out=1 [done]
- Reduce in=1 out=1 [done]
- <lambda> in=1 out=1 [done]
- PrometheusReader in=1 out=3274 [done]
- dict_items in=1 out=3 [done]
- ObjectCountsReader in=3 out=3 [done]
- SetFields(['dims', 'metrics']) in=3420 out=3420 [done]
- HourlyAnalyticsWriter in=3420 out=3562 [done]
- DailyAnalyticsWriter in=3420 out=182 [done]

Easy to build.
Easy to add or replace parts.
Easy to run.
Told ya, slight bias.

Grafana
Analytics & Monitoring

Acquisition Rate
+
User Counts New Sessions

Iteration 0
- Cron job runs everything every 30
minutes.
- No way to know if something fails.
- Expensive tasks.
- Hard to run manually.

I’ll handle that.
Monitoring?

Airflow
 
 
«Airflow is a platform to
programmatically author,
schedule and monitor
workflows.»
- Official docs

Airﬂow
- Created by Airbnb, joined Apache incubation.
- Schedules & monitor jobs.
- Distribute workloads through Celery, Dask, K8S…
- Can run anything, not just Python.

Airﬂow
Webserver Scheduler
Worker
Metadata
Worker
Worker
Worker
Simpliﬁed to show high-level concept. 
Depends on executor (celery, dask, k8s, local, sequential …)

DAGsimport shlex
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
def _get_bash_command(*args, module='apercite.analytics'):
return '(cd /usr/local/apercite; /usr/local/env/bin/python -m {} {})'.format(
module, ' '.join(map(shlex.quote, args)),
)
def build_dag(name, *args, schedule_interval='@hourly'):
dag = DAG(
name,
schedule_interval=schedule_interval,
default_args=default_args,
catchup=False,
)
dag >> BashOperator(
dag=dag,
task_id=args[0],
bash_command=_get_bash_command(*args),
env=env,
)
return dag

DAGs
# Build datasource-to-metrics-db related dags.
for source in ('google-analytics', 'events', 'events-timings', 'spiders',
'prometheus', 'objects'):
name = 'apercite.analytics.' + source.replace('-', '_')
globals()[name] = build_dag(name, 'read', source, '--write')
# Cleanup dag.
name = 'apercite.analytics.cleanup'
globals()[name] = build_dag(name, 'clean', 'all', schedule_interval='@daily')

Data Sources
from airflow.models import Connection
from airflow.settings import Session
session = Session()
website = session.query(Connection).filter_by(conn_id='apercite_website').first()
events = session.query(Connection).filter_by(conn_id='apercite_events').first()
session.close()
env = {}
if website:
env['DATABASE_HOST'] = str(website.host)
env['DATABASE_PORT'] = str(website.port)
env['DATABASE_USER'] = str(website.login)
env['DATABASE_NAME'] = str(website.schema)
env['DATABASE_PASSWORD'] = str(website.password)
if events:
env['EVENT_DATABASE_USER'] = str(events.login)
env['EVENT_DATABASE_NAME'] = str(events.schema)
env['EVENT_DATABASE_PASSWORD'] = str(events.password)
Warning: sub-optimal

Airﬂow
- Where to store the dags ?
- Build : separate virtualenv
- Everything we do run locally
- Deployment

Learnings
- Multiple services, not trivial
- Helm charts :-(
- Astronomer Distro :-)
- Read the Source, Luke

Plan N+1
- Create a framework for
experiments.
- Timebox constraint
- Objective & Key Result
- Decide upon results.

Plan N+1read: Scaling Lean, by Ash Maurya

Tech Side
- Month on Month
- Year on Year
- % Growth

Ideas …
- Revenue (stripe, billing …)
- Traﬁc & SEO (analytics, console …)
- Conversion (AARRR)
- Quality of Service
- Processing (AMQP, …)
- Service Level (HTTP statuses, Time of Requests …)
- Vanity metrics
- Business metrics

Pick one. 
Rince. 
Repeat.

Airﬂow helps you manage the whole factory. 
Does not care about the jobs’ content. 
Bonobo helps you build assembly lines. 
Does not care about the surrounding factory.

Feedback 
+ 
Bonobo ETL Sprint

Slides, resources, feedback …
apercite.fr/europython
romain@makersquad.fr rdorgueil

Business Dashboards using Bonobo ETL, Grafana and Apache Airflow

More Related Content

What's hot (20)

Similar to Business Dashboards using Bonobo ETL, Grafana and Apache Airflow (20)

Recently uploaded (20)

Business Dashboards using Bonobo ETL, Grafana and Apache Airflow