SlideShare a Scribd company logo
BUSINESS
DASHBOARDS
using Bonobo, Airflow and Grafana
makersquad.fr
makersquad.fr
Romain Dorgueil

romain@makersquad.fr


Maker ~0 years
Containers ~5 years
Python ~10 years
Web ~15 years
Linux ~20 years
Code ~25 years
rdorgueil
Intro. Product
1. Plan.
2. Implement.
3. Visualize.
4. Monitor.
5. Iterate.
Outro. References, Pointers, Sidekick
Content
DISCLAIMERS
If you build a product, do your own research.

Take time to learn and understand tools you consider.





I don’t know nothing, and I recommend nothing.

I assume things and change my mind when I hit a wall.

I’m the creator and main developper of Bonobo ETL.

I’ll try to be objective, but there is a bias here.
PRODUCT
GET https://aprc.it/api/800x600/ep2018.europython.eu/
January 2009
timedelta(years=9)
Febuary 2018
March 2018
April 2018
June 2018
July 2018
not expected
UNDER THE
HOOD
Load Balancer
TCP / L4
Janitor
(asyncio)
SpideSpideSpideSpideSpideSpideSpider
(asyncio)
“Events” Message Queue
Database
(postgres)
“Orders” Message Queue
Object
Storage
Redis
AMQP
HTTP
“CRAWL”
“CREATED”
AMQP/RabbitMQ
HTTP/HTTPS
SQL/Storage
Reverse Proxy
HTTP2 / L7
WebsiWebsite
(django)
APISeAPISeAPISeAPISeAPISeAPIServer
(tornado)
“MISS”
Local
Cache
Load Balancer
TCP / L4
HTTP
Reverse Proxy
HTTP2 / L7
HTTP/HTTPS
SQL/Storage
Prometheus
AlertManager
Grafana
Weblate
MANAGEMENT
SERVICES
Google Analytics
EXTERNAL SERVICES
Stripe
Slack
Sentry
Mailgun
Drift
MixMax
…
Prometheus
Kubernetes
RabbitMQ
Redis
PostgreSQL
NGINX + VTS
Apercite
…
PROMETHEUS EXPORTERS
Database
(postgres) +
Business Dashboards using Bonobo ETL, Grafana and Apache Airflow
Business Dashboards using Bonobo ETL, Grafana and Apache Airflow
Business Dashboards using Bonobo ETL, Grafana and Apache Airflow
PLAN
- Peter Drucker (?)
« If you can’t measure it,
you can’t improve it. »
« What gets measured
gets improved. »
- Peter Drucker (?)
- Take your time to choose metrics wisely.
- Cause vs Effect.
- Less is More.
- One at a time. Or one by team.
- There is not one answer to this question.
Planning
Vanity metrics will
waste your time
- What business are you in?
- What stage are you at?
- Start with a framework.
- You may build your own, later.
Planning
Pirate Metrics
Pirate Metrics(based on Ash Maurya version)
Lean Analytics(book by Alistair Croll & Benjamin Yoskovitz)
Plan A
Plan A
- What business? Software as a Service
- What stage? Empathy / Stickyness
- What metric matters?
- Rate from acquisition to activation.
- QOS (both for display and measure
improvements).
IMPLEMENT
Idea
DataSources
anything
Database
metric → value
Aggregated
(dims, metrics)
Model
Metric
(id) → name
HourlyValue

(metric, date, hour) → value
DailyValue

(metric, date) → value
1
1
n
n
Quick to write.
Not the best.
Keywords to read more: Star and Snowflake Schemas
Bonobo
Extract Transform Load
Select('''
SELECT *
FROM …
WHERE …
''')
def qualify(row):
yield (
row,
'active' if …
else 'inactive'
)
Join('''
SELECT count(*)
FROM …
WHERE uid = %(id)s
''')
def report(row):
send_email(
render(
'email.html', row
)
)
Bonobo
- Independent threads.
- Data is passed first in, first out.
- Supports any kind of directed acyclic graphs.
- Standard Python callable and iterators.
- Getting started still fits in here (ok, barely)
$ pip install bonobo
$ bonobo init somejob.py
$ python somejob.py
Bonobo
Let’s write our jobs.
Extract
from bonobo.config import use_context, Service
from bonobo_sqlalchemy.readers import Select
@use_context
class ObjectCountsReader(Select):
engine = Service('website.engine')
query = '''
SELECT count(%(0)s.id) AS cnt
FROM %(0)s
'''
output_fields = ['dims', 'metrics']
def formatter(self, input_row, row):
now = datetime.datetime.now()
return ({
'date': now.date(),
'hour': now.hour,
}, {
'objects.{}.count'.format(input_row[1]): row['cnt']
})
… counts from website’s database
Extract
TABLES_METRICS = {
AsIs('apercite_account_user'): 'users',
AsIs('apercite_account_userprofile'): 'profiles',
AsIs('apercite_account_apikey'): 'apikeys',
}
def get_readers():
return [
TABLES_METRICS.items(),
ObjectCountsReader(),
]
… counts from website’s database
Normalize
bonobo.SetFields(['dims', 'metrics'])
All data should look the same
Load
class AnalyticsWriter(InsertOrUpdate):
dims = Option(required=True)
filter = Option(required=True)
@property
def discriminant(self):
return ('metric_id', *self.dims)
def get_or_create_metrics(self, context, connection, metrics):
…
def __call__(self, connection, table, buffer, context, row, engine):
dims, metrics = row
if not self.filter(dims, metrics): return
# Get database rows for metric objects.
db_metrics_ids = self.get_or_create_metrics(context, connection, metrics)
# Insert or update values.
for metric, value in metrics.items():
yield from self._put(table, connection, buffer, {
'metric_id': db_metrics_ids[metric],
**{dim: dims[dim] for dim in self.dims},
'value': value,
})
Compose
def get_graph():
normalize = bonobo.SetFields(['dims', 'metrics'])
graph = bonobo.Graph(*get_readers(), normalize)
graph.add_chain(
AnalyticsWriter(
table_name=HourlyValue.__tablename__,
dims=('date', 'hour',),
filter=lambda dims, metrics: 'hour' in dims,
name='Hourly',
), _input=normalize
)
graph.add_chain(
AnalyticsWriter(
table_name=DailyValue.__tablename__,
dims=('date',),
filter=lambda dims, metrics: 'hour' not in dims,
name='Daily',
),
_input=normalize
)
return graph
Configure
def get_services():
return {
'sqlalchemy.engine': EventsDatabase().create_engine(),
'website.engine': WebsiteDatabase().create_engine(),
}
bonobo inspect --graph job.py | dot -o graph.png -T png
Inspect
Run
$ python -m apercite.analytics read objects --write
- dict_items in=1 out=3 [done]
- ObjectCountsReader in=3 out=3 [done]
- SetFields(['dims', 'metrics']) in=3 out=3 [done]
- HourlyAnalyticsWriter in=3 out=3 [done]
- DailyAnalyticsWriter in=3 [done]
Got it.
Let’s add readers.
We’ll run through, you’ll have the code.
Google Analytics
@use('google_analytics')
def read_analytics(google_analytics):
reports = google_analytics.reports().batchGet(
body={…}
).execute().get('reports', [])
for report in reports:
dimensions = report['columnHeader']['dimensions']
metrics = report[‘columnHeader']['metricHeader']['metricHeaderEntries']
rows = report['data']['rows']
for row in rows:
dim_values = zip(dimensions, row['dimensions'])
yield (
{
GOOGLE_ANALYTICS_DIMENSIONS.get(dim, [dim])[0]:
GOOGLE_ANALYTICS_DIMENSIONS.get(dim, [None, IDENTITY])[1](val)
for dim, val in dim_values
},
{
GOOGLE_ANALYTICS_METRICS.get(metric['name'], metric['name']):
GOOGLE_ANALYTICS_TYPES[metric['type']](value)
for metric, value in zip(metrics, row['metrics'][0]['values'])
},
)
Prometheus
class PrometheusReader(Configurable):
http = Service('http')
endpoint = 'http://{}:{}/api/v1'.format(PROMETHEUS_HOST, PROMETHEUS_PORT)
queries = […]
def __call__(self, *, http):
start_at, end_at = self.get_timerange()
for query in self.queries:
for result in http.get(…).json().get('data', {}).get('result', []):
metric = result.get('metric', {})
for ts, val in result.get('values', []):
name = query.target.format(**metric)
_date, _hour = …
yield {
'date': _date,
'hour': _hour,
}, {
name: float(val)
}
Spider counts
class SpidersReader(Select):
kwargs = Option()
output_fields = ['row']
@property
def query(self):
return '''
SELECT spider.value AS name,
spider.created_at AS created_at,
spider_status.attributes AS attributes,
spider_status.created_at AS updated_at
FROM
spider
JOIN …
WHERE spider_status.created_at > %(now)s
ORDER BY spider_status.created_at DESC
'''
def formatter(self, input_row, row):
return (row, )
Spider counts
def spider_reducer(self, left, right):
result = dict(left)
result['spider.total'] += len(right.attributes)
for worker in right.attributes:
if 'stage' in worker:
result['spider.active'] += 1
else:
result['spider.idle'] += 1
return result
Spider counts
now = datetime.datetime.utcnow() - datetime.timedelta(minutes=30)
def get_readers():
return (
SpidersReader(kwargs={'now': now}),
Reduce(spider_reducer, initializer={
'spider.idle': 0,
'spider.active': 0,
'spider.total': 0,
}),
(lambda x: ({'date': now.date(), 'hour': now.hour}, x))
)
You got the idea.
Inspect
We can generate ETL graphs with all readers or only a few.
Run
$ python -m apercite.analytics read all --write
- read_analytics in=1 out=91 [done]
- EventsReader in=1 out=27 [done]
- EventsTimingsReader in=1 out=2039 [done]
- group_timings in=2039 out=24 [done]
- format_timings_for_metrics in=24 out=24 [done]
- SpidersReader in=1 out=1 [done]
- Reduce in=1 out=1 [done]
- <lambda> in=1 out=1 [done]
- PrometheusReader in=1 out=3274 [done]
- dict_items in=1 out=3 [done]
- ObjectCountsReader in=3 out=3 [done]
- SetFields(['dims', 'metrics']) in=3420 out=3420 [done]
- HourlyAnalyticsWriter in=3420 out=3562 [done]
- DailyAnalyticsWriter in=3420 out=182 [done]
Easy to build.
Easy to add or replace parts.
Easy to run.
Told ya, slight bias.
VISUALIZE
Grafana
Analytics & Monitoring
Dashboards
Quality of Service
Quality of Service
Quality of Service
Public Dashboards
Acquisition Rate
+
User Counts New Sessions
Acquisition Rate
We’re just getting started.
MONITOR
Iteration 0
- Cron job runs everything every 30
minutes.
- No way to know if something fails.
- Expensive tasks.
- Hard to run manually.
I’ll handle that.
Monitoring?
Airflow




«Airflow is a platform to
programmatically author,
schedule and monitor
workflows.»
- Official docs
Airflow
- Created by Airbnb, joined Apache incubation.
- Schedules & monitor jobs.
- Distribute workloads through Celery, Dask, K8S…
- Can run anything, not just Python.
Airflow
Webserver Scheduler
Worker
Metadata
Worker
Worker
Worker
Simplified to show high-level concept.

Depends on executor (celery, dask, k8s, local, sequential …)
Business Dashboards using Bonobo ETL, Grafana and Apache Airflow
DAGsimport shlex
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
def _get_bash_command(*args, module='apercite.analytics'):
return '(cd /usr/local/apercite; /usr/local/env/bin/python -m {} {})'.format(
module, ' '.join(map(shlex.quote, args)),
)
def build_dag(name, *args, schedule_interval='@hourly'):
dag = DAG(
name,
schedule_interval=schedule_interval,
default_args=default_args,
catchup=False,
)
dag >> BashOperator(
dag=dag,
task_id=args[0],
bash_command=_get_bash_command(*args),
env=env,
)
return dag
DAGs
# Build datasource-to-metrics-db related dags.
for source in ('google-analytics', 'events', 'events-timings', 'spiders',
'prometheus', 'objects'):
name = 'apercite.analytics.' + source.replace('-', '_')
globals()[name] = build_dag(name, 'read', source, '--write')
# Cleanup dag.
name = 'apercite.analytics.cleanup'
globals()[name] = build_dag(name, 'clean', 'all', schedule_interval='@daily')
Business Dashboards using Bonobo ETL, Grafana and Apache Airflow
Data Sources
from airflow.models import Connection
from airflow.settings import Session
session = Session()
website = session.query(Connection).filter_by(conn_id='apercite_website').first()
events = session.query(Connection).filter_by(conn_id='apercite_events').first()
session.close()
env = {}
if website:
env['DATABASE_HOST'] = str(website.host)
env['DATABASE_PORT'] = str(website.port)
env['DATABASE_USER'] = str(website.login)
env['DATABASE_NAME'] = str(website.schema)
env['DATABASE_PASSWORD'] = str(website.password)
if events:
env['EVENT_DATABASE_USER'] = str(events.login)
env['EVENT_DATABASE_NAME'] = str(events.schema)
env['EVENT_DATABASE_PASSWORD'] = str(events.password)
Warning: sub-optimal
Airflow
- Where to store the dags ?
- Build : separate virtualenv
- Everything we do run locally
- Deployment
Learnings
- Multiple services, not trivial
- Helm charts :-(
- Astronomer Distro :-)
- Read the Source, Luke
ITERATE
Plan N+1
- Create a framework for
experiments.
- Timebox constraint
- Objective & Key Result
- Decide upon results.
Plan N+1read: Scaling Lean, by Ash Maurya
Tech Side
- Month on Month
- Year on Year
- % Growth
Ideas …
- Revenue (stripe, billing …)
- Trafic & SEO (analytics, console …)
- Conversion (AARRR)
- Quality of Service
- Processing (AMQP, …)
- Service Level (HTTP statuses, Time of Requests …)
- Vanity metrics
- Business metrics
Pick one.

Rince.

Repeat.
OUTRO
Airflow helps you manage the whole factory.

Does not care about the jobs’ content.

Bonobo helps you build assembly lines.

Does not care about the surrounding factory.
References
Feedback

+

Bonobo ETL Sprint
Slides, resources, feedback …
apercite.fr/europython
romain@makersquad.fr rdorgueil

More Related Content

What's hot (20)

PDF
Yahoo!プロモーション広告のビックデータ基盤を支える技術と今後の展望
Yahoo!デベロッパーネットワーク
 
PDF
ユーザーストーリーの分割
Arata Fujimura
 
PDF
2023年から眺めたシンギュラリティ
Koji Fukuoka
 
PDF
データの価値を最大化させるためのデザイン~データビジュアライゼーションの方法~ #devsumi 17-E-2
Yahoo!デベロッパーネットワーク
 
PPTX
Azure Databricks (For Data Analytics).pptx
Knoldus Inc.
 
PDF
モニタリングプラットフォーム開発の裏側
Rakuten Group, Inc.
 
PDF
KPTとKPTA
ESM SEC
 
PDF
FizzBuzzで学ぶJavaの進化
虎の穴 開発室
 
PDF
Google Cloud ベストプラクティス:Google BigQuery 編 - 02 : データ処理 / クエリ / データ抽出
Google Cloud Platform - Japan
 
PDF
書籍 「Python FlaskによるWebアプリ開発入門 物体検知アプリ&機械学習APIの作り方」 を通して伝えたいFlaskのプラクティス.pdf
taisa831
 
PDF
ChatGPTのデータソースにPostgreSQLを使う(第42回PostgreSQLアンカンファレンス@オンライン 発表資料)
NTT DATA Technology & Innovation
 
PDF
クラウド&Azure入門 セッション at Microsoft Ignite the Tour Tokyo 2019
Madoka Chiyoda
 
PDF
ネットワークの自動化・監視の取り組みについて #netopscoding #npstudy
Yahoo!デベロッパーネットワーク
 
PDF
社内問い合わせ&申請・承認業務の 管理方法 - Jira Service Management 事例紹介 -
MicroAd, Inc.(Engineer)
 
PDF
Google Cloud ベストプラクティス:Google BigQuery 編 - 03 : パフォーマンスとコストの最適化
Google Cloud Platform - Japan
 
PPTX
クラウドネイティブ時代の大規模ウォーターフォール開発(CloudNative Days Tokyo 2021 発表資料)
NTT DATA Technology & Innovation
 
PDF
株式会社コロプラ『GKE と Cloud Spanner が躍動するドラゴンクエストウォーク』第 9 回 Google Cloud INSIDE Game...
Google Cloud Platform - Japan
 
PDF
他社製品と比較した際のAuth0のいいところ
Satoshi Takayanagi
 
PDF
2021年度版 ANDPAD会社紹介資料
ManabuSejima
 
PDF
ヤフーの検索基盤と機械学習検索ランキング #yjtc / YJTC21 B-7
Yahoo!デベロッパーネットワーク
 
Yahoo!プロモーション広告のビックデータ基盤を支える技術と今後の展望
Yahoo!デベロッパーネットワーク
 
ユーザーストーリーの分割
Arata Fujimura
 
2023年から眺めたシンギュラリティ
Koji Fukuoka
 
データの価値を最大化させるためのデザイン~データビジュアライゼーションの方法~ #devsumi 17-E-2
Yahoo!デベロッパーネットワーク
 
Azure Databricks (For Data Analytics).pptx
Knoldus Inc.
 
モニタリングプラットフォーム開発の裏側
Rakuten Group, Inc.
 
KPTとKPTA
ESM SEC
 
FizzBuzzで学ぶJavaの進化
虎の穴 開発室
 
Google Cloud ベストプラクティス:Google BigQuery 編 - 02 : データ処理 / クエリ / データ抽出
Google Cloud Platform - Japan
 
書籍 「Python FlaskによるWebアプリ開発入門 物体検知アプリ&機械学習APIの作り方」 を通して伝えたいFlaskのプラクティス.pdf
taisa831
 
ChatGPTのデータソースにPostgreSQLを使う(第42回PostgreSQLアンカンファレンス@オンライン 発表資料)
NTT DATA Technology & Innovation
 
クラウド&Azure入門 セッション at Microsoft Ignite the Tour Tokyo 2019
Madoka Chiyoda
 
ネットワークの自動化・監視の取り組みについて #netopscoding #npstudy
Yahoo!デベロッパーネットワーク
 
社内問い合わせ&申請・承認業務の 管理方法 - Jira Service Management 事例紹介 -
MicroAd, Inc.(Engineer)
 
Google Cloud ベストプラクティス:Google BigQuery 編 - 03 : パフォーマンスとコストの最適化
Google Cloud Platform - Japan
 
クラウドネイティブ時代の大規模ウォーターフォール開発(CloudNative Days Tokyo 2021 発表資料)
NTT DATA Technology & Innovation
 
株式会社コロプラ『GKE と Cloud Spanner が躍動するドラゴンクエストウォーク』第 9 回 Google Cloud INSIDE Game...
Google Cloud Platform - Japan
 
他社製品と比較した際のAuth0のいいところ
Satoshi Takayanagi
 
2021年度版 ANDPAD会社紹介資料
ManabuSejima
 
ヤフーの検索基盤と機械学習検索ランキング #yjtc / YJTC21 B-7
Yahoo!デベロッパーネットワーク
 

Similar to Business Dashboards using Bonobo ETL, Grafana and Apache Airflow (20)

PDF
Simple ETL in python 3.5+ with Bonobo, Romain Dorgueil
Pôle Systematic Paris-Region
 
PDF
Simple ETL in python 3.5+ with Bonobo - PyParis 2017
Romain Dorgueil
 
PDF
Rethinking metrics: metrics 2.0 @ Lisa 2014
Dieter Plaetinck
 
PPTX
Code is not text! How graph technologies can help us to understand our code b...
Andreas Dewes
 
PDF
Cluj.py Meetup: Extending Python in C
Steffen Wenz
 
PPTX
Digital analytics with R - Sydney Users of R Forum - May 2015
Johann de Boer
 
PDF
Luigi presentation NYC Data Science
Erik Bernhardsson
 
PDF
All I know about rsc.io/c2go
Moriyoshi Koizumi
 
PDF
Fast REST APIs Development with MongoDB
MongoDB
 
PPTX
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
PROIDEA
 
PDF
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
InfluxData
 
PDF
Swift for tensorflow
규영 허
 
PPTX
Big data week presentation
Joseph Adler
 
PPTX
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
PDF
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
GeeksLab Odessa
 
PPT
An Overview Of Python With Functional Programming
Adam Getchell
 
PDF
Beyond Breakpoints: A Tour of Dynamic Analysis
C4Media
 
PDF
Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...
InfluxData
 
PDF
Python business intelligence (PyData 2012 talk)
Stefan Urbanek
 
PDF
Automating Workflows for Analytics Pipelines
Sadayuki Furuhashi
 
Simple ETL in python 3.5+ with Bonobo, Romain Dorgueil
Pôle Systematic Paris-Region
 
Simple ETL in python 3.5+ with Bonobo - PyParis 2017
Romain Dorgueil
 
Rethinking metrics: metrics 2.0 @ Lisa 2014
Dieter Plaetinck
 
Code is not text! How graph technologies can help us to understand our code b...
Andreas Dewes
 
Cluj.py Meetup: Extending Python in C
Steffen Wenz
 
Digital analytics with R - Sydney Users of R Forum - May 2015
Johann de Boer
 
Luigi presentation NYC Data Science
Erik Bernhardsson
 
All I know about rsc.io/c2go
Moriyoshi Koizumi
 
Fast REST APIs Development with MongoDB
MongoDB
 
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
PROIDEA
 
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
InfluxData
 
Swift for tensorflow
규영 허
 
Big data week presentation
Joseph Adler
 
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
GeeksLab Odessa
 
An Overview Of Python With Functional Programming
Adam Getchell
 
Beyond Breakpoints: A Tour of Dynamic Analysis
C4Media
 
Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...
InfluxData
 
Python business intelligence (PyData 2012 talk)
Stefan Urbanek
 
Automating Workflows for Analytics Pipelines
Sadayuki Furuhashi
 
Ad

Recently uploaded (20)

PDF
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
Alluxio, Inc.
 
PPTX
Build a Custom Agent for Agentic Testing.pptx
klpathrudu
 
PDF
ESUG 2025: Pharo 13 and Beyond (Stephane Ducasse)
ESUG
 
PDF
Introduction to Apache Iceberg™ & Tableflow
Alluxio, Inc.
 
PDF
How Attendance Management Software is Revolutionizing Education.pdf
Pikmykid
 
PDF
chapter 5.pdf cyber security and Internet of things
PalakSharma980227
 
PDF
IDM Crack with Internet Download Manager 6.42 Build 31 2025?
utfefguu
 
PDF
Code and No-Code Journeys: The Maintenance Shortcut
Applitools
 
PPTX
Get Started with Maestro: Agent, Robot, and Human in Action – Session 5 of 5
klpathrudu
 
PDF
Softaken CSV to vCard Converter accurately converts CSV files to vCard
markwillsonmw004
 
PDF
Understanding the EU Cyber Resilience Act
ICS
 
PDF
Notification System for Construction Logistics Application
Safe Software
 
PPTX
Function & Procedure: Function Vs Procedure in PL/SQL
Shani Tiwari
 
PPTX
How Can Reporting Tools Improve Marketing Performance.pptx
Varsha Nayak
 
PDF
Windows 10 Professional Preactivated.pdf
asghxhsagxjah
 
PPTX
Transforming Lending with IntelliGrow – Advanced Loan Software Solutions
Intelli grow
 
PDF
How to get the licensing right for Microsoft Core Infrastructure Server Suite...
Q-Advise
 
PPTX
iaas vs paas vs saas :choosing your cloud strategy
CloudlayaTechnology
 
PDF
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
Alluxio, Inc.
 
PDF
custom development enhancement | Togglenow.pdf
aswinisuhu
 
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
Alluxio, Inc.
 
Build a Custom Agent for Agentic Testing.pptx
klpathrudu
 
ESUG 2025: Pharo 13 and Beyond (Stephane Ducasse)
ESUG
 
Introduction to Apache Iceberg™ & Tableflow
Alluxio, Inc.
 
How Attendance Management Software is Revolutionizing Education.pdf
Pikmykid
 
chapter 5.pdf cyber security and Internet of things
PalakSharma980227
 
IDM Crack with Internet Download Manager 6.42 Build 31 2025?
utfefguu
 
Code and No-Code Journeys: The Maintenance Shortcut
Applitools
 
Get Started with Maestro: Agent, Robot, and Human in Action – Session 5 of 5
klpathrudu
 
Softaken CSV to vCard Converter accurately converts CSV files to vCard
markwillsonmw004
 
Understanding the EU Cyber Resilience Act
ICS
 
Notification System for Construction Logistics Application
Safe Software
 
Function & Procedure: Function Vs Procedure in PL/SQL
Shani Tiwari
 
How Can Reporting Tools Improve Marketing Performance.pptx
Varsha Nayak
 
Windows 10 Professional Preactivated.pdf
asghxhsagxjah
 
Transforming Lending with IntelliGrow – Advanced Loan Software Solutions
Intelli grow
 
How to get the licensing right for Microsoft Core Infrastructure Server Suite...
Q-Advise
 
iaas vs paas vs saas :choosing your cloud strategy
CloudlayaTechnology
 
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
Alluxio, Inc.
 
custom development enhancement | Togglenow.pdf
aswinisuhu
 
Ad

Business Dashboards using Bonobo ETL, Grafana and Apache Airflow

  • 1. BUSINESS DASHBOARDS using Bonobo, Airflow and Grafana makersquad.fr
  • 2. makersquad.fr Romain Dorgueil
 [email protected] 
 Maker ~0 years Containers ~5 years Python ~10 years Web ~15 years Linux ~20 years Code ~25 years rdorgueil
  • 3. Intro. Product 1. Plan. 2. Implement. 3. Visualize. 4. Monitor. 5. Iterate. Outro. References, Pointers, Sidekick Content
  • 4. DISCLAIMERS If you build a product, do your own research.
 Take time to learn and understand tools you consider.
 
 
 I don’t know nothing, and I recommend nothing.
 I assume things and change my mind when I hit a wall.
 I’m the creator and main developper of Bonobo ETL.
 I’ll try to be objective, but there is a bias here.
  • 16. Load Balancer TCP / L4 Janitor (asyncio) SpideSpideSpideSpideSpideSpideSpider (asyncio) “Events” Message Queue Database (postgres) “Orders” Message Queue Object Storage Redis AMQP HTTP “CRAWL” “CREATED” AMQP/RabbitMQ HTTP/HTTPS SQL/Storage Reverse Proxy HTTP2 / L7 WebsiWebsite (django) APISeAPISeAPISeAPISeAPISeAPIServer (tornado) “MISS” Local Cache
  • 17. Load Balancer TCP / L4 HTTP Reverse Proxy HTTP2 / L7 HTTP/HTTPS SQL/Storage Prometheus AlertManager Grafana Weblate MANAGEMENT SERVICES Google Analytics EXTERNAL SERVICES Stripe Slack Sentry Mailgun Drift MixMax … Prometheus Kubernetes RabbitMQ Redis PostgreSQL NGINX + VTS Apercite … PROMETHEUS EXPORTERS Database (postgres) +
  • 21. PLAN
  • 22. - Peter Drucker (?) « If you can’t measure it, you can’t improve it. »
  • 23. « What gets measured gets improved. » - Peter Drucker (?)
  • 24. - Take your time to choose metrics wisely. - Cause vs Effect. - Less is More. - One at a time. Or one by team. - There is not one answer to this question. Planning
  • 26. - What business are you in? - What stage are you at? - Start with a framework. - You may build your own, later. Planning
  • 28. Pirate Metrics(based on Ash Maurya version)
  • 29. Lean Analytics(book by Alistair Croll & Benjamin Yoskovitz)
  • 31. Plan A - What business? Software as a Service - What stage? Empathy / Stickyness - What metric matters? - Rate from acquisition to activation. - QOS (both for display and measure improvements).
  • 34. Model Metric (id) → name HourlyValue
 (metric, date, hour) → value DailyValue
 (metric, date) → value 1 1 n n
  • 35. Quick to write. Not the best. Keywords to read more: Star and Snowflake Schemas
  • 37. Select(''' SELECT * FROM … WHERE … ''') def qualify(row): yield ( row, 'active' if … else 'inactive' ) Join(''' SELECT count(*) FROM … WHERE uid = %(id)s ''') def report(row): send_email( render( 'email.html', row ) ) Bonobo
  • 38. - Independent threads. - Data is passed first in, first out. - Supports any kind of directed acyclic graphs. - Standard Python callable and iterators. - Getting started still fits in here (ok, barely) $ pip install bonobo $ bonobo init somejob.py $ python somejob.py Bonobo
  • 40. Extract from bonobo.config import use_context, Service from bonobo_sqlalchemy.readers import Select @use_context class ObjectCountsReader(Select): engine = Service('website.engine') query = ''' SELECT count(%(0)s.id) AS cnt FROM %(0)s ''' output_fields = ['dims', 'metrics'] def formatter(self, input_row, row): now = datetime.datetime.now() return ({ 'date': now.date(), 'hour': now.hour, }, { 'objects.{}.count'.format(input_row[1]): row['cnt'] }) … counts from website’s database
  • 41. Extract TABLES_METRICS = { AsIs('apercite_account_user'): 'users', AsIs('apercite_account_userprofile'): 'profiles', AsIs('apercite_account_apikey'): 'apikeys', } def get_readers(): return [ TABLES_METRICS.items(), ObjectCountsReader(), ] … counts from website’s database
  • 43. Load class AnalyticsWriter(InsertOrUpdate): dims = Option(required=True) filter = Option(required=True) @property def discriminant(self): return ('metric_id', *self.dims) def get_or_create_metrics(self, context, connection, metrics): … def __call__(self, connection, table, buffer, context, row, engine): dims, metrics = row if not self.filter(dims, metrics): return # Get database rows for metric objects. db_metrics_ids = self.get_or_create_metrics(context, connection, metrics) # Insert or update values. for metric, value in metrics.items(): yield from self._put(table, connection, buffer, { 'metric_id': db_metrics_ids[metric], **{dim: dims[dim] for dim in self.dims}, 'value': value, })
  • 44. Compose def get_graph(): normalize = bonobo.SetFields(['dims', 'metrics']) graph = bonobo.Graph(*get_readers(), normalize) graph.add_chain( AnalyticsWriter( table_name=HourlyValue.__tablename__, dims=('date', 'hour',), filter=lambda dims, metrics: 'hour' in dims, name='Hourly', ), _input=normalize ) graph.add_chain( AnalyticsWriter( table_name=DailyValue.__tablename__, dims=('date',), filter=lambda dims, metrics: 'hour' not in dims, name='Daily', ), _input=normalize ) return graph
  • 45. Configure def get_services(): return { 'sqlalchemy.engine': EventsDatabase().create_engine(), 'website.engine': WebsiteDatabase().create_engine(), }
  • 46. bonobo inspect --graph job.py | dot -o graph.png -T png Inspect
  • 47. Run $ python -m apercite.analytics read objects --write - dict_items in=1 out=3 [done] - ObjectCountsReader in=3 out=3 [done] - SetFields(['dims', 'metrics']) in=3 out=3 [done] - HourlyAnalyticsWriter in=3 out=3 [done] - DailyAnalyticsWriter in=3 [done]
  • 48. Got it. Let’s add readers. We’ll run through, you’ll have the code.
  • 49. Google Analytics @use('google_analytics') def read_analytics(google_analytics): reports = google_analytics.reports().batchGet( body={…} ).execute().get('reports', []) for report in reports: dimensions = report['columnHeader']['dimensions'] metrics = report[‘columnHeader']['metricHeader']['metricHeaderEntries'] rows = report['data']['rows'] for row in rows: dim_values = zip(dimensions, row['dimensions']) yield ( { GOOGLE_ANALYTICS_DIMENSIONS.get(dim, [dim])[0]: GOOGLE_ANALYTICS_DIMENSIONS.get(dim, [None, IDENTITY])[1](val) for dim, val in dim_values }, { GOOGLE_ANALYTICS_METRICS.get(metric['name'], metric['name']): GOOGLE_ANALYTICS_TYPES[metric['type']](value) for metric, value in zip(metrics, row['metrics'][0]['values']) }, )
  • 50. Prometheus class PrometheusReader(Configurable): http = Service('http') endpoint = 'http://{}:{}/api/v1'.format(PROMETHEUS_HOST, PROMETHEUS_PORT) queries = […] def __call__(self, *, http): start_at, end_at = self.get_timerange() for query in self.queries: for result in http.get(…).json().get('data', {}).get('result', []): metric = result.get('metric', {}) for ts, val in result.get('values', []): name = query.target.format(**metric) _date, _hour = … yield { 'date': _date, 'hour': _hour, }, { name: float(val) }
  • 51. Spider counts class SpidersReader(Select): kwargs = Option() output_fields = ['row'] @property def query(self): return ''' SELECT spider.value AS name, spider.created_at AS created_at, spider_status.attributes AS attributes, spider_status.created_at AS updated_at FROM spider JOIN … WHERE spider_status.created_at > %(now)s ORDER BY spider_status.created_at DESC ''' def formatter(self, input_row, row): return (row, )
  • 52. Spider counts def spider_reducer(self, left, right): result = dict(left) result['spider.total'] += len(right.attributes) for worker in right.attributes: if 'stage' in worker: result['spider.active'] += 1 else: result['spider.idle'] += 1 return result
  • 53. Spider counts now = datetime.datetime.utcnow() - datetime.timedelta(minutes=30) def get_readers(): return ( SpidersReader(kwargs={'now': now}), Reduce(spider_reducer, initializer={ 'spider.idle': 0, 'spider.active': 0, 'spider.total': 0, }), (lambda x: ({'date': now.date(), 'hour': now.hour}, x)) )
  • 54. You got the idea.
  • 55. Inspect We can generate ETL graphs with all readers or only a few.
  • 56. Run $ python -m apercite.analytics read all --write - read_analytics in=1 out=91 [done] - EventsReader in=1 out=27 [done] - EventsTimingsReader in=1 out=2039 [done] - group_timings in=2039 out=24 [done] - format_timings_for_metrics in=24 out=24 [done] - SpidersReader in=1 out=1 [done] - Reduce in=1 out=1 [done] - <lambda> in=1 out=1 [done] - PrometheusReader in=1 out=3274 [done] - dict_items in=1 out=3 [done] - ObjectCountsReader in=3 out=3 [done] - SetFields(['dims', 'metrics']) in=3420 out=3420 [done] - HourlyAnalyticsWriter in=3420 out=3562 [done] - DailyAnalyticsWriter in=3420 out=182 [done]
  • 57. Easy to build. Easy to add or replace parts. Easy to run. Told ya, slight bias.
  • 69. Iteration 0 - Cron job runs everything every 30 minutes. - No way to know if something fails. - Expensive tasks. - Hard to run manually.
  • 71. Airflow 
 
 «Airflow is a platform to programmatically author, schedule and monitor workflows.» - Official docs
  • 72. Airflow - Created by Airbnb, joined Apache incubation. - Schedules & monitor jobs. - Distribute workloads through Celery, Dask, K8S… - Can run anything, not just Python.
  • 73. Airflow Webserver Scheduler Worker Metadata Worker Worker Worker Simplified to show high-level concept.
 Depends on executor (celery, dask, k8s, local, sequential …)
  • 75. DAGsimport shlex from airflow import DAG from airflow.operators.bash_operator import BashOperator def _get_bash_command(*args, module='apercite.analytics'): return '(cd /usr/local/apercite; /usr/local/env/bin/python -m {} {})'.format( module, ' '.join(map(shlex.quote, args)), ) def build_dag(name, *args, schedule_interval='@hourly'): dag = DAG( name, schedule_interval=schedule_interval, default_args=default_args, catchup=False, ) dag >> BashOperator( dag=dag, task_id=args[0], bash_command=_get_bash_command(*args), env=env, ) return dag
  • 76. DAGs # Build datasource-to-metrics-db related dags. for source in ('google-analytics', 'events', 'events-timings', 'spiders', 'prometheus', 'objects'): name = 'apercite.analytics.' + source.replace('-', '_') globals()[name] = build_dag(name, 'read', source, '--write') # Cleanup dag. name = 'apercite.analytics.cleanup' globals()[name] = build_dag(name, 'clean', 'all', schedule_interval='@daily')
  • 78. Data Sources from airflow.models import Connection from airflow.settings import Session session = Session() website = session.query(Connection).filter_by(conn_id='apercite_website').first() events = session.query(Connection).filter_by(conn_id='apercite_events').first() session.close() env = {} if website: env['DATABASE_HOST'] = str(website.host) env['DATABASE_PORT'] = str(website.port) env['DATABASE_USER'] = str(website.login) env['DATABASE_NAME'] = str(website.schema) env['DATABASE_PASSWORD'] = str(website.password) if events: env['EVENT_DATABASE_USER'] = str(events.login) env['EVENT_DATABASE_NAME'] = str(events.schema) env['EVENT_DATABASE_PASSWORD'] = str(events.password) Warning: sub-optimal
  • 79. Airflow - Where to store the dags ? - Build : separate virtualenv - Everything we do run locally - Deployment
  • 80. Learnings - Multiple services, not trivial - Helm charts :-( - Astronomer Distro :-) - Read the Source, Luke
  • 82. Plan N+1 - Create a framework for experiments. - Timebox constraint - Objective & Key Result - Decide upon results.
  • 83. Plan N+1read: Scaling Lean, by Ash Maurya
  • 84. Tech Side - Month on Month - Year on Year - % Growth
  • 85. Ideas … - Revenue (stripe, billing …) - Trafic & SEO (analytics, console …) - Conversion (AARRR) - Quality of Service - Processing (AMQP, …) - Service Level (HTTP statuses, Time of Requests …) - Vanity metrics - Business metrics
  • 87. OUTRO
  • 88. Airflow helps you manage the whole factory.
 Does not care about the jobs’ content.
 Bonobo helps you build assembly lines.
 Does not care about the surrounding factory.
  • 91. Slides, resources, feedback … apercite.fr/europython [email protected] rdorgueil