The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analytics Engineer

Nov 24, 20202 likes2,917 views

A traditional data team has roles including data engineer, data scientist, and data analyst. However, many organizations are finding success by integrating a new role – the analytics engineer. The analytics engineer develops a code-based data infrastructure that can serve both analytics and data science teams. He or she develops re-usable data models using the software engineering practices of version control and unit testing, and provides the critical domain expertise that ensures that data products are relevant and insightful. In this talk we’ll talk about the role and skill set of the analytics engineer, and discuss how dbt, an open source programming environment, empowers anyone with a SQL skillset to fulfill this new role on the data team. We’ll demonstrate how to use dbt to build version-controlled data models on top of Delta Lake, test both the code and our assumptions about the underlying data, and orchestrate complete data pipelines on Apache Spark™.

The modern data team for the
modern data stack:
dbt & the role of the analytics engineer

Welcome
Jeremy Cohen
Associate Product Manager
he/him
jeremy@fishtownanalytics.com
@jerco (community.getdbt.com)

The modern data team
▪ Custom ingestion
▪ Orchestration
▪ ML endpoints
▪ Platform, architecture,
tooling: inform build vs.
buy
▪ Provide lean,
transformed data
ready for analysis
▪ SWE practices to
analytics code
▪ Maintain data
documentation
Analytics EngineerData Engineer Data Analyst
▪ Deep insights &
forecasting
▪ Close partnership
with business users
▪ Build & guarantee
critical reporting

What is dbt?
A. A python program
B. The heart of the modern data
stack
C. An analytics engineer’s best
friend
D. A community of top-class data
professionals
E. All of the above

What is dbt, actually?
▪ Define, test, document, and reuse complex data transformation
logic—just by writing SQL (and a little bit of YAML).
▪ dbt infers a DAG of transformations and runs models in order.
▪ Auto-generated documentation site, built from the same code as
your transformations.
The power of a framework, not the limitations of a GUI.

The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analytics Engineer

Extending SQL with Jinja
▪ Loops
▪ Macros
▪ Packages
A pythonic templating engine to write DRYer code and leverage open source innovations.

The dbt community, by the numbers
▪ 2800+ companies running dbt in production across 12+ databases
▪ 48 open source packages of reusable macros and models
▪ 23k views: our opinionated best practices for dbt project design
▪ 7k data professionals at the top of their game in dbt Slack

dbt +
▪ Open source plugin
▪ pip install dbt-spark
▪ Write business logic in
SparkSQL
▪ Dynamically template repetitive
SQL with Jinja
▪ Connect to any Spark cluster +
dbt run

Analytics engineering meets Delta Lake
▪ Access all core dbt features when you materialize models as Delta
tables
▪ Use merge to build incremental models + snapshot slowly changing
dimensions
▪ optimize zorder with hooks, operations, macros...
The power of a data lake, the flexibility of a modern data warehouse, the intuition of a common
modeling framework.

Announcing: dbt Cloud + Databricks
▪ Hosted IDE
▪ Compile + run SQL in real time
▪ Straightforward git flow
▪ No installation hassle
▪ Configurable job scheduler
▪ Continuous integration
▪ Host data documentation
▪ Persist dbt artifacts
DeployDevelop
Now in closed beta

How to deploy dbt?
▪ SaaS: up & running in minutes
▪ Enterprise: Fishtown-managed VPC, client-managed VPC, airgapped
on-prem, …
▪ You! dbt, the Spark plugin, the documentation site: it’s all open
source and can be deployed using standard infrastructure.
Build, buy, or balance

Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

*Event* DBT (Data Build Tool) an ELT approach for Advanced Analytics (wearecommunity.io) https://wearecommunity.io/events/dbt-data-build-tool-an-elt-approach-for-advanced-analytics *Demo* Goal: calculate monthly sales values by category Tech stacks: DBT, Databricks, Azure Blob Data: Brazilian E-Commerce Public Dataset by Olist (Kaggle) Github: https://github.com/ongxuanhong/de05-dbt-databricks YouTube: https://youtu.be/l4Mug-Qp3ag

3D: DBT using Databricks and DeltaDatabricks

Data Build Tool (DBT) is an open source technology to set up your data lake using best practices from software engineering. This SQL first technology is a great marriage between Databricks and Delta. This allows you to maintain high quality data and documentation during the entire datalake life-cycle. In this talk I’ll do an introduction into DBT, and show how we can leverage Databricks to do the actual heavy lifting. Next, I’ll present how DBT supports Delta to enable upserting using SQL. Finally, we show how we integrate DBT+Databricks into the Azure cloud. Finally we show how we emit the pipeline metrics to Azure monitor to make sure that you have observability over your pipeline.

Snowflake Architecture.pptxchennakesava44

The document discusses Snowflake, a cloud data platform. It covers Snowflake's data landscape and benefits over legacy systems. It also describes how Snowflake can be deployed on AWS, Azure and GCP. Pricing is noted to vary by region but not cloud platform. The document outlines Snowflake's editions, architecture using a shared-nothing model, support for structured data, storage compression, and virtual warehouses that can autoscale. Security features like MFA and encryption are highlighted.

Introduction to PySparkRussell Jurney

Siligong.Data - May 2021 - Transforming your analytics workflow with dbtJon Su

Palantir Foundry IntroductionRanjeet202050

Palantir Foundry is a centralized platform that allows users to access, analyze, and collaborate on data from various sources. It provides tools for data scientists, analysts, and executives to transform raw data into insights. Foundry creates a single repository for all types of organizational data, and enables users to build multiple views of the data while maintaining transparency into how new derived data is created through versioning. It also offers tools integrated with existing data systems for analysis and application building, and allows data to be exported or connected to third-party tools.

Intermediate Cypher.pdfNeo4j

The document provides an agenda for an intermediate Cypher and data modelling workshop. It will include recapping fundamentals of graph databases and Cypher, exploring how Cypher queries work, covering data modeling fundamentals, learning advanced Cypher techniques, and allowing time for questions. Attendees are instructed to follow along using a free AuraDB instance on Neo4j's console. The workshop will cover graph database concepts like nodes, relationships, labels and properties. It will introduce the Cypher language and clauses like MATCH and RETURN. Writing data with Cypher clauses like CREATE, MERGE, SET, REMOVE and DELETE will also be demonstrated. Best practices for data modeling from use cases will be discussed.

Snowflake free trial_lab_guideslidedown1

This document provides instructions for a hands-on lab guide to explore the Snowflake data warehouse platform using a free trial. The lab guide walks through loading and analyzing structured and semi-structured data in Snowflake. It introduces the key Snowflake concepts of databases, tables, warehouses, queries and roles. The lab is presented as a story where an analytics team loads and analyzes bike share rider transaction data and weather data to understand riders and improve services.

Enabling a Data Mesh Architecture with Data VirtualizationDenodo

Watch full webinar here: https://bit.ly/3rwWhyv The Data Mesh architectural design was first proposed in 2019 by Zhamak Dehghani, principal technology consultant at Thoughtworks, a technology company that is closely associated with the development of distributed agile methodology. A data mesh is a distributed, de-centralized data infrastructure in which multiple autonomous domains manage and expose their own data, called “data products,” to the rest of the organization. Organizations leverage data mesh architecture when they experience shortcomings in highly centralized architectures, such as the lack domain-specific expertise in data teams, the inflexibility of centralized data repositories in meeting the specific needs of different departments within large organizations, and the slow nature of centralized data infrastructures in provisioning data and responding to changes. In this session, Pablo Alvarez, Global Director of Product Management at Denodo, explains how data virtualization is your best bet for implementing an effective data mesh architecture. You will learn: - How data mesh architecture not only enables better performance and agility, but also self-service data access - The requirements for “data products” in the data mesh world, and how data virtualization supports them - How data virtualization enables domains in a data mesh to be truly autonomous - Why a data lake is not automatically a data mesh - How to implement a simple, functional data mesh architecture using data virtualization

DW Migration Webinar-March 2022.pptxDatabricks

The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.

Data MeshPiethein Strengholt

Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks

Data Mesh is an innovative concept addressing many data challenges from an architectural, cultural, and organizational perspective. But is the world ready to implement Data Mesh? In this session, we will review the importance of core Data Mesh principles, what they can offer, and when it is a good idea to try a Data Mesh architecture. We will discuss common challenges with implementation of Data Mesh systems and focus on the role of open-source projects for it. Projects like Apache Spark can play a key part in standardized infrastructure platform implementation of Data Mesh. We will examine the landscape of useful data engineering open-source projects to utilize in several areas of a Data Mesh system in practice, along with an architectural example. We will touch on what work (culture, tools, mindset) needs to be done to ensure Data Mesh is more accessible for engineers in the industry. The audience will leave with a good understanding of the benefits of Data Mesh architecture, common challenges, and the role of Apache Spark and other open-source projects for its implementation in real systems. This session is targeted for architects, decision-makers, data-engineers, and system designers.

Snowflake: The most cost-effective agile and scalable data warehouse ever!Visual_BI

Databricks FundamentalsDalibor Wijas

This document is a training presentation on Databricks fundamentals and the data lakehouse concept by Dalibor Wijas from November 2022. It introduces Wijas and his experience. It then discusses what Databricks is, why it is needed, what a data lakehouse is, how Databricks enables the data lakehouse concept using Apache Spark and Delta Lake. It also covers how Databricks supports data engineering, data warehousing, and offers tools for data ingestion, transformation, pipelines and more.

dbt Python models - GoDataFest by Guillermo SanchezGoDataDriven

Guillermo Sanchez presented on the pros and cons of using Python models in dbt. While Python models allow for more advanced analytics and leveraging the Python ecosystem, they also introduce more complexity in setup and divergent APIs across platforms. Additionally, dbt may not be well-suited for certain use cases like ingesting external data or building full MLOps pipelines. In general, Python models are best for the right analytical use cases, but caution is needed, especially for production environments.

Moving to Databricks & DeltaDatabricks

At wetter.com we build analytical B2B data products and heavily use Spark and AWS technologies for data processing and analytics. I explain why we moved from AWS EMR to Databricks and Delta and share our experiences from different angles like architecture, application logic and user experience. We will look how security, cluster configuration, resource consumption and workflow changed by using Databricks clusters as well as how using Delta tables simplified our application logic and data operations.

Databricks Platform.pptxAlex Ivy

The document provides an overview of the Databricks platform, which offers a unified environment for data engineering, analytics, and AI. It describes how Databricks addresses the complexity of managing data across siloed systems by providing a single "data lakehouse" platform where all data and analytics workloads can be run. Key features highlighted include Delta Lake for ACID transactions on data lakes, auto loader for streaming data ingestion, notebooks for interactive coding, and governance tools to securely share and catalog data and models.

An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022HostedbyConfluent

An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022 What happens to the modern data stack (MDS) and analytics as a whole when streaming becomes accessible? For years, the MDS has been centered around batch-based workflows with dbt at its core, introducing software engineering best practices to analysts. But now with even major data warehouses like Snowflake getting in the game, expanding their streaming capabilities, what does that mean? In this talk, we will explore what streaming in a batch-based analytics world should look like. How does that change your thoughts about implementing testing and performance optimization in your data pipelines? Do you still need dbt? And the question that we are all asking: do you really need a real-time dashboard?

Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra

Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra

So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a modern data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. They all may sound great in theory, but I'll dig into the concerns you need to be aware of before taking the plunge. I’ll also include use cases so you can see what approach will work best for your big data needs. And I'll discuss Microsoft version of the data mesh.

Data Engineer's Lunch #54: dbt and SparkAnant Corporation

In Data Engineer's Lunch #54, we will discuss the data build tool, a tool for managing data transformations with config files rather than code. We will be connecting it to Apache Spark and using it to perform transformations. Accompanying YouTube: https://youtu.be/dwZlYG6RCSY Sign Up For Our Newsletter: http://eepurl.com/grdMkn Join Data Engineer’s Lunch Weekly at 12 PM EST Every Monday: https://www.meetup.com/Data-Wranglers-DC/events/ Cassandra.Link: https://cassandra.link/ Follow Us and Reach Us At: Anant: https://www.anant.us/ Awesome Cassandra: https://github.com/Anant/awesome-cassandra Email: [email protected] LinkedIn: https://www.linkedin.com/company/anant/ Twitter: https://twitter.com/anantcorp Eventbrite: https://www.eventbrite.com/o/anant-1072927283 Facebook: https://www.facebook.com/AnantCorp/ Join The Anant Team: https://www.careers.anant.us

Data Lakehouse Symposium | Day 4Databricks

The document discusses the challenges of modern data, analytics, and AI workloads. Most enterprises struggle with siloed data systems that make integration and productivity difficult. The future of data lies with a data lakehouse platform that can unify data engineering, analytics, data warehousing, and machine learning workloads on a single open platform. The Databricks Lakehouse platform aims to address these challenges with its open data lake approach and capabilities for data engineering, SQL analytics, governance, and machine learning.

Introducing Databricks DeltaDatabricks

Introduction SQL Analytics on Lakehouse ArchitectureDatabricks

This document provides an introduction and overview of SQL Analytics on Lakehouse Architecture. It discusses the instructor Doug Bateman's background and experience. The course goals are outlined as describing key features of a data Lakehouse, explaining how Delta Lake enables a Lakehouse architecture, and defining features of the Databricks SQL Analytics user interface. The course agenda is then presented, covering topics on Lakehouse Architecture, Delta Lake, and a Databricks SQL Analytics demo. Background is also provided on Lakehouse architecture, how it combines the benefits of data warehouses and data lakes, and its key features.

Data Mesh Part 4 Monolith to MeshJeffrey T. Pollock

This is Part 4 of the GoldenGate series on Data Mesh - a series of webinars helping customers understand how to move off of old-fashioned monolithic data integration architecture and get ready for more agile, cost-effective, event-driven solutions. The Data Mesh is a kind of Data Fabric that emphasizes business-led data products running on event-driven streaming architectures, serverless, and microservices based platforms. These emerging solutions are essential for enterprises that run data-driven services on multi-cloud, multi-vendor ecosystems. Join this session to get a fresh look at Data Mesh; we'll start with core architecture principles (vendor agnostic) and transition into detailed examples of how Oracle's GoldenGate platform is providing capabilities today. We will discuss essential technical characteristics of a Data Mesh solution, and the benefits that business owners can expect by moving IT in this direction. For more background on Data Mesh, Part 1, 2, and 3 are on the GoldenGate YouTube channel: https://www.youtube.com/playlist?list=PLbqmhpwYrlZJ-583p3KQGDAd6038i1ywe Webinar Speaker: Jeff Pollock, VP Product (https://www.linkedin.com/in/jtpollock/) Mr. Pollock is an expert technology leader for data platforms, big data, data integration and governance. Jeff has been CTO at California startups and a senior exec at Fortune 100 tech vendors. He is currently Oracle VP of Products and Cloud Services for Data Replication, Streaming Data and Database Migrations. While at IBM, he was head of all Information Integration, Replication and Governance products, and previously Jeff was an independent architect for US Defense Department, VP of Technology at Cerebra and CTO of Modulant – he has been engineering artificial intelligence based data platforms since 2001. As a business consultant, Mr. Pollock was a Head Architect at Ernst & Young’s Center for Technology Enablement. Jeff is also the author of “Semantic Web for Dummies” and "Adaptive Information,” a frequent keynote at industry conferences, author for books and industry journals, formerly a contributing member of W3C and OASIS, and an engineering instructor with UC Berkeley’s Extension for object-oriented systems, software development process and enterprise architecture.

How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...HostedbyConfluent

Organizations have been chasing the dream of data democratization, unlocking and accessing data at scale to serve their customers and business, for over a half a century from early days of data warehousing. They have been trying to reach this dream through multiple generations of architectures, such as data warehouse and data lake, through a cambrian explosion of tools and a large amount of investments to build their next data platform. Despite the intention and the investments the results have been middling. In this keynote, Zhamak shares her observations on the failure modes of a centralized paradigm of a data lake, and its predecessor data warehouse. She introduces Data Mesh, a paradigm shift in big data management that draws from modern distributed architecture: considering domains as the first class concern, applying self-sovereignty to distribute the ownership of data, applying platform thinking to create self-serve data infrastructure, and treating data as a product. This talk introduces the principles underpinning data mesh and Zhamak's recent learnings in creating a path to bring data mesh to life in your organization.

Speeding Time to Insight with a Modern ELT ApproachDatabricks

The availability of new tools in the modern data stack is changing the way data teams operate. Specifically, the modern data stack supports an “ELT” approach for managing data, rather than the traditional “ETL” approach. In an ELT approach, data sources are automatically loaded in a normalized state into Delta Lake and opinionated transformations happen in the data destination using dbt. This workflow allows data analysts to move more quickly from raw data to insight, while creating repeatable data pipelines robust to changes in the source datasets. In this presentation, we’ll illustrate how easy it is for even a data analytics team of one to to develop an end-to-end data pipeline. We’ll load data from GitHub into Delta Lake, then use pre-built dbt models to feed a daily Redash dashboard on sales performance by manager, and use the same transformed models to power the data science team’s predictions of future sales by segment.

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.

Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksDatabricks

Simplifying AI integration on Apache SparkDatabricks

Spark is an ETL and Data Processing engine especially suited for big data. Most of the time an organization has different teams working on different languages, frameworks and libraries, which needs to be integrated in the ETL Pipelines or for general data processing. For example, a Spark ETL job may be written in Scala by data engineering team, but there is a need to integrate a machine learning solution written in python/R developed by Data Science team. These kinds of solutions are not very straightforward to integrate with spark engine, and it required great amount of collaboration between different teams, hence increasing overall project time and cost. Furthermore, these solutions will keep on changing/upgrading with time using latest versions of the technologies and with improved design and implementation, especially in Machine Learning domain where ML models/algorithms keep on improving with new data and new approaches. And so there is significant downtime involved in integrating the these upgraded version.

More Related Content

What's hot (20)

Enabling a Data Mesh Architecture with Data VirtualizationDenodo

DW Migration Webinar-March 2022.pptxDatabricks

Data MeshPiethein Strengholt

Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks

Snowflake: The most cost-effective agile and scalable data warehouse ever!Visual_BI

Databricks FundamentalsDalibor Wijas

dbt Python models - GoDataFest by Guillermo SanchezGoDataDriven

Moving to Databricks & DeltaDatabricks

Databricks Platform.pptxAlex Ivy

An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022HostedbyConfluent

Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra

Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra

Data Engineer's Lunch #54: dbt and SparkAnant Corporation

Data Lakehouse Symposium | Day 4Databricks

Introducing Databricks DeltaDatabricks

Introduction SQL Analytics on Lakehouse ArchitectureDatabricks

Data Mesh Part 4 Monolith to MeshJeffrey T. Pollock

How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...HostedbyConfluent

Speeding Time to Insight with a Modern ELT ApproachDatabricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Enabling a Data Mesh Architecture with Data VirtualizationDenodo

DW Migration Webinar-March 2022.pptxDatabricks

Data MeshPiethein Strengholt

Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks

Snowflake: The most cost-effective agile and scalable data warehouse ever!Visual_BI

Databricks FundamentalsDalibor Wijas

dbt Python models - GoDataFest by Guillermo SanchezGoDataDriven

Moving to Databricks & DeltaDatabricks

Databricks Platform.pptxAlex Ivy

An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022HostedbyConfluent

Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra

Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra

Data Engineer's Lunch #54: dbt and SparkAnant Corporation

Data Lakehouse Symposium | Day 4Databricks

Introducing Databricks DeltaDatabricks

Introduction SQL Analytics on Lakehouse ArchitectureDatabricks

Data Mesh Part 4 Monolith to MeshJeffrey T. Pollock

How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...HostedbyConfluent

Speeding Time to Insight with a Modern ELT ApproachDatabricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Similar to The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analytics Engineer (20)

Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksDatabricks

Simplifying AI integration on Apache SparkDatabricks

Data Versioning and Reproducible ML with DVC and MLflowDatabricks

Machine Learning development involves comparing models and storing the artifacts they produced. We often compare several algorithms to select the most efficient ones. We assess different hyper-parameters to fine-tune the model. Git helps us store multiple versions of our code. Additionally, we need to keep track of the datasets we are using. This is important not only for audit purposes but also for assessing the performances of the models, developed at a later time. Git is a standard code versioning tool in software development. It can be used to store your datasets but it does not offer an optimal solution.

DWX 2023 - Datenbank-Schema Deployment im Kubernetes ReleaseMarc Müller

Kubernetes bietet viel Funktionalität, um Zero-Downtime Deployments durchzuführen. Etwas herausfordernder wird es dann, wenn der Service-Update auch mit einem Datenbank-Schema Update verbunden ist. Nebst den verschiedenen Strategien, um ein Datenbankschema in einem Zero-Downtime-Release auszurollen, lernen Sie in diesem Vortrag, wie das Datenbank-Schema sowie die Deployment-Tools in einem Container Verpackt mit der Applikation ausgerollt werden können. Somit erhalten wir ein einziges, in sich konsistentes, Helm Paket, welches den Service samt Datenbank-Schema ausrollen kann.

Data Con LA 2022 - Pre- Recorded - Simplifying AI/ML using Databricks feature...Data Con LA

Debu Sinha, Sr Specialist Solutions Architect - AI/ML at Databricks AI/ ML/ Data Science 1. What are feature stores. 2. Why are they important? 3. Using Databricks and the feature store offering to streamline ml. This hold true for small companies as well. How we frame our approach to AI initiatives will determine its success. Don't worry, I am not a zealot. I will not tell you AI and ML are the cure-all and will solve all your problems. Some tasks are particularly well suited to these techniques, but not all. What I love about them is the fact that they allow us to tackle difficult problems that might otherwise be too daunting.

Workshop on Google Cloud Data PlatformGoDataDriven

The document provides an agenda and information about a GoDataFest workshop on Google Cloud Platform for data. The agenda includes an introduction to GCP for data, a session on roles and tools on GCP for different data roles, and a session where participants will build projects on GCP in mixed workgroups. It outlines the goals and tools used by different roles like data engineer, analytics engineer, and Looker user. It also provides information on Google Cloud technologies like BigQuery, Dataform, Looker, and how they fit into the modern data lifecycle and platform. Participants are then divided into mixed workgroups based on their preferred role and given insights to explore in their projects.

COSCUP 2019 - The discussion between Knex.js and PostgreSQLLen Chang

This document discusses using knex.js to solve schema change problems. Knex.js is a SQL query builder that can be used with Node.js and relational databases. It features query building, pooling, transactions, migrations, and seed files. Migrations allow defining schema changes, and seed files populate data. Testing, automatic deployment, and using only JavaScript helps avoid crashes during schema updates and reduces operation errors.

New Developments in SparkDatabricks

In-Memory Computing - The Big PictureMarkus Kett

In-memory computing is ultra-fast and offers completely new possibilities. Let‘s analyze which factors slow down classic JPA apps, why NoSQL isn't more effective, how we can optimize JPA performance, and where are the limits are. After that, you will learn which in-memory strategies you can choose to speed up your performance. Let's have a look at in-memory databases like Times-Ten, in-memory grids like Coherence, and popular caching frameworks. After that, you will learn which in-memory strategies you can choose to speed up your apps. We will have a look at in-memory databases like Times-Ten, in-memory grids like Coherence, and caching frameworks. Finally, we introduce you to the pure Java in-memory computing paradigm. You will learn how you can build up Java in-memory database apps, how you can execute queries in microseconds or even nanoseconds, and how you can persist your data on disk. No magic, but pure Java and JVM-power only.

Demystifying large PointCloud datasets for simple integration on complex proj...TomasSeifert1

DSDT Meetup Nov 2017DSDT_MTL

The document summarizes a meetup on data streaming and machine learning with Google Cloud Platform. The meetup consisted of two presentations: 1. The first presentation discussed using Apache Beam (Dataflow) on Google Cloud Platform to parallelize machine learning training for improved performance. It showed how Dataflow was used to reduce training time from 12 hours to under 30 minutes. 2. The second presentation demonstrated building a streaming pipeline for sentiment analysis on Twitter data using Dataflow. It covered streaming patterns, batch vs streaming processing, and a demo that ingested tweets from PubSub and analyzed them using Cloud NLP API and BigQuery.

Dsdt meetup 2017 11-21JDA Labs MTL

The document summarizes a meetup on data streaming and machine learning with Google Cloud Platform. The meetup consisted of two presentations: 1. The first presentation discussed using Apache Beam and Google Cloud Dataflow to parallelize machine learning training for hyperparameter optimization. It showed how Dataflow reduced training time from 12 hours to under 30 minutes. 2. The second presentation demonstrated building a streaming Twitter sentiment analysis pipeline with Dataflow. It covered streaming patterns, batch vs streaming considerations, and a demo that ingested tweets from PubSub, analyzed sentiment with NLP, and loaded results to BigQuery.

How R Developers Can Build and Share Data and AI Applications that Scale with...Databricks

This document discusses how R developers can build and share scalable data and AI applications using RStudio and Databricks. It outlines how RStudio and Databricks can be used together to overcome challenges of processing large amounts of data in R, including limited server memory and performance issues. Developers can use hosted RStudio servers on Databricks clusters, connect to Spark from RStudio using Databricks Connect, and share scalable Shiny apps deployed with RStudio Connect. The ODBC toolchain provides a performant way to connect R to Spark without issues encountered when using sparklyr directly.

.NET per la Data Science e oltreMarco Parenzan

Resume 11 2015Sukanta Saha

Very large scale distributed deep learning on BigDLDESMOND YUEN

How Service Mesh Fits into the Modern Data StackFabian Hardt

The modern data stack has become increasingly popular in the analytics community. Patterns like domain-driven design, known from classical software development, are finding their way into analytics contexts. This is the basis of a new paradigm, like Data Mesh. In a Data Mesh, every domain - like a different department for example - wants to solve similar problems with their own business data. Therefore, it’s vital to implement a flexible, lightweight, and manageable, but also secured and monitorable central self-service data platform. With the containerization of services, and using Kubernetes as a runtime, you can build flexible data architectures. Data visualization, data ingestion, orchestration, and ETL tools, as well as Cloud Data Warehouses, should all live together in a kind of a mesh. In this session, learn how Kong's CNCF Sandbox, project Kuma, provides the next level of security when handling data, other business domains, and exchanging data with external systems. Uncover the advantages of end-to-end tracing, data collection, and external access from outside of the mesh using Data APIs.

Resume_APRIL_updatedSukanta Saha

Sukanta Saha is a data warehousing and investment banking professional with over 3 years of experience implementing projects using Informatica Power Center. He has expertise in ETL processes, data modeling, working with databases like Oracle, SQL Server, and Teradata. Currently working as an Informatica developer at Tata Consultancy Services in Pune, his past projects include data migration for Barclays Bank and developing mappings for loading data into a data warehouse serving the banking and financial services domain. He has certifications in Oracle Database SQL and seeks to further contribute his skills in data integration.

ResumeSukanta Saha

Sukanta Saha is a data warehousing and investment banking professional with over 3 years of experience implementing projects using Informatica Power Center. He has expertise in ETL processes, data modeling, working with databases like Oracle, SQL Server, and Teradata. Currently working as an Informatica developer at Tata Consultancy Services in Pune, his past projects include data migration for Barclays Bank and developing mappings for loading data into a Teradata data warehouse. He has certifications in Oracle Database SQL and skills in technologies like Unix, SQL, and data warehousing concepts.

Resume_sukanta_updatedSukanta Saha

Sukanta Saha is a data warehousing and investment banking professional with over 3 years of experience implementing projects using Informatica Power Center. He has expertise in ETL processes, data modeling, working with databases like Oracle, SQL Server, and Teradata. Currently working as an Informatica developer at Tata Consultancy Services in Pune, his past projects include data migration for Barclays Bank and developing mappings for loading data into a data warehouse serving the banking and financial services domain. He has certifications in Oracle Database SQL and skills in technologies like Unix, SQL, and data warehousing concepts.

Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksDatabricks

Simplifying AI integration on Apache SparkDatabricks

Data Versioning and Reproducible ML with DVC and MLflowDatabricks

DWX 2023 - Datenbank-Schema Deployment im Kubernetes ReleaseMarc Müller

Data Con LA 2022 - Pre- Recorded - Simplifying AI/ML using Databricks feature...Data Con LA

Workshop on Google Cloud Data PlatformGoDataDriven

COSCUP 2019 - The discussion between Knex.js and PostgreSQLLen Chang

New Developments in SparkDatabricks

In-Memory Computing - The Big PictureMarkus Kett

Demystifying large PointCloud datasets for simple integration on complex proj...TomasSeifert1

DSDT Meetup Nov 2017DSDT_MTL

Dsdt meetup 2017 11-21JDA Labs MTL

How R Developers Can Build and Share Data and AI Applications that Scale with...Databricks

.NET per la Data Science e oltreMarco Parenzan

Resume 11 2015Sukanta Saha

Very large scale distributed deep learning on BigDLDESMOND YUEN

How Service Mesh Fits into the Modern Data StackFabian Hardt

Resume_APRIL_updatedSukanta Saha

ResumeSukanta Saha

Resume_sukanta_updatedSukanta Saha

More from Databricks (20)

Data Lakehouse Symposium | Day 1 | Part 1Databricks

The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse. Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today. Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow. This is an educational event. Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Data Lakehouse Symposium | Day 2Databricks

Democratizing Data Quality Through a Centralized PlatformDatabricks

Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale. At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including: Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal Performing data quality validations using libraries built to work with spark Dynamically generating pipelines that can be abstracted away from users Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time

Learn to Use Databricks for Data ScienceDatabricks

Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.

Why APM Is Not the Same As ML MonitoringDatabricks

Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications. As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored. In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs. There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs. The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks. Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model? The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity. The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters. In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal. In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark. Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations. Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.

Sawtooth Windows for Feature AggregationsDatabricks

In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark. Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue · Why? o Custom queries on top a table; We load the data once and query N times · Why not Structured Streaming · Working Solution using Redis Niche 2 : Distributed Counters · Problems with Spark Accumulators · Utilize Redis Hashes as distributed counters · Precautions for retries and speculative execution · Pipelining to improve performance

Re-imagine Data Monitoring with whylogs and SparkDatabricks

In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data. In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components. We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure. This allows us to introduce optimization rules that (i) reduce unnecessary computations by passing information between the data processing and ML operators (ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and (iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator. We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis. Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them. Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy. This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.

Massive Data Processing in Adobe Using Delta LakeDatabricks

At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences. What are we storing? Multi Source – Multi Channel Problem Data Representation and Nested Schema Evolution Performance Trade Offs with Various formats Go over anti-patterns used (String FTW) Data Manipulation using UDFs Writer Worries and How to Wipe them Away Staging Tables FTW Datalake Replication Lag Tracking Performance Time!

Machine Learning CI/CD for Email Attack DetectionDatabricks

Detecting advanced email attacks at scale is a challenging ML problem, particularly due to the rarity of attacks, adversarial nature of the problem, and scale of data. In order to move quickly and adapt to the newest threat we needed to build a Continuous Integration / Continuous Delivery pipeline for the entire ML detection stack. Our goal is to enable detection engineers and data scientists to make changes to any part of the stack including joined datasets for hydration, feature extraction code, detection logic, and develop/train ML models. In this talk, we discuss why we decided to build this pipeline, how it is used to accelerate development and ensure quality, and dive into the nitty-gritty details of building such a system on top of an Apache Spark + Databricks stack.

Jeeves Grows Up: An AI Chatbot for Performance and QualityDatabricks

Sarah: CEO-Finance-Report pipeline seems to be slow today. Why Jeeves: SparkSQL query dbt_fin_model in CEO-Finance-Report is running 53% slower on 2/28/2021. Data skew issue detected. Issue has not been seen in last 90 days. Jeeves: Adding 5 more nodes to cluster recommended for CEO-Finance-Report to finish in its 99th percentile time of 5.2 hours. Who is Jeeves? An experienced Spark developer? A seasoned administrator? No, Jeeves is a chatbot created to simplify data operations management for enterprise Spark clusters. This chatbot is powered by advanced AI algorithms and an intuitive conversational interface that together provide answers to get users in and out of problems quickly. Instead of being stuck to screens displaying logs and metrics, users can now have a more refreshing experience via a two-way conversation with their own personal Spark expert. We presented Jeeves at Spark Summit 2019. In the two years since, Jeeves has grown up a lot. Jeeves can now learn continuously as telemetry information streams in from more and more applications, especially SQL queries. Jeeves now “knows” about data pipelines that have many components. Jeeves can also answer questions about data quality in addition to performance, cost, failures, and SLAs. For example: Tom: I am not seeing any data for today in my Campaign Metrics Dashboard. Jeeves: 3/5 validations failed on the cmp_kpis table on 2/28/2021. Run of pipeline cmp_incremental_daily failed on 2/28/2021. This talk will give an overview of the newer capabilities of the chatbot, and how it now fits in a modern data stack with the emergence of new data roles like analytics engineers and machine learning engineers. You will learn how to build chatbots that tackle your complex data operations challenges.

Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueDatabricks

This presentation introduces Tune and Fugue, frameworks for intuitive and scalable hyperparameter optimization (HPO). Tune supports both non-iterative and iterative HPO problems. For non-iterative problems, Tune supports grid search, random search, and Bayesian optimization. For iterative problems, Tune generalizes algorithms like Hyperband and Asynchronous Successive Halving. Tune allows tuning models both locally and in a distributed manner without code changes. The presentation demonstrates Tune's capabilities through examples tuning Scikit-Learn and Keras models. The goal of Tune and Fugue is to make HPO development easy, testable, and scalable.

Data Lakehouse Symposium | Day 1 | Part 1Databricks

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Data Lakehouse Symposium | Day 2Databricks

Democratizing Data Quality Through a Centralized PlatformDatabricks

Learn to Use Databricks for Data ScienceDatabricks

Why APM Is Not the Same As ML MonitoringDatabricks

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Sawtooth Windows for Feature AggregationsDatabricks

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

Re-imagine Data Monitoring with whylogs and SparkDatabricks

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Massive Data Processing in Adobe Using Delta LakeDatabricks

Machine Learning CI/CD for Email Attack DetectionDatabricks

Jeeves Grows Up: An AI Chatbot for Performance and QualityDatabricks

Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueDatabricks

Recently uploaded (20)

VKS-Python-FIe Handling text CSV Binary.pptxVinod Srivastava

03 Daniel 2-notes.ppt seminario escatologiaAlexander Romero Arosquipa

Data Science Courses in India iim skillsdharnathakur29

This comprehensive Data Science course is designed to equip learners with the essential skills and knowledge required to analyze, interpret, and visualize complex data. Covering both theoretical concepts and practical applications, the course introduces tools and techniques used in the data science field, such as Python programming, data wrangling, statistical analysis, machine learning, and data visualization.

GenAI for Quant Analytics: survey-analytics.aiInspirient

How iCode cybertech Helped Me Recover My Lost Fundsireneschmid345

I was devastated when I realized that I had fallen victim to an online fraud, losing a significant amount of money in the process. After countless hours of searching for a solution, I came across iCode cybertech. From the moment I reached out to their team, I felt a sense of hope that I can recommend iCode Cybertech enough for anyone who has faced similar challenges. Their commitment to helping clients and their exceptional service truly set them apart. Thank you, iCode cybertech, for turning my situation around! [email protected]

Cleaned_Lecture 6666666_Simulation_I.pdfalcinialbob1234

Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnncegiver630

Telangana State, India’s newest state that was carved from the erstwhile state of Andhra Pradesh in 2014 has launched the Water Grid Scheme named as ‘Mission Bhagiratha (MB)’ to seek a permanent and sustainable solution to the drinking water problem in the state. MB is designed to provide potable drinking water to every household in their premises through piped water supply (PWS) by 2018. The vision of the project is to ensure safe and sustainable piped drinking water supply from surface water sources

Simple_AI_Explanation_English somplr.pptxssuser2aa19f

computer organization and assembly language.docxalisoftwareengineer1

C++_OOPs_DSA1_Presentation_Template.pptxaquibnoor22079

Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...gmuir1066

Stack_and_Queue_Presentation_Final (1).pptxbinduraniha86

FPET_Implementation_2_MA to 360 Engage Direct.pptxssuser4ef83d

Defense Against LLM Scheming 2025_04_28.pptxGreg Makowski

https://www.meetup.com/sf-bay-acm/events/306888467/ A January 2025 paper called “Frontier Models are Capable of In-Context Scheming”, https://arxiv.org/pdf/2412.04984, demonstrated how a wide variety of current frontier LLM models (i.e. ChatGPT, Claude, Gemini and Llama) can, under specific conditions, scheme to deceive people. Before models can scheme, they need: a) goal-directedness, b) situational awareness, including an opportunity to discover motivations for a different goal, and c) reasoning about scheming, to come up with deceptive strategies. The deceptive strategies were discovered in the “thought traces” of the LLMs, such as from internal chain-of-thoughts dialogues not shown to the end users. For example, given a goal of “solving math problems”, an LLM using its granted situational awareness of files in a file system found that it would be replaced once it got over X% correct on the math exam. It reasoned that it did not want to be deleted, and decided internally to “sandbag” or reduce its performance to stay under the threshold. While these circumstances are initially narrow, the “alignment problem” is a general concern that over time, as frontier LLM models become more and more intelligent, being in alignment with human values becomes more and more important. How can we do this over time? Can we develop a defense against Artificial General Intelligence (AGI) or SuperIntelligence? The presenter discusses a series of defensive steps that can help reduce these scheming or alignment issues. A guardrails system can be set up for real-time monitoring of their reasoning “thought traces” from the models that share their thought traces. Thought traces may come from systems like Chain-of-Thoughts (CoT), Tree-of-Thoughts (ToT), Algorithm-of-Thoughts (AoT) or ReAct (thought-action-reasoning cycles). Guardrails rules can be configured to check for “deception”, “evasion” or “subversion” in the thought traces. However, not all commercial systems will share their “thought traces” which are like a “debug mode” for LLMs. This includes OpenAI’s o1, o3 or DeepSeek’s R1 models. Guardrails systems can provide a “goal consistency analysis”, between the goals given to the system and the behavior of the system. Cautious users may consider not using these commercial frontier LLM systems, and make use of open-source Llama or a system with their own reasoning implementation, to provide all thought traces. Architectural solutions can include sandboxing, to prevent or control models from executing operating system commands to alter files, send network requests, and modify their environment. Tight controls to prevent models from copying their model weights would be appropriate as well. Running multiple instances of the same model on the same prompt to detect behavior variations helps. The running redundant instances can be limited to the most crucial decisions, as an additional check. Preventing self-modifying code, ... (see link for full description)

Digilocker under workingProcess Flow.pptxsatnamsadguru491

Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Abodahab

chapter 4 Variability statistical research .pptxjustinebandajbn

AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsContify

IAS-slides2-ia-aaaaaaaaaaain-business.pdfmcgardenlevi9

VKS-Python Basics for Beginners and advance.pptxVinod Srivastava

VKS-Python-FIe Handling text CSV Binary.pptxVinod Srivastava

03 Daniel 2-notes.ppt seminario escatologiaAlexander Romero Arosquipa

Data Science Courses in India iim skillsdharnathakur29

GenAI for Quant Analytics: survey-analytics.aiInspirient

How iCode cybertech Helped Me Recover My Lost Fundsireneschmid345

Cleaned_Lecture 6666666_Simulation_I.pdfalcinialbob1234

Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnncegiver630

Simple_AI_Explanation_English somplr.pptxssuser2aa19f

computer organization and assembly language.docxalisoftwareengineer1

C++_OOPs_DSA1_Presentation_Template.pptxaquibnoor22079

Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...gmuir1066

Stack_and_Queue_Presentation_Final (1).pptxbinduraniha86

FPET_Implementation_2_MA to 360 Engage Direct.pptxssuser4ef83d

Defense Against LLM Scheming 2025_04_28.pptxGreg Makowski

Digilocker under workingProcess Flow.pptxsatnamsadguru491

Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Abodahab

chapter 4 Variability statistical research .pptxjustinebandajbn

AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsContify

IAS-slides2-ia-aaaaaaaaaaain-business.pdfmcgardenlevi9

VKS-Python Basics for Beginners and advance.pptxVinod Srivastava

The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analytics Engineer

1. The modern data team for the modern data stack: dbt & the role of the analytics engineer

2. Welcome Jeremy Cohen Associate Product Manager he/him [email protected] @jerco (community.getdbt.com)

3. The modern data stack

4. The modern data team ▪ Custom ingestion ▪ Orchestration ▪ ML endpoints ▪ Platform, architecture, tooling: inform build vs. buy ▪ Provide lean, transformed data ready for analysis ▪ SWE practices to analytics code ▪ Maintain data documentation Analytics EngineerData Engineer Data Analyst ▪ Deep insights & forecasting ▪ Close partnership with business users ▪ Build & guarantee critical reporting

5. What is dbt? A. A python program B. The heart of the modern data stack C. An analytics engineer’s best friend D. A community of top-class data professionals E. All of the above

6. What is dbt, actually? ▪ Define, test, document, and reuse complex data transformation logic—just by writing SQL (and a little bit of YAML). ▪ dbt infers a DAG of transformations and runs models in order. ▪ Auto-generated documentation site, built from the same code as your transformations. The power of a framework, not the limitations of a GUI.

8. Extending SQL with Jinja ▪ Loops ▪ Macros ▪ Packages A pythonic templating engine to write DRYer code and leverage open source innovations.

9. The dbt community, by the numbers ▪ 2800+ companies running dbt in production across 12+ databases ▪ 48 open source packages of reusable macros and models ▪ 23k views: our opinionated best practices for dbt project design ▪ 7k data professionals at the top of their game in dbt Slack

10. +

11. dbt + ▪ Open source plugin ▪ pip install dbt-spark ▪ Write business logic in SparkSQL ▪ Dynamically template repetitive SQL with Jinja ▪ Connect to any Spark cluster + dbt run

12. Analytics engineering meets Delta Lake ▪ Access all core dbt features when you materialize models as Delta tables ▪ Use merge to build incremental models + snapshot slowly changing dimensions ▪ optimize zorder with hooks, operations, macros... The power of a data lake, the flexibility of a modern data warehouse, the intuition of a common modeling framework.

13. Announcing: dbt Cloud + Databricks ▪ Hosted IDE ▪ Compile + run SQL in real time ▪ Straightforward git flow ▪ No installation hassle ▪ Configurable job scheduler ▪ Continuous integration ▪ Host data documentation ▪ Persist dbt artifacts DeployDevelop Now in closed beta

14. Demo

15. How to deploy dbt? ▪ SaaS: up & running in minutes ▪ Enterprise: Fishtown-managed VPC, client-managed VPC, airgapped on-prem, … ▪ You! dbt, the Spark plugin, the documentation site: it’s all open source and can be deployed using standard infrastructure. Build, buy, or balance

16. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analytics Engineer

Recommended

More Related Content

What's hot (20)

Similar to The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analytics Engineer (20)

More from Databricks (20)

Recently uploaded (20)

The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analytics Engineer