Data Mesh @ Yelp - 2019

Sep 17, 20192 likes824 views

Yelp has operated our connector ecosystem to feed vital data to domain-specific teams and data stores. We share some of our learning and experiences on operating such system. We will touch on what is the next phase of the system evolution.

Yelp’s Mission
Connecting
people with great
local businesses

Who am I?
My name is
Steven, my
preferred
pronoun is “he”
I graduated from UC Berkeley EECS in 2005
This is my second term in Yelp (2017 - now)
Last term is 2011 - 2015
I consider myself a generalist in the ﬁeld

Who am I?
I work in team
metrics-data
within
metrics-platform

Data powers
decision making
OnLine Transaction Processing (OLTP)
We use MySQL to power yelp.com
Each transaction interacts with small amount of
data
Display reviews, photos, tips of a business
OLTP queries’ results are expected to return quickly
No one wants to wait for more than 2 seconds for a
business page to load

OLTP example:
ﬁnd the titles an
author has
written. Take
advantage of an
index
https://en.wikipedia.org/wiki/Library_catalog#/media/File:Schlagwortkatalog.jpg

Data powers
decision making
Developers want to ﬁnd out what local business has
the most reviews
Table scan on the review table?
OnLine Analytical Processing (OLAP)
Queries that scan majority of data relative to total
amount of data
Need specialized system to support such queries
Yelp uses AWS Redshift as a data warehouse to
support OLAP queries.

OLAP example:
average number
of pages in a
book stored
inside main
stack. Need to
scan all the titles.
https://www.dailycal.org/2013/12/08/best-worst-foods-sneak-main-stacks/

Data Fabric We want to avoid n * m programs to transport data
n is the number of source, and m is the number of sink
Domain speciﬁc data stores are here to stay
Stonebraker, “One Size Fits All”: An Idea Whose Time
Has Come and Gone”
Stream-Table Duality
We can formulate the transport of data as streams

https://docs.confluent.io/current/streams/concepts.html

Image source: https://images-na.ssl-images-amazon.com/images/I/71UfEHhZ2uL._SL1000_.jpg

Beneﬁts
Connector
Ecosystem
Lower the barrier of entry
It’s easy to move data between data stores
High performance implementation
Each data store has its own performance
characteristics.
Streams-processing over batch processing
Near real-time data availability

Image source: https://images-na.ssl-images-amazon.com/images/I/71GmEqny4NL._SL1000_.jpg

Lesson Learned
Connector
Ecosystem
Schematized data is good
Lessen the likelihood of malformed data
Schema evolution can be diﬃcult
Making incompatible schema change can break many
things. Discourage them in registration phase.
Decouple data producers and data consumers
We need automation to inform data producers how to
manage data life cycle as producers do not think about
who uses the data.

Image source: https://i.ytimg.com/vi/03y8DJrzzjA/maxresdefault.jpg

Desirable
Improvements
Data Producers should own their data life cycle
Speciﬁc connector owner does not have visibility of
data semantics.
Data Consumers are stakeholders
Consumers don’t want to out incompatible changes
after its been rolled out.
Self-serve mechanism accelerates changes
The only way to rapidly evolves is to self-serve

Data Mesh Data speciﬁcations are like microservices APIs
They are contracts between producers and consumers
Each team owns their data speciﬁcations
To avoid accidentally abstraction leakage
Decentralization allows rapid experiments
Common conventions are promoted to minimize
frictions among diﬀerent domain systems

https://martinfowler.com/articles/data-monolith-to-mesh.html

yelp.com/dataset_challenge
Academic
dataset from 10
cities across the
globe!
Your academic project, research or visualizations
submitted by December 31, 2019
=
a $5,000 prize* !
*See full terms on website
6M reviews
1M business attributes
190K businesses
200K photos

These are slides for my talk "Data Quality as a prerequisite for you business success: when should I start taking care of it?" I delivered as an invited keynote for HackCodeX Forum that gathered international experts to share their experience and knowledge on the emerging technologies and areas such as Artificial Intelligence, Security, Data Quality, Quantum Computing, Sustainability, Open Data, Privacy etc.

Creating your Center of Excellence (CoE) for data driven use casesFrank Vullers

The document discusses creating a data-driven culture and organization. It provides advice on building a data-driven culture, developing the right team and skills, adopting an agile approach, efficiently operationalizing insights, and implementing proper data governance. Specific recommendations include establishing executive sponsorship, advocating for data use, developing data science, engineering, and analytics teams, prioritizing work using agile methodologies, and communicating a business roadmap to operationalize insights.

How to build a successful Data LakeDataWorks Summit/Hadoop Summit

This document discusses how to build a successful data lake by focusing on the right data, platform, and interface. It emphasizes the importance of saving raw data to analyze later, organizing the data lake into zones with different governance levels, and providing self-service tools to find, understand, provision, prepare, and analyze data. It promotes the use of a smart data catalog like Waterline Data to automate metadata tagging, enable data discovery and collaboration, and maximize business value from the data lake.

Webinar Data Mesh - Part 3Jeffrey T. Pollock

The document discusses data mesh vs data fabric architectures. It defines data mesh as a decentralized data processing architecture with microservices and event-driven integration of enterprise data assets across multi-cloud environments. The key aspects of data mesh are that it is decentralized, processes data at the edge, uses immutable event logs and streams for integration, and can move all types of data reliably. The document then provides an overview of how data mesh architectures have evolved from hub-and-spoke models to more distributed designs using techniques like kappa architecture and describes some use cases for event streaming and complex event processing.

Data Vault OverviewEmpowered Holdings, LLC

Data Vault Modeling and Methodology introduction that I provided to a Montreal event in September 2011. It covers an introduction and overview of the Data Vault components for Business Intelligence and Data Warehousing. I am Dan Linstedt, the author and inventor of Data Vault Modeling and methodology. If you use the images anywhere in your presentations, please credit http://LearnDataVault.com as the source (me). Thank-you kindly, Daniel Linstedt

Data GovernanceSambaSoup

The document discusses data governance and outlines several key points: 1) Many organizations have little or no focus on data governance, though most CIOs plan to implement enterprise-wide data governance in the next three years. 2) Data governance refers to the overall management of availability, usability, integrity and security of enterprise data. 3) Effective data governance requires policies, processes, business rules, roles and responsibilities, and technologies to be successfully implemented.

Session découverte de la Data VirtualizationDenodo

Watch full webinar here: https://bit.ly/38mIuTp Denodo vous propose une session virtuelle pour découvrir la Data Virtualization. Quel que soit votre rôle, responsable IT, architecte, data scientist, analyste ou CDO, vous découvrirez comment Denodo Platform, la plateforme leader en data intégration, data management et livraison de données en temps réel permet d'accéder à tout type de source de données pour en tirer de la valeur.

Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021Tristan Baker

Data Mesh Part 4 Monolith to MeshJeffrey T. Pollock

This is Part 4 of the GoldenGate series on Data Mesh - a series of webinars helping customers understand how to move off of old-fashioned monolithic data integration architecture and get ready for more agile, cost-effective, event-driven solutions. The Data Mesh is a kind of Data Fabric that emphasizes business-led data products running on event-driven streaming architectures, serverless, and microservices based platforms. These emerging solutions are essential for enterprises that run data-driven services on multi-cloud, multi-vendor ecosystems. Join this session to get a fresh look at Data Mesh; we'll start with core architecture principles (vendor agnostic) and transition into detailed examples of how Oracle's GoldenGate platform is providing capabilities today. We will discuss essential technical characteristics of a Data Mesh solution, and the benefits that business owners can expect by moving IT in this direction. For more background on Data Mesh, Part 1, 2, and 3 are on the GoldenGate YouTube channel: https://www.youtube.com/playlist?list=PLbqmhpwYrlZJ-583p3KQGDAd6038i1ywe Webinar Speaker: Jeff Pollock, VP Product (https://www.linkedin.com/in/jtpollock/) Mr. Pollock is an expert technology leader for data platforms, big data, data integration and governance. Jeff has been CTO at California startups and a senior exec at Fortune 100 tech vendors. He is currently Oracle VP of Products and Cloud Services for Data Replication, Streaming Data and Database Migrations. While at IBM, he was head of all Information Integration, Replication and Governance products, and previously Jeff was an independent architect for US Defense Department, VP of Technology at Cerebra and CTO of Modulant – he has been engineering artificial intelligence based data platforms since 2001. As a business consultant, Mr. Pollock was a Head Architect at Ernst & Young’s Center for Technology Enablement. Jeff is also the author of “Semantic Web for Dummies” and "Adaptive Information,” a frequent keynote at industry conferences, author for books and industry journals, formerly a contributing member of W3C and OASIS, and an engineering instructor with UC Berkeley’s Extension for object-oriented systems, software development process and enterprise architecture.

Business Intelligence (BI) and Data Management Basics amorshed

This document provides an overview of business intelligence (BI) and data management basics. It discusses topics such as digital transformation requirements, data strategy, data governance, data literacy, and becoming a data-driven organization. The document emphasizes that in the digital age, data is a key asset and organizations need to focus on data management in order to make informed decisions. It also stresses the importance of data culture and competency for successful BI and data initiatives.

Big Data, Big Deal: For Future Big Data ScientistsWay-Yen Lin

Big Data, Big Deal is a document that discusses big data. It begins by defining big data as high-volume, high-velocity, and high-variety information that requires new processing methods. It then discusses the key drivers for big data, including technical drivers like increased data storage and social media, as well as business drivers like customer analytics and public opinion analysis. The document concludes by discussing challenges for big data like data quality, privacy, and the need for skilled data scientists with technical expertise, curiosity, storytelling abilities, and cleverness.

Building the Data Lake with Azure Data Factory and Data Lake AnalyticsKhalid Salama

In essence, a data lake is commodity distributed file system that acts as a repository to hold raw data file extracts of all the enterprise source systems, so that it can serve the data management and analytics needs of the business. A data lake system provides means to ingest data, perform scalable big data processing, and serve information, in addition to manage, monitor and secure the it environment. In these slide, we discuss building data lakes using Azure Data Factory and Data Lake Analytics. We delve into the architecture if the data lake and explore its various components. We also describe the various data ingestion scenarios and considerations. We introduce the Azure Data Lake Store, then we discuss how to build Azure Data Factory pipeline to ingest the data lake. After that, we move into big data processing using Data Lake Analytics, and we delve into U-SQL.

Big data architectures and the data lakeJames Serra

The document provides an overview of big data architectures and the data lake concept. It discusses why organizations are adopting data lakes to handle increasing data volumes and varieties. The key aspects covered include: - Defining top-down and bottom-up approaches to data management - Explaining what a data lake is and how Hadoop can function as the data lake - Describing how a modern data warehouse combines features of a traditional data warehouse and data lake - Discussing how federated querying allows data to be accessed across multiple sources - Highlighting benefits of implementing big data solutions in the cloud - Comparing shared-nothing, massively parallel processing (MPP) architectures to symmetric multi-processing (

Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...Databricks

How did eBay move their ETL computation from conventional RDBMS environment over to Spark? What did it take to go from a strategic vision to a viable solution? This paper will take you through a journey which lead to an implementation of a 1000+ node Spark Cluster running 10,000+ ETL jobs daily, all done in a span of less than 6 months, by a team with limited Spark experience. We will share the vision, technical architecture, critical Management decisions, Challenges and Road ahead. This will be a unique opportunity to look into this awesome Spark success story at eBay!

Big Data, Artificial Intelligence & HealthcareIris Thiele Isip-Tan

This document discusses the use of big data, artificial intelligence, and social media data in healthcare and diabetes management. It presents research that was able to predict medical diagnoses from language on social media and identifies markers of disease. It also discusses tools that use AI and case-based reasoning to provide insulin dosing recommendations for type 1 diabetes patients based on similar past cases and temporal patient data. The document notes both the promise and limitations of AI in healthcare and that AI will likely require human oversight rather than replacing physicians.

Data Quality: A Raising Data Warehousing ConcernAmin Chowdhury

Characteristics of Data Warehouse Benefits of a data warehouse Designing of Data Warehouse Extract, Transform, Load (ETL) Data Quality Classification Of Data Quality Issues Causes Of Data Quality Impact of Data Quality Issues Cost of Poor Data Quality Confidence and Satisfaction-based impacts Impact on Productivity Risk and Compliance impacts Why Data Quality Influences? Causes of Data Quality Problems How to deal: Missing Data Data Corruption Data: Out of Range error Techniques of Data Quality Control Data warehousing security

Data Catalog as a Business EnablerSrinivasan Sankar

Data protection and privacy regulations such as the EU’s General Data Protection Regulation (GDPR), the California Consumer Privacy Act (CCPA), and Singapore’s Personal Data Protection Act (PDPA) have been major drivers for data governance initiatives and the emergence of data catalog solutions. Organizations have an ever-increasing appetite to leverage their data for business advantage, either through internal collaboration, data sharing across ecosystems, direct commercialization, or as the basis for AI-driven business decision-making. This requires data governance and especially data asset catalog solutions to step up once again and enable data-driven businesses to leverage their data responsibly, ethically, compliantly, and accountably. This presentation explores how data catalog has become a key technology enabler in overcoming these challenges.

Slides: Knowledge Graphs vs. Property GraphsDATAVERSITY

We are in the era of graphs. Graphs are hot. Why? Flexibility is one strong driver: Heterogeneous data, integrating new data sources, and analytics all require flexibility. Graphs deliver it in spades. Over the last few years, a number of new graph databases came to market. As we start the next decade, dare we say “the semantic twenties,” we also see vendors that never before mentioned graphs starting to position their products and solutions as graphs or graph-based. Graph databases are one thing, but “Knowledge Graphs” are an even hotter topic. We are often asked to explain Knowledge Graphs. Today, there are two main graph data models: • Property Graphs (also known as Labeled Property Graphs) • RDF Graphs (Resource Description Framework) aka Knowledge Graphs Other graph data models are possible as well, but over 90 percent of the implementations use one of these two models. In this webinar, we will cover the following: I. A brief overview of each of the two main graph models noted above II. Differences in Terminology and Capabilities of these models III. Strengths and Limitations of each approach IV. Why Knowledge Graphs provide a strong foundation for Enterprise Data Governance and Metadata Management

Data Warehouse Agility Array Conference2011Hans Hultgren

Data Warehouse Concepts | Data Warehouse Tutorial | Data Warehousing | EdurekaEdureka!

This tutorial on data warehouse concepts will tell you everything you need to know in performing data warehousing and business intelligence. The various data warehouse concepts explained in this video are: 1. What Is Data Warehousing? 2. Data Warehousing Concepts: i. OLAP (On-Line Analytical Processing) ii. Types Of OLAP Cubes iii. Dimensions, Facts & Measures iv. Data Warehouse Schema

Intro to Data Vault 2.0 on SnowflakeKent Graziano

This document provides an introduction and overview of implementing Data Vault 2.0 on Snowflake. It begins with an agenda and the presenter's background. It then discusses why customers are asking for Data Vault and provides an overview of the Data Vault methodology including its core components of hubs, links, and satellites. The document applies Snowflake features like separation of workloads and agile warehouse scaling to support Data Vault implementations. It also addresses modeling semi-structured data and building virtual information marts using views.

Delivering Data Democratization in the Cloud with SnowflakeKent Graziano

Data Vault and DW2.0Empowered Holdings, LLC

Emerging Trends in Data Architecture – What’s the Next Big Thing?DATAVERSITY

Digital Transformation is a top priority for many organizations, and a successful digital journey requires a strong data foundation. Creating this digital transformation requires a number of core data management capabilities such as MDM, With technological innovation and change occurring at an ever-increasing rate, it’s hard to keep track of what’s hype and what can provide practical value for your organization. Join this webinar to see the results of a recent DATAVERSITY survey on emerging trends in Data Architecture, along with practical commentary and advice from industry expert Donna Burbank.

ClickHouse北京Meetup ClickHouse Best Practice @SinaJack Gao

How a Semantic Layer Makes Data Mesh Work at ScaleDATAVERSITY

Data Mesh is a trending approach to building a decentralized data architecture by leveraging a domain-oriented, self-service design. However, the pure definition of Data Mesh lacks a center of excellence or central data team and doesn’t address the need for a common approach for sharing data products across teams. The semantic layer is emerging as a key component to supporting a Hub and Spoke style of organizing data teams by introducing data model sharing, collaboration, and distributed ownership controls. This session will explain how data teams can define common models and definitions with a semantic layer to decentralize analytics product creation using a Hub and Spoke architecture. Attend this session to learn about: - The role of a Data Mesh in the modern cloud architecture. - How a semantic layer can serve as the binding agent to support decentralization. - How to drive self service with consistency and control.

Data Mesh for DinnerKent Graziano

Data mesh is a decentralized approach to managing and accessing analytical data at scale. It distributes responsibility for data pipelines and quality to domain experts. The key principles are domain-centric ownership, treating data as a product, and using a common self-service infrastructure platform. Snowflake is well-suited for implementing a data mesh with its capabilities for sharing data and functions securely across accounts and clouds, with built-in governance and a data marketplace for discovery. A data mesh implemented on Snowflake's data cloud can support truly global and multi-cloud data sharing and management according to data mesh principles.

BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...Big Data Week

We have seen vast improvements to data collection, storage, processing and transport in recent years. An increasing number of networked devices are emitting data and all of us are preparing to handle this wave of valuable data. Have we, as data professionals, been too focused on the technical challenges and analytical results? What about the data quality? Are we confident about it? How can we be sure we are making good decisions? We need to revisit methods of assessing data quality on our modernized data platforms. The quality of our decision making depends on it.

TSE_Pres12.pptxssuseracaaae2

This document discusses big data workflows. It begins by defining big data and workflows, noting that workflows are task-oriented processes for decision making. Big data workflows require many servers to run one application, unlike traditional IT workflows which run on one server. The document then covers the 5Vs and 1C characteristics of big data: volume, velocity, variety, variability, veracity, and complexity. It lists software tools for big data platforms, business analytics, databases, data mining, and programming. Challenges of big data are also discussed: dealing with size and variety of data, scalability, analysis, and management issues. Major application areas are listed in private sector domains like retail, banking, manufacturing, and government.

More Related Content

What's hot (20)

Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021Tristan Baker

Data Mesh Part 4 Monolith to MeshJeffrey T. Pollock

Business Intelligence (BI) and Data Management Basics amorshed

Big Data, Big Deal: For Future Big Data ScientistsWay-Yen Lin

Building the Data Lake with Azure Data Factory and Data Lake AnalyticsKhalid Salama

Big data architectures and the data lakeJames Serra

Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...Databricks

Big Data, Artificial Intelligence & HealthcareIris Thiele Isip-Tan

Data Quality: A Raising Data Warehousing ConcernAmin Chowdhury

Data Catalog as a Business EnablerSrinivasan Sankar

Slides: Knowledge Graphs vs. Property GraphsDATAVERSITY

Data Warehouse Agility Array Conference2011Hans Hultgren

Data Warehouse Concepts | Data Warehouse Tutorial | Data Warehousing | EdurekaEdureka!

Intro to Data Vault 2.0 on SnowflakeKent Graziano

Delivering Data Democratization in the Cloud with SnowflakeKent Graziano

Data Vault and DW2.0Empowered Holdings, LLC

Emerging Trends in Data Architecture – What’s the Next Big Thing?DATAVERSITY

ClickHouse北京Meetup ClickHouse Best Practice @SinaJack Gao

How a Semantic Layer Makes Data Mesh Work at ScaleDATAVERSITY

Data Mesh for DinnerKent Graziano

Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021Tristan Baker

Data Mesh Part 4 Monolith to MeshJeffrey T. Pollock

Business Intelligence (BI) and Data Management Basics amorshed

Big Data, Big Deal: For Future Big Data ScientistsWay-Yen Lin

Building the Data Lake with Azure Data Factory and Data Lake AnalyticsKhalid Salama

Big data architectures and the data lakeJames Serra

Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...Databricks

Big Data, Artificial Intelligence & HealthcareIris Thiele Isip-Tan

Data Quality: A Raising Data Warehousing ConcernAmin Chowdhury

Data Catalog as a Business EnablerSrinivasan Sankar

Slides: Knowledge Graphs vs. Property GraphsDATAVERSITY

Data Warehouse Agility Array Conference2011Hans Hultgren

Data Warehouse Concepts | Data Warehouse Tutorial | Data Warehousing | EdurekaEdureka!

Intro to Data Vault 2.0 on SnowflakeKent Graziano

Delivering Data Democratization in the Cloud with SnowflakeKent Graziano

Data Vault and DW2.0Empowered Holdings, LLC

Emerging Trends in Data Architecture – What’s the Next Big Thing?DATAVERSITY

ClickHouse北京Meetup ClickHouse Best Practice @SinaJack Gao

How a Semantic Layer Makes Data Mesh Work at ScaleDATAVERSITY

Data Mesh for DinnerKent Graziano

Similar to Data Mesh @ Yelp - 2019 (20)

BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...Big Data Week

TSE_Pres12.pptxssuseracaaae2

An Overview of VIEWShiyong Lu

The document describes scientific workflows for big data and the challenges they present. It discusses Prof. Shiyong Lu's work on developing the VIEW system for designing, executing, and analyzing scientific workflows. The VIEW system provides a runtime environment for workflows, supports their execution on servers or clouds, and enables efficient storage, querying and visualization of workflow provenance data.

Streaming Data Pipelines with Kafka (MEAP) Stefan Sprengervalimcatiis

Data warehousing and business intelligence project reportsonalighai

Developed Data warehouse project with a structured, semi-structured and unstructured sources of data and generated Business Intelligence reports. Topic for the project was Tobacco products consumption in America. Studied on which products are more famous among people across and also got to know that middle school students are the soft targets for the tobacco companies as maximum people start taking tobacco products at this age. Tools used: SSMS, SSIS, SSAS, SSRS, R-Studio, Power BI, Excel

Real World End to End machine Learning PipelineSrivatsan Srinivasan

Purpose of this presentation is to highlight how end to end machine learning looks like in real world enterprise. This is to provide insight to aspiring data scientist who have been through courses or education in ML that mostly focus on ML algorithms and not end to end pipeline. Architecture and components mentioned in Slide 11 will be discussed in detailed in series of post on LinkedIn over the course of next few month To get updates on this follow me on LinkedIn or search/follow hashtag #end2endDS. Post will be active in August 2019 and will be posted till September 2019

Data Management at Scale Piethein Strengholtdacikaashiti

The Right Data Warehouse: Automation Now, Business Value ThereafterInside Analysis

The Briefing Room with Dr. Robin Bloor and WhereScape Live Webcast on April 1, 2014 Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=7b23b14b532bd7be60a70f6bd5209f03 In the Big Data shuffle, everyone is looking at Hadoop as “the answer” to collect interesting data from a new set of sources. While Hadoop has given organizations the power to gather more information assets than ever before, the question still looms: which data, regardless of source, structure, volume and all the rest, are significant for affecting business value – and how do we harness it? One effective approach is to bolster the data warehouse environment with a solution capable of integrating all the data sources, including Hadoop, and automating delivery of key information into the rights hands. Register for this episode of The Briefing Room to hear veteran Analyst Robin Bloor as he explains how a rapidly changing information landscape impacts data management. He will be briefed by Mark Budzinski of WhereScape, who will tout his company’s data warehouse automation solutions. Budzinski will discuss how automation can be the cornerstone for closing the gap between those responsible for data management and the people driving business decisions. Visit InsideAnlaysis.com for more information.

Introduction to Semantic Web for GIS PractitionersEmanuele Della Valle

This document provides an introduction to the Semantic Web and RDF (Resource Description Framework). It discusses how the Semantic Web aims to extend the current web by giving data well-defined meaning to enable computers and people to better work together. It introduces RDF as a standard for representing information in the Semantic Web and provides examples of how RDF can be used to represent different types of data, such as relational data and evolving data scenarios.

The NoSQL MovementRalucaGheorghita

Streaming Data Pipelines with Kafka (MEAP) Stefan Sprenger download pdfaroubkihak

Data Mesh using Microsoft FabricNathan Bijnens

The document discusses Microsoft's approach to implementing a data mesh architecture using their Azure Data Fabric. It describes how the Fabric can provide a unified foundation for data governance, security, and compliance while also enabling business units to independently manage their own domain-specific data products and analytics using automated data services. The Fabric aims to overcome issues with centralized data architectures by empowering lines of business and reducing dependencies on central teams. It also discusses how domains, workspaces, and "shortcuts" can help virtualize and share data across business units and data platforms while maintaining appropriate access controls and governance.

Fbdl enabling comprehensive_data_servicesCindy Irby

This document describes a training course on the Federation Business Data Lake. The FBDL allows organizations to ingest diverse data sources, perform various types of analytics including real-time, interactive, and exploratory analytics, and develop applications using insights from big data. The document provides a use case of a restaurant chain that uses the FBDL to analyze social media data and inform menu decisions. It details how the company ingests Twitter data, analyzes it using Hadoop and NoSQL, and uses a dashboard to aid management decisions. The FBDL provides an integrated solution for the full analytics lifecycle from data ingestion to application development.

Big data journey to the cloud maz chaudhri 5.30.18Cloudera, Inc.

Download Complete Streaming Data Pipelines with Kafka (MEAP) Stefan Sprenger ...aisaraserale

Why Data Virtualization? An Introduction by DenodoJusto Hidalgo

markfinleyResumeMarch2016Mark Finley

Mark Finley has over 20 years of experience developing databases and software using technologies like SQL Server, Oracle, C#, and .NET. He has extensive experience architecting, analyzing, designing, developing, documenting, testing, deploying, and supporting complex systems. Some of his past roles include data architect at Quintiles, where he built a data warehouse and data integration systems, and senior developer/architect at MF Global, where he developed risk management applications. He is proficient in technologies like SQL Server, Oracle, SSIS, C#, ASP.NET, and Agile methodologies.

NoSQL, What it is and how our projects can benefit from itHeather Campbell

Streaming Data Pipelines with Kafka (MEAP) Stefan Sprengeryazitstuer

Buy ebook Streaming Data Pipelines with Kafka (MEAP) Stefan Sprenger cheap priceconacofagot41

BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...Big Data Week

TSE_Pres12.pptxssuseracaaae2

An Overview of VIEWShiyong Lu

Streaming Data Pipelines with Kafka (MEAP) Stefan Sprengervalimcatiis

Data warehousing and business intelligence project reportsonalighai

Real World End to End machine Learning PipelineSrivatsan Srinivasan

Data Management at Scale Piethein Strengholtdacikaashiti

The Right Data Warehouse: Automation Now, Business Value ThereafterInside Analysis

Introduction to Semantic Web for GIS PractitionersEmanuele Della Valle

The NoSQL MovementRalucaGheorghita

Streaming Data Pipelines with Kafka (MEAP) Stefan Sprenger download pdfaroubkihak

Data Mesh using Microsoft FabricNathan Bijnens

Fbdl enabling comprehensive_data_servicesCindy Irby

Big data journey to the cloud maz chaudhri 5.30.18Cloudera, Inc.

Download Complete Streaming Data Pipelines with Kafka (MEAP) Stefan Sprenger ...aisaraserale

Why Data Virtualization? An Introduction by DenodoJusto Hidalgo

markfinleyResumeMarch2016Mark Finley

NoSQL, What it is and how our projects can benefit from itHeather Campbell

Streaming Data Pipelines with Kafka (MEAP) Stefan Sprengeryazitstuer

Buy ebook Streaming Data Pipelines with Kafka (MEAP) Stefan Sprenger cheap priceconacofagot41

Recently uploaded (20)

Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptxRishavKumar530754

Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptxMahaveerVPandit

"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...Infopitaara

A feed water heater is a device used in power plants to preheat water before it enters the boiler. It plays a critical role in improving the overall efficiency of the power generation process, especially in thermal power plants. 🔧 Function of a Feed Water Heater: It uses steam extracted from the turbine to preheat the feed water. This reduces the fuel required to convert water into steam in the boiler. It supports Regenerative Rankine Cycle, increasing plant efficiency. 🔍 Types of Feed Water Heaters: Open Feed Water Heater (Direct Contact) Steam and water come into direct contact. Mixing occurs, and heat is transferred directly. Common in low-pressure stages. Closed Feed Water Heater (Surface Type) Steam and water are separated by tubes. Heat is transferred through tube walls. Common in high-pressure systems. ⚙️ Advantages: Improves thermal efficiency. Reduces fuel consumption. Lowers thermal stress on boiler components. Minimizes corrosion by removing dissolved gases.

theory-slides-for react for beginners.pptxsanchezvanessa7896

211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdfinmishra17121973

Raish Khanji GTU 8th sem Internship Report.pdfRaishKhanji

This report details the practical experiences gained during an internship at Indo German Tool Room, Ahmedabad. The internship provided hands-on training in various manufacturing technologies, encompassing both conventional and advanced techniques. Significant emphasis was placed on machining processes, including operation and fundamental understanding of lathe and milling machines. Furthermore, the internship incorporated modern welding technology, notably through the application of an Augmented Reality (AR) simulator, offering a safe and effective environment for skill development. Exposure to industrial automation was achieved through practical exercises in Programmable Logic Controllers (PLCs) using Siemens TIA software and direct operation of industrial robots utilizing teach pendants. The principles and practical aspects of Computer Numerical Control (CNC) technology were also explored. Complementing these manufacturing processes, the internship included extensive application of SolidWorks software for design and modeling tasks. This comprehensive practical training has provided a foundational understanding of key aspects of modern manufacturing and design, enhancing the technical proficiency and readiness for future engineering endeavors.

DSP and MV the Color image processing.pptHafizAhamed8

Degree_of_Automation.pdf for Instrumentation and industrial specialistshreyabhosale19

Data Structures_Searching and Sorting.pptxRushaliDeshmukh2

Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...Journal of Soft Computing in Civil Engineering

Analysis of reinforced concrete deep beam is based on simplified approximate method due to the complexity of the exact analysis. The complexity is due to a number of parameters affecting its response. To evaluate some of this parameters, finite element study of the structural behavior of the reinforced self-compacting concrete deep beam was carried out using Abaqus finite element modeling tool. The model was validated against experimental data from the literature. The parametric effects of varied concrete compressive strength, vertical web reinforcement ratio and horizontal web reinforcement ratio on the beam were tested on eight (8) different specimens under four points loads. The results of the validation work showed good agreement with the experimental studies. The parametric study revealed that the concrete compressive strength most significantly influenced the specimens’ response with the average of 41.1% and 49 % increment in the diagonal cracking and ultimate load respectively due to doubling of concrete compressive strength. Although the increase in horizontal web reinforcement ratio from 0.31 % to 0.63 % lead to average of 6.24 % increment on the diagonal cracking load, it does not influence the ultimate strength and the load-deflection response of the beams. Similar variation in vertical web reinforcement ratio leads to an average of 2.4 % and 15 % increment in cracking and ultimate load respectively with no appreciable effect on the load-deflection response.

QA/QC Manager (Quality management Expert)rccbatchplant

AI-assisted Software Testing (3-hours tutorial)Vəhid Gəruslu

some basics electrical and electronics knowledgenguyentrungdo88

fluke dealers in bangalore..............Haresh Vaswani

The Fluke 925 is a vane anemometer, a handheld device designed to measure wind speed, air flow (volume), and temperature. It features a separate sensor and display unit, allowing greater flexibility and ease of use in tight or hard-to-reach spaces. The Fluke 925 is particularly suitable for HVAC (heating, ventilation, and air conditioning) maintenance in both residential and commercial buildings, offering a durable and cost-effective solution for routine airflow diagnostics.

Artificial Intelligence (AI) basics.pptxaditichinar

The Gaussian Process Modeling Module in UQLabJournal of Soft Computing in Civil Engineering

We introduce the Gaussian process (GP) modeling module developed within the UQLab software framework. The novel design of the GP-module aims at providing seamless integration of GP modeling into any uncertainty quantification workflow, as well as a standalone surrogate modeling tool. We first briefly present the key mathematical tools on the basis of GP modeling (a.k.a. Kriging), as well as the associated theoretical and computational framework. We then provide an extensive overview of the available features of the software and demonstrate its flexibility and user-friendliness. Finally, we showcase the usage and the performance of the software on several applications borrowed from different fields of engineering. These include a basic surrogate of a well-known analytical benchmark function; a hierarchical Kriging example applied to wind turbine aero-servo-elastic simulations and a more complex geotechnical example that requires a non-stationary, user-defined correlation function. The GP-module, like the rest of the scientific code that is shipped with UQLab, is open source (BSD license).

Machine learning project on employee attrition detection using (2).pptxrajeswari89780

Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...Journal of Soft Computing in Civil Engineering

Passenger car unit (PCU) of a vehicle type depends on vehicular characteristics, stream characteristics, roadway characteristics, environmental factors, climate conditions and control conditions. Keeping in view various factors affecting PCU, a model was developed taking a volume to capacity ratio and percentage share of particular vehicle type as independent parameters. A microscopic traffic simulation model VISSIM has been used in present study for generating traffic flow data which some time very difficult to obtain from field survey. A comparison study was carried out with the purpose of verifying when the adaptive neuro-fuzzy inference system (ANFIS), artificial neural network (ANN) and multiple linear regression (MLR) models are appropriate for prediction of PCUs of different vehicle types. From the results observed that ANFIS model estimates were closer to the corresponding simulated PCU values compared to MLR and ANN models. It is concluded that the ANFIS model showed greater potential in predicting PCUs from v/c ratio and proportional share for all type of vehicles whereas MLR and ANN models did not perform well.

Compiler Design Unit1 PPT Phases of Compiler.pptxRushaliDeshmukh2

Process Parameter Optimization for Minimizing Springback in Cold Drawing Proc...Journal of Soft Computing in Civil Engineering

In tube drawing process, a tube is pulled out through a die and a plug to reduce its diameter and thickness as per the requirement. Dimensional accuracy of cold drawn tubes plays a vital role in the further quality of end products and controlling rejection in manufacturing processes of these end products. Springback phenomenon is the elastic strain recovery after removal of forming loads, causes geometrical inaccuracies in drawn tubes. Further, this leads to difficulty in achieving close dimensional tolerances. In the present work springback of EN 8 D tube material is studied for various cold drawing parameters. The process parameters in this work include die semi-angle, land width and drawing speed. The experimentation is done using Taguchi’s L36 orthogonal array, and then optimization is done in data analysis software Minitab 17. The results of ANOVA shows that 15 degrees die semi-angle,5 mm land width and 6 m/min drawing speed yields least springback. Furthermore, optimization algorithms named Particle Swarm Optimization (PSO), Simulated Annealing (SA) and Genetic Algorithm (GA) are applied which shows that 15 degrees die semi-angle, 10 mm land width and 8 m/min drawing speed results in minimal springback with almost 10.5 % improvement. Finally, the results of experimentation are validated with Finite Element Analysis technique using ANSYS.

Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptxRishavKumar530754

Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptxMahaveerVPandit

"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...Infopitaara

theory-slides-for react for beginners.pptxsanchezvanessa7896

211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdfinmishra17121973

Raish Khanji GTU 8th sem Internship Report.pdfRaishKhanji

DSP and MV the Color image processing.pptHafizAhamed8

Degree_of_Automation.pdf for Instrumentation and industrial specialistshreyabhosale19

Data Structures_Searching and Sorting.pptxRushaliDeshmukh2

Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...Journal of Soft Computing in Civil Engineering

QA/QC Manager (Quality management Expert)rccbatchplant

AI-assisted Software Testing (3-hours tutorial)Vəhid Gəruslu

some basics electrical and electronics knowledgenguyentrungdo88

fluke dealers in bangalore..............Haresh Vaswani

Artificial Intelligence (AI) basics.pptxaditichinar

The Gaussian Process Modeling Module in UQLabJournal of Soft Computing in Civil Engineering

Machine learning project on employee attrition detection using (2).pptxrajeswari89780

Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...Journal of Soft Computing in Civil Engineering

Compiler Design Unit1 PPT Phases of Compiler.pptxRushaliDeshmukh2

Process Parameter Optimization for Minimizing Springback in Cold Drawing Proc...Journal of Soft Computing in Civil Engineering

Data Mesh @ Yelp - 2019

1. Data Mesh @ Yelp Sep 12, 2018

2. Yelp’s Mission Connecting people with great local businesses

3. Who am I? My name is Steven, my preferred pronoun is “he” I graduated from UC Berkeley EECS in 2005 This is my second term in Yelp (2017 - now) Last term is 2011 - 2015 I consider myself a generalist in the ﬁeld

4. Who am I? I work in team metrics-data within metrics-platform

5. Who am I? I work in team metrics-data within metrics-platform

6. Data powers decision making OnLine Transaction Processing (OLTP) We use MySQL to power yelp.com Each transaction interacts with small amount of data Display reviews, photos, tips of a business OLTP queries’ results are expected to return quickly No one wants to wait for more than 2 seconds for a business page to load

7. OLTP example: ﬁnd the titles an author has written. Take advantage of an index https://en.wikipedia.org/wiki/Library_catalog#/media/File:Schlagwortkatalog.jpg

8. Data powers decision making Developers want to ﬁnd out what local business has the most reviews Table scan on the review table? OnLine Analytical Processing (OLAP) Queries that scan majority of data relative to total amount of data Need specialized system to support such queries Yelp uses AWS Redshift as a data warehouse to support OLAP queries.

9. OLAP example: average number of pages in a book stored inside main stack. Need to scan all the titles. https://www.dailycal.org/2013/12/08/best-worst-foods-sneak-main-stacks/

10. More throughput Lower Latency

11. More throughput Lower Latency

12. Data Fabric We want to avoid n * m programs to transport data n is the number of source, and m is the number of sink Domain speciﬁc data stores are here to stay Stonebraker, “One Size Fits All”: An Idea Whose Time Has Come and Gone” Stream-Table Duality We can formulate the transport of data as streams

13. https://docs.confluent.io/current/streams/concepts.html

14. https://docs.confluent.io/current/streams/concepts.html

16. Image source: https://images-na.ssl-images-amazon.com/images/I/71UfEHhZ2uL._SL1000_.jpg

17. Beneﬁts Connector Ecosystem Lower the barrier of entry It’s easy to move data between data stores High performance implementation Each data store has its own performance characteristics. Streams-processing over batch processing Near real-time data availability

18. Image source: https://images-na.ssl-images-amazon.com/images/I/71GmEqny4NL._SL1000_.jpg

19. Lesson Learned Connector Ecosystem Schematized data is good Lessen the likelihood of malformed data Schema evolution can be diﬃcult Making incompatible schema change can break many things. Discourage them in registration phase. Decouple data producers and data consumers We need automation to inform data producers how to manage data life cycle as producers do not think about who uses the data.

20. Image source: https://i.ytimg.com/vi/03y8DJrzzjA/maxresdefault.jpg

21. Desirable Improvements Data Producers should own their data life cycle Speciﬁc connector owner does not have visibility of data semantics. Data Consumers are stakeholders Consumers don’t want to out incompatible changes after its been rolled out. Self-serve mechanism accelerates changes The only way to rapidly evolves is to self-serve

22. Data Mesh Data specifications are like microservices APIs They are contracts between producers and consumers Each team owns their data specifications To avoid accidentally abstraction leakage Decentralization allows rapid experiments Common conventions are promoted to minimize frictions among different domain systems

23. https://martinfowler.com/articles/data-monolith-to-mesh.html

24. yelp.com/dataset_challenge Academic dataset from 10 cities across the globe! Your academic project, research or visualizations submitted by December 31, 2019 = a $5,000 prize* ! *See full terms on website 6M reviews 1M business attributes 190K businesses 200K photos

25. Questions/Suggestions? [email protected]

26. Thank you.

Data Mesh @ Yelp - 2019

Recommended

More Related Content

What's hot (20)

Similar to Data Mesh @ Yelp - 2019 (20)

Recently uploaded (20)

Data Mesh @ Yelp - 2019