Data platform architecture

May 28, 20192 likes899 views

The document discusses data architecture solutions for solving real-time, high-volume data problems with low latency response times. It recommends a data platform capable of capturing, ingesting, streaming, and optionally storing data for batch analytics. The solution should provide fast data ingestion, real-time analytics, fast action, and quick time to value. Multiple data sources like logs, social media, and internal systems would be ingested using Apache Flume and Kafka and analyzed with Spark/Storm streaming. The processed data would be stored in HDFS, Cassandra, S3, or Hive. Kafka, Spark, and Cassandra are identified as key technologies for real-time data pipelines, stream analytics, and high availability persistent storage.

DATA ARCHITECTURE &
ROAD MAP
NEXT GENERATION DATA PLATFORM AND ARCHITECTURAL PATTERNS
BY SUDHEER KONDLA
SENIOR DATA PLATFORM ARCHITECTS

OVERVIEW
• Define a problem
• Understanding problem
• Articulate the problem
• Craft a solution

DATA ARCHITECTURE SOLUTION
• In order to solve real time high volume data problem with low latency response time, we need data
platform that has capable of capturing, ingesting , streaming and optionally storing data for batch
analytics. Most of the real time streaming data platforms will have short lived data after processing to
build predictive modelling that enable marketing to offer real time recommendations, the following
characteristics are expected
• Fast Data
• Require fast ingestion
• Real-time analytics
• Fast action
• Time to value
• Benefits
• Capture and use (or discard – time to live or purge)
• Insights real or near real-time
• Agile and Responsive
• Expressive

ECHO SYSTEM & INFRASTRUCTURE
• Multiple Data Sources:
• Web/Apps Logs, Twitter (trending), and other social media, blogs, SOR (internal systems), HDFS
• Ingestion/Streaming
• Apache Flume (log capture/aggregation), Kafka (event streaming, data pipelines & messaging)
• Stream Analytics
• Spark/Storm API
• Data Store/Persistence
• HDFS, Cassandra, S3, Hive
• Infrastructure
• IaaS (Cloud) or On-premise or Hybrid Private Cloud
• Orchestration
• Mesos

REAL-TIME DATA PIPELINES
Real-time data pipeline
Collect data into Kafka
(Channel Data)
Process micro-batches
(Aggregate, predict &
act)
Persist data for later use
(Historical, Analytics)
Kafka Spark Cassandra

CHOOSING RIGHT ECHO SYSTEM
• Kafka:
• Distributed pub-sub messaging and data pipe line system
• Designed for processing real-time activity streams (logs, metrics)
• When to use: real-time decision making, working with streams of continuous data
• Why Kafka: Persistent messaging, High throughput, Fault tolerant.
• Spark:
• What is it: It’s a distributed computing framework that can scale, integrate real time data from many event
streams (Kafka, Flume, HDFS, S3, Twitter and other sources)
• Event Driven, Asynchronous, Scalable, Type-safe and fault tolerant
• Where does fit:
• When you need real time decision making - recommendation, fraud detection, real time forcasting
• Why spark streaming
• Provides high throughput, reliable for live data streams
• Batch, iterative and streaming on same platform
• Fits for machine learning

CHOOSING RIGHT ECHO SYSTEM
• Cassandra:
• What is it: Distributed database with high availability (multi-master, high write throughput)
• When to use: Scaling, data needed in multi-data centers (geo locations), Always available and fast response
times.
• Why Cassandra: Easy to scale out, High throughput, Continuous availability , no SPOFs. Easy to integrate with
Spark and supports Spark Streaming and Solr search.

A traditional data team has roles including data engineer, data scientist, and data analyst. However, many organizations are finding success by integrating a new role – the analytics engineer. The analytics engineer develops a code-based data infrastructure that can serve both analytics and data science teams. He or she develops re-usable data models using the software engineering practices of version control and unit testing, and provides the critical domain expertise that ensures that data products are relevant and insightful. In this talk we’ll talk about the role and skill set of the analytics engineer, and discuss how dbt, an open source programming environment, empowers anyone with a SQL skillset to fulfill this new role on the data team. We’ll demonstrate how to use dbt to build version-controlled data models on top of Delta Lake, test both the code and our assumptions about the underlying data, and orchestrate complete data pipelines on Apache Spark™.

Data Mesh Part 4 Monolith to MeshJeffrey T. Pollock

This is Part 4 of the GoldenGate series on Data Mesh - a series of webinars helping customers understand how to move off of old-fashioned monolithic data integration architecture and get ready for more agile, cost-effective, event-driven solutions. The Data Mesh is a kind of Data Fabric that emphasizes business-led data products running on event-driven streaming architectures, serverless, and microservices based platforms. These emerging solutions are essential for enterprises that run data-driven services on multi-cloud, multi-vendor ecosystems. Join this session to get a fresh look at Data Mesh; we'll start with core architecture principles (vendor agnostic) and transition into detailed examples of how Oracle's GoldenGate platform is providing capabilities today. We will discuss essential technical characteristics of a Data Mesh solution, and the benefits that business owners can expect by moving IT in this direction. For more background on Data Mesh, Part 1, 2, and 3 are on the GoldenGate YouTube channel: https://www.youtube.com/playlist?list=PLbqmhpwYrlZJ-583p3KQGDAd6038i1ywe Webinar Speaker: Jeff Pollock, VP Product (https://www.linkedin.com/in/jtpollock/) Mr. Pollock is an expert technology leader for data platforms, big data, data integration and governance. Jeff has been CTO at California startups and a senior exec at Fortune 100 tech vendors. He is currently Oracle VP of Products and Cloud Services for Data Replication, Streaming Data and Database Migrations. While at IBM, he was head of all Information Integration, Replication and Governance products, and previously Jeff was an independent architect for US Defense Department, VP of Technology at Cerebra and CTO of Modulant – he has been engineering artificial intelligence based data platforms since 2001. As a business consultant, Mr. Pollock was a Head Architect at Ernst & Young’s Center for Technology Enablement. Jeff is also the author of “Semantic Web for Dummies” and "Adaptive Information,” a frequent keynote at industry conferences, author for books and industry journals, formerly a contributing member of W3C and OASIS, and an engineering instructor with UC Berkeley’s Extension for object-oriented systems, software development process and enterprise architecture.

Exceptional Executive AssistantRon Wilkins

This document provides tips for executive assistants to become exceptional at their role. It discusses how the role has evolved from primarily administrative tasks to becoming an active partner to senior management. It emphasizes the importance of building trust through listening, empathy, taking initiative, and following through on commitments. Executive assistants are encouraged to make their executives look good by staying organized, prepared, and on top of relevant industry information.

Data Architecture, Solution Architecture, Platform Architecture — What’s the ...DATAVERSITY

A solid data architecture is critical to the success of any data initiative. But what is meant by “data architecture”? Throughout the industry, there are many different “flavors” of data architecture, each with its own unique value and use cases for describing key aspects of the data landscape. Join this webinar to demystify the various architecture styles and understand how they can add value to your organization.

Enterprise Architecture vs. Data ArchitectureDATAVERSITY

Enterprise Architecture (EA) provides a visual blueprint of the organization, and shows key interrelationships between data, process, applications, and more. By abstracting these assets in a graphical view, it’s possible to see key interrelationships, particularly as they relate to data and its business impact across the organization. Join us for a discussion on how data architecture is a key component of an overall enterprise architecture for enhanced business value and success.

Modern Data ArchitectureAlexey Grishchenko

DBT ELT approach for Advanced Analytics.pptxHong Ong

*Event* DBT (Data Build Tool) an ELT approach for Advanced Analytics (wearecommunity.io) https://wearecommunity.io/events/dbt-data-build-tool-an-elt-approach-for-advanced-analytics *Demo* Goal: calculate monthly sales values by category Tech stacks: DBT, Databricks, Azure Blob Data: Brazilian E-Commerce Public Dataset by Olist (Kaggle) Github: https://github.com/ongxuanhong/de05-dbt-databricks YouTube: https://youtu.be/l4Mug-Qp3ag

Improving Data Literacy Around Data ArchitectureDATAVERSITY

Data Literacy is an increasing concern, as organizations look to become more data-driven. As the rise of the citizen data scientist and self-service data analytics becomes increasingly common, the need for business users to understand core Data Management fundamentals is more important than ever. At the same time, technical roles need a strong foundation in Data Architecture principles and best practices. Join this webinar to understand the key components of Data Literacy, and practical ways to implement a Data Literacy program in your organization.

Data Lakehouse Symposium | Day 4Databricks

The document discusses the challenges of modern data, analytics, and AI workloads. Most enterprises struggle with siloed data systems that make integration and productivity difficult. The future of data lies with a data lakehouse platform that can unify data engineering, analytics, data warehousing, and machine learning workloads on a single open platform. The Databricks Lakehouse platform aims to address these challenges with its open data lake approach and capabilities for data engineering, SQL analytics, governance, and machine learning.

Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra

So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a modern data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. They all may sound great in theory, but I'll dig into the concerns you need to be aware of before taking the plunge. I’ll also include use cases so you can see what approach will work best for your big data needs. And I'll discuss Microsoft version of the data mesh.

Data Governance and Metadata ManagementDATAVERSITY

Metadata is a tool that improves data understanding, builds end-user confidence, and improves the return on investment in every asset associated with becoming a data-centric organization. Metadata’s use has expanded beyond “data about data” to cover every phase of data analytics, protection, and quality improvement. Data Governance and metadata are connected at the hip in every way possible. As the song goes, “You can’t have one without the other.” In this RWDG webinar, Bob Seiner will provide a way to renew your energy by focusing on the valuable asset that can make or break your Data Governance program’s success. The truth is metadata is already inherent in your data environment, and it can be leveraged by making it available to all levels of the organization. At issue is finding the most appropriate ways to leverage and share metadata to improve data value and protection. Throughout this webinar, Bob will share information about: - Delivering an improved definition of metadata - Communicating the relationship between successful governance and metadata - Getting your business community to embrace the need for metadata - Determining the metadata that will provide the most bang for your bucks - The importance of Metadata Management to becoming data-centric

Time to Talk about Data MeshLibbySchulze

This document discusses data mesh, a distributed data management approach for microservices. It outlines the challenges of implementing microservice architecture including data decoupling, sharing data across domains, and data consistency. It then introduces data mesh as a solution, describing how to build the necessary infrastructure using technologies like Kubernetes and YAML to quickly deploy data pipelines and provision data across services and applications in a distributed manner. The document provides examples of how data mesh can be used to improve legacy system integration, batch processing efficiency, multi-source data aggregation, and cross-cloud/environment integration.

Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra

Building a Data Strategy – Practical Steps for Aligning with Business GoalsDATAVERSITY

Developing a Data Strategy for your organization can seem like a daunting task – but it’s worth the effort. Getting your Data Strategy right can provide significant value, as data drives many of the key initiatives in today’s marketplace – from digital transformation, to marketing, to customer centricity, to population health, and more. This webinar will help demystify Data Strategy and its relationship to Data Architecture and will provide concrete, practical ways to get started.

DW Migration Webinar-March 2022.pptxDatabricks

The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.

Data Governance Best Practices, Assessments, and RoadmapsDATAVERSITY

When starting or evaluating the present state of your Data Governance program, it is important to focus on best practices such that you don’t take a ready, fire, aim approach. Best practices need to be practical and doable to be selected for your organization, and the program must be at risk if the best practice is not achieved. Join Bob Seiner for an important webinar focused on industry best practice around standing up formal Data Governance. Learn how to assess your organization against the practices and deliver an effective roadmap based on the results of conducting the assessment. In this webinar, Bob will focus on: - Criteria to select the appropriate best practices for your organization - How to define the best practices for ultimate impact - Assessing against selected best practices - Focusing the recommendations on program success - Delivering a roadmap for your Data Governance program

Lakehouse in AzureSergio Zenatti Filho

Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks

The ABCs of Treating Data as ProductDATAVERSITY

Product-thinking is making a big impact in the data world with the rise of Data Products, Data Product Managers, data mesh, and treating “Data as a Product.” But Honest, No-BS: What is a Data Product? And what key questions should we ask ourselves while developing them? Tim Gasper (VP of Product, data.world), will walk through the Data Product ABCs as a way to make treating data as a product way simpler: Accountability, Boundaries, Contracts and Expectations, Downstream Consumers, and Explicit Knowledge.

The Importance of MetadataDATAVERSITY

The first step towards understanding data assets’ impact on your organization is understanding what those assets mean for each other. Metadata – literally, data about data – is a practice area required by good systems development, and yet is also perhaps the most mislabeled and misunderstood Data Management practice. Understanding metadata and its associated technologies as more than just straightforward technological tools can provide powerful insight into the efficiency of organizational practices and enable you to combine practices into sophisticated techniques supporting larger and more complex business initiatives. Program learning objectives include: - Understanding how to leverage metadata practices in support of business strategy - Discuss foundational metadata concepts - Guiding principles for and lessons previously learned from metadata and its practical uses applied strategy Metadata strategies include: - Metadata is a gerund so don’t try to treat it as a noun - Metadata is the language of Data Governance - Treat glossaries/repositories as capabilities, not technology

Data Lake OverviewJames Serra

The data lake has become extremely popular, but there is still confusion on how it should be used. In this presentation I will cover common big data architectures that use the data lake, the characteristics and benefits of a data lake, and how it works in conjunction with a relational data warehouse. Then I’ll go into details on using Azure Data Lake Store Gen2 as your data lake, and various typical use cases of the data lake. As a bonus I’ll talk about how to organize a data lake and discuss the various products that can be used in a modern data warehouse.

Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021Tristan Baker

Enabling a Data Mesh Architecture with Data VirtualizationDenodo

Watch full webinar here: https://bit.ly/3rwWhyv The Data Mesh architectural design was first proposed in 2019 by Zhamak Dehghani, principal technology consultant at Thoughtworks, a technology company that is closely associated with the development of distributed agile methodology. A data mesh is a distributed, de-centralized data infrastructure in which multiple autonomous domains manage and expose their own data, called “data products,” to the rest of the organization. Organizations leverage data mesh architecture when they experience shortcomings in highly centralized architectures, such as the lack domain-specific expertise in data teams, the inflexibility of centralized data repositories in meeting the specific needs of different departments within large organizations, and the slow nature of centralized data infrastructures in provisioning data and responding to changes. In this session, Pablo Alvarez, Global Director of Product Management at Denodo, explains how data virtualization is your best bet for implementing an effective data mesh architecture. You will learn: - How data mesh architecture not only enables better performance and agility, but also self-service data access - The requirements for “data products” in the data mesh world, and how data virtualization supports them - How data virtualization enables domains in a data mesh to be truly autonomous - Why a data lake is not automatically a data mesh - How to implement a simple, functional data mesh architecture using data virtualization

Data Governance — Aligning Technical and Business ApproachesDATAVERSITY

Data Governance can have a varied definition, depending on the audience. To many, data governance consists of committee meetings and stewardship roles. To others, it focuses on technical data management and controls. Holistic data governance combines both of these aspects, and a robust data architecture and associated diagrams can be the “glue” that binds business and IT governance together. Join this webinar for practical tips and hands-on exercises for aligning data architecture & data governance for business and IT success.

Data Architecture for Data GovernanceDATAVERSITY

This document discusses data governance and data architecture. It introduces data governance as the processes for managing data, including deciding data rights, making data decisions, and implementing those decisions. It describes how data architecture relates to data governance by providing patterns and structures for governing data. The document presents some common data architecture patterns, including a publish/subscribe pattern where a publisher pushes data to a hub and subscribers pull data from the hub. It also discusses how data architecture can support data governance goals through approaches like a subject area data model.

Azure Purview Data Toboggan Erwin de KreukErwin de Kreuk

Azure Purview is Microsoft's cloud-native data governance service that provides unified data discovery, cataloging, and classification across hybrid and multi-cloud environments. It automates the extraction of metadata at scale and identifies data lineage between sources. The service includes a data map, data catalog, and data insights. The data map automates metadata scanning and lineage tracking. The data catalog enables effortless discovery and browsing of classified data. Data insights provides governance reporting across the data estate.

Gartner: Master Data Management FunctionalityGartner

MDM solutions require tightly integrated capabilities including data modeling, integration, synchronization, propagation, flexible architecture, granular and packaged services, performance, availability, analysis, information quality management, and security. These capabilities allow organizations to extend data models, integrate and synchronize data in real-time and batch processes across systems, measure ROI and data quality, and securely manage the MDM solution.

20160331 sa introduction to big data pipelining berlin meetup 0.3Simon Ambridge

This document discusses building data pipelines with Apache Spark and DataStax Enterprise (DSE) for both static and real-time data. It describes how DSE provides a scalable, fault-tolerant platform for distributed data storage with Cassandra and real-time analytics with Spark. It also discusses using Kafka as a messaging queue for streaming data and processing it with Spark. The document provides examples of using notebooks, Parquet, and Akka for building pipelines to handle both large static datasets and fast, real-time streaming data sources.

Data Pipelines with Spark & DataStax EnterpriseDataStax

This document discusses building data pipelines for both static and streaming data using Apache Spark and DataStax Enterprise (DSE). For static data, it recommends using optimized data storage formats, distributed and scalable technologies like Spark, interactive analysis tools like notebooks, and DSE for persistent storage. For streaming data, it recommends using scalable distributed technologies, Kafka to decouple producers and consumers, and DSE for real-time analytics and persistent storage across datacenters.

More Related Content

What's hot (20)

Improving Data Literacy Around Data ArchitectureDATAVERSITY

Data Lakehouse Symposium | Day 4Databricks

Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra

Data Governance and Metadata ManagementDATAVERSITY

Time to Talk about Data MeshLibbySchulze

Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra

Building a Data Strategy – Practical Steps for Aligning with Business GoalsDATAVERSITY

DW Migration Webinar-March 2022.pptxDatabricks

Data Governance Best Practices, Assessments, and RoadmapsDATAVERSITY

Lakehouse in AzureSergio Zenatti Filho

Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks

The ABCs of Treating Data as ProductDATAVERSITY

The Importance of MetadataDATAVERSITY

Data Lake OverviewJames Serra

Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021Tristan Baker

Enabling a Data Mesh Architecture with Data VirtualizationDenodo

Data Governance — Aligning Technical and Business ApproachesDATAVERSITY

Data Architecture for Data GovernanceDATAVERSITY

Azure Purview Data Toboggan Erwin de KreukErwin de Kreuk

Gartner: Master Data Management FunctionalityGartner

Improving Data Literacy Around Data ArchitectureDATAVERSITY

Data Lakehouse Symposium | Day 4Databricks

Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra

Data Governance and Metadata ManagementDATAVERSITY

Time to Talk about Data MeshLibbySchulze

Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra

Building a Data Strategy – Practical Steps for Aligning with Business GoalsDATAVERSITY

DW Migration Webinar-March 2022.pptxDatabricks

Data Governance Best Practices, Assessments, and RoadmapsDATAVERSITY

Lakehouse in AzureSergio Zenatti Filho

Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks

The ABCs of Treating Data as ProductDATAVERSITY

The Importance of MetadataDATAVERSITY

Data Lake OverviewJames Serra

Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021Tristan Baker

Enabling a Data Mesh Architecture with Data VirtualizationDenodo

Data Governance — Aligning Technical and Business ApproachesDATAVERSITY

Data Architecture for Data GovernanceDATAVERSITY

Azure Purview Data Toboggan Erwin de KreukErwin de Kreuk

Gartner: Master Data Management FunctionalityGartner

Similar to Data platform architecture (20)

20160331 sa introduction to big data pipelining berlin meetup 0.3Simon Ambridge

Data Pipelines with Spark & DataStax EnterpriseDataStax

Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataMike Percy

The document discusses using Kafka and Kudu for low-latency SQL analytics on streaming data. It describes the challenges of supporting both streaming and batch workloads simultaneously using traditional solutions. The authors propose using Kafka to ingest data and Kudu for structured storage and querying. They demonstrate how this allows for stream processing, batch processing, and querying of up-to-second data with low complexity. Case studies from Xiaomi and TPC-H benchmarks show the advantages of this approach over alternatives.

Survey of Real-time Processing Systems for Big DataLuiz Henrique Zambom Santana

This document summarizes a survey of real-time processing systems for big data. It describes several systems including Apache Storm, Flume, Spark, Kafka, Scribe, S4, Hstreaming, All-RiTE, and Impala. These systems were designed for real-time capabilities to address the limitations of Hadoop for low-latency queries. They use in-memory processing, publish/subscribe architectures, or integration with distributed technologies like HDFS. The survey concludes that most systems aim to improve scalability while maintaining low latency, often through in-memory processing to reduce disk access.

Cloud Lambda Architecture PatternsAsis Mohanty

The document discusses different cloud data architectures including streaming processing, Lambda architecture, Kappa architecture, and patterns for implementing Lambda architecture on AWS. It provides an overview of each architecture's components and limitations. The key differences between Lambda and Kappa architectures are outlined, with Kappa being based solely on streaming and using a single technology stack. Finally, various AWS services that can be used to implement Lambda architecture patterns are listed.

real time data processing is a tsubtopic in the topic in the domain bigdataArasuVishnu

Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data ArchitectSoftServe

This document discusses Hadoop infrastructure and SoftServe's experience with it. It provides an overview of various Hadoop components like HDFS, YARN, Pig, Hive, Sqoop and HBase. It also discusses popular Hadoop distributions and the Lambda architecture. Finally, it shares three case studies where SoftServe implemented Hadoop solutions for clients in log analysis, web analytics and an online analytics platform.

Pacemaker hadoop infrastructure and soft serve experienceVitaliy Bashun

This document discusses Hadoop infrastructure and SoftServe's experience working with Hadoop. It provides an overview of Hadoop components like HDFS, YARN, Pig, Hive, Sqoop and HBase. It also discusses popular Hadoop distributions and the Lambda architecture. The document then presents three case studies where SoftServe implemented Hadoop solutions for clients - one for log analysis, one for clickstream analysis of a retail website, and one for an online analytics platform. It provides details on the technologies used, architecture and business goals for each case study.

Big Data_Architecture.pptxbetalab

Les mégadonnées représentent un vrai enjeu à la fois technique, business et de société : l'exploitation des données massives ouvre des possibilités de transformation radicales au niveau des entreprises et des usages. Tout du moins : à condition que l'on en soit techniquement capable... Car l'acquisition, le stockage et l'exploitation de quantités massives de données représentent des vrais défis techniques. Une architecture big data permet la création et de l'administration de tous les systèmes techniques qui vont permettre la bonne exploitation des données. Il existe énormément d'outils différents pour manipuler des quantités massives de données : pour le stockage, l'analyse ou la diffusion, par exemple. Mais comment assembler ces différents outils pour réaliser une architecture capable de passer à l'échelle, d'être tolérante aux pannes et aisément extensible, tout cela sans exploser les coûts ? Le succès du fonctionnement de la Big data dépend de son architecture, son infrastructure correcte et de son l’utilité que l’on fait ‘’ Data into Information into Value ‘’. L’architecture de la Big data est composé de 4 grandes parties : Intégration, Data Processing & Stockage, Sécurité et Opération.

Webinar: What's new in CDAP 3.5?Cask Data

Cask Webinar Date: 08/10/2016 Link to video recording: https://www.youtube.com/watch?v=XUkANr9iag0 In this webinar, Nitin Motgi, CTO of Cask, walks through the new capabilities of CDAP 3.5 and explains how your organization can benefit. Some of the highlights include: - Enterprise-grade security - Authentication, authorization, secure keystore for storing configurations. Plus integration with Apache Sentry and Apache Ranger. - Preview mode - Ability to preview and debug data pipelines before deploying them. - Joins in Cask Hydrator - Capabilities to join multiple data sources in data pipelines - Real-time pipelines with Spark Streaming - Drag & drop real-time pipelines using Spark Streaming. - Data usage analytics - Ability to report application usage of data sets. - And much more!

Innovation in the Data Warehouse - StampedeCon 2016StampedeCon

Enterprise Holding’s first started with Hadoop as a POC in 2013. Today, we have clusters on premises and in the cloud. This talk will explore our experience with Big Data and outline three common big data architectures (batch, lambda, and kappa). Then, we’ll dive into the decision points to necessary for your own cluster, for example: cloud vs on premises, physical vs virtual, workload, and security. These decisions will help you understand what direction to take. Finally, we’ll share some lessons learned with the pieces of our architecture worked well and rant about those which didn’t. No deep Hadoop knowledge is necessary, architect or executive level.

Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely

This document discusses engineering machine learning data pipelines and addresses five big challenges: 1) scattered and difficult to access data, 2) data cleansing at scale, 3) entity resolution, 4) tracking data lineage, and 5) ongoing real-time changed data capture and streaming. It presents DMX Change Data Capture as a solution to capture changes from various data sources and replicate them in real-time to targets like Kafka, HDFS, databases and data lakes to feed machine learning models. Case studies demonstrate how DMX-h has helped customers like a global hotel chain and insurance and healthcare companies build scalable data pipelines.

Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014cdmaxime

Maxime Dumas gives a presentation on Cloudera Impala, which provides fast SQL query capability for Apache Hadoop. Impala allows for interactive queries on Hadoop data in seconds rather than minutes by using a native MPP query engine instead of MapReduce. It offers benefits like SQL support, improved performance of 3-4x up to 90x faster than MapReduce, and flexibility to query existing Hadoop data without needing to migrate or duplicate it. The latest release of Impala 2.0 includes new features like window functions, subqueries, and spilling joins and aggregations to disk when memory is exhausted.

Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion

What’s important about a technology is what you can use it to do. I’ve looked at what a number of groups are doing with Apache Hadoop and NoSQL in production, and I will relay what worked well for them and what did not. Drawing from real world use cases, I show how people who understand these new approaches can employ them well in conjunction with traditional approaches and existing applications. Thread Detection, Datawarehouse optimization, Marketing Efficiency, Biometric Database are some examples exposed during this presentation.

Introduction To Hadoop EcosystemInSemble

How Data Drives Business at Choice HotelsCloudera, Inc.

3 Things to Learn: -How data is driving digital transformation to help businesses innovate rapidly -How Choice Hotels (one of largest hoteliers) is using Cloudera Enterprise to gain meaningful insights that drive their business -How Choice Hotels has transformed business through innovative use of Apache Hadoop, Cloudera Enterprise, and deployment in the cloud — from developing customer experiences to meeting IT compliance requirements

Data Analytics Service Company and Its Ruby UsageSATOSHI TAGOMORI

This document summarizes Satoshi Tagomori's presentation on Treasure Data, a data analytics service company. It discusses Treasure Data's use of Ruby for various components of its platform including its logging (Fluentd), ETL (Embulk), scheduling (PerfectSched), and storage (PlazmaDB) technologies. The document also provides an overview of Treasure Data's architecture including how it collects, stores, processes, and visualizes customer data using open source tools integrated with services like Hadoop and Presto.

Trend Micro Big Data Platform and Apache BigtopEvans Ye

Solving Office 365 Big Challenges using Cassandra + Spark Anubhav Kale

ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...Altinity Ltd

Pragma Innovation is an IT services company focused on time series data solutions. Their PASS (Pragma Analytics Software Suite) allows companies to analyze, report on, and make decisions from time series network data using open source software. It is designed for ISPs, hosting providers, and telecom companies. The solution ingests network and log data, standardizes it, enriches it using tools like GeoIP, and stores it in a time series database. This allows customers to build applications for traffic engineering, security, and business intelligence use cases. Key challenges addressed in version 2.0 of the solution include data sampling, IPv4/IPv6 support, and using ClickHouse as the time series database for its performance and simplicity

20160331 sa introduction to big data pipelining berlin meetup 0.3Simon Ambridge

Data Pipelines with Spark & DataStax EnterpriseDataStax

Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataMike Percy

Survey of Real-time Processing Systems for Big DataLuiz Henrique Zambom Santana

Cloud Lambda Architecture PatternsAsis Mohanty

real time data processing is a tsubtopic in the topic in the domain bigdataArasuVishnu

Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data ArchitectSoftServe

Pacemaker hadoop infrastructure and soft serve experienceVitaliy Bashun

Big Data_Architecture.pptxbetalab

Webinar: What's new in CDAP 3.5?Cask Data

Innovation in the Data Warehouse - StampedeCon 2016StampedeCon

Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely

Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014cdmaxime

Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion

Introduction To Hadoop EcosystemInSemble

How Data Drives Business at Choice HotelsCloudera, Inc.

Data Analytics Service Company and Its Ruby UsageSATOSHI TAGOMORI

Trend Micro Big Data Platform and Apache BigtopEvans Ye

Solving Office 365 Big Challenges using Cassandra + Spark Anubhav Kale

ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...Altinity Ltd

More from Sudheer Kondla (8)

MongoDB cluster_on_aws_exampleSudheer Kondla

No sqlSudheer Kondla

This document provides a summary of big data, NoSQL databases, and Apache Cassandra. It introduces the speaker as a 20+ year IT professional currently working as a senior architect at IBM. It then covers topics such as what big data is, why NoSQL databases are needed, common NoSQL patterns, Cassandra's architecture and strengths, and use cases for Cassandra in areas like IoT, fraud detection, and cloud infrastructure monitoring.

AWS multi-region DB design and deploymentSudheer Kondla

This document discusses AWS multi-region database system architecture using Aurora cross-region read replicas. Key points include: - Aurora supports up to 5 cross-region read replicas that can be created with a single API call. - Use cases for cross-region replicas include disaster recovery, low latency reads from globally distributed users, and region migration. - The document reviews considerations for replication lag, deployment architectures across 3 continents, and operational challenges like limitations on creating additional replicas and monitoring across regions. - It provides steps for testing failover by promoting a read replica to a new master in another region.

Aws aurora scalingSudheer Kondla

This document discusses using ScaleArc software to load balance and route queries across an Aurora database cluster to improve performance and scalability. ScaleArc inserts transparently between applications and databases, distributing load and facilitating high availability. It supports automated failover, connection pooling, read/write splitting, and other features. The current Aurora setup in EU-WEST-1 is described, and recommendations are provided for tuning the database and SQL queries to optimize performance.

Digital transformation is not about technologySudheer Kondla

Digital transformation efforts often fail because they focus too much on technology instead of business strategy. Over 70% of digital transformation initiatives waste money because they lack a clear vision and do not gain internal participation. Successful efforts align digital efforts with business goals like improving customer experience and gaining competitive advantages. Companies should leverage existing employees and expertise rather than relying solely on outside consultants. Efforts are more likely to succeed when they address employee concerns about job loss and adopt an agile culture that values speed.

Setting up mongodb sharded cluster in 30 minutesSudheer Kondla

The document describes how to configure and deploy a MongoDB sharded cluster with 6 virtual machines in 30 minutes. It provides step-by-step instructions on installing MongoDB, setting up the config servers, adding shards, and enabling sharding for databases and collections. Key aspects include designating MongoDB instances as config servers, starting mongos processes connected to the config servers, adding shards by hostname and port, and enabling sharding on specific databases and collections with shard keys.

Cloudera cluster setup and configurationSudheer Kondla

The document provides steps for setting up a Hadoop cluster using Cloudera Manager, including downloading and running the Cloudera Manager installer, logging into the Cloudera Manager Admin Console, using Cloudera Manager to automate the installation and configuration of CDH, specifying cluster node and repository information, installing software components on cluster nodes, reviewing installation logs, installing parcels, setting up the cluster and roles, configuring databases and clients, and completing the Cloudera cluster installation process.

Setting up mongo replica setSudheer Kondla

This document provides instructions for setting up a MongoDB replica set across multiple virtual machines. It describes installing MongoDB on each VM, creating directories to store data, configuring the yum repository, and installing MongoDB packages. It then explains how to initialize and configure a local 3-node replica set, add members, and check the replica set status. Finally, it briefly discusses connecting to primary and secondary members, performing CRUD operations, and setting up MongoDB Management Service (MMS) for monitoring and backups.

MongoDB cluster_on_aws_exampleSudheer Kondla

No sqlSudheer Kondla

AWS multi-region DB design and deploymentSudheer Kondla

Aws aurora scalingSudheer Kondla

Digital transformation is not about technologySudheer Kondla

Setting up mongodb sharded cluster in 30 minutesSudheer Kondla

Cloudera cluster setup and configurationSudheer Kondla

Setting up mongo replica setSudheer Kondla

Recently uploaded (20)

Medical Dataset including visualizationsvishrut8750588758

IAS-slides2-ia-aaaaaaaaaaain-business.pdfmcgardenlevi9

VKS-Python Basics for Beginners and advance.pptxVinod Srivastava

03 Daniel 2-notes.ppt seminario escatologiaAlexander Romero Arosquipa

DPR_Expert_Recruitment_notice_Revised.pdfinmishra17121973

Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPareaRusan

md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxfatimalazaar2004

GenAI for Quant Analytics: survey-analytics.aiInspirient

Calories_Prediction_using_Linear_Regression.pptxTijiLMAHESHWARI

Principles of information security Chapter 5.pptEstherBaguma

Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Abodahab

Thingyan is now a global treasure! See how people around the world are search...Pixellion

AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsContify

Molecular methods diagnostic and monitoring of infection - Repaired.pptx7tzn7x5kky

How to join illuminati Agent in uganda call+256776963507/0741506136illuminati Agent uganda call+256776963507/0741506136

chapter3 Central Tendency statistics.pptjustinebandajbn

Deloitte Analytics - Applying Process Mining in an audit contextProcess mining Evangelist

Mieke Jans is a Manager at Deloitte Analytics Belgium. She learned about process mining from her PhD supervisor while she was collaborating with a large SAP-using company for her dissertation. Mieke extended her research topic to investigate the data availability of process mining data in SAP and the new analysis possibilities that emerge from it. It took her 8-9 months to find the right data and prepare it for her process mining analysis. She needed insights from both process owners and IT experts. For example, one person knew exactly how the procurement process took place at the front end of SAP, and another person helped her with the structure of the SAP-tables. She then combined the knowledge of these different persons.

Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptssuser5f8f49

C++_OOPs_DSA1_Presentation_Template.pptxaquibnoor22079

Digilocker under workingProcess Flow.pptxsatnamsadguru491

Medical Dataset including visualizationsvishrut8750588758

IAS-slides2-ia-aaaaaaaaaaain-business.pdfmcgardenlevi9

VKS-Python Basics for Beginners and advance.pptxVinod Srivastava

03 Daniel 2-notes.ppt seminario escatologiaAlexander Romero Arosquipa

DPR_Expert_Recruitment_notice_Revised.pdfinmishra17121973

Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPareaRusan

md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxfatimalazaar2004

GenAI for Quant Analytics: survey-analytics.aiInspirient

Calories_Prediction_using_Linear_Regression.pptxTijiLMAHESHWARI

Principles of information security Chapter 5.pptEstherBaguma

Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Abodahab

Thingyan is now a global treasure! See how people around the world are search...Pixellion

AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsContify

Molecular methods diagnostic and monitoring of infection - Repaired.pptx7tzn7x5kky

How to join illuminati Agent in uganda call+256776963507/0741506136illuminati Agent uganda call+256776963507/0741506136

chapter3 Central Tendency statistics.pptjustinebandajbn

Deloitte Analytics - Applying Process Mining in an audit contextProcess mining Evangelist

Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptssuser5f8f49

C++_OOPs_DSA1_Presentation_Template.pptxaquibnoor22079

Digilocker under workingProcess Flow.pptxsatnamsadguru491

Data platform architecture

1. DATA ARCHITECTURE & ROAD MAP NEXT GENERATION DATA PLATFORM AND ARCHITECTURAL PATTERNS BY SUDHEER KONDLA SENIOR DATA PLATFORM ARCHITECTS

2. OVERVIEW • Define a problem • Understanding problem • Articulate the problem • Craft a solution

3. DATA ARCHITECTURE SOLUTION • In order to solve real time high volume data problem with low latency response time, we need data platform that has capable of capturing, ingesting , streaming and optionally storing data for batch analytics. Most of the real time streaming data platforms will have short lived data after processing to build predictive modelling that enable marketing to offer real time recommendations, the following characteristics are expected • Fast Data • Require fast ingestion • Real-time analytics • Fast action • Time to value • Benefits • Capture and use (or discard – time to live or purge) • Insights real or near real-time • Agile and Responsive • Expressive

4. DATA ARCHITECTURE SOLUTION • In order to solve real time high volume data problem with low latency response time, we need data platform that has capable of capturing, ingesting , streaming and optionally storing data for batch analytics. Most of the real time streaming data platforms will have short lived data after processing to build predictive modelling that enable marketing to offer real time recommendations, the following characteristics are expected • Fast Data • Require fast ingestion • Real-time analytics • Fast action • Time to value • Benefits • Capture and use (or discard – time to live or purge) • Insights real or near real-time • Agile and Responsive • Expressive

5. ECHO SYSTEM & INFRASTRUCTURE • Multiple Data Sources: • Web/Apps Logs, Twitter (trending), and other social media, blogs, SOR (internal systems), HDFS • Ingestion/Streaming • Apache Flume (log capture/aggregation), Kafka (event streaming, data pipelines & messaging) • Stream Analytics • Spark/Storm API • Data Store/Persistence • HDFS, Cassandra, S3, Hive • Infrastructure • IaaS (Cloud) or On-premise or Hybrid Private Cloud • Orchestration • Mesos

6. STREAM DATA ANALYTICS DATA FLOW

7. REAL-TIME DATA PIPELINES Real-time data pipeline Collect data into Kafka (Channel Data) Process micro-batches (Aggregate, predict & act) Persist data for later use (Historical, Analytics) Kafka Spark Cassandra

8. DATA GOVERNANCE & DATA LIFE CYCLE

9. CHOOSING RIGHT ECHO SYSTEM • Kafka: • Distributed pub-sub messaging and data pipe line system • Designed for processing real-time activity streams (logs, metrics) • When to use: real-time decision making, working with streams of continuous data • Why Kafka: Persistent messaging, High throughput, Fault tolerant. • Spark: • What is it: It’s a distributed computing framework that can scale, integrate real time data from many event streams (Kafka, Flume, HDFS, S3, Twitter and other sources) • Event Driven, Asynchronous, Scalable, Type-safe and fault tolerant • Where does fit: • When you need real time decision making - recommendation, fraud detection, real time forcasting • Why spark streaming • Provides high throughput, reliable for live data streams • Batch, iterative and streaming on same platform • Fits for machine learning

10. CHOOSING RIGHT ECHO SYSTEM • Cassandra: • What is it: Distributed database with high availability (multi-master, high write throughput) • When to use: Scaling, data needed in multi-data centers (geo locations), Always available and fast response times. • Why Cassandra: Easy to scale out, High throughput, Continuous availability , no SPOFs. Easy to integrate with Spark and supports Spark Streaming and Solr search.

11. Q & A •Questions ?

Data platform architecture

Recommended

More Related Content

What's hot (20)

Similar to Data platform architecture (20)

More from Sudheer Kondla (8)

Recently uploaded (20)

Data platform architecture