Apache doris (incubating) introduction

Sep 25, 2019Download as PPTX, PDF1 like1,107 views

Apache Doris (incubating) is an MPP-based interactive SQL data warehousing for reporting and analysis. It is open-sourced by Baidu. Doris mainly integrates the technology of Google Mesa and Apache Impala. Unlike other popular SQL-on-Hadoop systems, Doris is designed to be a simple and single tightly coupled system, not depending on other systems. Doris not only provides high concurrent low latency point query performance, but also provides high throughput queries of ad-hoc analysis. Doris not only provides batch data loading, but also provides near real-time mini-batch data loading. Doris also provides high availability, reliability, fault tolerance, and scalability. The simplicity (of developing, deploying and using) and meeting many data serving requirements in single system are the main features of Doris.

Apache Doris
(incubating)
A simple and single tightly coupled OLAP system
lide@apache.org

About me
 I’m Reed, an Apache committer, come from China
 I’m working for Baidu Inc., China‘s Google as a senior
software engineer and core developer of Doris.
 10+ year experience in software development, and 5-
year experience in managing a team.
 Once worked for Naver, South Korea's largest web
search engine and Qihoo 360, China's largest internet
security company.

Overview
2 ARCHITECTURE – Key techniques
3 FEATURES – Features and functions
4 STORAGE – Data storage model
1 INTRODUCTION – What is Doris

What is Doris
 An MPP-based interactive SQL data
warehousing for reporting and analysis
 A simple and single tightly coupled system, not
depending on other systems.
2014
Project start
2018

What is Based on
 Doris is the technology combination of Google Mesa
and Apache Impala.

Used in Baidu Inc.
1000+ 200+ 1PB
Deployed
machines
Business
lines
Maximum in
one cluster

User case – JD.com
-- Provided by JD.com

Why Doris by JD
 Google’s Mesa theory with Baidu's
engineering practice
 Compatible MySQL protocol
 High concurrency and high QPS support
 Convenient operation and maintenance with
clear structure
-- Provided by JD.com

Overview
3
FEATURES – Features and
functions
4 STORAGE – Data storage model
1 INTRODUCTION – What is Doris
2
ARCHITECTURE – Key
techniques

Doris Architecture
…
Catalog
FE
Catalog
FE
…

Frontend and Backend
consists of
query
coordinator and
catalog
manager
stores the data
and executes
the query
fragments
Sending query
fragments to BE
Executing query
fragments &
send result to FE
Catalog
Doris' implementation consists of two daemons:
frontend (FE) and backend (BE).

Key Technology
Doris =
Aggregate Data
Model
Versioned Data
Management
Prefix Index
Online Schema
Change
...
Query Engine
...
File Format
Indexes
Compression
Encoding
...
Google Mesa + Apache Impala + Apache ORCFile

Overview
2 ARCHITECTURE – Key techniques
4 STORAGE – Data storage model
1 INTRODUCTION – What is Doris
3 FEATURES – Features and functions

Compatible MySQL Protocol
ODBC
/JDBC
Compatible
MySQL protocol

Two-Level Partitioning
1. The Range partitioning
2. The hash partitioning
User can specify a column (usually the
time series column) range of values for
the data partition.
User can also specify one or more
columns and a number of buckets to
do the hash partitioning.

Two-Level Partitioning
CREATE TABLE IF NOT EXISTS example_db.expamle_tbl
(
`user_id` LARGEINT NOT NULL COMMENT "user id",
`date` DATE NOT NULL COMMENT "date and time",
... /* Omit other columns */
) AGGREGATE KEY(`user_id`, `date`, `timestamp`, `city`, `age`, `sex`)
PARTITION BY RANGE(`date`)
(
PARTITION `p201801` VALUES LESS THAN ("2018-02-01"),
PARTITION `p201802` VALUES LESS THAN ("2018-03-01"),
PARTITION `p201803` VALUES LESS THAN ("2018-04-01")
)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 16
... /* Omitted other information */
;

Materialized view
ALTER TABLE expamle_tbl ADD ROLLUP rollup_cost(user_id, cost);
SHOW ALTER TABLE ROLLUP;
EXPLAIN SELECT user_id, sum(cost) FROM expamle_tbl GROUP BY user_id;
+-----------------------------------------+
| Explain String |
+-----------------------------------------+
| PLAN FRAGMENT 0 |
...
| 1:AGGREGATE (update serialize) |
...
| 0:OlapScanNode |
| TABLE: expamle_tbl |
| rollup: rollup_cost |
...

Overview
2 ARCHITECTURE – Key techniques
3
FEATURES – Features and
functions
1 INTRODUCTION – What is Doris
4 STORAGE – Data storage model

Data Model
In Mesa, a table schema specifies key space K
for table and corresponding value space V.
Doris combines Mesa‘s data model and ORC
File / Parquet storage technology
The table schema specifies the aggregation
function F : V ×V → V

Types of Model
01 03
02
DUPLICATE
KEY
SUM, REPLACE,
MAX and MIN
UNIQUE KEY
AGGREGATE
KEY

Aggregate key model
CREATE TABLE IF NOT EXISTS example_db.expamle_tbl
(
`user_id` LARGEINT NOT NULL COMMENT "user ID",
`date` DATE NOT NULL COMMENT "date and time",
`city` VARCHAR(20) COMMENT "the city user lives",
`age` SMALLINT COMMENT "age of user",
`sex` TINYINT COMMENT "sex of user",
`last_visit_date` DATETIME REPLACE COMMENT "the date last visited",
`cost` BIGINT SUM DEFAULT "0" COMMENT "total cost",
`max_dwell_time` INT MAX DEFAULT "0" COMMENT "maximum dwell time",
`min_dwell_time` INT MIN DEFAULT "99999" COMMENT "minimum dwell time"
) AGGREGATE KEY(`user_id`, `date`, `city`, `age`, `sex`)
... /* Omitted the information of Partition and Distribution */
;

Duplicate key model
CREATE TABLE IF NOT EXISTS example_db.expamle_tb2
(
`timestamp` DATETIME NOT NULL COMMENT "log time",
`type` INT NOT NULL COMMENT "log type",
`error_code` INT COMMENT "error code",
`error_msg` VARCHAR(1024) COMMENT "error message",
`op_id` BIGINT COMMENT "operator ID",
`op_time` DATETIME COMMENT "operation time"
)
DUPLICATE KEY(`timestamp`, `type`)
... /* Omitted the information of Partition and Distribution */
;

Unique key model
CREATE TABLE IF NOT EXISTS example_db.expamle_tb3
(
`user_id` LARGEINT NOT NULL COMMENT "user id",
`username` VARCHAR(50) NOT NULL COMMENT ”user name",
`city` VARCHAR(20) COMMENT "the city user lives",
`age` SMALLINT COMMENT "age of user",
`sex` TINYINT COMMENT "sex of user",
`phone` LARGEINT COMMENT "phone number",
`address` VARCHAR(500) COMMENT "address of user",
`register_time` DATETIME COMMENT "registered time"
)
UNIQUE KEY(`user_id`, `user_name`)
... /* Omitted the information of Partition and Distribution */
;

Contact us
• Official website and source code
• http://doris.apache.org
• https://github.com/apache/incubator-doris
• E-Mail
• dev@doris.apache.org
• Document
• http://doris.incubator.apache.org/documentation/en/

Apache Spark is a fast and flexible compute engine for a variety of diverse workloads. Optimizing performance for different applications often requires an understanding of Spark internals and can be challenging for Spark application developers. In this session, learn how Facebook tunes Spark to run large-scale workloads reliably and efficiently. The speakers will begin by explaining the various tools and techniques they use to discover performance bottlenecks in Spark jobs. Next, you’ll hear about important configuration parameters and their experiments tuning these parameters on large-scale production workload. You’ll also learn about Facebook’s new efforts towards automatically tuning several important configurations based on nature of the workload. The speakers will conclude by sharing their results with automatic tuning and future directions for the project.ing several important configurations based on nature of the workload. We will conclude by sharing our result with automatic tuning and future directions for the project.

Deep Dive into the New Features of Apache Spark 3.0Databricks

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama

This document provides an overview and summary of Amazon S3 best practices and tuning for Hadoop/Spark in the cloud. It discusses the relationship between Hadoop/Spark and S3, the differences between HDFS and S3 and their use cases, details on how S3 behaves from the perspective of Hadoop/Spark, well-known pitfalls and tunings related to S3 consistency and multipart uploads, and recent community activities related to S3. The presentation aims to help users optimize their use of S3 storage with Hadoop/Spark frameworks.

History of Computer Virus Ammy Vijay

A computer virus is a type of malicious software that replicates by inserting copies of itself into other computer programs, files or boot sectors. It can damage files or systems. The first computer viruses emerged in the early 1980s and have since affected many platforms. Viruses function by infecting files and replicating. They often exploit software bugs or user errors to spread. Common types include macro viruses, boot sector viruses and resident viruses. Viruses continue to evolve techniques to avoid detection. A brief history outlines some of the most famous early viruses from the 1970s-2010s like Brain, Jerusalem, ILOVEYOU, and Cryptolocker. Viruses spread via software vulnerabilities, social engineering, or by targeting common file types like documents

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks

The document discusses optimizations made to Spark SQL performance when working with Parquet files at ByteDance. It describes how Spark originally reads Parquet files and identifies two main areas for optimization: Parquet filter pushdown and the Parquet reader. For filter pushdown, sorting columns improved statistics and reduced data reads by 30%. For the reader, splitting it to first filter then read other columns prevented loading unnecessary data. These changes improved Spark SQL performance at ByteDance without changing jobs.

Optimizing Apache Spark SQL JoinsDatabricks

Join operations in Apache Spark is often the biggest source of performance problems and even full-blown exceptions in Spark. After this talk, you will understand the two most basic methods Spark employs for joining DataFrames – to the level of detail of how Spark distributes the data within the cluster. You’ll also find out how to work out common errors and even handle the trickiest corner cases we’ve encountered! After this talk, you should be able to write performance joins in Spark SQL that scale and are zippy fast! This session will cover different ways of joining tables in Apache Spark. Speaker: Vida Ha This talk was originally presented at Spark Summit East 2017.

Business Plan VS Feasibility studyNeveenJamal

The feasibility analysis is an internationally accepted process used to evaluate various project dimensions important for achieving the desired project benefits. An effective tool for appraising the project from standpoints of all project stakeholders It is not a waste of time. It significantly reduces the risks in project implementation The business plan provides a planning function and outlines the actions needed to take the proposal from “idea” to “reality” The feasibility study outlines and analyzes several alternatives and identifies the best business scenario(s). The business plan deals with only one alternative or scenario.

Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks

Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsDatabricks

This document summarizes the results of a performance analysis conducted by the Barcelona Supercomputing Center comparing Apache Spark and Presto on cloud environments using the TPC-DS benchmark. It finds that Databricks Spark was about 4x faster than AWS EMR Presto without statistics and about 3x faster with statistics. Databricks was also more cost effective and had a more efficient runtime, caching, and query optimizer. While EMR Presto required more tuning, Databricks and EMR Spark were easier to configure and use interactive notebooks.

Change Data Feed in DeltaDatabricks

This document discusses Delta Change Data Feed (CDF), which allows capturing changes made to Delta tables. It describes how CDF works by storing change events like inserts, updates and deletes. It also outlines how CDF can be used to improve ETL pipelines, unify batch and streaming workflows, and meet regulatory needs. The document provides examples of enabling CDF, querying change data and storing the change events. It concludes by offering a demo of CDF in Jupyter notebooks.

When Apache Spark Meets TiDB with Xiaoyu MaDatabricks

During the past 10 years, big-data storage layers mainly focus on analytical use cases. When it comes to analytical cases, users usually offload data onto Hadoop cluster and perform queries on HDFS files. People struggle dealing with modifications on append only storage and maintain fragile ETL pipelines. On the other hand, although Spark SQL has been proven effective parallel query processing engine, some tricks common in traditional databases are not available due to characteristics of storage underneath. TiSpark sits directly on top of a distributed database (TiDB)’s storage engine, expand Spark SQL’s planning with its own extensions and utilizes unique features of database storage engine to achieve functions not possible for Spark SQL on HDFS. With TiSpark, users are able to perform queries directly on changing / fresh data in real time. The takeaways from this two are twofold: — How to integrate Spark SQL with a distributed database engine and the benefit of it — How to leverage Spark SQL’s experimental methods to extend its capacity.

RedisConf17- Using Redis at scale @ TwitterRedis Labs

The document discusses Nighthawk, Twitter's distributed caching system which uses Redis. It provides caching services at a massive scale of over 10 million queries per second and 10 terabytes of data across 3000 Redis nodes. The key aspects of Nighthawk's architecture that allow it to scale are its use of a client-oblivious proxy layer and cluster manager that can independently scale and rebalance partitions across Redis nodes. It also employs replication between data centers to provide high availability even in the event of node failures. Some challenges discussed are handling "hot keys" that get an unusually high volume of requests and more efficiently warming up replicas when nodes fail.

Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.

Performance Tuning RocksDB for Kafka Streams’ State Storesconfluent

Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfAltinity Ltd

Join the Altinity experts as we dig into ClickHouse sharding and replication, showing how they enable clusters that deliver fast queries over petabytes of data. We’ll start with basic definitions of each, then move to practical issues. This includes the setup of shards and replicas, defining schema, choosing sharding keys, loading data, and writing distributed queries. We’ll finish up with tips on performance optimization. #ClickHouse #datasets #ClickHouseTutorial #opensource #ClickHouseCommunity #Altinity ----------------- Join ClickHouse Meetups: https://www.meetup.com/San-Francisco-... Check out more ClickHouse resources: https://altinity.com/resources/ Visit the Altinity Documentation site: https://docs.altinity.com/ Contribute to ClickHouse Knowledge Base: https://kb.altinity.com/ Join the ClickHouse Reddit community: https://www.reddit.com/r/Clickhouse/ ---------------- Learn more about Altinity! Site: https://www.altinity.com LinkedIn: https://www.linkedin.com/company/alti... Twitter: https://twitter.com/AltinityDB

ポスト・ラムダアーキテクチャの切り札? Apache Hudi（NTTデータテクノロジーカンファレンス 2020 発表資料）NTT DATA Technology & Innovation

RocksDB Performance and Reliability PracticesYoshinori Matsunobu

Meta/Facebook's database serving social workloads is running on top of MyRocks (MySQL on RocksDB). This means our performance and reliability depends a lot on RocksDB. Not just MyRocks, but also we have other important systems running on top of RocksDB. We have learned many lessons from operating and debugging RocksDB at scale. In this session, we will offer an overview of RocksDB, key differences from InnoDB, and share a few interesting lessons learned from production.

Building an open data platform with apache icebergAlluxio, Inc.

Apache Hadoopの未来 3系になって何が変わるのか?NTT DATA OSS Professional Services

Everything you always wanted to know about Redis but were afraid to askCarlos Abalde

HBase Advanced - Lars GeorgeJAX London

Introduction to redisTanu Siwag

This document provides an overview and introduction to Redis, including: - Redis is an open source, in-memory data structure store that can be used as a database, cache, and message broker. - It supports common data structures like strings, hashes, lists, sets, sorted sets with operations like GET, SET, LPUSH, SADD. - Redis has advantages like speed, rich feature set, replication, and persistence to disk. - The document outlines how to install and use Redis, and covers additional features like pub/sub, transactions, security and backup.

Building Reliable Data Lakes at Scale with Delta LakeDatabricks

Most data practitioners grapple with data reliability issues—it’s the bane of their existence. Data engineers, in particular, strive to design, deploy, and serve reliable data in a performant manner so that their organizations can make the most of their valuable corporate data assets. Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark™ and big data workloads. Built on open standards, Delta Lake employs co-designed compute and storage and is compatible with Spark API’s. It powers high data reliability and query performance to support big data use cases, from batch and streaming ingests, fast interactive queries to machine learning. In this tutorial we will discuss the requirements of modern data engineering, the challenges data engineers face when it comes to data reliability and performance and how Delta Lake can help. Through presentation, code examples and notebooks, we will explain these challenges and the use of Delta Lake to address them. You will walk away with an understanding of how you can apply this innovation to your data architecture and the benefits you can gain. This tutorial will be both instructor-led and hands-on interactive session. Instructions on how to get tutorial materials will be covered in class. What you’ll learn: Understand the key data reliability challenges How Delta Lake brings reliability to data lakes at scale Understand how Delta Lake fits within an Apache Spark™ environment How to use Delta Lake to realize data reliability improvements Prerequisites A fully-charged laptop (8-16GB memory) with Chrome or Firefox Pre-register for Databricks Community Edition

Kafka replication apachecon_2013Jun Rao

The document discusses intra-cluster replication in Apache Kafka, including its architecture where partitions are replicated across brokers for high availability. Kafka uses a leader and in-sync replicas approach to strongly consistent replication while tolerating failures. Performance considerations in Kafka replication include latency and durability tradeoffs for producers and optimizing throughput for consumers.

MyRocks Deep DiveYoshinori Matsunobu

Spark Operator—Deploy, Manage and Monitor Spark clusters on KubernetesDatabricks

The document discusses the Spark Operator, which allows deploying, managing, and monitoring Spark clusters on Kubernetes. It describes how the operator extends Kubernetes by defining custom resources and reacting to events from those resources, such as SparkCluster, SparkApplication, and SparkHistoryServer. The operator takes care of common tasks to simplify running Spark on Kubernetes and hides the complexity through an abstract operator library.

Under the Hood of a Shard-per-Core Database ArchitectureScyllaDB

Most databases are based on architectures that pre-date advances to modern hardware. This results in performance issues, the need to overprovision, and a high total cost of ownership. In this webinar we will discuss the advances to modern server technology and take a deep dive into Scylla’s shard-per-core architecture and our asynchronous engine, the Seastar framework. Join us to learn how Seastar (and Scylla): Avoid locks and contention on the CPU level Bypass kernel bottlenecks Implement its per-core shared-nothing autosharding mechanism Utilize modern storage hardware Leverage NUMA to get the best RAM performance Balance your data across CPUs and nodes for best and smoothest performance Plus we’ll cover the advantages of unlocking vertical scalability.

How Prometheus Store the DataHao Chen

When to no sql and when to know sql javaoneSimon Elliston Ball

Simplifying & accelerating application development with MongoDB's intelligent...Maxime Beugnet

The document discusses MongoDB's Intelligent Operational Data Platform and how it allows developers to simplify application development. It highlights how MongoDB uses a document model which is more flexible than a relational database and allows for embedding of related data. MongoDB also provides features like multi-document transactions, full indexing capabilities, advanced aggregations, and change streams for building reactive applications in real-time.

More Related Content

What's hot (20)

Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsDatabricks

Change Data Feed in DeltaDatabricks

When Apache Spark Meets TiDB with Xiaoyu MaDatabricks

RedisConf17- Using Redis at scale @ TwitterRedis Labs

Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.

Performance Tuning RocksDB for Kafka Streams’ State Storesconfluent

Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfAltinity Ltd

ポスト・ラムダアーキテクチャの切り札? Apache Hudi（NTTデータテクノロジーカンファレンス 2020 発表資料）NTT DATA Technology & Innovation

RocksDB Performance and Reliability PracticesYoshinori Matsunobu

Building an open data platform with apache icebergAlluxio, Inc.

Apache Hadoopの未来 3系になって何が変わるのか?NTT DATA OSS Professional Services

Everything you always wanted to know about Redis but were afraid to askCarlos Abalde

HBase Advanced - Lars GeorgeJAX London

Introduction to redisTanu Siwag

Building Reliable Data Lakes at Scale with Delta LakeDatabricks

Kafka replication apachecon_2013Jun Rao

MyRocks Deep DiveYoshinori Matsunobu

Spark Operator—Deploy, Manage and Monitor Spark clusters on KubernetesDatabricks

Under the Hood of a Shard-per-Core Database ArchitectureScyllaDB

How Prometheus Store the DataHao Chen

Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsDatabricks

Change Data Feed in DeltaDatabricks

When Apache Spark Meets TiDB with Xiaoyu MaDatabricks

RedisConf17- Using Redis at scale @ TwitterRedis Labs

Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.

Performance Tuning RocksDB for Kafka Streams’ State Storesconfluent

Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfAltinity Ltd

ポスト・ラムダアーキテクチャの切り札? Apache Hudi（NTTデータテクノロジーカンファレンス 2020 発表資料）NTT DATA Technology & Innovation

RocksDB Performance and Reliability PracticesYoshinori Matsunobu

Building an open data platform with apache icebergAlluxio, Inc.

Apache Hadoopの未来 3系になって何が変わるのか?NTT DATA OSS Professional Services

Everything you always wanted to know about Redis but were afraid to askCarlos Abalde

HBase Advanced - Lars GeorgeJAX London

Introduction to redisTanu Siwag

Building Reliable Data Lakes at Scale with Delta LakeDatabricks

Kafka replication apachecon_2013Jun Rao

MyRocks Deep DiveYoshinori Matsunobu

Spark Operator—Deploy, Manage and Monitor Spark clusters on KubernetesDatabricks

Under the Hood of a Shard-per-Core Database ArchitectureScyllaDB

How Prometheus Store the DataHao Chen

Similar to Apache doris (incubating) introduction (20)

When to no sql and when to know sql javaoneSimon Elliston Ball

Simplifying & accelerating application development with MongoDB's intelligent...Maxime Beugnet

The Heterogeneous Data lakeDataWorks Summit/Hadoop Summit

Dremio is a startup founded in 2015 by experts in big data and open source. It aims to provide a platform for interactive analysis across disparate data sources through a storage-agnostic and client-agnostic approach leveraging Apache Arrow for high performance in-memory columnar execution. Dremio uses Apache Drill as its query engine, allowing users to query data across different systems like HDFS, S3, MongoDB as if it was a single relational database through SQL. It has an extensible architecture that allows new data sources to be easily added via plugins.

Database Management SystemAbishek V S

Python business intelligence (PyData 2012 talk)Stefan Urbanek

Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)Kyle Davis

Where are yours vertexes and what are they talking about?Roberto Franchini

The document discusses OrientDB, a multi-model database that combines document and graph functionality. It provides an overview of key OrientDB concepts like data models, schemas, indexing, and spatial and full-text search capabilities. Examples are given of modeling a Twitter graph using OrientDB classes, properties, indexes and relationship types. The document concludes with information on getting started with OrientDB.

NoSQL Endgame DevoxxUA Conference 2020Thodoris Bais

Full-stack Web Development with MongoDB, Node.js and AWSMongoDB

Akira Technologies will share its experience of building a universal scalable high-performance platform for conducting surveys. Using MongoDB allowed replacing dozens unique survey systems with a single flexible solution, improved data and questionnaire reusability, simplified data analysis. We will also cover full-stack development and integration with Node.js, Hadoop, deployment to AWS Cloud, offline caching and stress-tecting the entire system with Tsung. A working prototype will be demonstrated including multiple surveys, dynamically rebuilding interface, geolocation, data analysis and visualization.

01 nosql and multi model databaseMahdi Atawneh

This document summarizes a presentation on NoSQL and multi-model databases. It begins with an introduction to NoSQL databases, describing them as non-relational systems designed for big data and scalability. The main NoSQL models are outlined as key-value, document, columnar, and graph databases. Document databases are discussed in more detail. The presentation then covers multi-model databases, which combine features of document and graph databases, and allows for flexible querying. Popular multi-model databases like OrientDB and ArangoDB are presented. Finally, the document concludes with a demo of OrientDB's querying capabilities.

Introduction to azure document dbAntonios Chatzipavlis

This document provides an introduction and overview of Azure DocumentDB. It discusses how DocumentDB is a fully managed NoSQL database service that provides fast and predictable performance for JSON data through SQL querying capabilities. It also describes how DocumentDB offers features like elastic scaling, high availability, global distribution and ease of development. The document then provides information on starting with DocumentDB, writing queries, and programming capabilities within DocumentDB like stored procedures and triggers.

Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital.AI

This document provides an overview of MetaQL, which allows composing queries across NoSQL, SQL, SPARQL, and Spark databases using a domain model. Key points include: - MetaQL uses a domain model to define concepts and compose typed queries in code that can execute across different databases. - This separates concerns and improves developer efficiency over managing schemas and databases separately. - Examples demonstrate MetaQL queries in graph, path, select, and aggregation formats across SQL, NoSQL, and RDF implementations.

OrientDB for real & Web App developmentLuca Garulli

The document discusses how NoSQL databases like OrientDB can improve web application development compared to traditional relational databases. OrientDB provides a fast, scalable, and flexible storage solution with transactions, SQL, and security. It combines the best features of newer NoSQL solutions with relational databases. OrientDB supports document, graph, and object-oriented data models and can be used for both online backup solutions and CRM applications. It also introduces OrientWEB.js, a new JavaScript library for building web applications with OrientDB.

Application development with Oracle NoSQL Database 3.0Anuj Sahni

The document introduces table-based data modeling features for Oracle NoSQL Database. It discusses using tables to simplify application data modeling with familiar concepts like tables and data types. Examples show how to model user and email data using tables, including defining the schema using DDL, querying the data using DML, and indexing the tables. The document also provides an example of modeling user and email data from an email client application to illustrate how to approach data modeling.

Jump Start into Apache® Spark™ and DatabricksDatabricks

These are the slides from the Jump Start into Apache Spark and Databricks webinar on February 10th, 2016. --- Spark is a fast, easy to use, and unified engine that allows you to solve many Data Sciences and Big Data (and many not-so-Big Data) scenarios easily. Spark comes packaged with higher-level libraries, including support for SQL queries, streaming data, machine learning, and graph processing. We will leverage Databricks to quickly and easily demonstrate, visualize, and debug our code samples; the notebooks will be available for you to download.

2021 04-20 apache arrow and its impact on the database industry.pptxAndrew Lamb

Data stores: beyond relational databasesJavier García Magna

A quick overview of the several options we have right now when choosing the architecture of our data storage system. Relational is fantastic, but sometimes we have the feeling our problem won’t fit very well there… are there any other options? We will review a few different DB engines and we will see examples on how we can use them from our Microsoft .Net applications (or any others, really). Originally introduced in http://www.meetup.com/es/dotnetMALAGA/events/226374459/ and http://www.meetup.com/es/MalagaMakers/events/225695665/

Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks

Many data pipelines share common characteristics and are often built in similar but bespoke ways, even within a single organisation. In this talk, we will outline the key considerations which need to be applied when building data pipelines, such as performance, idempotency, reproducibility, and tackling the small file problem. We’ll work towards describing a common Data Engineering toolkit which separates these concerns from business logic code, allowing non-Data-Engineers (e.g. Business Analysts and Data Scientists) to define data pipelines without worrying about the nitty-gritty production considerations. We’ll then introduce an implementation of such a toolkit in the form of Waimak, our open-source library for Apache Spark (https://github.com/CoxAutomotiveDataSolutions/waimak), which has massively shortened our route from prototype to production. Finally, we’ll define new approaches and best practices about what we believe is the most overlooked aspect of Data Engineering: deploying data pipelines.

ADL/U-SQL Introduction (SQLBits 2016)Michael Rys

The document discusses Azure Data Lake and U-SQL. It provides an overview of the Data Lake approach to storing and analyzing data compared to traditional data warehousing. It then describes Azure Data Lake Storage and Azure Data Lake Analytics, which provide scalable data storage and an analytics service built on Apache YARN. U-SQL is introduced as a language that unifies SQL and C# for querying data in Data Lakes and other Azure data sources.

Drupal 7 entities & TextbookMadness.comJD Leonard

When to no sql and when to know sql javaoneSimon Elliston Ball

Simplifying & accelerating application development with MongoDB's intelligent...Maxime Beugnet

The Heterogeneous Data lakeDataWorks Summit/Hadoop Summit

Database Management SystemAbishek V S

Python business intelligence (PyData 2012 talk)Stefan Urbanek

Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)Kyle Davis

Where are yours vertexes and what are they talking about?Roberto Franchini

NoSQL Endgame DevoxxUA Conference 2020Thodoris Bais

Full-stack Web Development with MongoDB, Node.js and AWSMongoDB

01 nosql and multi model databaseMahdi Atawneh

Introduction to azure document dbAntonios Chatzipavlis

Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital.AI

OrientDB for real & Web App developmentLuca Garulli

Application development with Oracle NoSQL Database 3.0Anuj Sahni

Jump Start into Apache® Spark™ and DatabricksDatabricks

2021 04-20 apache arrow and its impact on the database industry.pptxAndrew Lamb

Data stores: beyond relational databasesJavier García Magna

Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks

ADL/U-SQL Introduction (SQLBits 2016)Michael Rys

Drupal 7 entities & TextbookMadness.comJD Leonard

Recently uploaded (20)

Flip flop presenation-Presented By Mubahir khan.pptxmubashirkhan45461

EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbJessaMaeEvangelista2

C++_OOPs_DSA1_Presentation_Template.pptxaquibnoor22079

FPET_Implementation_2_MA to 360 Engage Direct.pptxssuser4ef83d

Classification_in_Machinee_Learning.pptxwencyjorda88

How to join illuminati Agent in uganda call+256776963507/0741506136illuminati Agent uganda call+256776963507/0741506136

Data Analytics Overview and its applicationsJanmejayaMishra7

Thingyan is now a global treasure! See how people around the world are search...Pixellion

chapter3 Central Tendency statistics.pptjustinebandajbn

DPR_Expert_Recruitment_notice_Revised.pdfinmishra17121973

Digilocker under workingProcess Flow.pptxsatnamsadguru491

Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Abodahab

Principles of information security Chapter 5.pptEstherBaguma

Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPareaRusan

Ppt. Nikhil.pptxnshwuudgcudisisshvehsjkspanchariyasahil

Secure_File_Storage_Hybrid_Cryptography.pptx..yuvarajreddy2002

Ch3MCT24.pptx measure of central tendencyayeleasefa2

md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxfatimalazaar2004

GenAI for Quant Analytics: survey-analytics.aiInspirient

04302025_CCC TUG_DataVista: The Design Storyccctableauusergroup

Flip flop presenation-Presented By Mubahir khan.pptxmubashirkhan45461

EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbJessaMaeEvangelista2

C++_OOPs_DSA1_Presentation_Template.pptxaquibnoor22079

FPET_Implementation_2_MA to 360 Engage Direct.pptxssuser4ef83d

Classification_in_Machinee_Learning.pptxwencyjorda88

How to join illuminati Agent in uganda call+256776963507/0741506136illuminati Agent uganda call+256776963507/0741506136

Data Analytics Overview and its applicationsJanmejayaMishra7

Thingyan is now a global treasure! See how people around the world are search...Pixellion

chapter3 Central Tendency statistics.pptjustinebandajbn

DPR_Expert_Recruitment_notice_Revised.pdfinmishra17121973

Digilocker under workingProcess Flow.pptxsatnamsadguru491

Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Abodahab

Principles of information security Chapter 5.pptEstherBaguma

Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPareaRusan

Ppt. Nikhil.pptxnshwuudgcudisisshvehsjkspanchariyasahil

Secure_File_Storage_Hybrid_Cryptography.pptx..yuvarajreddy2002

Ch3MCT24.pptx measure of central tendencyayeleasefa2

md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxfatimalazaar2004

GenAI for Quant Analytics: survey-analytics.aiInspirient

04302025_CCC TUG_DataVista: The Design Storyccctableauusergroup

Apache doris (incubating) introduction

1. Apache Doris (incubating) A simple and single tightly coupled OLAP system [email protected]

2. About me  I’m Reed, an Apache committer, come from China  I’m working for Baidu Inc., China‘s Google as a senior software engineer and core developer of Doris.  10+ year experience in software development, and 5- year experience in managing a team.  Once worked for Naver, South Korea's largest web search engine and Qihoo 360, China's largest internet security company.

3. Overview 2 ARCHITECTURE – Key techniques 3 FEATURES – Features and functions 4 STORAGE – Data storage model 1 INTRODUCTION – What is Doris

4. Overview 2 ARCHITECTURE – Key techniques 3 FEATURES – Features and functions 4 STORAGE – Data storage model 1 INTRODUCTION – What is Doris

5. What is Doris  An MPP-based interactive SQL data warehousing for reporting and analysis  A simple and single tightly coupled system, not depending on other systems. 2014 Project start 2018

6. What is Based on  Doris is the technology combination of Google Mesa and Apache Impala.

7. Used in Baidu Inc. 1000+ 200+ 1PB Deployed machines Business lines Maximum in one cluster

8. Companies who use

9. User case – JD.com -- Provided by JD.com

10. Why Doris by JD  Google’s Mesa theory with Baidu's engineering practice  Compatible MySQL protocol  High concurrency and high QPS support  Convenient operation and maintenance with clear structure -- Provided by JD.com

11. Overview 3 FEATURES – Features and functions 4 STORAGE – Data storage model 1 INTRODUCTION – What is Doris 2 ARCHITECTURE – Key techniques

12. Doris Architecture … Catalog FE Catalog FE …

13. Frontend and Backend consists of query coordinator and catalog manager stores the data and executes the query fragments Sending query fragments to BE Executing query fragments & send result to FE Catalog Doris' implementation consists of two daemons: frontend (FE) and backend (BE).

14. Key Technology Doris = Aggregate Data Model Versioned Data Management Prefix Index Online Schema Change ... Query Engine ... File Format Indexes Compression Encoding ... Google Mesa + Apache Impala + Apache ORCFile

15. Overview 2 ARCHITECTURE – Key techniques 4 STORAGE – Data storage model 1 INTRODUCTION – What is Doris 3 FEATURES – Features and functions

16. Compatible MySQL Protocol ODBC /JDBC Compatible MySQL protocol

17. Multiple Frontends

18. Distributed Queries

19. Two-Level Partitioning 1. The Range partitioning 2. The hash partitioning User can specify a column (usually the time series column) range of values for the data partition. User can also specify one or more columns and a number of buckets to do the hash partitioning.

20. Two-Level Partitioning CREATE TABLE IF NOT EXISTS example_db.expamle_tbl ( ùser_id` LARGEINT NOT NULL COMMENT "user id", `date` DATE NOT NULL COMMENT "date and time", ... /* Omit other columns */ ) AGGREGATE KEY(ùser_id`, `date`, `timestamp`, `city`, àge`, `sex`) PARTITION BY RANGE(`date`) ( PARTITION `p201801` VALUES LESS THAN ("2018-02-01"), PARTITION `p201802` VALUES LESS THAN ("2018-03-01"), PARTITION `p201803` VALUES LESS THAN ("2018-04-01") ) DISTRIBUTED BY HASH(ùser_id`) BUCKETS 16 ... /* Omitted other information */ ;

21. MVCC and Compaction

22. Materialized view

24. Columnar storage

25. Overview 2 ARCHITECTURE – Key techniques 3 FEATURES – Features and functions 1 INTRODUCTION – What is Doris 4 STORAGE – Data storage model

26. Data Model In Mesa, a table schema specifies key space K for table and corresponding value space V. Doris combines Mesa‘s data model and ORC File / Parquet storage technology The table schema specifies the aggregation function F : V ×V → V

27. Data Model

28. Types of Model 01 03 02 DUPLICATE KEY SUM, REPLACE, MAX and MIN UNIQUE KEY AGGREGATE KEY

29. Aggregate key model CREATE TABLE IF NOT EXISTS example_db.expamle_tbl ( ùser_id` LARGEINT NOT NULL COMMENT "user ID", `date` DATE NOT NULL COMMENT "date and time", `city` VARCHAR(20) COMMENT "the city user lives", àge` SMALLINT COMMENT "age of user", `sex` TINYINT COMMENT "sex of user", `last_visit_date` DATETIME REPLACE COMMENT "the date last visited", `cost` BIGINT SUM DEFAULT "0" COMMENT "total cost", `max_dwell_time` INT MAX DEFAULT "0" COMMENT "maximum dwell time", `min_dwell_time` INT MIN DEFAULT "99999" COMMENT "minimum dwell time" ) AGGREGATE KEY(ùser_id`, `date`, `city`, àge`, `sex`) ... /* Omitted the information of Partition and Distribution */ ;

30. Duplicate key model CREATE TABLE IF NOT EXISTS example_db.expamle_tb2 ( `timestamp` DATETIME NOT NULL COMMENT "log time", `type` INT NOT NULL COMMENT "log type", èrror_code` INT COMMENT "error code", èrror_msg` VARCHAR(1024) COMMENT "error message", òp_id` BIGINT COMMENT "operator ID", òp_time` DATETIME COMMENT "operation time" ) DUPLICATE KEY(`timestamp`, `type`) ... /* Omitted the information of Partition and Distribution */ ;

31. Unique key model CREATE TABLE IF NOT EXISTS example_db.expamle_tb3 ( ùser_id` LARGEINT NOT NULL COMMENT "user id", ùsername` VARCHAR(50) NOT NULL COMMENT ”user name", `city` VARCHAR(20) COMMENT "the city user lives", àge` SMALLINT COMMENT "age of user", `sex` TINYINT COMMENT "sex of user", `phone` LARGEINT COMMENT "phone number", àddress` VARCHAR(500) COMMENT "address of user", `register_time` DATETIME COMMENT "registered time" ) UNIQUE KEY(ùser_id`, ùser_name`) ... /* Omitted the information of Partition and Distribution */ ;

32. Contact us • Official website and source code • http://doris.apache.org • https://github.com/apache/incubator-doris • E-Mail • [email protected] • Document • http://doris.incubator.apache.org/documentation/en/

33. Thank you

Editor's Notes

#2: Hello, I‘m very happy to have this opportunity to share Apache Doris with you. Thank you so much for coming. thank you! Today my topic is “Apache Doris (incubating) —— A simple and single tightly coupled OLAP system”
#3: Well, firstly, please let me make a self-introduction. I’m Reed, an Apache committer, come from China I’m working for Baidu Inc., China‘s Google as a senior software engineer and core developer of Doris. I have more than ten years experience in software development, and I have 5-year experience in managing a team. I once worked for Naver, South Korea's largest web search engine and Qihoo 360, China's largest internet security company.
#4: There are four parts of this topic, firstly, I will make an introduction for Doris. Secondly, I will explain the architecture and key techniques. thirdly, I will list the features and functions of Doris. Lastly, I will let you know what is the data storage model of Doris.
#5: 1. The introduction, what is Doris.
#6: Doris is an MPP-based interactive SQL data warehousing for reporting and analysis. It was originally start by Baidu Inc in two thousand fourteen. and It was contributed to the Apache foundation by Baidu in July two thousand eighteen, and is currently in its incubation phase. Unlike other popular SQL-on-Hadoop systems, Doris is designed to be a simple and single tightly coupled system, not depending on other systems. Doris not only provides high concurrent, low latency and point query performance, but also provides high throughput['θrʊ'pʊt] queries of ad-hoc analysis. Doris not only provides batch data loading, but also provides real-time stream data loading. Doris also provides high availability, reliability, fault tolerance, and scalability. The simplicity (of developing, deploying and using) and meeting many data serving requirements in single system are the main feature of Doris.
#7: Generally speaking, Doris is the technology combination of Google Mesa and Apache Impala. Mesa is a highly scalable analytic data storage system that stores critical measurement data related to Google’s Internet advertising business. Mesa is designed to satisfy complex and challenging set of users' and systems' requirements, including near real-time data ingestion and query ability, as well as high availability, reliability, fault tolerance, and scalability for large data and query volumes. Impala is a modern, open-source MPP SQL engine architected from the ground up for the Hadoop data processing environment. Mesa can satisfy the needs of many of our storage requirements, however Mesa itself does not provide a SQL query engine; Impala is a very well MPP SQL query engine, but the lack of a perfect distributed storage engine in that time. So in the end we chose the combination of these two technologies.
#8: Doris has been used in Baidu for more than 1000 machines, and 200 business lines including Baidu Fengchao and Baidu Statistics. The maximum single business data volume exceeds one PB. meanwhile, it has also been highly recognized in Baidu's public cloud and toB business.
#9: Since the open source, more than ten companies including Sina Weibo, Sohu, Meituan, lianjia, Guazi, ipinyou, jingdong, sichuan airlines, goldwind, Shanghai electric, xiaommi and etc. have used Doris in their online businesses.
#10: Take JD.com for example. JD.com is one of the biggest e-commerce platforms in China. In two thousand eighteen, JD Group's market transaction volume was close to 1.7 trillion yuan. JD created the famous six eighteen shopping festival. In the store celebration month, JD will launch a series of large-scale promotional activities. For JD.com, advertising is one of its important business, especially during six-eighteen each year. In this year‘s six-eighteen festival, Doris shows that its strong stable and efficient performance according to the advertising platform team of JD.com who applied Doris to their system.
#11: Why JD.com decided to use Doris in their online business, there are four causes according to JD.com. 1. Google Mesa theory with Baidu‘s engineering practice. And Doris has entered the Apache Foundation incubation, the future is expected, and subsequent development and maintenance is guaranteed. 2. Perfect feature support, standard MySQL protocol. Many existing peripheral MySQL function modules can be used, and the overall use is very convenient. And the user migration cost of the original MySQL database is very low. 3. High concurrency and high QPS support. Because the core code is all implemented in C++, performance is better than other languages. On the other hand, good design also guarantees that Apache Doris performs better than other open source products when dealing with high concurrency. 4. Convenient operation and maintenance with clear structure. Only FE and BE modules with fewer external dependencies. You can concentrate on maintaining the Doris system. The work of other ETLs can be handled by the business department, freeing up manpower.
#12: 2. The ARCHITECTURE and Key techniques of Doris.
#13: Doris‘ implementation consists of two daemons: frontend (FE) and backend (BE). It is so simple that just has two processes: FE and BE. Users can use any client that compatible MySQL protocol to connect FE daemon. Of course, users can use Native C API, JDBC, ODBC, PHP, Python, Perl, Ruby and etc.. to connect FE daemon. A typical Doris cluster generally composes of several frontend daemons and dozens to hundreds of backend daemons.
#14: Frontend daemon consists of query coordinator and catalog manager. Query coordinator is responsible for receiving user‘s sql queries, compiling queries and managing queries execution. Catalog manager is responsible for managing metadata such as databases, tables, partitions, replicas and etc. Several frontend daemons could be deployed to guarantee fault-tolerance, and load balancing. Backend daemon stores the data and executes the query fragments. Many backend daemons could also be deployed to provide scalability and fault-tolerance. Clients can use MySQL-related tools to connect any frontend daemon to submit SQL query. The frontend receives the query and compiles it into query plans executable by the backends. Then frontend sends the query plan fragments to backend. Backends will build a query execution DAG. Data is fetched and pipelined into the DAG. The final result response is sent to client via frontend.
#15: As I said before, generally speaking, Doris is the technologies combination of Google mesa and impala, also Apache ORCFile. actually, we developed a distributed storage engine based on Google Mesa and Apache ORCFile. To be more exact, the techniques learned from Mesa include aggreate data model, versioned data management, prefix index, online schema change and etc.. For In data format, indexes, compression encoding, we refer to Apache ORCFile. Unlike Mesa, the storage engine of Doris does not rely on any distributed file system. We also deeply integrate this storage engine with Impala query engine. Query compiling, query execution coordination and catalog management of storage engine are integrated to be frontend daemon; query execution and data storage are integrated to be backend daemon. With this integration, we implemented a single, full-featured, high performance MPP database, as well as maintaining the simplicity. As I said before, the simplicity (of developing, deploying and using) and meeting many data serving requirements in single system are the main feature of Doris.
#16: 3. Features and functions
#17: As an OLAP system, Doris can ingest data from local files, real-time data come from Kafka and HDFS files, and when these data enters Doris, they can been pre-aggregated. MySQL compatible networking protocol is implemented in Doris‘ frontend. The main cause of using MySQL compatible protocol as following: Firstly, SQL interface is preferred for engineers; Secondly, compatibility with MySQL protocol makes the integrating with current existing BI software, such as Tableau, easier; Lastly, rich MySQL client libraries and tools reduce our development costs, but also reduces the user's using cost.
#18: The Frontends of Doris can be configured to three kinds of roles: leader, follower and observer. Through a voting protocol, follower frontends firstly elect a leader frontend. All the write requests of metadata are forwarded to the leader, then the leader writes the operation into the replicated log file. If the new log entry will be replicated to at least quorum followers successfully, the leader commits the operation into memory, and responses the write request. Followers always replay the replicated logs to apply them into their memory metadata. If the leader crashes, a new leader will be elected from the leftover followers. Leader and follower mainly solve the problem of write availability and partly solve the problem of read scalability. Leader replicates log stream to observers asynchronously. Observers don't involve leader election. The replicated-state-machine is implemented based on BerkeleyDB java version (in brief BDB-JE). BDB-JE has achieved high availability by implementing a Paxos-like consensus algorithm. We use BDB-JE to implement Doris' log replication and leader election. In-memory catalog storage has three functional modules: real-time memory data structures, memory checkpoints on local disk and an operation relay log. When modifying catalog, the mutation operation is written into the log file firstly. Then, the mutation operation is applied into the memory data structures. Periodically, a thread does the checkpoint that dumps memory data structure image into local disk. Checkpoint mechanism enables the fast startup of frontend and reduces the disk storage occupancy. Actually, in-memory catalog also simplifies the implementation of multiple frontends.
#19: In FE, the query come from client will be translated to the distributed query plan, that is distribution of query fragments. These query fragments will be sent to each Backend node to execute. The distribution of query fragment execution takes minimizing data movement and maximizing scan locality as the main goal. Because Doris is designed to provide interactive analysis, so the average execution time of queries is short. Considering this, we adopt query re-execution to meet the fault tolerance of query execution.
#20: Like most of the distributed database system, data in Doris is horizontally[,hɑrə'zɑntli] partitioned. However, a single level partitioning rule (hash partitioning or range partitioning) may not be a good solution to all scenarios[sə'nærɪos]. For example, there is a user-based fact table that stores rows of the form (date, userid, metric). Choosing only hash partitioning by column userid may lead to uneven[ʌn'ivən] distribution of data, when one user’s data is very large. If choosing range partitioning according to column date, it will also lead to uneven distribution of data due to the likely data explosion in a certain period of time. Therefore we support the two-level partitioning rule. The first level is range partitioning. User can specify a column (usually the time series column) range of values for the data partition. In one partition, the user can also specify one or more columns and a number of buckets to do the hash partitioning. User can combine with different partitioning rules to better divide the data.
#21: In one partition, the user can also specify one or more columns and a number of buckets to do the hash partitioning. User can combine with different partitioning rules to better divide the data. Three benefits are gained by using the two-level partitioning mechanism. Firstly, old and new data could be separated, and stored on different storage mediums; Secondly, storage engine of backend can reduce the consumption of IO and CPU for unnecessary data merging, because the data in some partitions is no longer be updated; Lastly, every partition‘s buckets number can be different and adjusted according to the change of data size.
#22: To achieve high update throughput, Doris only applies updates in batches at the smallest frequency of every minute. Each update batch specifies an increased version number and generates a delta data file, commits the version when updates of quorum replicas are complete. You can query all committed data using the committed version, and the uncommitted version would not be used in query. All update versions are strictly be in increasing order. If an update contains more than one table, the versions of these tables are committed atomically [ə'tɑmɪkly]. The MVCC mechanism allows Doris to guarantee multiple table atomic updates and query consistency. In addition, Doris uses compaction policies to merge delta files to reduce delta number, also reduce the cost of delta merging during query for higher performance.
#23: Mesa also supports creating materialized rollups, which contain a column subset of schema to gain better aggregation effect. Rollup is a materialized view that contains a column subset of schema in Doris. A table may contain multiple rollups with columns in different order. According to sort key index and column covering of the rollups, Doris can select the best rollup for different query. Because most rollups only contain a few columns, the size of aggregated data is typically much smaller and query performance can greatly be improved. All the rollups in the same table are updated atomically. Because rollups are materialized, users should make a trade-off between query latency and storage space when using them.
#24: We can create materialized view, namely rollup by alter table statement actually. After created, we can view by show alter table statement. If we want to know if the rollup take effect, we can use explain statement to view its query plan. and from query plan, we can see the rollup table name.
#25: The storage engine of Doris use columnar storage technology. Compared with the row-oriented database, column-oriented organization is more efficient when an aggregate needs to be computed over many rows but only for a small subset of all columns of data, because reading that smaller subset of data can be faster than reading whole row data. And columnar storage is also space-friendly due to the high compression ratio of each column. Further, column support block-level storage technology such as min/max index and bloom filter index. Query executor can filter out a lot of blocks that do not meet the predicate, to further improve the query performance.
#26: 4. STORAGE – Data storage model
#27: Doris combines Google Mesa's data model and ORCFile / Parquet storage technology. Data in Mesa is inherently multi-dimensional fact table. These facts in table typically consist of two types of attributes: dimensional attributes (which we call keys) and measure attributes (which we call values). The table schema also specifies the aggregation function F: V ×V → V which is used to aggregate the values corresponding to the same key.
#28: To achieve high update throughput, Mesa loads data in batch. Each batch of data will be converted to a delta file. Mesa uses MVCC approach to manage these delta files, and so to enforce update atomicity.
#29: Like traditional databases, Doris stores structured data represented as tables. Each table has a well defined schema consisting of a finite number of columns. User can create three types of table to meet different needs in interactive query scenarios. Aggregate key, duplicate key and unique key. Data in all three types of table are sorted by KEY.
#30: Type one, the aggregate key model. we can define this model by specify the key words of Aggregate key, red words in here. According to Mesa‘s data model, all columns of a table can be divided into two parts: key and value, The columns name list in the parentheses[pə‘rɛnθəsɪz] are the key, and the rest of columns are values, such as last visit date, cost, max dwell time and min dwell time. we can define the aggregation functions for values, such as SUM, MIN, MAX, REPLACE, and this function will aggregate the corresponding value by key.
#31: Next, the type two, the duplicate key model. we can define this model by specify the key words of duplicate key, red words in here. Like the aggregate key, the columns name list in the parentheses[pə‘rɛnθəsɪz] are the key, and the rest of columns are values, such as error code, error message, op id and op time. In this type, Doris will only keep one row for the data which has duplicate key.
#32: Next, the type three, the unique key model. we can define this model by specify the key words of unique key, red words in here. Like the aggregate key, the columns name list in the parentheses[pə‘rɛnθəsɪz] are the key, and the rest of columns are values, such as city, age and etc. In this type, Doris will replace the data which are the same key, actually, this mode completely equally the aggregate key which only define replace function.
#33: At last, I show you that how to contact us, the official website and source code in github; the email and document.
#34: Thank you!

Apache doris (incubating) introduction

Recommended

More Related Content

What's hot (20)

Similar to Apache doris (incubating) introduction (20)

Recently uploaded (20)

Apache doris (incubating) introduction

Editor's Notes