An Introduction to Distributed Search with Cassandra and SolrDataStax Academy
Cassandra is a distributed database that can be used with Solr for distributed search capabilities. Data is written to Cassandra and indexed by Solr to enable fast and scalable full-text search across nodes. Queries can be performed directly on Cassandra or through the Solr API, with tradeoffs in performance. Production deployments typically use a mix of Cassandra and Solr nodes for analytics and search workloads.
Working with Complex Types in DataFrames: Optics to the RescueDatabricks
This document summarizes Alfonso Roa's presentation on using optics to work with complex types in Spark DataFrames. The presentation introduces the problem of manipulating nested structures in DataFrames and demonstrates how optics libraries like Monocle can be used to focus on specific elements. It then shows how Spark optics provides a similar lens-based API for DataFrames, allowing changes to nested elements to be made easily through composition of lenses. The presentation concludes by discussing additional lens functionality for schema changes and future work to improve Spark optics.
Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...Riccardo Zamana
Riccardo Zamana presented on time series analytics using Azure Data Explorer (ADX). The presentation covered ADX basics, architecture, quickstart, ingestion techniques including LightIngest and One Click ingestion, query optimization techniques like materialized views and caching, and visualization using dashboards in Kusto and Grafana. The document provided code examples of queries, functions, and operators for time series analysis in ADX.
The document summarizes a presentation on the internals of InnoDB file formats and source code structure. The presentation covers the goals of InnoDB being optimized for online transaction processing (OLTP) with performance, reliability, and scalability. It describes the InnoDB architecture, on-disk file formats including tablespaces, pages, rows, and indexes. It also discusses the source code structure.
How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015PostgreSQL-Consulting
This document discusses how PostgreSQL works with disks and provides recommendations for disk subsystem monitoring, hardware selection, and configuration tuning to optimize performance. It explains that PostgreSQL relies on disk I/O for reading pages, writing the write-ahead log (WAL), and checkpointing. It recommends monitoring disk utilization, IOPS, latency, and I/O wait. The document also provides tips for choosing hardware like SSDs or RAID configurations and configuring the operating system, file systems, and PostgreSQL to improve performance.
Observability for Data Pipelines With OpenLineageDatabricks
Data is increasingly becoming core to many products. Whether to provide recommendations for users, getting insights on how they use the product, or using machine learning to improve the experience. This creates a critical need for reliable data operations and understanding how data is flowing through our systems. Data pipelines must be auditable, reliable, and run on time. This proves particularly difficult in a constantly changing, fast-paced environment.
Collecting this lineage metadata as data pipelines are running provides an understanding of dependencies between many teams consuming and producing data and how constant changes impact them. It is the underlying foundation that enables the many use cases related to data operations. The OpenLineage project is an API standardizing this metadata across the ecosystem, reducing complexity and duplicate work in collecting lineage information. It enables many projects, consumers of lineage in the ecosystem whether they focus on operations, governance or security.
Marquez is an open source project part of the LF AI & Data foundation which instruments data pipelines to collect lineage and metadata and enable those use cases. It implements the OpenLineage API and provides context by making visible dependencies across organizations and technologies as they change over time.
This document summarizes a presentation by Vinay Chella and Joey Lynch from Netflix on building and running cloud native Cassandra. They outline some of Cassandra's limitations for cloud deployments including development friction, packaging issues, cluster startup difficulties, and lack of scaling tools. Their proposals aim to address these by improving documentation, automating builds/tests, packaging for containers/packages, adding cluster control planes, and integrating metrics/monitoring. The speakers believe targeted changes can help Cassandra better support cloud-native principles of flexibility, scalability, and reliability.
Managing your black Friday logs - CloudConf.ITDavid Pilato
Monitoring an entire application is not a simple task, but with the right tools it is not a hard task either. However, events like Black Friday can push your application to the limit, and even cause crashes. As the system is stressed, it generates a lot more logs, which may crash the monitoring system as well. In this talk I will walk through the best practices when using the Elastic Stack to centralize and monitor your logs. I will also share some tricks to help you with the huge increase of traffic typical in Black Fridays.
WebLogic in Practice: SSL ConfigurationSimon Haslam
The document provides an overview of SSL configuration in Oracle WebLogic Server. It discusses key SSL concepts like key pairs, certificates, and certificate authorities. It describes how WebLogic uses Java keystores for identity and trust, and the tools like keytool and orapki that can be used to manage keys and certificates. The document also covers best practices for SSL configuration in WebLogic like always enabling hostname verification and not using demo certificates in production.
Some Iceberg Basics for Beginners (CDP).pdfMichael Kogan
The document describes the recommended Iceberg workflow which includes 8 steps:
1) Create Iceberg tables from existing datasets or sample datasets
2) Batch insert data to prepare for time travel scenarios
3) Create security policies for fine-grained access control
4) Build BI queries for reporting
5) Build visualizations from query results
6) Perform time travel queries to audit changes
7) Optimize partition schemas to improve query performance
8) Manage and expire snapshots for table maintenance
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Julian Hyde
What if Looker saw the queries you just executed and could predict your next query? Could it make those queries faster, by smarter caching, or aggregate navigation? Could it read your past SQL queries and help you write your LookML model? Those are some of the reasons to add relational algebra into Looker’s query engine, and why Looker hired Julian Hyde, author of Apache Calcite, to lead the effort. In this talk about the internals of Looker’s query engine, Julian Hyde will describe how the engine works, how Looker queries are described in Calcite’s relational algebra, and some features that it makes possible.
A talk by Julian Hyde at JOIN 2019 in San Francisco.
Cassandra Data Modeling - Practical Considerations @ Netflixnkorla1share
Cassandra community has consistently requested that we cover C* schema design concepts. This presentation goes in depth on the following topics:
- Schema design
- Best Practices
- Capacity Planning
- Real World Examples
1. What is Solr?
2. When should I use Solr vs. Azure Search?
3. Why is Solr great (and its downside)?
4. How does Solr compare to Azure Search?
5. Why SearchStax? (Solr is complex; SearchStax makes it as easy as Azure Search)
There are parallels between storing JSON data in PostgreSQL and storing vectors that are produced from AI/ML systems. This lightning talk briefly covers the similarities in use-cases in storing JSON and vectors in PostgreSQL, shows some of the use-cases developers have for querying vectors in Postgres, and some roadmap items for improving PostgreSQL as a vector database.
Building a data lake is a daunting task. The promise of a virtual data lake is to provide the advantages of a data lake without consolidating all data into a single repository. With Apache Arrow and Dremio, companies can, for the first time, build virtual data lakes that provide full access to data no matter where it is stored and no matter what size it is.
Maxscale switchover, failover, and auto rejoinWagner Bianchi
How the MariaDB Maxscale Switchover, Failover, and Rejoin works under the hood by Esa Korhonen and Wagner Bianchi.
You can watch the video of the presentation at
https://www.linkedin.com/feed/update/urn:li:activity:6381185640607809536
How to size up an Apache Cassandra cluster (Training)DataStax Academy
This document discusses how to size a Cassandra cluster based on replication factor, data size, and performance needs. It describes that replication factor, data size, data velocity, and hardware considerations like CPU, memory, and disk type should all be examined to determine the appropriate number of nodes. The goal is to have enough nodes to store data, achieve target throughput levels, and maintain performance and availability even if nodes fail.
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...Databricks
Stateful processing is one of the most challenging aspects of distributed, fault-tolerant stream processing. The DataFrame APIs in Structured Streaming make it very easy for the developer to express their stateful logic, either implicitly (streaming aggregations) or explicitly (mapGroupsWithState). However, there are a number of moving parts under the hood which makes all the magic possible. In this talk, I am going to dive deeper into how stateful processing works in Structured Streaming.
In particular, I’m going to discuss the following.
• Different stateful operations in Structured Streaming
• How state data is stored in a distributed, fault-tolerant manner using State Stores
• How you can write custom State Stores for saving state to external storage systems.
Spark + Cassandra = Real Time Analytics on Operational DataVictor Coustenoble
This document discusses using Apache Spark and Cassandra together for real-time analytics on transactional data. It provides an overview of Cassandra and how it can be used for operational applications like recommendations, fraud detection, and messaging. It then discusses how the Spark Cassandra Connector allows reading and writing Cassandra data from Spark, enabling real-time analytics on streaming and batch data using Spark SQL, MLlib, and Spark Streaming. It also covers some DataStax Enterprise features for high availability and integration of Spark and Cassandra.
Key-Key-Value Store: Generic NoSQL Datastore with Tombstone Reduction and Aut...ScyllaDB
This document describes a key-key-value store that Discord built to store composite primary keys in ScyllaDB. The store aims to mitigate performance issues from tombstones and large partitions by using application tombstones and automatic partition splitting. It provides APIs for creating, getting, deleting and scanning entities under a parent identifier. When partitions grow too large, it doubles the shard count and copies data to new shards in a resilient process. This generic datastore has supported over 20 use cases at Discord with no production incidents attributed to ScyllaDB.
Understanding what is a region for HBase, why those transitions, how to troubleshoot and fix potential problems that may arise from this important HBase internal operation.
re:Invent 2022 DAT326 Deep dive into Amazon Aurora and its innovationsGrant McAlister
With an innovative architecture that decouples compute from storage as well as advanced features like Global Database and low-latency read replicas, Amazon Aurora reimagines what it means to be a relational database. The result is a modern database service that offers performance and high availability at scale, fully open-source MySQL- and PostgreSQL-compatible editions, and a range of developer tools for building serverless and machine learning-driven applications. In this session, dive deep into some of the most exciting features Aurora offers, including Aurora Serverless v2 and Global Database. Also learn about recent innovations that enhance performance, scalability, and security while reducing operational challenges.
This document provides technical details about PostgreSQL WAL (Write Ahead Log) buffers. It describes the structure and purpose of WAL segments, WAL records, and their components. It also explains how the WAL is used to safely recover transactions after a server crash by replaying the log.
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and CloudMichael Stack
New Journey of HBase in Alibaba and Cloud discusses Alibaba's use of HBase over 8 years and improvements made. Key points discussed include:
- Alibaba began using HBase in 2010 and has since contributed to the open source community while developing internal improvements.
- Challenges addressed include JVM garbage collection pauses, separating computing and storage, and adding cold/hot data tiering. A diagnostic system was also created.
- Alibaba uses HBase across many core scenarios and has integrated it with other databases in a multi-model approach to support different workloads.
- Benefits of running HBase on cloud include flexibility, cost savings, and making it
Oracle E-Business Suite 12.2 - The Upgrade to End All UpgradesShiri Amit
This business-led session discusses key roadmap and project planning considerations for organizations that are thinking to upgrade. It combines lessons learned from customers that have completed the upgrade with advice from Oracle user group leaders.
WebLogic in Practice: SSL ConfigurationSimon Haslam
The document provides an overview of SSL configuration in Oracle WebLogic Server. It discusses key SSL concepts like key pairs, certificates, and certificate authorities. It describes how WebLogic uses Java keystores for identity and trust, and the tools like keytool and orapki that can be used to manage keys and certificates. The document also covers best practices for SSL configuration in WebLogic like always enabling hostname verification and not using demo certificates in production.
Some Iceberg Basics for Beginners (CDP).pdfMichael Kogan
The document describes the recommended Iceberg workflow which includes 8 steps:
1) Create Iceberg tables from existing datasets or sample datasets
2) Batch insert data to prepare for time travel scenarios
3) Create security policies for fine-grained access control
4) Build BI queries for reporting
5) Build visualizations from query results
6) Perform time travel queries to audit changes
7) Optimize partition schemas to improve query performance
8) Manage and expire snapshots for table maintenance
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Julian Hyde
What if Looker saw the queries you just executed and could predict your next query? Could it make those queries faster, by smarter caching, or aggregate navigation? Could it read your past SQL queries and help you write your LookML model? Those are some of the reasons to add relational algebra into Looker’s query engine, and why Looker hired Julian Hyde, author of Apache Calcite, to lead the effort. In this talk about the internals of Looker’s query engine, Julian Hyde will describe how the engine works, how Looker queries are described in Calcite’s relational algebra, and some features that it makes possible.
A talk by Julian Hyde at JOIN 2019 in San Francisco.
Cassandra Data Modeling - Practical Considerations @ Netflixnkorla1share
Cassandra community has consistently requested that we cover C* schema design concepts. This presentation goes in depth on the following topics:
- Schema design
- Best Practices
- Capacity Planning
- Real World Examples
1. What is Solr?
2. When should I use Solr vs. Azure Search?
3. Why is Solr great (and its downside)?
4. How does Solr compare to Azure Search?
5. Why SearchStax? (Solr is complex; SearchStax makes it as easy as Azure Search)
There are parallels between storing JSON data in PostgreSQL and storing vectors that are produced from AI/ML systems. This lightning talk briefly covers the similarities in use-cases in storing JSON and vectors in PostgreSQL, shows some of the use-cases developers have for querying vectors in Postgres, and some roadmap items for improving PostgreSQL as a vector database.
Building a data lake is a daunting task. The promise of a virtual data lake is to provide the advantages of a data lake without consolidating all data into a single repository. With Apache Arrow and Dremio, companies can, for the first time, build virtual data lakes that provide full access to data no matter where it is stored and no matter what size it is.
Maxscale switchover, failover, and auto rejoinWagner Bianchi
How the MariaDB Maxscale Switchover, Failover, and Rejoin works under the hood by Esa Korhonen and Wagner Bianchi.
You can watch the video of the presentation at
https://www.linkedin.com/feed/update/urn:li:activity:6381185640607809536
How to size up an Apache Cassandra cluster (Training)DataStax Academy
This document discusses how to size a Cassandra cluster based on replication factor, data size, and performance needs. It describes that replication factor, data size, data velocity, and hardware considerations like CPU, memory, and disk type should all be examined to determine the appropriate number of nodes. The goal is to have enough nodes to store data, achieve target throughput levels, and maintain performance and availability even if nodes fail.
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...Databricks
Stateful processing is one of the most challenging aspects of distributed, fault-tolerant stream processing. The DataFrame APIs in Structured Streaming make it very easy for the developer to express their stateful logic, either implicitly (streaming aggregations) or explicitly (mapGroupsWithState). However, there are a number of moving parts under the hood which makes all the magic possible. In this talk, I am going to dive deeper into how stateful processing works in Structured Streaming.
In particular, I’m going to discuss the following.
• Different stateful operations in Structured Streaming
• How state data is stored in a distributed, fault-tolerant manner using State Stores
• How you can write custom State Stores for saving state to external storage systems.
Spark + Cassandra = Real Time Analytics on Operational DataVictor Coustenoble
This document discusses using Apache Spark and Cassandra together for real-time analytics on transactional data. It provides an overview of Cassandra and how it can be used for operational applications like recommendations, fraud detection, and messaging. It then discusses how the Spark Cassandra Connector allows reading and writing Cassandra data from Spark, enabling real-time analytics on streaming and batch data using Spark SQL, MLlib, and Spark Streaming. It also covers some DataStax Enterprise features for high availability and integration of Spark and Cassandra.
Key-Key-Value Store: Generic NoSQL Datastore with Tombstone Reduction and Aut...ScyllaDB
This document describes a key-key-value store that Discord built to store composite primary keys in ScyllaDB. The store aims to mitigate performance issues from tombstones and large partitions by using application tombstones and automatic partition splitting. It provides APIs for creating, getting, deleting and scanning entities under a parent identifier. When partitions grow too large, it doubles the shard count and copies data to new shards in a resilient process. This generic datastore has supported over 20 use cases at Discord with no production incidents attributed to ScyllaDB.
Understanding what is a region for HBase, why those transitions, how to troubleshoot and fix potential problems that may arise from this important HBase internal operation.
re:Invent 2022 DAT326 Deep dive into Amazon Aurora and its innovationsGrant McAlister
With an innovative architecture that decouples compute from storage as well as advanced features like Global Database and low-latency read replicas, Amazon Aurora reimagines what it means to be a relational database. The result is a modern database service that offers performance and high availability at scale, fully open-source MySQL- and PostgreSQL-compatible editions, and a range of developer tools for building serverless and machine learning-driven applications. In this session, dive deep into some of the most exciting features Aurora offers, including Aurora Serverless v2 and Global Database. Also learn about recent innovations that enhance performance, scalability, and security while reducing operational challenges.
This document provides technical details about PostgreSQL WAL (Write Ahead Log) buffers. It describes the structure and purpose of WAL segments, WAL records, and their components. It also explains how the WAL is used to safely recover transactions after a server crash by replaying the log.
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and CloudMichael Stack
New Journey of HBase in Alibaba and Cloud discusses Alibaba's use of HBase over 8 years and improvements made. Key points discussed include:
- Alibaba began using HBase in 2010 and has since contributed to the open source community while developing internal improvements.
- Challenges addressed include JVM garbage collection pauses, separating computing and storage, and adding cold/hot data tiering. A diagnostic system was also created.
- Alibaba uses HBase across many core scenarios and has integrated it with other databases in a multi-model approach to support different workloads.
- Benefits of running HBase on cloud include flexibility, cost savings, and making it
Oracle E-Business Suite 12.2 - The Upgrade to End All UpgradesShiri Amit
This business-led session discusses key roadmap and project planning considerations for organizations that are thinking to upgrade. It combines lessons learned from customers that have completed the upgrade with advice from Oracle user group leaders.
2. Topic
o Business Requirement
o Process of solution
o Data Analysis & Features Impact
o Features Impact & Features Selection
o Model Training and Prediction
o Customer selection to recommend service
o Service recommendation
3. o High customer churn rate
o High cost of new customer finding
o Cause of churn
o How to maintain existing customer
Business Requirement
12. Decision Tree
Area Under ROC = 0.747369
Area Under PR = 0.685616
Decision Tree Model Test Error = 0.202934
Random Forest
Area under ROC = 0.7007775324935394
Area under PR = 0.6600632136509201
RandomForest Model Test Error = 0.20293
Logistic Regression
Area under ROC = 0.5
Threshold = 0.26079802550390785
F-Measure= 0.41370309951060363
Binomial Intercept = -1.041824932346333
Model Training& Evaluation
1. Contract (4) 5. DSL (6)
2. Tenure (3) 6. Payment Method (15)
3. Fiber Optic (7) 7. Technical Support (11)
4. Monthly Charges (18) 8. Streaming Movies (13)