Galaxy Big Data with MariaDB 10 by Bernard Garros, Sandrine Chirokoff and Stéphane Varoqui.
Presented 26.6.2014 at the MariaDB Roadshow in Paris, France.
Spider's HA structure includes data nodes, spider nodes, and monitoring nodes. Data nodes store data, spider nodes provide load balancing and failover, and monitoring nodes monitor data nodes. To add a new data node without stopping service: 1) Create a new table on the node, 2) Alter tables on monitoring nodes to include new node, 3) Alter clustered table connection to include new node, 4) Copy data to new node. This maintains redundancy when a node fails without service interruption.
RMAN backup scripts should be improved in the following ways:
1. Log backups thoroughly and send failure alerts to ensure recoverability.
2. Avoid relying on a single backup and use redundancy to protect against data loss.
3. Back up control files last and do not delete archives until backups are complete.
4. Check backups regularly to ensure they meet recovery needs.
Graph Structure in the Web - Revisited. WWW2014 Web Science TrackChris Bizer
The document discusses research that revisits the graph structure of the web using a new large crawl from Common Crawl. It finds that the web has become more dense and connected over time, with the largest strongly connected component growing significantly. While previous research found power laws for in- and out-degrees, this data does not fit power laws and instead has heavy-tailed distributions. The shape of the bow-tie structure also depends on the specific crawl used. The authors provide the new crawl data and analysis to enable further research on the evolving structure of the web graph.
Oracle Active Data Guard 12c: Far Sync Instance, Real-Time Cascade and Other ...Ludovico Caldara
Slides used for my Oracle Open World 2014 #OOW14 session.
The new release of Oracle Database has come with many new exciting enhancements for high availability. The aim of this presentation is to introduce some new Oracle Active Data Guard features through practical examples and live demos. Among the various enhancements, the new Far Sync Instance and Real-Time Cascade Standby features receive special attention in the session.
The document discusses Bloom filters, which are compact data structures used to represent sets probabilistically. Bloom filters allow membership queries to determine if an element is in a set, but may return false positives. They provide a more space-efficient alternative to other data structures like hash tables. The key properties of Bloom filters are that they require less memory than other solutions, allow fast membership checking, and never return false negatives, though they can return false positives. Several applications of Bloom filters are also mentioned such as spell checkers, password checking, and caching.
The document discusses different RAID levels for storing data across multiple disks. It provides details on RAID levels 0 through 6, including the minimum number of drives required, how data and parity are distributed, and example diagrams. The benefits of RAID include preventing data loss from disk failures through techniques like mirroring, striping, and parity.
Oracle Fleet Patching and Provisioning Deep Dive Webcast SlidesLudovico Caldara
Oracle Fleet Patching and Provisioning allows users to provision, patch, and upgrade Oracle databases and Grid Infrastructure across many servers from a central location. It uses a repository of gold images and working copies to deploy consistent configurations at scale while minimizing errors. Key features include Oracle home management, provisioning, patching, upgrading, and integration with REST APIs.
YugaByte DB Internals - Storage Engine and Transactions Yugabyte
This document introduces YugaByte DB, a high-performance, distributed, transactional database. It is built to scale horizontally on commodity servers across data centers for mission-critical applications. YugaByte DB uses a transactional document store based on RocksDB, Raft-based replication for resilience, and automatic sharding and rebalancing. It supports ACID transactions across documents, provides APIs compatible with Cassandra and Redis, and is open source. The architecture is designed for high performance, strong consistency, and cloud-native deployment.
Snowflake concepts & hands on expertise to help get you started on implementing Data warehouses using Snowflake. Necessary information and skills that will help you master Snowflake essentials.
Open Source 101 2022 - MySQL Indexes and HistogramsFrederic Descamps
Nobody complains that the database is too fast. But when things slow down, the complaints come quickly. The two most popular approaches to speeding up queries are indexes and histograms. But there are so many options and types on indexes that it can get confusing. Histograms are fairly new to MySQL but they do not work for all types of data. This talk covers how indexes and histograms work and show you how to test just how effective they are so you can measure the performance of your queries.
This document discusses PySpark DataFrames. It notes that DataFrames can be constructed from various data sources and are conceptually similar to tables in a relational database. The document explains that DataFrames allow richer optimizations than RDDs due to avoiding context switching between Java and Python. It provides links to resources that demonstrate how to create DataFrames, perform queries using DataFrame APIs and Spark SQL, and use an example flight data DataFrame.
PostgreSQL Replication High Availability MethodsMydbops
This slides illustrates the need for replication in PostgreSQL, why do you need a replication DB topology, terminologies, replication nodes and many more.
A transaction is a logical unit of work in a database that ensures data integrity and consistency. Transactions are executed concurrently using concurrency control protocols like locking and time stamp ordering to prevent problems like dirty reads, lost updates, and deadlocks. The locking protocol uses shared and exclusive locks to control concurrent access to data. The two phase locking protocol divides transaction execution into growing, locked, and shrinking phases to avoid deadlocks.
Best Practice for Achieving High Availability in MariaDBMariaDB plc
This document discusses high availability and MariaDB replication. It defines high availability and outlines key components like data redundancy, failover solutions, and monitoring. It then describes MariaDB replication in detail, covering asynchronous and semi-synchronous replication as well as Galera cluster synchronous replication. MaxScale is introduced as a tool for load balancing, monitoring, and facilitating failovers in MariaDB replication topologies.
This document provides an introduction to data mining and big data. It defines data mining as the process of analyzing data from different perspectives to discover useful patterns and relationships. The document lists some common applications of data mining in industries like finance, insurance, and telecommunications. It also outlines the typical steps involved in data mining, including data integration, cleaning, transformation, and knowledge presentation. Big data is defined as extremely large data sets that are difficult to process using traditional tools. The rapid growth of data from sources like social media and mobile devices is driving the need for tools to handle big data's volume, velocity, and variety of data types.
Auto-Pilot for Apache Spark Using Machine LearningDatabricks
At Qubole, users run Spark at scale on cloud (900+ concurrent nodes). At such scale, for efficiently running SLA critical jobs, tuning Spark configurations is essential. But it continues to be a difficult undertaking, largely driven by trial and error. In this talk, we will address the problem of auto-tuning SQL workloads on Spark. The same technique can also be adapted for non-SQL Spark workloads. In our earlier work[1], we proposed a model based on simple rules and insights. It was simple yet effective at optimizing queries and finding the right instance types to run queries. However, with respect to auto tuning Spark configurations we saw scope of improvement. On exploration, we found previous works addressing auto-tuning using Machine learning techniques. One major drawback of the simple model[1] is that it cannot use multiple runs of query for improving recommendation, whereas the major drawback with Machine Learning techniques is that it lacks domain specific knowledge. Hence, we decided to combine both techniques. Our auto-tuner interacts with both models to arrive at good configurations. Once user selects a query to auto tune, the next configuration is computed from models and the query is run with it. Metrics from event log of the run is fed back to models to obtain next configuration. Auto-tuner will continue exploring good configurations until it meets the fixed budget specified by the user. We found that in practice, this method gives much better configurations compared to configurations chosen even by experts on real workload and converges soon to optimal configuration. In this talk, we will present a novel ML model technique and the way it was combined with our earlier approach. Results on real workload will be presented along with limitations and challenges in productionizing them. [1] Margoor et al,'Automatic Tuning of SQL-on-Hadoop Engines' 2018,IEEE CLOUD
DB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentalsJohn Beresniewicz
RMOUG 2020 abstract:
This session will cover core concepts for Oracle performance analysis first introduced in Oracle 10g and forming the backbone of many features in the Diagnostic and Tuning packs. The presentation will cover the theoretical basis and meaning of these concepts, as well as illustrate how they are fundamental to many user-facing features in both the database itself and Enterprise Manager.
As cloud computing continues to gather speed, organizations with years’ worth of data stored on legacy on-premise technologies are facing issues with scale, speed, and complexity. Your customers and business partners are likely eager to get data from you, especially if you can make the process easy and secure.
Challenges with performance are not uncommon and ongoing interventions are required just to “keep the lights on”.
Discover how Snowflake empowers you to meet your analytics needs by unlocking the potential of your data.
Agenda of Webinar :
~Understand Snowflake and its Architecture
~Quickly load data into Snowflake
~Leverage the latest in Snowflake’s unlimited performance and scale to make the data ready for analytics
~Deliver secure and governed access to all data – no more silos
Setting up a GeoServer can sometimes be deceptively simple. However, going from proof of concept to production requires a number of steps to be taken in order to optimize the server in terms of availability, performance and scalability. The presentation will show how to get from a basic set up to a battle ready, rock solid installation by showing the ropes an advanced user already mastered.
In-memory OLTP storage with persistence and transaction supportAlexander Korotkov
Nowadays it becomes evident that single storage engine can't be "one size fits all". PostgreSQL community starts its movement towards pluggable storages. Significant restriction which is imposed in the current approach is compatibility. We consider pluggable storages to be compatible with (at least some) existing index access methods. That means we've long way to go, because we have to extend our index AMs before we can add corresponding features in the pluggable storages themselves.
In this talk we would like look this problem from another angle, and see what can we achieve if we try to make storage completely from scratch (using FDW interface for prototyping). Thus, we would show you a prototype of in-memory OLTP storage with transaction support and snapshot isolation. Internally it's implemented as index-organized table (B-tree) with undo log and optional persistence. That means it's quite different from what we have in PostgreSQL now.
The proved by benchmarks advantages of this in-memory storage are: better multicore scalability (thanks to no buffer manager), reduced bloat (thanks to undo log) and optimized IO (thank to logical WAL logging).
Análise de performance usando as estatísticas do PostgreSQLMatheus de Oliveira
Essa palestra desmitifica as estatísticas do PostgreSQL abordando todo o sistema de coleta de dados estatísticos, as tabelas disponíveis atualmente, conceitos, técnicas e exemplos de consultas úteis para performance e monitoramento do PostgreSQL.
Using all of the high availability options in MariaDBMariaDB plc
MariaDB provides a number of high availability options, including replication with automatic failover and multi-master clustering. In this session Wagner Bianchi, Principal Remote DBA, provides a comprehensive overview of the high availability features in MariaDB, highlights their impact on consistency and performance, discusses advanced failover strategies and introduces new features such as casual reads and transparent connection failover.
1. RAID (Redundant Array of Independent Disks) is a data storage virtualization technology that combines multiple physical disk drive components into one or more logical units for the purposes of data redundancy, performance improvement, or both.
2. There are different RAID levels that provide redundancy through techniques like mirroring, parity, or a combination of both. The most common levels are RAID 0, 1, 5 and 10 but there are also less common levels like RAID 2-4 and 6.
3. The presenter discusses the advantages and disadvantages of various RAID levels for improving performance, reliability, and fault tolerance of disk storage systems. RAID can help address issues like increasing storage capacity
The document discusses big data and distributed computing. It explains that big data refers to large, unstructured datasets that are too large for traditional databases. Distributed computing uses multiple computers connected via a network to process large datasets in parallel. Hadoop is an open-source framework for distributed computing that uses MapReduce and HDFS for parallel processing and storage across clusters. HDFS stores data redundantly across nodes for fault tolerance.
This document discusses NoSQL and the CAP theorem. It begins with an introduction of the presenter and an overview of topics to be covered: What is NoSQL and the CAP theorem. It then defines NoSQL, provides examples of major NoSQL categories (document, graph, key-value, and wide-column stores), and explains why NoSQL is used, including to handle large, dynamic, and distributed data. The document also explains the CAP theorem, which states that a distributed data store can only satisfy two of three properties: consistency, availability, and partition tolerance. It provides examples of how to choose availability over consistency or vice versa. Finally, it concludes that both SQL and NoSQL have valid use cases and a combination
A look at what HA is and what PostgreSQL has to offer for building an open source HA solution. Covers various aspects in terms of Recovery Point Objective and Recovery Time Objective. Includes backup and restore, PITR (point in time recovery) and streaming replication concepts.
Snowflake concepts & hands on expertise to help get you started on implementing Data warehouses using Snowflake. Necessary information and skills that will help you master Snowflake essentials.
Open Source 101 2022 - MySQL Indexes and HistogramsFrederic Descamps
Nobody complains that the database is too fast. But when things slow down, the complaints come quickly. The two most popular approaches to speeding up queries are indexes and histograms. But there are so many options and types on indexes that it can get confusing. Histograms are fairly new to MySQL but they do not work for all types of data. This talk covers how indexes and histograms work and show you how to test just how effective they are so you can measure the performance of your queries.
This document discusses PySpark DataFrames. It notes that DataFrames can be constructed from various data sources and are conceptually similar to tables in a relational database. The document explains that DataFrames allow richer optimizations than RDDs due to avoiding context switching between Java and Python. It provides links to resources that demonstrate how to create DataFrames, perform queries using DataFrame APIs and Spark SQL, and use an example flight data DataFrame.
PostgreSQL Replication High Availability MethodsMydbops
This slides illustrates the need for replication in PostgreSQL, why do you need a replication DB topology, terminologies, replication nodes and many more.
A transaction is a logical unit of work in a database that ensures data integrity and consistency. Transactions are executed concurrently using concurrency control protocols like locking and time stamp ordering to prevent problems like dirty reads, lost updates, and deadlocks. The locking protocol uses shared and exclusive locks to control concurrent access to data. The two phase locking protocol divides transaction execution into growing, locked, and shrinking phases to avoid deadlocks.
Best Practice for Achieving High Availability in MariaDBMariaDB plc
This document discusses high availability and MariaDB replication. It defines high availability and outlines key components like data redundancy, failover solutions, and monitoring. It then describes MariaDB replication in detail, covering asynchronous and semi-synchronous replication as well as Galera cluster synchronous replication. MaxScale is introduced as a tool for load balancing, monitoring, and facilitating failovers in MariaDB replication topologies.
This document provides an introduction to data mining and big data. It defines data mining as the process of analyzing data from different perspectives to discover useful patterns and relationships. The document lists some common applications of data mining in industries like finance, insurance, and telecommunications. It also outlines the typical steps involved in data mining, including data integration, cleaning, transformation, and knowledge presentation. Big data is defined as extremely large data sets that are difficult to process using traditional tools. The rapid growth of data from sources like social media and mobile devices is driving the need for tools to handle big data's volume, velocity, and variety of data types.
Auto-Pilot for Apache Spark Using Machine LearningDatabricks
At Qubole, users run Spark at scale on cloud (900+ concurrent nodes). At such scale, for efficiently running SLA critical jobs, tuning Spark configurations is essential. But it continues to be a difficult undertaking, largely driven by trial and error. In this talk, we will address the problem of auto-tuning SQL workloads on Spark. The same technique can also be adapted for non-SQL Spark workloads. In our earlier work[1], we proposed a model based on simple rules and insights. It was simple yet effective at optimizing queries and finding the right instance types to run queries. However, with respect to auto tuning Spark configurations we saw scope of improvement. On exploration, we found previous works addressing auto-tuning using Machine learning techniques. One major drawback of the simple model[1] is that it cannot use multiple runs of query for improving recommendation, whereas the major drawback with Machine Learning techniques is that it lacks domain specific knowledge. Hence, we decided to combine both techniques. Our auto-tuner interacts with both models to arrive at good configurations. Once user selects a query to auto tune, the next configuration is computed from models and the query is run with it. Metrics from event log of the run is fed back to models to obtain next configuration. Auto-tuner will continue exploring good configurations until it meets the fixed budget specified by the user. We found that in practice, this method gives much better configurations compared to configurations chosen even by experts on real workload and converges soon to optimal configuration. In this talk, we will present a novel ML model technique and the way it was combined with our earlier approach. Results on real workload will be presented along with limitations and challenges in productionizing them. [1] Margoor et al,'Automatic Tuning of SQL-on-Hadoop Engines' 2018,IEEE CLOUD
DB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentalsJohn Beresniewicz
RMOUG 2020 abstract:
This session will cover core concepts for Oracle performance analysis first introduced in Oracle 10g and forming the backbone of many features in the Diagnostic and Tuning packs. The presentation will cover the theoretical basis and meaning of these concepts, as well as illustrate how they are fundamental to many user-facing features in both the database itself and Enterprise Manager.
As cloud computing continues to gather speed, organizations with years’ worth of data stored on legacy on-premise technologies are facing issues with scale, speed, and complexity. Your customers and business partners are likely eager to get data from you, especially if you can make the process easy and secure.
Challenges with performance are not uncommon and ongoing interventions are required just to “keep the lights on”.
Discover how Snowflake empowers you to meet your analytics needs by unlocking the potential of your data.
Agenda of Webinar :
~Understand Snowflake and its Architecture
~Quickly load data into Snowflake
~Leverage the latest in Snowflake’s unlimited performance and scale to make the data ready for analytics
~Deliver secure and governed access to all data – no more silos
Setting up a GeoServer can sometimes be deceptively simple. However, going from proof of concept to production requires a number of steps to be taken in order to optimize the server in terms of availability, performance and scalability. The presentation will show how to get from a basic set up to a battle ready, rock solid installation by showing the ropes an advanced user already mastered.
In-memory OLTP storage with persistence and transaction supportAlexander Korotkov
Nowadays it becomes evident that single storage engine can't be "one size fits all". PostgreSQL community starts its movement towards pluggable storages. Significant restriction which is imposed in the current approach is compatibility. We consider pluggable storages to be compatible with (at least some) existing index access methods. That means we've long way to go, because we have to extend our index AMs before we can add corresponding features in the pluggable storages themselves.
In this talk we would like look this problem from another angle, and see what can we achieve if we try to make storage completely from scratch (using FDW interface for prototyping). Thus, we would show you a prototype of in-memory OLTP storage with transaction support and snapshot isolation. Internally it's implemented as index-organized table (B-tree) with undo log and optional persistence. That means it's quite different from what we have in PostgreSQL now.
The proved by benchmarks advantages of this in-memory storage are: better multicore scalability (thanks to no buffer manager), reduced bloat (thanks to undo log) and optimized IO (thank to logical WAL logging).
Análise de performance usando as estatísticas do PostgreSQLMatheus de Oliveira
Essa palestra desmitifica as estatísticas do PostgreSQL abordando todo o sistema de coleta de dados estatísticos, as tabelas disponíveis atualmente, conceitos, técnicas e exemplos de consultas úteis para performance e monitoramento do PostgreSQL.
Using all of the high availability options in MariaDBMariaDB plc
MariaDB provides a number of high availability options, including replication with automatic failover and multi-master clustering. In this session Wagner Bianchi, Principal Remote DBA, provides a comprehensive overview of the high availability features in MariaDB, highlights their impact on consistency and performance, discusses advanced failover strategies and introduces new features such as casual reads and transparent connection failover.
1. RAID (Redundant Array of Independent Disks) is a data storage virtualization technology that combines multiple physical disk drive components into one or more logical units for the purposes of data redundancy, performance improvement, or both.
2. There are different RAID levels that provide redundancy through techniques like mirroring, parity, or a combination of both. The most common levels are RAID 0, 1, 5 and 10 but there are also less common levels like RAID 2-4 and 6.
3. The presenter discusses the advantages and disadvantages of various RAID levels for improving performance, reliability, and fault tolerance of disk storage systems. RAID can help address issues like increasing storage capacity
The document discusses big data and distributed computing. It explains that big data refers to large, unstructured datasets that are too large for traditional databases. Distributed computing uses multiple computers connected via a network to process large datasets in parallel. Hadoop is an open-source framework for distributed computing that uses MapReduce and HDFS for parallel processing and storage across clusters. HDFS stores data redundantly across nodes for fault tolerance.
This document discusses NoSQL and the CAP theorem. It begins with an introduction of the presenter and an overview of topics to be covered: What is NoSQL and the CAP theorem. It then defines NoSQL, provides examples of major NoSQL categories (document, graph, key-value, and wide-column stores), and explains why NoSQL is used, including to handle large, dynamic, and distributed data. The document also explains the CAP theorem, which states that a distributed data store can only satisfy two of three properties: consistency, availability, and partition tolerance. It provides examples of how to choose availability over consistency or vice versa. Finally, it concludes that both SQL and NoSQL have valid use cases and a combination
A look at what HA is and what PostgreSQL has to offer for building an open source HA solution. Covers various aspects in terms of Recovery Point Objective and Recovery Time Objective. Includes backup and restore, PITR (point in time recovery) and streaming replication concepts.
- MariaDB Corporation was founded by original developers of MySQL and provides commercial support for MariaDB and MySQL. It has over 400 enterprise customers globally.
- MariaDB is an enhanced, drop-in replacement for MySQL that is open source and offers additional features like improved performance, security, and scalability. It has been adopted by several Linux distributions as the default database.
- MariaDB offers several advantages over MySQL for applications like Drupal, including its XtraDB storage engine, SphinxSE search engine, thread pool feature for handling many concurrent queries efficiently, and Galera Cluster for high availability.
TokuDB is an ACID/transactional storage engine that makes MySQL even better by increasing performance, adding high compression, and allowing for true schema agility. All of these features are made possible by Tokutek's Fractal Tree indexes.
Une introduction pour montrer les nouvelles options de MariaDB 10 et des autres fork de MySQL.
A small introduction to show you what interesting in MariaDB 10 and the others fork of MySQL
Slides sur MariaDB/MySQL pour administrateurs systèmes confirmés.
Ces slides contiennent des rappels sur les verrous et des cas d'utilisation classique de MariaDB/MySQL.
Puis une partie performances et haute disponibilité avec réplications et cluster Galera.
Building a High Performance Analytics PlatformSantanu Dey
The document discusses using flash memory to build a high performance data platform. It notes that flash memory is faster than disk storage and cheaper than RAM. The platform utilizes NVMe flash drives connected via PCIe for high speed performance. This allows it to provide in-memory database speeds at the cost and density of solid state drives. It can scale independently by adding compute nodes or storage nodes. The platform offers a unified database for both real-time and analytical workloads through common APIs.
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs
This document discusses Apache Kudu, an open source columnar storage system for analytics workloads on Hadoop. Kudu is designed to enable both fast analytics queries as well as real-time updates on fast changing data. It aims to fill gaps in the current Hadoop storage landscape by supporting simultaneous high throughput scans, low latency reads/writes, and ACID transactions. An example use case described is for real-time fraud detection on streaming financial data.
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics
This document discusses using DDN's parallel file systems to improve the performance of kdb+ analytics queries on large datasets. Running kdb+ on a parallel file system can significantly reduce query latency by distributing data and queries across multiple file system servers. This allows queries to achieve near linear speedups as more servers are added. The shared namespace also allows multiple independent kdb+ instances to access the same consolidated datasets.
Accelerating HBase with NVMe and Bucket CacheNicolas Poggi
on-Volatile-Memory express (NVMe) standard promises and order of magnitude faster storage than regular SSDs, while at the same time being more economical than regular RAM on TB/$. This talk evaluates the use cases and benefits of NVMe drives for its use in Big Data clusters with HBase and Hadoop HDFS.
First, we benchmark the different drives using system level tools (FIO) to get maximum expected values for each different device type and set expectations. Second, we explore the different options and use cases of HBase storage and benchmark the different setups. And finally, we evaluate the speedups obtained by the NVMe technology for the different Big Data use cases from the YCSB benchmark.
In summary, while the NVMe drives show up to 8x speedup in best case scenarios, testing the cost-efficiency of new device technologies is not straightforward in Big Data, where we need to overcome system level caching to measure the maximum benefits.
HPC DAY 2017 | HPE Storage and Data Management for Big DataHPC DAY
HPC DAY 2017 - http://www.hpcday.eu/
HPE Storage and Data Management for Big Data
Volodymyr Saviak | CEE HPC & POD Sales Manager at HPE
Accelerating hbase with nvme and bucket cacheDavid Grier
This set of slides describes some initial experiments which we have designed for discovering improvements for performance in Hadoop technologies using NVMe technology
Pilot Hadoop Towards 2500 Nodes and Cluster RedundancyStuart Pook
Hadoop has become a critical part of Criteo's operations. What started out as a proof of concept has turned into two in-house bare-metal clusters of over 2200 nodes. Hadoop contains the data required for billing and, perhaps even more importantly, the data used to create the machine learning models, computed every 6 hours by Hadoop, that participate in real time bidding for online advertising.
Two clusters do not necessarily mean a redundant system, so Criteo must plan for any of the disasters that can destroy a cluster.
This talk describes how Criteo built its second cluster in a new datacenter and how to do it better next time. How a small team is able to run and expand these clusters is explained. More importantly the talk describes how a redundant data and compute solution at this scale must function, what Criteo has already done to create this solution and what remains undone.
QCT Ceph Solution - Design Consideration and Reference ArchitecturePatrick McGarry
This document discusses QCT's Ceph storage solutions, including an overview of Ceph architecture, QCT hardware platforms, Red Hat Ceph software, workload considerations, reference architectures, test results and a QCT/Red Hat whitepaper. It provides technical details on QCT's throughput-optimized and capacity-optimized solutions and shows how they address different storage needs through workload-driven design. Hands-on testing and a test drive lab are offered to explore Ceph features and configurations.
QCT Ceph Solution - Design Consideration and Reference ArchitectureCeph Community
This document discusses QCT's Ceph storage solutions, including an overview of Ceph architecture, QCT hardware platforms, Red Hat Ceph software, workload considerations, benchmark testing results, and a collaboration between QCT, Red Hat, and Intel to provide optimized and validated Ceph solutions. Key reference architectures are presented targeting small, medium, and large storage capacities with options for throughput, capacity, or IOPS optimization.
SQream DB - Bigger Data On GPUs: Approaches, Challenges, SuccessesArnon Shimoni
This talk will present SQream’s journey to building an analytics data warehouse powered by GPUs. SQream DB is an SQL data warehouse designed for larger than main-memory datasets (up to petabytes). It’s an on-disk database that combines novel ideas and algorithms to rapidly analyze trillions of rows with the help of high-throughput GPUs. We will explore some of SQream’s ideas and approaches to developing its analytics database – from simple prototype and tech demos, to a fully functional data warehouse product containing the most important features for enterprise deployment. We will also describe the challenges of working with exotic hardware like GPUs, and what choices had to be made in order to combine the CPU and GPU capabilities to achieve industry-leading performance – complete with real world use case comparisons.
As part of this discussion, we will also share some of the real issues that were discovered, and the engineering decisions that led to the creation of SQream DB’s high-speed columnar storage engine, designed specifically to take advantage of streaming architectures like GPUs.
Logging at OVHcloud :
Logs Data platform est la plateforme de collecte, d'analyse et de gestion centralisée de logs d'OVHcloud. Cette plateforme a pour but de répondre aux challenges que constitue l'indexation de plus de 4000 milliards de logs par une entreprise comme OVHcloud. Cette présentation vous décrira l'architecture générale de Logs Data Platform autour de ses composants centraux Elasticsearch et Graylog et vous décrira les différentes problématiques de scalabilité, disponibilité, performance et d'évolutivité qui sont le quotidien de l'équipe Observability à OVHcloud.
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...Red_Hat_Storage
Cisco uses Ceph for storage in its OpenStack cloud platform. The initial Ceph cluster design used HDDs which caused stability issues as the cluster grew to petabytes in size. Improvements included throttling client IO, upgrading Ceph versions, moving MON metadata to SSDs, and retrofitting journals to NVMe SSDs. These steps stabilized performance and reduced recovery times. Lessons included having clear stability goals and automating testing to prevent technical debt from shortcuts.
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community
The document summarizes a presentation given by representatives from various companies on optimizing Ceph for high-performance solid state drives. It discusses testing a real workload on a Ceph cluster with 50 SSD nodes that achieved over 280,000 read and write IOPS. Areas for further optimization were identified, such as reducing latency spikes and improving single-threaded performance. Various companies then described their contributions to Ceph performance, such as Intel providing hardware for testing and Samsung discussing SSD interface improvements.
HAMR, HDMR, DuraWrite, SHIELD, RAISE sind Technologien, mit denen Seagate der stetig wachsenden Datenmenge in Unternehmen entgegen tritt. Im Webinar erfahren Sie direkt vom Hersteller, was sich dahinter verbirgt.
This document discusses scalable storage configuration for physics database services. It outlines challenges with storage configuration, best practices like using all available disks and striping data, and Oracle's ASM solution. The document presents benchmark data measuring performance of different storage configurations and recommendations for sizing new projects based on stress testing and benchmark data.
Big data refers to large, complex datasets that are difficult to process using traditional methods. This document discusses three examples of real-world big data challenges and their solutions. The challenges included storage, analysis, and processing capabilities given hardware and time constraints. Solutions involved switching databases, using Hadoop/MapReduce, and representing complex data structures to enable analysis of terabytes of ad serving data. Flexibility and understanding domain needs were key to feasible versus theoretical solutions.
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based HardwareRed_Hat_Storage
This document discusses how data growth driven by mobile, social media, IoT, and big data/cloud is requiring a fundamental shift in storage cost structures from scale-up to scale-out architectures. It provides an overview of key storage technologies and workloads driving public cloud storage, and how Ceph can help deliver on the promise of the cloud by providing next generation storage architectures with flash to enable new capabilities in small footprints. It also illustrates the wide performance range Ceph can provide for different workloads and hardware configurations.
Scalable and High available Distributed File System Metadata Service Using gR...Alluxio, Inc.
Alluxio Community Office Hour
Apr 7, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speaker: Bin Fan
Alluxio (alluxio.io) is an open-source data orchestration system that provides a single namespace federating multiple external distributed storage systems. It is critical for Alluxio to be able to store and serve the metadata of all files and directories from all mounted external storage both at scale and at speed.
This talk shares our design, implementation, and optimization of Alluxio metadata service (master node) to address the scalability challenges. Particularly, we will focus on how to apply and combine techniques including tiered metadata storage (based on off-heap KV store RocksDB), fine-grained file system inode tree locking scheme, embedded state-replicate machine (based on RAFT), exploration and performance tuning in the correct RPC frameworks (thrift vs gRPC) and etc. As a result of the combined above techniques, Alluxio 2.0 is able to store at least 1 billion files with a significantly reduced memory requirement, serving 3000 workers and 30000 clients concurrently.
In this Office Hour, we will go over how to:
- Metadata storage challenges
- How to combine different open source technologies as building blocks
- The design, implementation, and optimization of Alluxio metadata service
Webseminar: MariaDB Enterprise und MariaDB Enterprise ClusterMariaDB Corporation
This document provides information about MariaDB Enterprise and MariaDB Enterprise Cluster from Ralf Gebhardt, including:
- An agenda covering MariaDB, MariaDB Enterprise, MariaDB Enterprise Cluster, services, and more info.
- Background on MariaDB, the MariaDB Foundation, MariaDB.com, and SkySQL.
- A timeline of MariaDB releases from 5.1 to the current 10.0 and Galera Cluster 10.
- An overview of key features and optimizations in MariaDB 10 like multi-source replication and improved query optimization.
- Mention of Fusion-IO page compression providing a 30% performance increase with atomic writes.
Skalierbarkeit mit MariaDB und MaxScale - MariaDB Roadshow Summer 2014 Hambur...MariaDB Corporation
Skalierbarkeit mit MariaDB und MaxScale
Presented by Ralf Gebhardt at the MariaDB Roadshow Germany: 4.7.2014 in Hamburg, 8.7.2014 in Berlin and 11.7.2014 in Frankfurt.
Hochverfügbarkeit mit MariaDB Enterprise
Presented by Ralf Gebhardt at the MariaDB Roadshow Germany: 4.7.2014 in Hamburg, 8.7.2014 in Berlin and 11.7.2014 in Frankfurt.
Automatisierung & Verwaltung von Datenbank - Clustern mit Severalnines - Mari...MariaDB Corporation
Automatisierung & Verwaltung von Datenbank - Clustern mit Severalnines
Presented by Jean-Jérôme Schmidt 8.7.2014 at the MariaDB Roadshow in Berlin, Germany.
Automatisation et Gestion de Cluster de Bases de Données MariaDB RoadshowMariaDB Corporation
Automatisation et Gestion de Cluster de Bases de Données by Jean-Jerome Schmidt, Severalnines
Presented 26.6.2014 at the MariaDB Roadshow in Paris, France.
Automation and Management of Database Clusters MariaDB Roadshow 2014MariaDB Corporation
The document discusses automation and management of database clusters using ClusterControl software from Severalnines. It covers the database infrastructure lifecycle including deploying, monitoring, managing, and scaling database clusters. ClusterControl provides automation and management capabilities for provisioning, monitoring systems and database performance, managing multiple database clusters across data centers, and automating tasks like repair, recovery, upgrades, backups and scaling. The presentation includes a demo and discusses Severalnines customers.
MariaDB 10 and Beyond - the Future of Open Source Databases by Ivan Zoratti.
Presented 24.6.2014 at the MariaDB Roadshow in Maarssen, Utrecht, The Netherlands.
MaxScale is an open-source, highly scalable, and transparent load balancing solution for MySQL and MariaDB databases. It acts as a proxy between applications and databases, authenticating clients, routing queries, and monitoring database nodes. MaxScale supports features like read/write splitting, connection load balancing, and filtering of queries through extensible plugin modules. Typical use cases include balancing read loads across database replicas and distributing connections among nodes in a Galera cluster.
Codership provides high availability, no-data-loss, and scalable data replication and clustering solutions for open source databases to securely store customers' valuable data. They do this through solutions like Galera replication which allows for synchronous multi-master replication across MariaDB and MySQL database clusters.
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentShubham Joshi
A secure test infrastructure ensures that the testing process doesn’t become a gateway for vulnerabilities. By protecting test environments, data, and access points, organizations can confidently develop and deploy software without compromising user privacy or system integrity.
How can one start with crypto wallet development.pptxlaravinson24
This presentation is a beginner-friendly guide to developing a crypto wallet from scratch. It covers essential concepts such as wallet types, blockchain integration, key management, and security best practices. Ideal for developers and tech enthusiasts looking to enter the world of Web3 and decentralized finance.
Adobe Master Collection CC Crack Advance Version 2025kashifyounis067
🌍📱👉COPY LINK & PASTE ON GOOGLE http://drfiles.net/ 👈🌍
Adobe Master Collection CC (Creative Cloud) is a comprehensive subscription-based package that bundles virtually all of Adobe's creative software applications. It provides access to a wide range of tools for graphic design, video editing, web development, photography, and more. Essentially, it's a one-stop-shop for creatives needing a broad set of professional tools.
Key Features and Benefits:
All-in-one access:
The Master Collection includes apps like Photoshop, Illustrator, InDesign, Premiere Pro, After Effects, Audition, and many others.
Subscription-based:
You pay a recurring fee for access to the latest versions of all the software, including new features and updates.
Comprehensive suite:
It offers tools for a wide variety of creative tasks, from photo editing and illustration to video editing and web development.
Cloud integration:
Creative Cloud provides cloud storage, asset sharing, and collaboration features.
Comparison to CS6:
While Adobe Creative Suite 6 (CS6) was a one-time purchase version of the software, Adobe Creative Cloud (CC) is a subscription service. CC offers access to the latest versions, regular updates, and cloud integration, while CS6 is no longer updated.
Examples of included software:
Adobe Photoshop: For image editing and manipulation.
Adobe Illustrator: For vector graphics and illustration.
Adobe InDesign: For page layout and desktop publishing.
Adobe Premiere Pro: For video editing and post-production.
Adobe After Effects: For visual effects and motion graphics.
Adobe Audition: For audio editing and mixing.
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?steaveroggers
Migrating from Lotus Notes to Outlook can be a complex and time-consuming task, especially when dealing with large volumes of NSF emails. This presentation provides a complete guide on how to batch export Lotus Notes NSF emails to Outlook PST format quickly and securely. It highlights the challenges of manual methods, the benefits of using an automated tool, and introduces eSoftTools NSF to PST Converter Software — a reliable solution designed to handle bulk email migrations efficiently. Learn about the software’s key features, step-by-step export process, system requirements, and how it ensures 100% data accuracy and folder structure preservation during migration. Make your email transition smoother, safer, and faster with the right approach.
Read More:- https://www.esofttools.com/nsf-to-pst-converter.html
AgentExchange is Salesforce’s latest innovation, expanding upon the foundation of AppExchange by offering a centralized marketplace for AI-powered digital labor. Designed for Agentblazers, developers, and Salesforce admins, this platform enables the rapid development and deployment of AI agents across industries.
Email: [email protected]
Phone: +1(630) 349 2411
Website: https://www.fexle.com/blogs/agentexchange-an-ultimate-guide-for-salesforce-consultants-businesses/?utm_source=slideshare&utm_medium=pptNg
Who Watches the Watchmen (SciFiDevCon 2025)Allon Mureinik
Tests, especially unit tests, are the developers’ superheroes. They allow us to mess around with our code and keep us safe.
We often trust them with the safety of our codebase, but how do we know that we should? How do we know that this trust is well-deserved?
Enter mutation testing – by intentionally injecting harmful mutations into our code and seeing if they are caught by the tests, we can evaluate the quality of the safety net they provide. By watching the watchmen, we can make sure our tests really protect us, and we aren’t just green-washing our IDEs to a false sense of security.
Talk from SciFiDevCon 2025
https://www.scifidevcon.com/courses/2025-scifidevcon/contents/680efa43ae4f5
Download YouTube By Click 2025 Free Full Activatedsaniamalik72555
Copy & Past Link 👉👉
https://dr-up-community.info/
"YouTube by Click" likely refers to the ByClick Downloader software, a video downloading and conversion tool, specifically designed to download content from YouTube and other video platforms. It allows users to download YouTube videos for offline viewing and to convert them to different formats.
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)Andre Hora
Software testing plays a crucial role in the contribution process of open-source projects. For example, contributions introducing new features are expected to include tests, and contributions with tests are more likely to be accepted. Although most real-world projects require contributors to write tests, the specific testing practices communicated to contributors remain unclear. In this paper, we present an empirical study to understand better how software testing is approached in contribution guidelines. We analyze the guidelines of 200 Python and JavaScript open-source software projects. We find that 78% of the projects include some form of test documentation for contributors. Test documentation is located in multiple sources, including CONTRIBUTING files (58%), external documentation (24%), and README files (8%). Furthermore, test documentation commonly explains how to run tests (83.5%), but less often provides guidance on how to write tests (37%). It frequently covers unit tests (71%), but rarely addresses integration (20.5%) and end-to-end tests (15.5%). Other key testing aspects are also less frequently discussed: test coverage (25.5%) and mocking (9.5%). We conclude by discussing implications and future research.
PDF Reader Pro Crack Latest Version FREE Download 2025mu394968
🌍📱👉COPY LINK & PASTE ON GOOGLE https://dr-kain-geera.info/👈🌍
PDF Reader Pro is a software application, often referred to as an AI-powered PDF editor and converter, designed for viewing, editing, annotating, and managing PDF files. It supports various PDF functionalities like merging, splitting, converting, and protecting PDFs. Additionally, it can handle tasks such as creating fillable forms, adding digital signatures, and performing optical character recognition (OCR).
Join Ajay Sarpal and Miray Vu to learn about key Marketo Engage enhancements. Discover improved in-app Salesforce CRM connector statistics for easy monitoring of sync health and throughput. Explore new Salesforce CRM Synch Dashboards providing up-to-date insights into weekly activity usage, thresholds, and limits with drill-down capabilities. Learn about proactive notifications for both Salesforce CRM sync and product usage overages. Get an update on improved Salesforce CRM synch scale and reliability coming in Q2 2025.
Key Takeaways:
Improved Salesforce CRM User Experience: Learn how self-service visibility enhances satisfaction.
Utilize Salesforce CRM Synch Dashboards: Explore real-time weekly activity data.
Monitor Performance Against Limits: See threshold limits for each product level.
Get Usage Over-Limit Alerts: Receive notifications for exceeding thresholds.
Learn About Improved Salesforce CRM Scale: Understand upcoming cloud-based incremental sync.
This presentation explores code comprehension challenges in scientific programming based on a survey of 57 research scientists. It reveals that 57.9% of scientists have no formal training in writing readable code. Key findings highlight a "documentation paradox" where documentation is both the most common readability practice and the biggest challenge scientists face. The study identifies critical issues with naming conventions and code organization, noting that 100% of scientists agree readable code is essential for reproducible research. The research concludes with four key recommendations: expanding programming education for scientists, conducting targeted research on scientific code quality, developing specialized tools, and establishing clearer documentation guidelines for scientific software.
Presented at: The 33rd International Conference on Program Comprehension (ICPC '25)
Date of Conference: April 2025
Conference Location: Ottawa, Ontario, Canada
Preprint: https://arxiv.org/abs/2501.10037
Not So Common Memory Leaks in Java WebinarTier1 app
This SlideShare presentation is from our May webinar, “Not So Common Memory Leaks & How to Fix Them?”, where we explored lesser-known memory leak patterns in Java applications. Unlike typical leaks, subtle issues such as thread local misuse, inner class references, uncached collections, and misbehaving frameworks often go undetected and gradually degrade performance. This deck provides in-depth insights into identifying these hidden leaks using advanced heap analysis and profiling techniques, along with real-world case studies and practical solutions. Ideal for developers and performance engineers aiming to deepen their understanding of Java memory management and improve application stability.
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...Egor Kaleynik
This case study explores how we partnered with a mid-sized U.S. healthcare SaaS provider to help them scale from a successful pilot phase to supporting over 10,000 users—while meeting strict HIPAA compliance requirements.
Faced with slow, manual testing cycles, frequent regression bugs, and looming audit risks, their growth was at risk. Their existing QA processes couldn’t keep up with the complexity of real-time biometric data handling, and earlier automation attempts had failed due to unreliable tools and fragmented workflows.
We stepped in to deliver a full QA and DevOps transformation. Our team replaced their fragile legacy tests with Testim’s self-healing automation, integrated Postman and OWASP ZAP into Jenkins pipelines for continuous API and security validation, and leveraged AWS Device Farm for real-device, region-specific compliance testing. Custom deployment scripts gave them control over rollouts without relying on heavy CI/CD infrastructure.
The result? Test cycle times were reduced from 3 days to just 8 hours, regression bugs dropped by 40%, and they passed their first HIPAA audit without issue—unlocking faster contract signings and enabling them to expand confidently. More than just a technical upgrade, this project embedded compliance into every phase of development, proving that SaaS providers in regulated industries can scale fast and stay secure.
Explaining GitHub Actions Failures with Large Language Models Challenges, In...ssuserb14185
GitHub Actions (GA) has become the de facto tool that developers use to automate software workflows, seamlessly building, testing, and deploying code. Yet when GA fails, it disrupts development, causing delays and driving up costs. Diagnosing failures becomes especially challenging because error logs are often long, complex and unstructured. Given these difficulties, this study explores the potential of large language models (LLMs) to generate correct, clear, concise, and actionable contextual descriptions (or summaries) for GA failures, focusing on developers’ perceptions of their feasibility and usefulness. Our results show that over 80% of developers rated LLM explanations positively in terms of correctness for simpler/small logs. Overall, our findings suggest that LLMs can feasibly assist developers in understanding common GA errors, thus, potentially reducing manual analysis. However, we also found that improved reasoning abilities are needed to support more complex CI/CD scenarios. For instance, less experienced developers tend to be more positive on the described context, while seasoned developers prefer concise summaries. Overall, our work offers key insights for researchers enhancing LLM reasoning, particularly in adapting explanations to user expertise.
https://arxiv.org/abs/2501.16495
🌍📱👉COPY LINK & PASTE ON GOOGLE http://drfiles.net/ 👈🌍
Adobe Illustrator is a powerful, professional-grade vector graphics software used for creating a wide range of designs, including logos, icons, illustrations, and more. Unlike raster graphics (like photos), which are made of pixels, vector graphics in Illustrator are defined by mathematical equations, allowing them to be scaled up or down infinitely without losing quality.
Here's a more detailed explanation:
Key Features and Capabilities:
Vector-Based Design:
Illustrator's foundation is its use of vector graphics, meaning designs are created using paths, lines, shapes, and curves defined mathematically.
Scalability:
This vector-based approach allows for designs to be resized without any loss of resolution or quality, making it suitable for various print and digital applications.
Design Creation:
Illustrator is used for a wide variety of design purposes, including:
Logos and Brand Identity: Creating logos, icons, and other brand assets.
Illustrations: Designing detailed illustrations for books, magazines, web pages, and more.
Marketing Materials: Creating posters, flyers, banners, and other marketing visuals.
Web Design: Designing web graphics, including icons, buttons, and layouts.
Text Handling:
Illustrator offers sophisticated typography tools for manipulating and designing text within your graphics.
Brushes and Effects:
It provides a range of brushes and effects for adding artistic touches and visual styles to your designs.
Integration with Other Adobe Software:
Illustrator integrates seamlessly with other Adobe Creative Cloud apps like Photoshop, InDesign, and Dreamweaver, facilitating a smooth workflow.
Why Use Illustrator?
Professional-Grade Features:
Illustrator offers a comprehensive set of tools and features for professional design work.
Versatility:
It can be used for a wide range of design tasks and applications, making it a versatile tool for designers.
Industry Standard:
Illustrator is a widely used and recognized software in the graphic design industry.
Creative Freedom:
It empowers designers to create detailed, high-quality graphics with a high degree of control and precision.
Adobe Lightroom Classic Crack FREE Latest link 2025kashifyounis067
🌍📱👉COPY LINK & PASTE ON GOOGLE http://drfiles.net/ 👈🌍
Adobe Lightroom Classic is a desktop-based software application for editing and managing digital photos. It focuses on providing users with a powerful and comprehensive set of tools for organizing, editing, and processing their images on their computer. Unlike the newer Lightroom, which is cloud-based, Lightroom Classic stores photos locally on your computer and offers a more traditional workflow for professional photographers.
Here's a more detailed breakdown:
Key Features and Functions:
Organization:
Lightroom Classic provides robust tools for organizing your photos, including creating collections, using keywords, flags, and color labels.
Editing:
It offers a wide range of editing tools for making adjustments to color, tone, and more.
Processing:
Lightroom Classic can process RAW files, allowing for significant adjustments and fine-tuning of images.
Desktop-Focused:
The application is designed to be used on a computer, with the original photos stored locally on the hard drive.
Non-Destructive Editing:
Edits are applied to the original photos in a non-destructive way, meaning the original files remain untouched.
Key Differences from Lightroom (Cloud-Based):
Storage Location:
Lightroom Classic stores photos locally on your computer, while Lightroom stores them in the cloud.
Workflow:
Lightroom Classic is designed for a desktop workflow, while Lightroom is designed for a cloud-based workflow.
Connectivity:
Lightroom Classic can be used offline, while Lightroom requires an internet connection to sync and access photos.
Organization:
Lightroom Classic offers more advanced organization features like Collections and Keywords.
Who is it for?
Professional Photographers:
PCMag notes that Lightroom Classic is a popular choice among professional photographers who need the flexibility and control of a desktop-based application.
Users with Large Collections:
Those with extensive photo collections may prefer Lightroom Classic's local storage and robust organization features.
Users who prefer a traditional workflow:
Users who prefer a more traditional desktop workflow, with their original photos stored on their computer, will find Lightroom Classic a good fit.
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...Andre Hora
Unittest and pytest are the most popular testing frameworks in Python. Overall, pytest provides some advantages, including simpler assertion, reuse of fixtures, and interoperability. Due to such benefits, multiple projects in the Python ecosystem have migrated from unittest to pytest. To facilitate the migration, pytest can also run unittest tests, thus, the migration can happen gradually over time. However, the migration can be timeconsuming and take a long time to conclude. In this context, projects would benefit from automated solutions to support the migration process. In this paper, we propose TestMigrationsInPy, a dataset of test migrations from unittest to pytest. TestMigrationsInPy contains 923 real-world migrations performed by developers. Future research proposing novel solutions to migrate frameworks in Python can rely on TestMigrationsInPy as a ground truth. Moreover, as TestMigrationsInPy includes information about the migration type (e.g., changes in assertions or fixtures), our dataset enables novel solutions to be verified effectively, for instance, from simpler assertion migrations to more complex fixture migrations. TestMigrationsInPy is publicly available at: https://github.com/altinoalvesjunior/TestMigrationsInPy.
WinRAR Crack for Windows (100% Working 2025)sh607827
copy and past on google ➤ ➤➤ https://hdlicense.org/ddl/
WinRAR Crack Free Download is a powerful archive manager that provides full support for RAR and ZIP archives and decompresses CAB, ARJ, LZH, TAR, GZ, ACE, UUE, .
2. Galaxy confidential
Galaxy Big Data scalability Menu
• About Galaxy Semiconductor (BG)
• The big data challenge (BG)
• Scalable, fail-safe architecture for big data (BG)
• MariaDB challenges: compression (SV)
• MariaDB challenges: sharding (SC)
• Results (BG)
• Next Steps (BG)
• Q&A
2
3. Galaxy confidential
About Galaxy Semiconductor
• A software company dedicated to semiconductor:
Quality improvement
Yield enhancement
NPI acceleration
Test cell OEE optimization
• Founded in 1988
• Track record of building products that offer the best
user experience + premier customer support
• Products used by 3500+ users and all major ATE
companies
3
via
SEMICONDUCTOR
INTELLIGENCE
4. Galaxy confidential
4
Galaxy Teo, Ireland
HQ, G&A
Galaxy East
Sales, Marketing, Apps
Galaxy France
R&D, QA, & Apps
Partner
Taiwan Sales & Apps
Partner
Israel Sales
Partner
Singapore Sales & Apps
Galaxy West
Sales, Apps
Partner
Japan Sales & Apps
Partner
China Sales & Apps
Worldwide Presence
5. Galaxy confidential
Test Data production / consumption
5
ATE
Test Data
Files
ETL,
Data
Cleansing
Yield-Man
Data
Cube(s)
ETL
Galaxy TDR
Examinator-Pro
Browser-based
dashboards
Custom Agents
Data Mining
OEE Alarms
PAT
Automated Agents
SYA
7. Galaxy confidential
Big Data, Big Problem
• More data can produce more knowledge and higher profits
• Modern systems make it easy to generate more data
• The problem is how to create a hardware and software platform
that can make full and effective use of all this data as it
continues to grow
• Galaxy has the expertise to guide you to a solution for this big
data problem that includes:
– Real-time data streams
– High data insertion rates
– Scalable database to extreme data volumes
– Automatic compensation for server failures
– Use of inexpensive, commodity servers
– Load balancing
7
8. Galaxy confidential
First-level solutions
• Partitioning
– SUMMARY data
• High level reports
• 10% of the volume
• Must be persistent for a long period (years)
– RAW data
• Detailed data inspection
• 90% of the volume
• Must be persistent for a short period (months)
• PURGE
– Partitioning per date (e.g. daily) on RAW data
tables
– Instant purge by drop partitions
• Parallel insertion
8
Yield-Man
Yield-Man
Yield-Man
9. Galaxy confidential
New customer use case
9
• Solution needs to be easily setup
• Solution needs to handle large (~50TB+) data
• Need to handle large insertion speed of approximately 2 MB/sec
Solutions
• Solution 1: Single scale-up node (lots of RAM, lots of CPU,
expensive high-speed SSD storage, single point of failure, not
scalable, heavy for replication)
• Solution 2: Cluster of commodity nodes (see later)
10. Galaxy confidential
Cluster of Nodes
Other customer applications
and systems
Other Test Data Files
Event Data Stream
ATE config &
maintenance events
Real-time Tester Status
Test Floor
Data Sources
STDF Data Files
.
.
.
RESTful
API
RESTful API
Test
Hardware
Management
System
MES
Galaxy Cluster of Commodity Servers
DB Node
DB Node
DB Node
DB Node
Compute
Node
Compute
Node
Head Node
Dashboard
Node
Yield-Man
PAT-Man
Yield-Man
PAT-Man
Real-Time Interface
Test Data Stream
10
11. Galaxy confidential
Easy Scalability
Other customer applications
and systems
Other Test Data Files
Event Data Stream
ATE config &
maintenance events
Real-time Tester Status
Test Floor
Data Sources
STDF Data Files
.
.
.
RESTful
API
Test
Hardware
Management
System
MES
Galaxy Cluster of Commodity Servers
DB Node
DB Node
DB Node
DB Node
Compute
Node
Compute
Node
Head Node
Dashboard
Node
Yield-Man
PAT-Man
Yield-Man PAT-Man
Real-Time Interface
Test Data Stream
DB Node
DB Node
Compute
Node
RESTful API
11
12. Galaxy confidential
MariaDB challenges
12
❏ From a single box to elastic architecture
❏ Reducing the TCO
❏ OEM solution
❏ Minimizing the impact on existing code
❏ Reach 200B records
13. Galaxy confidential
A classic case
13
SENSOR
SENSOR
SENSOR
SENSOR
SENSOR
STORE
QUERY
QUERY
QUERY
QUERY
QUERY
❏ Millions of records/s sorted by timeline
❏ Data is queried in other order
❏ Indexes don’t fit into main memory
❏ Disk IOps become bottleneck
14. Galaxy confidential
B-tree gotcha
14
2ms disk or network latency, 100 head
seeks/s, 2 options:
❏ Increase concurrency
❏ Increase packet size
Increased both long time ago using
innodb_write_io_threads , innodb_io_capacity, bulk load
15. Galaxy confidential
B-tree gotcha
15
With a Billion records, a single partition B-tree stops staying in
main memory, a single write produces read IOps to traverse the
tree:
❏ Use partitioning
❏ Insert in primary key order
❏ Big redo log and smaller amount of dirty pages
❏ Covering index
The next step is to radically change the IO pattern
17. Galaxy confidential
INDEXES MAINTENANCE
NO INDEXES
COLUMN STORE
TTREE BTREE FRACTAL TREE
NDB
InnoDB - MyISAM
ZFS
TokuDB
LevelDB
Cassandra
Hbase
InfiniDB
Average Compression Rate
NA 1/2 1/6 1/3 1/12
IO Size
NA 4K to 64K
Variable base on
compression & Depth
64M 8M To 64M
READ Disk Access Model
NA O(Log(N)/ Log(B)) ~O(Log(N)/ Log(B)) O(N/B )
O(N/B - B
Elimination)
WRITE Disk Access Model
NA O(Log(N)/ Log(B)) ~O(Log(N)/B) O(1/B ) O(1/B)
Data Structure for big data
17
18. Galaxy confidential
Top 10 Alexa’s PETA Bytes store is InnoDB
18
Top Alexa
InnoDB
Galaxy
TokuDB
❏ DBA to setup Insert buffer + Dirty pages
❏ Admins to monitor IO
❏ Admins to increase # nodes
❏ Use flash & hybride storage
❏ DBAs to partition and shard
❏ DBAs to organize maintenance
❏ DBAs to set covering and clustering
indexes
❏ Zipf read distribution
❏ Concurrent by design
❏ Remove fragmentation
❏ Constant insert rate regardless
memory/disk ratio
❏ High compression rate
❏ No control over client architecture
❏ All indexes can be clustered
20. Galaxy confidential
20
2 times slower insert time vs. InnoDB
2.5 times faster insert vs. InnoDB compressed
Key point for 200 Billion records
21. Galaxy confidential
21
❏ Disk IOps on InnoDB was bottleneck,
despite partitioning
❏ Moving to TokuDB, move bottleneck to
CPU for compression
❏ So how to increase performance more?
Sharding!!
Galaxy take away for 200 Billion records
22. Galaxy confidential
22
INDEXES MAINTENANCE NO INDEXES
COLUMN STORE
TTREE BTREE FRACTAL TREE
NDB
InnoDB
MyISAM
ZFS
TokuDB
LevelDB
Cassandra
Hbase
InfiniDB
Vetica
CLUSTERING
Native
Manual, Spider,
Vitess, Fabric,
Shardquery
Manual, Spider,
Vitess, Fabric,
Shardquery
Native Native
# OF NODES
+++++ +++ ++ +++++ +
Sharding to fix CPU Bottleneck
26. Galaxy confidential
Implemented architecture
26
SUMMARY
universal tables
RAW
Sharded tables
DATA NODE #1
COMPUTE NODE #1
…
DATA NODE #2 DATA NODE #3 DATA NODE #4
HEAD NODE COMPUTE NODE #2
…
•SPIDER
•NO DATA
•MONITORING
•TOKUDB
•COMPRESSED DATA
•PARTITIONS
Delay current
insertion
Replay insertion with
new shard key
1/4
OR
1/2
1/4
OR
1/2
1/4
OR
1/2
1/4
OR
1/2
27. Galaxy confidential
Re-sharding without data copy
27
Spider table L1.1
Node 01
Node 02
Spider table L1.2
Node 01
Node 02
Node 03
Node 04
Spider table L2
CURRENT
Toku table
P#Week 01
P#Week 02
Spider table L2
BEFORE
AFTER
Toku table
P#Week 01
P#Week 02
Toku table
P#Week 03
P#Week 04
Toku table
P#Week 03
P#Week 04
Toku table
P#Week 03
P#Week 04
Toku table
P#Week 03
P#Week 04
Partition by date (e.g. daily) Shard by node modulo Shard by date range
28. Galaxy confidential
Proven Performance
28
Galaxy has deployed its big data solution at a major test subcontractor in Asia
with the following performance:
• Peak data insertion rate : 2 TB of STDF data per day
• Data compression of raw data : 60-80 %
• DB retention of raw data : 3 months
• DB retention of summary data : 1 year
• Archiving of test data : Automatic
• Target was 2MB/sec, we get about 10MB/sec
• Since 17th June, steady production :
– Constant insertion speed
– 1400 files/day, 120 GB/day
– ft_ptest_results: 92 billion rows / 1.5 TB across 4 nodes
– ft_mptest_results: 14 billion rows / 266 GB acroos 4 nodes
– wt_ptest_results: 9 billion rows / 153 GB across 4 nodes
– 50TB available volume, total DB size is 8TB across all 4 nodes
• 7 servers (22k$) + SAN ($$$) OR DAS (15k$)
29. Galaxy confidential
File count inserted per day
29
• Integration issues up to May 7
• Raw & Summary-only data insertion up to May 18
• Raw & Summary data insertion, Problem solving, fine tuning up to June 16
• Steady production insertion of Raw & Summary data since June 17