Query compilation in Impala involves parsing the SQL, semantic analysis to validate the query, planning to generate an executable query plan, and finally executing the query. The query planner considers different join orders and strategies like broadcast joins and partitioned joins to minimize data transfer during query execution based on table and column statistics. The explain output provides details on how the query will be executed in a distributed fashion across nodes.
Machine Learning - Convolutional Neural NetworkRichard Kuo
The document provides an overview of convolutional neural networks (CNNs) for visual recognition. It discusses the basic concepts of CNNs such as convolutional layers, activation functions, pooling layers, and network architectures. Examples of classic CNN architectures like LeNet-5 and AlexNet are presented. Modern architectures such as Inception and ResNet are also discussed. Code examples for image classification using TensorFlow, Keras, and Fastai are provided.
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
This document discusses 5 common mistakes when writing Spark applications:
1) Improperly sizing executors by not considering cores, memory, and overhead. The optimal configuration depends on the workload and cluster resources.
2) Applications failing due to shuffle blocks exceeding 2GB size limit. Increasing the number of partitions helps address this.
3) Jobs running slowly due to data skew in joins and shuffles. Techniques like salting keys can help address skew.
4) Not properly managing the DAG to avoid shuffles and bring work to the data. Using ReduceByKey over GroupByKey and TreeReduce over Reduce when possible.
5) Classpath conflicts arising from mismatched library versions, which can be addressed using sh
Hardware Acceleration for Machine LearningCastLabKAIST
This document provides an overview of a lecture on hardware acceleration for machine learning. The lecture will cover deep neural network models like convolutional neural networks and recurrent neural networks. It will also discuss various hardware accelerators developed for machine learning, including those designed for mobile/edge and cloud computing environments. The instructor's background and the agenda topics are also outlined.
This document discusses how Qemu works to translate guest binaries to run on the host machine. It first generates an intermediate representation called TCG-IR from the guest binary code. It then translates the TCG-IR into native host machine code. To achieve high performance, it chains translated blocks together by patching jump targets. Key techniques include just-in-time compilation, translation block finding, block chaining, and helper functions to emulate unsupported guest instructions.
ClickHouse Features for Advanced Users, by Aleksei MilovidovAltinity Ltd
This document summarizes key features for advanced users of ClickHouse, an open-source column-oriented database management system. It describes sample keys that can be defined in MergeTree tables to generate instant reports on large customer data. It also summarizes intermediate aggregation states, consistency modes, and tools for processing data without a server like clickhouse-local.
Deep Learning for Data Scientists - Data Science ATL Meetup Presentation, 201...Andrew Gardner
Note: these are the slides from a presentation at Lexis Nexis in Alpharetta, GA, on 2014-01-08 as part of the DataScienceATL Meetup. A video of this talk from Dec 2013 is available on vimeo at http://bit.ly/1aJ6xlt
Note: Slideshare mis-converted the images in slides 16-17. Expect a fix in the next couple of days.
---
Deep learning is a hot area of machine learning named one of the "Breakthrough Technologies of 2013" by MIT Technology Review. The basic ideas extend neural network research from past decades and incorporate new discoveries in statistical machine learning and neuroscience. The results are new learning architectures and algorithms that promise disruptive advances in automatic feature engineering, pattern discovery, data modeling and artificial intelligence. Empirical results from real world applications and benchmarking routinely demonstrate state-of-the-art performance across diverse problems including: speech recognition, object detection, image understanding and machine translation. The technology is employed commercially today, notably in many popular Google products such as Street View, Google+ Image Search and Android Voice Recognition.
In this talk, we will present an overview of deep learning for data scientists: what it is, how it works, what it can do, and why it is important. We will review several real world applications and discuss some of the key hurdles to mainstream adoption. We will conclude by discussing our experiences implementing and running deep learning experiments on our own hardware data science appliance.
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...Altinity Ltd
This document summarizes Cloudflare's use of ClickHouse to analyze over 6 million HTTP requests per second. Some key points:
- Cloudflare previously used PostgreSQL, Citus, and Flink but these did not scale sufficiently.
- ClickHouse was chosen as it is fast, scalable, fault tolerant, and Cloudflare had existing expertise in it.
- Cloudflare designed ClickHouse schemas to aggregate HTTP data into totals, breakdowns by category, and unique counts into two tables using different engines.
- Tuning ClickHouse index granularity improved query latency by 50% and throughput by 3x.
- The new ClickHouse pipeline is more scalable, fault tolerant
The document discusses instruction set architecture (ISA), describing it as the interface between software and hardware that defines the programming model and machine language instructions. It provides details on RISC ISAs like MIPS and how they aim to have simpler instructions, more registers, load/store architectures, and pipelining to improve performance compared to CISC ISAs. The document also discusses different types of ISA designs including stack-based, accumulator-based, and register-to-register architectures.
This Edureka Recurrent Neural Networks tutorial will help you in understanding why we need Recurrent Neural Networks (RNN) and what exactly it is. It also explains few issues with training a Recurrent Neural Network and how to overcome those challenges using LSTMs. The last section includes a use-case of LSTM to predict the next word using a sample short story
Below are the topics covered in this tutorial:
1. Why Not Feedforward Networks?
2. What Are Recurrent Neural Networks?
3. Training A Recurrent Neural Network
4. Issues With Recurrent Neural Networks - Vanishing And Exploding Gradient
5. Long Short-Term Memory Networks (LSTMs)
6. LSTM Use-Case
Adventures with the ClickHouse ReplacingMergeTree EngineAltinity Ltd
Presentation on ReplacingMergeTree by Robert Hodges of Altinity at the 14 December 2022 SF Bay Area ClickHouse Meetup (https://www.meetup.com/san-francisco-bay-area-clickhouse-meetup/events/289605843/)
Federated Learning makes it possible to build machine learning systems without direct access to training data. The data remains in its original location, which helps to ensure privacy, reduces network communication costs, and taps edge device computing resources. The principles of data minimization established by the GDPR, and the growing prevalence of smart sensors make the advantages of federated learning more compelling. Federated learning is a great fit for smartphones, industrial and consumer IoT, healthcare and other privacy-sensitive use cases, and industrial sensor applications.
We’ll present the Fast Forward Labs team’s research on this topic and the accompanying prototype application, “Turbofan Tycoon”: a simplified working example of federated learning applied to a predictive maintenance problem. In this demo scenario, customers of an industrial turbofan manufacturer are not willing to share the details of how their components failed with the manufacturer, but want the manufacturer to provide them with a strategy to maintain the part. Federated learning allows us to satisfy the customer's privacy concerns while providing them with a model that leads to fewer costly failures and less maintenance downtime.
We’ll discuss the advantages and tradeoffs of taking the federated approach. We’ll assess the state of tooling for federated learning, circumstances in which you might want to consider applying it, and the challenges you’d face along the way.
Speaker
Chris Wallace
Data Scientist
Cloudera
High Performance, High Reliability Data Loading on ClickHouseAltinity Ltd
This document provides a summary of best practices for high reliability data loading in ClickHouse. It discusses ClickHouse's ingestion pipeline and strategies for improving performance and reliability of inserts. Some key points include using larger block sizes for inserts, avoiding overly frequent or compressed inserts, optimizing partitioning and sharding, and techniques like buffer tables and compact parts. The document also covers ways to make inserts atomic and handle deduplication of records through block-level and logical approaches.
This document discusses unsupervised learning and clustering. It defines unsupervised learning as modeling the underlying structure or distribution of input data without corresponding output variables. Clustering is described as organizing unlabeled data into groups of similar items called clusters. The document focuses on k-means clustering, describing it as a method that partitions data into k clusters by minimizing distances between points and cluster centers. It provides details on the k-means algorithm and gives examples of its steps. Strengths and weaknesses of k-means clustering are also summarized.
This document discusses various addressing modes and instruction formats used in computer architecture. It describes immediate, direct, indirect, register, register indirect, displacement, and stack addressing modes. It also discusses instruction formats used by processors like PDP-8, PDP-10, PDP-11, VAX, Pentium, and PowerPC that allocate bits differently based on factors like memory size, addressing modes, operands, and register sets.
The document analyzes the performance of Google's Tensor Processing Unit (TPU) compared to CPUs and GPUs for neural network inference workloads. It finds that the TPU, an ASIC designed specifically for neural network operations, achieves a 25-30x speedup over CPUs and GPUs. This is due to the TPU having many more simple integer math cores and on-chip memory optimized for neural network computations. The document concludes the TPU is 30-80x more energy efficient than other hardware and its performance could increase further with higher memory bandwidth.
"Attention Is All You Need" (Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, https://bit.ly/2y7yAD2 presented by Maroua Maachou (Veepee)
DPDK greatly improves packet processing performance and throughput by allowing applications to directly access hardware and bypass kernel involvement. It can improve performance by up to 10 times, allowing over 80 Mbps throughput on a single CPU or double that with two CPUs. This enables telecom and networking equipment manufacturers to develop products faster and with lower costs. DPDK achieves these gains through techniques like dedicated core affinity, userspace drivers, polling instead of interrupts, and lockless synchronization.
Convolutional neural network (CNN / ConvNet's) is a part of Computer Vision. Machine Learning Algorithm. Image Classification, Image Detection, Digit Recognition, and many more. https://technoelearn.com .
Performance Optimizations in Apache ImpalaCloudera, Inc.
Apache Impala is a modern, open-source MPP SQL engine architected from the ground up for the Hadoop data processing environment. Impala provides low latency and high concurrency for BI/analytic read-mostly queries on Hadoop, not delivered by batch frameworks such as Hive or SPARK. Impala is written from the ground up in C++ and Java. It maintains Hadoop’s flexibility by utilizing standard components (HDFS, HBase, Metastore, Sentry) and is able to read the majority of the widely-used file formats (e.g. Parquet, Avro, RCFile).
To reduce latency, such as that incurred from utilizing MapReduce or by reading data remotely, Impala implements a distributed architecture based on daemon processes that are responsible for all aspects of query execution and that run on the same machines as the rest of the Hadoop infrastructure. Impala employs runtime code generation using LLVM in order to improve execution times and uses static and dynamic partition pruning to significantly reduce the amount of data accessed. The result is performance that is on par or exceeds that of commercial MPP analytic DBMSs, depending on the particular workload. Although initially designed for running on-premises against HDFS-stored data, Impala can also run on public clouds and access data stored in various storage engines such as object stores (e.g. AWS S3), Apache Kudu and HBase. In this talk, we present Impala's architecture in detail and discuss the integration with different storage engines and the cloud.
The document discusses big data integration techniques. It defines big data integration as combining heterogeneous data sources into a unified form. The key techniques discussed are schema mapping to match data schemas, record linkage to identify matching records across sources, and data fusion to resolve conflicts by techniques like voting and source quality assessment. The document also briefly mentions research areas in big data integration and some tools for performing integration.
The document discusses program execution in the central processing unit (CPU). It explains that the CPU fetches instructions from memory one at a time and executes them using its control unit, arithmetic logic unit, and registers. The execution process involves fetching the instruction from memory into the instruction register, decoding what type of instruction it is, executing the appropriate operation using components like the accumulator and memory address register, and storing the output, which may update the program counter. Key components like the control unit, registers, and arithmetic logic unit work together to precisely carry out the steps specified in the stored program.
Comparing Incremental Learning Strategies for Convolutional Neural NetworksVincenzo Lomonaco
In the last decade, Convolutional Neural Networks (CNNs) have shown to perform incredibly well in many computer vision tasks such as object recognition and object detection, being able to extract meaningful high-level invariant features. However, partly because of their complex training and tricky hyper-parameters tuning, CNNs have been scarcely studied in the context of incremental learning where data are available in consecutive batches and retraining the model from scratch is unfeasible. In this work we compare different incremental learning strategies for CNN based architectures, targeting real-word applications.
If you are interested in this work please cite:
Lomonaco, V., & Maltoni, D. (2016, September). Comparing Incremental Learning Strategies for Convolutional Neural Networks. In IAPR Workshop on Artificial Neural Networks in Pattern Recognition (pp. 175-184). Springer International Publishing.
For further information visit my website: http://www.vincenzolomonaco.com/
The Presentation introduces the basic concept of cache memory, its introduction , background and all necessary details are provided along with details of different mapping techniques that are used inside Cache Memory.
Meta-learning, or learning how to learn, is our innate ability to learn new, ever more complex tasks very efficiently by building on prior experience. It is a very exciting direction for machine learning (and AI in general). In this tutorial, I introduce the main concepts and state of the art.
The document discusses several key factors for optimizing HBase performance including:
1. Reads and writes compete for disk, network, and thread resources so they can cause bottlenecks.
2. Memory allocation needs to balance space for memstores, block caching, and Java heap usage.
3. The write-ahead log can be a major bottleneck and increasing its size or number of logs can improve write performance.
4. Flushes and compactions need to be tuned to avoid premature flushes causing "compaction storms".
This document provides an overview and best practices for operating HBase clusters. It discusses HBase and Hadoop architecture, how to set up an HBase cluster including Zookeeper and region servers, high availability considerations, scaling the cluster, backup and restore processes, and operational best practices around hardware, disks, OS, automation, load balancing, upgrades, monitoring and alerting. It also includes a case study of a 110 node HBase cluster.
The document discusses instruction set architecture (ISA), describing it as the interface between software and hardware that defines the programming model and machine language instructions. It provides details on RISC ISAs like MIPS and how they aim to have simpler instructions, more registers, load/store architectures, and pipelining to improve performance compared to CISC ISAs. The document also discusses different types of ISA designs including stack-based, accumulator-based, and register-to-register architectures.
This Edureka Recurrent Neural Networks tutorial will help you in understanding why we need Recurrent Neural Networks (RNN) and what exactly it is. It also explains few issues with training a Recurrent Neural Network and how to overcome those challenges using LSTMs. The last section includes a use-case of LSTM to predict the next word using a sample short story
Below are the topics covered in this tutorial:
1. Why Not Feedforward Networks?
2. What Are Recurrent Neural Networks?
3. Training A Recurrent Neural Network
4. Issues With Recurrent Neural Networks - Vanishing And Exploding Gradient
5. Long Short-Term Memory Networks (LSTMs)
6. LSTM Use-Case
Adventures with the ClickHouse ReplacingMergeTree EngineAltinity Ltd
Presentation on ReplacingMergeTree by Robert Hodges of Altinity at the 14 December 2022 SF Bay Area ClickHouse Meetup (https://www.meetup.com/san-francisco-bay-area-clickhouse-meetup/events/289605843/)
Federated Learning makes it possible to build machine learning systems without direct access to training data. The data remains in its original location, which helps to ensure privacy, reduces network communication costs, and taps edge device computing resources. The principles of data minimization established by the GDPR, and the growing prevalence of smart sensors make the advantages of federated learning more compelling. Federated learning is a great fit for smartphones, industrial and consumer IoT, healthcare and other privacy-sensitive use cases, and industrial sensor applications.
We’ll present the Fast Forward Labs team’s research on this topic and the accompanying prototype application, “Turbofan Tycoon”: a simplified working example of federated learning applied to a predictive maintenance problem. In this demo scenario, customers of an industrial turbofan manufacturer are not willing to share the details of how their components failed with the manufacturer, but want the manufacturer to provide them with a strategy to maintain the part. Federated learning allows us to satisfy the customer's privacy concerns while providing them with a model that leads to fewer costly failures and less maintenance downtime.
We’ll discuss the advantages and tradeoffs of taking the federated approach. We’ll assess the state of tooling for federated learning, circumstances in which you might want to consider applying it, and the challenges you’d face along the way.
Speaker
Chris Wallace
Data Scientist
Cloudera
High Performance, High Reliability Data Loading on ClickHouseAltinity Ltd
This document provides a summary of best practices for high reliability data loading in ClickHouse. It discusses ClickHouse's ingestion pipeline and strategies for improving performance and reliability of inserts. Some key points include using larger block sizes for inserts, avoiding overly frequent or compressed inserts, optimizing partitioning and sharding, and techniques like buffer tables and compact parts. The document also covers ways to make inserts atomic and handle deduplication of records through block-level and logical approaches.
This document discusses unsupervised learning and clustering. It defines unsupervised learning as modeling the underlying structure or distribution of input data without corresponding output variables. Clustering is described as organizing unlabeled data into groups of similar items called clusters. The document focuses on k-means clustering, describing it as a method that partitions data into k clusters by minimizing distances between points and cluster centers. It provides details on the k-means algorithm and gives examples of its steps. Strengths and weaknesses of k-means clustering are also summarized.
This document discusses various addressing modes and instruction formats used in computer architecture. It describes immediate, direct, indirect, register, register indirect, displacement, and stack addressing modes. It also discusses instruction formats used by processors like PDP-8, PDP-10, PDP-11, VAX, Pentium, and PowerPC that allocate bits differently based on factors like memory size, addressing modes, operands, and register sets.
The document analyzes the performance of Google's Tensor Processing Unit (TPU) compared to CPUs and GPUs for neural network inference workloads. It finds that the TPU, an ASIC designed specifically for neural network operations, achieves a 25-30x speedup over CPUs and GPUs. This is due to the TPU having many more simple integer math cores and on-chip memory optimized for neural network computations. The document concludes the TPU is 30-80x more energy efficient than other hardware and its performance could increase further with higher memory bandwidth.
"Attention Is All You Need" (Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, https://bit.ly/2y7yAD2 presented by Maroua Maachou (Veepee)
DPDK greatly improves packet processing performance and throughput by allowing applications to directly access hardware and bypass kernel involvement. It can improve performance by up to 10 times, allowing over 80 Mbps throughput on a single CPU or double that with two CPUs. This enables telecom and networking equipment manufacturers to develop products faster and with lower costs. DPDK achieves these gains through techniques like dedicated core affinity, userspace drivers, polling instead of interrupts, and lockless synchronization.
Convolutional neural network (CNN / ConvNet's) is a part of Computer Vision. Machine Learning Algorithm. Image Classification, Image Detection, Digit Recognition, and many more. https://technoelearn.com .
Performance Optimizations in Apache ImpalaCloudera, Inc.
Apache Impala is a modern, open-source MPP SQL engine architected from the ground up for the Hadoop data processing environment. Impala provides low latency and high concurrency for BI/analytic read-mostly queries on Hadoop, not delivered by batch frameworks such as Hive or SPARK. Impala is written from the ground up in C++ and Java. It maintains Hadoop’s flexibility by utilizing standard components (HDFS, HBase, Metastore, Sentry) and is able to read the majority of the widely-used file formats (e.g. Parquet, Avro, RCFile).
To reduce latency, such as that incurred from utilizing MapReduce or by reading data remotely, Impala implements a distributed architecture based on daemon processes that are responsible for all aspects of query execution and that run on the same machines as the rest of the Hadoop infrastructure. Impala employs runtime code generation using LLVM in order to improve execution times and uses static and dynamic partition pruning to significantly reduce the amount of data accessed. The result is performance that is on par or exceeds that of commercial MPP analytic DBMSs, depending on the particular workload. Although initially designed for running on-premises against HDFS-stored data, Impala can also run on public clouds and access data stored in various storage engines such as object stores (e.g. AWS S3), Apache Kudu and HBase. In this talk, we present Impala's architecture in detail and discuss the integration with different storage engines and the cloud.
The document discusses big data integration techniques. It defines big data integration as combining heterogeneous data sources into a unified form. The key techniques discussed are schema mapping to match data schemas, record linkage to identify matching records across sources, and data fusion to resolve conflicts by techniques like voting and source quality assessment. The document also briefly mentions research areas in big data integration and some tools for performing integration.
The document discusses program execution in the central processing unit (CPU). It explains that the CPU fetches instructions from memory one at a time and executes them using its control unit, arithmetic logic unit, and registers. The execution process involves fetching the instruction from memory into the instruction register, decoding what type of instruction it is, executing the appropriate operation using components like the accumulator and memory address register, and storing the output, which may update the program counter. Key components like the control unit, registers, and arithmetic logic unit work together to precisely carry out the steps specified in the stored program.
Comparing Incremental Learning Strategies for Convolutional Neural NetworksVincenzo Lomonaco
In the last decade, Convolutional Neural Networks (CNNs) have shown to perform incredibly well in many computer vision tasks such as object recognition and object detection, being able to extract meaningful high-level invariant features. However, partly because of their complex training and tricky hyper-parameters tuning, CNNs have been scarcely studied in the context of incremental learning where data are available in consecutive batches and retraining the model from scratch is unfeasible. In this work we compare different incremental learning strategies for CNN based architectures, targeting real-word applications.
If you are interested in this work please cite:
Lomonaco, V., & Maltoni, D. (2016, September). Comparing Incremental Learning Strategies for Convolutional Neural Networks. In IAPR Workshop on Artificial Neural Networks in Pattern Recognition (pp. 175-184). Springer International Publishing.
For further information visit my website: http://www.vincenzolomonaco.com/
The Presentation introduces the basic concept of cache memory, its introduction , background and all necessary details are provided along with details of different mapping techniques that are used inside Cache Memory.
Meta-learning, or learning how to learn, is our innate ability to learn new, ever more complex tasks very efficiently by building on prior experience. It is a very exciting direction for machine learning (and AI in general). In this tutorial, I introduce the main concepts and state of the art.
The document discusses several key factors for optimizing HBase performance including:
1. Reads and writes compete for disk, network, and thread resources so they can cause bottlenecks.
2. Memory allocation needs to balance space for memstores, block caching, and Java heap usage.
3. The write-ahead log can be a major bottleneck and increasing its size or number of logs can improve write performance.
4. Flushes and compactions need to be tuned to avoid premature flushes causing "compaction storms".
This document provides an overview and best practices for operating HBase clusters. It discusses HBase and Hadoop architecture, how to set up an HBase cluster including Zookeeper and region servers, high availability considerations, scaling the cluster, backup and restore processes, and operational best practices around hardware, disks, OS, automation, load balancing, upgrades, monitoring and alerting. It also includes a case study of a 110 node HBase cluster.
This document discusses tuning HBase and HDFS for performance and correctness. Some key recommendations include:
- Enable HDFS sync on close and sync behind writes for correctness on power failures.
- Tune HBase compaction settings like blockingStoreFiles and compactionThreshold based on whether the workload is read-heavy or write-heavy.
- Size RegionServer machines based on disk size, heap size, and number of cores to optimize for the workload.
- Set client and server RPC chunk sizes like hbase.client.write.buffer to 2MB to maximize network throughput.
- Configure various garbage collection settings in HBase like -Xmn512m and -XX:+UseCMSInit
HBase can be an intimidating beast for someone considering its adoption. For what kinds of workloads is it well suited? How does it integrate into the rest of my application infrastructure? What are the data semantics upon which applications can be built? What are the deployment and operational concerns? In this talk, I'll address each of these questions in turn. As supporting evidence, both high-level application architecture and internal details will be discussed. This is an interactive talk: bring your questions and your use-cases!
HBase Applications - Atlanta HUG - May 2014larsgeorge
HBase is good a various workloads, ranging from sequential range scans to purely random access. These access patterns can be translated into application types, usually falling into two major groups: entities and events. This presentation discussed the underlying implications and how to approach those use-cases. Examples taken from Facebook show how this has been tackled in real life.
These are my slides for the 5 minute overview talk I gave during a recent workshop at the European Commission in Brussels, on the topic of "Big Data Skills in Europe".
HBase Status Report - Hadoop Summit Europe 2014larsgeorge
This document provides a summary of new features and improvements in recent versions of Apache HBase, a distributed, scalable, big data store. It discusses major changes and enhancements in HBase 0.92+, 0.94+, and 0.96+, including new HFile formats, coprocessors, caching improvements, performance tuning, and more. The document is intended to bring readers up to date on the current state and capabilities of HBase.
Designing Scalable Data Warehouse Using MySQLVenu Anuganti
The document discusses designing scalable data warehouses using MySQL. It covers topics like the role of MySQL in data warehousing and analytics, typical data warehouse architectures, scaling out MySQL, and limitations of MySQL for large datasets or as a scalable warehouse solution. Real-time analytics are also discussed, noting the challenges of performance and scalability for near real-time analytics.
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon
In this presentation, we will introduce Hotspot's Garbage First collector (G1GC) as the most suitable collector for latency-sensitive applications running with large memory environments. We will first discuss G1GC internal operations and tuning opportunities, and also cover tuning flags that set desired GC pause targets, change adaptive GC thresholds, and adjust GC activities at runtime. We will provide several HBase case studies using Java heaps as large as 100GB that show how to best tune applications to remove unpredicted, protracted GC pauses.
This document summarizes a presentation about optimizing HBase performance through caching. It discusses how baseline tests showed low cache hit rates and CPU/memory utilization. Reducing the table block size improved cache hits but increased overhead. Adding an off-heap bucket cache to store table data minimized JVM garbage collection latency spikes and improved memory utilization by caching frequently accessed data outside the Java heap. Configuration parameters for the bucket cache are also outlined.
This document introduces HBase, an open-source, non-relational, distributed database modeled after Google's BigTable. It describes what HBase is, how it can be used, and when it is applicable. Key points include that HBase stores data in columns and rows accessed by row keys, integrates with Hadoop for MapReduce jobs, and is well-suited for large datasets, fast random access, and write-heavy applications. Common use cases involve log analytics, real-time analytics, and messages-centered systems.
Many of the systems we want to monitor happen as a stream of events, examples include event data from web or mobile applications, sensors, medical devices. What do we need to do to build a real time streaming application , and how do we do this with High Performance at Scale?
This Free Code Friday will help you get a jump-start on scaling distributed computing by taking an example time series application and coding through different aspects of working with such a dataset. We will cover building an end to end distributed processing pipeline using MapR Streams (Kafka API), Apache Spark, and MapR-DB (HBase API), to rapidly ingest, process and store large volumes of high speed data.
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...Spark Summit
The document discusses tuning the Garbage First (G1) garbage collector in Java 8 to reduce garbage collection pauses for large heaps used in Spark graph computing workloads. It was found that the default G1 settings resulted in lengthy full garbage collections over 100 seconds. After analyzing the garbage collection logs, the main issue was identified as the concurrent marking phase not completing before a full collection was needed. Increasing the number of concurrent marking threads from 8 to 20 addressed this by speeding up the concurrent phase. With this tuning, no full collections occurred and total stop-the-world pause time was reduced to under a minute, a significant improvement over the original implementation.
HBase is a scalable NoSQL database modeled after Google's Bigtable. It is built on top of HDFS for storage, and uses Zookeeper for distributed coordination and failover. Data in HBase is stored in tables and sorted by row key, with columns grouped into families and cells containing values and timestamps. HBase tables are split into regions for scalability and fault tolerance, with a master server coordinating region locations across multiple region servers.
Social Networks and the Richness of Datalarsgeorge
Social networks by their nature deal with large amounts of user-generated data that must be processed and presented in a time sensitive manner. Much more write intensive than previous generations of websites, social networks have been on the leading edge of non-relational persistence technology adoption. This talk presents how Germany's leading social networks Schuelervz, Studivz and Meinvz are incorporating Redis and Project Voldemort into their platform to run features like activity streams.
This document discusses Bronto's use of HBase for their marketing platform. Some key points:
- Bronto uses HBase for high volume scenarios, realtime data access, batch processing, and as a staging area for HDFS.
- HBase tables at Bronto are designed with the read/write patterns and necessary queries in mind. Row keys and column families are structured to optimize for these access patterns.
- Operations of HBase at scale require tuning of JVM settings, monitoring tools, and custom scripts to handle compactions and prevent cascading failures during high load. Table design also impacts operations and needs to account for expected workloads.
MemSQL is an in-memory distributed database that provides fast data processing for real-time analytics. It allows companies to extract greater insights from big data in real time. MemSQL is used by companies for applications like ad targeting, recommendations, fraud detection, and more. It provides rapid data loading and querying, horizontal scalability, and supports both relational and JSON data. Case studies describe how companies like Comcast, Zynga, CPXi, and others use MemSQL to power applications that require real-time insights from massive datasets.
In-memory Data Management Trends & TechniquesHazelcast
- Hardware trends like increasing cores/CPU and RAM sizes enable in-memory data management techniques. Commodity servers can now support terabytes of memory.
- Different levels of data storage have vastly different access times, from registers (<1ns) to disk (4-7ms). Caching data in faster levels of storage improves performance.
- Techniques to exploit data locality, cache hierarchies, tiered storage, parallelism and in-situ processing can help overcome hardware limitations and achieve fast, real-time processing. Emerging in-memory databases use these techniques to enable new types of operational analytics.
The document discusses Ceph, an open-source distributed storage system. It provides an overview of Ceph's architecture and components, how it works, and considerations for setting up a Ceph cluster. Key points include: Ceph provides unified block, file and object storage interfaces and can scale exponentially. It uses CRUSH to deterministically map data across a cluster for redundancy. Setup choices like network, storage nodes, disks, caching and placement groups impact performance and must be tuned for the workload.
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)Lars Marowsky-Brée
This document discusses modeling and predicting performance for Ceph storage clusters. It describes many of the hardware, software, and configuration factors that impact Ceph performance, including network setup, storage nodes, disks, redundancy, placement groups and more. The document advocates for developing standardized benchmarks to better understand Ceph performance under different workloads and cluster configurations in order to answer customers' questions.
In this session we review the design of the newly released off heap storage feature in Apache Geode, and discuss use cases and potential direction for additional capabilities of this feature.
The document outlines topics covered in "The Impala Cookbook" published by Cloudera. It discusses physical and schema design best practices for Impala, including recommendations for data types, partition design, file formats, and block size. It also covers estimating and managing Impala's memory usage, and how to identify the cause when queries exceed memory limits.
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters Ceph Community
This document discusses modeling and predicting performance in Ceph distributed storage systems. It provides an overview of Ceph, including its object storage, block storage, and file system capabilities. It then discusses various factors that impact Ceph performance, such as network configuration, storage node hardware, number of disks, caching, redundancy settings, and placement groups. The document notes there are many configuration choices and tradeoffs to consider when designing a Ceph cluster to meet performance requirements.
Responding rapidly when you have 100+ GB data sets in JavaPeter Lawrey
One way to speed up you application is to bring more of your data into memory. But how to do you handle hundreds of GB of data in a JVM and what tools can help you.
Mentions: Speedment, Azul, Terracotta, Hazelcast and Chronicle.
Taking Splunk to the Next Level - Architecture Breakout SessionSplunk
This document provides an overview and agenda for taking a Splunk deployment to the next level by addressing scaling needs and high availability requirements. It discusses growing use cases and data volumes, making Splunk mission critical through clustering, and supporting global deployments. The agenda covers scaling strategies like indexer clustering, search head clustering, and hybrid cloud deployments. It also promotes justifying increased spending by mapping dependencies and costs of failures across an organization's systems.
In-memory Caching in HDFS: Lower Latency, Same Great TasteDataWorks Summit
This document discusses in-memory caching in HDFS to improve query latency. The implementation caches important datasets in the DataNode memory and allows clients to directly access cached blocks via zero-copy reads without checksum verification. Evaluation shows the zero-copy reads approach provides significant performance gains over short-circuit and TCP reads for both microbenchmarks and Impala queries, with speedups of up to 7x when the working set fits in memory. MapReduce jobs see more modest gains as they are often not I/O bound.
Java one2015 - Work With Hundreds of Hot Terabytes in JVMsSpeedment, Inc.
Presentation Summary: By leveraging on memory mapped files, the Chronicle Engine supports large maps that easily can exceed the size of your server’s RAM, thus allowing application developers to create huge JVM:s where data can be obtained quickly and with predictable latency. The Chronicle Engine can be synchronized with an underlying database using Speedment so that your in-memory maps will be “alive” and change whenever data changes in the underlying database. Speedment can also automatically derive domain models directly from the database so that you can start using the solution very quickly. Because the Java Maps are mapped onto files, the maps can also be shared instantly between several JVM:s and when you restart a JVM, it may start very quickly without having to reload data from the underlying database. The mapped files can be hundreds of terabytes which has been done in real world deployment cases.
Motivation and goals for off-heap storage
Off-heap features and usage
Implementation overview
Preliminary benchmarks: off-heap vs. heap
Tips and best practices
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar AhmedEqunix Business Solutions
This document discusses tuning Linux and PostgreSQL for performance. It recommends:
- Tuning Linux kernel parameters like huge pages, swappiness, and overcommit memory. Huge pages can improve TLB performance.
- Tuning PostgreSQL parameters like shared_buffers, work_mem, and checkpoint_timeout. Shared_buffers stores the most frequently accessed data.
- Other tips include choosing proper hardware, OS, and database based on workload. Tuning queries and applications can also boost performance.
Apache Impala is a complex engine and requires a thorough technical understanding to utilize it fully. Without proper configuration or usage, Impala’s performance becomes unpredictable, and end-user experience suffers. However, for many users and administrators, the right configuration of Impala is still a mystery.
Drawing on work with some of the largest clusters in the world, Manish Maheshwari shares ingestion best practices to keep an Impala deployment scalable and details admission control configuration to provide a consistent experience to end users. Manish also takes a high-level look at Impala’s query profile, which is used as a first step in any performance troubleshooting, and discusses common mistakes users and BI tools make when interacting with Impala. Manish concludes by detailing an ideal setup to show all of this in practice.
Azure Data Factory Data Flow Performance Tuning 101Mark Kromer
The document provides performance timing results and recommendations for optimizing Azure Data Factory data flows. Sample 1 processed a 421MB file with 887k rows in 4 minutes using default partitioning on an 80-core Azure IR. Sample 2 processed a table with the same size and transforms in 3 minutes using source and derived column partitioning. Sample 3 processed the same size file in 2 minutes with default partitioning. The document recommends partitioning strategies, using memory optimized clusters, and scaling cores to improve performance.
This document discusses common mistakes made when implementing Oracle Exadata systems. It describes improperly sized SGAs which can hurt performance on data warehouses. It also discusses issues like not using huge pages, over or under use of indexing, too much parallelization, selecting the wrong disk types, failing to patch systems, and not implementing tools like Automatic Service Request and exachk. The document provides guidance on optimizing these areas to get the best performance from Exadata.
HBase is a distributed, column-oriented database that stores data in tables divided into rows and columns. It is optimized for random, real-time read/write access to big data. The document discusses HBase's key concepts like tables, regions, and column families. It also covers performance tuning aspects like cluster configuration, compaction strategies, and intelligent key design to spread load evenly. Different use cases are suitable for HBase depending on access patterns, such as time series data, messages, or serving random lookups and short scans from large datasets. Proper data modeling and tuning are necessary to maximize HBase's performance.
Apache Impala is a complex engine and requires a thorough technical understanding to utilize it fully. Without proper configuration or usage, Impala’s performance becomes unpredictable, and end-user experience suffers. However, for many users and administrators, the right configuration of Impala is still a mystery.
Drawing on work with some of the largest clusters in the world, Manish Maheshwari shares ingestion best practices to keep an Impala deployment scalable and details admission control configuration to provide a consistent experience to end users. Manish also takes a high-level look at Impala’s query profile, which is used as a first step in any performance troubleshooting, and discusses common mistakes users and BI tools make when interacting with Impala. Manish concludes by detailing an ideal setup to show all of this in practice.
From: DataWorks Summit 2017 - Munich - 20170406
HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.
MariaDB Performance Tuning and OptimizationMariaDB plc
This document discusses MariaDB performance tuning and optimization. It covers common principles like tuning from the start of application development. Specific topics discussed include server hardware, OS settings, MariaDB configuration settings like innodb_buffer_pool_size, database design best practices, and query monitoring and tuning tools. The overall goal is to efficiently use hardware resources, ensure best performance for users, and avoid outages.
MariaDB Server Performance Tuning & OptimizationMariaDB plc
This document discusses various techniques for optimizing MariaDB server performance, including:
- Tuning configuration settings like the buffer pool size, query cache size, and thread pool settings.
- Monitoring server metrics like CPU usage, memory usage, disk I/O, and MariaDB-specific metrics.
- Analyzing slow queries with the slow query log and EXPLAIN statements to identify optimization opportunities like adding indexes.
From: DataWorks Summit Munich 2017 - 20170406
While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv larsgeorge
This talk is about showing the complexity in building a data pipeline in Hadoop, starting with the technology aspect, and the correlating to the skillsets of current Hadoop adopters.
Parquet is an open-source columnar storage format that provides an efficient data layout for analytical queries. Twitter uses Parquet to store logs and analytics data across multiple large Hadoop clusters, saving petabytes of storage and reducing query times by up to 66% by reading only needed columns. Parquet defines a language-independent file format that stores data by column rather than row to optimize analytical access patterns.
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoalarsgeorge
Keynote during BiDaTA 2013 in Genoa, a special track of the ADBIS 2013 conference. URL: http://dbdmg.polito.it/bidata2013/index.php/keynote-presentation
HBase Advanced Schema Design - Berlin Buzzwords - June 2012larsgeorge
While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second. This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they
http://berlinbuzzwords.de/sessions/advanced-hbase-schema-design
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012larsgeorge
This document summarizes Lars George's presentation on moving from batch to real-time processing with Hadoop. It discusses using Hadoop (HDFS and MapReduce) for batch processing of large amounts of data and integrating real-time databases and stream processing tools like HBase and Storm to enable faster querying and analytics. Example architectures shown combine batch and real-time systems by using real-time tools to process streaming data and periodically syncing results to Hadoop and HBase for long-term storage and analysis.
Realtime Analytics with Hadoop and HBaselarsgeorge
The document discusses realtime analytics using Hadoop and HBase. It begins by introducing the speaker and their experience. It then discusses moving from batch processing with Hadoop to more realtime needs, and how systems like HBase can help bridge that gap. Several designs are presented for using HBase and Hadoop together to enable both realtime and batch analytics on large datasets.
How Can I use the AI Hype in my Business Context?Daniel Lehner
𝙄𝙨 𝘼𝙄 𝙟𝙪𝙨𝙩 𝙝𝙮𝙥𝙚? 𝙊𝙧 𝙞𝙨 𝙞𝙩 𝙩𝙝𝙚 𝙜𝙖𝙢𝙚 𝙘𝙝𝙖𝙣𝙜𝙚𝙧 𝙮𝙤𝙪𝙧 𝙗𝙪𝙨𝙞𝙣𝙚𝙨𝙨 𝙣𝙚𝙚𝙙𝙨?
Everyone’s talking about AI but is anyone really using it to create real value?
Most companies want to leverage AI. Few know 𝗵𝗼𝘄.
✅ What exactly should you ask to find real AI opportunities?
✅ Which AI techniques actually fit your business?
✅ Is your data even ready for AI?
If you’re not sure, you’re not alone. This is a condensed version of the slides I presented at a Linkedin webinar for Tecnovy on 28.04.2025.
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfSoftware Company
Explore the benefits and features of advanced logistics management software for businesses in Riyadh. This guide delves into the latest technologies, from real-time tracking and route optimization to warehouse management and inventory control, helping businesses streamline their logistics operations and reduce costs. Learn how implementing the right software solution can enhance efficiency, improve customer satisfaction, and provide a competitive edge in the growing logistics sector of Riyadh.
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Impelsys Inc.
Impelsys provided a robust testing solution, leveraging a risk-based and requirement-mapped approach to validate ICU Connect and CritiXpert. A well-defined test suite was developed to assess data communication, clinical data collection, transformation, and visualization across integrated devices.
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPathCommunity
Join this UiPath Community Berlin meetup to explore the Orchestrator API, Swagger interface, and the Test Manager API. Learn how to leverage these tools to streamline automation, enhance testing, and integrate more efficiently with UiPath. Perfect for developers, testers, and automation enthusiasts!
📕 Agenda
Welcome & Introductions
Orchestrator API Overview
Exploring the Swagger Interface
Test Manager API Highlights
Streamlining Automation & Testing with APIs (Demo)
Q&A and Open Discussion
Perfect for developers, testers, and automation enthusiasts!
👉 Join our UiPath Community Berlin chapter: https://community.uipath.com/berlin/
This session streamed live on April 29, 2025, 18:00 CET.
Check out all our upcoming UiPath Community sessions at https://community.uipath.com/events/.
Technology Trends in 2025: AI and Big Data AnalyticsInData Labs
At InData Labs, we have been keeping an ear to the ground, looking out for AI-enabled digital transformation trends coming our way in 2025. Our report will provide a look into the technology landscape of the future, including:
-Artificial Intelligence Market Overview
-Strategies for AI Adoption in 2025
-Anticipated drivers of AI adoption and transformative technologies
-Benefits of AI and Big data for your business
-Tips on how to prepare your business for innovation
-AI and data privacy: Strategies for securing data privacy in AI models, etc.
Download your free copy nowand implement the key findings to improve your business.
HCL Nomad Web – Best Practices and Managing Multiuser Environmentspanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-nomad-web-best-practices-and-managing-multiuser-environments/
HCL Nomad Web is heralded as the next generation of the HCL Notes client, offering numerous advantages such as eliminating the need for packaging, distribution, and installation. Nomad Web client upgrades will be installed “automatically” in the background. This significantly reduces the administrative footprint compared to traditional HCL Notes clients. However, troubleshooting issues in Nomad Web present unique challenges compared to the Notes client.
Join Christoph and Marc as they demonstrate how to simplify the troubleshooting process in HCL Nomad Web, ensuring a smoother and more efficient user experience.
In this webinar, we will explore effective strategies for diagnosing and resolving common problems in HCL Nomad Web, including
- Accessing the console
- Locating and interpreting log files
- Accessing the data folder within the browser’s cache (using OPFS)
- Understand the difference between single- and multi-user scenarios
- Utilizing Client Clocking
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfAbi john
Analyze the growth of meme coins from mere online jokes to potential assets in the digital economy. Explore the community, culture, and utility as they elevate themselves to a new era in cryptocurrency.
Dev Dives: Automate and orchestrate your processes with UiPath MaestroUiPathCommunity
This session is designed to equip developers with the skills needed to build mission-critical, end-to-end processes that seamlessly orchestrate agents, people, and robots.
📕 Here's what you can expect:
- Modeling: Build end-to-end processes using BPMN.
- Implementing: Integrate agentic tasks, RPA, APIs, and advanced decisioning into processes.
- Operating: Control process instances with rewind, replay, pause, and stop functions.
- Monitoring: Use dashboards and embedded analytics for real-time insights into process instances.
This webinar is a must-attend for developers looking to enhance their agentic automation skills and orchestrate robust, mission-critical processes.
👨🏫 Speaker:
Andrei Vintila, Principal Product Manager @UiPath
This session streamed live on April 29, 2025, 16:00 CET.
Check out all our upcoming Dev Dives sessions at https://community.uipath.com/dev-dives-automation-developer-2025/.
Big Data Analytics Quick Research Guide by Arthur MorganArthur Morgan
This is a Quick Research Guide (QRG).
QRGs include the following:
- A brief, high-level overview of the QRG topic.
- A milestone timeline for the QRG topic.
- Links to various free online resource materials to provide a deeper dive into the QRG topic.
- Conclusion and a recommendation for at least two books available in the SJPL system on the QRG topic.
QRGs planned for the series:
- Artificial Intelligence QRG
- Quantum Computing QRG
- Big Data Analytics QRG
- Spacecraft Guidance, Navigation & Control QRG (coming 2026)
- UK Home Computing & The Birth of ARM QRG (coming 2027)
Any questions or comments?
- Please contact Arthur Morgan at [email protected].
100% human made.
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxshyamraj55
We’re bringing the TDX energy to our community with 2 power-packed sessions:
🛠️ Workshop: MuleSoft for Agentforce
Explore the new version of our hands-on workshop featuring the latest Topic Center and API Catalog updates.
📄 Talk: Power Up Document Processing
Dive into smart automation with MuleSoft IDP, NLP, and Einstein AI for intelligent document workflows.
TrsLabs - Fintech Product & Business ConsultingTrs Labs
Hybrid Growth Mandate Model with TrsLabs
Strategic Investments, Inorganic Growth, Business Model Pivoting are critical activities that business don't do/change everyday. In cases like this, it may benefit your business to choose a temporary external consultant.
An unbiased plan driven by clearcut deliverables, market dynamics and without the influence of your internal office equations empower business leaders to make right choices.
Getting things done within a budget within a timeframe is key to Growing Business - No matter whether you are a start-up or a big company
Talk to us & Unlock the competitive advantage
Book industry standards are evolving rapidly. In the first part of this session, we’ll share an overview of key developments from 2024 and the early months of 2025. Then, BookNet’s resident standards expert, Tom Richardson, and CEO, Lauren Stewart, have a forward-looking conversation about what’s next.
Link to recording, presentation slides, and accompanying resource: https://bnctechforum.ca/sessions/standardsgoals-for-2025-standards-certification-roundup/
Presented by BookNet Canada on May 6, 2025 with support from the Department of Canadian Heritage.
Artificial Intelligence is providing benefits in many areas of work within the heritage sector, from image analysis, to ideas generation, and new research tools. However, it is more critical than ever for people, with analogue intelligence, to ensure the integrity and ethical use of AI. Including real people can improve the use of AI by identifying potential biases, cross-checking results, refining workflows, and providing contextual relevance to AI-driven results.
News about the impact of AI often paints a rosy picture. In practice, there are many potential pitfalls. This presentation discusses these issues and looks at the role of analogue intelligence and analogue interfaces in providing the best results to our audiences. How do we deal with factually incorrect results? How do we get content generated that better reflects the diversity of our communities? What roles are there for physical, in-person experiences in the digital world?
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Aqusag Technologies
In late April 2025, a significant portion of Europe, particularly Spain, Portugal, and parts of southern France, experienced widespread, rolling power outages that continue to affect millions of residents, businesses, and infrastructure systems.
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungenpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-nomad-web-best-practices-und-verwaltung-von-multiuser-umgebungen/
HCL Nomad Web wird als die nächste Generation des HCL Notes-Clients gefeiert und bietet zahlreiche Vorteile, wie die Beseitigung des Bedarfs an Paketierung, Verteilung und Installation. Nomad Web-Client-Updates werden “automatisch” im Hintergrund installiert, was den administrativen Aufwand im Vergleich zu traditionellen HCL Notes-Clients erheblich reduziert. Allerdings stellt die Fehlerbehebung in Nomad Web im Vergleich zum Notes-Client einzigartige Herausforderungen dar.
Begleiten Sie Christoph und Marc, während sie demonstrieren, wie der Fehlerbehebungsprozess in HCL Nomad Web vereinfacht werden kann, um eine reibungslose und effiziente Benutzererfahrung zu gewährleisten.
In diesem Webinar werden wir effektive Strategien zur Diagnose und Lösung häufiger Probleme in HCL Nomad Web untersuchen, einschließlich
- Zugriff auf die Konsole
- Auffinden und Interpretieren von Protokolldateien
- Zugriff auf den Datenordner im Cache des Browsers (unter Verwendung von OPFS)
- Verständnis der Unterschiede zwischen Einzel- und Mehrbenutzerszenarien
- Nutzung der Client Clocking-Funktion
6. 6
HBase Sizing Is...
• Making the most out of the cluster you have by...
– Understanding how HBase uses low-level resources
– Helping HBase understand your use-case by configuring it appropriately - and/or -
– Design the use-case to help HBase along
• Being able to gauge how many servers are needed for a given use-case
8. 8
HBase Dilemma
Although HBase can host many applications, they may require completely opposite
features
Events Entities
Time Series Message Store
9. 9
Competing Resources
• Reads and Writes compete for the same low-level resources
– Disk (HDFS) and Network I/O
– RPC Handlers and Threads
– Memory (Java Heap)
• Otherwise they do exercise completely separate code paths
10. 10
Memory Sharing
• By default every region server is dividing its memory (i.e. given maximum heap)
into
– 40% for in-memory stores (write ops)
– 20% (40%) for block caching (reads ops)
– Remaining space (here 40% or 20%) go towards usual Java heap usage
• Objects etc.
• Region information (HFile metadata)
• Share of memory needs to be tweaked
11. 11
Writes
• The cluster size is often determined by the write performance
– Simple schema design implies writing to all (entities) or only one region (events)
• Log structured merge trees like
– Store mutation in in-memory store and write-ahead log
– Flush out aggregated, sorted maps at specified threshold - or - when under pressure
– Discard logs with no pending edits
– Perform regular compactions of store files
12. 12
Writes: Flushes and Compactions
Older TIME Newer
SIZE (MB)
1000
750
500
250
0
13. 13
Flushes
• Every mutation call (put, delete etc.) causes a check for a flush
• If threshold is met, flush file to disk and schedule a compaction
– Try to compact newly flushed files quickly
• The compaction returns - if necessary - where a region should be split
14. 14
Compaction Storms
• Premature flushing because of # of logs or memory pressure
– Files will be smaller than the configured flush size
• The background compactions are hard at work merging small flush files into the
existing, larger store files
– Rewrite hundreds of MB over and over
15. 15
Dependencies
• Flushes happen across all stores/column families, even if just one triggers it
• The flush size is compared to the size of all stores combined
– Many column families dilute the size
– Example: 55MB + 5MB + 4MB
16. 16
Write-Ahead Log
• Currently only one per region server
– Shared across all stores (i.e. column families)
– Synchronized on file append calls
• Work being done on mitigating this
– WAL Compression
– Multithreaded WAL with Ring Buffer
– Multiple WAL’s per region server ➜ Start more than one region server per node?
17. 17
Write-Ahead Log (cont.)
• Size set to 95% of default block size
– 64MB or 128MB, but check config!
• Keep number low to reduce recovery time
– Limit set to 32, but can be increased
• Increase size of logs - and/or - increase the number of logs before blocking
• Compute number based on fill distribution and flush frequencies
18. 18
Write-Ahead Log (cont.)
• Writes are synchronized across all stores
– A large cell in one family can stop all writes of another
– In this case the RPC handlers go binary, i.e. either work or all block
• Can be bypassed on writes, but means no real durability and no replication
– Maybe use coprocessor to restore dependent data sets (preWALRestore)
19. 19
Some Numbers
• Typical write performance of HDFS is 35-50MB/s
Cell Size OPS
0.5MB 70-100
100KB 350-500
10KB 3500-5000 ??
1KB 35000-50000 ????
This is way to high in practice - Contention!
20. 20
Some More Numbers
• Under real world conditions the rate is less, more like 15MB/s or less
– Thread contention and serialization overhead is cause for massive slow down
Cell Size OPS
0.5MB 10
100KB 100
10KB 800
1KB 6000
21. 21
Write Performance
• There are many factors to the overall write performance of a cluster
– Key Distribution ➜ Avoid region hotspot
– Handlers ➜ Do not pile up too early
– Write-ahead log ➜ Bottleneck #1
– Compactions ➜ Badly tuned can cause ever increasing background noise
22. 22
Cheat Sheet
• Ensure you have enough or large enough write-ahead logs
• Ensure you do not oversubscribe available memstore space
• Ensure to set flush size large enough but not too large
• Check write-ahead log usage carefully
• Enable compression to store more data per node
• Tweak compaction algorithm to peg background I/O at some level
• Consider putting uneven column families in separate tables
• Check metrics carefully for block cache, memstore, and all queues
23. 23
Example: Write to All Regions
• Java Xmx heap at 10GB
• Memstore share at 40% (default)
– 10GB Heap x 0.4 = 4GB
• Desired flush size at 128MB
– 4GB / 128MB = 32 regions max!
• For WAL size of 128MB x 0.95%
– 4GB / (128MB x 0.95) = ~33 partially uncommitted logs to keep around
• Region size at 20GB
– 20GB x 32 regions = 640GB raw storage used
24. 24
Notes
• Compute memstore sizes based on number of written-to regions x flush size
• Compute number of logs to keep based on fill and flush rate
• Ultimately the capacity is driven by
– Java Heap
– Region Count and Size
– Key Distribution
25. 25
Reads
• Locate and route request to appropriate region server
– Client caches information for faster lookups
• Eliminate store files if possible using time ranges or Bloom filter
• Try block cache, if block is missing then load from disk
27. 27
Writes: Where’s the Data at?
Older TIME Newer
SIZE (MB)
1000
750
500
250
0
Existing Row Mutations
Unique Row Inserts
28. 28
Block Cache
• Use exported metrics to see effectiveness of block cache
– Check fill and eviction rate, as well as hit ratios ➜ random reads are not ideal
• Tweak up or down as needed, but watch overall heap usage
• You absolutely need the block cache
– Set to 10% at least for short term benefits
40. 40
HBase Heap Usage
• Overall addressable amount of data is driven
by heap size
– Only read-from regions need space for indexes,
filters
– Written-to regions also need MemStore space
• Java heap space is limited still as garbage
collections will cause pauses
– Typically up to 20GB heap
– Or invest is pause-less GC