Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...GetInData
Did you like it? Check out our E-book: Apache NiFi - A Complete Guide
https://ebook.getindata.com/apache-nifi-complete-guide
Apache NiFi is one of the most popular services for running ETL pipelines otherwise it’s not the youngest technology. During the talk, there are described all details about migrating pipelines from the old Hadoop platform to the Kubernetes, managing everything as the code, monitoring all corner cases of NiFi and making it a robust solution that is user-friendly even for non-programmers.
Author: Albert Lewandowski
Linkedin: https://www.linkedin.com/in/albert-lewandowski/
___
Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets.
Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries.
https://getindata.com
Ever tried to get get clarity on what kinds of memory there are and how to tune each of them ? If not, very likely your jobs are configured incorrectly. As we found out, its is not straightforward and it is not well documented either. This session will provide information on the types of memory to be aware of, the calculations involved in determining how much is allocated to each type of memory and how to tune it depending on the use case.
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안SANG WON PARK
Apache Kafak의 빅데이터 아키텍처에서 역할이 점차 커지고, 중요한 비중을 차지하게 되면서, 성능에 대한 고민도 늘어나고 있다.
다양한 프로젝트를 진행하면서 Apache Kafka를 모니터링 하기 위해 필요한 Metrics들을 이해하고, 이를 최적화 하기 위한 Configruation 설정을 정리해 보았다.
[Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안]
Apache Kafka 성능 모니터링에 필요한 metrics에 대해 이해하고, 4가지 관점(처리량, 지연, Durability, 가용성)에서 성능을 최적화 하는 방안을 정리함. Kafka를 구성하는 3개 모듈(Producer, Broker, Consumer)별로 성능 최적화를 위한 …
[Apache Kafka 모니터링을 위한 Metrics 이해]
Apache Kafka의 상태를 모니터링 하기 위해서는 4개(System(OS), Producer, Broker, Consumer)에서 발생하는 metrics들을 살펴봐야 한다.
이번 글에서는 JVM에서 제공하는 JMX metrics를 중심으로 producer/broker/consumer의 지표를 정리하였다.
모든 지표를 정리하진 않았고, 내 관점에서 유의미한 지표들을 중심으로 이해한 내용임
[Apache Kafka 성능 Configuration 최적화]
성능목표를 4개로 구분(Throughtput, Latency, Durability, Avalibility)하고, 각 목표에 따라 어떤 Kafka configuration의 조정을 어떻게 해야하는지 정리하였다.
튜닝한 파라미터를 적용한 후, 성능테스트를 수행하면서 추출된 Metrics를 모니터링하여 현재 업무에 최적화 되도록 최적화를 수행하는 것이 필요하다.
This document discusses solutions for generating unique identifiers at high speeds. It compares auto-increment, UUID, hash, and Snowflake approaches. Snowflake is highlighted as able to generate up to 4 billion IDs per second while maintaining order, supporting distribution and sharding, and providing security benefits. The document outlines how Snowflake works by combining a timestamp, node ID determined via file, random number, IP address or ZooKeeper, and an increasing sequence number stored in Redis to generate the IDs at high speeds with strong ordering properties.
The document discusses Apache NiFi and its role in the Hadoop ecosystem. It provides an overview of NiFi, describes how it can be used to integrate with Hadoop components like HDFS, HBase, and Kafka. It also discusses how NiFi supports stream processing integrations and outlines some use cases. The document concludes by discussing future work, including improving NiFi's high availability, multi-tenancy, and expanding its ecosystem integrations.
Like many other messaging systems, Kafka has put limit on the maximum message size. User will fail to produce a message if it is too large. This limit makes a lot of sense and people usually send to Kafka a reference link which refers to a large message stored somewhere else. However, in some scenarios, it would be good to be able to send messages through Kafka without external storage. At LinkedIn, we have a few use cases that can benefit from such feature. This talk covers our solution to send large message through Kafka without additional storage.
This document provides a summary of improvements made to Hive's performance through the use of Apache Tez and other optimizations. Some key points include:
- Hive was improved to use Apache Tez as its execution engine instead of MapReduce, reducing latency for interactive queries and improving throughput for batch queries.
- Statistics collection was optimized to gather column-level statistics from ORC file footers, speeding up statistics gathering.
- The cost-based optimizer Optiq was added to Hive, allowing it to choose better execution plans.
- Vectorized query processing, broadcast joins, dynamic partitioning, and other optimizations improved individual query performance by over 100x in some cases.
Aljoscha Krettek is the PMC chair of Apache Flink and Apache Beam, and co-founder of data Artisans. Apache Flink is an open-source platform for distributed stream and batch data processing. It allows for stateful computations over data streams in real-time and historically. Flink supports batch and stream processing using APIs like DataSet and DataStream. Data Artisans originated Flink and provides an application platform powered by Flink and Kubernetes for building stateful stream processing applications.
The document discusses various techniques for profiling CPU and memory performance in Rust programs, including:
- Using the flamegraph tool to profile CPU usage by sampling a running process and generating flame graphs.
- Integrating pprof profiling into Rust programs to expose profiles over HTTP similar to how it works in Go.
- Profiling heap usage by integrating jemalloc profiling and generating heap profiles on program exit.
- Some challenges with profiling asynchronous Rust programs due to the lack of backtraces.
The key takeaways are that there are crates like pprof-rs and techniques like jemalloc integration that allow collecting CPU and memory profiles from Rust programs, but profiling asynchronous programs
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
Flink Forward San Francisco 2022.
Flink consumers read from Kafka as a scalable, high throughput, and low latency data source. However, there are challenges in scaling out data streams where migration and multiple Kafka clusters are required. Thus, we introduced a new Kafka source to read sharded data across multiple Kafka clusters in a way that conforms well with elastic, dynamic, and reliable infrastructure. In this presentation, we will present the source design and how the solution increases application availability while reducing maintenance toil. Furthermore, we will describe how we extended the existing KafkaSource to provide mechanisms to read logical streams located on multiple clusters, to dynamically adapt to infrastructure changes, and to perform transparent cluster migrations and failover.
by
Mason Chen
Practical learnings from running thousands of Flink jobsFlink Forward
Flink Forward San Francisco 2022.
Task Managers constantly running out of memory? Flink job keeps restarting from cryptic Akka exceptions? Flink job running but doesn’t seem to be processing any records? We share practical learnings from running thousands of Flink Jobs for different use-cases and take a look at common challenges they have experienced such as out-of-memory errors, timeouts and job stability. We will cover memory tuning, S3 and Akka configurations to address common pitfalls and the approaches that we take on automating health monitoring and management of Flink jobs at scale.
by
Hong Teoh & Usamah Jassat
This document provides a summary of improvements made to Hive's performance through the use of Apache Tez and other optimizations. Some key points include:
- Hive was improved to use Apache Tez as its execution engine instead of MapReduce, reducing latency for interactive queries and improving throughput for batch queries.
- Statistics collection was optimized to gather column-level statistics from ORC file footers, speeding up statistics gathering.
- The cost-based optimizer Optiq was added to Hive, allowing it to choose better execution plans.
- Vectorized query processing, broadcast joins, dynamic partitioning, and other optimizations improved individual query performance by over 100x in some cases.
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
This document provides an overview and summary of Amazon S3 best practices and tuning for Hadoop/Spark in the cloud. It discusses the relationship between Hadoop/Spark and S3, the differences between HDFS and S3 and their use cases, details on how S3 behaves from the perspective of Hadoop/Spark, well-known pitfalls and tunings related to S3 consistency and multipart uploads, and recent community activities related to S3. The presentation aims to help users optimize their use of S3 storage with Hadoop/Spark frameworks.
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
Change Data Capture CDC is a typical use case in Real-Time Data Warehousing. It tracks the data change log -binlog- of a relational database [OLTP], and replay these change log timely to an external storage to do Real-Time OLAP, such as delta/kudu. To implement a robust CDC streaming pipeline, lots of factors should be concerned, such as how to ensure data accuracy , how to process OLTP source schema changed, whether it is easy to build for variety databases with less code.
Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...Flink Forward
A common requirement for many stateful streaming applications is to automatically cleanup application state for effective management of state size and visibility. The state time-to-live (TTL) feature enables application state cleanup in Apache Flink.
In this talk, we will first discuss the State TTL feature and its use cases. We will then outline the semantics of the feature and provide code examples before taking a closer look at the implementation details to tackle the encountered challenges associated with the background cleanup process. Finally, we will talk about the roadmap of the TTL feature including potential improvements of the feature in future Flink releases.
Apache Kafka Fundamentals for Architects, Admins and Developersconfluent
This document summarizes a presentation about Apache Kafka. It introduces Apache Kafka as a modern, distributed platform for data streams made up of distributed, immutable, append-only commit logs. It describes Kafka's scalability similar to a filesystem and guarantees similar to a database, with the ability to rewind and replay data. The document discusses Kafka topics and partitions, partition leadership and replication, and provides resources for further information.
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
Apache Tez is a framework for accelerating Hadoop query processing. It is based on expressing a computation as a dataflow graph and executing it in a highly customizable way. Tez is built on top of YARN and provides benefits like better performance, predictability, and utilization of cluster resources compared to traditional MapReduce. It allows applications to focus on business logic rather than Hadoop internals.
This document provides an overview of Apache NiFi and dataflow. It begins with an introduction to the challenges of moving data effectively within and between systems. It then discusses Apache NiFi's key features for addressing these challenges, including guaranteed delivery, data buffering, prioritized queuing, and data provenance. The document outlines NiFi's architecture and components like repositories and extension points. It also previews a live demo and invites attendees to further discuss Apache NiFi at a Birds of a Feather session.
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...HostedbyConfluent
At Stripe, we operate a general ledger modeled as double-entry bookkeeping for all financial transactions. Warehousing such data is challenging due to its high volume and high cardinality of unique accounts.
aFurthermore, it is financially critical to get up-to-date, accurate analytics over all records. Due to the changing nature of real time transactions, it is impossible to pre-compute the analytics as a fixed time series. We have overcome the challenge by creating a real time key-value store inside Pinot that can sustain half million QPS with all the financial transactions.
We will talk about the details of our solution and the interesting technical challenges faced.
Git 101 - Crash Course in Version Control using GitGeoff Hoffman
Find out why more and more developers are switching to Git - distributed version control. This intro to Git covers the basics, from cloning to pushing for beginners.
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
Flink Forward San Francisco 2022.
At Stripe we have created a complete end to end exactly-once processing pipeline to process financial data at scale, by combining the exactly-once power from Flink, Kafka, and Pinot together. The pipeline provides exactly-once guarantee, end-to-end latency within a minute, deduplication against hundreds of billions of keys, and sub-second query latency against the whole dataset with trillion level rows. In this session we will discuss the technical challenges of designing, optimizing, and operating the whole pipeline, including Flink, Kafka, and Pinot. We will also share our lessons learned and the benefits gained from exactly-once processing.
by
Xiang Zhang & Pratyush Sharma & Xiaoman Dong
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
This presentation introduces Apache Flink, a massively parallel data processing engine which currently undergoes the incubation process at the Apache Software Foundation. Flink's programming primitives are presented and it is shown how easily a distributed PageRank algorithm can be implemented with Flink. Intriguing features such as dedicated memory management, Hadoop compatibility, streaming and automatic optimisation make it an unique system in the world of Big Data processing.
Apache Kafka is an open-source distributed event streaming platform used for building real-time data pipelines and streaming apps. It was developed by LinkedIn in 2011 to solve problems with data integration and processing. Kafka uses a publish-subscribe messaging model and is designed to be fast, scalable, and durable. It allows both streaming and storage of data and acts as a central data backbone for large organizations.
Real-time Analytics with Trino and Apache PinotXiang Fu
Trino summit 2021:
Overview of Trino Pinot Connector, which bridges the flexibility of Trino's full SQL support to the power of Apache Pinot's realtime analytics, giving you the best of both worlds.
Apache Kafka is a distributed streaming platform used for building real-time data pipelines and streaming apps. It provides a unified, scalable, and durable platform for handling real-time data feeds. Kafka works by accepting streams of records from one or more producers and organizing them into topics. It allows both storing and forwarding of these streams to consumers. Producers write data to topics which are replicated across clusters for fault tolerance. Consumers can then read the data from the topics in the order it was produced. Major companies like LinkedIn, Yahoo, Twitter, and Netflix use Kafka for applications like metrics, logging, stream processing and more.
This document discusses JavaScript performance best practices. It covers loading and execution performance, DOM scripting performance, and patterns to minimize repaints and reflows. Some key points include batching DOM changes, event delegation to reduce event handlers, and taking elements out of the document flow during animations. References are provided to resources on JavaScript performance testing and design patterns.
Presentation to the MIT IAP HTML5 Game Development Class on Debugging and Optimizing Javascript, Local storage, Offline Storage and Server side Javascript with Node.js
Aljoscha Krettek is the PMC chair of Apache Flink and Apache Beam, and co-founder of data Artisans. Apache Flink is an open-source platform for distributed stream and batch data processing. It allows for stateful computations over data streams in real-time and historically. Flink supports batch and stream processing using APIs like DataSet and DataStream. Data Artisans originated Flink and provides an application platform powered by Flink and Kubernetes for building stateful stream processing applications.
The document discusses various techniques for profiling CPU and memory performance in Rust programs, including:
- Using the flamegraph tool to profile CPU usage by sampling a running process and generating flame graphs.
- Integrating pprof profiling into Rust programs to expose profiles over HTTP similar to how it works in Go.
- Profiling heap usage by integrating jemalloc profiling and generating heap profiles on program exit.
- Some challenges with profiling asynchronous Rust programs due to the lack of backtraces.
The key takeaways are that there are crates like pprof-rs and techniques like jemalloc integration that allow collecting CPU and memory profiles from Rust programs, but profiling asynchronous programs
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
Flink Forward San Francisco 2022.
Flink consumers read from Kafka as a scalable, high throughput, and low latency data source. However, there are challenges in scaling out data streams where migration and multiple Kafka clusters are required. Thus, we introduced a new Kafka source to read sharded data across multiple Kafka clusters in a way that conforms well with elastic, dynamic, and reliable infrastructure. In this presentation, we will present the source design and how the solution increases application availability while reducing maintenance toil. Furthermore, we will describe how we extended the existing KafkaSource to provide mechanisms to read logical streams located on multiple clusters, to dynamically adapt to infrastructure changes, and to perform transparent cluster migrations and failover.
by
Mason Chen
Practical learnings from running thousands of Flink jobsFlink Forward
Flink Forward San Francisco 2022.
Task Managers constantly running out of memory? Flink job keeps restarting from cryptic Akka exceptions? Flink job running but doesn’t seem to be processing any records? We share practical learnings from running thousands of Flink Jobs for different use-cases and take a look at common challenges they have experienced such as out-of-memory errors, timeouts and job stability. We will cover memory tuning, S3 and Akka configurations to address common pitfalls and the approaches that we take on automating health monitoring and management of Flink jobs at scale.
by
Hong Teoh & Usamah Jassat
This document provides a summary of improvements made to Hive's performance through the use of Apache Tez and other optimizations. Some key points include:
- Hive was improved to use Apache Tez as its execution engine instead of MapReduce, reducing latency for interactive queries and improving throughput for batch queries.
- Statistics collection was optimized to gather column-level statistics from ORC file footers, speeding up statistics gathering.
- The cost-based optimizer Optiq was added to Hive, allowing it to choose better execution plans.
- Vectorized query processing, broadcast joins, dynamic partitioning, and other optimizations improved individual query performance by over 100x in some cases.
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
This document provides an overview and summary of Amazon S3 best practices and tuning for Hadoop/Spark in the cloud. It discusses the relationship between Hadoop/Spark and S3, the differences between HDFS and S3 and their use cases, details on how S3 behaves from the perspective of Hadoop/Spark, well-known pitfalls and tunings related to S3 consistency and multipart uploads, and recent community activities related to S3. The presentation aims to help users optimize their use of S3 storage with Hadoop/Spark frameworks.
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
Change Data Capture CDC is a typical use case in Real-Time Data Warehousing. It tracks the data change log -binlog- of a relational database [OLTP], and replay these change log timely to an external storage to do Real-Time OLAP, such as delta/kudu. To implement a robust CDC streaming pipeline, lots of factors should be concerned, such as how to ensure data accuracy , how to process OLTP source schema changed, whether it is easy to build for variety databases with less code.
Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...Flink Forward
A common requirement for many stateful streaming applications is to automatically cleanup application state for effective management of state size and visibility. The state time-to-live (TTL) feature enables application state cleanup in Apache Flink.
In this talk, we will first discuss the State TTL feature and its use cases. We will then outline the semantics of the feature and provide code examples before taking a closer look at the implementation details to tackle the encountered challenges associated with the background cleanup process. Finally, we will talk about the roadmap of the TTL feature including potential improvements of the feature in future Flink releases.
Apache Kafka Fundamentals for Architects, Admins and Developersconfluent
This document summarizes a presentation about Apache Kafka. It introduces Apache Kafka as a modern, distributed platform for data streams made up of distributed, immutable, append-only commit logs. It describes Kafka's scalability similar to a filesystem and guarantees similar to a database, with the ability to rewind and replay data. The document discusses Kafka topics and partitions, partition leadership and replication, and provides resources for further information.
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
Apache Tez is a framework for accelerating Hadoop query processing. It is based on expressing a computation as a dataflow graph and executing it in a highly customizable way. Tez is built on top of YARN and provides benefits like better performance, predictability, and utilization of cluster resources compared to traditional MapReduce. It allows applications to focus on business logic rather than Hadoop internals.
This document provides an overview of Apache NiFi and dataflow. It begins with an introduction to the challenges of moving data effectively within and between systems. It then discusses Apache NiFi's key features for addressing these challenges, including guaranteed delivery, data buffering, prioritized queuing, and data provenance. The document outlines NiFi's architecture and components like repositories and extension points. It also previews a live demo and invites attendees to further discuss Apache NiFi at a Birds of a Feather session.
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...HostedbyConfluent
At Stripe, we operate a general ledger modeled as double-entry bookkeeping for all financial transactions. Warehousing such data is challenging due to its high volume and high cardinality of unique accounts.
aFurthermore, it is financially critical to get up-to-date, accurate analytics over all records. Due to the changing nature of real time transactions, it is impossible to pre-compute the analytics as a fixed time series. We have overcome the challenge by creating a real time key-value store inside Pinot that can sustain half million QPS with all the financial transactions.
We will talk about the details of our solution and the interesting technical challenges faced.
Git 101 - Crash Course in Version Control using GitGeoff Hoffman
Find out why more and more developers are switching to Git - distributed version control. This intro to Git covers the basics, from cloning to pushing for beginners.
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
Flink Forward San Francisco 2022.
At Stripe we have created a complete end to end exactly-once processing pipeline to process financial data at scale, by combining the exactly-once power from Flink, Kafka, and Pinot together. The pipeline provides exactly-once guarantee, end-to-end latency within a minute, deduplication against hundreds of billions of keys, and sub-second query latency against the whole dataset with trillion level rows. In this session we will discuss the technical challenges of designing, optimizing, and operating the whole pipeline, including Flink, Kafka, and Pinot. We will also share our lessons learned and the benefits gained from exactly-once processing.
by
Xiang Zhang & Pratyush Sharma & Xiaoman Dong
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
This presentation introduces Apache Flink, a massively parallel data processing engine which currently undergoes the incubation process at the Apache Software Foundation. Flink's programming primitives are presented and it is shown how easily a distributed PageRank algorithm can be implemented with Flink. Intriguing features such as dedicated memory management, Hadoop compatibility, streaming and automatic optimisation make it an unique system in the world of Big Data processing.
Apache Kafka is an open-source distributed event streaming platform used for building real-time data pipelines and streaming apps. It was developed by LinkedIn in 2011 to solve problems with data integration and processing. Kafka uses a publish-subscribe messaging model and is designed to be fast, scalable, and durable. It allows both streaming and storage of data and acts as a central data backbone for large organizations.
Real-time Analytics with Trino and Apache PinotXiang Fu
Trino summit 2021:
Overview of Trino Pinot Connector, which bridges the flexibility of Trino's full SQL support to the power of Apache Pinot's realtime analytics, giving you the best of both worlds.
Apache Kafka is a distributed streaming platform used for building real-time data pipelines and streaming apps. It provides a unified, scalable, and durable platform for handling real-time data feeds. Kafka works by accepting streams of records from one or more producers and organizing them into topics. It allows both storing and forwarding of these streams to consumers. Producers write data to topics which are replicated across clusters for fault tolerance. Consumers can then read the data from the topics in the order it was produced. Major companies like LinkedIn, Yahoo, Twitter, and Netflix use Kafka for applications like metrics, logging, stream processing and more.
This document discusses JavaScript performance best practices. It covers loading and execution performance, DOM scripting performance, and patterns to minimize repaints and reflows. Some key points include batching DOM changes, event delegation to reduce event handlers, and taking elements out of the document flow during animations. References are provided to resources on JavaScript performance testing and design patterns.
Presentation to the MIT IAP HTML5 Game Development Class on Debugging and Optimizing Javascript, Local storage, Offline Storage and Server side Javascript with Node.js
ASTs are an incredibly powerful tool for understanding and manipulating JavaScript. We'll explore this topic by looking at examples from ESLint, a pluggable static analysis tool, and Browserify, a client-side module bundler. Through these examples we'll see how ASTs can be great for analyzing and even for modifying your JavaScript. This talk should be interesting to anyone that regularly builds apps in JavaScript either on the client-side or on the server-side.
Java 7 was released in July 2011 with improvements to performance, concurrency, and memory management. Plans for Java 8 include modularity, lambda expressions, and date/time APIs. The Java Community Process is also being improved to increase transparency, participation, and agility through JSR 348. Overall, the Java ecosystem continues to grow with new languages on the JVM and an active community.
WebNet Conference 2012 - Designing complex applications using html5 and knock...Fabio Franzini
This document provides an overview of designing complex applications using HTML5 and KnockoutJS. It discusses HTML5 and why it is useful, introduces JavaScript and frameworks like KnockoutJS and SammyJS that help manage complexity. It also summarizes several JavaScript libraries and patterns including the module pattern, revealing module pattern, and MV* patterns. Specific libraries and frameworks discussed include RequireJS, AmplifyJS, UnderscoreJS, and LINQ.js. The document concludes with a brief mention of server-side tools like ScriptSharp.
Mock what? What Mock?Learn What is Mocking, and how to use Mocking with ColdFusion testing, development, and continuous integration. Look at Mocking and Stubbing with a touch of Theory and a lot of Examples, including what you could test, and what you should test… and what you shouldn't test (but might be fun).
Javascript done right - Open Web Camp IIIDirk Ginader
The document discusses what makes for "good" JavaScript from a developer perspective. It argues that good JavaScript should be understandable, reusable, extensible, optimized, secure, internationalized, optional, and accessible. It provides examples and recommendations for each of these qualities, such as using clear naming and documentation to improve understandability, writing code in a modular and structured way to improve reusability, and using frameworks to avoid reinventing patterns. It also discusses performance optimizations like reducing DOM reflows and different techniques for modifying the DOM efficiently.
Spring Day | Spring and Scala | Eberhard WolffJAX London
2011-10-31 | 09:45 AM - 10:30 AM
Spring is widely used in the Java world - but does it make any sense to combine it with Scala? This talk gives an answer and shows how and why Spring is useful in the Scala world. All areas of Spring such as Dependency Injection, Aspect-Oriented Programming and the Portable Service Abstraction as well as Spring MVC are covered.
This document provides an agenda and overview for a presentation on JavaScript. It discusses JavaScript's history and popularity, current implementations of JavaScript engines in browsers, and proliferation of JavaScript frameworks. The agenda outlines discussing objects, functions, scope, primitives, common mistakes, inheritance, best practices, modularity, and more. It also includes code examples demonstrating functions, closures, scope, operators, and error handling in JavaScript.
He will start you at the beginning and cover prerequisites; setting up your development environment first. Afterward, you will use npm to install react-native-cli. The CLI is our go to tool. We use it to create and deploy our app.
Next, you will explore the code. React Native will look familiar to all React developers since it is React. The main difference between React on the browser and a mobile device is the lack of a DOM. We take a look a many of the different UI components that are available.
With React Native you have access to all of the devices hardware features like cameras, GPS, fingerprint reader and more. So we'll show some JavaScript code samples demonstrating it. We will wrap up the evening by deploying our app to both iOS and Android devices and with tips on getting ready for both devices stores.
Why you should be using the shiny new C# 6.0 features now!Eric Phan
C# 6.0 will change the way you write C#. There are many language features that are so much more efficient you’ll wonder why they weren’t there since the beginning.
This document provides an introduction to JavaScript, including what JavaScript is used for, how it interacts with HTML and CSS, and some basic JavaScript concepts. JavaScript allows making web pages interactive by inserting dynamic text, reacting to events like clicks, performing calculations, and getting information about the user's computer. It is commonly used for calculations, waiting for and responding to events, and manipulating HTML tags. The document discusses JavaScript's role on the client-side, using variables, data types, operators, arrays, functions, and the console for debugging. It provides examples of declaring variables, strings, logical operators, arrays, and functions.
Spring Data Requery is alternatives of Spring Data JPA
Requery is lightweight ORM for DBMS (MySQL, PostgreSQL, H2, SQLite, Oracle, SQL Server)
Spring Data Requery provide Query By Native Query, Query By Example and Query By Property like Spring Data JPA
Spring Data Requery is better performance than JPA
This document summarizes a presentation about rapid prototyping with Solr. It discusses getting documents indexed into Solr quickly, adjusting Solr's schema to better match needs, and showcasing data in a flexible search UI. It outlines how to leverage faceting, highlighting, spellchecking and debugging in rapid prototyping. Finally, it discusses next steps in developing a search application and taking it to production.
Java 9 is expected to include several new features and changes, including:
- New collection factory methods like Set.of() and Map.of() that provide immutable collections.
- Enhancements to the Stream API such as takeWhile() and dropWhile().
- Syntax changes like allowing effectively final variables in try-with-resources and @SafeVarargs for private methods.
- The addition of JShell to provide a Java REPL.
- Garbage First (G1) garbage collector becoming the default collector.
- Various performance and logging improvements.
[Session given at Engage 2019, Brussels, 15 May 2019]
In this session, Tim Davis (Technical Director at The Turtle Partnership Ltd) takes you through the new Domino Query Language (DQL), how it works, and how to use it in LotusScript, in Java, and in the new domino-db Node.js module. Introduced in Domino 10, DQL provides a simple, efficient and powerful search facility for accessing Domino documents. Originally only used in the domino-db Node.js module, with 10.0.1 DQL also became available to both LotusScript and Java. This presentation will provide code examples in all three languages, ensuring you will come away with a good understanding of DQL and how to use it in your projects.
Front end fundamentals session 1: javascript coreWeb Zhao
This document provides an overview of JavaScript fundamentals presented in a session on JavaScript core concepts. It defines what JavaScript is, demonstrates basic syntax and data types including numbers, strings, Booleans, objects and arrays. It also covers control structures, functions, scope, and built-in objects like Date. The document contains examples and links to interactive demos of JavaScript concepts.
Schibsted collects and analyzes 900 million events/day using AWS. This presentation gives an overview of the systems and architecture, including the solutions to GDPR.
NoSQL databases were created to solve scalability problems with SQL databases. It turns out these problems are profoundly connected with Einstein's theory of relativity (no, honestly), and understanding this illuminates the SQL/NoSQL divide in surprising ways.
The document discusses the traditional farmhouse brewing of maltøl beer in various regions of Norway. It provides details on the brewing processes used historically and in some places still today, including ingredients like malted barley, juniper, and distinctive kveik yeast strains. The document also outlines some of the cultural and social traditions surrounding farmhouse brewing in Norway.
This document discusses integrating additional systems with Mattilsynet's archive using semantic technologies. It proposes:
1. Integrating WebCruiter, a recruiting system, with ePhorte using RDF to provide a simple first step toward a new architecture. This can be done inexpensively and easily extended to other integrations.
2. Basing all integrations on RDF and SDShare feeds to allow dynamic data flows without hard bindings between code and data models, making the system more flexible to changes.
3. Using SESAM principles including extracting data in its native form, translating as needed, and managing changes through configuration rather than code for easier maintenance as systems evolve.
This document summarizes Ted Dunning's approach to recommendations based on his 1993 paper. The approach involves:
1. Analyzing user data to determine which items are statistically significant co-occurrences
2. Indexing items in a search engine with "indicator" fields containing IDs of significantly co-occurring items
3. Providing recommendations by searching the indicator fields for a user's liked items
The approach is demonstrated in a simple web application using the MovieLens dataset. Further work could optimize and expand on the approach.
This document discusses the importance of open cultural data and linked open data. It provides examples of how cultural data can be used by various sectors like publishing, travel, and media. The document explains key concepts of linked open data including using URLs for identifiers, the RDF data model, SPARQL query language, and linking data to make connections between datasets. It also discusses challenges in linking cultural data from different sources and introduces record linkage tools that can connect similar records without common IDs based on attributes. The goal is to make more open cultural data accessible and interlinked through applying linked open data practices and technologies.
NoSQL databases, the CAP theorem, and the theory of relativityLars Marius Garshol
The document discusses NoSQL databases and the CAP theorem. It begins by providing an overview of NoSQL databases, their key features like being schemaless and supporting eventual consistency over ACID transactions. It then explains the CAP theorem - that a distributed system can only provide two of consistency, availability, and partition tolerance. It also discusses how Google's Spanner database achieves consistency and scalability using ideas from Lamport's Paxos algorithm and a new time service called TrueTime.
- Bitcoin is a digital currency based on cryptography. Transactions are recorded on a decentralized peer-to-peer network, without a central authority.
- The document discusses how the Bitcoin protocol works, including how the blockchain solves the double spending problem and incentivizes miners to verify transactions through cryptocurrency rewards.
- While Bitcoin has potential advantages like low fees and no central control, there are also concerns about its ability to replace national currencies, provide true anonymity, and be regulated by governments.
This document provides an introduction to machine learning. It begins with an agenda that lists topics such as introduction, theory, top 10 algorithms, recommendations, classification with naive Bayes, linear regression, clustering, principal component analysis, MapReduce, and conclusion. It then discusses what big data is and how data is accumulating at tremendous rates from various sources. It explains the volume, variety, and velocity aspects of big data. The document also provides examples of machine learning applications and discusses extracting insights from data using various algorithms. It discusses issues in machine learning like overfitting and underfitting data and the importance of testing algorithms. The document concludes that machine learning has vast potential but is very difficult to realize that potential as it requires strong mathematics skills.
Hops are used in beer for bitterness, as a preservative, and to add flavor. They balance the sweetness from malt. The classic hoppy beer is IPA, originally from England but reinvented in the US with resiny and citrus flavors. Imperial IPAs have even more hops. Cascade was an early American variety that drove West Coast IPA popularity. Common aromatic varieties now include Cascade, Centennial, Chinook, Citra and Amarillo. Bitterness is measured in IBUs while BU/GU considers bitterness relative to sweetness. The evening's beers include IPAs and an American pale ale made with popular varieties like Chinook, Columbus, Amarillo and Cent
Big Data 101 provides an overview of big data concepts. It defines big data as data that is too large to fit into a typical database or spreadsheet due to its volume, variety and velocity. It discusses how data is accumulating rapidly from various sources and the challenges of storing and processing all this data. It also introduces common big data techniques like MapReduce and how they can be used to extract insights from large, unstructured data sets.
This document discusses Linked Open Data and how to publish open government data. It explains that publishing data in open, machine-readable formats and linking it to other external data sources increases its value. It provides examples of published open government data and outlines best practices for making data open through licensing, standard formats like CSV and XML, using URIs as identifiers, and linking to related external data. The key benefits outlined are empowering others to build upon the data and improving transparency, competition and innovation.
This document provides an overview of the SESAM project, which aims to increase the usage and quality of an archive system for an energy company by automatically enriching document metadata and connecting documents to structured business data. It describes how metadata is extracted from source systems into a triple store using separate ontologies for each system. Documents can then be searched across systems and metadata can be translated between them. When archiving documents, additional metadata is automatically attached based on information from the triple store.
Approximate string comparators measure the similarity between two strings when an exact match is insufficient. Levenshtein distance measures the minimum number of edit operations (insert, remove, substitute characters) required to change one string into another. Jaro-Winkler distance compares characters and transpositions within a threshold and is commonly used for name comparisons. Soundex and Metaphone produce phonetic codes for strings to match similar-sounding names irrespective of spelling variations. There are many string similarity measures for different use cases.
Genetic programming is used to evolve data matching configurations that maximize accuracy on test data. The algorithm generates random initial configurations, evaluates them on the test data, and uses genetic operations of selection, crossover and mutation to evolve better configurations over generations. On several datasets, the genetic algorithm is able to find configurations that improve accuracy over manual configurations. However, the evolved configurations are not always intuitive and may represent local optima rather than global optima. More techniques from genetic programming literature could help address these issues.
Decision Trees in Artificial-Intelligence.pdfSaikat Basu
Have you heard of something called 'Decision Tree'? It's a simple concept which you can use in life to make decisions. Believe you me, AI also uses it.
Let's find out how it works in this short presentation. #AI #Decisionmaking #Decisions #Artificialintelligence #Data #Analysis
https://saikatbasu.me
This comprehensive Data Science course is designed to equip learners with the essential skills and knowledge required to analyze, interpret, and visualize complex data. Covering both theoretical concepts and practical applications, the course introduces tools and techniques used in the data science field, such as Python programming, data wrangling, statistical analysis, machine learning, and data visualization.
Lalit Wangikar, a partner at CKM Advisors, is an experienced strategic consultant and analytics expert. He started looking for data driven ways of conducting process discovery workshops. When he read about process mining the first time around, about 2 years ago, the first feeling was: “I wish I knew of this while doing the last several projects!".
Interviews are subject to all the whims human recollection is subject to: specifically, recency, simplification and self preservation. Interview-based process discovery, therefore, leaves out a lot of “outliers” that usually end up being one of the biggest opportunity area. Process mining, in contrast, provides an unbiased, fact-based, and a very comprehensive understanding of actual process execution.
4. Routing
• We send data to ~210 different destinations
• Filters on the data specify which data should go
where
• often very detailed conditions on many fields
• Full routing tree has ~600 filter/transform/sink
nodes
4
5. Transforms
• Because GDPR we need to anonymize most incoming data
formats
• Some data has data quality issues that cannot be fixed at
source, requires transforms to solve
• In many cases data needs to be transformed from one format to
another
• Pulse to Amplitude
• Pulse to Adobe Analytics
• ClickMeter to Pulse
• Convert data to match database structures
• …
5
6. Who configures?
• Schibsted has >100 business units
• for Data Platform to do detailed configuration for all of
these isn’t going to scale
• for sites to do it themselves saves lots of time
• Configuration requires domain knowledge
• each site has its own specialities in Pulse tracking
• to transform and route these correctly requires knowing all
this
6
9. What if?
• We had an expression language for JSON
• something like, say, XPath for JSON
• could write routing filters using that
• We had a tranformation language for JSON
• write as JSON template, using expression language to
compute values to insert
• A custom routing language for both batch and
streaming, based on this language
• designed for easy expressivity & deploy
9
10. • Already existing query language for JSON
• https://stedolan.github.io/jq/
• Originally implemented in C
• there is a Java implementation, too
• Can do things like
• .foo
• .foo.bar
• .foo.bar > 25
• …
10
14. Proof-of-concept
• Implement real-world transforms in this language
• before it was implemented
• Helped improve and solidify the design
• Verified that the language could do what we
needed
• Transforms looked quite reasonable.
14
15. A simple language
• JSON is written in JSON syntax
• evaluates to itself
• if <expr> <expr> else <expr>
• [for <expr> <expr>]
• let <name> = <expr>
• ${ … jq … }
15
17. Stunt prototype
• Most of it implemented in two days
• Implemented in Scala
• using Antlr 3 to generate the parser
• jackson-jq for jq
• jackson for JSON
• A simple object tree interpreter
• Constructor.construct(Context, JsonNode) => JsonNode
17
22. The parser
• Code that takes a character stream and builds the
expression tree
• Use a parser generator to handle the difficult part
• requires writing a grammar
• Parser generator produces Abstract Syntax Tree
• basically corresponds to the grammar structure
22
25. Language in use
• Implemented Data Quality Tooling using jq
• filters done in jq
• Implemented routing using jq filters
• and transforms in JSLT
• Wrote some transforms using the language
• anonymization of tracking data
• cleanup transforms to handle bad data
• …
25
26. The good
• The language works
• proven by DQT, routing, and transforms
• Minimal implementation effort required
• Users approved of the language
• general agreement it was a major improvement
• people started writing their own transforms
26
27. The bad
• Performance could be better
• not horrible, but not great, either
• The ${ … } wrappers are really ugly
• jq
• does not handle missing data well
• has dangerous features
• has weird and difficult syntax for some things
• Too many dependencies
• Scala runtime (with versioning issues)
• Antlr runtime
27
28. 2.0
• Implement the complete language ourselves
• goodbye ${ … }
• Get rid of the jq strangeness
• Add some new functionality
• Implement in pure Java with JavaCC
• JavaCC has no runtime dependencies
• only dependency is Jackson
28
29. JSLT expressions
29
.foo Get “foo” key from input object
.foo.bar Get “foo”, then “.bar” on that
.foo == 231 Comparison
.foo and .bar < 12 Boolean operator
$baz.foo Variable reference
test(.foo, “^[a-z0-9]+$”) Functions (& regexps)
42. Turing-complete?
• Means that the language can express any
computation
• It’s known that all that’s required is
• conditionals (we have if tests)
• recursion (our functions can call themselves)
• But can this really be true?
42
43. N-queens
• Write a function that takes the size of the
chessboard and returns it with queens
• queens(4) =>
[
[ 0, 1, 0, 0 ],
[ 0, 0, 0, 1 ],
[ 1, 0, 0, 0 ],
[ 0, 0, 1, 0 ]
]
43https://github.com/schibsted/jslt/blob/master/examples/queens.jslt
45. Danger?
• It’s possible to implement operations that run
forever
• But in practice the stack quickly gets too deep
• The JVM will then terminate the transform
45
46. Performance
• 5-10 times faster than 1.0
• The main difference: no more jackson-jq
• jackson-jq is not very efficient
• internal model is List<JsonNode>
• creates many unnecessary objects during evaluation
• does work at run-time that should be done at compile-time
46
47. JSLT improvements
• Value model is JsonNode
• can usually just return data from input object or from code
• Efficient internal structures
• all collections are arrays
• very fast to traverse
• Boolean short-circuiting
• once we know the result, stop evaluating
• Cache regular expressions to avoid recompiling
47
48. The optimizer
• An optimizer is a function that takes an expression
and outputs an expression such that
• the new expression is at least as fast, and
• always outputs the same value
• Improves performance quite substantially even with
very simple techniques
48
51. Performance
• Test case: pulse-cleanup.jslt, real data, my laptop
• a complicated transform: 165 lines
• Transforms 132,000 events/second in one thread
• 1.0 did about 20,000 events/second
51
52. Three strategies
• Syntax tree interpreter
• known to be the slowest approach
• Bytecode compiler with virtual machine
• C version of jq does this
• Java does that (until the JIT kicks in)
• Python does this
• Native compilation
• what JIT compiler in Java does
52
53. Designing a VM
53
Opcode Param
DUP
MKOBJ
CALL <func>
int[] bytecode;
JsonNode[] stack;
int top;
switch (opcode) {
case DUP:
stack[++top] = stack[top-1];
break;
case MKOBJ:
stack[++top] = mapper.createObj…
break;
// …
}
54. Compiler
• Traverse down the object tree
• emit bytecode as you go
• Stack nature of the VM matches object tree
structure
• each Expression produces code that leaves the value of
that expression on the stack
• Example:
• MKARR, <first value>, ARRADD, <second>, ARRADD, …
54
55. Prototype
• Stunt implemented over a couple of days
• Depressing result: object tree interpreter ~40%
faster
• Anthony Sparks tried the same thing
• original VM implementation 5x slower
• eventually managed to achieve performance parity
• So far: performance does not justify complexity
55
56. Java bytecode?
• The JVM is actually a stack-based VM
• can simply compile to Java bytecode instead
• Tricky to learn tools for generating bytecode
• no examples, very little documentation
• In the end decided to use the Asm library
• not very nice to use
• very primitive API
• crashes with NullPointerException on bad bytecode
56
59. Results
• Hard work to build
• many surprising issues in Java bytecode
• Performance boost of 15-25%
• code lives on jvm-bytecode branch in Github
• Ideas for how it could be even faster…
• through type inference
59
60. Type inference benefits
"sdrn:" + $namespace + ":" + $rType + ":" + $rId
Plus
Plus(“sdrn” $namespace)
Plus(
Plus(“:” $rType)
Plus(“:” $rId)
)
60
Plus:
JsonNode -> String
JsonNode -> String
String + String
new String -> new JsonNode
Will make 4 unnecessary TextNode objects
Will wrap and unwrap String repeatedly
Will check types unnecessarily
61. Solution
• + operator can ask both sides: what type will you
produce?
• If one side says “string” then the result will be a string
• When compiling, do compile(generator, String)
• will compile code that produces a Java String object
• + operator will make a new String if that’s what’s
wanted
• or turn it into a TextNode if the context wants Any
61
62. Freedom from Jackson
• The current codebase is bound to Jackson
• JVM bytecode compilation might be a way to
escape that
• Could build compilers that can interface with
different JSON representations
• Have ideas for a more efficient JSON representation
• basically encode everything as arrays of ints
• should save memory, GC, and produce faster code
62
63. Freedom from JSON
• If we aren’t bound to Jackson, why should we be
bound to JSON?
• Could support Avro, too
• Perhaps also other formats
63
65. Internal status
• JSLT now used in
• Data Quality Tooling (to express tests on data)
• routing filters
• transforms
• In Schibsted we have
• 52 transforms, 2370 lines of code
• written by many people in different parts of the company
• Data Platform runs ~11 billion transforms/day
65
66. Open source status
• Released in June
• People are using it for real
• one certain case, several more examples
• details unknown
• Useful contributions from outsiders
• several bug fixes to datetime/number handling
• Two alternative implementations being worked on
• one in .NET
• one is virtual machine-based in Java
66
67. Lessons learned
• A custom language can make life much simpler
• if it fits the use case well
• Implementing a language is easier than it seems
• basically doable in a week
• Designing a language is not easy
• unfortunately
67