Apple must detect a wide variety of security threats, and rises to the challenge using Apache Spark across a diverse pool of telemetry. This talk covers some of the home-grown solutions we’ve built to address complications of scale
This document discusses using Alfresco actions, rules, and workflows to automate document review and approval processes. It provides an example of a simple two-step review/approve workflow and describes how to add status tracking and user tasks. Actions are used to copy workflow status, rules trigger actions like starting advanced workflows, and simple workflows integrate with advanced workflows that include user tasks. The document demonstrates configuring and implementing these capabilities in Alfresco.
Talk at "Istanbul Tech Talks" in Istanbul, April, 17, 2018. http://www.istanbultechtalks.com/
In this talk I will show how to get started with MySQL Query Tuning. I will make short introduction into physical table structure and demonstrate how it may influence query execution time. Then we will discuss basic query tuning instruments and techniques, mainly EXPLAIN command with its latest variations. You will learn how to understand its output and how to rewrite query or change table structure to achieve better performance.
Empower Splunk and other SIEMs with the Databricks Lakehouse for CybersecurityDatabricks
Cloud, Cost, Complexity, and threat Coverage are top of mind for every security leader. The Lakehouse architecture has emerged in recent years to help address these concerns with a single unified architecture for all your threat data, analytics and AI in the cloud. In this talk, we will show how Lakehouse is essential for effective Cybersecurity and popular security use-cases. We will also share how Databricks empowers the security data scientist and analyst of the future and how this technology allows cyber data sets to be used to solve business problems.
- The document discusses common web application vulnerabilities like SQL injection, cross-site scripting, and cross-site request forgery.
- It provides examples of vulnerable code and outlines secure coding practices to prevent these vulnerabilities, such as using parameterized queries to prevent SQL injection, encoding user input to prevent XSS, and using anti-forgery tokens to prevent CSRF.
- Additional topics covered include secure password storage, configuration hardening through web.config settings, and implementation of security controls like encryption and encoding using libraries like ESAPI.
Spencer Christensen
There are many aspects to managing an RDBMS. Some of these are handled by an experienced DBA, but there are a good many things that any sys admin should be able to take care of if they know what to look for.
This presentation will cover basics of managing Postgres, including creating database clusters, overview of configuration, and logging. We will also look at tools to help monitor Postgres and keep an eye on what is going on. Some of the tools we will review are:
* pgtop
* pg_top
* pgfouine
* check_postgres.pl.
Check_postgres.pl is a great tool that can plug into your Nagios or Cacti monitoring systems, giving you even better visibility into your databases.
How to Shot Web - Jason Haddix at DEFCON 23 - See it Live: Details in Descrip...bugcrowd
1. The document provides tips for effective hacking and bug hunting in 2015, focusing on web applications.
2. It discusses philosophy shifts towards crowdsourced testing, and techniques for discovery such as finding less tested application parts and acquisitions.
3. The document also covers mapping methodology, parameters to attack, and bypassing filters for XSS, SQLi, file inclusion, and CSRF vulnerabilities.
The Golden Rules - Detecting more with RSA Security AnalyticsDemetrio Milea
The document discusses techniques for detecting threats using security analytics. It begins by explaining how a typical attack sequence is too simplistic and can fail to detect real threats. It then advocates for using a threat analysis approach to understand assets, data flows, threats and tactics. This involves profiling assets, mapping components and access points, and identifying threats, sources and techniques. The document shows how to write threat indicators using security analytics tools. It provides examples of anomaly detection rules in Event Processing Language to detect complex scenarios. The goal is to leverage threat analysis to implement risk-based indicators that effectively address residual risks.
This document discusses how Apache Calcite makes it easier to write database management systems (DBMS) by decomposing them into modular components like a query parser, catalog, algorithms, and storage engines. It presents Calcite as a framework that allows these components to be mixed and matched, with a core relational algebra and rule-based optimization. Calcite powers systems like Apache Hive, Drill, Phoenix, and Kylin by translating SQL and other queries to relational algebra and optimizing queries using over 100 rules before executing them using configurable engines and data sources.
Building Data Quality pipelines with Apache Spark and Delta LakeDatabricks
Technical Leads and Databricks Champions Darren Fuller & Sandy May will give a fast paced view of how they have productionised Data Quality Pipelines across multiple enterprise customers. Their vision to empower business decisions on data remediation actions and self healing of Data Pipelines led them to build a library of Data Quality rule templates and accompanying reporting Data Model and PowerBI reports.
With the drive for more and more intelligence driven from the Lake and less from the Warehouse, also known as the Lakehouse pattern, Data Quality at the Lake layer becomes pivotal. Tools like Delta Lake become building blocks for Data Quality with Schema protection and simple column checking, however, for larger customers they often do not go far enough. Notebooks will be shown in quick fire demos how Spark can be leverage at point of Staging or Curation to apply rules over data.
Expect to see simple rules such as Net sales = Gross sales + Tax, or values existing with in a list. As well as complex rules such as validation of statistical distributions and complex pattern matching. Ending with a quick view into future work in the realm of Data Compliance for PII data with generations of rules using regex patterns and Machine Learning rules based on transfer learning.
Python Data Wrangling: Preparing for the FutureWes McKinney
The document is a slide deck for a presentation on Python data wrangling and the future of the pandas project. It discusses the growth of the Python data science community and key projects like NumPy, pandas, and scikit-learn that have contributed to pandas' popularity. It outlines some issues with the current pandas codebase and proposes a new C++-based core called libpandas for pandas 2.0 to improve performance and interoperability. Benchmark results show serialization formats like Arrow and Feather outperforming pickle and CSV for transferring data.
The document provides an overview of PostgreSQL performance tuning. It discusses caching, query processing internals, and optimization of storage and memory usage. Specific topics covered include the PostgreSQL configuration parameters for tuning shared buffers, work memory, and free space map settings.
How to use histograms to get better performanceMariaDB plc
Sergei Petrunia and Varun Gupta, software engineers MariaDB, show how histograms can be used to improve query performance. They begin by introducing histrograms and explaining why they’re needed by the query optimizer. Next, they discuss how to determine whether or not histrograms are needed, and if so, how to determine which tables and columns they should be applied. Finally, they cover best practices and recent improvements to histograms.
Delta from a Data Engineer's PerspectiveDatabricks
This document describes the Delta architecture, which unifies batch and streaming data processing. Delta achieves this through a continuous data flow model using structured streaming. It allows data engineers to read consistent data while being written, incrementally read large tables at scale, rollback in case of errors, replay and process historical data along with new data, and handle late arriving data without delays. Delta uses transaction logging, optimistic concurrency, and Spark to scale metadata handling for large tables. This provides a simplified solution to common challenges data engineers face.
This document provides an introduction and overview of PostgreSQL, including its history, features, installation, usage and SQL capabilities. It describes how to create and manipulate databases, tables, views, and how to insert, query, update and delete data. It also covers transaction management, functions, constraints and other advanced topics.
Red Team Revenge - Attacking Microsoft ATANikhil Mittal
Nikhil Mittal presented methods for evading detection by Microsoft Advanced Threat Analytics (ATA). ATA detects attacks by monitoring traffic to domain controllers, but can be bypassed by avoiding direct queries to the DC. Reconnaissance techniques like SPN scanning and hunting domain admin tokens on other machines go undetected. Overpass-the-hash and golden tickets can bypass ATA if the encryption type matches normal traffic. False events can also be generated by triggering unusual detections for fake users.
How to improve ELK log pipeline performanceSteven Shim
The document discusses improving the processing speed of logs in an ELK stack. It finds that logs are beginning to back up due to high average request volumes of around 1 million requests per minute. It analyzes various logging pipeline architectures and patterns to address this. It recommends measuring key parts of the pipeline to identify bottlenecks, improving the Logstash grok parser performance, increasing Kafka partitions to distribute load more evenly, and scaling Logstash instances to parallelize ingestion. These changes aim to reduce the risks of high throughput, lost records, and latency in the logging pipeline.
This document discusses Pinot, Uber's real-time analytics platform. It provides an overview of Pinot's architecture and data ingestion process, describes a case study on modeling trip data in Pinot, and benchmarks Pinot's performance on ingesting large volumes of data and answering queries in real-time.
Deep learning has come a long way over the past few years, with advances in cloud computing, frameworks, and open source tooling, working with images has gotten simpler over time. Delta Lake has been amazing at creating a tabular structured transactional layer on object storage, but what about images? Would you like to know how to gain a 45x improvement in your image processing pipeline? Join Jason and Rohit to find out how!
Postgres MVCC - A Developer Centric View of Multi Version Concurrency ControlReactive.IO
Scaling a data-tier requires multiple concurrent database connections that are all vying for read and write access of the same data. In order to cater to this complex demand, PostgreSQL implements a concurrency method known as Multi Version Concurrency Control, or MVCC. By understating MVCC, you will be able to take advantage of advanced features such as transactional memory, atomic data isolation, and point in time consistent views.
This presentation will show you how MVCC works in both a theoretical and practical level. Furthermore, you will learn how to optimize common tasks such as database writes, vacuuming, and index maintenance. Afterwards, you will have a fundamental understanding on how PostgreSQL operates on your data.
Key points discussed:
* MVCC; what is really happening when I write data.
* Vacuuming; why it is needed and what is really going on.
* Transactions; much more then just an undo button.
* Isolation levels; seeing only the data you want to see.
* Locking; ensure writes happen in the order you choose.
* Cursors; how to stream chronologically correct data more efficiency.
SQL examples given during the presentation are available here: http://www.reactive.io/academy/presentations/postgresql/mvcc/mvcc-examples.zip
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation
Essentially every successful analytical DBMS in the market today makes use of column-oriented data structures. In the Hadoop ecosystem, Apache Parquet (and Apache ORC) provide similar advantages in terms of processing and storage efficiency. Apache Arrow is the in-memory counterpart to these formats and has been been embraced by over a dozen open source projects as the de facto standard for in-memory processing. In this session the PMC Chair for Apache Arrow and the PMC Chair for Apache Parquet discuss the future of column-oriented processing.
Slides for a college course based on "The Web Application Hacker's Handbook", 2nd Ed.
Teacher: Sam Bowne
Twitter: @sambowne
Website: https://samsclass.info/129S/129S_F16.shtml
Common Strategies for Improving Performance on Your Delta LakehouseDatabricks
The Delta Architecture pattern has made the lives of data engineers much simpler, but what about improving query performance for data analysts? What are some common places to look at for tuning query performance? In this session we will cover some common techniques to apply to our delta tables to make them perform better for data analysts queries. We will look at a few examples of how you can analyze a query, and determine what to focus on to deliver better performance results.
This document discusses how to organize and manipulate files in Python. It introduces the shutil module, which contains functions for copying, moving, renaming, and deleting files. It describes how to use shutil functions like copy(), copytree(), move(), rmtree() to perform common file operations. It also introduces the send2trash module as a safer alternative to permanently deleting files. Finally, it discusses walking directory trees using os.walk() to perform operations on all files within a folder and its subfolders.
Autovacuum, explained for engineers, new improved version PGConf.eu 2015 ViennaPostgreSQL-Consulting
Autovacuum is PostgreSQL's automatic vacuum process that helps manage bloat and garbage collection. It is critical for performance but is often improperly configured by default settings. Autovacuum works table-by-table to remove expired rows in small portions to avoid long blocking operations. Its settings like scale factors, thresholds, and costs can be tuned more aggressively for OLTP workloads to better control bloat and avoid long autovacuum operations.
My slides for understanding Pentesting for GraphQL Applications. I presented this content at c0c0n and bSides Delhi 2018. Also contains details of my Burp Extension for GraphQL parsing and scanning located here https://github.com/br3akp0int/GQLParser
The talk cover concepts and internal mechanisms of how PostgreSQL, a popular open-source database, operates. While doing so, I'll also draw similarities to other RDBMS like Oracle, MySQL or SQL Server.
Some topics to touch during this presentation:
- PostgreSQL internal concepts: table, index, page, heap, vacuum, toast, etc.
- MVCC and relational transactions
- Indexes and how they affect performance
- Discuss on Uber's blog post about moving from PostgreSQL to MySQL
The talk is suitable for technical audience who has worked with databases before (software engineers/data analysts) and want to learn about its internal mechanism.
Speaker: Huy Nguyen, CTO & Cofounder, Holistics Software
Huy's currently CTO of Holistics, a Business Intelligence (BI) and Data Infrastructure product. Holistics helps customers generate reports and insights from their data. Holistics customers include tech companies like Grab, Traveloka, The Coffee House, Tech In Asia and e27.
Before Holistics, Huy worked at Viki, helping build their end-to-end data platform that scale to over 100M records a day. Previously, Huy spent a year writing medical simulation in Europe, and did an internship with Facebook HQ working for their growth team.
Huy's proudest achievement is 251 scores on Flappy Bird.
Language: Vietnamese, with slides in English.
This document provides an introduction to automated testing. It discusses the motivations for automated testing such as improving quality and catching bugs early. It covers basic testing concepts like unit, integration, and system tests. It explains testing principles such as keeping tests independent and focusing on visible behavior. The document also discusses popular testing frameworks for different programming languages and provides examples of tests from a codebase.
This document discusses database unit testing fundamentals. It defines unit testing as code that exercises specific portions of code to return a pass/fail result. The goals of unit testing are to catch mistakes early, ensure code works as expected, and maintain tight code. The document reviews how to unit test databases, using tSQLt and SQL Test frameworks. It provides examples of unit testing stored procedures, functions, views and constraints. Overall, the document promotes unit testing databases to write testable code and help prevent errors.
Building Data Quality pipelines with Apache Spark and Delta LakeDatabricks
Technical Leads and Databricks Champions Darren Fuller & Sandy May will give a fast paced view of how they have productionised Data Quality Pipelines across multiple enterprise customers. Their vision to empower business decisions on data remediation actions and self healing of Data Pipelines led them to build a library of Data Quality rule templates and accompanying reporting Data Model and PowerBI reports.
With the drive for more and more intelligence driven from the Lake and less from the Warehouse, also known as the Lakehouse pattern, Data Quality at the Lake layer becomes pivotal. Tools like Delta Lake become building blocks for Data Quality with Schema protection and simple column checking, however, for larger customers they often do not go far enough. Notebooks will be shown in quick fire demos how Spark can be leverage at point of Staging or Curation to apply rules over data.
Expect to see simple rules such as Net sales = Gross sales + Tax, or values existing with in a list. As well as complex rules such as validation of statistical distributions and complex pattern matching. Ending with a quick view into future work in the realm of Data Compliance for PII data with generations of rules using regex patterns and Machine Learning rules based on transfer learning.
Python Data Wrangling: Preparing for the FutureWes McKinney
The document is a slide deck for a presentation on Python data wrangling and the future of the pandas project. It discusses the growth of the Python data science community and key projects like NumPy, pandas, and scikit-learn that have contributed to pandas' popularity. It outlines some issues with the current pandas codebase and proposes a new C++-based core called libpandas for pandas 2.0 to improve performance and interoperability. Benchmark results show serialization formats like Arrow and Feather outperforming pickle and CSV for transferring data.
The document provides an overview of PostgreSQL performance tuning. It discusses caching, query processing internals, and optimization of storage and memory usage. Specific topics covered include the PostgreSQL configuration parameters for tuning shared buffers, work memory, and free space map settings.
How to use histograms to get better performanceMariaDB plc
Sergei Petrunia and Varun Gupta, software engineers MariaDB, show how histograms can be used to improve query performance. They begin by introducing histrograms and explaining why they’re needed by the query optimizer. Next, they discuss how to determine whether or not histrograms are needed, and if so, how to determine which tables and columns they should be applied. Finally, they cover best practices and recent improvements to histograms.
Delta from a Data Engineer's PerspectiveDatabricks
This document describes the Delta architecture, which unifies batch and streaming data processing. Delta achieves this through a continuous data flow model using structured streaming. It allows data engineers to read consistent data while being written, incrementally read large tables at scale, rollback in case of errors, replay and process historical data along with new data, and handle late arriving data without delays. Delta uses transaction logging, optimistic concurrency, and Spark to scale metadata handling for large tables. This provides a simplified solution to common challenges data engineers face.
This document provides an introduction and overview of PostgreSQL, including its history, features, installation, usage and SQL capabilities. It describes how to create and manipulate databases, tables, views, and how to insert, query, update and delete data. It also covers transaction management, functions, constraints and other advanced topics.
Red Team Revenge - Attacking Microsoft ATANikhil Mittal
Nikhil Mittal presented methods for evading detection by Microsoft Advanced Threat Analytics (ATA). ATA detects attacks by monitoring traffic to domain controllers, but can be bypassed by avoiding direct queries to the DC. Reconnaissance techniques like SPN scanning and hunting domain admin tokens on other machines go undetected. Overpass-the-hash and golden tickets can bypass ATA if the encryption type matches normal traffic. False events can also be generated by triggering unusual detections for fake users.
How to improve ELK log pipeline performanceSteven Shim
The document discusses improving the processing speed of logs in an ELK stack. It finds that logs are beginning to back up due to high average request volumes of around 1 million requests per minute. It analyzes various logging pipeline architectures and patterns to address this. It recommends measuring key parts of the pipeline to identify bottlenecks, improving the Logstash grok parser performance, increasing Kafka partitions to distribute load more evenly, and scaling Logstash instances to parallelize ingestion. These changes aim to reduce the risks of high throughput, lost records, and latency in the logging pipeline.
This document discusses Pinot, Uber's real-time analytics platform. It provides an overview of Pinot's architecture and data ingestion process, describes a case study on modeling trip data in Pinot, and benchmarks Pinot's performance on ingesting large volumes of data and answering queries in real-time.
Deep learning has come a long way over the past few years, with advances in cloud computing, frameworks, and open source tooling, working with images has gotten simpler over time. Delta Lake has been amazing at creating a tabular structured transactional layer on object storage, but what about images? Would you like to know how to gain a 45x improvement in your image processing pipeline? Join Jason and Rohit to find out how!
Postgres MVCC - A Developer Centric View of Multi Version Concurrency ControlReactive.IO
Scaling a data-tier requires multiple concurrent database connections that are all vying for read and write access of the same data. In order to cater to this complex demand, PostgreSQL implements a concurrency method known as Multi Version Concurrency Control, or MVCC. By understating MVCC, you will be able to take advantage of advanced features such as transactional memory, atomic data isolation, and point in time consistent views.
This presentation will show you how MVCC works in both a theoretical and practical level. Furthermore, you will learn how to optimize common tasks such as database writes, vacuuming, and index maintenance. Afterwards, you will have a fundamental understanding on how PostgreSQL operates on your data.
Key points discussed:
* MVCC; what is really happening when I write data.
* Vacuuming; why it is needed and what is really going on.
* Transactions; much more then just an undo button.
* Isolation levels; seeing only the data you want to see.
* Locking; ensure writes happen in the order you choose.
* Cursors; how to stream chronologically correct data more efficiency.
SQL examples given during the presentation are available here: http://www.reactive.io/academy/presentations/postgresql/mvcc/mvcc-examples.zip
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation
Essentially every successful analytical DBMS in the market today makes use of column-oriented data structures. In the Hadoop ecosystem, Apache Parquet (and Apache ORC) provide similar advantages in terms of processing and storage efficiency. Apache Arrow is the in-memory counterpart to these formats and has been been embraced by over a dozen open source projects as the de facto standard for in-memory processing. In this session the PMC Chair for Apache Arrow and the PMC Chair for Apache Parquet discuss the future of column-oriented processing.
Slides for a college course based on "The Web Application Hacker's Handbook", 2nd Ed.
Teacher: Sam Bowne
Twitter: @sambowne
Website: https://samsclass.info/129S/129S_F16.shtml
Common Strategies for Improving Performance on Your Delta LakehouseDatabricks
The Delta Architecture pattern has made the lives of data engineers much simpler, but what about improving query performance for data analysts? What are some common places to look at for tuning query performance? In this session we will cover some common techniques to apply to our delta tables to make them perform better for data analysts queries. We will look at a few examples of how you can analyze a query, and determine what to focus on to deliver better performance results.
This document discusses how to organize and manipulate files in Python. It introduces the shutil module, which contains functions for copying, moving, renaming, and deleting files. It describes how to use shutil functions like copy(), copytree(), move(), rmtree() to perform common file operations. It also introduces the send2trash module as a safer alternative to permanently deleting files. Finally, it discusses walking directory trees using os.walk() to perform operations on all files within a folder and its subfolders.
Autovacuum, explained for engineers, new improved version PGConf.eu 2015 ViennaPostgreSQL-Consulting
Autovacuum is PostgreSQL's automatic vacuum process that helps manage bloat and garbage collection. It is critical for performance but is often improperly configured by default settings. Autovacuum works table-by-table to remove expired rows in small portions to avoid long blocking operations. Its settings like scale factors, thresholds, and costs can be tuned more aggressively for OLTP workloads to better control bloat and avoid long autovacuum operations.
My slides for understanding Pentesting for GraphQL Applications. I presented this content at c0c0n and bSides Delhi 2018. Also contains details of my Burp Extension for GraphQL parsing and scanning located here https://github.com/br3akp0int/GQLParser
The talk cover concepts and internal mechanisms of how PostgreSQL, a popular open-source database, operates. While doing so, I'll also draw similarities to other RDBMS like Oracle, MySQL or SQL Server.
Some topics to touch during this presentation:
- PostgreSQL internal concepts: table, index, page, heap, vacuum, toast, etc.
- MVCC and relational transactions
- Indexes and how they affect performance
- Discuss on Uber's blog post about moving from PostgreSQL to MySQL
The talk is suitable for technical audience who has worked with databases before (software engineers/data analysts) and want to learn about its internal mechanism.
Speaker: Huy Nguyen, CTO & Cofounder, Holistics Software
Huy's currently CTO of Holistics, a Business Intelligence (BI) and Data Infrastructure product. Holistics helps customers generate reports and insights from their data. Holistics customers include tech companies like Grab, Traveloka, The Coffee House, Tech In Asia and e27.
Before Holistics, Huy worked at Viki, helping build their end-to-end data platform that scale to over 100M records a day. Previously, Huy spent a year writing medical simulation in Europe, and did an internship with Facebook HQ working for their growth team.
Huy's proudest achievement is 251 scores on Flappy Bird.
Language: Vietnamese, with slides in English.
This document provides an introduction to automated testing. It discusses the motivations for automated testing such as improving quality and catching bugs early. It covers basic testing concepts like unit, integration, and system tests. It explains testing principles such as keeping tests independent and focusing on visible behavior. The document also discusses popular testing frameworks for different programming languages and provides examples of tests from a codebase.
This document discusses database unit testing fundamentals. It defines unit testing as code that exercises specific portions of code to return a pass/fail result. The goals of unit testing are to catch mistakes early, ensure code works as expected, and maintain tight code. The document reviews how to unit test databases, using tSQLt and SQL Test frameworks. It provides examples of unit testing stored procedures, functions, views and constraints. Overall, the document promotes unit testing databases to write testable code and help prevent errors.
1 1/2 years ago we have rolled out a new integrated full-text search engine for our Intranet based on Apache Solr. The search engine integrates various data sources such as file systems, wikis, internal websites and web applications, shared calendars, our corporate database, CRM system, email archive, task management and defect tracking etc. This talk is an experience report about some of the good things, the bad things and the surprising things we have encountered over two years of developing with, operating and using a Intranet search engine based on Apache Solr.
After setting the scene, we will discuss some interesting requirements that we have for our search engine and how we solved them with Apache Solr (or at least tried to solve). Using these concrete examples, we will discuss some interesting features and limitations of Apache Solr.
In the second part of the talk, we will tell a couple of "war stories" and walk through some interesting, annoying and surprising problems that we faced, how we analyzed the issues, identified the cause of the problems and eventually solved them.
The talk is aimed at software developers and architects with some basic knowledge about Apache Solr, the Apache Lucene project familiy or similar full-text search engines. It is not an introduction into Apache Solr and we will dive right into the interesting and juicy bits.
Property-based testing (PBT) focuses on testing specifications rather than implementations. It uses random testing against properties expressed as code to generate many test cases, reducing testing effort. PBT represents a system as states, commands to transition between states, and properties relating commands to expected states. This allows effective testing of stateful systems. PBT has been used successfully for concurrency, distributed systems, and finding bugs unit tests missed. Popular PBT libraries include Scalacheck, QuickCheck, and Hypothesis.
Sumo Logic QuickStart Webinar - Jan 2016Sumo Logic
QuickStart your Sumo Logic service with this exclusive webinar. At these monthly live events you will learn how to capitalize on critical capabilities that can amplify your log analytics and monitoring experience while providing you with meaningful business and IT insights
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Databricks
Morningstar’s Risk Model project is created by stitching together statistical and machine learning models to produce risk and performance metrics for millions of financial securities. Previously, we were running a single version of this application, but needed to expand it to allow for customizations based on client demand. With the goal of running hundreds of custom Risk Model runs at once at an output size of around 1TB of data each, we had a challenging technical problem on our hands! In this presentation, we’ll talk about the challenges we faced replatforming this application to Spark, how we solved them, and the benefits we saw.
Some things we’ll touch on include how we created customized models, the architecture of our machine learning application, how we maintain an audit trail of data transformations (for rigorous third party audits), and how we validate the input data our model takes in and output data our model produces. We want the attendees to walk away with some key ideas of what worked for us when productizing a large scale machine learning platform.
This document provides tips and tricks for debugging Arbortext applications. It discusses challenges like debugging components with multiple interfaces and custom code. It recommends using messages like response() and eval to monitor state, and debugging tools like the Java console. It also suggests adding debug messages programmatically, using binary search, and getting a second set of eyes to help find bugs. Maintaining backups and good documentation are emphasized.
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...javier ramirez
En esta sesión voy a contar las decisiones técnicas que tomamos al desarrollar QuestDB, una base de datos Open Source para series temporales compatible con Postgres, y cómo conseguimos escribir más de cuatro millones de filas por segundo sin bloquear o enlentecer las consultas.
Hablaré de cosas como (zero) Garbage Collection, vectorización de instrucciones usando SIMD, reescribir en lugar de reutilizar para arañar microsegundos, aprovecharse de los avances en procesadores, discos duros y sistemas operativos, como por ejemplo el soporte de io_uring, o del balance entre experiencia de usuario y rendimiento cuando se plantean nuevas funcionalidades.
[CB16] COFI break – Breaking exploits with Processor trace and Practical cont...CODE BLUE
One of the most prevalent methods used by attackers to exploit vulnerabilities is ROP - Return Oriented Programming. Many times during the exploitation process, code will run very differently than it does usually - calls will be made to the middle of functions, functions won’t return to their callers, etc. These anomalies in control flow could be detected if a log of all instructions executed by the processor were available.
In the past, tracing the execution of a processor incurred a significant slowdown, rendering such an anti-exploitation method impractical. However, recent Intel processors, such as Broadwell and Skylake, are now able to trace execution with low overhead, via a feature called Processor Trace. A similar feature called CoreSight exists on new ARM processors.
The lecture will discuss an anti-exploitation system we built which scans files and detects control flow violations by using these new processor features.
--- Ron Shina
Ron has been staring at binary code for over the past decade, occasionally running it. Having spent a lot of his time doing mathematics, he enjoys searching for algorithmic opportunities in security research and reverse engineering. He is a graduate of the Israel Defense Forces’ Talpiot program. In his spare time he works on his jump shot.
--- Shlomi Oberman
Shlomi Oberman is an independent security researcher with over a decade of experience in security research. Shlomi spent many years in the attacker’s shoes for different companies and knows too well how hard it is to stop a determined attacker. In the past years his interest has shifted from breaking things to helping stop exploits – while software is written and after it has shipped. Shlomi is a veteran of the IDF Intelligence Corps and used to head the security research efforts at NSO Group and other companies.
Debugging Complex Systems - Erlang Factory SF 2015lpgauth
Debugging complex systems can be difficult. Luckily, the Erlang ecosystem is full of tools to help you out. With the right mindset and the right tools, debugging complex Erlang systems can be easy. In this talk, I'll share the debugging methodology I've developed over the years.
This document provides an introduction to the CSE 326: Data Structures course. It discusses the following key points in 3 sentences or less:
The course will cover common data structures and algorithms, how to choose the appropriate data structure for different needs, and how to justify design decisions through formal reasoning. It aims to help students become better developers by understanding fundamental data structures and when to apply them. The document provides examples of stacks and queues to illustrate abstract data types, data structures, and their implementations in different programming languages.
This document provides an overview of a Data Structures course. The course will cover basic data structures and algorithms used in software development. Students will learn about common data structures like lists, stacks, and queues; analyze the runtime of algorithms; and practice implementing data structures. The goal is for students to understand which data structures are appropriate for different problems and be able to justify design decisions. Key concepts covered include abstract data types, asymptotic analysis to evaluate algorithms, and the tradeoffs involved in choosing different data structure implementations.
This document provides an introduction to the CSE 326: Data Structures course. It discusses the following key points in 3 sentences or less:
The course will cover common data structures and algorithms, how to choose the appropriate data structure for different needs, and how to justify design decisions through formal reasoning. It aims to help students become better developers by understanding fundamental data structures and when to apply them. The document provides examples of stacks and queues to illustrate abstract data types, data structures, and their implementations in different programming languages.
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...rschuppe
Application Performance doesn't come easy. How to find the root cause of performance issues in modern and complex applications? All you have is a complaining user to start with?
In this presentation (mainly in German, but understandable for english speakers) I'd reprised the fundamentals of trouble shooting and have some new examples on how to tackle issues.
Follow up presentation to "Performance Trouble Shooting 101 - Schweine, Schlangen und Papierschnitte"
Illuminate - Performance Analystics driven by Machine LearningjClarity
illuminate is a machine learning-based performance analytics tool that automatically diagnoses performance issues in servers and applications without human intervention. It has a small memory, CPU, and network footprint, uses adaptive machine learning to interpret data and scale with applications, and provides a holistic view of both application and system performance across servers. illuminate identifies the largest bottlenecks through machine learning, aggregates similar issues across servers, and auto-triggers on SLA breaches. It supports Linux systems and has a secure web-based dashboard.
Tracing the Breadcrumbs: Apache Spark Workload DiagnosticsDatabricks
Have you ever hit mysterious random process hangs, performance regressions, or OOM errors that leave barely any useful traces, yet hard or expensive to reproduce? No matter how tricky the bugs are, they always leave some breadcrumbs along the way.
This Tutorial will discuss and demonstrate how to implement different realtime streaming analytics patterns. We will start with counting usecases and progress into complex patterns like time windows, tracking objects, and detecting trends. We will start with Apache Storm and progress into Complex Event Processing based technologies.
This document discusses various patterns for real-time streaming analytics. It begins by providing background on data analytics and how real-time streaming has become important for use cases where insights need to be generated very quickly. It then covers basic patterns like preprocessing, alerts and thresholds, counting, and joining event streams. Further patterns discussed include detecting trends, interacting with databases, running batch and real-time queries, and using machine learning models. The document also reviews tools for implementing real-time analytics like stream processing frameworks and complex event processing. Finally, it provides examples of implementing several patterns in Storm and WSO2 CEP.
In this talk, Azlam Abdulsalam and Ramzi Akremi will share their experiences in an ongoing Salesforce program how they build deploy and maintain 20+ unlocked packages through a highly optimised pipeline.
(ATS3-PLAT07) Pipeline Pilot Protocol Tips, Tricks, and ChallengesBIOVIA
This document provides tips and tricks for using Pipeline Pilot, including how to use protocol search, favorites bar, tool tips, component profiling, design mode, protocol recovery, recursion vs looping, merge/join operations, debugging tips, and RTC subprotocols. It emphasizes best practices like avoiding loops and using recursion instead. Design mode and checkpoints are highlighted as useful debugging aids. Resources like training, support, and the user community are recommended for additional help.
The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.
Data Lakehouse Symposium | Day 1 | Part 1Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
Data Lakehouse Symposium | Day 1 | Part 2Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The document discusses the challenges of modern data, analytics, and AI workloads. Most enterprises struggle with siloed data systems that make integration and productivity difficult. The future of data lies with a data lakehouse platform that can unify data engineering, analytics, data warehousing, and machine learning workloads on a single open platform. The Databricks Lakehouse platform aims to address these challenges with its open data lake approach and capabilities for data engineering, SQL analytics, governance, and machine learning.
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.
Democratizing Data Quality Through a Centralized PlatformDatabricks
Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale.
At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including:
Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal
Performing data quality validations using libraries built to work with spark
Dynamically generating pipelines that can be abstracted away from users
Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers
Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time
Learn to Use Databricks for Data ScienceDatabricks
Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
Why APM Is Not the Same As ML MonitoringDatabricks
Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications.
As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored.
In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs.
There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs.
The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks.
Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model?
The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity.
The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters.
In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal.
In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark.
Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations.
Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.
Sawtooth Windows for Feature AggregationsDatabricks
In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.
We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark.
Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue
· Why?
o Custom queries on top a table; We load the data once and query N times
· Why not Structured Streaming
· Working Solution using Redis
Niche 2 : Distributed Counters
· Problems with Spark Accumulators
· Utilize Redis Hashes as distributed counters
· Precautions for retries and speculative execution
· Pipelining to improve performance
Re-imagine Data Monitoring with whylogs and SparkDatabricks
In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data.
In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components.
We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure.
This allows us to introduce optimization rules that
(i) reduce unnecessary computations by passing information between the data processing and ML operators
(ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and
(iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator.
We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis.
Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them.
Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy.
This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.
Massive Data Processing in Adobe Using Delta LakeDatabricks
At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences.
What are we storing?
Multi Source – Multi Channel Problem
Data Representation and Nested Schema Evolution
Performance Trade Offs with Various formats
Go over anti-patterns used
(String FTW)
Data Manipulation using UDFs
Writer Worries and How to Wipe them Away
Staging Tables FTW
Datalake Replication Lag Tracking
Performance Time!
快速办理新西兰成绩单奥克兰理工大学毕业证【q微1954292140】办理奥克兰理工大学毕业证(AUT毕业证书)diploma学位认证【q微1954292140】新西兰文凭购买,新西兰文凭定制,新西兰文凭补办。专业在线定制新西兰大学文凭,定做新西兰本科文凭,【q微1954292140】复制新西兰Auckland University of Technology completion letter。在线快速补办新西兰本科毕业证、硕士文凭证书,购买新西兰学位证、奥克兰理工大学Offer,新西兰大学文凭在线购买。
主营项目:
1、真实教育部国外学历学位认证《新西兰毕业文凭证书快速办理奥克兰理工大学毕业证的方法是什么?》【q微1954292140】《论文没过奥克兰理工大学正式成绩单》,教育部存档,教育部留服网站100%可查.
2、办理AUT毕业证,改成绩单《AUT毕业证明办理奥克兰理工大学展示成绩单模板》【Q/WeChat:1954292140】Buy Auckland University of Technology Certificates《正式成绩单论文没过》,奥克兰理工大学Offer、在读证明、学生卡、信封、证明信等全套材料,从防伪到印刷,从水印到钢印烫金,高精仿度跟学校原版100%相同.
3、真实使馆认证(即留学人员回国证明),使馆存档可通过大使馆查询确认.
4、留信网认证,国家专业人才认证中心颁发入库证书,留信网存档可查.
《奥克兰理工大学毕业证定制新西兰毕业证书办理AUT在线制作本科文凭》【q微1954292140】学位证1:1完美还原海外各大学毕业材料上的工艺:水印,阴影底纹,钢印LOGO烫金烫银,LOGO烫金烫银复合重叠。文字图案浮雕、激光镭射、紫外荧光、温感、复印防伪等防伪工艺。
高仿真还原新西兰文凭证书和外壳,定制新西兰奥克兰理工大学成绩单和信封。专业定制国外毕业证书AUT毕业证【q微1954292140】办理新西兰奥克兰理工大学毕业证(AUT毕业证书)【q微1954292140】学历认证复核奥克兰理工大学offer/学位证成绩单定制、留信官方学历认证(永久存档真实可查)采用学校原版纸张、特殊工艺完全按照原版一比一制作。帮你解决奥克兰理工大学学历学位认证难题。
新西兰文凭奥克兰理工大学成绩单,AUT毕业证【q微1954292140】办理新西兰奥克兰理工大学毕业证(AUT毕业证书)【q微1954292140】学位认证要多久奥克兰理工大学offer/学位证在线制作硕士成绩单、留信官方学历认证(永久存档真实可查)采用学校原版纸张、特殊工艺完全按照原版一比一制作。帮你解决奥克兰理工大学学历学位认证难题。
奥克兰理工大学offer/学位证、留信官方学历认证(永久存档真实可查)采用学校原版纸张、特殊工艺完全按照原版一比一制作【q微1954292140】Buy Auckland University of Technology Diploma购买美国毕业证,购买英国毕业证,购买澳洲毕业证,购买加拿大毕业证,以及德国毕业证,购买法国毕业证(q微1954292140)购买荷兰毕业证、购买瑞士毕业证、购买日本毕业证、购买韩国毕业证、购买新西兰毕业证、购买新加坡毕业证、购买西班牙毕业证、购买马来西亚毕业证等。包括了本科毕业证,硕士毕业证。
特殊原因导致无法毕业,也可以联系我们帮您办理相关材料:
1:在奥克兰理工大学挂科了,不想读了,成绩不理想怎么办???
2:打算回国了,找工作的时候,需要提供认证《AUT成绩单购买办理奥克兰理工大学毕业证书范本》【Q/WeChat:1954292140】Buy Auckland University of Technology Diploma《正式成绩单论文没过》有文凭却得不到认证。又该怎么办???新西兰毕业证购买,新西兰文凭购买,
【q微1954292140】帮您解决在新西兰奥克兰理工大学未毕业难题(Auckland University of Technology)文凭购买、毕业证购买、大学文凭购买、大学毕业证购买、买文凭、日韩文凭、英国大学文凭、美国大学文凭、澳洲大学文凭、加拿大大学文凭(q微1954292140)新加坡大学文凭、新西兰大学文凭、爱尔兰文凭、西班牙文凭、德国文凭、教育部认证,买毕业证,毕业证购买,买大学文凭,购买日韩毕业证、英国大学毕业证、美国大学毕业证、澳洲大学毕业证、加拿大大学毕业证(q微1954292140)新加坡大学毕业证、新西兰大学毕业证、爱尔兰毕业证、西班牙毕业证、德国毕业证,回国证明,留信网认证,留信认证办理,学历认证。从而完成就业。奥克兰理工大学毕业证办理,奥克兰理工大学文凭办理,奥克兰理工大学成绩单办理和真实留信认证、留服认证、奥克兰理工大学学历认证。学院文凭定制,奥克兰理工大学原版文凭补办,扫描件文凭定做,100%文凭复刻。
Lalit Wangikar, a partner at CKM Advisors, is an experienced strategic consultant and analytics expert. He started looking for data driven ways of conducting process discovery workshops. When he read about process mining the first time around, about 2 years ago, the first feeling was: “I wish I knew of this while doing the last several projects!".
Interviews are subject to all the whims human recollection is subject to: specifically, recency, simplification and self preservation. Interview-based process discovery, therefore, leaves out a lot of “outliers” that usually end up being one of the biggest opportunity area. Process mining, in contrast, provides an unbiased, fact-based, and a very comprehensive understanding of actual process execution.
This comprehensive Data Science course is designed to equip learners with the essential skills and knowledge required to analyze, interpret, and visualize complex data. Covering both theoretical concepts and practical applications, the course introduces tools and techniques used in the data science field, such as Python programming, data wrangling, statistical analysis, machine learning, and data visualization.
Just-in-time: Repetitive production system in which processing and movement of materials and goods occur just as they are needed, usually in small batches
JIT is characteristic of lean production systems
JIT operates with very little “fat”
Scaling Security Threat Detection with Apache Spark and Databricks
2. Scaling Security Threat Detection
with Spark and Databricks
Josh Gillner
Apple Detection Engineering
3. ▪ Protecting Apple’s Systems
▪ Finding & responding to security
threats using log data
▪ Threat research and hunting
^^^ Looking for this guy
Who are we? - Apple Detection Engineering
9. Problem #1 — Development Overhead
▪ Average time to write, test, and deploy a
basic detection === 1 week
▪ New ideas/week > deployed jobs/week
(unsustainable)
▪ Writing scalatests, preserving test
samples…testing is too cumbersome
▪ > 60% of new code is boilerplate (!!)
10. Problem #2 — Mo’ Detections, Mo’ Problems
Want to add a cool new
feature to all detections?
Refactor many different
notebooks
Config all over the place in
disparate notebooks
Want to configure multiple
detections at once?
Ongoing tuning and
maintenance?
One-off tuning doesn’t scale
to hundreds of detections
11. Problem #3 — No Support for Common Patterns
▪ Common enrichments or exclusions
▪ Creating and using statistical
baselines
▪ Write detection test using scalatest
Things People Often Do
(but must write code for)
…everyone implements in
a different way
…fixes/updates must be
applied in 10 places
14. Input
▪ All detection begins with input loading
▪ Pass in inputs through config object
▪ External control through config
▪ decide spark.read vs .readStream
▪ path, schema, format
▪ no hardcoding -> dynamic input
behavior
▪ Abstracts away details of getting data
^^^ This should not change if
someDataset is a production table
or test sample file
15. Detection and Alert Abstraction
▪ Logic is described
in form of Spark
DataFrame
▪ Supports additional
post-processing
transformation
▪ Basic interface for
consumption by
other code
Detection
val alerts: Map[String, Alert] =
Alert
val modules: ArrayBuffer[Transformer] =
def PostProcessor(input: DataFrame): DataFrame = ???
def df: DataFrame = /* alert logic here */
val config: DetectionConfig
Input and other runtime configs
Test generation
16. Emitter
▪ Takes output from Alert and send them elsewhere
▪ Also schedules the job in Spark cluster
Alert
MemoryEmitter
FileEmitter
KinesisEmitter
DBFS on AWS S3
In-memory Table
AWS Kinesis
17. Config Inference
▪ If things can (and should) be changed, move it outside of code
▪ eg. detection name, description, input dataset, emitter
▪ Where possible, supply a sane default or infer them
val checkpointLocation: String =
"dbfs:/mnt/defaultbucket/chk/detection/ / / .chk/"
name = "CodeRed: Something Has Happened"
alertName = "JoshsCoolDetection"
version = "1"
DetectionConfigInfer
18. Config Inheritance
▪ Fine-grained configurability
▪ Could be multiple Alerts in
same Detection
▪ Individually configurable,
otherwise inherit parent
config
Detection
Alert
val config: DetectionConfig
Alert
Alert
19. Modular Pre/PostProcessing
▪ DataFrame -> DataFrame transform
applied to input dataset
▪ Supplied in config
▪ Useful for things like date filtering
without changing detection
Preprocessing
Postprocessing
▪ Mutable Seq of transform functions
inside Detection
▪ Applied sequentially to output
foreachBatch Transformers
▪ Some operations not stream-safe
▪ Where the crazy stuff happens
20. Manual Tuning Lifecycle
▪ Tuning overhead scales
with number of detections
▪ Feedback loop can take
days while analysts
suffer :(
▪ This need to be faster…
ideally automated and self-
service
The data/
environment
changes
DE tweaks
detection
False positive
alerts
Analyst
requests
tuning pain
22. Complex Exclusions
▪ Arbitrary SQL expressions applied
on all results in forEachBatch
▪ Stored in rev-controlled TSV
▪ Integrated into Detection Test
CI…malformed or over-selective
items will fail tests
▪ Preservation of excluded alerts in
a separate table
Eventually, detections look like this >>>
So….
23. Repetitive Investigations…What Happens?
• Analysts run queries
in notebooks to
investigate
• Most of these
queries look the
same, just different
filter
Analyst Review
Alert Orchestration System
24. Automated Investigation Templates
▪ Find corresponding
template notebook
▪ Fill it out
▪ Attach to cluster
▪ Execute
Alert Orchestration
System
Workspace API
25. This lets us automate useful things like…
Interactive Process Trees in D3 Baselines of Typical Activity
26. Automated Containment
Machines can find, investigate, and contain issues without humans
Automated Investigation
Alert Orchestration System
ODBC API
• Run substantiating
queries via ODBC
• Render verdict
Contain
Issue
27. Detection Testing
Why is it so painful?
▪ Preserving/exporting JSON
samples
▪ Local SparkSession isn’t a real
cluster
▪ Development happens in
notebooks, testing happens in
IDE
▪ Brittle to even small changes
to schema, etc
28. Detection Functional Tests
▪ 85% reduction in test LoC
▪ write and run tests in
notebooks!
▪ use Delta sample files in
dbfs, no more exporting
JSON
▪ scalatest generation using
config and convention
Trait: DetectionTest
^^ this is a complete test ^^
29. Detection Test CI
Git PR
CI System
Test
Notebooks
Workspace API
/Alerts/Test/PRs/<Git PR
number>_<Git commit
hash>
Jobs API
Build
Scripts pass/fail
“Testing has never been this fun!!”
— detection engineers, probably
30. Jobs CI — Why?
▪ Managing hundreds of jobs in Databricks UI
▪ Each job has associated notebook, config, dbfs files
▪ No inventory of which jobs should be running, where
▪ We need job linting >>>
32. Deploy/Reconfigure Jobs with Single PR
CI System
Config Linter
Stacks CLI
Jobs Helper
Deploy Job/
Notebooks/Files
Kickstart/Restart
Set Permissions
33. Cool Things with Jobs CI!
▪ Deploy or reconfigure many
jobs concurrently
▪ Auto job restarts on notebook/
config change
▪ Standardization of retries,
timeout, permissions
▪ Automate alarm creation for
new jobs
^^^ No one likes manually crafting
Stacks JSON — so we generate it
35. Problem #1 — Cyclical Investigations
▪ Alert comes in, analysts spend hours
looking into it
▪ But the same thing happened 3
months ago and was determined to be
benign
▪ Lots of wasted cycles on duplicative
investigations
36. Problem #2 — Disparate Context
▪ Want to find historical incident
data?
▪ look in many different places
▪ many search UIs, syntaxes
▪ Manual, slow & painful
▪ New analysts won’t have
historical knowledge
37. Problem #3 — Finding Patterns
Which incidents relate to other
incidents?
Do we see common infrastructure,
actors?
How much work is repeated?
Case #55557
Case #44447
Case #33337
}(Some IP Address)
38. Solution: Document Recommendations
▪ Collect all incident-related
tickets, correspondence, and
investigations
▪ Normalize them into a Delta
table
▪ Automate suggestion of
related knowledge using our
own corpus of documents
Emails
Tickets
Alerts
Notebooks
Detection Code
Wikis
39. “Has This Happened Before?” -> Automated
Includes analyst comments and
verdicts
displayHTML suggestions,
clickable links to original document
41. Anatomy of an Alert
These are not valuable for search! (too
common)
These are good indicators of document
relevance
42. Entity Tokenization and Enrichment
IP Address
Regex
Domain
Hashes
Accounts
Serials
UDIDs
File Path
Emails
MAC Addresses
Alert Payload
VPN Sessions
Enrichments
DHCP Sessions
Asset Data
Account Data
43. Suggestion Algorithm
▪ Gather match statistics for each
entity:
▪ historical rarity
▪ document count rarity
▪ doc type distribution
▪ Compute entity weight based on
average ranked percentiles of those
features
▪ More common terms == less
valuable
▪ Return the best n hits by confidence
▪ Not That Expensive™