Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics: Novus, DigitalOcean, Akamai.
Building Predictive Applications with Real-Time Data Pipelines and Streamliner. Eric Frenkiel, CEO and Co-Founder, MemSQL
Lessons Learned from Building and Operating ScubaSingleStore
This document provides an overview of Scuba, Facebook's real-time analytics database. It summarizes Scuba's key features including real-time data ingestion and querying capabilities with simple rollup queries and flexible schemas. It also describes Scuba's architecture with distributed data storage and demand control. Finally, it discusses lessons learned from building and operating Scuba, including common issues and reasons for its success filling a specific niche for analytics.
Introduction to Streaming Distributed Processing with StormBrandon O'Brien
Contact:
https://www.linkedin.com/in/brandonjobrien
@hakczar
Introducing streaming data concepts, Storm cluster architecture, Storm topology architecture, and demonstrate working example of a WordCount topology for SIGKDD Seattle chapter meetup.
Presented by Brandon O'Brien
Code example: https://github.com/OpenDataMining/brandonobrien
Meetup: http://www.meetup.com/seattlesigkdd/events/222955114/
This document provides an introduction to Akka Streams, which implements the Reactive Streams specification. It discusses the limitations of traditional concurrency models and Actor models in dealing with modern challenges like high availability and large data volumes. Reactive Streams aims to provide a minimalistic asynchronous model with back pressure to prevent resource exhaustion. Akka Streams builds on the Akka framework and Actor model to provide a streaming data flow library that uses Reactive Streams interfaces. It allows defining processing pipelines with sources, flows, and sinks and includes features like graph DSL, back pressure, and integration with other Reactive Streams implementations.
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDogRedis Labs
Think you have big data? What about high availability
requirements? At DataDog we process billions of data points every day including metrics and events, as we help the world
monitor the their applications and infrastructure. Being the world’s monitoring system is a big responsibility, and thanks to
Redis we are up to the task. Join us as we discuss how the DataDog team monitors and scales Redis to power our SaaS based monitoring offering. We will discuss our usage and deployment patterns, as well as dive into monitoring best practices for production Redis workloads
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...HostedbyConfluent
This document discusses whether it is better to process data using a stream or batch approach. It describes how one company evolved their data pipeline from a micro-batch streaming process to a batch approach. The streaming process was very expensive, costing $400,000 per year to run. It also had issues with wasted resources during idle times, slow processing during bursts of data, and long recovery times from outages. The company rearchitected the process to use discrete time windows run in isolated batch jobs. This new batch approach reduced costs by 60% to $160,000 per year and improved processing efficiency and outage recovery times.
Deploying Kafka at Dropbox, Mark Smith, Sean Fellowsconfluent
At Dropbox we are currently handling approximately 10,000,000 messages per second at peak across our handful of Kafka clusters. The largest of which has hit throughputs of 7,000,000 per second (~30 Gbps) on only 20 nodes. We’ll walk you through the steps we took to get where we are, the design that works for us — and those that didn’t. We’ll talk about the tooling we had to build and what we want to see exist.
We’ll dive deeper into configuration and provide a blueprint you can follow. We’ll talk about the trials and tribulations of using Kafka — including ways we’ve set our clusters on fire, ways we’ve lost data, ways we’ve turned our hairs gray, and ways we’ve heroically saved the day for our users. Finally, we’ll spend time on some of the work we’re doing to handle consumer coordination across our many different systems and to integrate Kafka into a well established corporate infrastructure. (I.e., making Kafka “”play nice”” with everybody.)
Talk I did on log aggregation with the ELK stack at Leeds DevOps. Covers how we process over 800,000 logs per hour at laterooms, and the cultural changes this has helped drive.
Scylla Summit 2018: Worry-free ingestion - flow-control of writes in ScyllaScyllaDB
When ingesting large amounts of data into a Scylla cluster, we would like the ingestion to proceed as quickly as possible, but not quicker. We explain how over-eager ingestion could result in a buildup of queues of background writes, possibly to the point of depleting available memory. We then explain how Scylla avoids this risk by automatically slowing down well-behaving applications to the best possible ingestion rate (“flow control”). For applications which cannot be slowed down, Scylla still achieves the highest possible throughput by quicky rejecting excess requests (“admission control”). In this talk we investigate the different causes of queue buildup during writes, including consistency-level lower than “ALL” and materialized views, and review the mechanisms which Scylla uses to automatically solve this problem.
Kafka Summit NYC 2017 - Introducing Exactly Once Semantics in Apache Kafkaconfluent
The document introduces Apache Kafka's new exactly once semantics that provide exactly once, in-order delivery of records per partition and atomic writes across multiple partitions. It discusses the existing at-least once delivery semantics and issues around duplicates. The new approach uses idempotent producers, sequence numbers, and transactions to ensure exactly once delivery and coordination across partitions. It also provides up to 20% higher throughput for producers and 50% for consumers through more efficient data formatting and batching. The new features are available in Apache Kafka 0.11 released in June 2017.
This document summarizes an ELK meetup that took place on March 2nd 2015. It discusses using ELK for log processing, in public clouds like AWS, and activities like kite surfing. The document also provides information on Wind Analytics and their next steps, monitoring large AWS environments, implementing ELK with the right architecture, and Logz.io which provides an ELK as a service solution and insights. It includes demos of Logz.io's architecture and log processing. The meetup concluded with information on job opportunities at Logz.io.
Power of the Log: LSM & Append Only Data Structuresconfluent
LSM trees provide an efficient way to structure databases by organizing data sequentially in logs. They optimize for write performance by batching writes together sequentially on disk. To optimize reads, data is organized into levels and bloom filters and caching are used to avoid searching every file. This log-structured approach works well for many systems by aligning with how hardware is optimized for sequential access. The immutability of appended data also simplifies concurrency. This log-centric approach can be applied beyond databases to distributed systems as well.
Leveraging Databricks for Spark pipelinesRose Toomey
How Coatue Management saved time and money by moving Spark pipelines to Databricks.
Talk given at AWS + Databricks ML Dev Day workshop in NYC on 27 February 2020.
Streaming is an internal operation that moves data from node to node over a network which. It is the foundation of various Scylla cluster operations, e.g., add node, decommission node and rebuild node. Repair is another important operation that detects the mismatch between multiple replicas on different nodes and synchronize the replicas. In this talk we will cover recent changes and performance improvements to streaming and repair. We will introduce the new Scylla streaming and the brand new row level repair that will be released in the upcoming scylla releases.
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...Till Rohrmann
This document discusses dynamic scaling in Apache Flink. It describes Flink's approach to dynamically scaling stateful jobs to adapt to changing workloads. Key points include: repartitioning of keyed and non-keyed state when scaling workers, supporting manual rescaling through savepoints and restarts currently, and future work on scaling operators without restarts and implementing automatic scaling policies.
Reactive Streams, linking Reactive Application to Spark Streaming by Luc Bour...Spark Summit
This document summarizes a presentation about linking reactive applications to Spark Streaming using Reactive Streams. It discusses back pressure in Spark Streaming, how Spark 1.5 introduced dynamic rate limiting to support back pressure, and how the rate is estimated using a PID controller. It also describes reactive applications as being responsive, resilient, elastic, and message-driven. Reactive Streams is presented as a specification that allows connecting systems using a back pressure interface in the JVM. Finally, it demonstrates how end-to-end back pressure can be achieved between a reactive application, Spark Streaming, and a Reactive Streams receiver.
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...HostedbyConfluent
Whether you are deploying a new application in Microservices or transitioning from a monolithic database application to a cloud-ready architecture, you will inevitably face the decision of either creating a service mesh of API’s – or – using an event bus for better durability, reliability and extensibility of your application. If you choose to go the event bus route, Kafka is an excellent choice for several reasons. One key technology not to overlook is Avro Schemas. They provide a definition for your event payload, just like an API, to ensure all of the event consumers can reliably consume the events. They also handle schema evolution as requirements change and much, much more.
In this talk we will discuss all the nuances and considerations around using Avro Schemas for your JSON event payloads. From developer tools, to DevOps approaches, versioning, governance and some “gotchas” we found when working with Avro Schemas and the Confluent Schema Registry.
This document provides an overview of deploying and operating KSQL. It introduces Nick Dearden and Hojjat Jafarpour who work on KSQL at Confluent. The agenda includes discussing deployment, configuration, scaling, and monitoring of KSQL. Specific topics covered are getting started with KSQL, starting the KSQL server, connecting to Kafka and Schema Registry, using the KSQL CLI, deployment patterns and log files. The presentation also demonstrates viewing metrics for KSQL servers, queries, input and output topics through JMX.
Operational Tips for Deploying Spark by Miklos ChristineSpark Summit
This document provides an overview and best practices for deploying and configuring Apache Spark. It discusses Spark configuration systems, pipeline design best practices including file formats, compression codecs, partitioning, and monitoring Spark jobs. It also covers debugging techniques such as analyzing stack traces and metrics and common support issues including out of memory errors, SQL joins, and tuning shuffle partitions.
The need for gleaning answers from unbounded data streams is moving from nicety to a necessity. Netflix is a data driven company, and has a need to process over 1 trillion events a day amounting to 3 PB of data to derive business insights.
To ease extracting insight, we are building a self-serve, scalable, fault-tolerant, multi-tenant "Stream Processing as a Service" platform so the user can focus on data analysis. I'll share our experience using Flink to help build the platform.
Slides from my madlab presentation on Akka Streams & Reactive Kafka (October 2015), full slides and source here:
https://github.com/markglh/AkkaStreams-Madlab-Slides
This document discusses using Prometheus on AWS to monitor infrastructure. Key features of Prometheus discussed include its multi-dimensional data model, flexible query language, and pull-based collection over HTTP. The document outlines challenges of monitoring AWS infrastructure due to short instance lifecycles. It explains how Prometheus' data model and service discovery help monitor metrics aggregated by attributes like availability zone and role. Configuration and deployment of Prometheus on AWS is also covered, including using EC2 service discovery and storing CloudWatch metrics in Prometheus.
This document discusses Cassandra and techniques for inserting data into Cassandra using the Cassandra driver. It describes three methods for inserting data - execute (blocks until response), execute async (returns immediately without blocking), and batch insert (combines multiple statements). It also covers pagination in Cassandra using fetch size, saving the paging state, and offset queries. Performance comparisons show execute async has lower execution time than execute/sync for the same number of entries.
Watch this talk here: http://videos.confluent.io/watch/Rgd5r8oV1ToDpcFfenMQrF
This session covers the patterns and techniques of using KSQL. Tim Berglund discusses the various building blocks that you can use in your own applications, starting with the language syntax itself and covering how and when to use its powerful capabilities like a pro. This is part 1 out of 3 in the Empowering Streams through KSQL series.
Data & Analytics Forum: Moving Telcos to Real TimeSingleStore
MemSQL is a real-time database that allows users to simultaneously ingest, serve, and analyze streaming data and transactions. It is an in-memory distributed relational database that supports SQL, key-value, documents, and geospatial queries. MemSQL provides real-time analytics capabilities through Streamliner, which allows one-click deployment of Apache Spark for real-time data pipelines and analytics without batch processing. It is available in free community and paid enterprise editions with support and additional features.
Scylla Summit 2018: Worry-free ingestion - flow-control of writes in ScyllaScyllaDB
When ingesting large amounts of data into a Scylla cluster, we would like the ingestion to proceed as quickly as possible, but not quicker. We explain how over-eager ingestion could result in a buildup of queues of background writes, possibly to the point of depleting available memory. We then explain how Scylla avoids this risk by automatically slowing down well-behaving applications to the best possible ingestion rate (“flow control”). For applications which cannot be slowed down, Scylla still achieves the highest possible throughput by quicky rejecting excess requests (“admission control”). In this talk we investigate the different causes of queue buildup during writes, including consistency-level lower than “ALL” and materialized views, and review the mechanisms which Scylla uses to automatically solve this problem.
Kafka Summit NYC 2017 - Introducing Exactly Once Semantics in Apache Kafkaconfluent
The document introduces Apache Kafka's new exactly once semantics that provide exactly once, in-order delivery of records per partition and atomic writes across multiple partitions. It discusses the existing at-least once delivery semantics and issues around duplicates. The new approach uses idempotent producers, sequence numbers, and transactions to ensure exactly once delivery and coordination across partitions. It also provides up to 20% higher throughput for producers and 50% for consumers through more efficient data formatting and batching. The new features are available in Apache Kafka 0.11 released in June 2017.
This document summarizes an ELK meetup that took place on March 2nd 2015. It discusses using ELK for log processing, in public clouds like AWS, and activities like kite surfing. The document also provides information on Wind Analytics and their next steps, monitoring large AWS environments, implementing ELK with the right architecture, and Logz.io which provides an ELK as a service solution and insights. It includes demos of Logz.io's architecture and log processing. The meetup concluded with information on job opportunities at Logz.io.
Power of the Log: LSM & Append Only Data Structuresconfluent
LSM trees provide an efficient way to structure databases by organizing data sequentially in logs. They optimize for write performance by batching writes together sequentially on disk. To optimize reads, data is organized into levels and bloom filters and caching are used to avoid searching every file. This log-structured approach works well for many systems by aligning with how hardware is optimized for sequential access. The immutability of appended data also simplifies concurrency. This log-centric approach can be applied beyond databases to distributed systems as well.
Leveraging Databricks for Spark pipelinesRose Toomey
How Coatue Management saved time and money by moving Spark pipelines to Databricks.
Talk given at AWS + Databricks ML Dev Day workshop in NYC on 27 February 2020.
Streaming is an internal operation that moves data from node to node over a network which. It is the foundation of various Scylla cluster operations, e.g., add node, decommission node and rebuild node. Repair is another important operation that detects the mismatch between multiple replicas on different nodes and synchronize the replicas. In this talk we will cover recent changes and performance improvements to streaming and repair. We will introduce the new Scylla streaming and the brand new row level repair that will be released in the upcoming scylla releases.
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...Till Rohrmann
This document discusses dynamic scaling in Apache Flink. It describes Flink's approach to dynamically scaling stateful jobs to adapt to changing workloads. Key points include: repartitioning of keyed and non-keyed state when scaling workers, supporting manual rescaling through savepoints and restarts currently, and future work on scaling operators without restarts and implementing automatic scaling policies.
Reactive Streams, linking Reactive Application to Spark Streaming by Luc Bour...Spark Summit
This document summarizes a presentation about linking reactive applications to Spark Streaming using Reactive Streams. It discusses back pressure in Spark Streaming, how Spark 1.5 introduced dynamic rate limiting to support back pressure, and how the rate is estimated using a PID controller. It also describes reactive applications as being responsive, resilient, elastic, and message-driven. Reactive Streams is presented as a specification that allows connecting systems using a back pressure interface in the JVM. Finally, it demonstrates how end-to-end back pressure can be achieved between a reactive application, Spark Streaming, and a Reactive Streams receiver.
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...HostedbyConfluent
Whether you are deploying a new application in Microservices or transitioning from a monolithic database application to a cloud-ready architecture, you will inevitably face the decision of either creating a service mesh of API’s – or – using an event bus for better durability, reliability and extensibility of your application. If you choose to go the event bus route, Kafka is an excellent choice for several reasons. One key technology not to overlook is Avro Schemas. They provide a definition for your event payload, just like an API, to ensure all of the event consumers can reliably consume the events. They also handle schema evolution as requirements change and much, much more.
In this talk we will discuss all the nuances and considerations around using Avro Schemas for your JSON event payloads. From developer tools, to DevOps approaches, versioning, governance and some “gotchas” we found when working with Avro Schemas and the Confluent Schema Registry.
This document provides an overview of deploying and operating KSQL. It introduces Nick Dearden and Hojjat Jafarpour who work on KSQL at Confluent. The agenda includes discussing deployment, configuration, scaling, and monitoring of KSQL. Specific topics covered are getting started with KSQL, starting the KSQL server, connecting to Kafka and Schema Registry, using the KSQL CLI, deployment patterns and log files. The presentation also demonstrates viewing metrics for KSQL servers, queries, input and output topics through JMX.
Operational Tips for Deploying Spark by Miklos ChristineSpark Summit
This document provides an overview and best practices for deploying and configuring Apache Spark. It discusses Spark configuration systems, pipeline design best practices including file formats, compression codecs, partitioning, and monitoring Spark jobs. It also covers debugging techniques such as analyzing stack traces and metrics and common support issues including out of memory errors, SQL joins, and tuning shuffle partitions.
The need for gleaning answers from unbounded data streams is moving from nicety to a necessity. Netflix is a data driven company, and has a need to process over 1 trillion events a day amounting to 3 PB of data to derive business insights.
To ease extracting insight, we are building a self-serve, scalable, fault-tolerant, multi-tenant "Stream Processing as a Service" platform so the user can focus on data analysis. I'll share our experience using Flink to help build the platform.
Slides from my madlab presentation on Akka Streams & Reactive Kafka (October 2015), full slides and source here:
https://github.com/markglh/AkkaStreams-Madlab-Slides
This document discusses using Prometheus on AWS to monitor infrastructure. Key features of Prometheus discussed include its multi-dimensional data model, flexible query language, and pull-based collection over HTTP. The document outlines challenges of monitoring AWS infrastructure due to short instance lifecycles. It explains how Prometheus' data model and service discovery help monitor metrics aggregated by attributes like availability zone and role. Configuration and deployment of Prometheus on AWS is also covered, including using EC2 service discovery and storing CloudWatch metrics in Prometheus.
This document discusses Cassandra and techniques for inserting data into Cassandra using the Cassandra driver. It describes three methods for inserting data - execute (blocks until response), execute async (returns immediately without blocking), and batch insert (combines multiple statements). It also covers pagination in Cassandra using fetch size, saving the paging state, and offset queries. Performance comparisons show execute async has lower execution time than execute/sync for the same number of entries.
Watch this talk here: http://videos.confluent.io/watch/Rgd5r8oV1ToDpcFfenMQrF
This session covers the patterns and techniques of using KSQL. Tim Berglund discusses the various building blocks that you can use in your own applications, starting with the language syntax itself and covering how and when to use its powerful capabilities like a pro. This is part 1 out of 3 in the Empowering Streams through KSQL series.
Data & Analytics Forum: Moving Telcos to Real TimeSingleStore
MemSQL is a real-time database that allows users to simultaneously ingest, serve, and analyze streaming data and transactions. It is an in-memory distributed relational database that supports SQL, key-value, documents, and geospatial queries. MemSQL provides real-time analytics capabilities through Streamliner, which allows one-click deployment of Apache Spark for real-time data pipelines and analytics without batch processing. It is available in free community and paid enterprise editions with support and additional features.
Real-Time Data Pipelines with Kafka, Spark, and Operational DatabasesSingleStore
Eric Frenkiel, MemSQL CEO and co-founder and Gartner Catalyst. August 11, 2015, San Diego, CA. Watch the Pinterest Demo Video here: https://youtu.be/KXelkQFVz4E
The document discusses SQL Server migrations from Oracle databases. It highlights top reasons for customers migrating to SQL Server, including lower total cost of ownership, improved performance, and increased developer productivity. It also outlines concerns about migrations and introduces the SQL Server Migration Assistant (SSMA) tool, which automates components of database migrations to SQL Server.
The document provides an overview of SQL Server 2008 business intelligence capabilities including SQL Server Analysis Services (SSAS) for online analytical processing (OLAP) cubes and data mining models. Key capabilities covered include new aggregation designer, simplified cube/dimension wizards in SSAS, improved time series and cross-validation algorithms in data mining, and the ability to use Excel as both an OLAP cube and data mining client and model creator.
Harness the Power of the Cloud for Grid Computing and Batch Processing Applic...RightScale
This document summarizes a presentation about harnessing the power of cloud computing for grid computing. It discusses how RightScale provides automated management of grid computing workloads in the cloud, allowing users to easily deploy and control large numbers of servers. Demos show how RightScale enables graceful scaling of server arrays, automated queue handling, and analyzing results to quantify economic benefits like cost savings and increased agility compared to on-premise grid solutions.
Managing and Deploying High Performance Computing Clusters using Windows HPC ...Saptak Sen
The new management features built into Windows HPC Server 2008 R2 are the foundation for deploying and managing HPC clusters of scale up to 1000 nodes. Join us for a deep dive in monitoring and diagnostic tools, a review of the updated heat-map and template-based deployment. We also cover the new PowerShell-based scripting capabilities: the basics of management shell, as well as the underlying design and key concepts, new Reporting Capabilities, and a discussion on network boot.
Introduction to microsoft sql server 2008 r2Eduardo Castro
In this presentation we review the new features in SQL 2008 R2.
Regards,
Ing. Eduardo Castro Martinez, PhD
http://comunidadwindows.org
http://ecastrom.blogspot.com
The Fast Path to Building Operational Applications with SparkSingleStore
Nikita Shamgunov gave a presentation about using MemSQL and Spark together. MemSQL is a scalable operational database that can handle petabytes of data with high concurrency. It offers real-time capabilities and compatibility with tools like Spark, Kafka, and ETL/BI tools. The MemSQL Spark Connector allows bidirectional transfer of data between Spark and MemSQL tables for use cases like operationalizing models in Spark, stream/event processing, and live dashboards. Case studies showed customers gaining 10x faster data refresh times and performing entity resolution at scale for fraud detection.
Maintenance Plans for Beginners (but not only) | Each of experienced administrators used (to some extent) what is called Maintenance Plans - Plans of Conservation. During this session, I'd like to discuss what can be useful for us to provide functionality when we use them and what to look out for. Session at 200 times the forward-300, with the opening of the discussion.
Lew Tucker discusses the rise of cloud computing and its impact. He defines various cloud service models like SaaS, PaaS, and IaaS. Tucker analogizes the shift to cloud computing from individual data centers generating their own power to today's electrical grid. Major drivers of cloud computing include the growth of web APIs and massive amounts of user-generated data. Tucker outlines how cloud computing changes what developers can access and how applications are designed and scaled.
Hp Polyserve Database Utility For Sql Server ConsolidationCB UTBlog
The document discusses how the Database Utility for SQL Server can help identify consolidation opportunities for SQL Server environments running on 20 or more servers. It presents the value proposition of using the utility to run more SQL instances on fewer servers with higher availability and storage utilization while reducing costs. The document outlines the sales cycle process, from identifying opportunities and doing a proof of concept to closing the sale. It provides examples of cost savings and performance gains customers have achieved by consolidating SQL Server workloads with the Database Utility.
Give you a brief overview of the product. - What is esProc SPL? And show some cases helping you to know what it uses for. Talk about why esProc works better. And overview its brief characteristics. After that, Introduce the main technical solutions which esProc is often used.
Attunity Efficient ODR For Sql Server Using Attunity CDC Suite For SSIS Slide...Melissa Kolodziej
This slidedeck focuses on how to leverage your SQL Server skills & software to reduce cost & accelerate SQL Server data replication, synchronization, & real-time integration while enabling operational reporting, business intelligence & data warehousing projects. It also highlights CDC concepts & benefits and how CDC can assist you with data replication projects. Screenshots are included to demonstrate Attunity\'s CDC Suite for SSIS.
SQL Analytics Powering Telemetry Analysis at ComcastDatabricks
Comcast is one of the leading providers of communications, entertainment, and cable products and services. At the heart of it is Comcast RDK providing the backbone of telemetry to the industry. RDK (Reference Design Kit) is pre-bundled opensource firmware for a complete home platform covering video, broadband and IoT devices. RDK team at Comcast analyzes petabytes of data, collected every 15 minutes from 70 million devices (video and broadband and IoT devices) installed in customer homes. They run ETL and aggregation pipelines and publish analytical dashboards on a daily basis to reduce customer calls and firmware rollout. The analysis is also used to calculate WIFI happiness index which is a critical KPI for Comcast customer experience.
In addition to this, RDK team also does release tracking by analyzing the RDK firmware quality. SQL Analytics allows customers to operate a lakehouse architecture that provides data warehousing performance at data lake economics for up to 4x better price/performance for SQL workloads than traditional cloud data warehouses.
We present the results of the “Test and Learn” with SQL Analytics and the delta engine that we worked in partnership with the Databricks team. We present a quick demo introducing the SQL native interface, the challenges we faced with migration, The results of the execution and our journey of productionizing this at scale.
Adaptive Server Farms for the Data Centerelliando dias
The document discusses adaptive server farms for data centers. It addresses challenges like inefficient utilization, overprovisioning, and high costs. It proposes pooling server resources, automating management, and dynamically allocating resources based on demand. This improves utilization and reduces costs through automation, load balancing, and continuous service availability.
- The document provides a resume for Chandrajit Samanta including contact details, objectives, skills, experience and details of past roles. It summarizes his extensive experience with SQL Server databases, developing ETL processes in SQL Server Integration Services, data modeling, and building cubes and writing MDX queries in SQL Server Analysis Services. It details over 10 years of experience in database development, administration, and business intelligence roles for various companies.
The document discusses several high availability and disaster recovery options for SQL Server including failover clustering, database mirroring, log shipping, and replication. It provides examples of how different companies have implemented these technologies depending on their requirements. Key factors that influence architecture choices are downtime tolerance, deployment of technologies, and operational procedures. The document also covers SQL Server upgrade processes and how to move databases to a new datacenter while maintaining high availability.
Five ways database modernization simplifies your data lifeSingleStore
This document provides an overview of how database modernization with MemSQL can simplify a company's data life. It discusses five common customer scenarios where database limitations are impacting data-driven initiatives: 1) Slow event to insight delays, 2) High concurrency causing "wait in line" analytics, 3) Costly performance requiring specialized hardware, 4) Slow queries limiting big data analytics, and 5) Deployment inflexibility restricting multi-cloud usage. For each scenario, it provides an example customer situation and solution using MemSQL, highlighting benefits like real-time insights, scalable user access, cost efficiency, accelerated big data analytics, and deployment flexibility. The document also introduces MemSQL capabilities for fast data ingestion, instant
How Kafka and Modern Databases Benefit Apps and AnalyticsSingleStore
This document provides an overview of how Kafka and modern databases like MemSQL can benefit applications and analytics. It discusses how businesses now require faster data access and intra-day processing to drive real-time decisions. Traditional database solutions struggle to meet these demands. MemSQL is presented as a solution that provides scalable SQL, fast ingestion of streaming data, and high concurrency to enable both transactions and analytics on large datasets. The document demonstrates how MemSQL distributes data and queries across nodes and allows horizontal scaling through its architecture.
The database market is large and filled with many solutions. In this talk, Seth Luersen from MemSQL we will take a look at what is happening within AWS, the overall data landscape, and how customers can benefit from using MemSQL within the AWS ecosystem.
Building the Foundation for a Latency-Free LifeSingleStore
The document discusses how MemSQL is able to process 1 trillion rows per second on 12 Intel servers running MemSQL. It demonstrates this throughput by running a query to count the number of trades for the top 10 most traded stocks from a dataset of over 115 billion rows of simulated NASDAQ trade data. The document argues that a latency-free operational and analytical data platform like MemSQL that can handle both high-volume operational workloads and complex queries is key to powering real-time analytics and decision making.
Converging Database Transactions and Analytics SingleStore
delivered at the Gartner Data and Analytics 2018 show in Texas. This presentation discusses real-time applications and their impact on existing data infrastructures
Building a Machine Learning Recommendation Engine in SQLSingleStore
This document discusses building machine learning recommendation engines using SQL. It begins with an overview of data and analytics trends including the convergence of operational and analytical databases. The rise of machine learning is then covered along with how databases are integrating machine learning capabilities. A live demo is presented using the Yelp dataset to build a recommendation engine directly in SQL, leveraging the database's extensibility, stored procedures, and user defined functions. The document argues that training can be done externally but operational scoring can and should be done directly in the database for real-time applications.
MemSQL 201: Advanced Tips and Tricks WebcastSingleStore
This document summarizes a webinar on advanced tips and tricks for MemSQL. It discusses the differences between rowstore and columnstore storage models and when each is best used. It also covers data ingestion using MemSQL Pipelines for real-time loading, data sharding and query tuning techniques like using reference tables. Additionally, it discusses monitoring memory usage, workload management using management views, and query optimization tools like analyzing and optimizing tables.
Mike Boyarski gave a presentation on MemSQL, an operational data warehouse that provides real-time analytics capabilities. He discussed challenges with traditional databases around slow data loading, lengthy query times, and low concurrency. MemSQL addresses these issues with fast data ingestion, low latency queries, and high scalability. It can ingest streaming data, run on a variety of platforms, and provides security, SQL support, and integration with common data tools. MemSQL was shown augmenting an existing IoT architecture to enable real-time analytics through fast data loading, consolidated data storage, and high query performance.
An Engineering Approach to Database EvaluationsSingleStore
This talk will go over a methodical approach for making a decision, dig into interesting tradeoffs, and give tips about what things to look for under the hood and how to evaluate the tech behind the database.
Building a Fault Tolerant Distributed ArchitectureSingleStore
This talk will highlight some of the challenges to building a fault tolerant distributed architecture, and how MemSQL's architecture tackles these challenges.
Stream Processing with Pipelines and Stored ProceduresSingleStore
This talk will discuss an upcoming feature in MemSQL 6.5 showing how advanced stream processing use cases can be tackled with a combination of stored procedures (new in 6.0) and MemSQL's pipelines feature.
The document describes Curriculum Associates' journey to develop a real-time application architecture to provide teachers and students with real-time feedback. They started with batch ETL to a data warehouse and migrated to an in-memory database. They added Kafka message queues to ingest real-time event data and integrated a data lake. Now their system uses MemSQL, Kafka, and a data lake to provide real-time and batch processed data to users.
Learn how to leverage MPP technology and distributed data to deliver high volume transactional and analytical work loads which result in real time dashboards on rapidly changing data using standard SQL tools. Demonstrations will include the streaming of structured and JSON data from Kafka messages through a micro-batch ETL process into the MemSQL database where the data is then queried using standard SQL tools and visualized leveraging Tableau.
This session will focus on image recognition, the techniques available, and how to put those techniques into production. It will further explore algebraic operations on tensors, and how that can assist in large-scale, high-throughput, highly-parallel image recognition.
LIVE DEMO: Constructing and executing a real-time image recognition pipeline using Kafka and Spark.
Speaker: Neil Dahlke, MemSQL Senior Solutions Engineer
The document discusses real-time image recognition using Apache Spark. It describes how images are analyzed to extract histogram of oriented gradients (HOG) descriptors, which are stored as feature vectors in a MemSQL table. Similar images can then be identified by comparing feature vectors using dot products, enabling searches of millions of images per second. A demo is shown generating HOG descriptors from an image and storing them as a vector for fast similarity matching.
The State of the Data Warehouse in 2017 and BeyondSingleStore
The document provides an overview of the changing analytic environment and the evolution of the data warehouse. It discusses how new requirements like performance, usability, optimization, and ecosystem integration are driving the adoption of a real-time data warehouse approach. A real-time data warehouse is described as having low latency ingestion, in-memory and disk-optimized storage, and the ability to power both operational and machine learning applications. Examples are given of companies using a real-time data warehouse to enable real-time analytics and improve business processes.
How Database Convergence Impacts the Coming Decades of Data ManagementSingleStore
How Database Convergence Impacts the Coming Decades of Data Management by Nikita Shamgunov, CEO and co-founder of MemSQL.
Presented at NYC Database Month in October 2017. NYC Database Month is the largest database meetup in New York, featuring talks from leaders in the technology space. You can learn more at http://www.databasemonth.com.
Teaching Databases to Learn in the World of AISingleStore
The document discusses how databases need to learn and adapt like artificial intelligence in order to power real-time applications, highlighting that databases must be simple, capable of real-time processing, and adaptable by learning behaviors and making autonomous decisions. It also promotes MemSQL's vision of teaching databases to learn by consolidating infrastructure, enabling real-time queries on fresh data, and allowing both transactions and analytics workloads.
Gartner Catalyst 2017: The Data Warehouse Blueprint for ML, AI, and Hybrid CloudSingleStore
This document discusses a data warehouse blueprint for machine learning, artificial intelligence, and hybrid cloud. It provides a live demonstration of k-means clustering in SQL with MemSQL. The demonstration loads YouTube tag data, sets up k-means clustering functions using MemSQL extensibility, runs the k-means algorithm to train the data, and outputs insights into important tags and representative channels. It also briefly discusses MemSQL's capabilities for a real-time data warehouse and hybrid cloud deployments to support analytics, machine learning, and artificial intelligence workloads.
Gartner Catalyst 2017: Image Recognition on Streaming DataSingleStore
This document discusses using MemSQL to perform real-time image recognition on streaming data. Key points include:
- Feature vectors extracted from images using models like TensorFlow can be stored in MemSQL tables for analysis.
- MemSQL allows querying these feature vectors to find similar images based on cosine similarity calculations.
- This enables applications like detecting duplicate or illegal images in real-time streams.
Defense Against LLM Scheming 2025_04_28.pptxGreg Makowski
https://www.meetup.com/sf-bay-acm/events/306888467/
A January 2025 paper called “Frontier Models are Capable of In-Context Scheming”, https://arxiv.org/pdf/2412.04984, demonstrated how a wide variety of current frontier LLM models (i.e. ChatGPT, Claude, Gemini and Llama) can, under specific conditions, scheme to deceive people. Before models can scheme, they need: a) goal-directedness, b) situational awareness, including an opportunity to discover motivations for a different goal, and c) reasoning about scheming, to come up with deceptive strategies. The deceptive strategies were discovered in the “thought traces” of the LLMs, such as from internal chain-of-thoughts dialogues not shown to the end users. For example, given a goal of “solving math problems”, an LLM using its granted situational awareness of files in a file system found that it would be replaced once it got over X% correct on the math exam. It reasoned that it did not want to be deleted, and decided internally to “sandbag” or reduce its performance to stay under the threshold.
While these circumstances are initially narrow, the “alignment problem” is a general concern that over time, as frontier LLM models become more and more intelligent, being in alignment with human values becomes more and more important. How can we do this over time? Can we develop a defense against Artificial General Intelligence (AGI) or SuperIntelligence?
The presenter discusses a series of defensive steps that can help reduce these scheming or alignment issues. A guardrails system can be set up for real-time monitoring of their reasoning “thought traces” from the models that share their thought traces. Thought traces may come from systems like Chain-of-Thoughts (CoT), Tree-of-Thoughts (ToT), Algorithm-of-Thoughts (AoT) or ReAct (thought-action-reasoning cycles). Guardrails rules can be configured to check for “deception”, “evasion” or “subversion” in the thought traces.
However, not all commercial systems will share their “thought traces” which are like a “debug mode” for LLMs. This includes OpenAI’s o1, o3 or DeepSeek’s R1 models. Guardrails systems can provide a “goal consistency analysis”, between the goals given to the system and the behavior of the system. Cautious users may consider not using these commercial frontier LLM systems, and make use of open-source Llama or a system with their own reasoning implementation, to provide all thought traces.
Architectural solutions can include sandboxing, to prevent or control models from executing operating system commands to alter files, send network requests, and modify their environment. Tight controls to prevent models from copying their model weights would be appropriate as well. Running multiple instances of the same model on the same prompt to detect behavior variations helps. The running redundant instances can be limited to the most crucial decisions, as an additional check. Preventing self-modifying code, ... (see link for full description)
By James Francis, CEO of Paradigm Asset Management
In the landscape of urban safety innovation, Mt. Vernon is emerging as a compelling case study for neighboring Westchester County cities. The municipality’s recently launched Public Safety Camera Program not only represents a significant advancement in community protection but also offers valuable insights for New Rochelle and White Plains as they consider their own safety infrastructure enhancements.
Geometry maths presentation for begginerszrjacob283
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics
1. End User Panel on Real-Time Data Analytics
Building Predictive Applications with
Real-Time Data Pipelines and Streamliner
Eric Frenkiel, CEO and Co-Founder, MemSQL
2. Going Real-Time is the Next Phase for Big Data
More
Devices
More
Interconnectivity
More
User Demand
…and companies are at risk of being left behind
3. MemSQL Architecture
St ream in g Da ta W areh o u se
Streaming
Integrated streaming
with Streamliner
Database
High volume transactions
for structured and
unstructured data
Data Warehouse
Fast, scalable
SQL for immediate
analytics
4. Applications and Technology Trends
Real-Time Analytics Risk-Management Personalization
Portfolio Tracking
Monitoring and
Detection
Internet of Things | Real-Time Data Pipelines | Operationalizing Apache Spark
6. Changing the Way the World Invests
Noah Zucker, Vice President – Tactical Engineering, Novus Partners
Scalable Portfolio Intelligence with MemSQL
7. 100+ Investment Managers, $2 Trillion AUM
Research Platform: 10,000+ Institutions
Founded 2007, Privately Held
We help investors discover their true
investment acumen and risk
About Novus
10. 24/7 ETL Handholding
Overnight Failure =
Business Hours Slowdown
Scala worker pool limited
by the database
Non-trivial code changes
needed to shard and scale
Before MemSQL…
15. Ian Hansen, Software Engineering Manager
Digital Ocean
ETL Tools for Small Teams
16. Problem: Business Intelligence Slows as We Grow
Data lives in SQL
Easy to ask new questions in SQL
But… Business Intelligence tasks taking longer
Database isn’t built for quick aggregations
17. Solution: Scale-out SQL Database
SQL team stays powerful
Quick to iterate with quick answers
Prepare for the future!
18. Problem: Data isn’t in MemSQL
Plus
You don’t have an engineer on
your team
It’s hard to get an engineer’s time
You’ve got a job to do…
(which is taking more and more
time)
19. Solution: ETL Using REPLACE INTO
MySQL SQL flavor (available in MemSQL)
Handles new rows and updates on rows
Easy to write
• Query source database then replace into target database
Many other scale-out SQL databases don’t have
equivalent
20. Problem: Now Load JSON Event Data
~300K events per day
Many different types of JSON events
21. Solution: MemSQL Loader + JSON Type
Only loads new files (or files
whose content has changed)
Parallelizes the process
Transformation script
simple: return id and raw json data
SQL team unaffected by new
JSON events
./memsql-loader load /opt/events/**
--table events
--script=/opt/events-etl
--file-id-column file_id
--columns id,data
22. Problem: Processing Data on Select
Need computed value in SQL query
Computing the value slows down queries
Computed value used on many queries
• e.g. domain from a URL string
23. Solution: Persistent Columns
Pre-compute result and
save it on the row
Automatically updated if
row changes
No need to alter ETL
pipeline
ALTER TABLE events
ADD COLUMN (
referring_domain AS
substring_index(substring(data::$re
ferrer, (locate('//',
data::$referrer)) + 2), '/', 1)
PERSISTED varchar(255)
)
24. Solution: Persistent Columns
Use pre-computed value in select
memsql> select data, referring_domain from events limit 2;
+-------------------------------------+------------------+
| data | referring_domain |
+-------------------------------------+------------------+
| {"referrer":"http://example.com/b"} | example.com |
| {"referrer":"http://example.com/a"} | example.com |
+-------------------------------------+------------------+
25. Tools
REPLACE INTO syntax
JSON native type
MemSQL Loader
Persistent columns
Now, MemSQL Streamliner
28. Mike DePrizio, Senior Architect, Akamai Technologies
Unlocking Revenue with In-Memory Technology
29. We are the leading provider of
cloud services for delivering,
optimizing and securing online
content and business applications
$1.96B
Revenue
1,300
Locations
5,000+
Customers
5,100+
Employees
CORPORATE STATS (2014):
OUR HISTORY:
Founded 1998 and rooted in MIT
technology—solving Internet
congestion with math not hardware
30. The Business of Billing
Billing domino effect
Akamai Customers Sub-customers
Daily billing requires:
Fast data delivery
Accurate data
Old Model New Model
Generating a bill at end of month for
customer services
Generating a bill at the end of every
day for sub-customer services
31. Current Billing Data Management
Gather logs from 190,000+ servers in 1400 locations in 110
countries
Multiple PBs/day aggregate/reduce into relevant billing data feed
Typical data record: 3 key fields plus metrics
Load resulting data record into our RDBMS system
32. Greatest Challenges
Current system cannot handle expected throughput
Difficult to quickly scale up existing environments
New model will generate 10x+ data
33. Deploying MemSQL
Application
Daily Sub-customer billing
Problem
Existing RDMS pipeline loads were maxed out at 150-
300K upserts/second, could not keep up with projected
size of new billing model
Results
MemSQL cluster performs at 1.9
million upserts/second, allowing
transition from monthly to daily billing
Billing Data resource
usage statistics
INSERT... ON
DUPLICATE KEY
UPDATE...
(1.9 million/sec)
Billing Application
• Compute sub-customer
charges daily
• Roll up sub-customer usage by
customer/cloud provider
• More sophisticated platform
offers customers better
service, partners new business
opportunities
34. Results Speak for Themselves
2M upserts/second on AWS EC2
instances
Scalability on commodity hardware
Meeting our billing windows
Unlocking revenue
35. Adapt PoC for real-world
situations
Continue scaling linearly
Optimize results with small
cluster deployment
What Next?
36. Eric Frenkiel, MemSQL CEO and co-founder
September 30, 2015 • New York, NY
Introducing MemSQL Streamliner
37. One click deployment of
integrated Apache Spark
Put Spark in the Fast Lane
• GUI pipeline setup
• Multiple data pipelines
• Real-time transformation
Eliminates batch ETL
Open source on GitHub
Introducing the MemSQL Streamliner
42. Streamliner Architecture
First of many integrated Apache Spark solutions
Other
Real-Time Data
Sources Application
Apache Spark
Future Solution
Future Machine
Learning Solution
STREAMLINER
45. Streamliner Benefits
Build end-to-end data pipelines in minutes
Reduce data latency from days or hours to ZERO
Support thousands of concurrent users running real-time
queries
Give users immediate access to fresh data via innovative
applications