These are the slides from my talk at the 2012 Sphinx Search Day in Santa Clara, California. It provides a high-level picture of where Sphinx is used at craigslist, a bit of history, issues, and future work.
- Craigslist is a classified advertising website with over 500 cities worldwide and handles over 20 billion pageviews and 50 million users per month. It allows users to post free classified ads for jobs, housing, items for sale, and other services.
- The technical challenges for Craigslist include high ad churn rate, growth in traffic volume, need for data archiving and search capabilities, and maintaining the system with a small team.
- Craigslist uses open source technologies like MySQL, memcached, Apache, and Sphinx to power its infrastructure while keeping it simple, efficient and low cost. It employs techniques like vertical and horizontal data partitioning and incremental indexing to handle its scale.
Living with SQL and NoSQL at craigslist, a Pragmatic ApproachJeremy Zawodny
From the 2012 Percona Live MySQL Conference in Santa Clara, CA.
Craigslist uses a variety of data storage systems in its backend systems: in-memory, SQL, and NoSQL. This talk is an overview of how craigslist works with a focus on the data storage and management choices that were made in each of its major subsystems. These include MySQL, memcached, Redis, MongoDB, Sphinx, and the filesystem. Special attention will be paid to the benefits and tradeoffs associated with choosing from the various popular data storage systems, including long-term viability, support, and ease of integration.
Lessons Learned Migrating 2+ Billion Documents at CraigslistJeremy Zawodny
Lessons Learned from Migrating 2+ Billion Documents at Craigslist outlines Craigslist's migration from MySQL to MongoDB. Some key lessons include: knowing your hardware limitations, that replica sets provide high availability during reboots, understanding your data types and sizes, and being aware of limitations with sharding and replica set re-sync processes. The migration addressed issues with their archive data storage and provided a more scalable and performant system.
This document discusses Craigslist's migration from older MySQL database servers to new servers equipped with Fusion-io SSDs. It describes Craigslist's high database load of over 100 million postings and 1 billion daily page views. The migration involved replacing 14 older, less performant servers with just 3 new servers using Fusion-io SSDs. This reduced total power usage from 4,500 watts to 570 watts while greatly increasing I/O performance and reducing query response times.
Understanding and tuning WiredTiger, the new high performance database engine...Ontico
MongoDB 3.0 introduced the concept of different storage engine. The new engine known as WiredTiger introduces document level MVCC locking, compression and a choice between Btree or LSM indexes. In this talk you will learn about the storage engine architecture and specifically WiredTiger, and how to tune and monitor it for best performance.
MongoDB 3.0 представил новый концепт движков хранения. Новый движок известен как WiredTiger и предоставляет новый уровень документов MVCC фикс, компрессию и выбор между Btree или индексами LSM. В этом докладе вы поймете, как тюнить и мониторить архитектуры движка базы данных, а точнее WiredTiger для получения максимальной производительности.
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Ontico
В этом докладе я собираюсь поделиться нашим опытом обхода испанского интернета. Мы поставили перед собой задачу обойти около 600 тысяч веб-сайтов в зоне .es с целью сбора статистики об узлах и их размерах. Я расскажу об архитектуре робота, хранилища, проблемах, с которыми мы столкнулись при обходе, и их решении.
Наше решение доступно в форме open source фреймворка Frontera. Фреймворк позволяет построить распределенного робота для скачивания страниц из Интернета в больших объемах в реальном времени. Также он может быть использован для построения сфокусированных роботов для выкачивания подмножества заранее известных веб-сайтов.
Фреймворк предлагает: настраиваемое хранилище URL документов (RDBMS или Key Value), управление стратегиями обхода, абстракцию транспортного уровня, абстракцию модуля загрузки.
Доклад построен в увлекательной форме: описание проблемы, решение и проблемы, которые возникли в ходе разработки решения.
«Scrapy internals» Александр Сибиряков, Scrapinghubit-people
- Scrapy is a framework for web scraping that allows for extraction of structured data from HTML/XML through selectors like CSS and XPath. It provides features like an interactive shell, feed exports, encoding support, and more.
- Scrapy is built on top of the Twisted asynchronous networking framework, which provides an event loop and deferreds. It handles protocols and transports like TCP, HTTP, and more across platforms.
- Scrapy architecture includes components like the downloader, scraper, and item pipelines that communicate internally. Flow control is needed between these to limit memory usage and scheduling through techniques like concurrent item limits, memory limits, and delays between calls.
Zero to 1 Billion+ Records: A True Story of Learning & Scaling GameChangerMongoDB
This document summarizes MongoDB's journey from handling 1.48 billion events to scaling to support increasing load. It discusses modeling data for MongoDB's flexible schema, scaling to handle more users and load by decreasing latency, growing the database by denormalizing and propagating changes, and extending the database by leveraging MongoDB's features. Key advice includes designing for MongoDB's strengths, using monolithic documents, avoiding live querying, and considering the overall architecture when scaling.
Optimizing MongoDB: Lessons Learned at Localyticsandrew311
Tips, tricks, and gotchas learned at Localytics for optimizing MongoDB installs. Includes information about document design, indexes, fragmentation, migration, AWS EC2/EBS, and more.
View all the MongoDB World 2016 Poster Sessions slides in one place!
Table of Contents:
1: BigData DB Infrastructure for Modeling the Fly Brain
2: Taming the WiredTiger Cache
3: Sharding with MongoDB 3.2 Kick the tires and pop the hood!
4: Scaling Proactive Anomaly Detection
5: MongoTx: Transactions with Sharding and Queries
6: MongoDB: It’s Not Too Late To Shard
7: DLIFLC usage of MongoDB
MongoDB can be used simply as a log collector using for example a capped collection. Fotopedia has such a system which is used for quick introspection and realtime analysis.
Speech done the 23rd of March, 2011 at MongoFR days in Paris, la Cantine by Pierre Baillet and Mathieu Poumeyrol
A New MongoDB Sharding Architecture for Higher Availability and Better Resour...leifwalsh
Most modern databases concern themselves with their ability to scale a workload beyond the power of one machine. But maintaining a database across multiple machines is inherently more complex than it is on a single machine. As soon as scaling out is required, suddenly a lot of scaling out is required, to deal with new problems like index suitability and load balancing.
Write optimized data structures are well-suited to a sharding architecture that delivers higher efficiency than traditional sharding architectures. This talk describes a new sharding architecture for MongoDB applications that can be achieved with write optimized storage like TokuMX's Fractal Tree indexes.
This document provides an overview and introduction to key MongoDB concepts including:
- Replication which allows for failover, backups, and high availability through asynchronous replication across replica sets.
- Sharding which provides horizontal scalability by automatically distributing and balancing data across multiple shards in a cluster.
- Consistency and durability models including eventual consistency and different write acknowledgement options for ensuring data is safely written.
- Flexibility in data modeling through embedding and linking of related data as well as the use of JSON which maps easily to objects.
Managing 50K+ Redis Databases Over 4 Public Clouds ... with a Tiny Devops TeamRedis Labs
Redis is an in-memory database that provides fast performance for powering lightning fast apps. It supports many data structures like strings, hashes, lists, sets and sorted sets. Redis is efficient due to its support for many data structures and commands, as well as its complexity-aware design. Redis Labs provides fully-managed cloud services for Redis and Memcached, and helps customers with challenges around scalability, high availability, performance and monitoring for large-scale Redis deployments.
1. The document describes SPEEDA's use of Elasticsearch to improve search performance over their previous MySQL solution.
2. Key points include how Elasticsearch allowed them to handle a large volume of search queries for 1000 companies and 1000 motors with real-time performance.
3. It also discusses their use of Elasticsearch features like phrase prefix searching and analyzer configurations to support searches in both Japanese and English.
Redis is an open source, advanced key-value store that can be used as a data structure server since it supports strings, hashes, lists, sets and sorted sets. It is written in C, works on most POSIX systems, and can be accessed from many programming languages. Redis provides options for data persistence like snapshots and write-ahead logging, and can be replicated for scalability and high availability. It supports master-slave replication, sentinel-based master detection, and sharding via Redis clusters. Redis has been widely adopted by many companies and is used in applications like microblogging services.
Presentation on MongoDB given at the Hadoop DC meetup in October 2009. Some of the slides at the end are extra examples that didn't appear in the talk, but might be of interest.
Back to Basics Webinar 6: Production DeploymentMongoDB
This is the final webinar of a Back to Basics series that will introduce you to the MongoDB database. This webinar will guide you through production deployment.
Sharding allows you to distribute load across multiple servers and keep your data balanced across those servers. This session will review MongoDB’s sharding support, including an architectural overview, design principles, and automation.
Recording: https://www.youtube.com/watch?v=qHkXVY2LpwU
External links: https://gist.github.com/itamarhaber/dddc3d4d9c19317b1477
Applications today are required to process massive amounts of data and return responses in real time. Simply storing Big Data is no longer enough; insights must be gleaned and decisions made as soon as data rushes in. In-memory databases like Redis provide the blazing fast speeds required for sub-second application response times. Using a combination of in-memory Redis and disk-based MongoDB can significantly reduce the “digestive” challenge associated with processing high velocity data.
This document provides information about using MongoDB with Ruby. It discusses installing MongoDB on Mac OS X and Linux, running MongoDB, comparing MongoDB and CouchDB, using MongoDB ORMs like MongoMapper in Ruby applications, defining models and relationships, and additional features of MongoDB and MongoMapper. The conclusion recommends considering MongoDB as an alternative to MySQL for some web applications due to its speed, features, and schema-less flexibility.
This document discusses PostgreSQL and its use with Drupal. It provides an overview of PostgreSQL, highlighting its features such as being object-relational, open source, standards compliant, and supporting advanced data types and indexes. It also discusses installing and managing PostgreSQL and Drupal together, and the benefits of using PostgreSQL with Drupal due to its advanced optimizer and support in Drupal through its database abstraction layer. Finally, it provides recommendations for different roles, such as considering PostgreSQL for its growth opportunities, learning new skills, and optimizing queries and caching when using it with Drupal.
Webinar Back to Basics 3 - Introduzione ai Replica SetMongoDB
Un set di repliche in MongoDB è un gruppo di processi che mantengono copie dei dati su diversi server di database. Assicurano ridondanza e disponibilità elevata e sono la base di tutte le distribuzioni in produzione di MongoDB.
Redis is a key-value store that can be used as a database, cache, and message broker. It supports basic data structures like strings, hashes, lists, sets, sorted sets with operations that are fast thanks to storing the entire dataset in memory. Redis also provides features like replication, transactions, pub/sub messaging and can be used for caching, queueing, statistics and inter-process communication.
From MySQL to MongoDB at Wordnik (Tony Tam)MongoSF
Wordnik migrated their live application from MySQL to MongoDB to address scaling issues. They moved over 5 billion documents totaling over 1.2 TB of data with zero downtime. The migration involved setting up MongoDB infrastructure, designing the data model and software to match their existing object model, migrating the data, and optimizing performance of the new system. They achieved insert rates of over 100,000 documents per second during the migration process and saw read speeds increase to 250,000 documents per second after completing the move to MongoDB.
This document discusses using Redis as a database for the backend of a Facebook game application. It describes the requirements of supporting 1 million daily users with high write throughput needs. A Redis database was chosen because it provides fast in-memory performance suitable for the application's random access workload. Redis was able to meet the throughput requirements of 200,000 requests per minute and support storing 100KB of data per user in memory. The document provides advice to choose the right tool for the job and avoid sharding until necessary to keep the database configuration simple.
Fulltext engine for non fulltext searchesAdrian Nuta
Or better said when Sphinx can help MySQL on queries that at first look they don’t involve any fulltext searching.
Sphinx was build in mind to help the DB on fulltext queries. But it can also help on where there is no text search. That is everyday used queries with combined filtering,grouping and sorting used for various analytics, reporting of simply general usage.
In Sphinx, the fulltext query is executed first, creating a result set that is passed to the remaining operations ( filters, groups, sorts). By reducing the size of the set that is interogated, the whole query will not be only faster, but it will consume less resources.
Because of design for speed, Sphinx can group and sort a lot faster and can do easy segmentations or getting top-N best group matches in a single query.
The result will be offloading heavy work done by database nodes to even a single Sphinx server.
Slides were presented at PerconaLive London 2013
Sphinx is a fulltext search engine that provides more advanced indexing and querying capabilities than MySQL fulltext search. It uses an inverted index for fast searching and supports various ranking factors, search operators, and morphology tools. Sphinx can be easily integrated with MySQL for indexing and querying via SphinxQL.
«Scrapy internals» Александр Сибиряков, Scrapinghubit-people
- Scrapy is a framework for web scraping that allows for extraction of structured data from HTML/XML through selectors like CSS and XPath. It provides features like an interactive shell, feed exports, encoding support, and more.
- Scrapy is built on top of the Twisted asynchronous networking framework, which provides an event loop and deferreds. It handles protocols and transports like TCP, HTTP, and more across platforms.
- Scrapy architecture includes components like the downloader, scraper, and item pipelines that communicate internally. Flow control is needed between these to limit memory usage and scheduling through techniques like concurrent item limits, memory limits, and delays between calls.
Zero to 1 Billion+ Records: A True Story of Learning & Scaling GameChangerMongoDB
This document summarizes MongoDB's journey from handling 1.48 billion events to scaling to support increasing load. It discusses modeling data for MongoDB's flexible schema, scaling to handle more users and load by decreasing latency, growing the database by denormalizing and propagating changes, and extending the database by leveraging MongoDB's features. Key advice includes designing for MongoDB's strengths, using monolithic documents, avoiding live querying, and considering the overall architecture when scaling.
Optimizing MongoDB: Lessons Learned at Localyticsandrew311
Tips, tricks, and gotchas learned at Localytics for optimizing MongoDB installs. Includes information about document design, indexes, fragmentation, migration, AWS EC2/EBS, and more.
View all the MongoDB World 2016 Poster Sessions slides in one place!
Table of Contents:
1: BigData DB Infrastructure for Modeling the Fly Brain
2: Taming the WiredTiger Cache
3: Sharding with MongoDB 3.2 Kick the tires and pop the hood!
4: Scaling Proactive Anomaly Detection
5: MongoTx: Transactions with Sharding and Queries
6: MongoDB: It’s Not Too Late To Shard
7: DLIFLC usage of MongoDB
MongoDB can be used simply as a log collector using for example a capped collection. Fotopedia has such a system which is used for quick introspection and realtime analysis.
Speech done the 23rd of March, 2011 at MongoFR days in Paris, la Cantine by Pierre Baillet and Mathieu Poumeyrol
A New MongoDB Sharding Architecture for Higher Availability and Better Resour...leifwalsh
Most modern databases concern themselves with their ability to scale a workload beyond the power of one machine. But maintaining a database across multiple machines is inherently more complex than it is on a single machine. As soon as scaling out is required, suddenly a lot of scaling out is required, to deal with new problems like index suitability and load balancing.
Write optimized data structures are well-suited to a sharding architecture that delivers higher efficiency than traditional sharding architectures. This talk describes a new sharding architecture for MongoDB applications that can be achieved with write optimized storage like TokuMX's Fractal Tree indexes.
This document provides an overview and introduction to key MongoDB concepts including:
- Replication which allows for failover, backups, and high availability through asynchronous replication across replica sets.
- Sharding which provides horizontal scalability by automatically distributing and balancing data across multiple shards in a cluster.
- Consistency and durability models including eventual consistency and different write acknowledgement options for ensuring data is safely written.
- Flexibility in data modeling through embedding and linking of related data as well as the use of JSON which maps easily to objects.
Managing 50K+ Redis Databases Over 4 Public Clouds ... with a Tiny Devops TeamRedis Labs
Redis is an in-memory database that provides fast performance for powering lightning fast apps. It supports many data structures like strings, hashes, lists, sets and sorted sets. Redis is efficient due to its support for many data structures and commands, as well as its complexity-aware design. Redis Labs provides fully-managed cloud services for Redis and Memcached, and helps customers with challenges around scalability, high availability, performance and monitoring for large-scale Redis deployments.
1. The document describes SPEEDA's use of Elasticsearch to improve search performance over their previous MySQL solution.
2. Key points include how Elasticsearch allowed them to handle a large volume of search queries for 1000 companies and 1000 motors with real-time performance.
3. It also discusses their use of Elasticsearch features like phrase prefix searching and analyzer configurations to support searches in both Japanese and English.
Redis is an open source, advanced key-value store that can be used as a data structure server since it supports strings, hashes, lists, sets and sorted sets. It is written in C, works on most POSIX systems, and can be accessed from many programming languages. Redis provides options for data persistence like snapshots and write-ahead logging, and can be replicated for scalability and high availability. It supports master-slave replication, sentinel-based master detection, and sharding via Redis clusters. Redis has been widely adopted by many companies and is used in applications like microblogging services.
Presentation on MongoDB given at the Hadoop DC meetup in October 2009. Some of the slides at the end are extra examples that didn't appear in the talk, but might be of interest.
Back to Basics Webinar 6: Production DeploymentMongoDB
This is the final webinar of a Back to Basics series that will introduce you to the MongoDB database. This webinar will guide you through production deployment.
Sharding allows you to distribute load across multiple servers and keep your data balanced across those servers. This session will review MongoDB’s sharding support, including an architectural overview, design principles, and automation.
Recording: https://www.youtube.com/watch?v=qHkXVY2LpwU
External links: https://gist.github.com/itamarhaber/dddc3d4d9c19317b1477
Applications today are required to process massive amounts of data and return responses in real time. Simply storing Big Data is no longer enough; insights must be gleaned and decisions made as soon as data rushes in. In-memory databases like Redis provide the blazing fast speeds required for sub-second application response times. Using a combination of in-memory Redis and disk-based MongoDB can significantly reduce the “digestive” challenge associated with processing high velocity data.
This document provides information about using MongoDB with Ruby. It discusses installing MongoDB on Mac OS X and Linux, running MongoDB, comparing MongoDB and CouchDB, using MongoDB ORMs like MongoMapper in Ruby applications, defining models and relationships, and additional features of MongoDB and MongoMapper. The conclusion recommends considering MongoDB as an alternative to MySQL for some web applications due to its speed, features, and schema-less flexibility.
This document discusses PostgreSQL and its use with Drupal. It provides an overview of PostgreSQL, highlighting its features such as being object-relational, open source, standards compliant, and supporting advanced data types and indexes. It also discusses installing and managing PostgreSQL and Drupal together, and the benefits of using PostgreSQL with Drupal due to its advanced optimizer and support in Drupal through its database abstraction layer. Finally, it provides recommendations for different roles, such as considering PostgreSQL for its growth opportunities, learning new skills, and optimizing queries and caching when using it with Drupal.
Webinar Back to Basics 3 - Introduzione ai Replica SetMongoDB
Un set di repliche in MongoDB è un gruppo di processi che mantengono copie dei dati su diversi server di database. Assicurano ridondanza e disponibilità elevata e sono la base di tutte le distribuzioni in produzione di MongoDB.
Redis is a key-value store that can be used as a database, cache, and message broker. It supports basic data structures like strings, hashes, lists, sets, sorted sets with operations that are fast thanks to storing the entire dataset in memory. Redis also provides features like replication, transactions, pub/sub messaging and can be used for caching, queueing, statistics and inter-process communication.
From MySQL to MongoDB at Wordnik (Tony Tam)MongoSF
Wordnik migrated their live application from MySQL to MongoDB to address scaling issues. They moved over 5 billion documents totaling over 1.2 TB of data with zero downtime. The migration involved setting up MongoDB infrastructure, designing the data model and software to match their existing object model, migrating the data, and optimizing performance of the new system. They achieved insert rates of over 100,000 documents per second during the migration process and saw read speeds increase to 250,000 documents per second after completing the move to MongoDB.
This document discusses using Redis as a database for the backend of a Facebook game application. It describes the requirements of supporting 1 million daily users with high write throughput needs. A Redis database was chosen because it provides fast in-memory performance suitable for the application's random access workload. Redis was able to meet the throughput requirements of 200,000 requests per minute and support storing 100KB of data per user in memory. The document provides advice to choose the right tool for the job and avoid sharding until necessary to keep the database configuration simple.
Fulltext engine for non fulltext searchesAdrian Nuta
Or better said when Sphinx can help MySQL on queries that at first look they don’t involve any fulltext searching.
Sphinx was build in mind to help the DB on fulltext queries. But it can also help on where there is no text search. That is everyday used queries with combined filtering,grouping and sorting used for various analytics, reporting of simply general usage.
In Sphinx, the fulltext query is executed first, creating a result set that is passed to the remaining operations ( filters, groups, sorts). By reducing the size of the set that is interogated, the whole query will not be only faster, but it will consume less resources.
Because of design for speed, Sphinx can group and sort a lot faster and can do easy segmentations or getting top-N best group matches in a single query.
The result will be offloading heavy work done by database nodes to even a single Sphinx server.
Slides were presented at PerconaLive London 2013
Sphinx is a fulltext search engine that provides more advanced indexing and querying capabilities than MySQL fulltext search. It uses an inverted index for fast searching and supports various ranking factors, search operators, and morphology tools. Sphinx can be easily integrated with MySQL for indexing and querying via SphinxQL.
Mwasaha Mwagambo Mwasaha successfully completed the online course "Managing Big Data with MySQL" offered through Coursera and authorized by Duke University. The certificate confirms Mwasaha's identity and participation in the course, as verified by Daniel Egger, Director of the Center for Quantitative Modeling at Pratt School of Engineering, and Jana Schaich Borg, a Post-doctoral Fellow in Psychiatry and Behavioral Sciences.
Sphinx - High performance full-text search for MySQLNguyen Van Vuong
The document discusses Sphinx, an open source full-text search engine. It begins with an overview of full-text search and what Sphinx is - a high performance search engine that integrates well with SQL databases. The document then covers Sphinx's workflow, including indexing data, searching via its API or SphinxQL, and its query syntax. It also discusses how Sphinx scales horizontally across nodes and clusters.
MySQL Indexing - Best practices for MySQL 5.6MYXPLAIN
This document provides an overview of MySQL indexing best practices. It discusses the types of indexes in MySQL, how indexes work, and how to optimize queries through proper index selection and configuration. The presentation emphasizes understanding how MySQL utilizes indexes to speed up queries through techniques like lookups, sorting, avoiding full table scans, and join optimizations. It also covers new capabilities in MySQL 5.6 like index condition pushdown that provide more flexible index usage.
The technology has almost written off MySQL as a database for new fancy NoSQL databases like MongoDB and Cassandra or even Hadoop for aggregation. But MySQL has a lot to offer in terms of 'ACID'ity, performance and simplicity. For many use-cases MySQL works well. In this week's ShareThis workshop we discuss different tips & techniques to improve performance and extend the lifetime of your MySQL deployment.
As your data grows, the need to establish proper indexes becomes critical to performance. MongoDB supports a wide range of indexing options to enable fast querying of your data, but what are the right strategies for your application?
In this talk we’ll cover how indexing works, the various indexing options, and use cases where each can be useful. We'll dive into common pitfalls using real-world examples to ensure that you're ready for scale.
This document provides tips for tuning a MySQL database to optimize performance. It discusses why tuning is important for cost effectiveness, performance, and competitive advantage. It outlines who should be involved in tuning including application designers, developers, DBAs and system administrators. The document covers what can be tuned such as applications, databases structures, and hardware. It provides best practices for when and how much to tune a database. Specific tuning techniques are discussed for various areas including application development, database design, server configuration, and storage engine optimizations.
MySQL users commonly ask: Here's my table, what indexes do I need? Why aren't my indexes helping me? Don't indexes cause overhead? This talk gives you some practical answers, with a step by step method for finding the queries you need to optimize, and choosing the best indexes for them.
10 SQL Tricks that You Didn't Think Were PossibleLukas Eder
SQL is the winning language of Big Data. Whether you’re running a classic relational database, a column store (“NewSQL”), or a non-relational storage system (“NoSQL”), a powerful, declarative, SQL-based query language makes the difference. The SQL standard has evolved drastically in the past decades, and so have its commercial and open source implementations.
In this fast-paced talk, we’re going to look at very peculiar and interesting data problems and how we can solve them with SQL. We’ll explore common table expressions, hierarchical SQL, table-valued functions, lateral joins, row value expressions, window functions, and advanced data types, such as XML and JSON. And we’ll look at Oracle’s mysterious MODEL and MATCH_RECOGNIZE clauses, devices whose mystery is only exceeded by their power. Most importantly, however, we’re going to learn that everyone can write advanced SQL. Once you learn the basics in these tricks, you’re going to love SQL even more.
Indexes are references to documents that are efficiently ordered by key and maintained in a tree structure for fast lookup. They improve the speed of document retrieval, range scanning, ordering, and other operations by enabling the use of the index instead of a collection scan. While indexes improve query performance, they can slow down document inserts and updates since the indexes also need to be maintained. The query optimizer aims to select the best index for each query but can sometimes be overridden.
Tomas Doran presented on their implementation of Logstash at TIM Group to process over 55 million messages per day. Their applications are all Java/Scala/Clojure and they developed their own library to send structured log events as JSON to Logstash using ZeroMQ for reliability. They index data in Elasticsearch and use it for metrics, alerts and dashboards but face challenges with data growth.
This document summarizes Jeremy Zawodny's work with MySQL and search at Craigslist. It discusses how Craigslist uses MySQL for its classified listings but encountered scaling issues as traffic grew. To address this, Craigslist implemented the Sphinx search engine, which improved performance and allowed them to reduce their MySQL cluster size. The document also outlines Craigslist's data archiving strategy using eventual consistency and their goals for further optimizing their database and search infrastructure.
Frontera: open source, large scale web crawling frameworkScrapinghub
This document describes Frontera, an open source framework for large scale web crawling. It discusses the architecture and components of Frontera, which includes Scrapy for network operations, Apache Kafka as a data bus, and Apache HBase for storage. It also outlines some challenges faced during the development of Frontera and solutions implemented, such as handling large websites that flood the queue, optimizing traffic to HBase, and prioritizing URLs. The document provides details on using Frontera to crawl the Spanish (.es) web domain and presents results and future plans.
This document summarizes a keynote speech given by John Adams, an early Twitter engineer, about scaling Twitter operations from 2008-2009. Some key points:
1) Twitter saw exponential growth rates from 2008-2009, processing over 55 million tweets per day and 600 million searches per day.
2) Operations focused on improving performance, reducing errors and outages, and using metrics to identify weaknesses and bottlenecks like network latency and database delays.
3) Technologies like Unicorn, memcached, Flock, Cassandra, and daemons were implemented to improve scalability beyond a traditional RDBMS and handle Twitter's data volumes and real-time needs.
4) Caching,
This document discusses using ZeroMQ and Elasticsearch for log aggregation. It proposes using ZeroMQ to transmit structured log data from application servers to a central Logstash server, which would then insert the logs into Elasticsearch for querying and analysis. This approach aims to provide a lightweight logging solution that doesn't block application servers like traditional logging to databases can. The document also provides background on tools like Logstash, Elasticsearch, and Splunk.
RedisConf18 - Redis at LINE - 25 Billion Messages Per DayRedis Labs
LINE uses Redis for caching and primary storage of messaging data. It operates over 60 Redis clusters with over 1,000 machines and 10,000 nodes to handle 25 billion messages per day. LINE developed its own Redis client and monitoring system to support client-side sharding without a proxy, automated failure detection, and scalable cluster monitoring. While the official Redis Cluster was tested, it exhibited some issues around memory usage and maximum node size for LINE's large scale needs.
Messaging, interoperability and log aggregation - a new frameworkTomas Doran
In this talk, I will talk about why log files are horrible, logging log lines, and more structured performance metrics from large scale production applications as well as building reliable, scaleable and flexible large scale software systems in multiple languages.
Why (almost) all log formats are horrible will be explained, and why JSON is a good solution for logging will be discussed, along with a number of message queuing, middleware and network transport technologies, including STOMP, AMQP and ZeroMQ.
The Message::Passing framework will be introduced, along with the logstash.net project which the perl code is interoperable with. These are pluggable frameworks in ruby/java/jruby and perl with pre-written sets of inputs, filters and outputs for many many different systems, message formats and transports.
They were initially designed to be aggregators and filters of data for logging. However they are flexible enough to be used as part of your messaging middleware, or even as a replacement for centralised message queuing systems.
You can have your cake and eat it too - an architecture which is flexible, extensible, scaleable and distributed. Build discrete, loosely coupled components which just pass messages to each other easily.
Integrate and interoperate with your existing code and code bases easily, consume from or publish to any existing message queue, logging or performance metrics system you have installed.
Simple examples using common input and output classes will be demonstrated using the framework, as will easily adding your own custom filters. A number of common messaging middleware patterns will be shown to be trivial to implement.
Some higher level use-cases will also be explored, demonstrating log indexing in ElasticSearch and how to build a responsive platform API using webhooks.
Interoperability is also an important goal for messaging middleware. The logstash.net project will be highlighted and we'll discuss crossing the single language barrier, allowing us to have full integration between java, ruby and perl components, and to easily write bindings into libraries we want to reuse in any of those languages.
Elasticsearch is a distributed, RESTful search and analytics engine that can be used for processing big data with Apache Spark. Data is ingested from Spark into Elasticsearch for features generation and predictive modeling. Elasticsearch allows for fast reads and writes of large volumes of time-series and other data through its use of inverted indexes and dynamic mapping. It is deployed on AWS for its elastic scalability, high availability, and integration with Spark via fast queries. Ongoing maintenance includes archiving old data, partitioning indices, and reindexing large datasets.
This document summarizes a lecture on key-value storage systems. It introduces the key-value data model and compares it to relational databases. It then describes Cassandra, a popular open-source key-value store, including how it maps keys to servers, replicates data across multiple servers, and performs reads and writes in a distributed manner while maintaining consistency. The document also discusses Cassandra's use of gossip protocols to manage cluster membership.
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...smallerror
Twitter's operations team manages software performance, availability, capacity planning, and configuration management for Twitter. They use metrics, logs, and analysis to find weak points and take corrective action. Some techniques include caching everything possible, moving operations to asynchronous daemons, and optimizing databases to reduce replication delay and locks. The team also created several open source projects like CacheMoney for caching and Kestrel for asynchronous messaging.
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...xlight
Fixing Twitter and Finding your own Fail Whale document discusses Twitter operations. The operations team manages software performance, availability, capacity planning, and configuration management using metrics, logs, and data-driven analysis to find weak points and take corrective action. They use managed services for infrastructure to focus on computer science problems. The document outlines Twitter's rapid growth and challenges in maintaining performance as traffic increases. It provides recommendations around caching, databases, asynchronous processing, and other techniques Twitter uses to optimize performance under heavy load.
Twitter's operations team manages software performance, availability, capacity planning, and configuration management. They use metrics, logs, and analysis to find weak points and take corrective action. Some techniques include caching everything possible, moving operations to asynchronous daemons, optimizing databases, and instrumenting all systems. Their goal is to process requests asynchronously when possible and avoid overloading relational databases.
Fixing Twitter and Finding your own Fail Whale document discusses Twitter operations. The Twitter operations team focuses on software performance, availability, capacity planning, and configuration management using metrics, logs, and science. They use a dedicated managed services team and run their own servers instead of cloud services. The document outlines Twitter's rapid growth and challenges in maintaining performance. It discusses strategies for monitoring, analyzing metrics to find weak points, deploying changes, and improving processes through configuration management and peer reviews.
Percona Live London 2014: Serve out any page with an HA Sphinx environmentspil-engineering
Sphinx is a full-text search engine that Spil Games uses to provide fast and complex search across their databases and indexes. Some key ways Spil Games uses Sphinx include searching for games by title or URL, finding friends across their networks, and filtering search results based on browser capabilities. To ensure high availability, Spil Games implements distributed and mirrored Sphinx indexes across multiple nodes and uses load balancers. Benchmarking shows Sphinx significantly outperforms MySQL for certain search queries.
Sharding in MongoDB allows for scaling of data and queries across multiple servers. When determining the number of shards needed, key factors to consider include total storage requirements, latency needs, and throughput requirements. These are used to calculate the necessary disk capacity, disk throughput, and RAM across shards. Different types of sharding include range, tag-aware, and hashed, with range being best for query isolation. Choosing a high cardinality shard key that matches common queries is important for performance and scalability.
This document compares Cassandra and Redis for use as a backend for a Facebook game with 1 million daily users and 10 million total users. Redis was chosen over Cassandra due to its simpler architecture, higher write throughput, and ability to meet the capacity and performance requirements using a single node. The Redis master handled all reads and writes, with a slave for failover. User data was stored in Redis hashes to turn it into a "document DB" and allow for atomic operations on parts of the data.
Elasticsearch is a distributed, RESTful search and analytics engine that can be used for processing big data with Apache Spark. It allows ingesting large volumes of data in near real-time for search, analytics, and machine learning applications like feature generation. Elasticsearch is schema-free, supports dynamic queries, and integrates with Spark, making it a good fit for ingesting streaming data from Spark jobs. It must be deployed with consideration for fast reads, writes, and dynamic querying to support large-scale predictive analytics workloads.
The relational database model was designed to solve the problems of yesterday’s data storage requirements. The massively connected world of today presents different problems and new challenges. We’ll explore the NoSQL philosophy, before comparing and contrasting the strengths and weaknesses of the relational model versus the NoSQL model. While stepping through real-world scenarios, we’ll discuss the reasons for choosing one solution over the other.
To complete this session, let’s demonstrate our findings with an application written with a NoSQL storage layer and explain the advantages that accrue from that decision. By taking a look at the new challenges we face with our data storage needs, we’ll examine why the principles behind NoSQL make it a better candidate as a solution, than yesterday’s relational model.
Speed up your Symfony2 application and build awesome features with RedisRicard Clau
Redis is an extremely fast data structure server that can be easily added to your existing stack and act like a Swiss army knife to help solve many problems that would be extremely difficult to workaround with the traditional RDBMS. In this session we will focus on what Redis is, how it works, what awesome features we can build with it and how we can use it with PHP and integrate it with Symfony2 applications making them blazing fast.
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxshyamraj55
We’re bringing the TDX energy to our community with 2 power-packed sessions:
🛠️ Workshop: MuleSoft for Agentforce
Explore the new version of our hands-on workshop featuring the latest Topic Center and API Catalog updates.
📄 Talk: Power Up Document Processing
Dive into smart automation with MuleSoft IDP, NLP, and Einstein AI for intelligent document workflows.
Technology Trends in 2025: AI and Big Data AnalyticsInData Labs
At InData Labs, we have been keeping an ear to the ground, looking out for AI-enabled digital transformation trends coming our way in 2025. Our report will provide a look into the technology landscape of the future, including:
-Artificial Intelligence Market Overview
-Strategies for AI Adoption in 2025
-Anticipated drivers of AI adoption and transformative technologies
-Benefits of AI and Big data for your business
-Tips on how to prepare your business for innovation
-AI and data privacy: Strategies for securing data privacy in AI models, etc.
Download your free copy nowand implement the key findings to improve your business.
Semantic Cultivators : The Critical Future Role to Enable AIartmondano
By 2026, AI agents will consume 10x more enterprise data than humans, but with none of the contextual understanding that prevents catastrophic misinterpretations.
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfAbi john
Analyze the growth of meme coins from mere online jokes to potential assets in the digital economy. Explore the community, culture, and utility as they elevate themselves to a new era in cryptocurrency.
Generative Artificial Intelligence (GenAI) in BusinessDr. Tathagat Varma
My talk for the Indian School of Business (ISB) Emerging Leaders Program Cohort 9. In this talk, I discussed key issues around adoption of GenAI in business - benefits, opportunities and limitations. I also discussed how my research on Theory of Cognitive Chasms helps address some of these issues
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025BookNet Canada
Book industry standards are evolving rapidly. In the first part of this session, we’ll share an overview of key developments from 2024 and the early months of 2025. Then, BookNet’s resident standards expert, Tom Richardson, and CEO, Lauren Stewart, have a forward-looking conversation about what’s next.
Link to recording, transcript, and accompanying resource: https://bnctechforum.ca/sessions/standardsgoals-for-2025-standards-certification-roundup/
Presented by BookNet Canada on May 6, 2025 with support from the Department of Canadian Heritage.
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul
Artificial intelligence is changing how businesses operate. Companies are using AI agents to automate tasks, reduce time spent on repetitive work, and focus more on high-value activities. Noah Loul, an AI strategist and entrepreneur, has helped dozens of companies streamline their operations using smart automation. He believes AI agents aren't just tools—they're workers that take on repeatable tasks so your human team can focus on what matters. If you want to reduce time waste and increase output, AI agents are the next move.
Spark is a powerhouse for large datasets, but when it comes to smaller data workloads, its overhead can sometimes slow things down. What if you could achieve high performance and efficiency without the need for Spark?
At S&P Global Commodity Insights, having a complete view of global energy and commodities markets enables customers to make data-driven decisions with confidence and create long-term, sustainable value. 🌍
Explore delta-rs + CDC and how these open-source innovations power lightweight, high-performance data applications beyond Spark! 🚀
Mobile App Development Company in Saudi ArabiaSteve Jonas
EmizenTech is a globally recognized software development company, proudly serving businesses since 2013. With over 11+ years of industry experience and a team of 200+ skilled professionals, we have successfully delivered 1200+ projects across various sectors. As a leading Mobile App Development Company In Saudi Arabia we offer end-to-end solutions for iOS, Android, and cross-platform applications. Our apps are known for their user-friendly interfaces, scalability, high performance, and strong security features. We tailor each mobile application to meet the unique needs of different industries, ensuring a seamless user experience. EmizenTech is committed to turning your vision into a powerful digital product that drives growth, innovation, and long-term success in the competitive mobile landscape of Saudi Arabia.
Artificial Intelligence is providing benefits in many areas of work within the heritage sector, from image analysis, to ideas generation, and new research tools. However, it is more critical than ever for people, with analogue intelligence, to ensure the integrity and ethical use of AI. Including real people can improve the use of AI by identifying potential biases, cross-checking results, refining workflows, and providing contextual relevance to AI-driven results.
News about the impact of AI often paints a rosy picture. In practice, there are many potential pitfalls. This presentation discusses these issues and looks at the role of analogue intelligence and analogue interfaces in providing the best results to our audiences. How do we deal with factually incorrect results? How do we get content generated that better reflects the diversity of our communities? What roles are there for physical, in-person experiences in the digital world?
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPathCommunity
Join this UiPath Community Berlin meetup to explore the Orchestrator API, Swagger interface, and the Test Manager API. Learn how to leverage these tools to streamline automation, enhance testing, and integrate more efficiently with UiPath. Perfect for developers, testers, and automation enthusiasts!
📕 Agenda
Welcome & Introductions
Orchestrator API Overview
Exploring the Swagger Interface
Test Manager API Highlights
Streamlining Automation & Testing with APIs (Demo)
Q&A and Open Discussion
Perfect for developers, testers, and automation enthusiasts!
👉 Join our UiPath Community Berlin chapter: https://community.uipath.com/berlin/
This session streamed live on April 29, 2025, 18:00 CET.
Check out all our upcoming UiPath Community sessions at https://community.uipath.com/events/.
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Impelsys Inc.
Impelsys provided a robust testing solution, leveraging a risk-based and requirement-mapped approach to validate ICU Connect and CritiXpert. A well-defined test suite was developed to assess data communication, clinical data collection, transformation, and visualization across integrated devices.
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc
Most consumers believe they’re making informed decisions about their personal data—adjusting privacy settings, blocking trackers, and opting out where they can. However, our new research reveals that while awareness is high, taking meaningful action is still lacking. On the corporate side, many organizations report strong policies for managing third-party data and consumer consent yet fall short when it comes to consistency, accountability and transparency.
This session will explore the research findings from TrustArc’s Privacy Pulse Survey, examining consumer attitudes toward personal data collection and practical suggestions for corporate practices around purchasing third-party data.
Attendees will learn:
- Consumer awareness around data brokers and what consumers are doing to limit data collection
- How businesses assess third-party vendors and their consent management operations
- Where business preparedness needs improvement
- What these trends mean for the future of privacy governance and public trust
This discussion is essential for privacy, risk, and compliance professionals who want to ground their strategies in current data and prepare for what’s next in the privacy landscape.
Dev Dives: Automate and orchestrate your processes with UiPath MaestroUiPathCommunity
This session is designed to equip developers with the skills needed to build mission-critical, end-to-end processes that seamlessly orchestrate agents, people, and robots.
📕 Here's what you can expect:
- Modeling: Build end-to-end processes using BPMN.
- Implementing: Integrate agentic tasks, RPA, APIs, and advanced decisioning into processes.
- Operating: Control process instances with rewind, replay, pause, and stop functions.
- Monitoring: Use dashboards and embedded analytics for real-time insights into process instances.
This webinar is a must-attend for developers looking to enhance their agentic automation skills and orchestrate robust, mission-critical processes.
👨🏫 Speaker:
Andrei Vintila, Principal Product Manager @UiPath
This session streamed live on April 29, 2025, 16:00 CET.
Check out all our upcoming Dev Dives sessions at https://community.uipath.com/dev-dives-automation-developer-2025/.
What is Model Context Protocol(MCP) - The new technology for communication bw...Vishnu Singh Chundawat
The MCP (Model Context Protocol) is a framework designed to manage context and interaction within complex systems. This SlideShare presentation will provide a detailed overview of the MCP Model, its applications, and how it plays a crucial role in improving communication and decision-making in distributed systems. We will explore the key concepts behind the protocol, including the importance of context, data management, and how this model enhances system adaptability and responsiveness. Ideal for software developers, system architects, and IT professionals, this presentation will offer valuable insights into how the MCP Model can streamline workflows, improve efficiency, and create more intuitive systems for a wide range of use cases.
Quantum Computing Quick Research Guide by Arthur MorganArthur Morgan
This is a Quick Research Guide (QRG).
QRGs include the following:
- A brief, high-level overview of the QRG topic.
- A milestone timeline for the QRG topic.
- Links to various free online resource materials to provide a deeper dive into the QRG topic.
- Conclusion and a recommendation for at least two books available in the SJPL system on the QRG topic.
QRGs planned for the series:
- Artificial Intelligence QRG
- Quantum Computing QRG
- Big Data Analytics QRG
- Spacecraft Guidance, Navigation & Control QRG (coming 2026)
- UK Home Computing & The Birth of ARM QRG (coming 2027)
Any questions or comments?
- Please contact Arthur Morgan at [email protected].
100% human made.
3. CL Sphinx Infrastructure
• Live Sphinx
• ~30 million postings
• end users searching for stuff on craigslist
• Team Sphinx
• ~100 million postings
• additional indexes of postings for internal use
(including non-live postings)
4. CL Sphinx Infrastructure
• Archive Sphinx
• older postings (~3 billion)
• constantly growing in size
• Real-Time Sphinx
• last ~2 days worth of postings
• Forums Sphinx
• ~150 million forum postings
6. Back in 2008
• MySQL FULL TEXT (MyISAM)
• 25 Servers
• Melted Down Frequently
• Desperately Needed a Solution
• This was my first project at craigslist...
• Looked at Solr, Sphinx, Xapian
• Sphinx felt like the right fit
7. Making Sphinx Work
• Benchmarking showed promising results
• Query performance was great
• ~800qps/instance
• back then we only needed 1,200/sec
• Indexing performance too
• Can index documents far faster than I can
make the XML for input (from Perl)
• Can’t index and serve at the same time, though...
8. “Live” Sphinx
• One index per city (~700 indexes)
• Main + Delta
• xmlpipe2 input
• Data all fits on a single machine
• 32bit ids
• High churn rate
• Settled on Master/Slave model w/rsync replication
• Deployed in January, 2009
10. Main+Delta Indexes
delta
Regular Merge
from transient delta
today
Periodic Merge Logical
to clean house Index
index
11. Early Issues
• Monitoring
• Persistent Connections w/prefork
• hacked up my own initially
• Index merge crashes/bugs
• We’re aways running svn snapshots
12. Early Success
• Replaced the 25 MySQL servers
• Used 10 sphinx servers (2 masters, 8 slaves)
• Search traffic continued to increase
• Tons of headroom!
• Typical search is under 5ms
• New Features
• “nearby” search
• sort by: recent, price, best match
13. Early Mistakes
• Stopwords
• Not setting query limits
• Sphinx handled this just fine!
• ASCII-only
• Query mangling
• need to understand how users search and what
they expect to find
• UpdateAttributes (no kill lists!)
15. Growth
• Wanted Sphinx for “internal” use
• Created internal “team sphinx” with more indexed
data
• includes not visible postings
• includes additional fields
• Space became an issue, so had to build some simple
sharding into our code
• 2 clusters: even/odd split for indexes
16. Live Sphinx Today
• 300+ million queries/day
• 5,000 queries/sec peak load
• removed stopwords
• threaded workers
• dict=keywords
• wildcard search enabled
• UTF-8 (mostly) and charset_table
• blend_chars
• kill lists (no searchd on masters)
• sharded (3 masters, 18 slaves) on blades
19. Archive Sphinx
• The Archive Project!
• 2.5 billion postings
• Growing by ~1.6 million daily
• String attributes
• 4 shards, each is a 1 master, 2 slave cluster
• Bucket based on UserID (not city)
• Low query volume
• Need a way to reindex all docs
20. Real-Time Sphinx
• There’s a delay in indexing data on the master and
replicating to the slaves...
• What if we want to offer “real-time search” of your
own postings?
21. So I built something...
• Known as rtsd (real-time search daemon)
• Sphinx instance with MySQL Protocol
• Primarily uses in-memory indexes
• Used to bridge the gap between “now” and
“archive sphinx”
• Configured as an N day rolling window
• Runs on archive sphinx master hosts
22. Sphinx Time Horizons
Classic Team Archive rtsd
0-20min All
20m-1day Visible All All
1-60 days Visible All All
60+ days All
Note:Visible postings are findable on the site.
28. Future Work
• autonomous nodes (no master/slave)
• many-core blades with SSD storage
• better performance metrics
• we drop a lot of data on the floor
• log mining and analysis
• sphinx for “table of contents” (browsing)
• haproxy in front of sphinx
• generic sharding code
• testing framework
29. Sphinx Wishlist
• 32 -> 64 bit migration tool
• capture stats at daemon shut down
• RT optimizations for DELETE (high churn)
• distributed search (agent) config with multiple
servers per index (for failover and load):
30. Sphinx Wishlist
• 32 -> 64 bit migration tool
• capture stats at daemon shut down
• RT optimizations for DELETE (high churn)
• distributed search (agent) config with multiple
servers per index (for failover and load):