We all know that MongoDB is one of the most flexible and feature-rich databases available. In this session we'll discuss how you can leverage this feature set and maintain high performance with your project's massive data sets and high loads. We'll cover how indexes can be designed to optimize the performance of MongoDB. We'll also discuss tips for diagnosing and fixing performance issues should they arise.
Media owners are turning to MongoDB to drive social interaction with their published content. The way customers consume information has changed and passive communication is no longer enough. They want to comment, share and engage with publishers and their community through a range of media types and via multiple channels whenever and wherever they are. There are serious challenges with taking this semi-structured and unstructured data and making it work in a traditional relational database. This webinar looks at how MongoDB’s schemaless design and document orientation gives organisation’s like the Guardian the flexibility to aggregate social content and scale out.
Building a Scalable Inbox System with MongoDB and Javaantoinegirbal
Many user-facing applications present some kind of news feed/inbox system. You can think of Facebook, Twitter, or Gmail as different types of inboxes where the user can see data of interest, sorted by time, popularity, or other parameter. A scalable inbox is a difficult problem to solve: for millions of users, varied data from many sources must be sorted and presented within milliseconds. Different strategies can be used: scatter-gather, fan-out writes, and so on. This session presents an actual application developed by 10gen in Java, using MongoDB. This application is open source and is intended to show the reference implementation of several strategies to tackle this common challenge. The presentation also introduces many MongoDB concepts.
This document discusses tuning MongoDB performance. It covers tuning queries using the database profiler and explain commands to analyze slow queries. It also covers tuning system configurations like Linux settings, disk I/O, and memory to optimize MongoDB performance. Topics include setting ulimits, IO scheduler, filesystem options, and more. References to MongoDB and Linux tuning documentation are also provided.
- MongoDB is a document-oriented, non-relational database that scales horizontally and uses JSON-like documents with dynamic schemas.
- It offers features like embedded documents, indexing, replication, and sharding.
- Documents are stored and queried using simple statements in a JavaScript-like syntax interface.
MongoDB San Francisco 2013: Hash-based Sharding in MongoDB 2.4 presented by B...MongoDB
In version 2.4, MongoDB introduces hash-based sharding, a new option for distributing data in sharded collections. Hash-based sharding and range-based sharding present different advantages for MongoDB users deploying large scale systems. In this talk, we'll provide an overview of this new feature and discuss when to use hash-based sharding or range-based sharding.
Reducing Development Time with MongoDB vs. SQLMongoDB
Buzz Moschetti compares the development time and effort required to save and fetch contact data using MongoDB versus SQL over the course of two weeks. With SQL, each time a new field is added or the data structure changes, the SQL schema must be altered and code updated in multiple places. With MongoDB, the data structure can evolve freely without changes to the data access code - it remains a simple insert and find. By day 14, representing the more complex data structure in SQL would require flattening some data and storing it in non-ideal ways, while MongoDB continues to require no changes to the simple data access code.
MongoDB Europe 2016 - Debugging MongoDB PerformanceMongoDB
Asya is back, and so is Sherlock Holmes and his techniques to gather and analyze data from your poorly performing MongoDB clusters. In this advanced talk we take a deep look at all the diagnostic data that lives inside MongoDB - how to interrogate and interpret it to help you solve those frustrating performance bottlenecks that we all face occasionally.
10gen Presents Schema Design and Data ModelingDATAVERSITY
This document provides an overview of schema design in MongoDB. It discusses topics such as:
- The goals of schema design, which include avoiding anomalies, minimizing redesign, avoiding query bias, and making use of features.
- Key terminology when comparing MongoDB to relational databases, such as using collections instead of tables and embedding/linking instead of joins.
- Examples of basic collections, documents, indexing, and query operators.
- Common schema patterns for MongoDB like embedding, normalization, inheritance, one-to-many, many-to-many, and trees.
- Use cases like time series are also briefly covered.
This document discusses various indexing strategies in MongoDB to help scale applications. It covers the basics of indexes, including creating and tuning indexes. It also discusses different index types like geospatial indexes, text indexes, and how to use explain plans and profiling to evaluate queries. The document concludes with a section on scaling strategies like sharding to scale beyond a single server's resources.
Indexing in MongoDB works similarly to indexing in relational databases. An index is a data structure that can make certain queries more efficient by maintaining a sorted order of documents. Indexes are created using the ensureIndex() method and take up additional space and slow down writes. The explain() method is used to determine whether a query is using an index.
Developers love MongoDB because its flexible document model enhances their productivity. But did you know that MongoDB supports rich queries and lets you accomplish some of the same things you currently do with SQL statements? And that MongoDB's powerful aggregation framework makes it possible to perform real-time analytics for dashboards and reports?
Attend this webinar for an introduction to the MongoDB aggregation framework and a walk through of what you can do with it. We'll also demo using it to analyze U.S. census data.
MongoDB .local Chicago 2019: Practical Data Modeling for MongoDB: TutorialMongoDB
For 30 years, developers have been taught that relational data modeling was THE way to model, but as more companies adopt MongoDB as their data platform, the approaches that work well in relational design actually work against you in a document model design. In this talk, we will discuss how to conceptually approach modeling data with MongoDB, focusing on practical foundational techniques, paired with tips and tricks, and wrapping with discussing design patterns to solve common real world problems.
MongoDB .local Munich 2019: Best Practices for Working with IoT and Time-seri...MongoDB
Time series data is increasingly at the heart of modern applications - think IoT, stock trading, clickstreams, social media, and more. With the move from batch to real time systems, the efficient capture and analysis of time series data can enable organizations to better detect and respond to events ahead of their competitors or to improve operational efficiency to reduce cost and risk. Working with time series data is often different from regular application data, and there are best practices you should observe.
This talk covers:
• Common components of an IoT solution
• The challenges involved with managing time-series data in IoT applications
• Different schema designs, and how these affect memory and disk utilization – two critical factors in application performance.
• How to query, analyze and present IoT time-series data using MongoDB Compass and MongoDB Charts
At the end of the session, you will have a better understanding of key best practices in managing IoT time-series data with MongoDB.
Webinar: Exploring the Aggregation FrameworkMongoDB
Developers love MongoDB because its flexible document model enhances their productivity. But did you know that MongoDB supports rich queries and lets you accomplish some of the same things you currently do with SQL statements? And that MongoDB's powerful aggregation framework makes it possible to perform real-time analytics for dashboards and reports?
Watch this webinar for an introduction to the MongoDB aggregation framework and a walk through of what you can do with it. We'll also demo an analysis of U.S. census data.
PistonHead's use of MongoDB for AnalyticsAndrew Morgan
Haymarket Media Group is building a reporting and analytics suite called PistonHub to provide dealers and administrators insights into classifieds and stock performance data. PistonHub will aggregate data from various sources like classifieds, calls, emails, and stock information to generate daily statistics for each dealer that can be viewed on a dashboard. This consolidated data will give dealers and sales teams more visibility to help dealers improve performance. The initial feedback on PistonHub has been positive for providing extra insights.
As your data grows, the need to establish proper indexes becomes critical to performance. MongoDB supports a wide range of indexing options to enable fast querying of your data, but what are the right strategies for your application?
In this talk we’ll cover how indexing works, the various indexing options, and use cases where each can be useful. We'll dive into common pitfalls using real-world examples to ensure that you're ready for scale.
Back to Basics Webinar 5: Introduction to the Aggregation FrameworkMongoDB
The document provides information about an upcoming webinar on the MongoDB aggregation framework. Key details include:
- The webinar will introduce the aggregation framework and provide an overview of its capabilities for analytics.
- Examples will use a real-world vehicle testing dataset to demonstrate aggregation pipeline stages like $match, $project, and $group.
- Attendees will learn how the aggregation framework provides a simpler way to perform analytics compared to other tools like Spark and Hadoop.
Thomas Rückstieß gave a presentation on indexing and query optimization in MongoDB. He discussed what indexes are, why they are needed, how to create and manage indexes, and how to optimize queries. He emphasized that absent or suboptimal indexes are a common performance problem and outlined some common indexing mistakes to avoid, such as trying to use multiple indexes per query, low selectivity indexes, and queries that cannot use indexes like regular expressions and negation.
These are slides from our Big Data Warehouse Meetup in April. We talked about NoSQL databases: What they are, how they’re used and where they fit in existing enterprise data ecosystems.
Mike O’Brian from 10gen, introduced the syntax and usage patterns for a new aggregation system in MongoDB and give some demonstrations of aggregation using the new system. The new MongoDB aggregation framework makes it simple to do tasks such as counting, averaging, and finding minima or maxima while grouping by keys in a collection, complementing MongoDB’s built-in map/reduce capabilities.
For more information, visit our website at http://casertaconcepts.com/ or email us at [email protected].
The document discusses schema design basics for MongoDB, including terms, considerations for schema design, and examples of modeling different types of data structures like trees, single table inheritance, and many-to-many relationships. It provides examples of creating indexes, evolving schemas, and performing queries and updates. Key topics covered include embedding data versus normalization, indexing, and techniques for modeling one-to-many and many-to-many relationships.
Data Processing and Aggregation with MongoDB MongoDB
The document discusses data processing and aggregation using MongoDB. It provides an example of using MongoDB's map-reduce functionality to count the most popular pub names in a dataset of UK pub locations and attributes. It shows the map and reduce functions used to tally the name occurrences and outputs the top 10 results. It then demonstrates performing a similar analysis on just the pubs located in central London using MongoDB's aggregation framework pipeline to match, group and sort the results.
Beyond the Basics 2: Aggregation Framework MongoDB
The aggregation framework is one of the most powerful analytical tools available with MongoDB.
Learn how to create a pipeline of operations that can reshape and transform your data and apply a range of analytics functions and calculations to produce summary results across a data set.
Relational databases are central to web applications, but they have also been the primary source of pain when it comes to scale and performance. Recently, non-relational databases (also referred to as NoSQL) have arrived on the scene. This session explains not only what MongoDB is and how it works, but when and how to gain the most benefit.
Back to Basics Webinar 4: Advanced Indexing, Text and Geospatial IndexesMongoDB
This is the fourth webinar of a Back to Basics series that will introduce you to the MongoDB database. This webinar will introduce you to the aggregation framework.
CosmosDB service is a NoSQL is a globally distributed, multi-model database database service designed for scalable and high performance modern applications. CosmosDB is delivered as a fully managed service with an enterprise grade SLA. It supports querying of documents using a familiar SQL over hierarchical JSON documents. Azure Cosmos DB is a superset of the DocumentDB service. It allows you to store and query noSQL data, regardless of schema. In this presentation, you will learn: • How to get started with DocumentDB you provision a new database account. • How to index documents • How to create applications using CosmosDb (using REST API or programming libraries for several popular language) • Best practices designing applications with CosmosDB • Best practices creating queries.
Map/Confused? A practical approach to Map/Reduce with MongoDBUwe Printz
Talk given at MongoDb Munich on 16.10.2012 about the different approaches in MongoDB for using the Map/Reduce algorithm. The talk compares the performance of built-in MongoDB Map/Reduce, group(), aggregate(), find() and the MongoDB-Hadoop Adapter using a practical use case.
This document provides an overview of MongoDB aggregation which allows processing data records and returning computed results. It describes some common aggregation pipeline stages like $match, $lookup, $project, and $unwind. $match filters documents, $lookup performs a left outer join, $project selects which fields to pass to the next stage, and $unwind deconstructs an array field. The document also lists other pipeline stages and aggregation pipeline operators for arithmetic, boolean, and comparison expressions.
This document provides an overview and agenda for a presentation on NoSQL databases and MongoDB. It discusses the benefits and criticisms of SQL databases, introduces key concepts of NoSQL and MongoDB, and provides examples of MongoDB queries. The presentation covers why organizations adopt NoSQL, introduces MongoDB and demonstrates how to query and integrate it using Java. It also discusses data consistency, scaling and tips for using MongoDB.
Spark-driven audience counting by Boris TrofimovJavaDayUA
The story about Ad world and real-time segments counting. Size of data does not allow doing straightforward calculations so we will dive into the solution step-by step involving some "secret" algorithms from Google.
10gen Presents Schema Design and Data ModelingDATAVERSITY
This document provides an overview of schema design in MongoDB. It discusses topics such as:
- The goals of schema design, which include avoiding anomalies, minimizing redesign, avoiding query bias, and making use of features.
- Key terminology when comparing MongoDB to relational databases, such as using collections instead of tables and embedding/linking instead of joins.
- Examples of basic collections, documents, indexing, and query operators.
- Common schema patterns for MongoDB like embedding, normalization, inheritance, one-to-many, many-to-many, and trees.
- Use cases like time series are also briefly covered.
This document discusses various indexing strategies in MongoDB to help scale applications. It covers the basics of indexes, including creating and tuning indexes. It also discusses different index types like geospatial indexes, text indexes, and how to use explain plans and profiling to evaluate queries. The document concludes with a section on scaling strategies like sharding to scale beyond a single server's resources.
Indexing in MongoDB works similarly to indexing in relational databases. An index is a data structure that can make certain queries more efficient by maintaining a sorted order of documents. Indexes are created using the ensureIndex() method and take up additional space and slow down writes. The explain() method is used to determine whether a query is using an index.
Developers love MongoDB because its flexible document model enhances their productivity. But did you know that MongoDB supports rich queries and lets you accomplish some of the same things you currently do with SQL statements? And that MongoDB's powerful aggregation framework makes it possible to perform real-time analytics for dashboards and reports?
Attend this webinar for an introduction to the MongoDB aggregation framework and a walk through of what you can do with it. We'll also demo using it to analyze U.S. census data.
MongoDB .local Chicago 2019: Practical Data Modeling for MongoDB: TutorialMongoDB
For 30 years, developers have been taught that relational data modeling was THE way to model, but as more companies adopt MongoDB as their data platform, the approaches that work well in relational design actually work against you in a document model design. In this talk, we will discuss how to conceptually approach modeling data with MongoDB, focusing on practical foundational techniques, paired with tips and tricks, and wrapping with discussing design patterns to solve common real world problems.
MongoDB .local Munich 2019: Best Practices for Working with IoT and Time-seri...MongoDB
Time series data is increasingly at the heart of modern applications - think IoT, stock trading, clickstreams, social media, and more. With the move from batch to real time systems, the efficient capture and analysis of time series data can enable organizations to better detect and respond to events ahead of their competitors or to improve operational efficiency to reduce cost and risk. Working with time series data is often different from regular application data, and there are best practices you should observe.
This talk covers:
• Common components of an IoT solution
• The challenges involved with managing time-series data in IoT applications
• Different schema designs, and how these affect memory and disk utilization – two critical factors in application performance.
• How to query, analyze and present IoT time-series data using MongoDB Compass and MongoDB Charts
At the end of the session, you will have a better understanding of key best practices in managing IoT time-series data with MongoDB.
Webinar: Exploring the Aggregation FrameworkMongoDB
Developers love MongoDB because its flexible document model enhances their productivity. But did you know that MongoDB supports rich queries and lets you accomplish some of the same things you currently do with SQL statements? And that MongoDB's powerful aggregation framework makes it possible to perform real-time analytics for dashboards and reports?
Watch this webinar for an introduction to the MongoDB aggregation framework and a walk through of what you can do with it. We'll also demo an analysis of U.S. census data.
PistonHead's use of MongoDB for AnalyticsAndrew Morgan
Haymarket Media Group is building a reporting and analytics suite called PistonHub to provide dealers and administrators insights into classifieds and stock performance data. PistonHub will aggregate data from various sources like classifieds, calls, emails, and stock information to generate daily statistics for each dealer that can be viewed on a dashboard. This consolidated data will give dealers and sales teams more visibility to help dealers improve performance. The initial feedback on PistonHub has been positive for providing extra insights.
As your data grows, the need to establish proper indexes becomes critical to performance. MongoDB supports a wide range of indexing options to enable fast querying of your data, but what are the right strategies for your application?
In this talk we’ll cover how indexing works, the various indexing options, and use cases where each can be useful. We'll dive into common pitfalls using real-world examples to ensure that you're ready for scale.
Back to Basics Webinar 5: Introduction to the Aggregation FrameworkMongoDB
The document provides information about an upcoming webinar on the MongoDB aggregation framework. Key details include:
- The webinar will introduce the aggregation framework and provide an overview of its capabilities for analytics.
- Examples will use a real-world vehicle testing dataset to demonstrate aggregation pipeline stages like $match, $project, and $group.
- Attendees will learn how the aggregation framework provides a simpler way to perform analytics compared to other tools like Spark and Hadoop.
Thomas Rückstieß gave a presentation on indexing and query optimization in MongoDB. He discussed what indexes are, why they are needed, how to create and manage indexes, and how to optimize queries. He emphasized that absent or suboptimal indexes are a common performance problem and outlined some common indexing mistakes to avoid, such as trying to use multiple indexes per query, low selectivity indexes, and queries that cannot use indexes like regular expressions and negation.
These are slides from our Big Data Warehouse Meetup in April. We talked about NoSQL databases: What they are, how they’re used and where they fit in existing enterprise data ecosystems.
Mike O’Brian from 10gen, introduced the syntax and usage patterns for a new aggregation system in MongoDB and give some demonstrations of aggregation using the new system. The new MongoDB aggregation framework makes it simple to do tasks such as counting, averaging, and finding minima or maxima while grouping by keys in a collection, complementing MongoDB’s built-in map/reduce capabilities.
For more information, visit our website at http://casertaconcepts.com/ or email us at [email protected].
The document discusses schema design basics for MongoDB, including terms, considerations for schema design, and examples of modeling different types of data structures like trees, single table inheritance, and many-to-many relationships. It provides examples of creating indexes, evolving schemas, and performing queries and updates. Key topics covered include embedding data versus normalization, indexing, and techniques for modeling one-to-many and many-to-many relationships.
Data Processing and Aggregation with MongoDB MongoDB
The document discusses data processing and aggregation using MongoDB. It provides an example of using MongoDB's map-reduce functionality to count the most popular pub names in a dataset of UK pub locations and attributes. It shows the map and reduce functions used to tally the name occurrences and outputs the top 10 results. It then demonstrates performing a similar analysis on just the pubs located in central London using MongoDB's aggregation framework pipeline to match, group and sort the results.
Beyond the Basics 2: Aggregation Framework MongoDB
The aggregation framework is one of the most powerful analytical tools available with MongoDB.
Learn how to create a pipeline of operations that can reshape and transform your data and apply a range of analytics functions and calculations to produce summary results across a data set.
Relational databases are central to web applications, but they have also been the primary source of pain when it comes to scale and performance. Recently, non-relational databases (also referred to as NoSQL) have arrived on the scene. This session explains not only what MongoDB is and how it works, but when and how to gain the most benefit.
Back to Basics Webinar 4: Advanced Indexing, Text and Geospatial IndexesMongoDB
This is the fourth webinar of a Back to Basics series that will introduce you to the MongoDB database. This webinar will introduce you to the aggregation framework.
CosmosDB service is a NoSQL is a globally distributed, multi-model database database service designed for scalable and high performance modern applications. CosmosDB is delivered as a fully managed service with an enterprise grade SLA. It supports querying of documents using a familiar SQL over hierarchical JSON documents. Azure Cosmos DB is a superset of the DocumentDB service. It allows you to store and query noSQL data, regardless of schema. In this presentation, you will learn: • How to get started with DocumentDB you provision a new database account. • How to index documents • How to create applications using CosmosDb (using REST API or programming libraries for several popular language) • Best practices designing applications with CosmosDB • Best practices creating queries.
Map/Confused? A practical approach to Map/Reduce with MongoDBUwe Printz
Talk given at MongoDb Munich on 16.10.2012 about the different approaches in MongoDB for using the Map/Reduce algorithm. The talk compares the performance of built-in MongoDB Map/Reduce, group(), aggregate(), find() and the MongoDB-Hadoop Adapter using a practical use case.
This document provides an overview of MongoDB aggregation which allows processing data records and returning computed results. It describes some common aggregation pipeline stages like $match, $lookup, $project, and $unwind. $match filters documents, $lookup performs a left outer join, $project selects which fields to pass to the next stage, and $unwind deconstructs an array field. The document also lists other pipeline stages and aggregation pipeline operators for arithmetic, boolean, and comparison expressions.
This document provides an overview and agenda for a presentation on NoSQL databases and MongoDB. It discusses the benefits and criticisms of SQL databases, introduces key concepts of NoSQL and MongoDB, and provides examples of MongoDB queries. The presentation covers why organizations adopt NoSQL, introduces MongoDB and demonstrates how to query and integrate it using Java. It also discusses data consistency, scaling and tips for using MongoDB.
Spark-driven audience counting by Boris TrofimovJavaDayUA
The story about Ad world and real-time segments counting. Size of data does not allow doing straightforward calculations so we will dive into the solution step-by step involving some "secret" algorithms from Google.
This document discusses code contracts, which extend abstract data types with preconditions, postconditions, and invariants. Code contracts allow programmers to specify conditions that must be true before, after, and during execution. The document outlines key contract terms like preconditions, postconditions, and invariants. It also discusses how to add contracts to code using Code Contracts in .NET and demonstrates contract verification, inheritance of contracts, and handling contract failures at runtime. Code Contracts allow formal specification and static/dynamic checking of interface behaviors to help catch errors and improve code quality.
1. Scalding is a library that provides a concise domain-specific language (DSL) for writing MapReduce jobs in Scala. It allows defining source and sink connectors, as well as data transformation operations like map, filter, groupBy, and join in a more readable way than raw MapReduce APIs.
2. Some use cases for Scalding include splitting or reusing data streams, handling exotic data sources like JDBC or HBase, performing joins, distributed caching, and building connected user profiles by bridging data from different sources.
3. For connecting user profiles, Scalding can be used to model the data as a graph with vertices for user interests and edges for bridging rules.
- MongoDB is a non-relational, document-oriented database that scales horizontally and uses JSON-like documents with dynamic schemas.
- It supports complex queries, embedded documents and arrays, and aggregation and MapReduce for querying and transforming data.
- MongoDB is used by many large companies for operational databases and analytics due to its scalability, flexibility, and performance.
This document summarizes a cloud-native stream processor. It discusses how the stream processor is lightweight, open source, and supports distributed deployment on Docker and Kubernetes. It also outlines key features like real-time data integration, complex pattern detection, online machine learning, and integration with databases and services. Use cases like fraud detection, IoT analytics, and real-time decision making are provided.
[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital EnterpriseWSO2
The WSO2 analytics platform provides a high performance, lean, enterprise-ready, streaming solution to solve data integration and analytics challenges faced by connected businesses. This platform offers real-time, interactive, machine learning and batch processing technologies that empower enterprises to build a digital business. This session explores how to enable digital transformation by building a data analytics platform.
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...MongoDB
This document discusses using machine learning and various machine learning platforms like MongoDB, Spark, Watson, Azure, and AWS to engage customers. It provides examples of using these platforms for tasks like topic detection on tweets, sentiment analysis, recommendation engines, forecasting, and marketing response prediction. It also discusses architectures, languages, and functions supported by tools like Mahout, MLlib, and Watson Developer Cloud.
OrientDB - The 2nd generation of (multi-model) NoSQLRoberto Franchini
This document provides an overview of OrientDB, a multi-model database that combines features of document, graph, and other databases. It discusses data modeling and schema, querying and traversing graph data, full-text and spatial search, deployment scenarios, and APIs. Examples show creating classes and properties, inserting and querying graph data, and live reactive queries in OrientDB.
CouchApps are web applications built using CouchDB, JavaScript, and HTML5. CouchDB is a document-oriented database that stores JSON documents, has a RESTful HTTP API, and is queried using map/reduce views. This talk will answer your basic questions about CouchDB, but will focus on building CouchApps and related tools.
This document discusses using F# for learning probabilistic models and projects at Microsoft. It covers factor graphs and inference in factor graphs for representing probabilistic models. It then describes two projects - analyzing the TrueSkill ranking algorithm and an internal adCenter competition. It concludes by outlining benefits of F# such as producing correct, succinct and high performance code while being fun to program in.
The document discusses MongoDB transactions and concurrency. It provides code examples of how to perform transactions in MongoDB using logical sessions, including inserting a document into a collection and updating related documents in another collection atomically. It also discusses some of the features and timeline for implementing distributed transactions in sharded MongoDB clusters.
Christian Kvalheim gave an introduction to NoSQL and MongoDB. Some key points:
1) MongoDB is a scalable, high-performance, open source NoSQL database that uses a document-oriented model.
2) It supports indexing, replication, auto-sharding for horizontal scaling, and querying.
3) Documents are stored in JSON-like records which can contain various data types including nested objects and arrays.
Building Analytics Applications with Streaming Expressions in Apache Solr - A...Lucidworks
This document discusses building analytics applications with streaming expressions in Apache Solr. It introduces parallel computing frameworks, the streaming API, and streaming expressions. It provides examples of use cases like performing searches, facets, joins, and aggregations on real-time data from different sources. It also demonstrates how to execute expressions in parallel using worker collections and shuffling to improve performance.
How sitecore depends on mongo db for scalability and performance, and what it...Antonios Giannopoulos
Percona Live 2017 - How sitecore depends on mongo db for scalability and performance, and what it can teach you by Antonios Giannopoulos and Grant Killian
This document discusses MongoDB and the needs of Rivera Group, an IT services company. It notes that Rivera Group has been using MongoDB since 2012 to store large, multi-dimensional datasets with heavy read/write and audit requirements. The document outlines some of the challenges Rivera Group faces around indexing, aggregation, and flexibility in querying datasets.
Eagle6 is a product that use system artifacts to create a replica model that represents a near real-time view of system architecture. Eagle6 was built to collect system data (log files, application source code, etc.) and to link system behaviors in such a way that the user is able to quickly identify risks associated with unknown or unwanted behavioral events that may result in unknown impacts to seemingly unrelated down-stream systems. This session is designed to present the capabilities of the Eagle6 modeling product and how we are using MongoDB to support near-real-time analysis of large disparate datasets.
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016DataStax
A deep learning startup has a requirement for a robust and scalable data architecture. Training a Deep Neural Network requires 10s-100s of millions of examples consisting of data and metadata. In addition to training it is necessary to support test/validation, data exploration and more traditional data science analytics workloads. As a startup we have minimal resources and an engineering team of 1.
Cassandra, Spark and Kafka running on Mesos in AWS is a scalable architecture that is fast and easy to set up and maintain to deliver a data architecture for Deep Learning.
About the Speaker
Andrew Jefferson VP Engineering, Tractable
A software engineer specialising in realtime data systems. I've worked at companies from Startups to Apple on applications ranging from Ticketing to Genetics. Currently building data systems for training and exploiting Deep Neural Networks.
Real-Time Spark: From Interactive Queries to StreamingDatabricks
This document summarizes Michael Armbrust's presentation on real-time Spark. It discusses:
1. The goals of real-time analytics including having the freshest answers as fast as possible while keeping the answers up to date.
2. Spark 2.0 introduces unified APIs for SQL, DataFrames and Datasets to make developing real-time analytics simpler with powerful yet simple APIs.
3. Structured streaming allows running the same SQL queries on streaming data to continuously aggregate data and update outputs, unifying batch, interactive, and streaming queries into a single API.
New feature overview of Cubes 1.0 – lightweight Python OLAP and pluggable data warehouse. Video: https://www.youtube.com/watch?v=-FDTK80zsXc Github sources: https://github.com/databrewery/cubes
The document discusses MongoDB and how it allows storing data in flexible, document-based collections rather than rigid tables. Some key points:
- MongoDB uses a flexible document model that allows embedding related data rather than requiring separate tables joined by foreign keys.
- It supports dynamic schemas that allow fields within documents to vary unlike traditional SQL databases that require all rows to have the same structure.
- Aggregation capabilities allow complex analytics to be performed directly on the data without requiring data warehousing or manual export/import like with SQL databases. Pipelines of aggregation operations can be chained together.
DataScience Lab, 13 мая 2017
Коррекция геометрических искажений оптических спутниковых снимков
Алексей Кравченко (Senior Data Scientist at Zoral Labs)
Мы рассмотрим разнообразие существующих спутниковых данных и способов их применения в сельском и лесном хозяйстве, картографировании земной поверхности. Далее сфокусируемся на задаче геометрической коррекции снимков как первом шаге процесса обработки спутниковых данных, включая геопривязку снимков, регистрацию изображений, субпиксельную идентификацию контрольных точек, совмещение каналов. Также расскажем о некоторых интересных и неожиданных подходах к определению ориентации и jitter спутников и построению маски облачности.
Все материалы: http://datascience.in.ua/report2017
DataScience Lab 2017_Kappa Architecture: How to implement a real-time streami...GeeksLab Odessa
DataScience Lab, 13 мая 2017
Kappa Architecture: How to implement a real-time streaming data analytics engine
Juantomás García (Data Solutions Manager at OpenSistemas, Madrid, Spain)
We will have an introduction of what is the kappa architecture vs lambda architecture. We will see how kappa architecture is a good solution to implement solutions in (almost) real time when we need to analyze data in streaming. We will show in a case of real use: how architecture is designed, how pipelines are organized and how data scientists use it. We will review the most used technologies to implement it from apache Kafka + spark using Scala to new tools like apache beam / google dataflow.
Все материалы: http://datascience.in.ua/report2017
Semgrex allows users to extract information from text using patterns that match syntactic dependencies in sentences. It provides examples of patterns that match parts of speech tags and dependency relations. The document also includes links to the Semgrex npm package, a demo application on GitHub, and resources for natural language processing and syntactic dependencies.
DataScience Lab 2017_Обзор методов детекции лиц на изображениеGeeksLab Odessa
DataScience Lab, 13 мая 2017
Обзор методов детекции лиц на изображение
Юрий Пащенко ( Research Engineer, Ring Labs)
В данном докладе мы предлагаем обзор наиболее новых и популярных методов обнаружения лиц, таких как Viola-Jones, Faster-RCNN, MTCCN и прочих. Мы обсудим основные критерии оценки качества алгоритма а также базы, включая FDDB, WIDER, IJB-A.
Все материалы: http://datascience.in.ua/report2017
DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание про...GeeksLab Odessa
DataScience Lab, 13 мая 2017
Сходство пациентов: вычистка дубликатов и предсказание пропущенных диагнозов
Виктор Сарапин (CEO at V.I.Tech)
Как эффективно определять дубликаты на десятках миллионов пациентов, и как определять пропущенные диагнозы и лечебные действия.
Все материалы доступны по ссылке: http://datascience.in.ua/report2017
DataScience Lab, 13 мая 2017
Recent deep learning approaches for speech generation
Дмитрий Белевцов (Techlead at IBDI)
В последние пол года появилось несколько важных моделей на базе глубоких нейронных сетей, способных успешно синтезировать человеческую речь на уровне отдельных сэмплов. Это позволило обойти многие недостатки классических спектральных подходов. В этом докладе я сделаю небольшой обзор архитектур наиболее популярных сетей, таких как Wavenet и SampleRNN.
Все материалы доступны по ссылке: http://datascience.in.ua/report2017
DataScience Lab, 13 мая 2017
Распределенные вычисления: использование BOINC в Data Science
Виталий Кошура (Software Developer at Lohika)
BOINC - это открытое программное обеспечение для распределенных вычислений. Данный доклад освещает использование приложения BOINC в различных областях науки, которые связаны с обработкой огромных массивов данных, на примере текущих активных исследовательских проектов.
Все материалы доступны по ссылке: http://datascience.in.ua/report2017
DataScience Lab, 13 мая 2017
Магистерская программа "Data Science" в УКУ
Орест Купин(Master's Student at UCU)
В этом докладе я расскажу вам о магистерской программе со специализацией в анализе данных в Украинском Католическом Университете. Я расскажу про структуру программы, основные курсы, а также опишу свой опыт как студента УКУ и поговорю об вызовах с которыми мы столкнулись в этом году.
Все материалы доступны по ссылке: http://datascience.in.ua/report2017
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...GeeksLab Odessa
DataScience Lab, 13 мая 2017
Cервинг моделей, построенных на больших данных с помощью Apache Spark
Степан Пушкарев (GM (Kazan) at Provectus / CTO at Hydrosphere.io)
После подготовки данных и обучения моделей на больших данных с использованием Apache Spark встает вопрос о том, как использовать обученные модели в реальных приложениях. Помимо модели важно не забывать про весь пайплайн пре-процессинга данных, который должен попасть в продакшн в том виде, в котором его спроектировал и реализовал дата саентист. Такие решения, как PMML/PFA, основанные на экспорте/импорте модели и алгоритма имеют очевидные недостатки и ограничения. В данном докладе мы предложим альтернативное решение, которое упрощает процесс использования моделей и пайплайнов в реальных боевых приложениях.
Все материалы доступны по ссылке: http://datascience.in.ua/report2017
DataScienceLab2017_BioVec: Word2Vec в задачах анализа геномных данных и биоин...GeeksLab Odessa
DataScience Lab, 13 мая 2017
BioVec: Word2Vec в задачах анализа геномных данных и биоинформатики
Дмитрий Новицкий (Старший научный сотрудник в ИПММС НАНУ)
Этот доклад посвящен bioVec: применению технологии word2vec в задачах биоинфоматики. Сначала мы напомним как работает Word2vec и аналогичные ему методы Word Embedding. Затем расскажем об особенностях Word2vec в применении к геномным последовательностям-- основному виду данных в биоинформатике. Как обучать bioVec, и применять эту технологию к задачам классификации белков, предсказания их функции и др. В заключении мы продемонстрируем примеры кода для обучения и использования bioVec.
Все материалы доступны по ссылке: http://datascience.in.ua/report2017
DataScienceLab2017_Data Sciences и Big Data в Телекоме_Александр Саенко GeeksLab Odessa
DataScience Lab, 13 мая 2017
Data Sciences и Big Data в Телекоме
Александр Саенко (Software Engineer at SoftServe/CISCO)
Александр расскажет о некоторых интересных примерах использования Big Data и Data Science в Телекоме: оптимизация сотовой сети, улучшение клиентского опыта, модели прогнозирования местоположения мобильных устройств, предотвращения оттока абонентов, обнаружение фрода и других. Рассмотрит основные современные подходы к их решению на основе алгоритмов машинного обучения.
Все материалы доступны по ссылке: http://datascience.in.ua/report2017
DataScienceLab2017_Высокопроизводительные вычислительные возможности для сист...GeeksLab Odessa
DataScience Lab, 13 мая 2017
Высокопроизводительные вычислительные возможности для систем анализа данных
Михаил Федосеев ( Архитектор инфраструктурных решений, LanTec)
В докладе мы поговорим о hardware стороне систем анализа данных для случаев построения приватных облаков или локальных высокопроизводительных вычислительных кластеров. Рассмотрим какие технологии и комплексные решения от компании Hewlett Packard Enterprise позволяют ускорить процесс анализа данных. Это не только зарекомендовавшие в своей области лучшие в своем сегменте сервера линейки HPE Apollo, а так же высокоскоростные сетевые коммутаторы HPE, но и дополнительные вспомогательные элементы решения, такие как мощные графические карты NVIDIA и хост-процессоры Xeon Phi. Так же будет рассмотрен стек HPE Core HPC Software Stack, который позволяет администраторам контролировать использование ресурсов системы.
Все материалы доступны по ссылке: http://datascience.in.ua/report2017
DataScience Lab 2017_Мониторинг модных трендов с помощью глубокого обучения и...GeeksLab Odessa
DataScience Lab, 13 мая 2017
Мониторинг модных трендов с помощью глубокого обучения и TensorFlow, Ольга Романюк (Data Scientist at Eleks)
В течении последних 8 месяцев мы в Eleks работали над системой отслеживания модных трендов, основанной на глубинной остаточной нейронной сети с тождественным отображением. При тренировке сети мы использовали онлайн увеличение объема данных, а также распараллеливание данных по двум картам GPU. Мы создали эту систему с нуля при помощи TensorFlow. В презентации я расскажу о практической стороне проекта, нюансах реализации и подводных камнях, с которыми мы столкнулись во время работы.
Все материалы: http://datascience.in.ua/report2017
DataScience Lab 2017_Кто здесь? Автоматическая разметка спикеров на телефонны...GeeksLab Odessa
DataScience Lab, 13 мая 2017
Кто здесь? Автоматическая разметка спикеров на телефонных разговорах
Юрий Гуц (Machine Learning Engineer, DataRobot)
Автоматическая аннотация спикеров — интересная задача в обработке мультимедиа-данных. Нам нужно ответить на вопрос "Кто говорит когда?", не зная ничего о количестве и личности спикеров, присутствующих на записи. В этом докладе мы рассмотрим работающие методы для аннотации спикеров на телефонных разговорах.
Все материалы: http://datascience.in.ua/report2017
DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...GeeksLab Odessa
From bag of texts to bag of clusters
Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)
Мы рассмотрим современные подходы к кластеризации текстов и их визуализации. Начиная от классического K-means на TF-IDF и заканчивая Deep Learning репрезентациями текстов. В качестве практического примера, мы проанализируем набор сообщений из соц. сетей и попробуем найти основные темы обсуждения.
Все материалы: http://datascience.in.ua/report2017
DataScience Lab 2017_Графические вероятностные модели для принятия решений в ...GeeksLab Odessa
Графические вероятностные модели для принятия решений в проектном управлении
Ольга Татаринцева (Data Scientist at Eleks)
Как часто вам приходится принимать решения, используя знания в определенной предметной области? На сколько хороши такие решения? А теперь представьте, что вы собрали знания лучших экспертов в предметной области. Похоже, что ваши решения, основанные на этих знаниях, будут куда более взвешенными, не так ли? Мы будем говорить о системе ProjectHealth, которая была построена на основе опыта лучших экспертов в проектном управлении в компании Eleks. Для реализации поставленной задачи была использована графовая вероятностная модель, а именно байесовская сеть, имплементированная на Python. За время работы над проектом мы прошли шаги от извлечения требований, поиска данных и построения модели с нуля до реализации BI дашборда с возможностью углубиться в детали, доходя до сырых данных. Сейчас ProjectHealth экономит большое количество времени для топ менеджмента и ресурсов компании, так как мониторит состояние бизнеса в малейших деталях ежедневно и делает это как настоящий эксперт.
Все материалы: http://datascience.in.ua/report2017
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...GeeksLab Odessa
DataScienceLab, 13 мая 2017
Оптимизация гиперпараметров машинного обучения при помощи Байесовской оптимизации
Максим Бевза (Research Engineer at Grammarly)
Все алгоритмы машинного обучения нуждаются в настройке (тьюнинге). Часто мы используем Grid Search или Randomized Search или нашу интуицию для подбора гиперпараметров. Байесовская оптимизация поможет нам направить Randomized Search в те места, которые наиболее перспективны, так, чтобы тот же (или лучший) результат мы получили за меньшее количество итераций.
Все материалы: http://datascience.in.ua/report2017
DataScienceLab2017_Как знать всё о покупателях (или почти всё)?_Дарина Перемот GeeksLab Odessa
DataScienceLab, 13 мая 2017
Как знать всё о покупателях (или почти всё)?
Дарина Перемот (ML Engineer at SynergyOne)
Раскроем собственный ответ на вопрос "Чего же хочет покупатель?". Поделимся результатами исследований транзакций и расскажем, есть ли у вас домашний питомец. А так же, продемонстрируем, как машинное обучение уже сейчас помогает узнавать вас ближе.
Все материалы: http://datascience.in.ua/report2017
JS Lab 2017_Mapbox GL: как работают современные интерактивные карты_Владимир ...GeeksLab Odessa
JS Lab 2017, 25 марта
Mapbox GL: как работают современные интерактивные карты
Владимир Агафонкин (Lead JavaScript Engineer at MapBox)
Mapbox GL JS — открытая JS-библиотека для создания современных интерактивных карт на основе WebGL. В разработке более трех лет, она сочетает в себе множество удивительных технологий, сложных алгоритмов и идей для достижения плавной отрисовки тысяч векторных объектов с миллионами точек в реальном времени. В этом докладе вы узнаете, как работает библиотека внутри, и с какими сложностями сталкиваются разработчики современных WebGL-приложений. В докладе: отрисовка шрифтов, триангуляция линий и полигонов, пространственные индексы, определение коллизий, расстановка надписей, кластеризация точек, обрезка фигур, упрощение линий, упаковка спрайтов, компактные бинарные форматы, параллельная обработка данных в браузере, тестирование отрисовки и другие сложности.
Все материалы: http://jslab.in.ua/2017
JS Lab2017_Под микроскопом: блеск и нищета микросервисов на node.js GeeksLab Odessa
JS Lab2017, 25 марта, Одесса
Под микроскопом: блеск и нищета микросервисов на node.js
Илья Климов (CEO at Javascript.Ninja)
"- Что это?
- Микросервис!
- И что он делает?
- Микропадает".
Про микросервисы сейчас не рассуждает только ленивый. Все рассказывают про то, как микросервисы спасают от сложности разработки, снижают время развертывание и повышают общую надежность систем. Этот доклад - про подводные камни, которые ждут оседлавших волну этого хайпа с Node.JS. Мы поговорим про ошибки, которые стоили мне и моей компании бессонных ночей, потерянной прибыли и, временами, веры в могущество микросервисной архитектуры.
Все материалы: http://jslab.in.ua/
Организаторы: http://geekslab.org.ua/
Quantum Computing Quick Research Guide by Arthur MorganArthur Morgan
This is a Quick Research Guide (QRG).
QRGs include the following:
- A brief, high-level overview of the QRG topic.
- A milestone timeline for the QRG topic.
- Links to various free online resource materials to provide a deeper dive into the QRG topic.
- Conclusion and a recommendation for at least two books available in the SJPL system on the QRG topic.
QRGs planned for the series:
- Artificial Intelligence QRG
- Quantum Computing QRG
- Big Data Analytics QRG
- Spacecraft Guidance, Navigation & Control QRG (coming 2026)
- UK Home Computing & The Birth of ARM QRG (coming 2027)
Any questions or comments?
- Please contact Arthur Morgan at [email protected].
100% human made.
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfAbi john
Analyze the growth of meme coins from mere online jokes to potential assets in the digital economy. Explore the community, culture, and utility as they elevate themselves to a new era in cryptocurrency.
Generative Artificial Intelligence (GenAI) in BusinessDr. Tathagat Varma
My talk for the Indian School of Business (ISB) Emerging Leaders Program Cohort 9. In this talk, I discussed key issues around adoption of GenAI in business - benefits, opportunities and limitations. I also discussed how my research on Theory of Cognitive Chasms helps address some of these issues
Mobile App Development Company in Saudi ArabiaSteve Jonas
EmizenTech is a globally recognized software development company, proudly serving businesses since 2013. With over 11+ years of industry experience and a team of 200+ skilled professionals, we have successfully delivered 1200+ projects across various sectors. As a leading Mobile App Development Company In Saudi Arabia we offer end-to-end solutions for iOS, Android, and cross-platform applications. Our apps are known for their user-friendly interfaces, scalability, high performance, and strong security features. We tailor each mobile application to meet the unique needs of different industries, ensuring a seamless user experience. EmizenTech is committed to turning your vision into a powerful digital product that drives growth, innovation, and long-term success in the competitive mobile landscape of Saudi Arabia.
HCL Nomad Web – Best Practices and Managing Multiuser Environmentspanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-nomad-web-best-practices-and-managing-multiuser-environments/
HCL Nomad Web is heralded as the next generation of the HCL Notes client, offering numerous advantages such as eliminating the need for packaging, distribution, and installation. Nomad Web client upgrades will be installed “automatically” in the background. This significantly reduces the administrative footprint compared to traditional HCL Notes clients. However, troubleshooting issues in Nomad Web present unique challenges compared to the Notes client.
Join Christoph and Marc as they demonstrate how to simplify the troubleshooting process in HCL Nomad Web, ensuring a smoother and more efficient user experience.
In this webinar, we will explore effective strategies for diagnosing and resolving common problems in HCL Nomad Web, including
- Accessing the console
- Locating and interpreting log files
- Accessing the data folder within the browser’s cache (using OPFS)
- Understand the difference between single- and multi-user scenarios
- Utilizing Client Clocking
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfSoftware Company
Explore the benefits and features of advanced logistics management software for businesses in Riyadh. This guide delves into the latest technologies, from real-time tracking and route optimization to warehouse management and inventory control, helping businesses streamline their logistics operations and reduce costs. Learn how implementing the right software solution can enhance efficiency, improve customer satisfaction, and provide a competitive edge in the growing logistics sector of Riyadh.
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...Alan Dix
Talk at the final event of Data Fusion Dynamics: A Collaborative UK-Saudi Initiative in Cybersecurity and Artificial Intelligence funded by the British Council UK-Saudi Challenge Fund 2024, Cardiff Metropolitan University, 29th April 2025
https://alandix.com/academic/talks/CMet2025-AI-Changes-Everything/
Is AI just another technology, or does it fundamentally change the way we live and think?
Every technology has a direct impact with micro-ethical consequences, some good, some bad. However more profound are the ways in which some technologies reshape the very fabric of society with macro-ethical impacts. The invention of the stirrup revolutionised mounted combat, but as a side effect gave rise to the feudal system, which still shapes politics today. The internal combustion engine offers personal freedom and creates pollution, but has also transformed the nature of urban planning and international trade. When we look at AI the micro-ethical issues, such as bias, are most obvious, but the macro-ethical challenges may be greater.
At a micro-ethical level AI has the potential to deepen social, ethnic and gender bias, issues I have warned about since the early 1990s! It is also being used increasingly on the battlefield. However, it also offers amazing opportunities in health and educations, as the recent Nobel prizes for the developers of AlphaFold illustrate. More radically, the need to encode ethics acts as a mirror to surface essential ethical problems and conflicts.
At the macro-ethical level, by the early 2000s digital technology had already begun to undermine sovereignty (e.g. gambling), market economics (through network effects and emergent monopolies), and the very meaning of money. Modern AI is the child of big data, big computation and ultimately big business, intensifying the inherent tendency of digital technology to concentrate power. AI is already unravelling the fundamentals of the social, political and economic world around us, but this is a world that needs radical reimagining to overcome the global environmental and human challenges that confront us. Our challenge is whether to let the threads fall as they may, or to use them to weave a better future.
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Impelsys Inc.
Impelsys provided a robust testing solution, leveraging a risk-based and requirement-mapped approach to validate ICU Connect and CritiXpert. A well-defined test suite was developed to assess data communication, clinical data collection, transformation, and visualization across integrated devices.
Artificial Intelligence is providing benefits in many areas of work within the heritage sector, from image analysis, to ideas generation, and new research tools. However, it is more critical than ever for people, with analogue intelligence, to ensure the integrity and ethical use of AI. Including real people can improve the use of AI by identifying potential biases, cross-checking results, refining workflows, and providing contextual relevance to AI-driven results.
News about the impact of AI often paints a rosy picture. In practice, there are many potential pitfalls. This presentation discusses these issues and looks at the role of analogue intelligence and analogue interfaces in providing the best results to our audiences. How do we deal with factually incorrect results? How do we get content generated that better reflects the diversity of our communities? What roles are there for physical, in-person experiences in the digital world?
Big Data Analytics Quick Research Guide by Arthur MorganArthur Morgan
This is a Quick Research Guide (QRG).
QRGs include the following:
- A brief, high-level overview of the QRG topic.
- A milestone timeline for the QRG topic.
- Links to various free online resource materials to provide a deeper dive into the QRG topic.
- Conclusion and a recommendation for at least two books available in the SJPL system on the QRG topic.
QRGs planned for the series:
- Artificial Intelligence QRG
- Quantum Computing QRG
- Big Data Analytics QRG
- Spacecraft Guidance, Navigation & Control QRG (coming 2026)
- UK Home Computing & The Birth of ARM QRG (coming 2027)
Any questions or comments?
- Please contact Arthur Morgan at [email protected].
100% human made.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
1. Scalding Big ADta или обжигая горшки с рекламой
Boris Trofimov
@b0ris_1
2. Agenda
•Two stories on how AD is served inside AD company
•Awesome Scalding
The stories mention one company that has built multimillion-dollar business over ordinary cookies
9. … 1 sec
20 ms
100 ms
150 ms
Publisher receives request
Publisher
sends
response
Content
delivered
to user
170
Site sends request to Ad Server
200
80 ms
280
SSP picks the winning bid and sends redirect url back to ad Server
Every bidder/DSP receives info about user:
•ssp_cookie_id
•geo data
•site url
300
SSP (Ad Exchange) receives ad request
and opens
RTB Auction
210
Ad Server receives ad request and redirects to Ad Exchange
All bidders should send their decision (participate? & price) back
350
Ad Server shows page to user which redirects to the bidder’s server
User’s web page asks Ad banner from CDN Showing ad & bidder’s 1x1 pixel (impression)
400
The first second…
~70% users have this cookie aboard
>>1 independent companies take part in this auction
10. Return info about new user’s interests with special markers (segments) that indicates the new fact about user, e.g user is man who has iphone and lives in NYC and has dog.
Major format: <cookie_id – segment_id>
Data Scientists
Real time
Offline
Pixel Tracking
Farm
Warehouse
Bidder Farm
Auction requests
SSP Ad
Exchange
Hourly
Logs
3rd part data
House holders data
…
Hadoop’s HDFS
Updating user profiles
Hive
Oozie
MapReduce
Partners
HBASE
Scalding
hbase keeps user profiles
Update user’s profiles with new segments
Data export
Brand new feed about user interests 2 3 4 5 6 7 8 9 0 1
•Impressions
•Clicks
•Post-click Activities 5
11. Why do we need all this science?
•Deep audience targeting
•Case: customer would like to show ad for all men who live in NYC have iPhone and dog
12. Facts about Data Scientists
•Data Scientists do:
–Audience Modeling
identifying new user interests [segments] and finding ways to track them
–Audience Bridging
–Insights and Analytics
•They use IBM Netezza as local warehouse
•They use R language
13. Facts about Realtime team
•Scala, Java
•Restful Services
•Akka
•In Memory Cache : Aerospike, Redis
14. Facts about Offline team
•The tasks we solve over hadoop:
–As a Storage to keep all logs we need
–As Profile DB to keep all users and their interests [segments]
–As MapReduce Engine to run jobs on transformations between data
–As a Warehouse to export data via hive
•We use Clouderra CDH 5.1.2
•Major language: Scala
•Pure MapReduce jobs & Scalding/Cascading
•All map reduce applications are wrapped by Oozie’s workflow(s)
•Developing nextgen paltform version based on Spark Streaming/Kafka
16. hdfs
Scalding in a nutshell
•Concise DSL
•Configurable Source(s) and sink(s)
•Data transform operations:
–map/flatMap
–pivot/unpivot
–project
–groupBy/reduce/foldLeft
hdfs
17. Just one example (Java way)
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
18. Source
Just one example (Scalding way)
class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => tokenize(line) } .groupBy('word) { _.size } .write( Tsv( args("output") ) ) // Split a piece of text into individual words. def tokenize(text : String) : Array[String] = { // Lowercase each word and remove punctuation. text.toLowerCase.split("s+") } }
Sink
Transform operations
19. Use Case 1 Split
•Motivation: reuse calculated streams
val common = Tsv("./file").map(...)
val branch1 = common.map(..).write(Tsv("output"))
val branch2 = common.groupby(..).write(Tsv("output"))
20. Use Case 2 Exotic Sources JDBC (out of the box)
case object YourTableSource extends JDBCSource {
override val tableName = "tableName"
override val columns = List(
varchar("col1", 64),
date("col2"),
tinyint("col3"),
double("col4"),
)
override def currentConfig = ConnectionSpec("www.gt.com", "username", "password", "mysql")
}
YourTableSource.read.map(...) ...
21. Use Case 2 Exotic Sources HBASE
HBaseSource (https://github.com/ParallelAI/SpyGlass)
•SCAN_ALL,
•GET_LIST,
•SCAN_RANGE
HBaseRawSource (https://github.com/andry1/SpyGlass)
•Advanced filtering via base64Scan
val hbs3 = new HBaseSource( tableName, quorum, 'key, List("data"), List('data), sourceMode = SourceMode.SCAN_ALL) .read
val scan = new Scan()
scan.setCaching(caching)
val activity_filters = new FilterList(MUST_PASS_ONE, {
val scvf = new SingleColumnValueFilter(toBytes("family"), toBytes("column"), GREATER_OR_EQUAL, toBytes(value))
scvf.setFilterIfMissing(true)
scvf.setLatestVersionOnly(true)
val scvf2 = ...
List(scvf, scvf2)
})
scan.setFilter(activity_filters)
new HBaseRawSource(tableName, quorum, families,
base64Scan = convertScanToBase64(scan)).read. ...
22. Use Case 3 Join
•Motivation: joining two streams by key
•Different join strategies:
–joinWithLarger
–joinWithSmaller
–joinWithTiny
•Inner, Left, Right, strategies
val pipe1 = Tsv("file1").read
val pipe2 = Tsv("file2").read // small file
val pipe3 = Tsv("file3").read // huge file
val joinedPipe = pipe1.joinWithTiny('id1 -> 'id2, pipe2)
val joinedPipe2 = pipe1.joinWithLarge('id1 -> 'id2, pipe3)
23. Use Case 4 Distributed Caching and Counters
//somewhere outside Job definition
val fl = DistributedCacheFile("/user/boris/zooKeeper.json")
// next value can be passed through any Scalding's jobs via Args object for instance
val fileName = fl.path
...
class Job(val args:Args) {
// once we receive fl.path we can read it like a ordinary file
val fileName = args.get("fileName")
lazy val data = readJSONFromFile(fileName)
...
TSV(args.get("input")).read.map('line -> 'word ) {
line => ... /* using data json object*/ ... }
}
// counter example Stat("jdbc.call.counter","myapp").incBy(1)
24. Use Case 5 Bridging Profiles
Motivation: bridge information from different sources and build complete person profile
imp
Own company’s private cookie thanks to 1x1 pixel impression
Bridging two ssp_cookies via private cookie
ssp_cookie_Id1
ssp_cookie_Id2
Bridging via ip address