Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
Presentation :
https://www.youtube.com/watch?v=qmlzrCMT6Wc
Given at Data Day Texas 2016.
Apache Spark has been hailed as a trail-blazing new tool for doing distributed data science. However, since it's so new, it can be difficult to set up and hard to use. In this talk, I'll discuss the journey I've had using Spark for data science at Bitly over the past year. I'll talk about the benefits of using Spark, the challenges I've had to overcome, the caveats for using a cutting-edge technology such as this, and my hopes for the Spark project as a whole.
The document discusses recent developments and trends in IT, including virtual technologies, intelligent machines, analytics, wearable systems, 3D printing, context-aware systems, cloud computing, programming languages, databases, and mobile development. Emerging technologies mentioned include the Internet of Things, which will generate large data pools, smart machines learning on their own, open source/data trends, and HTML5 bringing new capabilities to smartphones. Programming languages discussed include jQuery, CSS, Java, PHP, .NET, Python, and frameworks like Ruby on Rails and AngularJS.
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...Alexey Zinoviev
Alexey Zinoviev presented this paper on the JPoint'15 conference javapoint.ru/talks/#zinoviev.
This paper covers next topics: Java, JPA, Morphia, Hibernate OGM, Spring Data, Hector, Kundera, NoSQL, Mongo, Cassandra, HBase, Riak
1) Databases have been around for a long time but newer non-relational databases are gaining popularity. However, databases still have issues that can "kill you" like fragile replication and poor failover support.
2) The CAP theorem states that a distributed system cannot simultaneously provide consistency, availability, and partition tolerance. This introduced needed realism that different data stores solve different problems depending on their focus on two of these areas.
3) New non-relational databases like MongoDB provide better solutions for issues like replication and failover that were weaknesses of relational databases. Document stores in particular are good alternatives when data is not truly relational.
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
What we're about
A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The technological challenges in this area are too great to master by one person. Just look at the picture in this articles, it only covers a small fraction of the technologies in the Big Data industry…
Consequently, I created a meetup detailing all the challenges of Big Data, especially in the world of cloud. I am using AWS infrastructure to answer the basic questions of anyone starting their way in the big data world.
how to transform data (TXT, CSV, TSV, JSON) into Parquet, ORCwhich technology should we use to model the data ? EMR? Athena? Redshift? Spectrum? Glue? Spark? SparkSQL?how to handle streaming?how to manage costs?Performance tips?Security tip?Cloud best practices tips?
Some of our online materials:
Website:
https://big-data-demystified.ninja/
Youtube channels:
https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
https://www.youtube.com/channel/UCMSdNB0fGmX5dXI7S7Y_LFA?view_as=subscriber
Meetup:
https://www.meetup.com/AWS-Big-Data-Demystified/
https://www.meetup.com/Big-Data-Demystified
Facebook Group :
https://www.facebook.com/groups/amazon.aws.big.data.demystified/
Facebook page (https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/)
Audience:
Data Engineers
Data Science
DevOps Engineers
Big Data Architects
Solution Architects
CTO
VP R&D
How to create a personal knowledge graph IBM Meetup Big Data Madrid 2017Juantomás García Molina
This document discusses how to create a personal knowledge graph. It begins by explaining why a knowledge graph is needed, as the speaker manages a lot of information from different sources and needs a way to organize and query it. It then discusses how to build a knowledge graph using concepts like explicit and implicit information, graph databases, and collective intelligence. The speaker advocates using cloud services, containers, notebooks and machine learning to build the knowledge graph. The first steps proposed are to name the project "Boosterme" and start a GitHub repository.
Developing high frequency indicators using real time tick data on apache supe...Zekeriya Besiroglu
This document summarizes the Central Bank of Turkey's project to develop high frequency market indicators using real-time tick data from the Thomson Reuters Enterprise Platform. It describes how they set up Apache Kafka, Druid, Spark and Superset on Hadoop to ingest, store, analyze and visualize the data. Their goal was to observe foreign exchange markets in real-time to detect risks and patterns. The architecture evolved over three phases from an initial test cluster to integrating Druid and Hive for improved querying and scaling to production. Work is ongoing to implement additional indicators and integrate historical data for enhanced analysis.
Big Data Anti-Patterns: Lessons From the Front LIneDouglas Moore
This document summarizes common anti-patterns in big data projects based on lessons learned from working with over 50 clients. It identifies anti-patterns in hardware and infrastructure, tooling, and big data warehousing. Specifically, it discusses issues with referencing outdated architectures, using tools improperly for the workload, and de-normalizing schemas without understanding the implications. The document provides recommendations to instead co-locate data and computing, choose the right tools for each job, and deploy solutions matching the intended workload.
This is the presentation I gave at the Data Science KC meet up on 25Aug14. Video can be found at: https://www.youtube.com/watch?v=g1Tu7kUlNa4
Tonight's Meetup will discuss testing big data in AWS. The presenter Michael Sexton will cover what big data is in AWS, types of testing including functional and performance testing, and challenges of testing large varied datasets in the cloud. Attendees will learn about services AWS provides for big data, how to test applications that handle data, and tools for automating testing including AWS services and Python scripts. There will be time for questions at the end.
This document discusses Real Time Log Analytics using the ELK (Elasticsearch, Logstash, Kibana) stack. It provides an overview of each component, including Elasticsearch for indexing and searching logs, Logstash for collecting, parsing, and enriching logs, and Kibana for visualizing and analyzing logs. It describes common use cases for log analytics like issue debugging and security analysis. It also covers challenges like non-consistent log formats and decentralized logs. The document includes examples of log entries from different systems and how ELK addresses issues like scalability and making logs easily searchable and reportable.
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...Data Con LA
This talk will address one apparently simple question: which open source machine learning tools one can use out-of-the-box to do binary classification (probably the most common machine learning problem) on largish datasets. We will discuss therefore the speed, scalability and accuracy of the top (most accurate) learning algorithms in R packages, Python’s scikit-learn, H2O, xgboost and Spark’s MLlib.
Big Data Processing Using Hadoop InfrastructureDmitry Buzdin
The document discusses using Hadoop infrastructure for big data processing. It describes Intrum Justitia SDC, which has data across 20 countries in various formats and a high number of data objects. Hadoop provides solutions like MapReduce and HDFS for distributed storage and processing at scale. The Hadoop ecosystem includes tools like Hive, Pig, HBase, Impala and Oozie that help process and analyze large datasets. Examples of using Hadoop with Java and integrating it into development environments are also included.
Big data hadoop flume spark cloudera Oracle big data appliance apache , oracle loader for hadoop, Big data copy. Exadata to Big data appliance. bilginc It academy.
Impala turbocharge your big data accessOphir Cohen
This document discusses challenges with accessing large amounts of data stored in Hadoop and introduces Impala as a solution. It notes that LivePerson stores over 13 terabytes of data per month in its over 1 petabyte Hadoop cluster. Traditional MapReduce jobs in Java can take hours or days to access this data and Hive provides slower access with a SQL-like language. Impala provides faster interactive queries of 4 to 30 times faster than Hive by bypassing MapReduce and using a scalable parallel database integrated with Hadoop formats and infrastructure.
Apache Spark is rapidly emerging as the prime platform for advanced analytics in Hadoop. This briefing is updated to reflect news and announcements as of July 2014.
Chris Curtin discusses Silverpop's journey with Hadoop. They initially used Hadoop to build flexible reports on customer data despite varying schemas. This helped with queries but was difficult to maintain. They then used Cascading to dynamically define schemas and job steps. Next, they profiled customer interactions over time which challenged Hadoop due to many small files and lack of appending. They switched to MapR which helped but recovery remained an issue. Current work includes optimizing imports, packaging the solution, and watching new real-time Hadoop technologies. The main challenges have been helping customers understand and use insights from large and complex data.
Building scalable data pipelines for big data involves dealing with legacy systems, implementing data lineage and provenance, managing the data lifecycle, and engineering pipelines that can handle large volumes of data. Effective data pipeline engineering requires understanding how to extract, transform and load data while addressing issues like privacy, security, and integrating diverse data sources. Frameworks like Cascading can help build pipelines, but proper testing and scaling is also required to develop robust solutions.
This document summarizes a presentation about Kappa Architecture 2.0 given by Juantomás García. It discusses the origins of Kappa Architecture as coined by Jay Kreps in 2014 for handling real-time data. It then outlines how Kappa Architecture uses tools like Apache Kafka, Apache Spark, and Apache Samza to handle real-time and batch data processing. The presentation also provides examples of how Kappa Architecture has been applied to a use case of monitoring car data in real-time and scaling REST APIs. It concludes by discussing future improvements and variations of Kappa Architecture.
Microservices, containers, and machine learningPaco Nathan
http://www.oscon.com/open-source-2015/public/schedule/detail/41579
In this presentation, an open source developer community considers itself algorithmically. This shows how to surface data insights from the developer email forums for just about any Apache open source project. It leverages advanced techniques for natural language processing, machine learning, graph algorithms, time series analysis, etc. As an example, we use data from the Apache Spark email list archives to help understand its community better; however, the code can be applied to many other communities.
Exsto is an open source project that demonstrates Apache Spark workflow examples for SQL-based ETL (Spark SQL), machine learning (MLlib), and graph algorithms (GraphX). It surfaces insights about developer communities from their email forums. Natural language processing services in Python (based on NLTK, TextBlob, WordNet, etc.), gets containerized and used to crawl and parse email archives. These produce JSON data sets, then we run machine learning on a Spark cluster to find out insights such as:
* What are the trending topic summaries?
* Who are the leaders in the community for various topics?
* Who discusses most frequently with whom?
This talk shows how to use cloud-based notebooks for organizing and running the analytics and visualizations. It reviews the background for how and why the graph analytics and machine learning algorithms generalize patterns within the data — based on open source implementations for two advanced approaches, Word2Vec and TextRank The talk also illustrates best practices for leveraging functional programming for big data.
Big Data is changing abruptly, and where it is likely headingPaco Nathan
Big Data technologies are changing rapidly due to shifts in hardware, data types, and software frameworks. Incumbent Big Data technologies do not fully leverage newer hardware like multicore processors and large memory spaces, while newer open source projects like Spark have emerged to better utilize these resources. Containers, clouds, functional programming, databases, approximations, and notebooks represent significant trends in how Big Data is managed and analyzed at large scale.
introduction to Neo4j (Tabriz Software Open Talks)Farzin Bagheri
This document provides an overview of Neo4j, a graph database. It begins with definitions of relational and NoSQL databases, categorizing NoSQL into key-value, document, column-oriented, and graph databases. Graph databases are explained to contain nodes, relationships, and properties. Neo4j is introduced as an example graph database, with Cypher listed as its query language. Examples of using Cypher to create nodes and relationships are provided. Finally, potential uses of Neo4j are listed, including social networks, network analysis, recommendations, and more.
Amazon aws big data demystified | Introduction to streaming and messaging flu...Omid Vahdaty
This document provides an overview of streaming data and messaging concepts including batch processing, streaming, streaming vs messaging, challenges with streaming data, and AWS services for streaming and messaging like Kinesis, Kinesis Firehose, SQS, and Kafka. It discusses use cases and comparisons for these different services. For example, Kinesis is suitable for complex analytics on streaming data while SQS focuses on per-event messaging. Firehose automatically loads streaming data into AWS services like S3 and Redshift without custom coding.
Big Data in the Cloud - Montreal April 2015Cindy Gross
slides:
Basic Big Data and Hadoop terminology
What projects fit well with Hadoop
Why Hadoop in the cloud is so Powerful
Sample end-to-end architecture
See: Data, Hadoop, Hive, Analytics, BI
Do: Data, Hadoop, Hive, Analytics, BI
How this tech solves your business problems
SnappyData is a new open source project started by Pivotal GemFire founders to provide a unified platform for OLTP, OLAP and streaming analytics using Spark. It aims to simplify big data architectures by supporting mixed workloads in a single clustered database, allowing for real-time operational analytics on live data without copying to other systems. This provides faster insights than current approaches that require periodic data copying between different databases and analytics systems.
Big data refers to large, complex datasets that are difficult to process using traditional methods. This document discusses three examples of real-world big data challenges and their solutions. The challenges included storage, analysis, and processing capabilities given hardware and time constraints. Solutions involved switching databases, using Hadoop/MapReduce, and representing complex data structures to enable analysis of terabytes of ad serving data. Flexibility and understanding domain needs were key to feasible versus theoretical solutions.
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaDatabricks
At NMC (Nielsen Marketing Cloud) we provide our customers (marketers and publishers) real-time analytics tools to profile their target audiences. To achieve that, we need to ingest billions of events per day into our big data stores, and we need to do it in a scalable yet cost-efficient manner.
In this session, we will discuss how we continuously transform our data infrastructure to support these goals. Specifically, we will review how we went from CSV files and standalone Java applications all the way to multiple Kafka and Spark clusters, performing a mixture of Streaming and Batch ETLs, and supporting 10x data growth We will share our experience as early-adopters of Spark Streaming and Spark Structured Streaming, and how we overcame technical barriers (and there were plenty). We will present a rather unique solution of using Kafka to imitate streaming over our Data Lake, while significantly reducing our cloud services’ costs. Topics include:
Kafka and Spark Streaming for stateless and stateful use-cases
Spark Structured Streaming as a possible alternative
Combining Spark Streaming with batch ETLs
”Streaming” over Data Lake using Kafka
Developing high frequency indicators using real time tick data on apache supe...Zekeriya Besiroglu
This document summarizes the Central Bank of Turkey's project to develop high frequency market indicators using real-time tick data from the Thomson Reuters Enterprise Platform. It describes how they set up Apache Kafka, Druid, Spark and Superset on Hadoop to ingest, store, analyze and visualize the data. Their goal was to observe foreign exchange markets in real-time to detect risks and patterns. The architecture evolved over three phases from an initial test cluster to integrating Druid and Hive for improved querying and scaling to production. Work is ongoing to implement additional indicators and integrate historical data for enhanced analysis.
Big Data Anti-Patterns: Lessons From the Front LIneDouglas Moore
This document summarizes common anti-patterns in big data projects based on lessons learned from working with over 50 clients. It identifies anti-patterns in hardware and infrastructure, tooling, and big data warehousing. Specifically, it discusses issues with referencing outdated architectures, using tools improperly for the workload, and de-normalizing schemas without understanding the implications. The document provides recommendations to instead co-locate data and computing, choose the right tools for each job, and deploy solutions matching the intended workload.
This is the presentation I gave at the Data Science KC meet up on 25Aug14. Video can be found at: https://www.youtube.com/watch?v=g1Tu7kUlNa4
Tonight's Meetup will discuss testing big data in AWS. The presenter Michael Sexton will cover what big data is in AWS, types of testing including functional and performance testing, and challenges of testing large varied datasets in the cloud. Attendees will learn about services AWS provides for big data, how to test applications that handle data, and tools for automating testing including AWS services and Python scripts. There will be time for questions at the end.
This document discusses Real Time Log Analytics using the ELK (Elasticsearch, Logstash, Kibana) stack. It provides an overview of each component, including Elasticsearch for indexing and searching logs, Logstash for collecting, parsing, and enriching logs, and Kibana for visualizing and analyzing logs. It describes common use cases for log analytics like issue debugging and security analysis. It also covers challenges like non-consistent log formats and decentralized logs. The document includes examples of log entries from different systems and how ELK addresses issues like scalability and making logs easily searchable and reportable.
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...Data Con LA
This talk will address one apparently simple question: which open source machine learning tools one can use out-of-the-box to do binary classification (probably the most common machine learning problem) on largish datasets. We will discuss therefore the speed, scalability and accuracy of the top (most accurate) learning algorithms in R packages, Python’s scikit-learn, H2O, xgboost and Spark’s MLlib.
Big Data Processing Using Hadoop InfrastructureDmitry Buzdin
The document discusses using Hadoop infrastructure for big data processing. It describes Intrum Justitia SDC, which has data across 20 countries in various formats and a high number of data objects. Hadoop provides solutions like MapReduce and HDFS for distributed storage and processing at scale. The Hadoop ecosystem includes tools like Hive, Pig, HBase, Impala and Oozie that help process and analyze large datasets. Examples of using Hadoop with Java and integrating it into development environments are also included.
Big data hadoop flume spark cloudera Oracle big data appliance apache , oracle loader for hadoop, Big data copy. Exadata to Big data appliance. bilginc It academy.
Impala turbocharge your big data accessOphir Cohen
This document discusses challenges with accessing large amounts of data stored in Hadoop and introduces Impala as a solution. It notes that LivePerson stores over 13 terabytes of data per month in its over 1 petabyte Hadoop cluster. Traditional MapReduce jobs in Java can take hours or days to access this data and Hive provides slower access with a SQL-like language. Impala provides faster interactive queries of 4 to 30 times faster than Hive by bypassing MapReduce and using a scalable parallel database integrated with Hadoop formats and infrastructure.
Apache Spark is rapidly emerging as the prime platform for advanced analytics in Hadoop. This briefing is updated to reflect news and announcements as of July 2014.
Chris Curtin discusses Silverpop's journey with Hadoop. They initially used Hadoop to build flexible reports on customer data despite varying schemas. This helped with queries but was difficult to maintain. They then used Cascading to dynamically define schemas and job steps. Next, they profiled customer interactions over time which challenged Hadoop due to many small files and lack of appending. They switched to MapR which helped but recovery remained an issue. Current work includes optimizing imports, packaging the solution, and watching new real-time Hadoop technologies. The main challenges have been helping customers understand and use insights from large and complex data.
Building scalable data pipelines for big data involves dealing with legacy systems, implementing data lineage and provenance, managing the data lifecycle, and engineering pipelines that can handle large volumes of data. Effective data pipeline engineering requires understanding how to extract, transform and load data while addressing issues like privacy, security, and integrating diverse data sources. Frameworks like Cascading can help build pipelines, but proper testing and scaling is also required to develop robust solutions.
This document summarizes a presentation about Kappa Architecture 2.0 given by Juantomás García. It discusses the origins of Kappa Architecture as coined by Jay Kreps in 2014 for handling real-time data. It then outlines how Kappa Architecture uses tools like Apache Kafka, Apache Spark, and Apache Samza to handle real-time and batch data processing. The presentation also provides examples of how Kappa Architecture has been applied to a use case of monitoring car data in real-time and scaling REST APIs. It concludes by discussing future improvements and variations of Kappa Architecture.
Microservices, containers, and machine learningPaco Nathan
http://www.oscon.com/open-source-2015/public/schedule/detail/41579
In this presentation, an open source developer community considers itself algorithmically. This shows how to surface data insights from the developer email forums for just about any Apache open source project. It leverages advanced techniques for natural language processing, machine learning, graph algorithms, time series analysis, etc. As an example, we use data from the Apache Spark email list archives to help understand its community better; however, the code can be applied to many other communities.
Exsto is an open source project that demonstrates Apache Spark workflow examples for SQL-based ETL (Spark SQL), machine learning (MLlib), and graph algorithms (GraphX). It surfaces insights about developer communities from their email forums. Natural language processing services in Python (based on NLTK, TextBlob, WordNet, etc.), gets containerized and used to crawl and parse email archives. These produce JSON data sets, then we run machine learning on a Spark cluster to find out insights such as:
* What are the trending topic summaries?
* Who are the leaders in the community for various topics?
* Who discusses most frequently with whom?
This talk shows how to use cloud-based notebooks for organizing and running the analytics and visualizations. It reviews the background for how and why the graph analytics and machine learning algorithms generalize patterns within the data — based on open source implementations for two advanced approaches, Word2Vec and TextRank The talk also illustrates best practices for leveraging functional programming for big data.
Big Data is changing abruptly, and where it is likely headingPaco Nathan
Big Data technologies are changing rapidly due to shifts in hardware, data types, and software frameworks. Incumbent Big Data technologies do not fully leverage newer hardware like multicore processors and large memory spaces, while newer open source projects like Spark have emerged to better utilize these resources. Containers, clouds, functional programming, databases, approximations, and notebooks represent significant trends in how Big Data is managed and analyzed at large scale.
introduction to Neo4j (Tabriz Software Open Talks)Farzin Bagheri
This document provides an overview of Neo4j, a graph database. It begins with definitions of relational and NoSQL databases, categorizing NoSQL into key-value, document, column-oriented, and graph databases. Graph databases are explained to contain nodes, relationships, and properties. Neo4j is introduced as an example graph database, with Cypher listed as its query language. Examples of using Cypher to create nodes and relationships are provided. Finally, potential uses of Neo4j are listed, including social networks, network analysis, recommendations, and more.
Amazon aws big data demystified | Introduction to streaming and messaging flu...Omid Vahdaty
This document provides an overview of streaming data and messaging concepts including batch processing, streaming, streaming vs messaging, challenges with streaming data, and AWS services for streaming and messaging like Kinesis, Kinesis Firehose, SQS, and Kafka. It discusses use cases and comparisons for these different services. For example, Kinesis is suitable for complex analytics on streaming data while SQS focuses on per-event messaging. Firehose automatically loads streaming data into AWS services like S3 and Redshift without custom coding.
Big Data in the Cloud - Montreal April 2015Cindy Gross
slides:
Basic Big Data and Hadoop terminology
What projects fit well with Hadoop
Why Hadoop in the cloud is so Powerful
Sample end-to-end architecture
See: Data, Hadoop, Hive, Analytics, BI
Do: Data, Hadoop, Hive, Analytics, BI
How this tech solves your business problems
SnappyData is a new open source project started by Pivotal GemFire founders to provide a unified platform for OLTP, OLAP and streaming analytics using Spark. It aims to simplify big data architectures by supporting mixed workloads in a single clustered database, allowing for real-time operational analytics on live data without copying to other systems. This provides faster insights than current approaches that require periodic data copying between different databases and analytics systems.
Big data refers to large, complex datasets that are difficult to process using traditional methods. This document discusses three examples of real-world big data challenges and their solutions. The challenges included storage, analysis, and processing capabilities given hardware and time constraints. Solutions involved switching databases, using Hadoop/MapReduce, and representing complex data structures to enable analysis of terabytes of ad serving data. Flexibility and understanding domain needs were key to feasible versus theoretical solutions.
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaDatabricks
At NMC (Nielsen Marketing Cloud) we provide our customers (marketers and publishers) real-time analytics tools to profile their target audiences. To achieve that, we need to ingest billions of events per day into our big data stores, and we need to do it in a scalable yet cost-efficient manner.
In this session, we will discuss how we continuously transform our data infrastructure to support these goals. Specifically, we will review how we went from CSV files and standalone Java applications all the way to multiple Kafka and Spark clusters, performing a mixture of Streaming and Batch ETLs, and supporting 10x data growth We will share our experience as early-adopters of Spark Streaming and Spark Structured Streaming, and how we overcame technical barriers (and there were plenty). We will present a rather unique solution of using Kafka to imitate streaming over our Data Lake, while significantly reducing our cloud services’ costs. Topics include:
Kafka and Spark Streaming for stateless and stateful use-cases
Spark Structured Streaming as a possible alternative
Combining Spark Streaming with batch ETLs
”Streaming” over Data Lake using Kafka
NoSQL databases should not be chosen just because a system is slow or to replace RDBMS. The appropriate choice depends on factors like the nature of the data, how the data scales, and whether ACID properties are needed. NoSQL databases are categorized by data model (document, column family, graph, key-value store) which affects querying. Other considerations include scalability based on the CAP theorem and operational factors like the distribution model and whether there is a single point of failure. The best choice depends on the specific requirements and risks losing data if chosen incorrectly.
Big Data! Great! Now What? #SymfonyCon 2014Ricard Clau
Big Data is one of the new buzzwords in the industry. Everyone is using NoSQL databases. MySQL is not cool anymore. But... do we really have big data? Where should we store it? Are the traditional RDBMS databases dead? Is NoSQL the solution to our problems? And most importantly, how can PHP and Symfony2 help with it?
Architecting Big Data Ingest & ManipulationGeorge Long
Here's the presentation I gave at the KW Big Data Peer2Peer meetup held at Communitech on 3rd November 2015.
The deck served as a backdrop to the interactive session
http://www.meetup.com/KW-Big-Data-Peer2Peer/events/226065176/
The scope was to drive an architectural conversation about :
o What it actually takes to get the data you need to add that one metric to your report/dashboard?
o What's it like to navigate the early conversations of an analytic solution?
o How is one technology selected over another and how do those selections impact or define other selections?
The document provides an overview of the Spark framework for lightning fast cluster computing. It discusses how Spark addresses limitations of MapReduce-based systems like Hadoop by enabling interactive queries and iterative jobs through caching data in-memory across clusters. Spark allows loading datasets into memory and querying them repeatedly for interactive analysis. The document covers Spark's architecture, use of resilient distributed datasets (RDDs), and how it provides a unified programming model for batch, streaming, and interactive workloads.
MongoDB World 2019: MongoDB Cluster Design: From Redundancy to GDPRMongoDB
No matter if you're thinking of migrating to MongoDB, or need to meet legal requirements for an existing on-prem cluster, this talk has you covered. We start with the basics of replication and sharding and quickly scale up, covering everything you need to know to control your data, and keep it safe from unexpected data loss or downtime - a well-designed MongoDB cluster should have no single point of failure. Learn how others are “stretching” what’s possible but why you shouldn't! I'll present real-world examples from my life in the field in Europe and beyond.
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaDataWorks Summit
At NMC (Nielsen Marketing Cloud) we provide our customers (marketers and publishers) real-time analytics tools to profile their target audiences.
To achieve that, we need to ingest billions of events per day into our big data stores, and we need to do it in a scalable yet cost-efficient manner.
In this session, we will discuss how we continuously transform our data infrastructure to support these goals.
Specifically, we will review how we went from CSV files and standalone Java applications all the way to multiple Kafka and Spark clusters, performing a mixture of Streaming and Batch ETLs, and supporting 10x data growth.
We will share our experience as early-adopters of Spark Streaming and Spark Structured Streaming, and how we overcame technical barriers (and there were plenty...).
We will present a rather unique solution of using Kafka to imitate streaming over our Data Lake, while significantly reducing our cloud services' costs.
Topics include :
* Kafka and Spark Streaming for stateless and stateful use-cases
* Spark Structured Streaming as a possible alternative
* Combining Spark Streaming with batch ETLs
* "Streaming" over Data Lake using Kafka
Introduction to NetGuardians' Big Data Software StackJérôme Kehrli
NetGuardians is executing it's Big Data Analytics Platform on three key Big Data components underneath: ElasticSearch, Apache Mesos and Apache Spark. This is a presentation of the behaviour of this software stack.
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...StampedeCon
From the StampedeCon 2015 Big Data Conference: There is an adage, “If you fail to plan, you plan to fail” . When developing systems the adage can be taken a step further, “If you fail to plan FOR FAILURE, you plan to fail”. At Huffington post data moves between a number of systems to provide statistics for our technical, business, and editorial teams. Due to the mission-critical nature of our data, considerable effort is spent building resiliency into processes.
This talk will focus on designing for failure. Some material will focus understanding the traits of specific distributed systems such as message queues or NoSQL databases and what are the consequences for different types of failures. While other parts of the presentation will focus on how systems and software can be designed to make re-processing batch data simple, or how to determine what failure mode semantics are important for a real time event processing system.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2l2Rr6L.
Doug Daniels discusses the cloud-based platform they have built at DataDog and how it differs from a traditional datacenter-based analytics stack. He walks through the decisions they have made at each layer, covers the pros and cons of these decisions and discusses the tooling they have built. Filmed at qconsf.com.
Doug Daniels is a Director of Engineering at Datadog, where he works on high-scale data systems for monitoring, data science, and analytics. Prior to joining Datadog, he was CTO at Mortar Data and an architect and developer at Wireless Generation, where he designed data systems to serve more than 4 million students in 49 states.
This document discusses using big data tools to build a fraud detection system. It outlines using Azure infrastructure to set up a Hadoop cluster with HDFS, HBase, Kafka and Spark. Mock transaction data will be generated and sent to Kafka. Spark jobs will process the data in batches, identifying potentially fraudulent transactions and writing them to an HBase table. The data will be visualized using Zeppelin notebooks querying Phoenix SQL on HBase. This will allow analysts to further investigate potential fraud patterns in near real-time.
The document discusses SQL versus NoSQL databases. It provides background on SQL databases and their advantages, then explains why some large tech companies have adopted NoSQL databases instead. Specifically, it describes how companies like Amazon, Facebook, and Google have such massive amounts of data that traditional SQL databases cannot adequately handle the scale, performance, and flexibility needs. It then summarizes some popular NoSQL databases like Cassandra, Hadoop, MongoDB that were developed to solve the challenges of scaling to big data workloads.
Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido
Given at Data Day Seattle 2015.
Bitly generates over 9 billion clicks on shortened links a month, as well as over 100 million unique link shortens. Analyzing data of this scale is not without its challenges. At Bitly, we have started adopting Apache Spark as a way to process our data. In this talk, I’ll elaborate on how I use Spark as part of my data science workflow. I’ll cover how Spark fits into our existing architecture, the kind of problems I’m solving with Spark, and the benefits and challenges of using Spark for large-scale data science.
This document provides an overview of architecting a first big data implementation. It defines key concepts like Hadoop, NoSQL databases, and real-time processing. It recommends asking questions about data, technology stack, and skills before starting a project. Distributed file systems, batch tools, and streaming systems like Kafka are important technologies for big data architectures. The document emphasizes moving from batch to real-time processing as a major opportunity.
This document discusses hardware provisioning best practices for MongoDB. It covers key concepts like bottlenecks, working sets, and replication vs sharding. It also presents two case studies where these concepts were applied: 1) For a Spanish bank storing logs, the working set was 4TB so they provisioned servers with at least that much RAM. 2) For an online retailer storing products, testing found the working set was 270GB, so they recommended a replica set with 384GB RAM per server to avoid complexity of sharding. The key lessons are to understand requirements, test with a proof of concept, measure resource usage, and expect that applications may become bottlenecks over time.
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkKrishna Sankar
Presentation for my tutorial at Big Data Tech Con http://goo.gl/4Pcvuq
It is work in progress. I will update with daily snapshot until done.
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsContify
AI competitor analysis helps businesses watch and understand what their competitors are doing. Using smart competitor intelligence tools, you can track their moves, learn from their strategies, and find ways to do better. Stay smart, act fast, and grow your business with the power of AI insights.
For more information please visit here https://www.contify.com/
Just-in-time: Repetitive production system in which processing and movement of materials and goods occur just as they are needed, usually in small batches
JIT is characteristic of lean production systems
JIT operates with very little “fat”
How iCode cybertech Helped Me Recover My Lost Fundsireneschmid345
I was devastated when I realized that I had fallen victim to an online fraud, losing a significant amount of money in the process. After countless hours of searching for a solution, I came across iCode cybertech. From the moment I reached out to their team, I felt a sense of hope that I can recommend iCode Cybertech enough for anyone who has faced similar challenges. Their commitment to helping clients and their exceptional service truly set them apart. Thank you, iCode cybertech, for turning my situation around!
[email protected]
computer organization and assembly language : its about types of programming language along with variable and array description..https://www.nfciet.edu.pk/
Telangana State, India’s newest state that was carved from the erstwhile state of Andhra
Pradesh in 2014 has launched the Water Grid Scheme named as ‘Mission Bhagiratha (MB)’
to seek a permanent and sustainable solution to the drinking water problem in the state. MB is
designed to provide potable drinking water to every household in their premises through
piped water supply (PWS) by 2018. The vision of the project is to ensure safe and sustainable
piped drinking water supply from surface water sources
2. Hello world :)
■ Who am I ?
■ Java developer working in Codete
■ Keen on Big Data and modern backend approach
■ Luckily can develop this passion in Codete
■ https://github.com/Hejwo
■ [email protected]
■ Disclaimer 1 - we will use Polish, but with lot of English, business specific terms.
■ Disclaimer 2 - Discipline is large so will going to cover only bigger picture
■ Disclaimer 3 - Live coding ? Next time
■ Disclaimer 4 - From zero to hero style
3. Recap & Intro
■ Recap - at the end of last GDG we were talking about Machine Learning
■ We talk about difference between Data Science and Big Data, often confused
Recap
Data science
■ Data science, also known as data-driven science, is an interdisciplinary field about scientific methods and processes to extract
knowledge or insights from data in various forms, either structured or unstructured
■ Data science is focused on availability of cleaning gathered data, math, statistic, business understanding and extracting valuable
information
Big data
■ Modern methods of gathering, processing big volumes of data
■ More info in next 40 mins ;)
6. What’s Big Data ?
■ Amount of our data is getting larger and larger
■ Important role in it is Internet of Things -> sensors, sensors are everywhere !
■ At some point EVEN business guys discovered that there’s great value behind unstructured data
■ ETL’s on massive scale
■ Recommendation systems based on FB likes
■ Analysing user traffic on e-shops and optimizing contents
■ Raw data from car’s sensors
■ Optimizing traffic like in Lublin :)
■ POTENTIAL and AMOUNT of data that we need is HUGE
■ Fun fact - having raw data means that we don’t know what we’re looking for and that’s great !!!
■ Discovering new relations in our data
8. How to process Big Data ?
Moore’s law is dying [*]
“Moore's law is the observation that the number of transistors in a
single core doubles approximately every two years”
■ Right now every new transistor progress is getting more and
more expensive.
■ New processors are getting more and more expensive.
■ Since now we could rely on Moore's law. If our
infrastructure is not doing well after two years and
approximately same cost we could have faster.
■ But… we still have many cores. But… sometimes distributing
work on many cores it’s still not enough.
9. How to process Big Data ? - Scale up vs. Scale out
Scale up
■ Costy components
■ Complexed application/system logic. Often multithreaded
■ Poor fault-tolerance
■ Machine is getting hot as Mordor.
■ Cheaper machines
■ Easier application and system logic
■ Thanks to orchestrating tools such as Mesos, Kubernetes it’s not THAT hard to maintain.
■ Fault-tolerance - If half of our machines will explode we still can do something
■ Needs data centers :(
Scale out
12. Meet Apache Spark - Big Data processing engine !
■ Created in Berkley university
■ At beginning it was Proof of Concept for Mesos cluster management
■ Much more faster than his father - Hadoop
■ By default it operates on memory.
■ No frequent disc writes means more speed
■ Rich and simple caching mechanism
■ There are ton of other Big Data processing engines - Hadoop, Storm, Flink, Splunk
■ We're gonna focus on Spark due to time
14. Is Big Data processing THE only direction ?
Spark is faster than Hadoop, but still… it’s heavy machinery
15. Is Big Data THE only direction ?
Reactive Manifesto
■ Responsive - What happens when Wifi
is down ? Users want FAST responses !
■ Elastic - Large system tend to have
frequent, massive loads
■ Resilient - System must stay available
and any kind of response is better than
no response.
■ Message Driven - isolation and
non-blocking is achieved via async
communication. Thanks to that we have
clear boundaries, isolation,
transparency.
17. Meet our systems heart - Apache Kafka
■ Lightlight fast Messaging system
■ Heart of Big Data system
■ Distributed
■ Build by LinkedIn
■ Written in Scala
■ Producers and Consumers concept
■ Auto recovery, Brokers detection
22. Spark Streaming - when batch is not enough
■ Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.
■ By running on Spark, Spark Streaming lets you reuse the same code for batch processing, join streams against historical data, or run
ad-hoc queries on stream state.
■ Used as a rapid fast micro batching
■ Before Spark Streaming, building complex pipelines that encompass streaming, batch, or even machine learning capabilities with open
source software meant dealing with multiple frameworks
■ Streaming ETL – Data is continuously cleaned and aggregated before being pushed into data stores. No more SAP.
■ Triggers – Anomalous behavior is detected in real-time and further downstream actions are triggered accordingly.
■ Data enrichment – Live data is enriched with more information by joining it with a static dataset allowing for a more complete real-time
analysis.
24. Now… let’s store it ! NoSql store it !
■ Large datasets
■ Easy to scale out
■ Less schema validation on write means faster
■ Schemaless databases can be a great value in Big Data, all
thought we sometimes don’t know what we need and we
want our data to be dirty.
Why NoSQL ?
27. Why it’s modern ?
■ Fast, reliable
■ More like - write once, run everywhere thanks to Spark, Spark Shell, Zeppelin
■ Less code (Hadoop’s MapReduce ? It’s an essay)
■ Comparing to older approach - less chaos thanks to Kafka.
28. Cons
■ More like micro batching not real time
■ Lot of stuff is still evolving (Spark, Kafka) and hasn’t got professional customer support
■ Things tend to get complicated when we’re Kafka messages within single topic evolve
■ DevOps, needed, strong powerful developers needed
■ Distributed world is complicated world
■ Thousands of frameworks and ideas every year
32. Spark ? Interesting alternative to ETL Hell
■ SAP, SAS, Elixir
■ ETL has nice visual building blocks, but this means....
■ … Click, Click, Click, Click… (RSI danger !)
■ Building blocks means that plain-text code hidden in stages. Hard to debug, Hard to unit test.
■ Waste of resources. ETL jobs are fired at night where we have peak performance. Then resources are unused.
■ Data is getting out of sync. So ETL pipeline gets out of sync.
■ In Big Data world we have Apache Avro for schema registry
■ Big Data can handle more
■ Legacy code
■ $$$ It’s for FREE $$$
■ Can throw Machine Learning into it and do interesting things. Not only batches.
■ Lack professional support
■ Big Data is not that mature
■ Let’s look what will happen here
35. Apache Kafka vs. Rabbit MQ
Kafka :
■ + Fire hose of events (100k+/sec)
■ + Availability of re-read messages (Good for CQRS)
■ + Scale out
■ + Confluent -> Kafka Connect, Kafka Streams, Schema Registry
■ - You don't mind supporting on your own
■ - No AMQP and complexed routing
RabbitMQ :
■ + Messages may be routed in complexed way to consumers
■ + Mature - You like yelling at support guys rather than fixing be yourself ? Place for you !
■ + Scale out
■ - (20k+/sec) messages
■ - Messages are deleted after consumers ack