Did you miss Scala Days 2015 in San Francisco? Have no fear! BoldRadius was there and we've compiled the best of the best! Here are the highlights of a great conference.
The document introduces the Dataset API in Spark, which provides type safety and performance benefits over DataFrames. Datasets allow operating on domain objects using compiled functions rather than Rows. Encoders efficiently serialize objects to and from the JVM. This allows type checking of operations and retaining objects in distributed operations. The document outlines the history of Spark APIs, limitations of DataFrames, and how Datasets address these through compiled encoding and working with case classes rather than Rows.
Introduction to concurrent programming with Akka actorsShashank L
This document provides an introduction to concurrent programming with Akka Actors. It discusses concurrency and parallelism, how the end of Moore's Law necessitated a shift to concurrent programming, and introduces key concepts of actors including message passing concurrency, actor systems, actor operations like create and send, routing, supervision, configuration and testing. Remote actors are also discussed. Examples are provided in Scala.
Multi Source Data Analysis using Spark and Telliusdatamantra
Multi Source Data Analysis Using Apache Spark and Tellius
This document discusses analyzing data from multiple sources using Apache Spark and the Tellius platform. It covers loading data from different sources like databases and files into Spark DataFrames, defining a data model by joining the sources, and performing analysis like calculating revenues by department across sources. It also discusses challenges like double counting values when directly querying the joined data. The Tellius platform addresses this by implementing a custom query layer on top of Spark SQL to enable accurate multi-source analysis.
Improving Mobile Payments With Real time Sparkdatamantra
This document discusses improving mobile payments by implementing real-time analytics using Apache Spark streaming. The initial solution involved batch processing of mobile payment event data. The new solution uses Spark streaming to analyze data in real-time from sources like Amazon Kinesis. This allows for automatic alerts and a closed feedback loop. Challenges in moving from batch to streaming processing and optimizing the Python code are also covered.
Understanding transactional writes in datasource v2datamantra
This document discusses the new Transactional Writes in Datasource V2 API introduced in Spark 2.3. It outlines the shortcomings of the previous V1 write API, specifically the lack of transaction support. It then describes the anatomy of the new V2 write API, including interfaces like DataSourceWriter, DataWriterFactory, and DataWriter that provide transactional capabilities at the partition and job level. It also covers how the V2 API addresses partition awareness through preferred location hints to improve performance.
This document provides an introduction and overview of Spark's Dataset API. It discusses how Dataset combines the best aspects of RDDs and DataFrames into a single API, providing strongly typed transformations on structured data. The document also covers how Dataset moves Spark away from RDDs towards a more SQL-like programming model and optimized data handling. Key topics include the Spark Session entry point, differences between DataFrames and Datasets, and examples of Dataset operations like word count.
Anatomy of Data Source API : A deep dive into Spark Data source APIdatamantra
In this presentation, we discuss how to build a datasource from the scratch using spark data source API. All the code discussed in this presentation available at https://github.com/phatak-dev/anatomy_of_spark_datasource_api
Scala in Model-Driven development for Apparel Cloud PlatformTomoharu ASAMI
This document outlines an agenda for discussing the Apparel Cloud platform and Scala programming language. It provides a vision for the Everforth platform, which will utilize Scala for model-driven development, cloud platform architecture and real system development. Details are given on Scala products and tools used for development. Key Scala features like traits, case classes, monads and parallel/concurrent programming are discussed. The conclusion is that Scala is suitable for model-driven development and building cloud platform frameworks.
Spark uses Resilient Distributed Datasets (RDDs) as its fundamental data structure. RDDs are immutable, lazy evaluated collections of data that can be operated on in parallel. This allows RDD transformations to be computed lazily and combined for better performance. RDDs also support type inference and caching to improve efficiency. Spark programs run by submitting jobs to a cluster manager like YARN or Mesos, which then schedule tasks across worker nodes where the lazy transformations are executed.
This document discusses best practices for migrating Spark applications from version 1.x to 2.0. It covers new features in Spark 2.0 like the Dataset API, catalog API, subqueries and checkpointing for iterative algorithms. The document recommends changes to existing best practices around choice of serializer, cache format, use of broadcast variables and choice of cluster manager. It also discusses how Spark 2.0's improved SQL support impacts use of HiveContext.
We are a company driven by inquisitive data scientists, having developed a pragmatic and interdisciplinary approach, which has evolved over the decades working with over 100 clients across multiple industries. Combining several Data Science techniques from statistics, machine learning, deep learning, decision science, cognitive science, and business intelligence, with our ecosystem of technology platforms, we have produced unprecedented solutions. Welcome to the Data Science Analytics team that can do it all, from architecture to algorithms.
Our practice delivers data driven solutions, including Descriptive Analytics, Diagnostic Analytics, Predictive Analytics, and Prescriptive Analytics. We employ a number of technologies in the area of Big Data and Advanced Analytics such as DataStax (Cassandra), Databricks (Spark), Cloudera, Hortonworks, MapR, R, SAS, Matlab, SPSS and Advanced Data Visualizations.
This presentation is designed for Spark Enthusiasts to get started and details of the course are below.
1. Introduction to Apache Spark
2. Functional Programming + Scala
3. Spark Core
4. Spark SQL + Parquet
5. Advanced Libraries
6. Tips & Tricks
7. Where do I go from here?
This document discusses principles for applying continuous delivery practices to machine learning models. It begins with background on the speaker and their company Indix, which builds location and product-aware software using machine learning. The document then outlines four principles for continuous delivery of machine learning: 1) Automating training, evaluation, and prediction pipelines using tools like Go-CD; 2) Using source code and artifact repositories to improve reproducibility; 3) Deploying models as containers for microservices; and 4) Performing A/B testing using request shadowing rather than multi-armed bandits. Examples and diagrams are provided for each principle.
Slides from my presentation on Lambda Architecture at Indix, presented at Fifth Elephant 2014.
It talks about our experience in using Lambda Architecture at Indix, to build a large scale analytics system on unstructured, dynamically changing data sources using Hadoop, HBase, Scalding, Spark and Solr.
How to Choose a Deep Learning FrameworkNavid Kalaei
The trend of neural networks has been attracted a huge community of researchers and practitioners. However, not all of the upfront runners are masters of deep learning and the colorful frameworks could be confusing, especially for the newcomers. In this presentation, I demystified the mystery of the leading frameworks of deep learning and provided a guideline on how to choose the most suitable option.
When Apache Spark Meets TiDB with Xiaoyu MaDatabricks
During the past 10 years, big-data storage layers mainly focus on analytical use cases. When it comes to analytical cases, users usually offload data onto Hadoop cluster and perform queries on HDFS files. People struggle dealing with modifications on append only storage and maintain fragile ETL pipelines.
On the other hand, although Spark SQL has been proven effective parallel query processing engine, some tricks common in traditional databases are not available due to characteristics of storage underneath. TiSpark sits directly on top of a distributed database (TiDB)’s storage engine, expand Spark SQL’s planning with its own extensions and utilizes unique features of database storage engine to achieve functions not possible for Spark SQL on HDFS. With TiSpark, users are able to perform queries directly on changing / fresh data in real time.
The takeaways from this two are twofold:
— How to integrate Spark SQL with a distributed database engine and the benefit of it
— How to leverage Spark SQL’s experimental methods to extend its capacity.
Scaling Search at Lendingkart discusses how Lendingkart scaled their search capabilities to handle large increases in data volume. They initially tried scaling databases vertically and horizontally, but searches were still slow at 8 seconds. They implemented ElasticSearch for its near real-time search, high scalability, and out-of-the-box functionality. Logstash was used to seed data from MySQL and MongoDB into ElasticSearch. Custom analyzers and mappings were developed. Searches then reduced to 230ms and aggregations to 200ms, allowing the business to scale as transactional data grew 3000% and leads 250%.
Story of architecture evolution of one project from zero to Lambda Architecture. Also includes information on how we scaled cluster as soon as architecture is set up.
Contains nice performance charts after every architecture change.
Infrastructure Provisioning in the context of organizationKatarína Valaliková
Nowadays companies/organizations migrate and operate their infrastructure in virtual infrastructures (Cloud/IaaS). To efficiently operate and adapt to everyday changes and requirements they need to leverage automation which will do not only configuration, but orchestration, backup/recovery, reporting and monitoring as well. All of the processes are related to organization and are used by people in the organization.
Imagine a tool which is able to automate and simplify whole process around the IaaS. From spinning whole project’s infrastructure, set it up, help to operate, assign accounts, permissions and deprovisions when project ends. In this presentation we will try to show proposal for such solution. Using OpenStack for private cloud infrastrucure, Chef and midPoint as their orchestrator. And we will try to cover a little bit more. Think about user management and connection between users/employees and the infrastructure....
Erich Ess CTO of SimpelRelevance introduces the Spark distributed computing platform and explains how to integrate it with Cassandra. He demonstrates running a distributed analytic computation on a data-set stored in Cassandra
Ledingkart Meetup #4: Data pipeline @ lkMukesh Singh
Building a Data Pipeline - Case studies
This document discusses building data pipelines at three companies: NoBroker, Treebo, and LendingKart. It describes the business needs for data and analytics that motivated building pipelines. Key aspects of data pipelines discussed include moving, joining, reformatting data between systems reliably. Lessons from building pipelines include ensuring scalability, availability, and reliability as data volume grows.
Implicit parameters and implicit conversions are Scala language features that allow omitting explicit calls to methods or variables. Implicits enable concise and elegant code through features like dependency injection, context passing, and ad hoc polymorphism. Implicits resolve types at compile-time rather than runtime. While powerful, implicits can cause conflicts and slow compilation if overused. Frameworks like Scala collections, Spark, and Spray JSON extensively use implicits to provide type classes and conversions between Scala and Java types.
The document summarizes Martin Odersky's talk at Scala Days 2016 about the road ahead for Scala. The key points are:
1. Scala is maturing with improvements to tools like IDEs and build tools in 2015, while 2016 sees increased activity with the Scala Center, Scala 2.12 release, and rethinking Scala libraries.
2. The Scala Center was formed to undertake projects benefiting the Scala community with support from various companies.
3. Scala 2.12 focuses on optimizing for Java 8 and includes many new features. Future releases will focus on improving Scala libraries and modularization.
4. The DOT calculus provides a formal
This presentation was first held at the OpenSQL Camp 2009, part of the FrOSCon conference in St. Augustin, Germany. It gives a nice overview over the project, technology and how it will progress. Find more information at http://www.blackray.org
Sharing our best secrets: Design a distributed system from scratchAdelina Simion
The document summarizes a system design workshop for designing a note-taking application called TechyNotes. The workshop covers defining system requirements and interfaces, discussing database and storage options, designing initial and revised system architectures, and addressing scalability bottlenecks. Attendees learn a repeatable process for system design and discuss technologies like databases, load balancing, caching, and queues.
Jeen Broekstra is a principal engineer at metaphacts GmbH and the project lead for Eclipse RDF4J. RDF4J is a modular collection of Java libraries for working with RDF data, including reading, writing, storing, querying, and updating RDF databases or remote SPARQL endpoints. It provides a vendor-neutral API and tools like RDF4J Server, Workbench, and Console. RDF4J aims to support modern Java versions while focusing on SHACL, SPARQL 1.2, and experimental RDF* and SPARQL* features.
Senior Software Developer and Lead Trainer Alejandro Lujan explains pattern matching, a very powerful and elegant feature of Scala, using a series of examples.
Learn more about this topic and find more presentation on Scala at:
Anatomy of Data Source API : A deep dive into Spark Data source APIdatamantra
In this presentation, we discuss how to build a datasource from the scratch using spark data source API. All the code discussed in this presentation available at https://github.com/phatak-dev/anatomy_of_spark_datasource_api
Scala in Model-Driven development for Apparel Cloud PlatformTomoharu ASAMI
This document outlines an agenda for discussing the Apparel Cloud platform and Scala programming language. It provides a vision for the Everforth platform, which will utilize Scala for model-driven development, cloud platform architecture and real system development. Details are given on Scala products and tools used for development. Key Scala features like traits, case classes, monads and parallel/concurrent programming are discussed. The conclusion is that Scala is suitable for model-driven development and building cloud platform frameworks.
Spark uses Resilient Distributed Datasets (RDDs) as its fundamental data structure. RDDs are immutable, lazy evaluated collections of data that can be operated on in parallel. This allows RDD transformations to be computed lazily and combined for better performance. RDDs also support type inference and caching to improve efficiency. Spark programs run by submitting jobs to a cluster manager like YARN or Mesos, which then schedule tasks across worker nodes where the lazy transformations are executed.
This document discusses best practices for migrating Spark applications from version 1.x to 2.0. It covers new features in Spark 2.0 like the Dataset API, catalog API, subqueries and checkpointing for iterative algorithms. The document recommends changes to existing best practices around choice of serializer, cache format, use of broadcast variables and choice of cluster manager. It also discusses how Spark 2.0's improved SQL support impacts use of HiveContext.
We are a company driven by inquisitive data scientists, having developed a pragmatic and interdisciplinary approach, which has evolved over the decades working with over 100 clients across multiple industries. Combining several Data Science techniques from statistics, machine learning, deep learning, decision science, cognitive science, and business intelligence, with our ecosystem of technology platforms, we have produced unprecedented solutions. Welcome to the Data Science Analytics team that can do it all, from architecture to algorithms.
Our practice delivers data driven solutions, including Descriptive Analytics, Diagnostic Analytics, Predictive Analytics, and Prescriptive Analytics. We employ a number of technologies in the area of Big Data and Advanced Analytics such as DataStax (Cassandra), Databricks (Spark), Cloudera, Hortonworks, MapR, R, SAS, Matlab, SPSS and Advanced Data Visualizations.
This presentation is designed for Spark Enthusiasts to get started and details of the course are below.
1. Introduction to Apache Spark
2. Functional Programming + Scala
3. Spark Core
4. Spark SQL + Parquet
5. Advanced Libraries
6. Tips & Tricks
7. Where do I go from here?
This document discusses principles for applying continuous delivery practices to machine learning models. It begins with background on the speaker and their company Indix, which builds location and product-aware software using machine learning. The document then outlines four principles for continuous delivery of machine learning: 1) Automating training, evaluation, and prediction pipelines using tools like Go-CD; 2) Using source code and artifact repositories to improve reproducibility; 3) Deploying models as containers for microservices; and 4) Performing A/B testing using request shadowing rather than multi-armed bandits. Examples and diagrams are provided for each principle.
Slides from my presentation on Lambda Architecture at Indix, presented at Fifth Elephant 2014.
It talks about our experience in using Lambda Architecture at Indix, to build a large scale analytics system on unstructured, dynamically changing data sources using Hadoop, HBase, Scalding, Spark and Solr.
How to Choose a Deep Learning FrameworkNavid Kalaei
The trend of neural networks has been attracted a huge community of researchers and practitioners. However, not all of the upfront runners are masters of deep learning and the colorful frameworks could be confusing, especially for the newcomers. In this presentation, I demystified the mystery of the leading frameworks of deep learning and provided a guideline on how to choose the most suitable option.
When Apache Spark Meets TiDB with Xiaoyu MaDatabricks
During the past 10 years, big-data storage layers mainly focus on analytical use cases. When it comes to analytical cases, users usually offload data onto Hadoop cluster and perform queries on HDFS files. People struggle dealing with modifications on append only storage and maintain fragile ETL pipelines.
On the other hand, although Spark SQL has been proven effective parallel query processing engine, some tricks common in traditional databases are not available due to characteristics of storage underneath. TiSpark sits directly on top of a distributed database (TiDB)’s storage engine, expand Spark SQL’s planning with its own extensions and utilizes unique features of database storage engine to achieve functions not possible for Spark SQL on HDFS. With TiSpark, users are able to perform queries directly on changing / fresh data in real time.
The takeaways from this two are twofold:
— How to integrate Spark SQL with a distributed database engine and the benefit of it
— How to leverage Spark SQL’s experimental methods to extend its capacity.
Scaling Search at Lendingkart discusses how Lendingkart scaled their search capabilities to handle large increases in data volume. They initially tried scaling databases vertically and horizontally, but searches were still slow at 8 seconds. They implemented ElasticSearch for its near real-time search, high scalability, and out-of-the-box functionality. Logstash was used to seed data from MySQL and MongoDB into ElasticSearch. Custom analyzers and mappings were developed. Searches then reduced to 230ms and aggregations to 200ms, allowing the business to scale as transactional data grew 3000% and leads 250%.
Story of architecture evolution of one project from zero to Lambda Architecture. Also includes information on how we scaled cluster as soon as architecture is set up.
Contains nice performance charts after every architecture change.
Infrastructure Provisioning in the context of organizationKatarína Valaliková
Nowadays companies/organizations migrate and operate their infrastructure in virtual infrastructures (Cloud/IaaS). To efficiently operate and adapt to everyday changes and requirements they need to leverage automation which will do not only configuration, but orchestration, backup/recovery, reporting and monitoring as well. All of the processes are related to organization and are used by people in the organization.
Imagine a tool which is able to automate and simplify whole process around the IaaS. From spinning whole project’s infrastructure, set it up, help to operate, assign accounts, permissions and deprovisions when project ends. In this presentation we will try to show proposal for such solution. Using OpenStack for private cloud infrastrucure, Chef and midPoint as their orchestrator. And we will try to cover a little bit more. Think about user management and connection between users/employees and the infrastructure....
Erich Ess CTO of SimpelRelevance introduces the Spark distributed computing platform and explains how to integrate it with Cassandra. He demonstrates running a distributed analytic computation on a data-set stored in Cassandra
Ledingkart Meetup #4: Data pipeline @ lkMukesh Singh
Building a Data Pipeline - Case studies
This document discusses building data pipelines at three companies: NoBroker, Treebo, and LendingKart. It describes the business needs for data and analytics that motivated building pipelines. Key aspects of data pipelines discussed include moving, joining, reformatting data between systems reliably. Lessons from building pipelines include ensuring scalability, availability, and reliability as data volume grows.
Implicit parameters and implicit conversions are Scala language features that allow omitting explicit calls to methods or variables. Implicits enable concise and elegant code through features like dependency injection, context passing, and ad hoc polymorphism. Implicits resolve types at compile-time rather than runtime. While powerful, implicits can cause conflicts and slow compilation if overused. Frameworks like Scala collections, Spark, and Spray JSON extensively use implicits to provide type classes and conversions between Scala and Java types.
The document summarizes Martin Odersky's talk at Scala Days 2016 about the road ahead for Scala. The key points are:
1. Scala is maturing with improvements to tools like IDEs and build tools in 2015, while 2016 sees increased activity with the Scala Center, Scala 2.12 release, and rethinking Scala libraries.
2. The Scala Center was formed to undertake projects benefiting the Scala community with support from various companies.
3. Scala 2.12 focuses on optimizing for Java 8 and includes many new features. Future releases will focus on improving Scala libraries and modularization.
4. The DOT calculus provides a formal
This presentation was first held at the OpenSQL Camp 2009, part of the FrOSCon conference in St. Augustin, Germany. It gives a nice overview over the project, technology and how it will progress. Find more information at http://www.blackray.org
Sharing our best secrets: Design a distributed system from scratchAdelina Simion
The document summarizes a system design workshop for designing a note-taking application called TechyNotes. The workshop covers defining system requirements and interfaces, discussing database and storage options, designing initial and revised system architectures, and addressing scalability bottlenecks. Attendees learn a repeatable process for system design and discuss technologies like databases, load balancing, caching, and queues.
Jeen Broekstra is a principal engineer at metaphacts GmbH and the project lead for Eclipse RDF4J. RDF4J is a modular collection of Java libraries for working with RDF data, including reading, writing, storing, querying, and updating RDF databases or remote SPARQL endpoints. It provides a vendor-neutral API and tools like RDF4J Server, Workbench, and Console. RDF4J aims to support modern Java versions while focusing on SHACL, SPARQL 1.2, and experimental RDF* and SPARQL* features.
Senior Software Developer and Lead Trainer Alejandro Lujan explains pattern matching, a very powerful and elegant feature of Scala, using a series of examples.
Learn more about this topic and find more presentation on Scala at:
This document contains a portfolio and contact information for Paul Partlow. It includes summaries of over 30 of his illustration and graphic design projects from 2012 to 2014 for schools including Rhode Island School of Design and Suffolk County Community College. The projects cover a wide range of mediums from graphite, ink and digital to found objects. They include personal works as well as assignments covering topics like visual storytelling, conveying personality, and combining cityscapes.
In this video, senior software developer Alejandro Lujan explores the elements of Scala's language that allow you to write clean and powerful code in a more brief manner.
This presentation explores the benefits of functional programming, especially with respect to reliability. It presents a sample of types that allow many program invariants to be enforced by compilers. We also discuss the industrial adoption of functional programming, and conclude with a live coding demo in Scala.
The curriculum document discusses social studies and focuses on the environment both before and after children wash a car. It examines the environment of the car wash location initially and then analyzes changes to the environment after the children complete the car washing activity. The document appears to use a hands-on car washing project to teach social studies lessons about environments and environmental changes.
Alejandro Lujan introduces us to String Interpolation, a feature of Scala that allows us to have placeholders inside of string definitions, and explains why you would want to use them. Video included!
Senior Software Developer Alejandro Lujan discusses the collections API in Scala, and provides some insight into what it can do with with some examples.
This presentation provides an overview on Value Classes in Scala, which is explained in the video on the last slide by Alejandro Lujan. He explains why you would want to use them, outlines the restrictions that are associated with them, and shows examples of how you would use them. Value classes are a mechanism that Scala provides to create a certain type of wrapper classes that provide memory and performance optimizations. In this video, we show a use case for Tiny Types with Value classes.
In his latest Typesafe tutorial video, Alejandro Lujan explains for expressions in Scala, and provides an example of them in action.
For expressions are a very useful construct that can simplify manipulation of collections and several other data structures. They can be used in place of nested for loops, or to replace calls to map and flatMap in non-collection structures.
Learn more
This document provides a summary of Korean cuisine and traditional Korean dishes. It discusses how Korean food often involves fermentation, which adds beneficial bacteria and nutrients. It also notes that Korean diet may help prevent obesity due to its emphasis on vegetables, soups and stews. Some key Korean dishes summarized are bibimbap (mixed rice bowl), doenjang jjigae (soybean paste stew), yukgaejang (spicy beef soup), and maeuntang (hot spicy fish soup). The document promotes Korean food as a healthy and balanced way of eating.
This document provides information about the sun, earth, and moon through a presentation by Dr. Marjorie Anne Wallace. It asks and answers several questions about these celestial bodies, including what causes day and night, their relative sizes, what percentage of the atmosphere is various gases, tidal patterns, and other details. It explains concepts such as rotation, revolution, phases of the moon, and seasons in 3-5 sentences per topic.
As a full-time Scala developer, I often find myself talking about Scala and functional programming in different kinds of situations, ranging from meeting a friend working in J2EE, Ruby or C++, to dedicated Scala Meetups aiming to promote deeper understanding of the language. However, something occurred to me lately. By hanging out with people who have some Scala knowledge or experience, I am somewhat holding on to a safe place. By presenting only to people who are curious about Scala, I'm preaching to the converted.
To make a long story short, I recently made an attempt at getting out of my comfort zone by presenting about how making the transition from Java to Scala makes total sense (from Java developer point of view). The presentation went through proof-hearing of approximately 60 experienced Java programmers (with almost no prior Scala knowledge) gathered in one room for a Lunch & Learn. Here are my slides.
Punishment Driven Development #agileinthecityLouise Elliott
What is the first thing we do when a major issue occurs in a live system? Sort it out of course. Then we start the hunt for the person to blame so that they can suffer the appropriate punishment. What do we do if a person is being awkward in the team and won’t agree to our ways of doing things? Ostracise them of course, and see how long it is until they hand in their notice – problem solved.
This highly interactive talk delves into why humans have this tendency to blame and punish. It looks at real examples of punishment within the software world and the results which were achieved. These stories not only cover managers punishing team members but also punishment within teams and self-punishment. We are all guilty of some of the behaviours discussed.
This is aimed at everyone involved in software development. It covers:
• Why we tend to blame and punish others.
• The impact of self-blame.
• The unintended (but predictable) results from punishment.
• The alternatives to punishment, which get real results.
This document discusses immutability in Scala. It recommends immutability to avoid unexpected values, concurrent state issues, and most container types are immutable by default. It provides examples of using vals for immutable variables, creating new instances instead of modifying existing ones, using the copy method for case classes, and hiding vars to prevent external modification of mutable state. It cautions being mindful of modifying immutable nested structures, fields initialized each instance, and closing over mutable state.
This document summarizes the development of Lore's machine learning and NLP platform using Python. It started as a monolithic Python server but evolved into a microservices architecture using Docker, Kubernetes, and Celery for parallelization. Key lessons included using DevOps tools like Docker for development and deployment, Celery to parallelize tasks, and wrapping services to improve modularity, flexibility, and performance. The platform now supports multiple products and consulting work in a scalable and maintainable way.
The world has changed and having one huge server won’t do the job anymore, when you’re talking about vast amounts of data, growing all the time the ability to Scale Out would be your savior. Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.
This lecture will be about the basics of Apache Spark and distributed computing and the development tools needed to have a functional environment.
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaDataWorks Summit
At NMC (Nielsen Marketing Cloud) we provide our customers (marketers and publishers) real-time analytics tools to profile their target audiences.
To achieve that, we need to ingest billions of events per day into our big data stores, and we need to do it in a scalable yet cost-efficient manner.
In this session, we will discuss how we continuously transform our data infrastructure to support these goals.
Specifically, we will review how we went from CSV files and standalone Java applications all the way to multiple Kafka and Spark clusters, performing a mixture of Streaming and Batch ETLs, and supporting 10x data growth.
We will share our experience as early-adopters of Spark Streaming and Spark Structured Streaming, and how we overcame technical barriers (and there were plenty...).
We will present a rather unique solution of using Kafka to imitate streaming over our Data Lake, while significantly reducing our cloud services' costs.
Topics include :
* Kafka and Spark Streaming for stateless and stateful use-cases
* Spark Structured Streaming as a possible alternative
* Combining Spark Streaming with batch ETLs
* "Streaming" over Data Lake using Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaItai Yaffe
Going into different streaming methods, we will share our experience as early-adopters of Spark Streaming and Spark Structured Streaming, and how we overcame technical barriers (and there were plenty...).
We will also present a rather unique solution of using Kafka to imitate streaming over our Data Lake, while significantly reducing our cloud services’ costs.
Topics include :
* Kafka and Spark Streaming for stateless and stateful use-cases
* Spark Structured Streaming as a possible alternative
* Combining Spark Streaming with batch ETLs
* “Streaming” over Data Lake using Kafka
Sa introduction to big data pipelining with cassandra & spark west mins...Simon Ambridge
This document provides an overview and outline of a 1-hour introduction to building a big data pipeline using Docker, Cassandra, Spark, Spark-Notebook and Akka. The introduction is presented as a half-day workshop at Devoxx November 2015. It uses a data pipeline environment from Data Fellas and demonstrates how to use scalable distributed technologies like Docker, Spark, Spark-Notebook and Cassandra to build a reactive, repeatable big data pipeline. The key takeaway is understanding how to construct such a pipeline.
What is Distributed Computing, Why we use Apache SparkAndy Petrella
In this talk we introduce the notion of distributed computing then we tackle the Spark advantages.
The Spark core content is very tiny because the whole explanation has been done live using a Spark Notebook (https://github.com/andypetrella/spark-notebook/blob/geek/conf/notebooks/Geek.snb).
This talk has been given together by @xtordoir and myself at the University of Liège, Belgium.
r instance, in zero-dimensional (0D) nanomaterials all the dimensions are measured within the nanoscale (no dimensions are larger than 100 nm); in two-dimensional nanomaterials (2D), two dimensions are outside the nanoscale; and in three-dimensional nanomaterials (3D) are materials that are not confined to the nanoscale in any dimension. This class can contain bulk powders, dispersions of nanoparticles, bundles of nanowires, and nanotubes as well as multi-nanolayers. Check our Frequently Asked Questions to get more details.
For instance, in zero-dimensional (0D) nanomaterials all the dimensions are measured within the nanoscale (no dimensions are larger than 100 nm); in two-dimensional nanomaterials (2D), two dimensions are outside the nanoscale; and in three-dimensional nanomaterials (3D) are materials that are not confined to the nanoscale in any dimension. This class can contain bulk powders, dispersions of nanoparticles, bundles of nanowires, and nanotubes as well as multi-nanolayers. Check our Frequently Asked Questions to g
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionFormulatedby
Presented by Mostafa Madjipour., Senior Data Scientist at Time Inc.
Next DSS NYC Event 👉 https://datascience.salon/newyork/
Next DSS LA Event 👉 https://datascience.salon/la/
Reducing the gap between R&D and production is still a challenge for data science/ machine learning engineering groups in many companies. Typically, data scientists develop the data-driven models in a research-oriented programming environment (such as R and python). Next, the data/machine learning engineers rewrite the code (typically in another programming language) in a way that is easy to integrate with production services.
This process has some disadvantages: 1) It is time consuming; 2) slows the impact of data science team on business; 3) code rewriting is prone to errors.
A possible solution to overcome the aforementioned disadvantages would be to implement a deployment strategy that easily embeds/transforms the model created by data scientists. Packages such as jPMML, MLeap, PFA, and PMML among others are developed for this purpose.
In this talk we review some of the mentioned packages, motivated by a project at Time Inc. The project involves development of a near real-time recommender system, which includes a predictor engine, paired with a set of business rules.
Spark - The Ultimate Scala Collections by Martin OderskySpark Summit
Spark is a domain-specific language for working with collections that is implemented in Scala and runs on a cluster. While similar to Scala collections, Spark differs in that it is lazy and supports additional functionality for paired data. Scala can learn from Spark by adding views to make laziness clearer, caching for persistence, and pairwise operations. Types are important for Spark as they prevent logic errors and help with programming complex functional operations across a cluster.
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaDatabricks
At NMC (Nielsen Marketing Cloud) we provide our customers (marketers and publishers) real-time analytics tools to profile their target audiences. To achieve that, we need to ingest billions of events per day into our big data stores, and we need to do it in a scalable yet cost-efficient manner.
In this session, we will discuss how we continuously transform our data infrastructure to support these goals. Specifically, we will review how we went from CSV files and standalone Java applications all the way to multiple Kafka and Spark clusters, performing a mixture of Streaming and Batch ETLs, and supporting 10x data growth We will share our experience as early-adopters of Spark Streaming and Spark Structured Streaming, and how we overcame technical barriers (and there were plenty). We will present a rather unique solution of using Kafka to imitate streaming over our Data Lake, while significantly reducing our cloud services’ costs. Topics include:
Kafka and Spark Streaming for stateless and stateful use-cases
Spark Structured Streaming as a possible alternative
Combining Spark Streaming with batch ETLs
”Streaming” over Data Lake using Kafka
Experiences with Evangelizing Java Within the DatabaseMarcelo Ochoa
The document discusses experiences with evangelizing the use of Java within Oracle databases. It provides a timeline of Java support in Oracle databases from 8i to 12c. It describes developing, testing, and deploying database-resident Java applications. Examples discussed include a content management system and RESTful web services implemented as stored procedures, as well as the Scotas OLS product for embedded Solr search. The conclusion covers challenges with open source projects, impedance mismatch between databases and Java, and lack of overlap between skillsets.
Develop realtime web with Scala and XitrumNgoc Dao
This document discusses a talk given by Ngoc Dao on developing realtime and distributed web applications with Scala and Xitrum. The talk covers:
1) An overview of Scala, including its functional features, object-oriented features, tools like SBT and REPL, and how to get started.
2) Using Scala for web development with the Xitrum framework, including routing, responding to requests, internationalization, and metrics.
3) Using Scala for concurrency with futures, actors, and Akka FSM.
4) Building realtime web applications with websockets, Socket.IO and SockJS.
5) Distributed systems with Akka remoting
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
As Apache Spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF: https://www.cncf.io/projects/), can be applied to monitor and archive system performance data in a containerized spark environment.
In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...Alexey Zinoviev
Alexey Zinoviev presented this paper on the JPoint'15 conference javapoint.ru/talks/#zinoviev.
This paper covers next topics: Java, JPA, Morphia, Hibernate OGM, Spring Data, Hector, Kundera, NoSQL, Mongo, Cassandra, HBase, Riak
This document summarizes a summer internship project to develop a framework for benchmarking machine learning libraries. The intern worked with Spark ML, XGBoost and scikit-learn, developing workflows to load data from S3, train models, collect accuracy and performance metrics, and log results to MySQL. Future work includes running the frameworks on larger datasets and clusters, adding hyperparameter tuning, and integrating additional libraries and test cases. The intern gained exposure to ML workflows and libraries while addressing issues like data formatting.
How to keep maintainability of long life Scala applicationstakezoe
Naoki Takezoe discusses maintaining long-term Scala applications. He outlines two main difficulties: programming style differences that impact understandability and upgrades that require coordinating framework, Scala, and Java version changes. Case studies show upgrades can be blocked until dependent libraries support new versions. Solutions include reducing dependencies, using popular libraries, custom libraries for core components, and considering Java alternatives. Regular maintenance and preparing for breaking changes are key to sustainable Scala applications.
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
Spark has evolved its APIs and engine over the last 6 years to combine the best aspects of previous systems like databases, MapReduce, and data frames. Its latest structured APIs like DataFrames provide a declarative interface inspired by data frames in R/Python for ease of use, along with optimizations from databases for performance and future-proofing. This unified approach allows Spark to scale massively like MapReduce while retaining flexibility.
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
AWS Big Data Demystified #1: Big data architecture lessons learned . a quick overview of a big data techonoligies, which were selected and disregard in our company
The video: https://youtu.be/l5KmaZNQxaU
dont forget to subcribe to the youtube channel
The website: https://amazon-aws-big-data-demystified.ninja/
The meetup : https://www.meetup.com/AWS-Big-Data-Demystified/
The facebook group : https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/
This document provides an overview of NoSQL databases. It discusses that NoSQL databases are non-relational and were created to overcome limitations of scaling relational databases. The document categorizes NoSQL databases into key-value stores, document databases, graph databases, XML databases, and distributed peer stores. It provides examples like MongoDB, Redis, CouchDB, and Cassandra. The document also explains concepts like CAP theorem, ACID properties, and reasons for using NoSQL databases like horizontal scaling, schema flexibility, and handling large amounts of data.
In this webinar, Michael Nash of BoldRadius explores the Typesafe Reactive Platform.
The Typesafe Reactive Platform is a suite of technologies and tools that support the creation of reactive applications, that is, applications that handle the kind of responsiveness requirements, data volume, and user load that was out of practical reach only a few years ago.
From analysis of the human genome to wearable technology to communications at a massive scale, BoldRadius has the premier team of experts with decades of collective experience in designing and building these types of applications, and in helping teams adopt these tools.
Patrick Premont of BoldRadius presented this talk at Scala By The Bay 2015.
Why do data structure lookups often return Options? Could we safely eliminate all the recovery code that we hope is never called? We will see how Scala’s type system lets us express referential integrity constraints to achieve unparalleled reliability. We apply the technique to in-memory data structures using the Total-Map library and consider how to extend the benefits to persisted data.
How You Convince Your Manager To Adopt Scala.js in ProductionBoldRadius Solutions
Dave Sugden and Katrin Shechtman of BoldRadius presented this talk at Scala By The Bay 2015.
The talk will present fully functional sample application developed with Scala.js, scalatags, scalacss and other Scala and Typesafe technologies. We aim to show all the pros and cons for having Scala coast-to-coast approach to web-application development and encourage people not to shy away from asking difficult questions challenging this approach. Participants can expect to gain a clear view on the current state of the Scala based client side technologies and take away an activator template with application code that could be used as a base for technical discussions with their peers and managers.
Domain Driven Design with Onion Architecture is a powerful combination of Architecture Patterns that can dramatically improve code quality and can help you learn a great deal about writing "clean" code.
Senior Software Developer and Trainer Alejandro Lujan explains sealed classes, why they are needed, and how to implement them in Scala. Read more on the BoldRadius blog: http://boldradius.com/blog-post/VBB3uzIAADYAiiSy/sealed-classes-in-scala
BoldRadius' Senior Software Developer Alejandro Lujan explains how to use higher order functions in Scala and illustrates them with some examples.
See the accompanying video at www.boldradius.com/blog
Mike Kelland and the BoldRadius team lead an interactive discussion at Scala Days 2014 in Berlin on adopting the Typesafe Reactive Platform and creating change in your organization.
We explored approaches to solving the pain points that may arise, presenting tools, strategies and resources designed to help you adopt the Typesafe Reactive Platform today.
This document discusses modeling products and categories in Scala using case classes. It describes the attributes of products (name, category, optional description) and categories (name, optional parent category). It shows how to define regular classes in Java and Scala to represent these, but notes that Scala case classes provide additional benefits out of the box, such as a companion object with factory methods, automatic fields from class parameters, nicer toString output, and attribute-based equality. It cautions that case classes have bytecode overhead and generated code that could clutter interfaces.
2. Who?
● BoldRadius Solutions
○ boldradius.com
○ Typesafe Partner
○ Scala, Akka and Play specialists
○ Ottawa, Saskatoon, San Francisco, Boston, Chicago, Montreal, New
York, Toronto
● Michael Nash, VP Capabilities
● Adam Murray, VP Business
4. What?
What was ScalaDays all about?
● Held in San Francisco March 16th thru 18th
○ About 800 attendees
○ Three categories (intermediate, beginner,
advanced),
○ Four tracks, 55 presentations, three keynotes
● Followed by Two Intensive Training days
○ All the Typesafe courses offered
7. Keynotes
● Martin Odersky, Chief Architect & Co-Founder at Typesafe
○ Scala - where it came from, where it’s going
● Danese Cooper, Distinguished Member of Technical Staff -
Open Source at Paypal
○ Open Languages
● Dianne March, Director of Engineering Tools at Netflix
○ Technical Leadership from wherever you are
9. Highlights
Too many great sessions to summarize them all, we have
to extract a few recurring themes…
○ Distributed Application Design and Development
○ Big Data/Fast Data
○ Types and Safe Scala
○ Performance and Scalability
10. Distributed Applications
● Life Beyond the Illusion of Present
● Reactive Reference Architectures
● Akka in Production: Why and How
● Easy Scalability with Akka
● Scalable task distribution with Scala, Akka
and Mesos
● A Guided Tour of a Distributed Application
11. Performance and Scalability
● Scala Collections and Performance
● Shattering Hadoop’s Large-Scala Sort Record with
Spark and Scala
● Type-safe off-heap memory for Scala
● Akka in Production: Why and How
● Easy Scalability with Akka
● The JVM Backend and Optimizer in Scala 2.12
12. Big Data, Fast Data
● Shattering Hadoop’s Large Scala Sort Record with
Spark and Scala
● Scala - The Real Spark of Data Science
● Apache Spark: A Large Community Project in Scala
● S3 at Scala: Async Scala Client with Play Iteratees and
Composable Operations
● The Unreasonable Effectiveness of Scala for Big Data
● Scalable task distribution with scala, Akka and Mesos
13. Types and Safer Scala
● Keynote: Scala - where it came from, where it’s going
● Towards a Safer Scala
● Leveraging Scala Macros for Better Validation
● Type-level programming in Scala 101
● Improving Correctness with Types
● Happy Paths: Functional Constructs in the Wild
● The Scalactic Way
● Delimited dependency-typed monadic checked
exceptions
14. The Rest...
Many excellent talks were outside these categories
● Why Scala.js
● Reactive Slick for Database Programming
● Exercise in machine learning
● Functional Natural Language Processing
● If I Only Had a Brain...in Scala
● Akka HTTP: the Reactive Web Toolkit
● many many others
15. Highlights of Specific Sessions
A quick taster of what was in some of the more
popular sessions...
17. Where it’s from, where it’s going
“Scala is a gateway drug to Haskell” (in actual fact it’s going well beyond Haskell.)
Slides: http://www.slideshare.net/Odersky/scala-days-san-francisco-45917092
Came from a practical combination of OOP and functional programming
- Funny story about hipster syntax (..) instead of [..], <..> instead of [..], ??? instead of <..>
- Trend in Type Systems
- Scala JS is no longer experimental
- TASTY: new scala-specific platform
- Introduction to DOT
- Type Parameters
- Better treatment of effects with implicit Functions instead of Monads
18. TASTY
Scala faces challenges:
● binary compatibility
● having to pick a platform: JDK( 7,8,9,10, ?) or Javascript.
Proposing a scala-specific platform called TASTY, (serialized typed abstract syntax tree),
as an intermediate representation before bytecode - carries metadata of types, can be compiled with
different versions of JDK, and to Javascript.
19. Tasty will enable:
● instrumentation
● optimization
● code analysis
● refactoring
● publish once run anywhere
● automated remapping to solve binary compatibility issues.
21. Explorations:
Hope to find something cooler than Monads to handle effects.
● Monads don’t commute
● Require Monad transformers for composition
● Monad transformers make Martin’s head explode
22. Toward a Safer Scala
Leif Wickland
http://tinyurl.com/sd15lint
● Scalac enables some error-prone code.
○ head of empty List?
● Using Static Analysis to detect errors early
● IDE based solutions
○ Inconsistencies
○ If not in release build process, doesn’t exist
● Web-based analysis
○ outside of compile loop
○ relatively immature analysis
25. Life Beyond the Illusion of Present
Jonas Bonér:
The idea of the present is an illusion. Everything we see, hear and feel is just
an echo from the past. But this illusion has influenced us and the way we view
the world in so many ways.
There is no present, all we have is facts derived from the merging of multiple
pasts. The truth is closer to Einstein’s physics where everything is relative to
one’s perspective. As developers we need to wake up and break free from the
perceived reality of living in a single globally consistent present.
26. The advent of multicore and cloud computing architectures meant that most
applications today are distributed systems—multiple cores separated by the
memory bus or multiple nodes separated by the network—which puts a harsh
end to this illusion.
The only way to design truly scalable and performant systems that can
construct a sufficiently consistent view of history—and thereby our local
“present”—is by treating time as a first class construct in our programming
model and to model the present as facts derived from the merging of multiple
concurrent pasts.
27. How do we deal with failure and communication unreliability in real life?
Confirmation and repetition
We can’t force the world into a globally consistent present (CRUD).
Mentioned 2 paradigms/theories:
● CALM (consistency as logical monotonicity)
● CRDT (Commutative Replicated Data Type)
CRDT (Commutative Replicated Data Type) eventually consistent data types
● minimize contention / coordination in a distributed system.
● set, maps, graphs: rich data types.
● monotonic merge function: all state change is monotonically increasing, no way
back.
29. ● Using Wrapper Types (aka Tiny Types) instead of primitives
● Never use null, or throw Exceptions
● use === org.scalactic.TypeCheckedTripleEquals
○ requires the types of the two values compared to be in a subtype/supertype
● Never Use Non-Empty Lists when a list must be populated (org.scalactic.Every)
● Use Type Tags (ala Shapeless, Scalaz)
● Use Path Dependent Types
● Other reading
○ Self recursive types
○ Phantom Types
○ Shapeless
○ Scalactic
Types: Defensive Programming. Fail Fast. Design By Contract.
30. Function Passing Style
Heather Miller:
● A new programming model called function passing designed to overcome many of
imperative / weakly-typed issues found in traditional “big data” processing
systems.
● Provides a more principled substrate on which to build data-centric distributed
systems.
● Pass safe, well-typed serializable functions to immutable distributed data
● Based on her work on Pickling and Spores
● Uses Spores (Serializable Functions) for a distributed model.
● Kind of an Inverse of the Actor Model
● Stateless. data is stationary, functions are passed around.
● Uses Data Silos accessed through a Silo Ref.
31. The Unreasonable Effectiveness of Scala for Big Data
Dean Wampler
● How Hadoop Works - Map Reduce
○ Problems
■ Hard to implement algorithms
■ The Hadoop API is horrible
● Scalding
○ An improved Hadoop API in Scala
○ Problems
■ Still uses a batch mode
● Spark
○ An elegant, functional API
○ Still in Batch Mode, but with mini batches which
approach real time.
32. Akka HTTP: The Reactive Web Toolkit
Roland Kuhn
● Replaces Spray
● Uses Akka Streams
○ Sources emit values to the stream
○ Sinks receive values, act on them
○ Sources can compose using Zip and Graph shapes
● “The pinball interpreter”
○ produce data
○ move downstream through transformations
○ get to the effect
○ go up and ask for more data
○ Filters interrupt the flow before getting to the effect, make the pinball go back
upstream.
● A live coded demonstration of using Streams and Http
● Expected timeline for Streams - 4 weeks
● Expected timeline for Http - 8 weeks
33. The Scalactic Way
Bill Venners
http://www.slideshare.net/bvenners/the-scalactic-way
ScalaTest: quality through tests
Scalactic: quality through types
SuperSafe: quality through static analysis
36. Reactive Slick for Database Programming
Stefan Zeiger
http://slick.typesafe.com/talks/scaladays2015sf/Reactive_Slick_for_Database_Programming.pdf
Slick 3.0
● JDBC is inherently blocking (and blocking ties up threads)
● Traditional Model
○ Fully synchronous
○ One thread per web request
○ Contention for Connections (getConnection blocks)
○ Database back-pressure creates more blocked threads
○ Doesn’t scale
37. New Slick Architecture makes use of a new DataType to provide
Asynchronous Database I/O
● based on State, IO and Free Monads.
● Returns a Future[R]
● Creates a separate ExecutionContext, avoiding blocking of current thread
● Works with akka-streams to create back pressure, so DB only gives as much data as client can
process.
● For performance purposes it pre-fetches some data to keep client busy while it waits for the next
portion from DB.
sealed trait DBIOAction[+R, +S <: NoStream, -E <: Effect]{
def map[R2](f: R => R2)(implicit executor: ExecutionContext): DBIOAction[R2, NoStream, E]
def flatMap[R2, S2 <: NoStream, E2 <: Effect](f: R => DBIOAction[R2, S2, E2])(implicit
executor: ExecutionContext): DBIOAction[R2, S2, E with E2]
...
}
39. Easy Scalability with Akka
Michael Nash
● Reviewed Akka, CQRS, ES
● Introduced Distributed DDD
● Identical clustered system with DDDD and Without
● Gatling performance tests on both
40. ConductR
● Application manager that empowers ops to
deploy distributed systems
● Uses Akka, Play, Aka Streams, Akka
Cluster, FSM, Akka Data Replication
● How can we run cluster based apps ensuring
the seed nodes are started first?
○ State replicated using Data
Replication
● How can we consolidate logging?
○ Using Akka Streams
● How can we avoid batching?
○ Use Event Driven Architecture
● How can monitor/test multiple nodes?
○ Use the visualizer built into conductor
● How can we share state among the nodes?
○ Use Akka Data Replication