Slides from my talk at 2016 Apache: Big Data conference.
Resource managers like Apache YARN and Mesos have emerged as a critical layer in the cloud computing system stack, but the developer abstractions for leasing cluster resources and instantiating application logic are very low-level. We present Apache REEF, a powerful yet simple framework that helps developers of big data systems to retain fine-grained control over the cloud resources and address common problems of fault-tolerance, task scheduling and coordination, caching, interprocess communication, and bulk-data transfers. We will guide the developers through a simple REEF application and discuss current state of Apache REEF project and its place in the Hadoop ecosystem.
Building large scale applications in yarn with apache twillHenry Saputra
This document summarizes a presentation about Apache Twill, which provides abstractions for building large-scale applications on Apache Hadoop YARN. It discusses why Twill was created to simplify developing on YARN, Twill's architecture and components, key features like real-time logging and elastic scaling, real-world uses at CDAP, and the Twill roadmap.
Harnessing the power of YARN with Apache TwillTerence Yim
This document discusses Apache Twill, which aims to simplify developing distributed applications on YARN. Twill provides a Java thread-like programming model for YARN applications, avoiding the complexity of directly using YARN APIs. Key features of Twill include real-time logging, resource reporting, state recovery, elastic scaling, command messaging between tasks, service discovery, and support for executing bundled JAR applications on YARN. Twill handles communication with YARN and the Application Master while providing an easy-to-use API for application developers.
Dependency Injection in Apache Spark ApplicationsDatabricks
Dependency Injection is a programming paradigm that allows for cleaner, reusable, and more easily extensible code. Though Dependency injection has existed for a while now, its use for wiring dependencies in Apache Spark applications is relatively new. In this talk, we present our adventures writing testable Spark applications with dependency injection and explain why it is different than wiring dependencies for web applications due to Spark’s unique programming model.
The document discusses Microsoft Research's ORECHEM project, which aims to integrate chemistry scholarship with web architectures, grid computing, and the semantic web. It involves developing infrastructure to enable new models for research and dissemination of scholarly materials in chemistry. Key aspects include using OAI-ORE standards to describe aggregations of web resources related to crystallography experiments. The objective is to build a pipeline that extracts 3D coordinate data from feeds, performs computations on resources like TeraGrid, and stores resulting RDF triples in a triplestore. RESTful web services are implemented to access different steps in the workflow.
The big data platforms of many organisations are underpinned by a technology that is soon to celebrate its 45th birthday: SQL. This industry stalwart is applied in a multitude of critical points in business data flows; the results that these processes generate may significantly influence business and financial decision making. However, the SQL ecosystem has been overlooked and ignored by more recent innovations in the field of software engineering best practices such as fine grained automated testing and code quality metrics. This exposes organisations to poor application maintainability, high bug rates, and ultimately corporate risk.
We present the work we’ve been doing at Hotels.com to address these issues by bringing some advanced software engineering practices and open source tools to the realm of Apache Hive SQL. We first define the relevance of such approaches and demonstrate how automated testing can be applied to Hive SQL using HiveRunner, a JUnit based testing framework. We next consider how best to structure Hive queries to yield meaningful test scenarios that are maintainable and performant. Finally, we demonstrate how test coverage reports can highlight areas of risk in SQL codebases and weaknesses in the testing process. We do this using Mutant Swarm, an open source mutation testing tool for SQL languages developed by Hotels.com that can deliver insights similar to those produced by Java focused tools such as Jacoco and PIT.
The Typesafe Stack includes Scala, Akka, and Play frameworks for building scalable applications. Scala is a statically typed, functional programming language that runs on the JVM and interoperates with Java. Akka is an event-driven middleware for building distributed, fault-tolerant applications using actors. Play is a web framework that enables fast development of RESTful APIs and web applications using Scala templates. Together, these frameworks provide tools for building scalable, distributed systems with easy development.
In this talk, we present a comprehensive framework for assessing the correctness, stability, and performance of the Spark SQL engine. Apache Spark is one of the most actively developed open source projects, with more than 1200 contributors from all over the world. At this scale and pace of development, mistakes bound to happen. To automatically identify correctness issues and performance regressions, we have build a testing pipeline that consists of two complementary stages: randomized testing and benchmarking.
Randomized query testing aims at extending the coverage of the typical unit testing suites, while we use micro and application-like benchmarks to measure new features and make sure existing ones do not regress. We will discuss various approaches we take, including random query generation, random data generation, random fault injection, and longevity stress tests. We will demonstrate the effectiveness of the framework by highlighting several correctness issues we have found through random query generation and critical performance regressions we were able to diagnose within hours due to our automated benchmarking tools.
This document provides an overview of Apache Spark and machine learning using Spark. It introduces the speaker and objectives. It then covers Spark concepts including its architecture, RDDs, transformations and actions. It demonstrates working with RDDs and DataFrames. Finally, it discusses machine learning libraries available in Spark like MLib and how Spark can be used for supervised machine learning tasks.
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Spark Summit
This document summarizes key aspects of running Spark Streaming applications in production, including fault tolerance, performance, and monitoring. It discusses how Spark Streaming receives data streams in batches and processes them across executors. It describes how driver and executor failures can be handled through checkpointing saved DAG information and write ahead logs that replicate received data blocks. Restarting the driver from checkpoints allows recovering the application state.
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsDatabricks
This talk will cover some practical aspects of Apache Spark monitoring, focusing on measuring Apache Spark running on cloud environments, and aiming to empower Apache Spark users with data-driven performance troubleshooting. Apache Spark metrics allow extracting important information on Apache Spark’s internal execution. In addition, Apache Spark 3 has introduced an improved plugin interface extending the metrics collection to third-party APIs. This is particularly useful when running Apache Spark on cloud environments as it allows measuring OS and container metrics like CPU usage, I/O, memory usage, network throughput, and also measuring metrics related to cloud filesystems access. Participants will learn how to make use of this type of instrumentation to build and run an Apache Spark performance dashboard, which complements the existing Spark WebUI for advanced monitoring and performance troubleshooting.
"Structured Streaming was a new streaming API introduced to Spark over 2 years ago in Spark 2.0, and was announced GA as of Spark 2.2. Databricks customers have processed over a hundred trillion rows in production using Structured Streaming. We received dozens of questions on how to best develop, monitor, test, deploy and upgrade these jobs. In this talk, we aim to share best practices around what has worked and what hasn't across our customer base.
We will tackle questions around how to plan ahead, what kind of code changes are safe for structured streaming jobs, how to architect streaming pipelines which can give you the most flexibility without sacrificing performance by using tools like Databricks Delta, how to best monitor your streaming jobs and alert if your streams are falling behind or are actually failing, as well as how to best test your code."
Microservices, Containers, and Machine LearningPaco Nathan
Session talk for Data Day Texas 2015, showing GraphX and SparkSQL for text analytics and graph analytics of an Apache developer email list -- including an implementation of TextRank in Spark.
Apache Flink Crash Course by Slim Baltagi and Srini PalthepuSlim Baltagi
In this hands-on Apache Flink presentation, you will learn in a step-by-step tutorial style about:
• How to setup and configure your Apache Flink environment: Local/VM image (on a single machine), cluster (standalone), YARN, cloud (Google Compute Engine, Amazon EMR, ... )?
• How to get familiar with Flink tools (Command-Line Interface, Web Client, JobManager Web Interface, Interactive Scala Shell, Zeppelin notebook)?
• How to run some Apache Flink example programs?
• How to get familiar with Flink's APIs and libraries?
• How to write your Apache Flink code in the IDE (IntelliJ IDEA or Eclipse)?
• How to test and debug your Apache Flink code?
• How to deploy your Apache Flink code in local, in a cluster or in the cloud?
• How to tune your Apache Flink application (CPU, Memory, I/O)?
ROCm and Distributed Deep Learning on Spark and TensorFlowDatabricks
ROCm, the Radeon Open Ecosystem, is an open-source software foundation for GPU computing on Linux. ROCm supports TensorFlow and PyTorch using MIOpen, a library of highly optimized GPU routines for deep learning. In this talk, we describe how Apache Spark is a key enabling platform for distributed deep learning on ROCm, as it enables different deep learning frameworks to be embedded in Spark workflows in a secure end-to-end machine learning pipeline. We will analyse the different frameworks for integrating Spark with Tensorflow on ROCm, from Horovod to HopsML to Databrick's Project Hydrogen. We will also examine the surprising places where bottlenecks can surface when training models (everything from object stores to the Data Scientists themselves), and we will investigate ways to get around these bottlenecks. The talk will include a live demonstration of training and inference for a Tensorflow application embedded in a Spark pipeline written in a Jupyter notebook on Hopsworks with ROCm.
The document discusses improvements made to Apache Flink by Alibaba, called Blink. Blink provides a unified SQL layer for both batch and streaming processes. It supports features like UDF/UDTF/UDAGG, stream-stream joins, windowing, and retraction. Blink also improves Flink's runtime to be more reliable and production-quality when running on large YARN clusters. It has a new architecture using a JobMaster and TaskExecutors. Checkpointing and state management were optimized for incremental backups. Blink has been running in production supporting many of Alibaba's critical systems and processing massive amounts of data.
Build a deep learning pipeline on apache spark for ads optimizationCraig Chao
This document discusses building deep learning pipelines on Apache Spark for ad optimization. It begins by discussing how data has become a new form of colonialism. It then explains why deep learning should be done on Apache Spark rather than just TensorFlow. The remainder of the document discusses machine learning pipelines on Apache Spark, how machine learning and deep learning can be used for ad optimization, and various approaches to deep learning on Apache Spark using tools like MMLSpark, Databricks, DL4J, BigDL, and SystemML.
The document summarizes a tutorial presentation about the Open Grid Computing Environments (OGCE) software tools for building science gateways. The OGCE tools include a gadget container, workflow composer called XBaya, and application factory service called GFac. The presentation demonstrates how these tools can be used to build portals and compose workflows to access resources like the TeraGrid.
Building Continuous Application with Structured Streaming and Real-Time Data ...Databricks
This document summarizes a presentation about building a structured streaming connector for continuous applications using Azure Event Hubs as the streaming data source. It discusses key design considerations like representing offsets, implementing the getOffset and getBatch methods required by structured streaming sources, and challenges with testing asynchronous behavior. It also outlines issues contributed back to the Apache Spark community around streaming checkpoints and recovery.
Dataservices - Processing Big Data The Microservice WayJosef Adersberger
We see a big data processing pattern emerging using the Microservice approach to build an integrated, flexible, and distributed system of data processing tasks. We call this the Dataservice pattern. In this presentation we'll introduce into Dataservices: their basic concepts, the technology typically in use (like Kubernetes, Kafka, Cassandra and Spring) and some architectures from real-life.
Apache Eagle: Architecture Evolvement and New FeaturesHao Chen
Apache Eagle is a distributed real-time monitoring framework for Hadoop clusters. It analyzes data activities, applications, metrics, and logs to identify security breaches, performance issues, and provide insights. The architecture uses streaming ingestion of data, processing, alerting through complex event processing and machine learning, and storage and dashboards for insights. New features include support for additional data sources, applications for specific use cases, and evolving the architecture for improved scalability and flexibility.
The document provides an overview of OGCE (Open Grid Computing Environment), which develops and packages reusable software components for science portals. Key components described include services, gadgets, tags, and how they fit together. Installation and usage of the various OGCE components is discussed at a high level.
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraJoe Stein
Slides for our solution we developed for using Mesos, Docker, Kafka, Spark, Cassandra and Solr (DataStax Enterprise Edition) all developed in Go for doing realtime log analysis at scale. Many organizations either need or want log analysis in real time where you can see within a second what is happening within your entire infrastructure. Today, with the hardware available and software systems we have in place, you can develop, build and use as a service these solutions.
Peter Doschkinow, langjähriger Java-Experte und Mitarbeiter bei Oracle, gab in seiner Präsentation einen Überblick über die interessantesten und spannendsten Neuerungen in der neusten Java Standard- und Enterprise Edition.
Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]Accumulo Summit
Talk Abstract
A common tradeoff made by fault-tolerant, distributed systems is the ease of user interaction with the system. Implementing correct distributed operations in the face of failures often takes priority over reducing the level of effort required to use the system. Because of this, applying a problem in a specific domain to the system can require significant planning and effort by the user. Apache Accumulo, and its sorted, Key-Value data model, is subject to this same problem: it is often difficult to use Accumulo to quickly ascertain real-life answers about some concrete problem.
This problem, not unique to Accumulo itself, has spurred the growth of numerous projects to fill these kinds of gaps in usability, in addition to multiple language bindings provided by applications. Outside of the Java API, Accumulo client support varies from programming languages, like Python or Ruby, to standalone projects that provide their own query language, such as Apache Pig and Apache Hive. This talk will cover the state of client support outside of Accumulo’s Java API with an emphasis on the pros, cons, and best practices of each alternative.
Speaker
Josh Elser
Member of Technical Staff, Hortonworks
Josh is a member of the engineering staff at Hortonworks. He is strong advocate for open source software and is an Apache Accumulo committer and PMC member. He is also a committer and PMC member of Apache Slider (incubating) and regularly contributes to other Apache projects in the Apache Hadoop ecosystem. He holds a Bachelor's degree in Computer Science from Rensselaer Polytechnic Institute.
Accumulo Summit 2015: Ambari and Accumulo: HDP 2.3 Upcoming Features [Sponsored]Accumulo Summit
Talk Abstract
The upcoming Hortonworks Data Platform (HDP) 2.3 includes significant additions to Accumulo, within the project itself and in its interactions with the larger Hadoop ecosystem.This session will cover high-level changes that improve usability, management and security of Accumulo. Administrators of Accumulo now have the ability to deploy, manage and dynamically configure Accumulo clusters using Apache Ambari.As a part of Ambari integration, the metrics system in Accumulo has been updated to use the standard “Hadoop Metrics2” metrics subsystem which provides native Ganglia and Graphite support as well as supporting the new Ambari Metrics System. On the security front,Accumulo was also improved to support client authentication via Kerberos, while earlier versions of Accumulo only supported Kerberos authentication for server processes.With these changes,Accumulo clients can authenticate solely using their Kerberos identity across the entire Hadoop cluster without the need to manage passwords.
Speakers
Billie Rinaldi Senior Member of Technical Staff, Hortonworks
Billie Rinaldi is a Senior Member of Technical Staff at Hortonworks, Inc., currently prototyping new features related to application monitoring and deployment in the Apache Hadoop ecosystem. Prior to August 2012, Billie engaged in big data science and research at the National Security Agency. Since 2008, she has been providing technical leadership regarding the software that is now Apache Accumulo. Billie is the VP of Apache Accumulo, the Accumulo Project Management Committee Chair, and a member of the Apache Software Foundation. She holds a Ph.D. in applied mathematics from Rensselaer Polytechnic Institute.
Josh Elser Member of Technical Staff, Hortonworks
Josh is a member of the engineering staff at Hortonworks. He is strong advocate for open source software and is an Apache Accumulo committer and PMC member. He is also a committer and PMC member of Apache Slider (incubating) and regularly contributes to other Apache projects in the Apache Hadoop ecosystem. He holds a Bachelor's degree in Computer Science from Rensselaer Polytechnic Institute.
This document provides an overview of big data concepts including HDFS, MapReduce, HBase, Pig, Hive and Hadoop clusters. It describes how HBase is better suited than a relational database for large-scale analytics due to its columnar data structure. It also summarizes how MapReduce programming works, with the map phase organizing data and the reduce phase aggregating it. Finally, it outlines limitations of the original Hadoop 1.0 and how YARN improved cluster resource management and scheduling in Hadoop 2.0.
This document provides an overview of Apache Spark and machine learning using Spark. It introduces the speaker and objectives. It then covers Spark concepts including its architecture, RDDs, transformations and actions. It demonstrates working with RDDs and DataFrames. Finally, it discusses machine learning libraries available in Spark like MLib and how Spark can be used for supervised machine learning tasks.
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Spark Summit
This document summarizes key aspects of running Spark Streaming applications in production, including fault tolerance, performance, and monitoring. It discusses how Spark Streaming receives data streams in batches and processes them across executors. It describes how driver and executor failures can be handled through checkpointing saved DAG information and write ahead logs that replicate received data blocks. Restarting the driver from checkpoints allows recovering the application state.
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsDatabricks
This talk will cover some practical aspects of Apache Spark monitoring, focusing on measuring Apache Spark running on cloud environments, and aiming to empower Apache Spark users with data-driven performance troubleshooting. Apache Spark metrics allow extracting important information on Apache Spark’s internal execution. In addition, Apache Spark 3 has introduced an improved plugin interface extending the metrics collection to third-party APIs. This is particularly useful when running Apache Spark on cloud environments as it allows measuring OS and container metrics like CPU usage, I/O, memory usage, network throughput, and also measuring metrics related to cloud filesystems access. Participants will learn how to make use of this type of instrumentation to build and run an Apache Spark performance dashboard, which complements the existing Spark WebUI for advanced monitoring and performance troubleshooting.
"Structured Streaming was a new streaming API introduced to Spark over 2 years ago in Spark 2.0, and was announced GA as of Spark 2.2. Databricks customers have processed over a hundred trillion rows in production using Structured Streaming. We received dozens of questions on how to best develop, monitor, test, deploy and upgrade these jobs. In this talk, we aim to share best practices around what has worked and what hasn't across our customer base.
We will tackle questions around how to plan ahead, what kind of code changes are safe for structured streaming jobs, how to architect streaming pipelines which can give you the most flexibility without sacrificing performance by using tools like Databricks Delta, how to best monitor your streaming jobs and alert if your streams are falling behind or are actually failing, as well as how to best test your code."
Microservices, Containers, and Machine LearningPaco Nathan
Session talk for Data Day Texas 2015, showing GraphX and SparkSQL for text analytics and graph analytics of an Apache developer email list -- including an implementation of TextRank in Spark.
Apache Flink Crash Course by Slim Baltagi and Srini PalthepuSlim Baltagi
In this hands-on Apache Flink presentation, you will learn in a step-by-step tutorial style about:
• How to setup and configure your Apache Flink environment: Local/VM image (on a single machine), cluster (standalone), YARN, cloud (Google Compute Engine, Amazon EMR, ... )?
• How to get familiar with Flink tools (Command-Line Interface, Web Client, JobManager Web Interface, Interactive Scala Shell, Zeppelin notebook)?
• How to run some Apache Flink example programs?
• How to get familiar with Flink's APIs and libraries?
• How to write your Apache Flink code in the IDE (IntelliJ IDEA or Eclipse)?
• How to test and debug your Apache Flink code?
• How to deploy your Apache Flink code in local, in a cluster or in the cloud?
• How to tune your Apache Flink application (CPU, Memory, I/O)?
ROCm and Distributed Deep Learning on Spark and TensorFlowDatabricks
ROCm, the Radeon Open Ecosystem, is an open-source software foundation for GPU computing on Linux. ROCm supports TensorFlow and PyTorch using MIOpen, a library of highly optimized GPU routines for deep learning. In this talk, we describe how Apache Spark is a key enabling platform for distributed deep learning on ROCm, as it enables different deep learning frameworks to be embedded in Spark workflows in a secure end-to-end machine learning pipeline. We will analyse the different frameworks for integrating Spark with Tensorflow on ROCm, from Horovod to HopsML to Databrick's Project Hydrogen. We will also examine the surprising places where bottlenecks can surface when training models (everything from object stores to the Data Scientists themselves), and we will investigate ways to get around these bottlenecks. The talk will include a live demonstration of training and inference for a Tensorflow application embedded in a Spark pipeline written in a Jupyter notebook on Hopsworks with ROCm.
The document discusses improvements made to Apache Flink by Alibaba, called Blink. Blink provides a unified SQL layer for both batch and streaming processes. It supports features like UDF/UDTF/UDAGG, stream-stream joins, windowing, and retraction. Blink also improves Flink's runtime to be more reliable and production-quality when running on large YARN clusters. It has a new architecture using a JobMaster and TaskExecutors. Checkpointing and state management were optimized for incremental backups. Blink has been running in production supporting many of Alibaba's critical systems and processing massive amounts of data.
Build a deep learning pipeline on apache spark for ads optimizationCraig Chao
This document discusses building deep learning pipelines on Apache Spark for ad optimization. It begins by discussing how data has become a new form of colonialism. It then explains why deep learning should be done on Apache Spark rather than just TensorFlow. The remainder of the document discusses machine learning pipelines on Apache Spark, how machine learning and deep learning can be used for ad optimization, and various approaches to deep learning on Apache Spark using tools like MMLSpark, Databricks, DL4J, BigDL, and SystemML.
The document summarizes a tutorial presentation about the Open Grid Computing Environments (OGCE) software tools for building science gateways. The OGCE tools include a gadget container, workflow composer called XBaya, and application factory service called GFac. The presentation demonstrates how these tools can be used to build portals and compose workflows to access resources like the TeraGrid.
Building Continuous Application with Structured Streaming and Real-Time Data ...Databricks
This document summarizes a presentation about building a structured streaming connector for continuous applications using Azure Event Hubs as the streaming data source. It discusses key design considerations like representing offsets, implementing the getOffset and getBatch methods required by structured streaming sources, and challenges with testing asynchronous behavior. It also outlines issues contributed back to the Apache Spark community around streaming checkpoints and recovery.
Dataservices - Processing Big Data The Microservice WayJosef Adersberger
We see a big data processing pattern emerging using the Microservice approach to build an integrated, flexible, and distributed system of data processing tasks. We call this the Dataservice pattern. In this presentation we'll introduce into Dataservices: their basic concepts, the technology typically in use (like Kubernetes, Kafka, Cassandra and Spring) and some architectures from real-life.
Apache Eagle: Architecture Evolvement and New FeaturesHao Chen
Apache Eagle is a distributed real-time monitoring framework for Hadoop clusters. It analyzes data activities, applications, metrics, and logs to identify security breaches, performance issues, and provide insights. The architecture uses streaming ingestion of data, processing, alerting through complex event processing and machine learning, and storage and dashboards for insights. New features include support for additional data sources, applications for specific use cases, and evolving the architecture for improved scalability and flexibility.
The document provides an overview of OGCE (Open Grid Computing Environment), which develops and packages reusable software components for science portals. Key components described include services, gadgets, tags, and how they fit together. Installation and usage of the various OGCE components is discussed at a high level.
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraJoe Stein
Slides for our solution we developed for using Mesos, Docker, Kafka, Spark, Cassandra and Solr (DataStax Enterprise Edition) all developed in Go for doing realtime log analysis at scale. Many organizations either need or want log analysis in real time where you can see within a second what is happening within your entire infrastructure. Today, with the hardware available and software systems we have in place, you can develop, build and use as a service these solutions.
Peter Doschkinow, langjähriger Java-Experte und Mitarbeiter bei Oracle, gab in seiner Präsentation einen Überblick über die interessantesten und spannendsten Neuerungen in der neusten Java Standard- und Enterprise Edition.
Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]Accumulo Summit
Talk Abstract
A common tradeoff made by fault-tolerant, distributed systems is the ease of user interaction with the system. Implementing correct distributed operations in the face of failures often takes priority over reducing the level of effort required to use the system. Because of this, applying a problem in a specific domain to the system can require significant planning and effort by the user. Apache Accumulo, and its sorted, Key-Value data model, is subject to this same problem: it is often difficult to use Accumulo to quickly ascertain real-life answers about some concrete problem.
This problem, not unique to Accumulo itself, has spurred the growth of numerous projects to fill these kinds of gaps in usability, in addition to multiple language bindings provided by applications. Outside of the Java API, Accumulo client support varies from programming languages, like Python or Ruby, to standalone projects that provide their own query language, such as Apache Pig and Apache Hive. This talk will cover the state of client support outside of Accumulo’s Java API with an emphasis on the pros, cons, and best practices of each alternative.
Speaker
Josh Elser
Member of Technical Staff, Hortonworks
Josh is a member of the engineering staff at Hortonworks. He is strong advocate for open source software and is an Apache Accumulo committer and PMC member. He is also a committer and PMC member of Apache Slider (incubating) and regularly contributes to other Apache projects in the Apache Hadoop ecosystem. He holds a Bachelor's degree in Computer Science from Rensselaer Polytechnic Institute.
Accumulo Summit 2015: Ambari and Accumulo: HDP 2.3 Upcoming Features [Sponsored]Accumulo Summit
Talk Abstract
The upcoming Hortonworks Data Platform (HDP) 2.3 includes significant additions to Accumulo, within the project itself and in its interactions with the larger Hadoop ecosystem.This session will cover high-level changes that improve usability, management and security of Accumulo. Administrators of Accumulo now have the ability to deploy, manage and dynamically configure Accumulo clusters using Apache Ambari.As a part of Ambari integration, the metrics system in Accumulo has been updated to use the standard “Hadoop Metrics2” metrics subsystem which provides native Ganglia and Graphite support as well as supporting the new Ambari Metrics System. On the security front,Accumulo was also improved to support client authentication via Kerberos, while earlier versions of Accumulo only supported Kerberos authentication for server processes.With these changes,Accumulo clients can authenticate solely using their Kerberos identity across the entire Hadoop cluster without the need to manage passwords.
Speakers
Billie Rinaldi Senior Member of Technical Staff, Hortonworks
Billie Rinaldi is a Senior Member of Technical Staff at Hortonworks, Inc., currently prototyping new features related to application monitoring and deployment in the Apache Hadoop ecosystem. Prior to August 2012, Billie engaged in big data science and research at the National Security Agency. Since 2008, she has been providing technical leadership regarding the software that is now Apache Accumulo. Billie is the VP of Apache Accumulo, the Accumulo Project Management Committee Chair, and a member of the Apache Software Foundation. She holds a Ph.D. in applied mathematics from Rensselaer Polytechnic Institute.
Josh Elser Member of Technical Staff, Hortonworks
Josh is a member of the engineering staff at Hortonworks. He is strong advocate for open source software and is an Apache Accumulo committer and PMC member. He is also a committer and PMC member of Apache Slider (incubating) and regularly contributes to other Apache projects in the Apache Hadoop ecosystem. He holds a Bachelor's degree in Computer Science from Rensselaer Polytechnic Institute.
This document provides an overview of big data concepts including HDFS, MapReduce, HBase, Pig, Hive and Hadoop clusters. It describes how HBase is better suited than a relational database for large-scale analytics due to its columnar data structure. It also summarizes how MapReduce programming works, with the map phase organizing data and the reduce phase aggregating it. Finally, it outlines limitations of the original Hadoop 1.0 and how YARN improved cluster resource management and scheduling in Hadoop 2.0.
Modern application development with oracle cloud sangam17Vinay Kumar
How Oracle cloud helps in building modern application development. This explains Oracle Application container cloud with developer cloud service and etc. Spring boot application deployed in Oracle ACCS and CI/CD part done in Oracle Developer cloud service.
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...DataStax Academy
This document discusses real-time log analysis using Mesos, Docker, Kafka, Spark, Cassandra and Solr at scale. It provides an overview of the architecture, describing how data from various sources like syslog can be ingested into Kafka via Docker producers. It then discusses consuming from Kafka to write to Cassandra in real-time and running Spark jobs on Cassandra data. The document uses these open source tools together in a reference architecture to enable real-time analytics and search capabilities on streaming data.
"Data Provenance: Principles and Why it matters for BioMedical Applications"Pinar Alper
Tutorial given at Informatics for HEalth 2017 COnference These slides are for the second part of the tutorial describing provenance capture and management tools.
The aim of the EU FP 7 Large-Scale Integrating Project LarKC is to develop the Large Knowledge Collider (LarKC, for short, pronounced “lark”), a platform for massive distributed incomplete reasoning that will remove the scalability barriers of currently existing reasoning systems for the Semantic Web. The LarKC platform is available at larkc.sourceforge.net. This talk, is part of a tutorial for early users of the LarKC platform, and introduces the platform and the project in general.
Amit Kumar is a technical professional with 3+ years of experience in Spark, Scala, Java, Hadoop and AWS. He has experience developing data ingestion frameworks using these technologies. His current project involves ingesting data from multiple sources into AWS S3 and creating a golden record for each customer. He is responsible for data quality checks, creating jobs to ingest and process the data, and automating the workflow using AWS Lambda and EMR. Previously he has worked on projects involving data migration from Teradata to Hadoop, converting graphs to XML/Java code to replicate workflows, and developing software for aircraft cabin systems.
The Taverna Suite provides tools for interactive and batch workflow execution. It includes a workbench for graphical workflow construction, various client interfaces, and servers for multi-user workflow execution. The suite utilizes a plug-in framework and supports a variety of domains, infrastructures, and tools through custom plug-ins.
This document provides an overview of Kubernetes and containerization concepts including Docker containers, container orchestration with Kubernetes, deploying and managing applications on Kubernetes, and using Helm to package and deploy applications to Kubernetes. Key terms like pods, deployments, services, configmaps and secrets are defined. Popular container registries, orchestrators and cloud offerings are also mentioned.
Secure Kafka at scale in true multi-tenant environment ( Vishnu Balusu & Asho...confluent
Application teams in JPMC have started shifting towards building event driven architectures and real time steaming pipelines and Kafka has been at core in this journey. As application teams have started adopting Kafka rapidly, need for a centrally managed Kafka as a service has emerged. We have started delivering Kafka as a service in early 2018 and running in production for more than an year now operating 80+ clusters (and growing) in all environments together. One of the key requirements is to provide truly segregated, secured multi-tenant environment with RBAC model while satisfying financial regulations and controls at the same time. Operating clusters at large scale requires scalable self-service capabilities and cluster management orchestration. In this talk we will present - Our experiences in delivering and operating secured, multi-tenant and resilient Kafka clusters at scale. - Internals of our service framework/control plane which enables self-service capabilities for application teams, cluster build/patch orchestration and capacity management capabilities for TSE/admin teams. - Our approach in enabling automated Cross Datacenter failover for application teams using service framework and confluent replicator.
Revolutionary container based hybrid cloud solution for MLPlatform
Ness' data science platform, NextGenML, puts the entire machine learning process: modelling, execution and deployment in the hands of data science teams.
The entire paradigm approaches collaboration around AI/ML, being implemented with full respect for best practices and commitment to innovation.
Kubernetes (onPrem) + Docker, Azure Kubernetes Cluster (AKS), Nexus, Azure Container Registry(ACR), GlusterFS
Workflow
Argo->Kubeflow
DevOps
Helm, kSonnet, Kustomize,Azure DevOps
Code Management & CI/CD
Git, TeamCity, SonarQube, Jenkins
Security
MS Active Directory, Azure VPN, Dex (K8s) integrated with GitLab
Machine Learning
TensorFlow (model training, boarding, serving), Keras, Seldon
Storage (Azure)
Storage Gen1 & Gen2, Data Lake, File Storage
ETL (Azure)
Databricks, Spark on K8, Data Factory (ADF), HDInsight (Kafka and Spark), Service Bus (ASB)
Lambda functions & VMs, Cache for Redis
Monitoring and Logging
Graphana, Prometeus, GrayLog
Web Scale Reasoning and the LarKC ProjectSaltlux Inc.
The LarKC project aims to build an integrated pluggable platform for large-scale reasoning. It supports parallelization, distribution, and remote execution. The LarKC platform provides a lightweight core that gives standardized interfaces for combining plug-in components, while the real work is done in the plug-ins. There are three types of LarKC users: those building plug-ins, configuring workflows, and using workflows.
TDC Connections 2023 - A High-Speed Data Ingestion Service in Java Using MQTT...Juarez Junior
The document discusses a Java-based high-speed data ingestion service that can ingest data using several protocols including MQTT, AMQP, and STOMP. It introduces Reactive Streams Ingestion (RSI), a Java library that allows streaming and reactive ingestion of data into an Oracle database. The document also discusses using ActiveMQ and JMS messaging to consume messages and presents a sample project structure and architecture for a data ingestion application.
OpenStack Identity - Keystone (liberty) by Lorenzo Carnevale and Silvio TavillaLorenzo Carnevale
OpenStack Identity Service (Keystone) seminar.
Distributed Systems course at Engineering and Computer Science (ECS), University of Messina.
By Lorenzo Carnevale and Silvio Tavilla.
Seminar’s topics
❖ OpenStack Identity - Keystone (liberty)
❖ Installation and first configuration of Keystone
❖ Identity service configuration
➢ Identity API protection with RBAC
➢ Use Trusts
➢ Certificates for PKI
❖ Hierarchical Projects
❖ Identity API v3 client example
Innovate2014 Better Integrations Through Open InterfacesSteve Speicher
- The document discusses open interfaces and integrated lifecycle tools through linked data and open standards like OSLC, taking inspiration from principles of the World Wide Web.
- It promotes using open protocols like REST and HTTP for tool integration instead of tight coupling, and outlines guidelines for using URIs, HTTP, and semantic standards like RDF and SPARQL to represent and share resource data on the web.
- OSLC is presented as a solution for lifecycle integration across requirements management, quality management, change management and other tools using common resource definitions and linked data over open APIs.
The document discusses various topics related to developing successful cloud services, including microservices architecture, Platform as a Service (PaaS), multi-tenancy, and DevOps. It defines a successful service as one that offers subscription-based delivery according to an international survey. Characteristics of successful services include automation, monetization through subscriptions, implementation using techniques like mashups and multi-tenancy, and using microservices focusing on separation of concerns. The document outlines a journey to developing successful services and discusses related topics like operational automation, revenue generation, implementation approaches, and microservices.
Building and deploying LLM applications with Apache AirflowKaxil Naik
Behind the growing interest in Generate AI and LLM-based enterprise applications lies an expanded set of requirements for data integrations and ML orchestration. Enterprises want to use proprietary data to power LLM-based applications that create new business value, but they face challenges in moving beyond experimentation. The pipelines that power these models need to run reliably at scale, bringing together data from many sources and reacting continuously to changing conditions.
This talk focuses on the design patterns for using Apache Airflow to support LLM applications created using private enterprise data. We’ll go through a real-world example of what this looks like, as well as a proposal to improve Airflow and to add additional Airflow Providers to make it easier to interact with LLMs such as the ones from OpenAI (such as GPT4) and the ones on HuggingFace, while working with both structured and unstructured data.
In short, this shows how these Airflow patterns enable reliable, traceable, and scalable LLM applications within the enterprise.
https://airflowsummit.org/sessions/2023/keynote-llm/
Distributed & Highly Available server applications in Java and ScalaMax Alexejev
This document summarizes a presentation about distributed and highly available server applications in Java and Scala. It discusses the Talkbits architecture, which uses lightweight SOA principles with stateless edge services and specialized systems to manage state. The presentation describes using the Finagle library as a distributed RPC framework with Apache Zookeeper for service discovery. It also covers configuration, deployment, monitoring and logging of services using tools like SLF4J, Logback, CodaHale metrics, Jolokia, Fabric, and Datadog.
Environmental Sciences is the scientific study of the environmental system and
the status of its inherent or induced changes on organisms. It includes not only the study
of physical and biological characters of the environment but also the social and cultural
factors and the impact of man on environment.
The human eye is a complex organ responsible for vision, composed of various structures working together to capture and process light into images. The key components include the sclera, cornea, iris, pupil, lens, retina, optic nerve, and various fluids like aqueous and vitreous humor. The eye is divided into three main layers: the fibrous layer (sclera and cornea), the vascular layer (uvea, including the choroid, ciliary body, and iris), and the neural layer (retina).
Here's a more detailed look at the eye's anatomy:
1. Outer Layer (Fibrous Layer):
Sclera:
The tough, white outer layer that provides shape and protection to the eye.
Cornea:
The transparent, clear front part of the eye that helps focus light entering the eye.
2. Middle Layer (Vascular Layer/Uvea):
Choroid:
A layer of blood vessels located between the retina and the sclera, providing oxygen and nourishment to the outer retina.
Ciliary Body:
A ring of tissue behind the iris that produces aqueous humor and controls the shape of the lens for focusing.
Iris:
The colored part of the eye that controls the size of the pupil, regulating the amount of light entering the eye.
Pupil:
The black opening in the center of the iris that allows light to enter the eye.
3. Inner Layer (Neural Layer):
Retina:
The light-sensitive layer at the back of the eye that converts light into electrical signals that are sent to the brain via the optic nerve.
Optic Nerve:
A bundle of nerve fibers that carries visual signals from the retina to the brain.
4. Other Important Structures:
Lens:
A transparent, flexible structure behind the iris that focuses light onto the retina.
Aqueous Humor:
A clear, watery fluid that fills the space between the cornea and the lens, providing nourishment and maintaining eye shape.
Vitreous Humor:
A clear, gel-like substance that fills the space between the lens and the retina, helping maintain eye shape.
Macula:
A small area in the center of the retina responsible for sharp, central vision.
Fovea:
The central part of the macula with the highest concentration of cone cells, providing the sharpest vision.
These structures work together to allow us to see, with the light entering the eye being focused by the cornea and lens onto the retina, where it is converted into electrical signals that are transmitted to the brain for interpretation.
he eye sits in a protective bony socket called the orbit. Six extraocular muscles in the orbit are attached to the eye. These muscles move the eye up and down, side to side, and rotate the eye.
The extraocular muscles are attached to the white part of the eye called the sclera. This is a strong layer of tissue that covers nearly the entire surface of the eyeball.he layers of the tear film keep the front of the eye lubricated.
Tears lubricate the eye and are made up of three layers. These three layers together are called the tear film. The mucous layer is made by the conjunctiva. The watery part of the tears is made by the lacrimal gland
The overall process of metabolism involves the complex anabolic and catabolic pathways. This depicts how our digestive system aids our body in the absorption of nutrients and storage.
STR Analysis and DNA Typing in Forensic Science: Techniques, Steps & Applicat...home
This presentation dives deep into the powerful world of DNA profiling and its essential role in modern forensic science. Beginning with the history of DNA fingerprinting, pioneered by Sir Alec Jeffreys in 1985, the presentation traces the evolution of forensic DNA analysis from the early days of RFLP (Restriction Fragment Length Polymorphism) to today's highly efficient STR (Short Tandem Repeat) typing methods.
You will learn about the key steps involved in STR analysis, including DNA extraction, amplification using PCR, capillary electrophoresis, and allele interpretation using electropherograms (EPGs). Detailed slides explain how STR markers, classified by repeat unit length and structure, are analyzed for human identification with remarkable precision—even from minute or degraded biological samples.
The presentation also introduces other DNA typing techniques such as Y-chromosome STR analysis, mitochondrial DNA (mtDNA) profiling, and SNP typing, alongside a comparative view of their strengths and limitations.
Real-world forensic applications are explored, from crime scene investigations, missing persons identification, and disaster victim recovery, to paternity testing and cold case resolution. Ethical considerations are addressed, emphasizing the need for informed consent, privacy protections, and responsible DNA database management.
Whether you're a forensic science student, a researcher, or simply curious about genetic identification methods, this presentation offers a comprehensive and clear overview of how STR typing works, its scientific basis, and its vital role in modern-day justice.
DNA Profiling and STR Typing in Forensics: From Molecular Techniques to Real-...home
This comprehensive assignment explores the pivotal role of DNA profiling and Short Tandem Repeat (STR) analysis in forensic science and genetic studies. The document begins by laying the molecular foundations of DNA, discussing its double helix structure, the significance of genetic variation, and how forensic science exploits these variations for human identification.
The historical journey of DNA fingerprinting is thoroughly examined, highlighting the revolutionary contributions of Dr. Alec Jeffreys, who first introduced the concept of using repetitive DNA regions for identification. Real-world forensic breakthroughs, such as the Colin Pitchfork case, illustrate the life-saving potential of this technology.
A detailed breakdown of traditional and modern DNA typing methods follows, including RFLP, VNTRs, AFLP, and especially PCR-based STR analysis, now considered the gold standard in forensic labs worldwide. The principles behind STR marker types, CODIS loci, Y-chromosome STRs, and the capillary electrophoresis (CZE) method are thoroughly explained. The steps of DNA profiling—from sample collection and amplification to allele detection using electropherograms (EPGs)—are presented in a clear and systematic manner.
Beyond crime-solving, the document explores the diverse applications of STR typing:
Monitoring cell line authenticity
Detecting genetic chimerism
Tracking bone marrow transplant engraftment
Studying population genetics
Investigating evolutionary history
Identifying lost individuals in mass disasters
Ethical considerations and potential misuse of DNA data are acknowledged, emphasizing the need for careful policy and regulation.
Whether you're a biotechnology student, a forensic professional, or a researcher, this document offers an in-depth look at how DNA and STRs transform science, law, and society.
Poultry require at least 38 dietary nutrients inappropriate concentrations for a balanced diet. A nutritional deficiency may be due to a nutrient being omitted from the diet, adverse interaction between nutrients in otherwise apparently well-fortified diets, or the overriding effect of specific anti-nutritional factors.
Major components of foods are – Protein, Fats, Carbohydrates, Minerals, Vitamins
Vitamins are A- Fat soluble vitamins: A, D, E, and K ; B - Water soluble vitamins: Thiamin (B1), Riboflavin (B2), Nicotinic acid (niacin), Pantothenic acid (B5), Biotin, folic acid, pyriodxin and cholin.
Causes: Low levels of vitamin A in the feed. oxidation of vitamin A in the feed, errors in mixing and inter current disease, e.g. coccidiosis , worm infestation
Clinical signs: Lacrimation (ocular discharge), White cheesy exudates under the eyelids (conjunctivitis). Sticky of eyelids and (xerophthalmia). Keratoconjunctivitis.
Watery discharge from the nostrils. Sinusitis. Gasping and sneezing. Lack of yellow pigments,
Respiratory sings due to affection of epithelium of the respiratory tract.
Lesions:
Pseudo diphtheritic membrane in digestive and respiratory system (Keratinized epithelia).
Nutritional roup: respiratory sings due to affection of epithelium of the respiratory tract.
Pustule like nodules in the upper digestive tract (buccal cavity, pharynx, esophagus).
The urate deposits may be found on other visceral organs
Treatment:
Administer 3-5 times the recommended levels of vitamin A @ 10000 IU/ KG ration either through water or feed.
Lesions:
Pseudo diphtheritic membrane in digestive and respiratory system (Keratinized epithelia).
Nutritional roup: respiratory sings due to affection of epithelium of the respiratory tract.
Pustule like nodules in the upper digestive tract (buccal cavity, pharynx, esophagus).
The urate deposits may be found on other visceral organs
Treatment:
Administer 3-5 times the recommended levels of vitamin A @ 10000 IU/ KG ration either through water or feed.
Lesions:
Pseudo diphtheritic membrane in digestive and respiratory system (Keratinized epithelia).
Nutritional roup: respiratory sings due to affection of epithelium of the respiratory tract.
Pustule like nodules in the upper digestive tract (buccal cavity, pharynx, esophagus).
The urate deposits may be found on other visceral organs
Treatment:
Administer 3-5 times the recommended levels of vitamin A @ 10000 IU/ KG ration either through water or feed.
Data cleaning with the Kurator toolkit: Bridging the gap between conventional scripting and high-performance workflow automation
1. Data cleaning with the Kurator toolkit
Bridging the gap between conventional scripting and
high-performance workflow automation
Timothy McPhillips, David Lowery, James Hanken,
Bertram Ludäscher, James A. Macklin, Paul J. Morris,
Robert A. Morris, Tianhong Song, and John Wieczorek
TDWG 2015 - Biodiversity Informatics Services and Workflows Symposium
Nairobi, Kenya - September 30, 2015
2. Kurator: workflow automation for
cleaning biodiversity data
Project aims
§ Facilitate cleaning of biodiversity data.
§ Support both traditional scripting and
high-performance scientific workflows.
§ Deliver much more than a fixed set of
configurable workflows.
Technical strategy
§ Wrap Akka actor toolkit in a curation-oriented
scientific workflow language and runtime.
§ Enable scientists who write scripts to add their
own new data validation and cleaning actors.
§ Bring to scripts major advantages of workflow
automation: prospective and retrospective
provenance.
§ Bridge gaps between data validation services,
data cleaning scripts, and pipelined data curation
workflows.
Empower users and developers of scripts, actors, and workflows.
3. What some of us think of when we hear
the term ‘scientific workflows’
Phylogenetics workflow in Kepler (2005)
Graphical interface
§ Canvas for assembling and
displaying the workflow.
§ Library of workflow blocks
(‘actors’) that can be
dragged onto the canvas
and connected.
§ Arrows that represent
control dependencies or
paths of data flow.
§ A run button.
These features are not
essential to managing
actual scientific
workflows.
4. 10 essential functions
of a scientific workflow system
1. Automate programs and services scientists already use.
2. Schedule invocations of programs and services correctly and efficiently
– in parallel where possible.
3. Manage data flow to, from, and between programs and services.
4. Enable scientists (not just developers) to author or modify workflows easily.
5. Predict what a workflow will do when executed: prospective provenance.
6. Record what actually happens during workflow execution.
7. Reveal retrospective provenance – how workflow products were derived
from inputs via programs and services.
8. Organize intermediate and final data products as desired by users.
9. Enable scientists to version, share and publish their workflows.
10. Empower scientists who wish to automate additional programs and
services themselves.
These functions–not actors—distinguish scientific workflow
automation from general scientific software development.
5. Why build yet another system?
Available systems
§ Kepler (Ptolemy II), Taverna, VisTrails…
§ Familiar graphical programming
environments.
Limitations
§ Often little support for organizing
intermediate and final data products in
ways familiar to scientists.
§ Professional software developers
frequently are needed to develop new
components or workflows.
Huge gap between how these systems
are used and how scientists already
automate their analyses—via scripting.
Part of a Kepler workflow for
inferring phylogenetic trees from
protein sequences.
6. Avoiding actor ‘overuse injuries’
Overuse the actor paradigm…
§ In many systems workflows can be reused as
actors, or ‘subworkflows’ in other workflows.
§ This is a necessary abstraction when
workflow systems are not well-integrated
with scripting languages.
§ Each actor at left is a page of Java code. But
this whole ‘subworkflow’ could be written in a
half a page of Python!
…or use the right tools for the right job!
§ In Kurator we want to enable the actor
abstraction where it pays off the most—as
the unit of parallelism.
§ For specifying behavior inside of actors why
not use easy to understand scripts?
New actors and workflows must be
easy and fast to develop.
Part of a Kepler workflow for
inferring phylogenetic trees from
protein sequences.
7. Data curation workflow using Kepler
FilteredPush explored using workflows for data cleaning
§ First used COMAD workflow model supported by Kepler.
§ Enabled graphical assembly and configuration of custom workflows
from library of actors.
Highlighted potential performance limitations of workflow engines.
8. FP-Akka workflows
load data
check scientific name
check basis of record
check date collected
check lat/long
write out results
§ FilteredPush next investigated use of
the Akka actor toolkit and platform.
§ Widely used in industry, well
supported, and rapidly advancing.
§ Efficient parallel execution across
multiple processors and compute nodes.
§ Improved performance and scalability
compared to Kepler.
Limitations of directly using Akka
§ Writing new Akka actors and programs
requires Java (or Scala) experience.
§ Must address many parallel
programming challenges from scratch.
Advanced programming skills
required to write Akka programs
(‘workflows’) that run correctly.
9. Akka partly supports two essential
workflow platform requirements
The Kurator toolkit will satisfy the rest as needed.
1. Automate programs and services scientists already use.
2. Schedule invocations of programs and services correctly and
efficiently–in parallel where possible.
3. Manage data flow to, from, and between programs and services.
4. Enable scientists (not just developers) to author or modify workflows easily.
5. Predict what a workflow will do when executed: prospective provenance.
6. Record what actually happens during workflow execution.
7. Reveal retrospective provenance – how workflow products were derived from
inputs via programs and services.
8. Organize intermediate and final data products as desired by users.
9. Enable scientists to version, share and publish their workflows.
10. Empower scientists who wish to automate additional programs and services
themselves.
10. The Kurator Toolkit
YesWorkflow (YW)
§ Add YW annotations to any script or program
that supports text comments. Highlight the
workflow structure in the script.
§ Visualize or query prospective provenance
before running a script.
§ Reconstruct, visualize, and query retrospective
provenance after running the script.
§ Integrate provenance gathered from file names
and paths, log files, data file headers, run
metadata, and records of run-time events.
Kurator-Akka
§ Write functions or classes in Python or Java. Mark up with YesWorkflow.
§ Declare how scripts or Java code can be used as actors. Short YAML blocks.
§ Declare workflows. List actors and specify their connections. More YAML.
§ Run workflow. Use Akka for parallelization transparently—and correctly.
§ Reconstruct retrospective provenance of workflow products.
11. Example: data validation and
cleaning using WoRMS web services
1) Write simple Python functions (or a class) that wrap the web services provided
by the World Register of Marine Species (WoRMS).
2) Develop a Python script that uses the service wrapper functions from (1) to
clean a set of records provided in CSV format.
3) Mark up the script written in (2) with YesWorkflow annotations and
graphically display the script as a workflow.
4) Factor out of (2) a Python function that can clean a single record using the
WoRMS wrapper functions in (1).
5) Write a block of YAML that declares how the record-cleaning function in (4) can
be used as an actor in a Kurator-Akka workflow.
6) Declare using YAML a workflow that uses the actor declared in (5) along with
actors for reading and writing CSV files.
7) Run the workflow (6) on a sample data set with the CSV reader, CSV writer
and WoRMS validation actors all running in parallel.
README at https://github.com/kurator-org/kurator-validation shows how to:
Illustrates how Kurator aims to facilitate composition of actors and
high-performance workflows from simple functions and scripts.
12. class WoRMSService(object):
"""
Class for accessing the WoRMS taxonomic name database via the AphiaNameService.
The Aphia names services are described at http://marinespecies.org/aphia.php?p=soap.
"""
WORMS_APHIA_NAME_SERVICE_URL = 'http://marinespecies.org/aphia.php?p=soap&wsdl=1’
def __init__(self, marine_only=False):
""" Initialize a SOAP client using the WSDL for the WoRMS Aphia names service"""
self._client = Client(self.WORMS_APHIA_NAME_SERVICE_URL)
self._marine_only = marine_only
def aphia_record_by_exact_taxon_name(self, name):
"""
Perform an exact match on the input name against the taxon names in WoRMS.
This function first invokes an Aphia names service to lookup the Aphia ID for
the taxon name. If exactly one match is returned, this function retrieves the
Aphia record for that ID and returns it.
"""
aphia_id = self._client.service.getAphiaID(name, self._marine_only);
if aphia_id is None or aphia_id == -‐999: # -‐999 indicates multiple matches
return None
else:
return self._client.service.getAphiaRecordByID(aphia_id)
def aphia_record_by_fuzzy_taxon_name(self, name):
:
:
Python class wrapping WoRMS services
WoRMS services
13. # @BEGIN find_matching_worms_record
# @IN original_scientific_name
# @OUT matching_worms_record
# @OUT worms_lsid
worms_match_result = None
worms_lsid = None
# first try exact match of the scientific name against WoRMS
matching_worms_record = worms.aphia_record_by_exact_taxon_name(original_scientific_name)
if matching_worms_record is not None:
worms_match_result = 'exact'
# otherwise try a fuzzy match
else:
matching_worms_record = worms.aphia_record_by_fuzzy_taxon_name(original_scientific_name)
if matching_worms_record is not None:
worms_match_result = 'fuzzy’
# if either match succeeds extract the LSID for the taxon
if matching_worms_record is not None:
worms_lsid = matching_worms_record['lsid']
# @END find_matching_worms_record
Finding matching WoRMS records
in the data cleaning script
YesWorkflow annotation
marking start of workflow step.
End of workflow step.
Variables serving as inputs and
outputs to workflow step.
Calls to WoRMS service
wrapper functions.
14. YW rendering of script
Workflow steps each
delimited by @BEGIN
and @END annotations
in script.
Data flowing into and out of
find_matching_worms_record
workflow step.
YesWorkflow infers connections
between workflow steps and what
data flows through them by matching
@IN and @OUT annotations.
15. # @BEGIN compose_cleaned_record
# @IN original_record
# @IN worms_lsid
# @IN updated_scientific_name
# @IN original_scientific_name
# @IN updated_authorship
# @IN original_authorship
# @OUT cleaned_record
cleaned_record = original_record
cleaned_record['LSID'] = worms_lsid
cleaned_record['WoRMsMatchResult'] = worms_match_result
if updated_scientific_name is not None:
cleaned_record['scientificName'] = updated_scientific_name
cleaned_record['originalScientificName'] = original_scientific_name
if updated_authorship is not None:
cleaned_record['scientificNameAuthorship'] = updated_authorship
cleaned_record['originalAuthor'] = original_authorship
# @END compose_cleaned_record
The compose_cleaned_record step in
the data cleaning script
16. A function for cleaning one record
def curate_taxon_name_and_author(self, input_record):
# look up record for input taxon name in WoRMS taxonomic database
is_exact_match, aphia_record = (
self._worms.aphia_record_by_taxon_name(input_record['TaxonName']))
if aphia_record is not None:
# save taxon name and author values from input record in new fields
input_record['OriginalName'] = input_record['TaxonName']
input_record['OriginalAuthor'] = input_record['Author']
# replace taxon name and author fields in input record with values in aphia record
input_record['TaxonName'] = aphia_record['scientificname']
input_record['Author'] = aphia_record['authority']
# add new fields
input_record['WoRMsExactMatch'] = is_exact_match
input_record['lsid'] = aphia_record['lsid’]
else:
input_record['OriginalName'] = None
input_record['OriginalAuthor'] = None
input_record['WoRMsExactMatch'] = None
input_record['lsid'] = None
return input_record
Factoring out the core functionality
of a script into a reusable function
is a natural step in script evolution.
17. Declaring the function as an actor
-‐ id: WoRMSNameCurator
type: PythonClassActor
properties:
pythonClass: org.kurator.validation.actors.WoRMSCurator.WoRMSCurator
onData: curate_taxon_name_and_author
Actor type identifier
referenced when composing a
workflow that uses the actor.
Python class declaring
the function as a
method (optional).
Name of Python function called
for each data item received by
the actor at run time.
Besides this block of YAML, no additional code needs
to be written to convert the function into an actor.
18. Simple workflow using the new actor
components:
-‐ id: ReadInput
type: CsvFileReader
-‐ id: CurateRecords
type: WoRMSNameCurator
properties:
listensTo:
-‐ !ref ReadInput
-‐ id: WriteOutput
type: CsvFileWriter
properties:
listensTo:
-‐ !ref CurateRecords
-‐ id: WoRMSNameValidationWorkflow
type: Workflow
properties:
actors:
-‐ !ref ReadInput
-‐ !ref CurateRecords
-‐ !ref WriteOutput
Actor that reads input from a CSV
file. Emits records one at a time.
The WoRMSNameCurator actor
declared on the previous slide.
Processes each received record in turn.
Actor that writes received records
to an output CSV file one by one.
The listensTo property is used to
declare how actors are connected.
Declaration of workflow as a
composition of the three actors.
Actors run concurrently, each working on
different records at the same time, when the
workflow is executed by Kurator-Akka.
20. Running the workflow
$ ka -‐f WoRMS_name_validation.yaml < WoRMS_name_validation_input.csv
ID,TaxonName,Author,OriginalName,OriginalAuthor,WoRMsExactMatch,lsid
37929,Architectonica reevei,"(Hanley, 1862)",Architectonica
reevi,,false,urn:lsid:marinespecies.org:taxname:588206
37932,Rapana rapiformis,"(Born, 1778)",Rapana rapiformis,"(Von Born,
1778)",true,urn:lsid:marinespecies.org:taxname:140415
180593,Buccinum donomani,"(Linnaeus, 1758)",,,,
179963,Codakia paytenorum,"(Iredale, 1937)",Codakia paytenorum,"Iredale,
1937",true,urn:lsid:marinespecies.org:taxname:215841
0,Rissoa venusta,"Garrett, 1873",Rissoa
venusta,,true,urn:lsid:marinespecies.org:taxname:607233
62156,Rissoa venusta,"Garrett, 1873",Rissoa
venusta,Phil.,true,urn:lsid:marinespecies.org:taxname:607233
$
Workflow can be run
at the command line.
Actors can be created from simple scripts, and workflows can be
run like scripts. Simple YAML files wire everything together.
Actors can read and write standard
input and output like any script.
21. The road ahead
The immediate future
§ YesWorkflow support for graphically rendering Kurator-Akka workflows.
§ Combining prospective and retrospective provenance from workflow
declarations and YesWorkflow-annotated scripts used as actors.
§ Enhancements to YAML workflow declarations enabling Akka support for
creating and managing multiple instances of each actor for higher
throughput.
§ Exploration of advanced Akka features in FP-Akka framework, followed
by generalization of these features in Kurator-Akka so that all users can
benefit.
Future possibilities
§ Support for additional actor scripting languages (e.g., R).
§ Actor function wrappers for other workflow and dataflow systems.
§ Graphical user interface for composing and running workflows.
§ Your suggestions?