Creating a Data Science Team from an Architect's perspective. This is about team building on how to support a data science team with the right staff, including data engineers and devops.
Moving to a data-centric architecture: Toronto Data Unconference 2015Adam Muise
Why use a datalake? Why use lambda? A conversation starter for Toronto Data Unconference 2015. We will discuss technologies such as Hadoop, Kafka, Spark Streaming, and Cassandra.
The document discusses the challenges of managing large volumes of data from various sources in a traditional divided approach. It argues that Hadoop provides a solution by allowing all data to be stored together in a single system and processed as needed. This addresses the problems caused by keeping data isolated in different silos and enables new types of analysis across all available data.
Turn Data Into Actionable Insights - StampedeCon 2016StampedeCon
At Monsanto, emerging technologies such as IoT, advanced imaging and geo-spatial platforms; molecular breeding, ancestry and genomics data sets have made us rethink how we approach developing, deploying, scaling and distributing our software to accelerate predictive and prescriptive decisions. We created a Cloud based Data Science platform for the enterprise to address this need. Our primary goals were to perform analytics@scale and integrate analytics with our core product platforms.
As part of this talk, we will be sharing our journey of transformation showing how we enabled: a collaborative discovery analytics environment for data science teams to perform model development, provisioning data through APIs, streams and deploying models to production through our auto-scaling big-data compute in the cloud to perform streaming, cognitive, predictive, prescriptive, historical and batch analytics@scale, integrating analytics with our core product platforms to turn data into actionable insights.
Innovation in the Data Warehouse - StampedeCon 2016StampedeCon
Enterprise Holding’s first started with Hadoop as a POC in 2013. Today, we have clusters on premises and in the cloud. This talk will explore our experience with Big Data and outline three common big data architectures (batch, lambda, and kappa). Then, we’ll dive into the decision points to necessary for your own cluster, for example: cloud vs on premises, physical vs virtual, workload, and security. These decisions will help you understand what direction to take. Finally, we’ll share some lessons learned with the pieces of our architecture worked well and rant about those which didn’t. No deep Hadoop knowledge is necessary, architect or executive level.
How to get started in Big Data without Big Costs - StampedeCon 2016StampedeCon
Looking to implement Hadoop but haven’t pulled the trigger yet? You are not alone. Many companies have heard the hype about how Hadoop can solve the challenges presented by big data, but few have actually implemented it. What’s preventing them from taking the plunge? Can it be done in small steps to ensure project success?
This session will discuss some of the items to consider when getting started with Hadoop and how to go about making the decision to move to the de facto big data platform. Starting small can be a good approach when your company is learning the basics and deciding what direction to take. There is no need to invest large amounts of time and money up front if a proof of concept is all you aim to provide. Using well known data sets on virtual machines can provide a low cost and effort implementation to know if your big data journey will be successful with Hadoop.
The document discusses Talend's big data solutions and sandbox. It introduces Rajan Kanitkar as a senior solutions engineer at Talend with 15 years of experience in data integration. It then summarizes Talend's big data platform and ecosystem including Hadoop, MapReduce, HDFS, Hive and more. The rest of the document describes Talend's sandbox, which provides a pre-configured virtual image with Hadoop distributions, Talend software, and data scenarios to demonstrate ingesting, transforming and delivering big data.
What is Big Data Discovery, and how it complements traditional business anal...Mark Rittman
Data Discovery is an analysis technique that complements traditional business analytics, and enables users to combine, explore and analyse disparate datasets to spot opportunities and patterns that lie hidden within your data. Oracle Big Data discovery takes this idea and applies it to your unstructured and big data datasets, giving users a way to catalogue, join and then analyse all types of data across your organization.
In this session we'll look at Oracle Big Data Discovery and how it provides a "visual face" to your big data initatives, and how it complements and extends the work that you currently do using business analytics tools.
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...StampedeCon
This session will be a detailed recount of the design, implementation, and launch of the next-generation Shutterstock Data Platform, with strong emphasis on conveying clear, understandable learnings that can be transferred to your own organizations and projects. This platform was architected around the prevailing use of Kafka as a highly-scalable central data hub for shipping data across your organization in batch or streaming fashion. It also relies heavily on Avro as a serialization format and a global schema registry to provide structure that greatly improves quality and usability of our data sets, while also allowing the flexibility to evolve schemas and maintain backwards compatibility.
As a company, Shutterstock has always focused heavily on leveraging open source technologies in developing its products and infrastructure, and open source has been a driving force in big data more so than almost any other software sub-sector. With this plethora of constantly evolving data technologies, it can be a daunting task to select the right tool for your problem. We will discuss our approach for choosing specific existing technologies and when we made decisions to invest time in home-grown components and solutions.
We will cover advantages and the engineering process of developing language-agnostic APIs for publishing to and consuming from the data platform. These APIs can power some very interesting streaming analytics solutions that are easily accessible to teams across our engineering organization.
We will also discuss some of the massive advantages a global schema for your data provides for downstream ETL and data analytics. ETL into Hadoop and creation and maintenance of Hive databases and tables becomes much more reliable and easily automated with historically compatible schemas. To complement this schema-based approach, we will cover results of performance testing various file formats and compression schemes in Hadoop and Hive, the massive performance benefits you can gain in analytical workloads by leveraging highly optimized columnar file formats such as ORC and Parquet, and how you can use good old fashioned Hive as a tool for easily and efficiently converting exiting datasets into these formats.
Finally, we will cover lessons learned in launching this platform across our organization, future improvements and further design, and the need for data engineers to understand and speak the languages of data scientists and web, infrastructure, and network engineers.
Apache Hadoop is quickly becoming the technology of choice for organizations investing in big data, powering their next generation data architecture. With Hadoop serving as both a scalable data platform and computational engine, data science is re-emerging as a center-piece of enterprise innovation, with applied data solutions such as online product recommendation, automated fraud detection and customer sentiment analysis. In this talk Ofer will provide an overview of data science and how to take advantage of Hadoop for large scale data science projects: * What is data science? * How can techniques like classification, regression, clustering and outlier detection help your organization? * What questions do you ask and which problems do you go after? * How do you instrument and prepare your organization for applied data science with Hadoop? * Who do you hire to solve these problems? You will learn how to plan, design and implement a data science project with Hadoop
The document discusses how database design is an important part of agile development and should not be neglected. It advocates for an evolutionary design approach where the database schema can change over time without impacting application code through the use of procedures, packages, and views. A jointly designed transactional API between the application and database is recommended to simplify changes. Both agile principles and database normalization are seen as valuable to achieve flexibility and avoid redundancy.
The ecosystem for big data and analytics has become too large and complex, with too many vendors, distributions, engines, projects, and iterations. This leads to problems like analysis paralysis during platform decisions, solutions becoming obsolete too quickly, and constant stress over choosing the right engine for each job. The document suggests that industries, vendors, investors, analysts, technologists, and customers all need to take steps to reduce complexity and focus on standards and merit-based evaluations of options.
ING Bank has developed a data lake architecture to centralize and govern all of its data. The data lake will serve as the "memory" of the bank, holding all data relevant for reporting, analytics, and data exchanges. ING formed an international data community to collaborate on Hadoop implementations and identify common patterns for file storage, deep data analytics, and real-time usage. Key challenges included the complexity of Hadoop, difficulty of large-scale collaboration, and ensuring analytic data received proper security protections. Future steps include standardizing building blocks, defining analytical model production, and embedding analytics in governance for privacy compliance.
Владимир Слободянюк «DWH & BigData – architecture approaches»Anna Shymchenko
This document discusses approaches to data warehouse (DWH) and big data architectures. It begins with an overview of big data, describing its large size and complexity that makes it difficult to process with traditional databases. It then compares Hadoop and relational database management systems (RDBMS), noting pros and cons of each for distributed computing. The document outlines how Hadoop uses MapReduce and has a structure including HDFS, HBase, Hive and Pig. Finally, it proposes using Hadoop as an ETL and data quality tool to improve traceability, reduce costs and handle exception data cleansing more effectively.
Incorporating the Data Lake into Your Analytic ArchitectureCaserta
Joe Caserta, President at Caserta Concepts presented at the 3rd Annual Enterprise DATAVERSITY conference. The emphasis of this year's agenda is on the key strategies and architecture necessary to create a successful, modern data analytics organization.
Joe Caserta presented Incorporating the Data Lake into Your Analytics Architecture.
For more information on the services offered by Caserta Concepts, visit out website at http://casertaconcepts.com/.
Unlock the value in your big data reservoir using oracle big data discovery a...Mark Rittman
The document discusses Oracle Big Data Discovery and how it can be used to analyze and gain insights from data stored in a Hadoop data reservoir. It provides an example scenario where Big Data Discovery is used to analyze website logs, tweets, and website posts and comments to understand popular content and influencers for a company. The data is ingested into the Big Data Discovery tool, which automatically enriches the data. Users can then explore the data, apply additional transformations, and visualize relationships to gain insights.
Big Data in the Cloud - Montreal April 2015Cindy Gross
slides:
Basic Big Data and Hadoop terminology
What projects fit well with Hadoop
Why Hadoop in the cloud is so Powerful
Sample end-to-end architecture
See: Data, Hadoop, Hive, Analytics, BI
Do: Data, Hadoop, Hive, Analytics, BI
How this tech solves your business problems
The document discusses big data and Oracle technologies. It provides an overview of big data, describing what it is and examples of big data in different industries. It then discusses several Oracle technologies for working with big data, including Oracle NoSQL Database for scalable key-value storage, Oracle R for statistical analysis and connecting to Hadoop, and Oracle Endeca for information discovery.
IDERA Live | The Ever Growing Science of Database MigrationsIDERA Software
You can watch the replay for this webcast in the IDERA Resource Center: http://ow.ly/QHaG50A58ZB
Many information technology professionals may not recognize it, but the bulk of their work has been and continues to be nothing more than database migrations. In the old days to share files across systems, then to move files into relational databases, then to load into data warehouses, and finally now we're moving to NoSQL and the cloud. In the presentation we'll delve into the ever growing and increasingly complex world of database migrations. Some of these considerations include what issues must be planned for and overcome, what problems are likely to occur, and what types of tools exist.
Database expert Bert Scalzo will cover these and many other database migration concerns.
About Bert: Bert Scalzo is an Oracle ACE, author, speaker, consultant, and a major contributor for many popular database tools used by millions of people worldwide. He has 30+ years of database experience and has worked for several major database vendors. He has BS, MS and Ph.D. in computer science plus an MBA. He has presented at numerous events and webcasts. His areas of key interest include data modeling, database benchmarking, database tuning, SQL optimization, "star schema" data warehousing, running databases on Linux or VMware, and using NVMe flash based technology to speed up database performance.
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016StampedeCon
Hadoop adoption is a journey. Depending on the business the process can take weeks, months, or even years. Hadoop is a transformative technology so the challenges have less to do with the technology and more to do with how a company adapts itself to a new way of thinking about data. There are challenges for companies who have lived with an application driven business for the last two decades to suddenly become data driven. Companies need to begin thinking less in terms of single, silo’d servers and more about “the cluster”.
The concept of the cluster becomes the center of data gravity drawing all the applications to it. Companies, especially the IT organizations, embark on a process of understanding how to maintain and operationalize this environment and provide the data lake as a service to the businesses. They must empower the business by providing the resources for the use cases which drive both renovation and innovation. IT needs to adopt new technologies and new methodologies which enable the solutions. This is not technology for technology sake. Hadoop is a data platform servicing and enabling all facets of an organization. Building out and expanding this platform is the ongoing journey as word gets out to businesses that they can have any data they want and any time. Success is what drives the journey.
The length of the journey varies from company to company. Sometimes the challenges are based on the size of the company but many times the challenges are based on the difficulty of unseating established IT processes companies have adopted without forethought for the past two decades. Companies must navigate through the noise. Sifting through the noise to find those solutions which bring real value takes time. As the platform matures and becomes mainstream, more and more companies are finding it easier to adopt Hadoop. Hundreds of companies have already taken many steps; hundreds more have already taken the first step. As the wave of successful Hadoop adoption continues, more and more companies will see the value in starting the journey and paving the way for others.
The Rise of the DataOps - Dataiku - J On the Beach 2016 Dataiku
Many organisations are creating groups dedicated to data. These groups have many names : Data Team, Data Labs, Analytics Teams….
But whatever the name, the success of those teams depends a lot on the quality of the data infrastructure and their ability to actually deploy data science applications in production.
In that regards a new role of “DataOps” is emerging. Similar, to Dev Ops for (Web) Dev, the Data Ops is a merge between a data engineer and a platform administrator. Well versed in cluster administration and optimisation, a data ops would have also a perspective on the quality of data quality and the relevance of predictive models.
Do you want to be a Data Ops ? We’ll discuss its role and challenges during this talk
This document discusses Dataiku Flow and DCTC. Dataiku Flow is a data-driven orchestration framework for complex data pipelines that manages data dependencies and parallelization. It allows defining datasets and tasks to transform data. DCTC is a tool that can manipulate files across different storage systems like S3, GCS, HDFS to perform operations like copying, synchronizing, dispatching files. It aims to simplify common data transfer pains. The presentation concludes with contacting information for Dataiku executives.
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...Seeling Cheung
Citizens Bank was implementing a BigInsights Hadoop Data Lake with PureData System for Analytics to support all internal data initiatives and improve the customer experience. Testing BigInsights on the ViON Hadoop Appliance yielded the productivity, maintenance, and performance Citizens was looking for. Citizens Bank moved some analytics processing from Teradata to Netezza for better cost and performance, implemented BigInsights Hadoop for a data lake, and avoided large capital expenditures for additional Teradata capacity.
1) The document discusses Godaddy's use of Hadoop to build a data warehouse that enables greater data integration and supports all phases of analytics.
2) Key principles for the data team include making data easy to discover, understand, consume and maintain through automation and delivering value quickly through an Agile approach.
3) The data warehouse uses a variant of the Kimball design with wide, denormalized fact tables and integrated, conformed dimensions to support all types of analytics using data at the lowest level of granularity.
Summary introduction to data engineeringNovita Sari
Data engineering involves designing, building, and maintaining data warehouses to transform raw data into queryable forms that enable analytics. A core task of data engineers is Extract, Transform, and Load (ETL) processes - extracting data from sources, transforming it through processes like filtering and aggregation, and loading it into destinations. Data engineers help divide systems into transactional (OLTP) and analytical (OLAP) databases, with OLTP providing source data to data warehouses analyzed through OLAP systems. While similar, data engineers focus more on infrastructure and ETL processes, while data scientists focus more on analysis, modeling, and insights.
Rob peglar introduction_analytics _big data_hadoopGhassan Al-Yafie
This document provides an introduction to analytics and big data using Hadoop. It discusses the growth of digital data and challenges of big data. Hadoop is presented as a solution for storing and processing large, unstructured datasets across commodity servers. The key components of Hadoop - HDFS for distributed storage and MapReduce for distributed processing - are described at a high level. Examples of industries using big data analytics are also listed.
Introduction to streaming and messaging flume,kafka,SQS,kinesis Omid Vahdaty
Big data makes you a bit Confused ? messaging? batch processing? data streaming? in flight analytics? Cloud? open source? Flume? kafka? flafka (both)? SQS? kinesis? firehose?
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...StampedeCon
This session will be a detailed recount of the design, implementation, and launch of the next-generation Shutterstock Data Platform, with strong emphasis on conveying clear, understandable learnings that can be transferred to your own organizations and projects. This platform was architected around the prevailing use of Kafka as a highly-scalable central data hub for shipping data across your organization in batch or streaming fashion. It also relies heavily on Avro as a serialization format and a global schema registry to provide structure that greatly improves quality and usability of our data sets, while also allowing the flexibility to evolve schemas and maintain backwards compatibility.
As a company, Shutterstock has always focused heavily on leveraging open source technologies in developing its products and infrastructure, and open source has been a driving force in big data more so than almost any other software sub-sector. With this plethora of constantly evolving data technologies, it can be a daunting task to select the right tool for your problem. We will discuss our approach for choosing specific existing technologies and when we made decisions to invest time in home-grown components and solutions.
We will cover advantages and the engineering process of developing language-agnostic APIs for publishing to and consuming from the data platform. These APIs can power some very interesting streaming analytics solutions that are easily accessible to teams across our engineering organization.
We will also discuss some of the massive advantages a global schema for your data provides for downstream ETL and data analytics. ETL into Hadoop and creation and maintenance of Hive databases and tables becomes much more reliable and easily automated with historically compatible schemas. To complement this schema-based approach, we will cover results of performance testing various file formats and compression schemes in Hadoop and Hive, the massive performance benefits you can gain in analytical workloads by leveraging highly optimized columnar file formats such as ORC and Parquet, and how you can use good old fashioned Hive as a tool for easily and efficiently converting exiting datasets into these formats.
Finally, we will cover lessons learned in launching this platform across our organization, future improvements and further design, and the need for data engineers to understand and speak the languages of data scientists and web, infrastructure, and network engineers.
Apache Hadoop is quickly becoming the technology of choice for organizations investing in big data, powering their next generation data architecture. With Hadoop serving as both a scalable data platform and computational engine, data science is re-emerging as a center-piece of enterprise innovation, with applied data solutions such as online product recommendation, automated fraud detection and customer sentiment analysis. In this talk Ofer will provide an overview of data science and how to take advantage of Hadoop for large scale data science projects: * What is data science? * How can techniques like classification, regression, clustering and outlier detection help your organization? * What questions do you ask and which problems do you go after? * How do you instrument and prepare your organization for applied data science with Hadoop? * Who do you hire to solve these problems? You will learn how to plan, design and implement a data science project with Hadoop
The document discusses how database design is an important part of agile development and should not be neglected. It advocates for an evolutionary design approach where the database schema can change over time without impacting application code through the use of procedures, packages, and views. A jointly designed transactional API between the application and database is recommended to simplify changes. Both agile principles and database normalization are seen as valuable to achieve flexibility and avoid redundancy.
The ecosystem for big data and analytics has become too large and complex, with too many vendors, distributions, engines, projects, and iterations. This leads to problems like analysis paralysis during platform decisions, solutions becoming obsolete too quickly, and constant stress over choosing the right engine for each job. The document suggests that industries, vendors, investors, analysts, technologists, and customers all need to take steps to reduce complexity and focus on standards and merit-based evaluations of options.
ING Bank has developed a data lake architecture to centralize and govern all of its data. The data lake will serve as the "memory" of the bank, holding all data relevant for reporting, analytics, and data exchanges. ING formed an international data community to collaborate on Hadoop implementations and identify common patterns for file storage, deep data analytics, and real-time usage. Key challenges included the complexity of Hadoop, difficulty of large-scale collaboration, and ensuring analytic data received proper security protections. Future steps include standardizing building blocks, defining analytical model production, and embedding analytics in governance for privacy compliance.
Владимир Слободянюк «DWH & BigData – architecture approaches»Anna Shymchenko
This document discusses approaches to data warehouse (DWH) and big data architectures. It begins with an overview of big data, describing its large size and complexity that makes it difficult to process with traditional databases. It then compares Hadoop and relational database management systems (RDBMS), noting pros and cons of each for distributed computing. The document outlines how Hadoop uses MapReduce and has a structure including HDFS, HBase, Hive and Pig. Finally, it proposes using Hadoop as an ETL and data quality tool to improve traceability, reduce costs and handle exception data cleansing more effectively.
Incorporating the Data Lake into Your Analytic ArchitectureCaserta
Joe Caserta, President at Caserta Concepts presented at the 3rd Annual Enterprise DATAVERSITY conference. The emphasis of this year's agenda is on the key strategies and architecture necessary to create a successful, modern data analytics organization.
Joe Caserta presented Incorporating the Data Lake into Your Analytics Architecture.
For more information on the services offered by Caserta Concepts, visit out website at http://casertaconcepts.com/.
Unlock the value in your big data reservoir using oracle big data discovery a...Mark Rittman
The document discusses Oracle Big Data Discovery and how it can be used to analyze and gain insights from data stored in a Hadoop data reservoir. It provides an example scenario where Big Data Discovery is used to analyze website logs, tweets, and website posts and comments to understand popular content and influencers for a company. The data is ingested into the Big Data Discovery tool, which automatically enriches the data. Users can then explore the data, apply additional transformations, and visualize relationships to gain insights.
Big Data in the Cloud - Montreal April 2015Cindy Gross
slides:
Basic Big Data and Hadoop terminology
What projects fit well with Hadoop
Why Hadoop in the cloud is so Powerful
Sample end-to-end architecture
See: Data, Hadoop, Hive, Analytics, BI
Do: Data, Hadoop, Hive, Analytics, BI
How this tech solves your business problems
The document discusses big data and Oracle technologies. It provides an overview of big data, describing what it is and examples of big data in different industries. It then discusses several Oracle technologies for working with big data, including Oracle NoSQL Database for scalable key-value storage, Oracle R for statistical analysis and connecting to Hadoop, and Oracle Endeca for information discovery.
IDERA Live | The Ever Growing Science of Database MigrationsIDERA Software
You can watch the replay for this webcast in the IDERA Resource Center: http://ow.ly/QHaG50A58ZB
Many information technology professionals may not recognize it, but the bulk of their work has been and continues to be nothing more than database migrations. In the old days to share files across systems, then to move files into relational databases, then to load into data warehouses, and finally now we're moving to NoSQL and the cloud. In the presentation we'll delve into the ever growing and increasingly complex world of database migrations. Some of these considerations include what issues must be planned for and overcome, what problems are likely to occur, and what types of tools exist.
Database expert Bert Scalzo will cover these and many other database migration concerns.
About Bert: Bert Scalzo is an Oracle ACE, author, speaker, consultant, and a major contributor for many popular database tools used by millions of people worldwide. He has 30+ years of database experience and has worked for several major database vendors. He has BS, MS and Ph.D. in computer science plus an MBA. He has presented at numerous events and webcasts. His areas of key interest include data modeling, database benchmarking, database tuning, SQL optimization, "star schema" data warehousing, running databases on Linux or VMware, and using NVMe flash based technology to speed up database performance.
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016StampedeCon
Hadoop adoption is a journey. Depending on the business the process can take weeks, months, or even years. Hadoop is a transformative technology so the challenges have less to do with the technology and more to do with how a company adapts itself to a new way of thinking about data. There are challenges for companies who have lived with an application driven business for the last two decades to suddenly become data driven. Companies need to begin thinking less in terms of single, silo’d servers and more about “the cluster”.
The concept of the cluster becomes the center of data gravity drawing all the applications to it. Companies, especially the IT organizations, embark on a process of understanding how to maintain and operationalize this environment and provide the data lake as a service to the businesses. They must empower the business by providing the resources for the use cases which drive both renovation and innovation. IT needs to adopt new technologies and new methodologies which enable the solutions. This is not technology for technology sake. Hadoop is a data platform servicing and enabling all facets of an organization. Building out and expanding this platform is the ongoing journey as word gets out to businesses that they can have any data they want and any time. Success is what drives the journey.
The length of the journey varies from company to company. Sometimes the challenges are based on the size of the company but many times the challenges are based on the difficulty of unseating established IT processes companies have adopted without forethought for the past two decades. Companies must navigate through the noise. Sifting through the noise to find those solutions which bring real value takes time. As the platform matures and becomes mainstream, more and more companies are finding it easier to adopt Hadoop. Hundreds of companies have already taken many steps; hundreds more have already taken the first step. As the wave of successful Hadoop adoption continues, more and more companies will see the value in starting the journey and paving the way for others.
The Rise of the DataOps - Dataiku - J On the Beach 2016 Dataiku
Many organisations are creating groups dedicated to data. These groups have many names : Data Team, Data Labs, Analytics Teams….
But whatever the name, the success of those teams depends a lot on the quality of the data infrastructure and their ability to actually deploy data science applications in production.
In that regards a new role of “DataOps” is emerging. Similar, to Dev Ops for (Web) Dev, the Data Ops is a merge between a data engineer and a platform administrator. Well versed in cluster administration and optimisation, a data ops would have also a perspective on the quality of data quality and the relevance of predictive models.
Do you want to be a Data Ops ? We’ll discuss its role and challenges during this talk
This document discusses Dataiku Flow and DCTC. Dataiku Flow is a data-driven orchestration framework for complex data pipelines that manages data dependencies and parallelization. It allows defining datasets and tasks to transform data. DCTC is a tool that can manipulate files across different storage systems like S3, GCS, HDFS to perform operations like copying, synchronizing, dispatching files. It aims to simplify common data transfer pains. The presentation concludes with contacting information for Dataiku executives.
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...Seeling Cheung
Citizens Bank was implementing a BigInsights Hadoop Data Lake with PureData System for Analytics to support all internal data initiatives and improve the customer experience. Testing BigInsights on the ViON Hadoop Appliance yielded the productivity, maintenance, and performance Citizens was looking for. Citizens Bank moved some analytics processing from Teradata to Netezza for better cost and performance, implemented BigInsights Hadoop for a data lake, and avoided large capital expenditures for additional Teradata capacity.
1) The document discusses Godaddy's use of Hadoop to build a data warehouse that enables greater data integration and supports all phases of analytics.
2) Key principles for the data team include making data easy to discover, understand, consume and maintain through automation and delivering value quickly through an Agile approach.
3) The data warehouse uses a variant of the Kimball design with wide, denormalized fact tables and integrated, conformed dimensions to support all types of analytics using data at the lowest level of granularity.
Summary introduction to data engineeringNovita Sari
Data engineering involves designing, building, and maintaining data warehouses to transform raw data into queryable forms that enable analytics. A core task of data engineers is Extract, Transform, and Load (ETL) processes - extracting data from sources, transforming it through processes like filtering and aggregation, and loading it into destinations. Data engineers help divide systems into transactional (OLTP) and analytical (OLAP) databases, with OLTP providing source data to data warehouses analyzed through OLAP systems. While similar, data engineers focus more on infrastructure and ETL processes, while data scientists focus more on analysis, modeling, and insights.
Rob peglar introduction_analytics _big data_hadoopGhassan Al-Yafie
This document provides an introduction to analytics and big data using Hadoop. It discusses the growth of digital data and challenges of big data. Hadoop is presented as a solution for storing and processing large, unstructured datasets across commodity servers. The key components of Hadoop - HDFS for distributed storage and MapReduce for distributed processing - are described at a high level. Examples of industries using big data analytics are also listed.
Introduction to streaming and messaging flume,kafka,SQS,kinesis Omid Vahdaty
Big data makes you a bit Confused ? messaging? batch processing? data streaming? in flight analytics? Cloud? open source? Flume? kafka? flafka (both)? SQS? kinesis? firehose?
This document compares Apache Flume and Apache Kafka for use in data pipelines. It describes Conversant's evolution from a homegrown log collection system to using Flume and then integrating Kafka. Key points covered include how Flume and Kafka work, their capabilities for reliability, scalability, and ecosystems. The document also discusses customizing Flume for Conversant's needs, and how Conversant monitors and collects metrics from Flume and Kafka using tools like JMX, Grafana dashboards, and OpenTSDB.
This document provides an overview and comparison of the Avro and Parquet data formats. It begins with introductions to Avro and Parquet, describing their key features and uses. The document then covers Avro and Parquet schemas, file structures, and includes code examples. Finally, it discusses considerations for choosing between Avro and Parquet and shares experiences using the two formats.
Parquet is a column-oriented storage format for Hadoop that supports efficient compression and encoding techniques. It uses a row group structure to store data in columns in a compressed and encoded column chunk format. The schema and metadata are stored in the file footer to allow for efficient reads and scans of selected columns. The format is designed to be extensible through pluggable components for schema conversion, record materialization, and encodings.
This document summarizes a benchmark study of file formats for Hadoop, including Avro, JSON, ORC, and Parquet. It found that ORC with zlib compression generally performed best for full table scans. However, Avro with Snappy compression worked better for datasets with many shared strings. The document recommends experimenting with the benchmarks, as performance can vary based on data characteristics and use cases like column projections.
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...StampedeCon
At the StampedeCon 2015 Big Data Conference: Picking your distribution and platform is just the first decision of many you need to make in order to create a successful data ecosystem. In addition to things like replication factor and node configuration, the choice of file format can have a profound impact on cluster performance. Each of the data formats have different strengths and weaknesses, depending on how you want to store and retrieve your data. For instance, we have observed performance differences on the order of 25x between Parquet and Plain Text files for certain workloads. However, it isn’t the case that one is always better than the others.
Flume is an Apache project for log aggregation and movement, optimized for Hadoop ecosystems. It uses a push model with agents and channels. Kafka is a distributed publish-subscribe messaging system optimized for high throughput and availability. It uses a pull model and supports multiple consumers. Kafka generally has higher throughput than Flume. Flume and Kafka can be combined, with Flume using Kafka as a channel or source/sink, to take advantage of both systems.
La BuzzWord dell’ultimo anno è “Data Science”. Ma cosa significa realmente? Cosa fa un “Data Scientist”? Che strumenti sono messi a disposizione da Microsoft? E che altri strumenti ci sono oltre a Microsoft?
Lean Analytics is a set of rules to make data science more streamlined and productive. It touches on many aspects of what a data scientist should be and how a data science project should be defined to be successful. During this presentation Richard will present where data science projects go wrong, how you should think of data science projects, what constitutes success in data science and how you can measure progress. This session will be loaded with terms, stories and descriptions of project successes and failures. If you're wondering whether you're getting value out of data science, how to get more value out of it and even whether you need it then this talk is for you!
What you will take away from this session
Learn how to make your data science projects successful
Evaluate how to track progress and report on the efficacy of data science solutions
Understand the role of engineering and data scientists
Understand your options for processes and software
5 Things that Make Hadoop a Game Changer
Webinar by Elliott Cordo, Caserta Concepts
There is much hype and mystery surrounding Hadoop's role in analytic architecture. In this webinar, Elliott presented, in detail, the services and concepts that makes Hadoop a truly unique solution - a game changer for the enterprise. He talked about the real benefits of a distributed file system, the multi workload processing capabilities enabled by YARN, and the 3 other important things you need to know about Hadoop.
To access the recorded webinar, visit the event site: https://www.brighttalk.com/webcast/9061/131029
For more information the services and solutions that Caserta Concepts offers, please visit http://casertaconcepts.com/
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain
http://www.bigdataspain.org/2014/conference/state-of-play-data-science-on-hadoop-in-2015-keynote
Machine Learning is not new. Big Machine Learning is qualitatively different: More data beats algorithm improvement, scale trumps noise and sample size effects, can brute-force manual tasks.
Session presented at Big Data Spain 2014 Conference
18th Nov 2014
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Slides: https://speakerdeck.com/bigdataspain/state-of-play-data-science-on-hadoop-in-2015-by-sean-owen-at-big-data-spain-2014
This document provides an overview of getting started with data science using Python. It discusses what data science is, why it is in high demand, and the typical skills and backgrounds of data scientists. It then covers popular Python libraries for data science like NumPy, Pandas, Scikit-Learn, TensorFlow, and Keras. Common data science steps are outlined including data gathering, preparation, exploration, model building, validation, and deployment. Example applications and case studies are discussed along with resources for learning including podcasts, websites, communities, books, and TV shows.
The Right Data Warehouse: Automation Now, Business Value ThereafterInside Analysis
The Briefing Room with Dr. Robin Bloor and WhereScape
Live Webcast on April 1, 2014
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=7b23b14b532bd7be60a70f6bd5209f03
In the Big Data shuffle, everyone is looking at Hadoop as “the answer” to collect interesting data from a new set of sources. While Hadoop has given organizations the power to gather more information assets than ever before, the question still looms: which data, regardless of source, structure, volume and all the rest, are significant for affecting business value – and how do we harness it? One effective approach is to bolster the data warehouse environment with a solution capable of integrating all the data sources, including Hadoop, and automating delivery of key information into the rights hands.
Register for this episode of The Briefing Room to hear veteran Analyst Robin Bloor as he explains how a rapidly changing information landscape impacts data management. He will be briefed by Mark Budzinski of WhereScape, who will tout his company’s data warehouse automation solutions. Budzinski will discuss how automation can be the cornerstone for closing the gap between those responsible for data management and the people driving business decisions.
Visit InsideAnlaysis.com for more information.
Dapper: the microORM that will change your lifeDavide Mauri
ORM or Stored Procedures? Code First or Database First? Ad-Hoc Queries? Impedance Mismatch? If you're a developer or you are a DBA working with developers you have heard all this terms at least once in your life…and usually in the middle of a strong discussion, debating about one or the other. Well, thanks to StackOverflow's Dapper, all these fights are finished. Dapper is a blazing fast microORM that allows developers to map SQL queries to classes automatically, leaving (and encouraging) the usage of stored procedures, parameterized statements and all the good stuff that SQL Server offers (JSON and TVP are supported too!) In this session I'll show how to use Dapper in your projects from the very basis to some more complex usages that will help you to create *really fast* applications without the burden of huge and complex ORMs. The days of Impedance Mismatch are finally over!
This document provides an overview of how to build your own personalized search and discovery tool like Microsoft Delve by combining machine learning, big data, and SharePoint. It discusses the Office Graph and how signals across Office 365 are used to populate insights. It also covers big data concepts like Hadoop and machine learning algorithms. Finally, it proposes a high-level architectural concept for building a Delve-like tool using Azure SQL Database, Azure Storage, Azure Machine Learning, and presenting insights.
How to build your own Delve: combining machine learning, big data and SharePointJoris Poelmans
You are experiencing the benefits of machine learning everyday through product recommendations on Amazon & Bol.com, credit card fraud prevention, etc… So how can we leverage machine learning together with SharePoint and Yammer. We will first look into the fundamentals of machine learning and big data solutions and next we will explore how we can combine tools such as Windows Azure HDInsight, R, Azure Machine Learning to extend and support collaboration and content management scenarios within your organization.
Mapping Data Flows in Azure Data Factory 1st Edition Mark Kromerdivacazokey
Mapping Data Flows in Azure Data Factory 1st Edition Mark Kromer
Mapping Data Flows in Azure Data Factory 1st Edition Mark Kromer
Mapping Data Flows in Azure Data Factory 1st Edition Mark Kromer
We recently presented our technology solution for metadata discovery to the Boulder Business Intelligence Brains Trust in Colorado. (www.bbbt.us)
The whole session was also videod and there is a link to the recording at the end of the presentation.
Journey of The Connected Enterprise - Knowledge Graphs - Smart DataBenjamin Nussbaum
We live in an era where the world is more connected than ever before and the trajectory is such that data relationships will only continue to increase with no signs of slowing down.
Connected data is the key to your business succeeding and growing in today’s connected world.
Leading enterprises will be the ones that utilize relationship-centric technologies to leverage connections from their internal operations and supply chain to their customer and user interactions. This ability to utilize connected data to understand all the nuanced relationships within their organization will propel them forward as they act on more holistic insights.
Every organization needs a knowledge graph because connected data is an essential foundation to advancing business. Knowledge graphs provide:
- Increased visibility between internal groups
- Efficiency gains
- Cross-functional data collaboration
- Core complete and reliable business insights
- Better customer engagement
The live presentation and discussion can be found here: https://www.youtube.com/watch?v=RQGdw82rAes
Additional reading on why connected data is beneficial: https://www.graphgrid.com/why-connected-data-is-more-useful/
Connected data solutions available by Benjamin and his team via GraphGrid and AtomRain: https://www.graphgrid.com and https://www.atomrain.com
Business in the Driver’s Seat – An Improved Model for IntegrationInside Analysis
The Briefing Room with Dr. Robin Bloor and WhereScape
Live Webcast on September 30, 2014
Watch the archive:
https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=bfff40f7c9645fc398770ea11152b148
The fueling of information systems will always require some effort, but a confluence of innovations is fundamentally changing how quickly and accurately it can be done. Gone are long cycle times for development. Today, organizations can embrace a more rapid and collaborative approach for building analytical applications and data warehouses. The key is to have business experts working hand-in-hand with data professionals as the solutions take shape, thus expediting the speed to valuable insights.
Register for this episode of The Briefing Room to hear veteran Analyst Dr. Robin Bloor as he explains the changing nature of information design. He’ll be briefed by WhereScape President Mark Budzinski, who will discuss his company’s data warehouse automation solutions and how they enable collaborative development. He will share use cases that illustrate show aligning business and IT, organizations can enable faster and more agile data warehouse development.
Visit InsideAnlaysis.com for more information.
This document provides an overview of architecting a first big data implementation. It defines key concepts like Hadoop, NoSQL databases, and real-time processing. It recommends asking questions about data, technology stack, and skills before starting a project. Distributed file systems, batch tools, and streaming systems like Kafka are important technologies for big data architectures. The document emphasizes moving from batch to real-time processing as a major opportunity.
Rental Cars and Industrialized Learning to Rank with Sean DownesDatabricks
Data can be viewed as the exhaust of online activity. With the rise of cloud-based data platforms, barriers to data storage and transfer have crumbled. The demand for creative applications and learning from those datasets has accelerated. Rapid acceleration can quickly accrue disorder, and disorderly data design can turn the deepest data lake into an impenetrable swamp.
In this talk, I will discuss the evolution of the data science workflow at Expedia with a special emphasis on Learning to Rank problems. From the heroic early days of ad-hoc Spark exploration to our first production sort model on the cloud, we will explore the process of industrializing the workflow. Layered over our story, I will share some best practices and suggestions on how to keep your data productive, or even pull your organization out of the data swamp.
Synapse is a solution provider with an innovative alternative to commercial off-the-shelf IT applications. Empowering business professionals to shape business processes without being chained to IT applications.
2015 nov 27_thug_paytm_rt_ingest_brief_finalAdam Muise
The document discusses Paytm Labs' transition from batch data ingestion to real-time data ingestion using Apache Kafka and Confluent. It outlines their current batch-driven pipeline and some of its limitations. Their new approach, called DFAI (Direct-From-App-Ingest), will have applications directly write data to Kafka using provided SDKs. This data will then be streamed and aggregated in real-time using their Fabrica framework to generate views for different use cases. The benefits of real-time ingestion include having fresher data available and a more flexible schema.
2015 feb 24_paytm_labs_intro_ashwin_armandoadamAdam Muise
The document discusses building a data science pipeline for fraud detection using various technologies like Node.js, RabbitMQ, Spark, HBase and Cassandra. It describes the different stages of the pipeline from collecting and processing data to building models for tasks like fraud detection, recommendations and personalization. Finally, it discusses implementation considerations and technologies for building a scalable real-time and batch processing architecture.
Hadoop at the Center: The Next Generation of HadoopAdam Muise
This document discusses Hortonworks' approach to addressing challenges around managing large volumes of diverse data. It presents Hortonworks' Hadoop Data Platform (HDP) as a solution for consolidating siloed data into a central data lake on a single cluster. This allows different data types and workloads like batch, interactive, and real-time processing to leverage shared services for security, governance and operations while preserving existing tools. The HDP also enables new use cases for analytics like real-time personalization and segmentation using diverse data sources.
The document discusses the Lambda architecture, which combines batch and stream processing. It provides an example implementation using Hadoop, Kafka, Storm and other tools. The Lambda architecture handles batch loading and querying of large datasets as well as real-time processing of data streams. It also discusses using YARN and Spark for distributed processing and refreshing enrichments.
An overview of securing Hadoop. Content primarily by Balaji Ganesan, one of the leaders of the Apache Argus project. Presented on Sept 4, 2014 at the Toronto Hadoop User Group by Adam Muise.
This document discusses big data and Hadoop. It notes that traditional technologies are not well-suited to handle the volume of data generated today. Hadoop was created by companies like Google and Yahoo to address this challenge through its distributed file system HDFS and processing framework MapReduce. The document promotes Hadoop and the Hortonworks Data Platform for storing, processing, and analyzing large volumes of diverse data in a cost-effective manner.
2014 feb 24_big_datacongress_hadoopsession1_hadoop101Adam Muise
This document provides an introduction to Hadoop using the Hortonworks Sandbox virtual machine. It discusses how Hadoop was created to address the limitations of traditional data architectures for handling large datasets. It then describes the key components of Hadoop like HDFS, MapReduce, YARN and Hadoop distributions like Hortonworks Data Platform. The document concludes by explaining how to get started with the Hortonworks Sandbox VM which contains a single node Hadoop cluster within a virtual machine, avoiding the need to install Hadoop locally.
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitectureAdam Muise
An introduction to Hadoop's core components as well as the core Hadoop use case: the Data Lake. This deck was delivered at Big Data Congress 2014 in Saint John, NB on Feb 24.
The document discusses the challenges of dealing with large volumes of data from different sources. Traditional approaches of separating data into isolated silos are inadequate for analyzing today's vast amounts of data. The presenter argues that a better approach is to bring all available data together into a unified system so it can be analyzed and queried as a whole to generate useful insights. This approach treats all data as an integrated whole rather than separate, disconnected parts.
The document discusses challenges related to large volumes of data, or "Big Data". Traditional technologies try to divide and separate data across different systems, but this becomes difficult to manage at scale. The presenter introduces Hadoop as an alternative approach that can handle large volumes of data in a single system and democratize access to data. Hadoop provides a framework for storage, management and processing of large datasets in a distributed manner across commodity hardware.
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0Adam Muise
The document discusses Hadoop 2.2.0 and new features in YARN and MapReduce. Key points include: YARN introduces a new application framework and resource management system that replaces the jobtracker, allowing multiple data processing engines besides MapReduce; MapReduce is now a library that runs on YARN; Tez is introduced as a new data processing framework to improve performance beyond MapReduce.
The document discusses the challenges of managing large volumes of data from different sources. Traditional approaches of separating data into isolated data silos are no longer effective. The emerging solution is to bring all data together into a unified platform like Hadoop that can store, process, and analyze large amounts of diverse data in a distributed manner. This allows organizations to gain deeper insights by asking new questions of all their combined data.
The document is a presentation on big data and Hadoop. It introduces the speaker, Adam Muise, and discusses the challenges of dealing with large and diverse datasets. Traditional approaches of separating data into silos are no longer sufficient. The presentation argues that a distributed system like Hadoop is needed to bring all data together and enable it to be analyzed as a whole.
Sept 17 2013 - THUG - HBase a Technical IntroductionAdam Muise
HBase Technical Introduction. This deck includes a description of memory design, write path, read path, some operational tidbits, SQL on HBase (Phoenix and Hive), as well as HOYA (HBase on YARN).
2013 July 23 Toronto Hadoop User Group Hive TuningAdam Muise
This document provides an overview of Hive and its performance capabilities. It discusses Hive's SQL interface for querying large datasets stored in Hadoop, its architecture which compiles SQL queries into MapReduce jobs, and its support for SQL features. The document also covers techniques for optimizing Hive performance, including data abstractions like partitions, buckets and skews. It describes different join strategies in Hive like shuffle joins, broadcast joins and sort-merge bucket joins, and how shuffle joins are implemented in MapReduce.
2013 march 26_thug_etl_cdc_talking_pointsAdam Muise
This document summarizes Adam Muise's proposed agenda for a March 26, 2013 data integration working session. The agenda includes introductions, discussing common data integration patterns, a roundtable on user group members' CDC/ETL use cases, and an overview of new data integration solutions like Hadoop data lakes, streaming technologies, data governance tools, and LinkedIn's Databus system.
The document discusses Apache HCatalog, which provides metadata services and a unified view of data across Hadoop tools like Hive, Pig, and MapReduce. It allows sharing of data and metadata between tools and external systems through a consistent schema. HCatalog simplifies data management by allowing tools to access metadata like the schema, location, and format of data from a shared metastore instead of encoding that information within each application.
KnittingBoar Toronto Hadoop User Group Nov 27 2012Adam Muise
This document discusses machine learning and parallel iterative algorithms. It provides an introduction to machine learning and Mahout. It then describes Knitting Boar, a system for parallelizing stochastic gradient descent on Hadoop YARN. Knitting Boar partitions data among workers that perform online logistic regression in batches. The workers send gradient updates to a master node, which averages the updates to produce a new global model. Experimental results show Knitting Boar achieves roughly linear speedup. The document concludes by discussing developing YARN applications and the Knitting Boar codebase.
The document summarizes two use cases for Hadoop in biotech companies. The first case discusses a large biotech firm "N" that implemented Hadoop to improve their drug development workflow using next generation DNA sequencing. Hadoop reduced the workflow from 6 weeks to 2 days. The second case discusses challenges at another biotech firm "M" around scaling genomic data analysis and Hadoop's role in addressing those challenges through improved data ingestion, storage, querying and analysis capabilities.
1. So you want to data
science.
Adam Muise
Chief Architect
2. Who am I?!
• Chief Architect at Paytm Labs!
• Paytm Labs is a data-driven lab founded to take on
the really hard problems of scaling up Fraud,
Recommendation, Rating, and Platform at Paytm!
• Paytm is an Indian Payments/Wallet company, has
50 Million wallets already, adds almost 1 Million
wallets a day, and will be greater than 100 Million
customers by the end of the year. Alibaba recently
invested in us, perhaps you heard. !
• I’ve also worked with Data Science teams at IBM,
Cloudera, and Hortonworks!
7. The Leadership!
If you are creating a data science
team, chances are that you are not a
Data Scientist. Data Scientists are
best applied to the problems of data,
not management.!
8. The Leadership!
Your boss (should ask): Why do you
even data science to solve the problem?!
You (should) answer: The problem is too
complex to solve without machine
learning. Here’s why.!
You (should not) answer: Big data and
data science is on the roadmap.!
9. The Leadership!
You have your budget for a team of 2
data scientists. That’s a good start
right? Get ready to ask for more
money. !
10. The Leadership!
You need to ask your management for:!
- Budget for 2 data engineers for every data scientist you hire!
- Access to the data lake, failing that, access the data warehouse!
- DevOps!
- Time to gain domain expertise before producing results!
- Exec-level cooperation from those teams who own the data and
tools you need and those who understand the data you need!
- A budget for servers/tools/additional storage based on a TCO
calculation you already did (right?)!
- A dedicated place for your team to work!
11. The Leadership!
Got DataLake?!
!
No? Depending on your
problem space,
chances are you are
building one unless you
can pull what you need
from an Existing Data
Warehouse.!
12. The Leadership!
You didn’t do a TCO (Total Cost of Ownership) calculation?
Ok, here you go:!
1. Internal/External cloud instances that can run Spark/
Hadoop/etc!
2. Storage costs (S3, internal, etc) for your analytical data
sets!
3. Lead time to get started, something like 1-2 months
depending on the complexity of the problem (Fraud
might take 3 months whereas Recommendation Engines
might be 1 month)!
4. Training time and costs for tools you didn’t know you
needed!
What! How much!
24-32 medium to large
instances on AWS each
month!
$15,000 to $45,000 per
month!
Storage costs for S3 (400TB
to 2PB)!
$12,000 to $57,000 per
month!
Salaries & Operating
Expenses!
2 x $xxxxx your operating
costs including salaries for
yourself and 3 people!
Training!
(Courses for Tools and
perhaps a conference trip
for hiring)!
$5,000 to $15,000!
14. The Team!
So you have permission, resources,
and a corner in an office. How do you
start? !
15. The Team!
Assemble your team in the following
order:!
1. Get a Data Engineer with a good
analytical mind. Have him beg,
borrow, or steal whatever data sets
that might be applicable to the
problem. Without data, no data
sciencey stuff can happen.!
16. The Team!
Assemble your team in the
following order:!
2. While you are getting
your data, hire or recruit
an internal Data Scientist. !
Easy, right?!
17. !!!!!!WARNING!!!!!!!
Data Science is not a mystical art form handed down by monks and taught over
50 years. You just need:!
• a good math background!
• academic or job experience with machine learning !
• business context!
• understand how to code. !
That can be easier to find than you think. !
!
That being said, everybody seems to think they are data scientists these days,
from the guy who writes the monthly SQL reports to your office manager who is a
wiz at excel. !
18. The Team!
Assemble your team in the following
order:!
3. More Data Engineers. !
4. DevOps support (if you don’t have
a common resource pool to draw
from).!
19. The Team!
Keep your data science team innovative, keep
them away from bureaucracy, keep them cool.
Don’t discount the cool factor.!
They are supposed to solve hard problems, not
deal with the everyday business issues. To
objective they need to be decoupled from the
emergencies and mediocre. !
If that sounds elitist then I challenge you to
create a scaling fraud detection system with your
existing data warehouse team. No really, try it. !
20. The Team!
What will they do?!
The Data Engineer !
Your data engineer is the heart and sole of your data science
team and will get almost none of the credit in the end. They
will help build your data pipeline, perform data
transformations, optimize training, automate validation, and
take the results into production. !
If you are lucky, you have Data Scientists that respect this
role and will often take some of these roles on to help ensure
their vision reaches production. Instead of relying on luck,
you can hire this way too. !
21. The Team!
What will they do?!
The Data Scientist!
Your Data Scientist will explore the data, create models, validate,
explore the data again, go in a different direction, clarify
requirements, model again, validate, retract, and then produce a
good model. The process is not deterministic and is a mix of
research and implementation. A good Data Scientist will be able to
code in the tools that you intend to go implement production code
with, something like Scala in Spark. !
Your Data Scientist will have or at least learn the business context
required to solve your problem. They will need to communicate with
business experts to validate their solutions actually solve the
problem or to help drive them in a new direction. !
22. The Team!
What will they do?!
DevOps!
Developer Operations will help
build that data pipeline for you. If
you have to build a Data Lake from
scratch, you are going to really rely
on these folks. They should be
elite, understand distributed
systems, ride a motorcycle, and be
someone you feel uncomfortable
standing next to in an elevator.!
23. Managing The Team!
If your Data Scientists are not stellar
coders, put a Data Engineer in their
grill and make them produce code.
They can’t contribute if they can’t get
their hands dirty. Data Science is not
an ivory tower. !
24. Managing The Team!
Introduce your team to the
business team that knows the
data or business processes
better than anyone else. Often
that’s not the CIO-favored DWH
team, but rather the Customer
Service Representatives*!
*This was especially true in fighting Fraud. !
25. Managing The Team!
Ways to make your team hate you:!
Data Scientists:!
• Don’t provide the data they need to create their models!
• Suggest that they create their own training data, from scratch!
• Provide ambiguous goals for the accuracy and precision of their models!
• Tell them to mine the data / don’t’ have a plan!
• Don’t respect the time it takes to create a model!
Data Engineers:!
• Let the Data Scientists use whatever tool they want without respect to parallel processing or
implementation!
• Have no management control over your data sources!
DevOps:!
• Use anything by IBM, Microsoft, SAS, or Oracle in your pipeline!
• Let the Data Engineers decide on the infrastructure!
27. The Work!
Start out with a clear that is
unambiguous. !
“I want to detect and prevent 50% of
Fraud in my payments system”!
“I want to increase conversion rates in
my eCommerce platform by 20%”!
28. The Work!
Get as much of the raw data as soon as you can
and as fast as you can. Don’t have a Data Lake?
Get your Hadoop on ASAP. !
!
29. The Work!
Give the team time to research the
data, gain context and become
experts. !
!
30. The Work!
Data without context == a complete
lack of direction in research. !
Research needs constant checks to
ensure that the primary problem is
being solved. !
!
31. The Work!
Data Science Development !=
Engineering Software Development.!
You will have to separate your
research process from the
engineering process that delivers the
models to production. !
!
32. The Work!
Data Engineering is an ongoing
process. You will need to maintain
pipelines, adapt to schema changes,
implement data cleansing, maintain
metadata in the data lake, optimize
processing workflows, etc. You will
never outgrow the need for your Data
Engineers. !
!
34. The Architecture!
Start with the cloud. You need to get
your infrastructure up as quickly as
possible. At the beginning, this is
cheaper than you think compared the
time and startup costs for creating an
on-premise data lake, even/especially if
you have an existing IT Team*!
!
*If you are big corporation your IT team is often the biggest barrier to your success in
creating an independent Data Science team.!
36. The Architecture!Lambda Architecture!
Batch Ingest:!
• SQOOP from MySQL instances!
• Keep as much in HDFS as you can, offload to S3 for
DR/Archive and when you have colder data!
• Spark and other Hadoop processing tools can run
natively over S3 data so it’s never really gone (don’t
use Glacier in a processing workflow)!
Realtime Ingest:!
• Mypipe to get events from binary log data and push
into Kafka topics (under construction)!
• VoltDB connector to get events from DB and push to
Kafka (under construction)!
• Streaming data piped through Kafka!
• All Realtime data processed with Spark Streaming or
Storm from Kafka!
37. The Architecture!
As you grow, your processing and
storage needs will likely mature.
Consider moving to on-premise
solution for your Hadoop/Processing
architecture. You can always archive
to S3 if you need DR and don’t have
the appetite to create two clusters.!
38. The Architecture!
With an on-premise architecture, you
can interact with existing on-premise
production systems quickly. For us,
that means real-time Fraud detection
and action. You may find yourself
maintaining both in the long run.!
40. [email protected] - @jabenitez
Supervised learning vs Anomaly detection
๏ Very small number of positive
examples
๏ Large number of negative examples.
๏ Many different “types” of anomalies.
Hard for any algorithm to learn from
positive examples what the
anomalies look like; future anomalies
may look nothing like any of the
anomalous examples we’ve seen so
far.
40
๏ Ideally large number of positive and
negative examples.
๏ Enough positive examples for
algorithm to get a sense of what
positive examples are like, future
positive examples likely to be similar
to ones in training set.
* Anomaly Detection - Andrew Ng - Coursera ML Course
41. [email protected] - @jabenitez
What approach to follow?
๏ Not so good: One model to rule them all
๏ Better:
๏ Many models competing against each other
๏ 100s or 1000s of rules running in parallel
๏ Know thy customer
41
42. [email protected] - @jabenitez
Feature Selection
๏ Want
p(x) large (small) for normal examples, "
p(x) small (large) for anomalous examples
๏ Most common problem: "
comparable distributions for both normal and anomalous examples
๏ Possible solutions:
๏ Apply transformation and variable combinations:
๏ xn+1 = ( x1 + x4 ) 2 / x3
๏ Focus on variable ratios and transaction velocity
๏ Use deep learning for feature extraction
๏ Dimensionality reduction
๏ your solution here
42
45. [email protected] - @jabenitez
What have we have tried
๏ Density estimator
๏ 2D Profiles
๏ Anomaly detection
๏ Clustering
๏ Model ensemble (Random forest)
๏ Deep learning (RBM)
๏ Logistic Regression
45
Combine
47. [email protected] - @jabenitez
Anomaly Detection* - Example
๏ Choose features, xi , that are indicative of anomalous examples.
๏ Fit parameters to a normal distribution
๏ Given new example, compute:
๏ Anomaly if
47
* Anomaly Detection - Andrew Ng - Coursera ML Course
48. [email protected] - @jabenitez
Algorithm Evaluation
๏ Fit model on training set
๏ On a cross validation/test example, predict
๏ Possible evaluation metrics:
๏ True positive, false positive, false negative, true negative
๏ Precision/Recall
๏ F1-score
48
50. [email protected] - @jabenitez
Anomaly Detection*
50
* Anomaly Detection - Andrew Ng - Coursera ML Course
Cross validation set:
Test set:
Assume we have some labeled data, of
anomalous and non-anomalous
examples: y = 0 if standard
behaviour, . y = 1 if
anomalous.
Training set: "
(assume normal examples/not anomalous)
54. [email protected] - @jabenitez
The lake again
54
Lake Simcoe
going on
Lake Superior
Classic Lambda
Architecture
Various
Processing
Frameworks
Near-Realtime
Scoring/Alerting*
55. [email protected] - @jabenitez
Fraud Capabilities and Technology
A. Batch Ingest and Analysis of
transaction data from Database
B. Batch Behavioural and Portfolio
heuristic fraud detection
C. Near-realtime anomaly and
heuristic fraud detection
D. Online Model Scoring
55
A. Traditional ETL tools for transfer, HDFS/S3 for
storage, Spark for processing
B. Model analysis with iPython/Scala Notebook,
Spark for processing, HDFS/HBase/Cassandra
for storage
C. Kafka real-time ingest, introduce Storm/Spark
Streaming for near-realtime interception of
data, HBase for model/rule storage and lookup
D. JPMML/Spark Streaming for realtime model
scoring