Sesión de Pentaho Data Integration impartida en Noviembre de 2015 en el marco del Programa de Big Data y Business Intelligence de la Universidad de Deusto (detalle aquí http://bit.ly/1PhIVgJ).
Pentaho data integration 4.0 and my sqlAHMED ENNAJI
This document provides an overview of Pentaho Data Integration (PDI) version 4 and its support for MySQL. It begins with an introduction to Pentaho as an open source business intelligence suite. It then discusses the key components and features of PDI, including extraction, transformation, loading, and support for over 35 database types. New features in version 4 are highlighted, such as improved visualization, logging, and plugin architecture. The document concludes with a section focused on MySQL support in PDI, including JDBC/ODBC integration and bulk loading jobs.
Pentaho Data Integration. Preparing and blending data from any source for analytics. Thus, enabling data-driven decision making. Application for education, specially, academic and learning analytics.
The document discusses Kettle, an open source ETL tool from Pentaho. It provides an introduction to the ETL process and describes Kettle's major components: Spoon for designing transformations and jobs, Pan for executing transformations, and Kitchen for executing jobs. Transformations in Kettle perform tasks like data filtering, field manipulation, lookups and more. Jobs are used to call and sequence multiple transformations. The document also covers recent Kettle releases and how it can help address challenges in data integration projects.
Business Intelligence and Big Data Analytics with Pentaho Uday Kothari
This webinar gives an overview of the Pentaho technology stack and then delves deep into its features like ETL, Reporting, Dashboards, Analytics and Big Data. The webinar also facilitates a cross industry perspective and how Pentaho can be leveraged effectively for decision making. In the end, it also highlights how apart from strong technological features, low TCO is central to Pentaho’s value proposition. For BI technology enthusiasts, this webinar presents easiest ways to learn an end to end analytics tool. For those who are interested in developing a BI / Analytics toolset for their organization, this webinar presents an interesting option of leveraging low cost technology. For big data enthusiasts, this webinar presents overview of how Pentaho has come out as a leader in data integration space for Big data.
Pentaho is one of the leading niche players in Business Intelligence and Big Data Analytics. It offers a comprehensive, end-to-end open source platform for Data Integration and Business Analytics. Pentaho’s leading product: Pentaho Business Analytics is a data integration, BI and analytics platform composed of ETL, OLAP, reporting, interactive dashboards, ad hoc analysis, data mining and predictive analytics.
Pentaho Data Integration (Kettle) is an open-source extract, transform, load (ETL) tool. It allows users to visually design data transformations and jobs to extract data from source systems, transform it, and load it into data warehouses. Kettle includes components like Spoon for designing transformations and jobs, Pan for executing transformations, and Carte for remote execution. It supports various databases and file formats through flexible components and transformations.
Pentaho | Data Integration & Report designerHamdi Hmidi
Pentaho provides a suite of open source business intelligence tools for data integration, dashboarding, reporting, and data mining. It includes Pentaho Data Integration (Kettle) for ETL processes, Pentaho Dashboard for visualization dashboards, Pentaho Reporting for report generation, and incorporates Weka for data mining algorithms. Pentaho Report Designer is a visual report writer that allows querying data from various sources and generating reports in different formats like PDF, HTML, and Excel. It requires Java and involves downloading, unpacking, and installing the Pentaho reporting files.
The document is a 20 page comparison of ETL tools. It includes an introduction, descriptions of 4 ETL tools (Pentaho Kettle, Talend, Informatica PowerCenter, Inaplex Inaport), and a section comparing the tools on various criteria such as cost, ease of use, speed and data quality. The comparison chart suggests Informatica PowerCenter is the fastest and most full-featured tool while open source options like Pentaho Kettle and Talend offer lower costs but require more manual configuration.
Pentaho Data Integration/Kettle is an open source ETL tool that has been used by the presenter for two years. It allows users to extract, transform and load data from various sources like databases, files and NoSQL into destinations like data warehouses. Some advantages of Kettle include its graphical user interface, large library of components, performance processing large datasets, and ability to leverage Java libraries. The presenter demonstrates syncing and processing data between different sources using Kettle.
Moving and Transforming Data with Pentaho Data Integration 5.0 CE (aka Kettle)Roland Bouman
This document introduces Pentaho Data Integration (KETL), an open source extract-transform-load (ETL) tool. It is part of Pentaho's full stack business intelligence platform. The document discusses KETL's capabilities for extracting, transforming and loading data from various sources through jobs and transformations. It also provides an overview of Pentaho's community and resources for using and contributing to its open source software.
This document provides an introduction to the Pentaho business intelligence (BI) platform. It discusses what BI is and why organizations need it. It then describes Pentaho's suite of open source BI tools, including Pentaho Data Integration for ETL, the Pentaho Report Designer for reporting, and the Pentaho BA Server for analytics, dashboards, and administration. The document also presents a case study of how Lufthansa used Pentaho to create real-time dashboards for monitoring airline operations. Finally, it outlines the course curriculum for an Edureka training on Pentaho.
Pentaho is an open-source business intelligence (BI) suite that provides query and reporting, OLAP analysis, data integration, dashboards, and data mining capabilities. It was founded in 2004 and offers these features through an enterprise edition with professional support. The suite allows users to access and analyze data through tools like reporting, interactive analysis, data integration, and machine learning algorithms for pattern detection.
This document discusses using ETL Metadata Injection in Pentaho to dynamically load metadata at runtime. Specifically, it describes how to use a template transformation to load cost files with a dynamic header structure, where budget files contain 12 months of data but forecast files can contain between 1-12 months. It provides examples of injecting metadata into text file input and row normalization steps in the template using the Metadata Injection step, and how to programmatically inject metadata into steps that don't natively support injection using the Pentaho API.
Here is a case study that I developed to explain the different sets of functionality with the Pentaho Suite. I focused on the functionality, features, illustrative tools and key strengths. I've provided an understanding toward evaluating BI tools when selecting vendors. Enjoy!
Pentaho is an open source business intelligence suite founded in 2004 that provides reporting, online analytical processing (OLAP) analysis, data integration, dashboards, and data mining capabilities. It can be downloaded for free from pentaho.com or sourceforge.net. Pentaho's commercial open source model eliminates licensing fees and provides annual subscription support and services. Key features include flexible reporting, a report designer, ad hoc reporting, security roles, OLAP analysis, ETL workflows, dashboard creation and alerts, and data mining algorithms.
Pentaho Analysis provides interactive data analysis capabilities through components like the BI server, client tools, and Mondrian OLAP engine. It allows users to analyze data warehouses across dimensions like time, product, and customer. Key functionality includes an interactive web interface, scheduling, and sharing analysis views. Mondrian in particular provides fast response times, automated aggregation, and support for any JDBC data source through its OLAP capabilities like drilling and slicing.
This document compares different ETL (extract, transform, load) tools. It begins with introductions to ETL tools in general and four specific tools: Pentaho Kettle, Talend, Informatica PowerCenter, and Inaplex Inaport. The document then compares the tools across various criteria like cost, ease of use, speed, and connectivity. It aims to help readers evaluate the tools for different use cases.
The document provides an overview of the ETL tool Informatica. It discusses that ETL stands for Extraction, Transformation, and Loading and is the process of extracting data from sources, transforming it, and loading it into a data warehouse or other target. It describes the key components of Informatica including the repository, client, server, transformations like filters and aggregators, and how mappings are used to move data from sources to targets. Finally, it provides examples of how to create simple mappings in Informatica Designer.
The document provides an overview of the Pentaho BI Suite presentation layer. It discusses interactive reports, which allow casual users to access data sources through drag and drop fields; Analyzer, designed for power users to access data sources and perform advanced sorting/filtering with charts; and Dashboards, which allow easy construction of layouts including dynamic filters and any Pentaho reports. The presentation is accompanied by screenshots demonstrating the creation and customization of reports and dashboards.
This document provides an overview of data warehousing and ETL concepts like OLTP vs OLAP, data warehouse architecture, and Informatica PowerCenter. It defines key terms, describes why organizations implement data warehouses to help with analytics and decision making, and outlines the typical layers of a data warehouse including the ETL process. The document also provides high-level information on Informatica PowerCenter's architecture and functionality for automating ETL jobs, and discusses some common errors and Unix commands for monitoring and managing Informatica services.
Talend Open Studio Introduction - OSSCamp 2014OSSCube
Talend Open Studio is the most open, innovative and
powerful data integration solution on the market today. Talend Open Studio for Data Integration allows you to
create ETL (extract, transform, load) jobs.
The document discusses Extract, Transform, Load (ETL) processes. It defines extract as reading data from a database, transform as converting extracted data into a form suitable for another database, and load as writing transformed data into the target database. It then lists several common ETL tools and databases they can connect to.
Data Scientists and Machine Learning practitioners, nowadays, seem to be churning out models by the dozen and they continuously experiment to find ways to improve their accuracies. They also use a variety of ML and DL frameworks & languages , and a typical organization may find that this results in a heterogenous, complicated bunch of assets that require different types of runtimes, resources and sometimes even specialized compute to operate efficiently.
But what does it mean for an enterprise to actually take these models to "production" ? How does an organization scale inference engines out & make them available for real-time applications without significant latencies ? There needs to be different techniques for batch (offline) inferences and instant, online scoring. Data needs to be accessed from various sources and cleansing, transformations of data needs to be enabled prior to any predictions. In many cases, there maybe no substitute for customized data handling with scripting either.
Enterprises also require additional auditing and authorizations built in, approval processes and still support a "continuous delivery" paradigm whereby a data scientist can enable insights faster. Not all models are created equal, nor are consumers of a model - so enterprises require both metering and allocation of compute resources for SLAs.
In this session, we will take a look at how machine learning is operationalized in IBM Data Science Experience (DSX), a Kubernetes based offering for the Private Cloud and optimized for the HortonWorks Hadoop Data Platform. DSX essentially brings in typical software engineering development practices to Data Science, organizing the dev->test->production for machine learning assets in much the same way as typical software deployments. We will also see what it means to deploy, monitor accuracies and even rollback models & custom scorers as well as how API based techniques enable consuming business processes and applications to remain relatively stable amidst all the chaos.
Speaker
Piotr Mierzejewski, Program Director Development IBM DSX Local, IBM
This document provides a summary of Maharshi Amin's professional experience and technical skills. He has over 10 years of experience developing software applications for the financial industry using technologies like Perl, Java, Sybase, and Informatica. His experience includes roles supporting trading, risk management, and regulatory reporting systems. He has strong skills in database design, application development, performance tuning, and leading development teams.
Any data source becomes an SQL Query with all the power of
Apache Spark. Querona is a virtual database that seamlessly connects any data source with Power BI, TARGIT, Qlik, Tableau, Microsoft Excel or others. It lets you build your
own universal data model and share it among reporting tools.
Querona does not create another copy of your data, unless you want to accelerate your reports and use build-in execution engine created for purpose of Big Data analytics. Just write standard SQL query and let Querona consolidate data on the fly, use one of execution engines and accelerate processing no matter what kind and how many sources you have.
This document discusses ETL (extract, transform, load) processes using the Talend open source tool. It provides an overview of ETL, describes the extract, transform and load steps. It also outlines a tutorial demonstrating loading data from a database to HDFS, running a Hive query to analyze the data, and outputting the results to HBase. The tutorial steps show how to set up Talend and run an ETL job that extracts data from a database, loads it to HDFS, runs a Hive query, and loads the results to HBase.
The process of extracting data from source systems and bringing it into the data warehouse is commonly called ETL, which stands for extraction, transformation, and loading.
This document provides a comparative study of various ETL (Extraction, Transformation, Loading) tools. It first discusses the ETL process and concepts. It then reviews some popular ETL tools, including Pentaho Data Integration, Talend Open Studio, Informatica Power Center, Oracle Warehouse Builder, IBM Information Server, and Microsoft SQL Server Integration Services. The document establishes criteria for comparing ETL tools, such as architecture, functionality, usability, reusability, connectivity, and interoperability. It then provides a comparative analysis and graphs ranking the tools based on their support for each criteria. The analysis finds that Informatica Power Center and Oracle Warehouse Builder provide strong architectural support, while Informatica and SQL Server Integration Services
This document discusses implementing Agile methodology for business intelligence (BI) projects. It begins by addressing common misconceptions about Agile BI, noting that it does not require specific tools or methodologies and can be applied using existing technologies. The document then examines extract, transform, load (ETL) tools and how some may not be well-suited for Agile due to issues like proprietary coding and lack of integration with version control and continuous integration practices. However, ETL tools can still be used when appropriate. The document provides recommendations for setting up an Agile BI environment, including using ETL tools judiciously and mitigating issues through practices like sandboxed development environments and test data sets to enable test-driven development.
Moving and Transforming Data with Pentaho Data Integration 5.0 CE (aka Kettle)Roland Bouman
This document introduces Pentaho Data Integration (KETL), an open source extract-transform-load (ETL) tool. It is part of Pentaho's full stack business intelligence platform. The document discusses KETL's capabilities for extracting, transforming and loading data from various sources through jobs and transformations. It also provides an overview of Pentaho's community and resources for using and contributing to its open source software.
This document provides an introduction to the Pentaho business intelligence (BI) platform. It discusses what BI is and why organizations need it. It then describes Pentaho's suite of open source BI tools, including Pentaho Data Integration for ETL, the Pentaho Report Designer for reporting, and the Pentaho BA Server for analytics, dashboards, and administration. The document also presents a case study of how Lufthansa used Pentaho to create real-time dashboards for monitoring airline operations. Finally, it outlines the course curriculum for an Edureka training on Pentaho.
Pentaho is an open-source business intelligence (BI) suite that provides query and reporting, OLAP analysis, data integration, dashboards, and data mining capabilities. It was founded in 2004 and offers these features through an enterprise edition with professional support. The suite allows users to access and analyze data through tools like reporting, interactive analysis, data integration, and machine learning algorithms for pattern detection.
This document discusses using ETL Metadata Injection in Pentaho to dynamically load metadata at runtime. Specifically, it describes how to use a template transformation to load cost files with a dynamic header structure, where budget files contain 12 months of data but forecast files can contain between 1-12 months. It provides examples of injecting metadata into text file input and row normalization steps in the template using the Metadata Injection step, and how to programmatically inject metadata into steps that don't natively support injection using the Pentaho API.
Here is a case study that I developed to explain the different sets of functionality with the Pentaho Suite. I focused on the functionality, features, illustrative tools and key strengths. I've provided an understanding toward evaluating BI tools when selecting vendors. Enjoy!
Pentaho is an open source business intelligence suite founded in 2004 that provides reporting, online analytical processing (OLAP) analysis, data integration, dashboards, and data mining capabilities. It can be downloaded for free from pentaho.com or sourceforge.net. Pentaho's commercial open source model eliminates licensing fees and provides annual subscription support and services. Key features include flexible reporting, a report designer, ad hoc reporting, security roles, OLAP analysis, ETL workflows, dashboard creation and alerts, and data mining algorithms.
Pentaho Analysis provides interactive data analysis capabilities through components like the BI server, client tools, and Mondrian OLAP engine. It allows users to analyze data warehouses across dimensions like time, product, and customer. Key functionality includes an interactive web interface, scheduling, and sharing analysis views. Mondrian in particular provides fast response times, automated aggregation, and support for any JDBC data source through its OLAP capabilities like drilling and slicing.
This document compares different ETL (extract, transform, load) tools. It begins with introductions to ETL tools in general and four specific tools: Pentaho Kettle, Talend, Informatica PowerCenter, and Inaplex Inaport. The document then compares the tools across various criteria like cost, ease of use, speed, and connectivity. It aims to help readers evaluate the tools for different use cases.
The document provides an overview of the ETL tool Informatica. It discusses that ETL stands for Extraction, Transformation, and Loading and is the process of extracting data from sources, transforming it, and loading it into a data warehouse or other target. It describes the key components of Informatica including the repository, client, server, transformations like filters and aggregators, and how mappings are used to move data from sources to targets. Finally, it provides examples of how to create simple mappings in Informatica Designer.
The document provides an overview of the Pentaho BI Suite presentation layer. It discusses interactive reports, which allow casual users to access data sources through drag and drop fields; Analyzer, designed for power users to access data sources and perform advanced sorting/filtering with charts; and Dashboards, which allow easy construction of layouts including dynamic filters and any Pentaho reports. The presentation is accompanied by screenshots demonstrating the creation and customization of reports and dashboards.
This document provides an overview of data warehousing and ETL concepts like OLTP vs OLAP, data warehouse architecture, and Informatica PowerCenter. It defines key terms, describes why organizations implement data warehouses to help with analytics and decision making, and outlines the typical layers of a data warehouse including the ETL process. The document also provides high-level information on Informatica PowerCenter's architecture and functionality for automating ETL jobs, and discusses some common errors and Unix commands for monitoring and managing Informatica services.
Talend Open Studio Introduction - OSSCamp 2014OSSCube
Talend Open Studio is the most open, innovative and
powerful data integration solution on the market today. Talend Open Studio for Data Integration allows you to
create ETL (extract, transform, load) jobs.
The document discusses Extract, Transform, Load (ETL) processes. It defines extract as reading data from a database, transform as converting extracted data into a form suitable for another database, and load as writing transformed data into the target database. It then lists several common ETL tools and databases they can connect to.
Data Scientists and Machine Learning practitioners, nowadays, seem to be churning out models by the dozen and they continuously experiment to find ways to improve their accuracies. They also use a variety of ML and DL frameworks & languages , and a typical organization may find that this results in a heterogenous, complicated bunch of assets that require different types of runtimes, resources and sometimes even specialized compute to operate efficiently.
But what does it mean for an enterprise to actually take these models to "production" ? How does an organization scale inference engines out & make them available for real-time applications without significant latencies ? There needs to be different techniques for batch (offline) inferences and instant, online scoring. Data needs to be accessed from various sources and cleansing, transformations of data needs to be enabled prior to any predictions. In many cases, there maybe no substitute for customized data handling with scripting either.
Enterprises also require additional auditing and authorizations built in, approval processes and still support a "continuous delivery" paradigm whereby a data scientist can enable insights faster. Not all models are created equal, nor are consumers of a model - so enterprises require both metering and allocation of compute resources for SLAs.
In this session, we will take a look at how machine learning is operationalized in IBM Data Science Experience (DSX), a Kubernetes based offering for the Private Cloud and optimized for the HortonWorks Hadoop Data Platform. DSX essentially brings in typical software engineering development practices to Data Science, organizing the dev->test->production for machine learning assets in much the same way as typical software deployments. We will also see what it means to deploy, monitor accuracies and even rollback models & custom scorers as well as how API based techniques enable consuming business processes and applications to remain relatively stable amidst all the chaos.
Speaker
Piotr Mierzejewski, Program Director Development IBM DSX Local, IBM
This document provides a summary of Maharshi Amin's professional experience and technical skills. He has over 10 years of experience developing software applications for the financial industry using technologies like Perl, Java, Sybase, and Informatica. His experience includes roles supporting trading, risk management, and regulatory reporting systems. He has strong skills in database design, application development, performance tuning, and leading development teams.
Any data source becomes an SQL Query with all the power of
Apache Spark. Querona is a virtual database that seamlessly connects any data source with Power BI, TARGIT, Qlik, Tableau, Microsoft Excel or others. It lets you build your
own universal data model and share it among reporting tools.
Querona does not create another copy of your data, unless you want to accelerate your reports and use build-in execution engine created for purpose of Big Data analytics. Just write standard SQL query and let Querona consolidate data on the fly, use one of execution engines and accelerate processing no matter what kind and how many sources you have.
This document discusses ETL (extract, transform, load) processes using the Talend open source tool. It provides an overview of ETL, describes the extract, transform and load steps. It also outlines a tutorial demonstrating loading data from a database to HDFS, running a Hive query to analyze the data, and outputting the results to HBase. The tutorial steps show how to set up Talend and run an ETL job that extracts data from a database, loads it to HDFS, runs a Hive query, and loads the results to HBase.
The process of extracting data from source systems and bringing it into the data warehouse is commonly called ETL, which stands for extraction, transformation, and loading.
This document provides a comparative study of various ETL (Extraction, Transformation, Loading) tools. It first discusses the ETL process and concepts. It then reviews some popular ETL tools, including Pentaho Data Integration, Talend Open Studio, Informatica Power Center, Oracle Warehouse Builder, IBM Information Server, and Microsoft SQL Server Integration Services. The document establishes criteria for comparing ETL tools, such as architecture, functionality, usability, reusability, connectivity, and interoperability. It then provides a comparative analysis and graphs ranking the tools based on their support for each criteria. The analysis finds that Informatica Power Center and Oracle Warehouse Builder provide strong architectural support, while Informatica and SQL Server Integration Services
This document discusses implementing Agile methodology for business intelligence (BI) projects. It begins by addressing common misconceptions about Agile BI, noting that it does not require specific tools or methodologies and can be applied using existing technologies. The document then examines extract, transform, load (ETL) tools and how some may not be well-suited for Agile due to issues like proprietary coding and lack of integration with version control and continuous integration practices. However, ETL tools can still be used when appropriate. The document provides recommendations for setting up an Agile BI environment, including using ETL tools judiciously and mitigating issues through practices like sandboxed development environments and test data sets to enable test-driven development.
Big data analytics beyond beer and diapersKai Zhao
This document discusses big data analytics. It begins with background on traditional business intelligence and defines big data in terms of volume, variety, and velocity. It then outlines the big data analytics technology stack, including ETL/ELT, MPP data warehouses, MapReduce, NoSQL, web services, data analytics tools, data visualization, and BI tools. Finally, it discusses big data analytics platform architectures.
ETL tools extract data from various sources, transform it for reporting and analysis, cleanse errors, and load it into a data warehouse. They save time and money compared to manual coding by automating this process. Popular open-source ETL tools include Pentaho Kettle and Talend, while Informatica is a leading commercial tool. A comparison found that Pentaho Kettle uses a graphical interface and standalone engine, has a large user community, and includes data quality features, while Talend generates code to run ETL jobs.
Title_ What are the various tools used in ETL testing.pdfishansharma200107
In the dynamic field of data integration, proficiency in ETL testing and mastery of relevant tools are indispensable skills. The highlighted tools cover a broad spectrum of ETL testing needs, from performance testing to data validation and transformation. Technogeeks IT Institute in Pune stands as a beacon, providing comprehensive training on these tools and preparing professionals for the real-world challenges of ETL testing. By enrolling in Technogeeks IT Institute’s ETL testing courses, individuals gain practical experience and contribute to the seamless flow of data across diverse systems.
The process of data warehousing is undergoing rapidtransformation, giving rise to various new terminologies, especially due to theshift from the traditional ETL to the new ELT. Forsomeone new to the process, these additional terminologies and abbreviationsmight seem overwhelming, some may even ask, “Why does it matter if the L comesbefore the T?”
The answer lies in the infrastructure and the setup. Here iswhat the fuss is all about, the sequencing of the words and more importantly,why you should be shifting from ETL to ELT.
The document compares ETL and ELT data integration processes. ETL extracts data from sources, transforms it, and loads it into a data warehouse. ELT loads extracted data directly into the data warehouse and performs transformations there. Key differences include that ETL is better for structured data and compliance, while ELT handles any size/type of data and transformations are more flexible but can slow queries. AWS Glue, Azure Data Factory, and SAP BODS are tools that support these processes.
Shivaprasada Kodoth is seeking a position as an ETL Lead/Architect with experience in data warehousing and ETL. He has over 8 years of experience in data warehousing and Informatica design and development. He is proficient in technologies like Oracle, Teradata, SQL, and PL/SQL. Some of his key projects include developing ETL mappings and workflows for integrating various systems at BoheringerIngelheim and UBS. He is looking for opportunities in Bangalore, Mangalore, Cochin, Europe, USA, Australia, or Singapore.
Airbyte @ Airflow Summit - The new modern data stackMichel Tricot
The document introduces the modern data stack of Airbyte, Airflow, and dbt. It discusses how ELT addresses issues with traditional ETL processes by separating extraction, loading, and transformation. Extraction and loading involve general-purpose routines to pull and push raw data, while transformation uses business logic specific to the organization. The stack is presented as an open solution that allows composing with best of breed tools for each part of the data pipeline. Airbyte provides data integration, dbt enables data transformation with SQL, and Airflow handles scheduling. The demo shows how these tools can be combined to build a flexible, autonomous, and future proof modern data stack.
This document is a curriculum vitae for Rajeswari Pothala. It outlines her professional experience working for Tata Consultancy Services for over 6 years leading teams of up to 8 members on data warehousing and ETL development projects. It also lists her educational qualifications including a B.Tech in Electronics and Communication Engineering. Key projects outlined include work on the TCS Trimatrix EDW project and several projects for Aviva involving data integration, mappings development, and module lead responsibilities.
The document is a presentation about Oracle Beehive, Oracle's unified collaboration platform. It provides an overview of Beehive's features and services, how it compares to Oracle's previous collaboration product Oracle Collaboration Suite, and real-world use cases and scenarios for how Beehive enables collaboration. The presentation also discusses some potential disadvantages of Beehive such as Oracle's lack of market share in the collaboration space and skepticism from analysts about Beehive's chances of success.
ETL extracts raw data from sources, transforms it on a separate server, and loads it into a target database. ELT loads raw data directly into a data warehouse, where data cleansing, enrichment, and transformations occur. While ETL has been used longer and has more supporting tools, ELT allows for faster queries, greater flexibility, and takes advantage of cloud data warehouse capabilities by performing transformations within the warehouse. However, ELT can present greater security risks and increased latency compared to ETL.
Gowthami S is a software developer and designer with over 2 years of experience in data warehousing using databases like Teradata and Oracle. She has extensive experience with ETL tools like Informatica and data loading utilities for Teradata. She has worked on full data warehouse development lifecycles including requirements, design, implementation and maintenance. Currently working as a software engineer at Tech Mahindra, her projects include developing ETL processes and Teradata SQL queries to load and transform data from various sources into a Cisco enterprise data warehouse supporting business intelligence reporting and analytics.
This document discusses modern data pipelines and compares Extract-Transform-Load (ETL) and Extract-Load-Transform (ELT) approaches. ETL extracts data from sources, transforms it, and loads it into a data warehouse. ELT extracts and loads raw data directly into a data warehouse or data lake and then transforms it. The document argues that ELT is better suited for modern data as it handles both structured and unstructured data, supports cloud-based data warehouses/lakes, and is more efficient and cost-effective than ETL. Key advantages of ELT over ETL include lower costs, faster data loading, and better support for large, heterogeneous datasets.
[DSC DACH 24] Automatic ETL Migration - on-prem to cloud and more - Miljenko ...DataScienceConferenc1
Nowadays, Business Intelligence is a mature technology. Technological migrations are more frequent than one might imagine. Regardless if we are moving to cloud or just reengineering old integration jobs, rewriting ETL from scratch in a new technology might be looming over our heads. As one might imagine, doing this manually might be an absolutely colossal task. Hundreds if not thousands mutually interdependent jobs. Written by somebody else years ago, often badly. Knowing what the job is actually supposed to do is more often an exception than a rule. For this purpose, we can use tools for automatic ETL migration from technology to technology. They offer obvious benefits, but also have limitations. In this lecture, we will dive deeply into the World of ETL migration. When to do it automatically, when to do it by hand? What are the threats on our journey? Last but not the least, we shall provide some real World experiences and examples. Success stories, perhaps an epic failure or two. There is only one way to find out - attend DSC DACH 24 Conference!
NEAR-REAL-TIME PARALLEL ETL+Q FOR AUTOMATIC SCALABILITY IN BIGDATAcsandit
In this paper we investigate the problem of providing scalability to near-real-time ETL+Q (Extract, transform, load and querying) process of data warehouses. In general, data loading,
transformation and integration are heavy tasks that are performed only periodically during small fixed time windows.
NEAR-REAL-TIME PARALLEL ETL+Q FOR AUTOMATIC SCALABILITY IN BIGDATA cscpconf
In this paper we investigate the problem of providing scalability to near-real-time ETL+Q (Extract, transform, load and querying) process of data warehouses. In general, data loading, transformation and integration are heavy tasks that are performed only periodically during small fixed time windows. We propose an approach to enable the automatic scalability and freshness of any data warehouse and ETL+Q process for near-real-time BigData scenarios. A general framework for testing the proposed system was implementing, supporting parallelization solutions for each part of the ETL+Q pipeline. The results show that the proposed system is capable of handling scalability to provide the desired processing speed.
ETL involves extracting data from source databases, transforming it to fit the structure of target databases, and loading it into those targets. Talend is an open source tool that can be used to perform ETL processes. It allows connecting to various data sources and targets, has built-in components for common ETL tasks like string manipulation and slowly changing dimensions, and supports exporting data to formats like SQL, MySQL, and file types. Setting up Talend requires downloading the open studio software, virtualization software, and a sandbox VM, then configuring a workspace with components, repositories, and palettes.
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)Rittman Analytics
Oracle Data Integration Platform is a cornerstone for big data solutions that provides five core capabilities: business continuity, data movement, data transformation, data governance, and streaming data handling. It includes eight core products that can operate in the cloud or on-premise, and is considered the most innovative in areas like real-time/streaming integration and extract-load-transform capabilities with big data technologies. The platform offers a comprehensive architecture covering key areas like data ingestion, preparation, streaming integration, parallel connectivity, and governance.
To Study E T L ( Extract, Transform, Load) Tools Specially S Q L Server I...Shahzad
This document discusses SQL Server Integration Services (SSIS), an extract, transform, load (ETL) tool from Microsoft. It provides an overview of what ETL is and the typical steps involved, including extracting data from sources, transforming the data, and loading it into a destination. The document then describes the key components of SSIS, such as the SSIS designer, runtime engine, tasks, data flow engine, API/object model, and packages. It also lists some common uses of SSIS, such as merging data, populating data warehouses, cleaning data, and automating data loads.
El Big Data en la dirección comercial: market(ing) intelligenceAlex Rayón Jerez
Sesión donde vimos mediante el método del caso diferentes aplicaciones del análisis de datos al mundo de la dirección comercial. Dentro del Programa Experto en Dirección Comercial de la Deusto Business School.
Herramientas y metodologías Big Data para acceder a datos no estructuradosAlex Rayón Jerez
Conferencia "Herramientas y metodologías Big Data para acceder a datos no estructurados" en las Jornadas "Investigación para Mejorar la Adecuación Asistencial. Foro sanitario interesado en la aplicación de tecnologías y metodologías Big Data para la extracción de conocimiento a partir de datos no estructurados.
Las competencias digitales como método de observación de competencias genéricasAlex Rayón Jerez
Conferencia "Las competencias digitales como método de observación de competencias genéricas" impartida el 21 de Abril de 2016 en Innobasque, Zamudio, Bizkaia. En el marco de los "Brunch & Learn" que organiza Innobasque, en una jornada donde hablamos de competencias profesionales y digitales, su aportación al campo de la empresa, y en qué consisten realmente. Se habló mucho de su importancia en este Siglo XXI que nos ocupa.
Conferencia "El BIg Data en mi empresa ¿de qué me sirve?" en el Donostia - San Sebastián el 20 de Abril de 2016. Jornadas "Big Data para PYMEs". Hablo sobre el perfil Big Data y sus competencias, así como las utilidades que tiene para las empresas.
Aplicación del Big Data a la mejora de la competitividad de la empresaAlex Rayón Jerez
Conferencia "Aplicación del Big Data a la mejora de la competitividad de la empresa" celebrada el 21 de Marzo de 2016 en Palma de Mallorca, en la Universidad de las Islas Baleares. El objetivo era entrever las posibilidades que abre el Big Data dentro del contexto de la empresa y su competitividad.
Análisis de Redes Sociales (Social Network Analysis) y Text MiningAlex Rayón Jerez
Presentación sobre la sesión "Análisis de Redes Sociales (Social Network Analysis) y Text Mining", dentro del Programa Ejecutivo de Big Data y Business Intelligence celebrado en Madrid en Febrero de 2016, en nuestra sede de la Universidad de Deusto.
Marketing intelligence con estrategia omnicanal y Customer JourneyAlex Rayón Jerez
El documento presenta un programa de marketing inteligencia y business intelligence que utiliza big data. El programa describe estrategias de marketing omnicanal y análisis del customer journey para comprender mejor a los clientes. También discute el uso de datos para segmentar clientes, predecir comportamientos, personalizar ofertas y medir el ROI del marketing.
Este documento describe los modelos de propensión y su uso en el análisis de datos. Explica que los modelos de propensión estiman la probabilidad de que un cliente realice una acción como comprar un producto, abandonar el servicio o incurrir en impago. Luego discute algunas técnicas como árboles de decisión, redes neuronales y regresión logística que se pueden usar para crear estos modelos predictivos. Finalmente, presenta algunos casos de aplicación como la detección de fuga de clientes y la sensibilidad al precio
Presentación sobre la sesión "Customer Lifetime Value Management con Big Data", dentro del Programa Ejecutivo de Big Data y Business Intelligence celebrado en Madrid en Febrero de 2016, en nuestra sede de la Universidad de Deusto.
Presentación sobre la sesión "Big Data: the Management Revolution", dentro del Programa Ejecutivo de Big Data y Business Intelligence celebrado en Madrid en Febrero de 2016, en nuestra sede de la Universidad de Deusto.
Presentación sobre la sesión "Optimización de procesos con el Big Data", dentro del Programa Ejecutivo de Big Data y Business Intelligence celebrado en Madrid en Febrero de 2016, en nuestra sede de la Universidad de Deusto.
La economía del dato: transformando sectores, generando oportunidadesAlex Rayón Jerez
Ponencia "La economía del dato: transformando sectores, generando oportunidades" preparada para el I Databeers Euskadi, promovido y organizado por Decidata (www.decidata.es). Hablando de los retos y las oportunidades que ha traído esta era de los datos.
Cómo crecer, ser más eficiente y competitivo a través del Big DataAlex Rayón Jerez
Conferencia "Cómo crecer, ser más eficiente y competitivo a través del Big Data" impartida en el 14º Congreso HORECA de AECOC, Asociación Española de Codificación Comercial). Hablando de la aplicación del Big Data al canal HORECA.
El poder de los datos: hacia una sociedad inteligente, pero éticaAlex Rayón Jerez
Lectio Brevis del profesor Alex Rayón, de la Facultad de Ingeniería. Nos habla sobre el poder que han adquirido los datos en esta era. Es lo que se ha venido a conocer como Big Data. Un área, que también entraña retos legales y éticos, expuestos en el texto.
Búsqueda, organización y presentación de recursos de aprendizajeAlex Rayón Jerez
Curso de formación interna "Búsqueda, organización y presentación de recursos de aprendizaje" en la Universidad de Deusto. Cómo buscar, organizar y presentar recursos de aprendizaje para luego poder utilizar en contextos educativos.
Deusto Knowledge Hub como herramienta de publicación y descubrimiento de cono...Alex Rayón Jerez
Curso de formación interna "Google Calendar para la planificación de la asignatura con mis estudiantes" en la Universidad de Deusto. Para qué me sirve en mi día a día el repositorio Deusto Knowledge Hub como herramietna de publicación y descubrimiento de conocimiento.
Fomentando la colaboración en el aula a través de herramientas socialesAlex Rayón Jerez
Curso de formación interna "Fomentando la colaboración en el aula a través de herramientas sociales" en la Universidad de Deusto. Herramientas de naturaleza social para fomentar la colaboración en al aula entre profesor y estudiantes.
Utilizando Google Drive y Google Docs en el aula para trabajar con mis estudi...Alex Rayón Jerez
Curso de formación interna "Utilizando Google Drive y Google Docs en el aula para trabajar con mis estudiantes" en la Universidad de Deusto. Cómo utlizar Google Drive y Docs para trabajar en el aula con mis estudiantes.
Procesamiento y visualización de datos para generar nuevo conocimientoAlex Rayón Jerez
Curso de formación interna "Procesamiento y visualización de datos para generar nuevo conocimiento" en la Universidad de Deusto. Procesamiento de datos a pequeña y precisa escala (Smart Data) para mejorar mi día a día en la universidad.
El Big Data y Business Intelligence en mi empresa: ¿de qué me sirve?Alex Rayón Jerez
Conferencia "El Big Data y Business Intelligence en mi empresa: ¿de qué me sirve?" impartida en Medellín, Colombia, en Septiembre de 2015. Sesión dirigida a empresas para que conozcan las posibilidades que abre el Big Data para su día a día.
The *nervous system of insects* is a complex network of nerve cells (neurons) and supporting cells that process and transmit information. Here's an overview:
Structure
1. *Brain*: The insect brain is a complex structure that processes sensory information, controls behavior, and integrates information.
2. *Ventral nerve cord*: A chain of ganglia (nerve clusters) that runs along the insect's body, controlling movement and sensory processing.
3. *Peripheral nervous system*: Nerves that connect the central nervous system to sensory organs and muscles.
Functions
1. *Sensory processing*: Insects can detect and respond to various stimuli, such as light, sound, touch, taste, and smell.
2. *Motor control*: The nervous system controls movement, including walking, flying, and feeding.
3. *Behavioral responThe *nervous system of insects* is a complex network of nerve cells (neurons) and supporting cells that process and transmit information. Here's an overview:
Structure
1. *Brain*: The insect brain is a complex structure that processes sensory information, controls behavior, and integrates information.
2. *Ventral nerve cord*: A chain of ganglia (nerve clusters) that runs along the insect's body, controlling movement and sensory processing.
3. *Peripheral nervous system*: Nerves that connect the central nervous system to sensory organs and muscles.
Functions
1. *Sensory processing*: Insects can detect and respond to various stimuli, such as light, sound, touch, taste, and smell.
2. *Motor control*: The nervous system controls movement, including walking, flying, and feeding.
3. *Behavioral responses*: Insects can exhibit complex behaviors, such as mating, foraging, and social interactions.
Characteristics
1. *Decentralized*: Insect nervous systems have some autonomy in different body parts.
2. *Specialized*: Different parts of the nervous system are specialized for specific functions.
3. *Efficient*: Insect nervous systems are highly efficient, allowing for rapid processing and response to stimuli.
The insect nervous system is a remarkable example of evolutionary adaptation, enabling insects to thrive in diverse environments.
The insect nervous system is a remarkable example of evolutionary adaptation, enabling insects to thrive
Social Problem-Unemployment .pptx notes for Physiotherapy StudentsDrNidhiAgarwal
Unemployment is a major social problem, by which not only rural population have suffered but also urban population are suffered while they are literate having good qualification.The evil consequences like poverty, frustration, revolution
result in crimes and social disorganization. Therefore, it is
necessary that all efforts be made to have maximum.
employment facilities. The Government of India has already
announced that the question of payment of unemployment
allowance cannot be considered in India
How to Set warnings for invoicing specific customers in odooCeline George
Odoo 16 offers a powerful platform for managing sales documents and invoicing efficiently. One of its standout features is the ability to set warnings and block messages for specific customers during the invoicing process.
Multi-currency in odoo accounting and Update exchange rates automatically in ...Celine George
Most business transactions use the currencies of several countries for financial operations. For global transactions, multi-currency management is essential for enabling international trade.
Title: A Quick and Illustrated Guide to APA Style Referencing (7th Edition)
This visual and beginner-friendly guide simplifies the APA referencing style (7th edition) for academic writing. Designed especially for commerce students and research beginners, it includes:
✅ Real examples from original research papers
✅ Color-coded diagrams for clarity
✅ Key rules for in-text citation and reference list formatting
✅ Free citation tools like Mendeley & Zotero explained
Whether you're writing a college assignment, dissertation, or academic article, this guide will help you cite your sources correctly, confidently, and consistent.
Created by: Prof. Ishika Ghosh,
Faculty.
📩 For queries or feedback: [email protected]
How to Customize Your Financial Reports & Tax Reports With Odoo 17 AccountingCeline George
The Accounting module in Odoo 17 is a complete tool designed to manage all financial aspects of a business. Odoo offers a comprehensive set of tools for generating financial and tax reports, which are crucial for managing a company's finances and ensuring compliance with tax regulations.
How to manage Multiple Warehouses for multiple floors in odoo point of saleCeline George
The need for multiple warehouses and effective inventory management is crucial for companies aiming to optimize their operations, enhance customer satisfaction, and maintain a competitive edge.
GDGLSPGCOER - Git and GitHub Workshop.pptxazeenhodekar
This presentation covers the fundamentals of Git and version control in a practical, beginner-friendly way. Learn key commands, the Git data model, commit workflows, and how to collaborate effectively using Git — all explained with visuals, examples, and relatable humor.
This chapter provides an in-depth overview of the viscosity of macromolecules, an essential concept in biophysics and medical sciences, especially in understanding fluid behavior like blood flow in the human body.
Key concepts covered include:
✅ Definition and Types of Viscosity: Dynamic vs. Kinematic viscosity, cohesion, and adhesion.
⚙️ Methods of Measuring Viscosity:
Rotary Viscometer
Vibrational Viscometer
Falling Object Method
Capillary Viscometer
🌡️ Factors Affecting Viscosity: Temperature, composition, flow rate.
🩺 Clinical Relevance: Impact of blood viscosity in cardiovascular health.
🌊 Fluid Dynamics: Laminar vs. turbulent flow, Reynolds number.
🔬 Extension Techniques:
Chromatography (adsorption, partition, TLC, etc.)
Electrophoresis (protein/DNA separation)
Sedimentation and Centrifugation methods.
Geography Sem II Unit 1C Correlation of Geography with other school subjectsProfDrShaikhImran
The correlation of school subjects refers to the interconnectedness and mutual reinforcement between different academic disciplines. This concept highlights how knowledge and skills in one subject can support, enhance, or overlap with learning in another. Recognizing these correlations helps in creating a more holistic and meaningful educational experience.
The Pala kings were people-protectors. In fact, Gopal was elected to the throne only to end Matsya Nyaya. Bhagalpur Abhiledh states that Dharmapala imposed only fair taxes on the people. Rampala abolished the unjust taxes imposed by Bhima. The Pala rulers were lovers of learning. Vikramshila University was established by Dharmapala. He opened 50 other learning centers. A famous Buddhist scholar named Haribhadra was to be present in his court. Devpala appointed another Buddhist scholar named Veerdeva as the vice president of Nalanda Vihar. Among other scholars of this period, Sandhyakar Nandi, Chakrapani Dutta and Vajradatta are especially famous. Sandhyakar Nandi wrote the famous poem of this period 'Ramcharit'.
Exploring Substances:
Acidic, Basic, and
Neutral
Welcome to the fascinating world of acids and bases! Join siblings Ashwin and
Keerthi as they explore the colorful world of substances at their school's
National Science Day fair. Their adventure begins with a mysterious white paper
that reveals hidden messages when sprayed with a special liquid.
In this presentation, we'll discover how different substances can be classified as
acidic, basic, or neutral. We'll explore natural indicators like litmus, red rose
extract, and turmeric that help us identify these substances through color
changes. We'll also learn about neutralization reactions and their applications in
our daily lives.
by sandeep swamy
pulse ppt.pptx Types of pulse , characteristics of pulse , Alteration of pulsesushreesangita003
what is pulse ?
Purpose
physiology and Regulation of pulse
Characteristics of pulse
factors affecting pulse
Sites of pulse
Alteration of pulse
for BSC Nursing 1st semester
for Gnm Nursing 1st year
Students .
vitalsign
2. Before starting….
Who has
used a
relational
database? Source: http://www.agiledata.org/essays/databaseTesting.html
2
3. Before starting…. (II)
Who has written
scripts or Java
code to move
data from one
source and load
it to another?
Source: http://www.theguardian.com/teacher-network/2012/jan/10/how-to-teach-code
3
9. Pentaho at a glance (III)
Business Intelligence & Analytics
Open Core
GPL v2
Apache 2.0
Enterprise and OEM licenses
Java-based
Web front-ends
9
10. Pentaho at a glance (IV)
The Pentaho Stack
Data Integration / ETL
Big Data / NoSQL
Data Modeling
Reporting
OLAP / Analysis
Data Visualization
Source: http://helicaltech.com/blogs/hire-pentaho-consultants-hire-pentaho-developers/
10
11. Pentaho at a glance (V)
Modules
Pentaho Data Integration
Kettle
Pentaho Analysis
Mondrian
Pentaho Reporting
Pentaho Dashboards
11
12. Pentaho at a glance (VI)
Figures
+ 10.000 deployments
+ 185 countries
+ 1.200 customers
Since 2012, in Gartner
Magic Quadrant for BI
Platforms
1 download / 30
12
22. Table of Contents
Pentaho at a glance
In the academic field
ETL
Kettle
Big Data
Predictive Analytics
22
23. ETL
Definition and characteristics
An ETL tool is a tool that
Extracts data from various data sources (usually legacy
data)
Transforms data
from → being optimized for transaction
to → being optimized for reporting and analysis
synchronizes the data coming from different databases
data cleanses to remove errors
Loads data into a data warehouse
23
24. ETL
Why do I need it?
ETL tools save time and money when
developing a data warehouse by removing
the need for hand-coding
It is very difficult for database administrators
to connect between different brands of
databases without using an external tool
In the event that databases are altered or new
databases need to be integrated, a lot of hand-
coded work needs to be completely redone24
25. ETL
Business Intelligence
ETL is the heart
and soul of
business
intelligence (BI)
ETL processes
bring together
and combine data
from multiple
source systems
into a data
warehouse
Source: http://datawarehouseujap.blogspot.com.es/2010/08/data-warehouse.html
25
26. ETL
Business Intelligence (II)
According to most
practitioners, ETL
design and
development work
consumes 60 to 80
percent of an entire BI
project
Source: http://www.dwuser.com/news/tag/optimization/
Source: The Data Warehousing Institute. www.dw-institute.com
26
30. ETL
CloverETL
Create a basic archive of functions
for mapping and transformations,
allowing companies to move large
amounts of data as quickly and
efficiently as possible
Uses building blocks called
components to create a
transformation graph, which is a
visual depiction of the intended
30
31. ETL
CloverETL (II)
The graphic presentation simplifies even
complex data transformations, allowing for
drag-and-drop functionality
Limited to approximately 40 different
components to simplify graph creation
Yet you may configure each component to meet
specific needs
It also features extensive debugging capabilities
to ensure all transformation graphs work31
32. ETL
KETL
Contains a scalable, platform-independent
engine capable of supporting multiple
computers and 64-bit servers
The program also offers performance
monitoring, extensive data source support,
XML compatibility and a scheduling engine for
time-based and event-driven job execution
32
33. ETL
Kettle
The Pentaho company produced Kettle as an OS
alternative to commercial ETL software
No relation to Kinetic Networks' KETL
Kettle features a drop-and-drag, graphical environment
with progress feedback for all data transactions,
including automatic documentation of executed jobs
XML Input Stream to handle huge XML files without
suffering a loss in performance or a spike in memory
usage
Users can also upgrade the free Kettle version for
33
34. ETL
Talend
Provides a graphical environment for data integration,
migration and synchronization
Drag and drop graphic components to create the java code
required to execute the desired task, saving time and
effort
Pre-built connectors to enable compatibility with a wide
range of business systems and databases
Users gain real-time access to corporate data, allowing for
the monitoring and debugging of transactions to ensure
smooth data integration
34
35. ETL
Comparison
The set of criteria that were used for the ETL
tools comparison were divided into seven
categories:
TCO
Risk
Ease of use
Support
Deployment
Speed 35
37. ETL
Comparison (III)
Total Cost of Ownership
The overall cost for a certain
product.
This can mean initial ordering,
licensing servicing, support,
training, consulting, and any
other additional payments that
need to be made before the
product is in full use
Commercial Open Source products
are typically free to use, but the
support, training and consulting
are what companies need to pay37
38. ETL
Comparison (IV)
Risk
There are always risks with projects, especially big projects.
The risks for projects failing are:
Going over budget
Going over schedule
Not completing the requirements or expectations of the customers
Open Source products have much lower risk then
Commercial ones since they do not restrict the use of their
products by pricey licenses
38
39. ETL
Comparison (V)
Ease of use
All of the ETL tools, apart from Inaport, have GUI to simplify
the development process
Having a good GUI also reduces the time to train and use
the tools
Pentaho Kettle has an easy to use GUI out of all the tools
Training can also be found online or within the community
39
40. ETL
Comparison (VI)
Support
Nowadays, all software products have support and all of the
ETL tool providers offer support
Pentaho Kettle – Offers support from US, UK and has a
partner consultant in Hong Kong
Deployment
Pentaho Kettle is a stand-alone java engine that can run on
any machine that can run java. Needs an external
scheduler to run automatically.
It can be deployed on many different machines and used as40
41. ETL
Comparison (VII)
Speed
The speed of ETL tools depends largely on the data that
needs to be transferred over the network and the
processing power involved in transforming the data.
Pentaho Kettle is faster than Talend, but the Java-connector
slows it down somewhat. Also requires manual tweaking
like Talend. Can be clustered by placed on many machines
to reduce network traffic
41
42. ETL
Comparison (VIII)
Data Quality
Data Quality is fast becoming the most important feature in
any data integration tool.
Pentaho – has DQ features in its GUI, allows for customized
SQL statements, by using JavaScript and Regular
Expressions. It also has some additional modules after
subscribing.
Monitoring
Pentaho Kettle – has practical monitoring tools and logging
42
43. ETL
Comparison (IX)
Connectivity
In most cases, ETL tools transfer data from legacy systems
Their connectivity is very important to the usefulness of the
ETL tools.
Kettle can connect to a very wide variety of databases, flat
files, xml files, excel files and web services.
43
44. Table of Contents
Pentaho at a glance
In the academic field
ETL
Kettle
Big Data
Predictive Analytics
44
46. Kettle
Introduction (II)
What is Kettle?
Batch data integration
and processing tool
written in Java
Exists to retrieve,
process and load data
PDI is a synonymous
term
Source: http://www.dreamstime.com/stock-photo-very-old-kettle-isolated-image16622230
46
47. Kettle
Introduction (III)
It uses an innovative meta-driven approach
It has a very easy-to-use GUI
Strong community of 13,500 registered
users
It uses a stand-alone Java engine that
process the tasks for moving data between
many different databases and files
47
52. Kettle
Data Integration
Changing input to desired output
Jobs
Synchronous workflow of job entries
(tasks)
Transformations
Stepwise parallel & asynchronous
processing of a recordstream52
53. Kettle
Data Integration challenges
Data is everywhere
Data is inconsistent
Records are different in each system
Performance issues
Running queries to summarize data for
stipulated long period takes operating
system for task
Brings the OS on max load53
54. Kettle
Transformations
String and Date Manipulation
Data Validation / Business Rules
Lookup / Join
Calculation, Statistics
Cryptography
Decisions, Flow control
54
55. Kettle
What is good for?
Mirroring data from master to slave
Syncing two data sources
Processing data retrieved from multiple
sources and pushed to multiple
destinations
Loading data to RDBMS
Datamart / Datawarehouse
55
65. Table of Contents
Pentaho at a glance
In the academic field
ETL
Kettle
Big Data
Predictive Analytics
65
66. Big Data
Business Intelligente
Source: http://es.wikipedia.org/wiki/Weka_(aprendizaje_autom%C3%A1tico)
A brief (BI) history….
66
67. Big Data
WEKA
Project Weka
A comprehensive set of tools for Machine
Learning and Data Mining
Source: http://es.wikipedia.org/wiki/Weka_(aprendizaje_autom%C3%A1tico)
67
68. Big Data
Among Pentaho’s products
Mondrian
OLAP server written in Java
Kettle
ETL tool
Weka
Machine learning and Data Mining tool
68
69. Big Data
WEKA platform
WEKA (Waikato Environment for Knowledge Analysis)
Funded by the New Zealand’s Government (for more
than 10 years)
Develop an open-source state-of-the-art workbench
of data mining tools
Explore fielded applications
Develop new fundamental methods
Became part of Pentaho platform in 2006 (PDM -
Pentaho Data Mining)
69
70. Big Data
Data Mining with WEKA
(One-of-the-many) Definition: Extraction of implicit,
previously unknown, and potentially useful
information from data
Goal: improve marketing, sales, and customer support
operations, risk assessment etc.
Who is likely to remain a loyal customer?
What products should be marketed to which
prospects?
What determines whether a person will respond to
a certain offer? 70
71. Big Data
Data Mining with WEKA (II)
Central idea: historical data contains
information that will be useful in the
future (patterns → generalizations)
Data Mining employs a set of
algorithms that automatically detect
patterns and regularities in data
71
72. Big Data
Data Mining with WEKA (III)
A bank’s case as an example
Problem: Prediction (Probability Score) of a Corporate
Customer Delinquency (or default) in the next year
Customer historical data used include:
Customer footings behavior (assets & liabilities)
Customer delinquencies (rates and time data)
Business Sector behavioral data
72
73. Big Data
Data Mining with WEKA (IV)
Variable selection using the Information Value (IV) criterion
Automatic Binning of continuous data variables was used
(Chi-merge). Manual corrections were made to address
particularities in the data distribution of some variables
(using again IV)
73
76. Big Data
Data Mining with WEKA (VII)
Limitations
Traditional algorithms need to have all data
in (main) memory
big datasets are an issue
Solution
Incremental schemes
Stream algorithms
MOA (Massive Online Analysis)
76
80. Predictive analytics
Unified solution for Big Data Analytics (II)
Curren release: Pentaho Business Analytics Suite 4.8
Instant and interactive
data discovery for iPad
● Full analytical power
on the go – unique to
Pentaho
● Mobile-optimized user
interface
80
81. Predictive analytics
Unified solution for Big Data Analytics (III)
Curren release: Pentaho Business Analytics Suite 4.8
Instant and interactive data
discovery and development for
big data
● Broadens big data access to
data analysts
● Removes the need for
separate big data
visualization tools
● Further improves
productivity for big data
developers
81
82. Predictive analytics
Unified solution for Big Data Analytics (IV)
Pentaho Instaview
● Instaview is simple
○ Created for data analysts
○ Dramatically simplifies ways to
access Hadoop and NoSQL data
stores
● Instaview is instant & interactive
○ Time accelerator – 3 quick steps
from data to analytics
○ Interact with big data sources –
group, sort, aggregate & visualize
● Instaview is big data analytics
○ Marketing analysis for weblog data in
Hadoop
○ Application log analysis for data in
MongoDB
82
85. Copyright (c) 2015 University of Deusto
This work (but the quoted images, whose rights are reserved to their owners*) is licensed under the Creative
Commons “Attribution-ShareAlike” License. To view a copy of this license, visit
http://creativecommons.org/licenses/by-sa/3.0/
Alex Rayón
Noviembre 2015