Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
Big Data and advanced analytics are critical topics for executives today. But many still aren't sure how to turn that promise into value. This presentation provides an overview of 16 examples and use cases that lay out the different ways companies have approached the issue and found value: everything from pricing flexibility to customer preference management to credit risk analysis to fraud protection and discount targeting. For the latest on Big Data & Advanced Analytics: http://mckinseyonmarketingandsales.com/topics/big-data
This document summarizes Andrew Brust's presentation on using the Microsoft platform for big data. It discusses Hadoop and HDInsight, MapReduce, using Hive with ODBC and the BI stack. It also covers Hekaton, NoSQL, SQL Server Parallel Data Warehouse, and PolyBase. The presentation includes demos of HDInsight, MapReduce, and using Hive with the BI stack.
Big Data raises challenges about how to process such vast pool of raw data and how to aggregate value to our lives. For addressing these demands an ecosystem of tools named Hadoop was conceived.
The document provides an overview of big data and Hadoop fundamentals. It discusses what big data is, the characteristics of big data, and how it differs from traditional data processing approaches. It then describes the key components of Hadoop including HDFS for distributed storage, MapReduce for distributed processing, and YARN for resource management. HDFS architecture and features are explained in more detail. MapReduce tasks, stages, and an example word count job are also covered. The document concludes with a discussion of Hive, including its use as a data warehouse infrastructure on Hadoop and its query language HiveQL.
Top Hadoop Big Data Interview Questions and Answers for FresherJanBask Training
Top Hadoop Big Data Interview Questions and Answers for Fresher , Hadoop, Hadoop Big Data, Hadoop Training, Hadoop Interview Question, Hadoop Interview Answers, Hadoop Big Data Interview Question
The document discusses big data and Hadoop. It describes the three V's of big data - variety, volume, and velocity. It also discusses Hadoop components like HDFS, MapReduce, Pig, Hive, and YARN. Hadoop is a framework for storing and processing large datasets in a distributed computing environment. It allows for the ability to store and use all types of data at scale using commodity hardware.
This document provides an overview of big data concepts, including NoSQL databases, batch and real-time data processing frameworks, and analytical querying tools. It discusses scalability challenges with traditional SQL databases and introduces horizontal scaling with NoSQL systems like key-value, document, column, and graph stores. MapReduce and Hadoop are described for batch processing, while Storm is presented for real-time processing. Hive and Pig are summarized as tools for running analytical queries over large datasets.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.
The document discusses and compares MapReduce and relational database management systems (RDBMS) for large-scale data processing. It describes several hybrid approaches that attempt to combine the scalability of MapReduce with the query optimization and efficiency of parallel RDBMS. HadoopDB is highlighted as a system that uses Hadoop for communication and data distribution across nodes running PostgreSQL for query execution. Performance evaluations show hybrid systems can outperform pure MapReduce but may still lag specialized parallel databases.
The document provides an introduction to NoSQL and HBase. It discusses what NoSQL is, the different types of NoSQL databases, and compares NoSQL to SQL databases. It then focuses on HBase, describing its architecture and components like HMaster, regionservers, Zookeeper. It explains how HBase stores and retrieves data, the write process involving memstores and compaction. It also covers HBase shell commands for creating, inserting, querying and deleting data.
The document provides an overview of Big Data technology landscape, specifically focusing on NoSQL databases and Hadoop. It defines NoSQL as a non-relational database used for dealing with big data. It describes four main types of NoSQL databases - key-value stores, document databases, column-oriented databases, and graph databases - and provides examples of databases that fall under each type. It also discusses why NoSQL and Hadoop are useful technologies for storing and processing big data, how they work, and how companies are using them.
The document provides an overview of Apache Hadoop and related big data technologies. It discusses Hadoop components like HDFS for storage, MapReduce for processing, and HBase for columnar storage. It also covers related projects like Hive for SQL queries, ZooKeeper for coordination, and Hortonworks and Cloudera distributions.
In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the first half of the tutorial.
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
YouTube Link: https://youtu.be/ll_O9JsjwT4
** Big Data Hadoop Certification Training - https://www.edureka.co/big-data-hadoop-training-certification **
This Edureka PPT on "Hadoop components" will provide you with detailed knowledge about the top Hadoop Components and it will help you understand the different categories of Hadoop Components. This PPT covers the following topics:
What is Hadoop?
Core Components of Hadoop
Hadoop Architecture
Hadoop EcoSystem
Hadoop Components in Data Storage
General Purpose Execution Engines
Hadoop Components in Database Management
Hadoop Components in Data Abstraction
Hadoop Components in Real-time Data Streaming
Hadoop Components in Graph Processing
Hadoop Components in Machine Learning
Hadoop Cluster Management tools
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Castbox: https://castbox.fm/networks/505?country=in
The document summarizes a technical seminar on Hadoop. It discusses Hadoop's history and origin, how it was developed from Google's distributed systems, and how it provides an open-source framework for distributed storage and processing of large datasets. It also summarizes key aspects of Hadoop including HDFS, MapReduce, HBase, Pig, Hive and YARN, and how they address challenges of big data analytics. The seminar provides an overview of Hadoop's architecture and ecosystem and how it can effectively process large datasets measured in petabytes.
The document provides information about Hadoop, its core components, and MapReduce programming model. It defines Hadoop as an open source software framework used for distributed storage and processing of large datasets. It describes the main Hadoop components like HDFS, NameNode, DataNode, JobTracker and Secondary NameNode. It also explains MapReduce as a programming model used for distributed processing of big data across clusters.
Big Data Architecture Workshop - Vahid Amiridatastack
Big Data Architecture Workshop
This slide is about big data tools, thecnologies and layers that can be used in enterprise solutions.
TopHPC Conference
2019
This document provides an overview of 4 solutions for processing big data using Hadoop and compares them. Solution 1 involves using core Hadoop processing without data staging or movement. Solution 2 uses BI tools to analyze Hadoop data after a single CSV transformation. Solution 3 creates a data warehouse in Hadoop after a single transformation. Solution 4 implements a traditional data warehouse. The solutions are then compared based on benefits like cloud readiness, parallel processing, and investment required. The document also includes steps for installing a Hadoop cluster and running sample MapReduce jobs and Excel processing.
The document provides an introduction to big data and Apache Hadoop. It discusses big data concepts like the 3Vs of volume, variety and velocity. It then describes Apache Hadoop including its core architecture, HDFS, MapReduce and running jobs. Examples of using Hadoop for a retail system and with SQL Server are presented. Real world applications at Microsoft and case studies are reviewed. References for further reading are included at the end.
Big data refers to the massive amounts of unstructured data that are growing exponentially. Hadoop is an open-source framework that allows processing and storing large data sets across clusters of commodity hardware. It provides reliability and scalability through its distributed file system HDFS and MapReduce programming model. The Hadoop ecosystem includes components like Hive, Pig, HBase, Flume, Oozie, and Mahout that provide SQL-like queries, data flows, NoSQL capabilities, data ingestion, workflows, and machine learning. Microsoft integrates Hadoop with its BI and analytics tools to enable insights from diverse data sources.
This presentation discusses the follow topics
What is Hadoop?
Need for Hadoop
History of Hadoop
Hadoop Overview
Advantages and Disadvantages of Hadoop
Hadoop Distributed File System
Comparing: RDBMS vs. Hadoop
Advantages and Disadvantages of HDFS
Hadoop frameworks
Modules of Hadoop frameworks
Features of 'Hadoop‘
Hadoop Analytics Tools
This presentation is based on a project for installing Apache Hadoop on a single node cluster along with Apache Hive for processing of structured data.
Supporting Financial Services with a More Flexible Approach to Big DataWANdisco Plc
In this webinar, WANdisco and Hortonworks look at three examples of using 'Big Data' to get a more comprehensive view of customer behavior and activity in the banking and insurance industries. Then we'll pull out the common threads from these examples, and see how a flexible next-generation Hadoop architecture lets you get a step up on improving your business performance. Join us to learn:
- How to leverage data from across an entire global enterprise
- How to analyze a wide variety of structured and unstructured data to get quick, meaningful answers to critical questions
- What industry leaders have put in place
Fundamentals of Big Data, Hadoop project design and case study or Use case
General planning consideration and most necessaries in Hadoop ecosystem and Hadoop projects
This will provide the basis for choosing the right Hadoop implementation, Hadoop technologies integration, adoption and creating an infrastructure.
Building applications using Apache Hadoop with a use-case of WI-FI log analysis has real life example.
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Cloudera, Inc.
The document discusses integrating Hadoop with relational databases. It describes scenarios where reference data is stored in an RDBMS and used in Hadoop, Hadoop is used for offline analytics on data stored in an RDBMS, and exporting MapReduce outputs to an RDBMS. It then presents a case study on extending SQOOP for optimized Oracle integration and compares performance with and without the extension. Other tools for Hadoop-RDBMS integration are also briefly outlined.
This document discusses SQL Server 2012 FileTables, which allow files to be stored and managed directly within a SQL Server database. FileTables represent both options of storing files and metadata together in the database or separately across file systems and databases. FileTables provide full Windows file system access to files stored in SQL Server tables while retaining relational properties and queries. They enable seamless access to files from applications without changes to client code.
SQLIOSIM and SQLIO are tools used for stress testing and determining the I/O capacity of disk subsystems. SQLIOSIM simulates SQL Server workloads to test reliability, while SQLIO directly tests disk throughput under different I/O configurations. The presentation provides an overview of what the tools are and are not, how to use them, and how to interpret the results. Key points include how to configure SQLIOSIM, run SQLIO tests in batches, and understand when disk saturation occurs based on latency and throughput metrics.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.
The document discusses and compares MapReduce and relational database management systems (RDBMS) for large-scale data processing. It describes several hybrid approaches that attempt to combine the scalability of MapReduce with the query optimization and efficiency of parallel RDBMS. HadoopDB is highlighted as a system that uses Hadoop for communication and data distribution across nodes running PostgreSQL for query execution. Performance evaluations show hybrid systems can outperform pure MapReduce but may still lag specialized parallel databases.
The document provides an introduction to NoSQL and HBase. It discusses what NoSQL is, the different types of NoSQL databases, and compares NoSQL to SQL databases. It then focuses on HBase, describing its architecture and components like HMaster, regionservers, Zookeeper. It explains how HBase stores and retrieves data, the write process involving memstores and compaction. It also covers HBase shell commands for creating, inserting, querying and deleting data.
The document provides an overview of Big Data technology landscape, specifically focusing on NoSQL databases and Hadoop. It defines NoSQL as a non-relational database used for dealing with big data. It describes four main types of NoSQL databases - key-value stores, document databases, column-oriented databases, and graph databases - and provides examples of databases that fall under each type. It also discusses why NoSQL and Hadoop are useful technologies for storing and processing big data, how they work, and how companies are using them.
The document provides an overview of Apache Hadoop and related big data technologies. It discusses Hadoop components like HDFS for storage, MapReduce for processing, and HBase for columnar storage. It also covers related projects like Hive for SQL queries, ZooKeeper for coordination, and Hortonworks and Cloudera distributions.
In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the first half of the tutorial.
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
YouTube Link: https://youtu.be/ll_O9JsjwT4
** Big Data Hadoop Certification Training - https://www.edureka.co/big-data-hadoop-training-certification **
This Edureka PPT on "Hadoop components" will provide you with detailed knowledge about the top Hadoop Components and it will help you understand the different categories of Hadoop Components. This PPT covers the following topics:
What is Hadoop?
Core Components of Hadoop
Hadoop Architecture
Hadoop EcoSystem
Hadoop Components in Data Storage
General Purpose Execution Engines
Hadoop Components in Database Management
Hadoop Components in Data Abstraction
Hadoop Components in Real-time Data Streaming
Hadoop Components in Graph Processing
Hadoop Components in Machine Learning
Hadoop Cluster Management tools
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Castbox: https://castbox.fm/networks/505?country=in
The document summarizes a technical seminar on Hadoop. It discusses Hadoop's history and origin, how it was developed from Google's distributed systems, and how it provides an open-source framework for distributed storage and processing of large datasets. It also summarizes key aspects of Hadoop including HDFS, MapReduce, HBase, Pig, Hive and YARN, and how they address challenges of big data analytics. The seminar provides an overview of Hadoop's architecture and ecosystem and how it can effectively process large datasets measured in petabytes.
The document provides information about Hadoop, its core components, and MapReduce programming model. It defines Hadoop as an open source software framework used for distributed storage and processing of large datasets. It describes the main Hadoop components like HDFS, NameNode, DataNode, JobTracker and Secondary NameNode. It also explains MapReduce as a programming model used for distributed processing of big data across clusters.
Big Data Architecture Workshop - Vahid Amiridatastack
Big Data Architecture Workshop
This slide is about big data tools, thecnologies and layers that can be used in enterprise solutions.
TopHPC Conference
2019
This document provides an overview of 4 solutions for processing big data using Hadoop and compares them. Solution 1 involves using core Hadoop processing without data staging or movement. Solution 2 uses BI tools to analyze Hadoop data after a single CSV transformation. Solution 3 creates a data warehouse in Hadoop after a single transformation. Solution 4 implements a traditional data warehouse. The solutions are then compared based on benefits like cloud readiness, parallel processing, and investment required. The document also includes steps for installing a Hadoop cluster and running sample MapReduce jobs and Excel processing.
The document provides an introduction to big data and Apache Hadoop. It discusses big data concepts like the 3Vs of volume, variety and velocity. It then describes Apache Hadoop including its core architecture, HDFS, MapReduce and running jobs. Examples of using Hadoop for a retail system and with SQL Server are presented. Real world applications at Microsoft and case studies are reviewed. References for further reading are included at the end.
Big data refers to the massive amounts of unstructured data that are growing exponentially. Hadoop is an open-source framework that allows processing and storing large data sets across clusters of commodity hardware. It provides reliability and scalability through its distributed file system HDFS and MapReduce programming model. The Hadoop ecosystem includes components like Hive, Pig, HBase, Flume, Oozie, and Mahout that provide SQL-like queries, data flows, NoSQL capabilities, data ingestion, workflows, and machine learning. Microsoft integrates Hadoop with its BI and analytics tools to enable insights from diverse data sources.
This presentation discusses the follow topics
What is Hadoop?
Need for Hadoop
History of Hadoop
Hadoop Overview
Advantages and Disadvantages of Hadoop
Hadoop Distributed File System
Comparing: RDBMS vs. Hadoop
Advantages and Disadvantages of HDFS
Hadoop frameworks
Modules of Hadoop frameworks
Features of 'Hadoop‘
Hadoop Analytics Tools
This presentation is based on a project for installing Apache Hadoop on a single node cluster along with Apache Hive for processing of structured data.
Supporting Financial Services with a More Flexible Approach to Big DataWANdisco Plc
In this webinar, WANdisco and Hortonworks look at three examples of using 'Big Data' to get a more comprehensive view of customer behavior and activity in the banking and insurance industries. Then we'll pull out the common threads from these examples, and see how a flexible next-generation Hadoop architecture lets you get a step up on improving your business performance. Join us to learn:
- How to leverage data from across an entire global enterprise
- How to analyze a wide variety of structured and unstructured data to get quick, meaningful answers to critical questions
- What industry leaders have put in place
Fundamentals of Big Data, Hadoop project design and case study or Use case
General planning consideration and most necessaries in Hadoop ecosystem and Hadoop projects
This will provide the basis for choosing the right Hadoop implementation, Hadoop technologies integration, adoption and creating an infrastructure.
Building applications using Apache Hadoop with a use-case of WI-FI log analysis has real life example.
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Cloudera, Inc.
The document discusses integrating Hadoop with relational databases. It describes scenarios where reference data is stored in an RDBMS and used in Hadoop, Hadoop is used for offline analytics on data stored in an RDBMS, and exporting MapReduce outputs to an RDBMS. It then presents a case study on extending SQOOP for optimized Oracle integration and compares performance with and without the extension. Other tools for Hadoop-RDBMS integration are also briefly outlined.
This document discusses SQL Server 2012 FileTables, which allow files to be stored and managed directly within a SQL Server database. FileTables represent both options of storing files and metadata together in the database or separately across file systems and databases. FileTables provide full Windows file system access to files stored in SQL Server tables while retaining relational properties and queries. They enable seamless access to files from applications without changes to client code.
SQLIOSIM and SQLIO are tools used for stress testing and determining the I/O capacity of disk subsystems. SQLIOSIM simulates SQL Server workloads to test reliability, while SQLIO directly tests disk throughput under different I/O configurations. The presentation provides an overview of what the tools are and are not, how to use them, and how to interpret the results. Key points include how to configure SQLIOSIM, run SQLIO tests in batches, and understand when disk saturation occurs based on latency and throughput metrics.
This document provides SQL Server best practices for improving maintenance, performance, availability, and quality. It discusses generic best practices that are independent of SQL version as well as SQL Server 2012 specific practices. Generic best practices include coding standards, using Windows authentication, normalizing data, ensuring data integrity, cluster index design, and set-based querying. SQL Server 2012 specific practices cover AlwaysOn availability groups, columnstore indexes, contained databases, filetables, and how AlwaysOn compares to mirroring and clustering. The document emphasizes the importance of following best practices to take advantage of new SQL Server 2012 technologies and stresses considering data partitioning and the resource governor.
SQL Server 2012 : réussir la migration - Stéphane Haby - Antonio De Santo - d...dbi services
Une migration de base de données n'est pas toujours une tâche simple à réaliser. Découvrez quels sont les pièges à éviter, les problématiques que l'on peut rencontrer mais également les bonnes pratiques issues de notre expérience SQL Server 2012. Vous apprendrez tout ce qu'il faut savoir pour réussir une migration en toute sérénité.
The document compares three methods for consolidating SQL Server databases: 1) multiple databases on a single SQL Server instance, 2) a single database on multiple SQL Server instances, and 3) hypervisor-based virtualization. It finds that consolidating multiple databases onto a single instance has the lowest direct costs but reduces security and manageability. Using multiple instances improves security but has higher resource needs. Hypervisor-based virtualization maintains security while enabling features like high availability, but has higher licensing costs. The document aims to help decide which approach best balances these technical and business factors for a given environment.
Sql server consolidation and virtualizationIvan Donev
This document discusses SQL Server consolidation and virtualization. It begins with defining consolidation as combining units into more efficient larger units to improve cost efficiency. It then discusses approaches to consolidation like combining databases or instances. Considerations for consolidation like workloads, applications, and manageability are covered. SQL Server virtualization is also discussed, noting the benefits of isolation, migration, and simplification. The market section outlines products that can help like SQL Server 2008 R2 and the HP ProLiant DL980 server. It concludes with discussing how to start a consolidation project through inventory, testing, and migration planning. Tools to help are also listed.
This document discusses server consolidation using SQL Server 2008. It describes options for consolidating multiple databases, instances, or virtual servers onto a single physical server to reduce hardware costs while maintaining isolation and flexibility. It also covers manageability features for centralized administration, auditing, configuration, and monitoring across multiple consolidated servers. SQL Server 2008 technologies like Resource Governor allow differentiating and prioritizing workloads to improve utilization and scalability.
This document discusses upgrading to SQL Server 2012. It begins by stating the goals of modernizing platforms, discovering SQL Server 2012 resources, and helping businesses grow. It then discusses SQL Server 2012's abilities to improve availability, speed, compliance, productivity and other factors. Specific editions of SQL Server 2012 are presented as solutions to challenges around scaling, performance, accessibility and reducing costs. Real world examples are provided of companies benefiting from SQL Server 2012 capabilities like AlwaysOn, Power View and Data Quality Services. Licensing models are also summarized.
Srishti Sharma,,B.Sc-ID+ 2 Year Residential & Commercial Design Diplomadezyneecole
This Project has been Developed by the Student of Dezyne E'cole College Doing Her Interior Design Studies Bachelor Degree Programme + 2Yr Residential & Commercial Design Diploma Programme www.dezyneecole.com
Advanced SQL injection to operating system full control (slides)Bernardo Damele A. G.
Over ten years have passed since a famous hacker coined the term "SQL injection" and it is still considered one of the major web application threats, affecting over 70% of web application on the Net. A lot has been said on this specific vulnerability, but not all of the aspects and implications have been uncovered, yet.
It's time to explore new ways to get complete control over the database management system's underlying operating system through a SQL injection vulnerability in those over-looked and theoretically not exploitable scenarios: From the command execution on MySQL and PostgreSQL to a stored procedure's buffer overflow exploitation on Microsoft SQL Server. These and much more will be unveiled and demonstrated with my own tool's new version that I will release at the Conference (http://www.blackhat.com/html/bh-europe-09/bh-eu-09-speakers.html#Damele).
These slides have been presented at Black Hat Euroe conference in Amsterdam on April 16, 2009.
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
This document discusses SQL Server and big data analytics projects in the real world. It covers the big data technology landscape, big data analytics, and three big data analytics scenarios using different technologies like Hadoop, MongoDB, and SQL Server. It also discusses SQL Server's role in the big data world and how to get data into Hadoop for analysis.
Alphorm.com Formation Microsoft SQL Server 2016 Business Intelligence (SSIS)Alphorm
Formation complète ici:
http://www.alphorm.com/tutoriel/formation-en-ligne-microsoft-sql-server-2016-ssis-implementer-une-solution-etl
Afin d'améliorer les capacités de BI, les entreprises doivent gérer de façon sécurisée la migration des données à travers de nombreuses plateformes. Dans cette formation SSIS, vous obtiendrez les compétences pour automatiser les tâches de migration complexes et contrôler la réussite ou l'échec des processus de migration.
SQL Server Integration Services (SSIS) est un puissant outil ETL utilisé dans le cadre de projets d'intégration ou de BI. Grâce à cette formation SSIS pratique, vous apprendrez à implémenter une solution ETL évoluée avec SSIS 2016. Vous découvrirez le traitement et l'alimentation des données, la sécurisation et l'optimisation des flux.
Cette formation SSIS vous apprendra à concevoir et à déployer une solution de Business Intelligence avec SQL Server 2016. A l'issue de cette formation SSIS, vous aurez acquis les connaissances et compétences nécessaires pour mettre en œuvre les méthodes de base de l’ETL, implémenter un flux de contrôles et de données dans Intégration Services, déboguer et implémenter la gestion d'erreurs dans Intégration Services, gérer et sécuriser des packages …
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Andrew Brust
This document discusses Microsoft's efforts to make big data technologies like Hadoop more accessible through its products. It describes Hadoop, MapReduce, HDFS, and other big data concepts. It then outlines Microsoft's project to create a Hadoop distribution that runs on Windows Server and Windows Azure, including building an ODBC driver to allow tools like Excel to query Hadoop. This will help bring big data to more business users and integrate it with Microsoft's existing BI technologies.
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
This document summarizes a meetup about Big Data and SQL on Hadoop. The meetup included discussions on what Hadoop is, why SQL on Hadoop is useful, what Hive is, and introduced IBM's BigInsights software for running SQL on Hadoop with improved performance over other solutions. Key topics included HDFS file storage, MapReduce processing, Hive tables and metadata storage, and how BigInsights provides a massively parallel SQL engine instead of relying on MapReduce.
The document provides an introduction to Hadoop. It discusses how Google developed its own infrastructure using Google File System (GFS) and MapReduce to power Google Search due to limitations with databases. Hadoop was later developed based on these Google papers to provide an open-source implementation of GFS and MapReduce. The document also provides overviews of the HDFS file system and MapReduce programming model in Hadoop.
Apache Hadoop is a popular open-source framework for storing and processing large datasets across clusters of computers. It includes Apache HDFS for distributed storage, YARN for job scheduling and resource management, and MapReduce for parallel processing. The Hortonworks Data Platform is an enterprise-grade distribution of Apache Hadoop that is fully open source.
Oozie and Sqoop are the tools to make Hadoop more efficient.
Oozie & Hadoop acts as the boon for Hadoop developed by Yahoo and donated to the Apache for further development.
Yahoo uses the Map Reduce technique developed by the google to tackle & monitor the Big Data.
Open source stak of big data techs open suse asiaMuhammad Rifqi
This document summarizes the key technologies in the open source stack for big data. It discusses Hadoop, the leading open source framework for distributed storage and processing of large data sets. Components of Hadoop include HDFS for distributed file storage and MapReduce for distributed computations. Other related technologies are also summarized like Hive for data warehousing, Pig for data flows, Sqoop for data transfer between Hadoop and databases, and approaches like Lambda architecture for batch and real-time processing. The document provides a high-level overview of implementing big data solutions using open source Hadoop technologies.
http://www.learntek.org/product/big-data-and-hadoop/
http://www.learntek.org
Learntek is global online training provider on Big Data Analytics, Hadoop, Machine Learning, Deep Learning, IOT, AI, Cloud Technology, DEVOPS, Digital Marketing and other IT and Management courses. We are dedicated to designing, developing and implementing training programs for students, corporate employees and business professional.
This document discusses cloud and big data technologies. It provides an overview of Hadoop and its ecosystem, which includes components like HDFS, MapReduce, HBase, Zookeeper, Pig and Hive. It also describes how data is stored in HDFS and HBase, and how MapReduce can be used for parallel processing across large datasets. Finally, it gives examples of using MapReduce to implement algorithms for word counting, building inverted indexes and performing joins.
Hive is a data warehouse infrastructure tool used to process large datasets in Hadoop. It allows users to query data using SQL-like queries. Hive resides on HDFS and uses MapReduce to process queries in parallel. It includes a metastore to store metadata about tables and partitions. When a query is executed, Hive's execution engine compiles it into a MapReduce job which is run on a Hadoop cluster. Hive is better suited for large datasets and queries compared to traditional RDBMS which are optimized for transactions.
The initiation of The Hadoop Apache Hive began in 2007 by Facebook due to its data growth.
This ETL system began to fail over few years as more people joined Facebook.
In August 2008, Facebook decided to move to scalable a more scalable open-source Hadoop environment; Hive
Facebook, Netflix and Amazons support the Apache Hive SQL now known as the HiveQL
Hadoop is an open source software project that allows distributed processing of large datasets across computer clusters. It was developed based on research from Google and has two main components - the Hadoop Distributed File System (HDFS) which reliably stores data in a distributed manner, and MapReduce which allows parallel processing of this data. Hadoop is scalable, cost effective, and fault tolerant for processing terabytes of data on commodity hardware. It is commonly used for batch processing of large unstructured datasets.
This document discusses big data and the Apache Hadoop framework. It defines big data as large, complex datasets that are difficult to process using traditional tools. Hadoop is an open-source framework for distributed storage and processing of big data across commodity hardware. It has two main components - the Hadoop Distributed File System (HDFS) for storage, and MapReduce for processing. HDFS stores data across clusters of machines with redundancy, while MapReduce splits tasks across processors and handles shuffling and sorting of data. Hadoop allows cost-effective processing of large, diverse datasets and has become a standard for big data.
This document provides an overview of Hadoop infrastructure and related technologies:
- Hadoop is based on Apache's implementation of Google's BigTable and uses Java VMs to parse instructions. It allows reading, writing, and manipulating very large datasets using sequential writes and column-based file structures in HDFS.
- HDFS is the backend file system for Hadoop that allows for easy node management and operability. Technologies like HBase can augment or replace HDFS.
- Middleware like Hive, Pig, and Cassandra help connect to and utilize Hadoop. Each has different uses - Hive is a data warehouse, Pig uses its own query language, and Sqoop connects databases and datasets.
M. Florence Dayana - Hadoop Foundation for Analytics.pptxDr.Florence Dayana
Hadoop Foundation for Analytics
History of Hadoop
Features of Hadoop
Key Advantages of Hadoop
Why Hadoop
Versions of Hadoop
Eco Projects
Essential of Hadoop ecosystem
RDBMS versus Hadoop
Key Aspects of Hadoop
Components of Hadoop
Big Data and NoSQL for Database and BI ProsAndrew Brust
This document discusses how Microsoft business intelligence (BI) tools can integrate with big data technologies like Hadoop. It describes how Excel, PowerPivot, and SQL Server Analysis Services can connect to Hadoop data stored in HDFS or Hive via an ODBC driver. It also explains how SQL Server Parallel Data Warehouse uses PolyBase to directly query Hadoop, bypassing MapReduce. The document provides an overview of Hadoop concepts like MapReduce, HDFS, and Hive, as well as ETL tools like Sqoop that can move data between Hadoop and other data sources.
Hadoop is an open source software framework that allows for distributed processing of large data sets across clusters of computers. It uses MapReduce as a programming model and HDFS for storage. Hadoop supports various big data applications like HBase for distributed column storage, Hive for data warehousing and querying, Pig and Jaql for data flow languages, and Hadoop ecosystem projects for tasks like system monitoring and machine learning.
The document discusses Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It describes how Hadoop addresses the growing volume, variety and velocity of big data through its core components: HDFS for storage, and MapReduce for distributed processing. Key features of Hadoop include scalability, flexibility, reliability and economic viability for large-scale data analytics.
During this session we will look into Windows 10 for the Enterprise.
Let’s explore the new management capabilities and choices.
Let’s understand the Windows 10 deployment infrastructure and mechanisms.
Let’s discover new Windows 10 features and improvements.
You are eager to learn about Windows 10 and want to gather early-stage info about this exciting Operating System… ?
Well you know what to do! See you there!
Compliance settings, formerly known as DCM, remains one of the often unexplored features in Configuration Manager. During this session we will walk through the new capabilities and improvements of this feature in ConfigMgr 2012, discuss implementation details, and demonstrate how you can start using it to fulfill actual business requirements.
Discover what’s new in Windows 8.1 regarding interface, settings, deployment, security, … How will Windows 8.1 fit in your enterprise? How do you upgrade? All answers are here!
The document discusses how to get started with monitoring after a successful installation of System Center Operations Manager (SCOM). It recommends doing an initial health check of the SCOM management server and database. It also covers installing SCOM agents, selecting appropriate management packs to monitor key components, and defining a phased approach for starting monitoring. The presentation provides tips on leveraging the community, backing up the SCOM environment, and finding quick wins to show management.
RMS, EFS, and BitLocker are Microsoft data protection technologies that can help prevent data leakage. RMS allows users to apply usage policies to files and encrypts files to control access. EFS transparently encrypts files stored locally on a computer. BitLocker encrypts fixed and removable drives to protect data at rest. The technologies provide different levels of protection and have varying capabilities for controlling access to data inside and outside an organization.
The document discusses Configuration Manager client deployment and health. It covers supported platforms for Windows, Linux, and Mac clients. Deployment methods include SUP, Group Policy, scripts, and manual installation. Client health is monitored from the server and client. Components include Client Check for prerequisites, dependencies and remediation, and Client Activity for tracking server interactions and status. Dashboards and reports provide visibility into client health and alerts surface issues.
This document discusses the history and evolution of self-service business intelligence (BI) tools from the 1980s to the present. It traces how BI tools have shifted from being developed primarily by IT to being user-focused end tools. It highlights key Microsoft products at different stages, from Excel in the 1980s to the addition of new apps like GeoFlow and Data Explorer in 2013. The document also demos some new self-service BI capabilities and resources.
This document discusses Cluster-Aware Updating (CAU) in Windows Server 2012. It provides an overview of how CAU works to update nodes in a failover cluster. The CAU update coordinator manages the updating process, pausing nodes, draining virtual machines, updating nodes, and failing back virtual machines in a coordinated manner. The document also provides links to Microsoft articles about CAU and integrating it with Dell server update tools.
The document discusses Microsoft's antimalware management platform which provides a common antimalware platform across Microsoft clients with proactive protection against known and unknown threats while reducing complexity. It integrates features such as early-launch antimalware, measured boot, and secure boot through UEFI to prevent malware from bypassing antimalware inspection during the boot process. The platform also provides simplified administration through a single console experience for endpoint protection and management.
This LiveMeeting presentation introduces Application Performance Monitoring (APM) in System Center Operations Manager 2012. APM allows monitoring of .NET and WCF applications to identify performance issues. It requires SCOM 2012 or later with the IIS management pack installed. APM bridges the gap between development and operations teams by integrating with Team Foundation Server and collecting traces in an IntelliTrace format. It provides various tools for client-side monitoring, server-side monitoring, and analyzing application diagnostics and advisors to help answer common support questions about application slowdowns and errors.
This document discusses Microsoft Lync Server 2013's persistent chat feature. It provides an overview of persistent chat's history and integration within Microsoft products. It also describes Lync 2013's unified client, improved server infrastructure and manageability, rich platform capabilities, and tools to easily migrate from previous versions. Configuration and management of persistent chat policies, categories, rooms and add-ins are examined. The document concludes with a section on licensing requirements for persistent chat.
The document discusses desktop virtualization and remote desktop services. It explains that with these services, the desktop workload is centralized on a virtual machine in the datacenter while the presentation of the UI is managed remotely via protocols like RDP. It also discusses mobility options that allow Lync to work across devices like PCs, Macs, smartphones and tablets through different applications. Finally, it provides a table comparing Lync support and requirements for various Windows Phone models.
Office 365 ProPlus can be deployed using Click-to-Run installation, which uses an App-V foundation for a streaming installation. This allows deploying Office fast without sacrificing control. The Office Deployment Tool can be used to download Click-to-Run packages, customize configurations, and deploy the packages across an organization. Telemetry data is collected to help optimize the user experience and identify issues, and a Telemetry Dashboard provides tools to manage data collection and settings.
This document discusses identity and authentication options for Office 365. It covers Directory Synchronization (DirSync) which synchronizes on-premises Active Directory with Azure Active Directory. It also discusses Active Directory Federation Services (ADFS) which provides single sign-on for federated identities and different ADFS topologies including on-premises, hybrid and cloud. Additionally, it covers Windows Azure Active Directory and how it can be used to provide identity services for cloud applications. The key takeaways are to check Active Directory health before using DirSync, understand the different Office 365 authentication flows with ADFS, and that WAAD can extend identity functionality to websites.
This document discusses options for upgrading a SharePoint environment from 2010 to 2013. It outlines the upgrade process which involves learning about the options, validating the environment, preparing by cleaning up and managing customizations, implementing the upgrade by building servers and upgrading content and services, and testing the upgraded environment. The key aspects are performing the upgrade on a new farm by attaching content databases to avoid downtime, allowing site collections to upgrade individually to minimize disruption, and thoroughly testing the upgraded environment.
This document discusses System Center Configuration Manager 2012's application model. It provides an overview of the application model, including the vision behind it of lifecycle management and user-centric deployment. Key concepts covered include requirement rules, detection methods, the application evaluation flow, application supersedence, and application uninstalls. Challenges and potential workarounds are also mentioned.
This document discusses FlexPod for Microsoft Private Cloud, an integrated solution from NetApp and Cisco for implementing a Microsoft Private Cloud using their technologies. It is a pre-validated reference implementation that is fully integrated with Microsoft System Center 2012 and provides a scalable Hyper-V platform. It accelerates private cloud deployments with reduced risk. Key components include Cisco UCS blade servers and switches, NetApp FAS storage, and tight integration and management capabilities through Cisco UCS Manager and NetApp OnCommand with Microsoft System Center.
Windows RT devices can be used in corporate environments if managed properly. Windows RT provides limited management capabilities compared to full Windows devices, but supports application deployment and some policy enforcement through Intune and ConfigMgr. Key challenges include application delivery restrictions, limited VPN configuration options, and lack of remote control and software metering capabilities. Proper infrastructure like Intune, ConfigMgr and VPN servers is required to securely connect and manage Windows RT devices in an enterprise.
The document discusses the evolution from device-centric management to user-centric management. Device-centric management involved managing individual devices, but user-centric management focuses on managing all of a user's devices through a single interface. The document outlines how Microsoft System Center Configuration Manager 2012 and Microsoft Intune can be used to implement user-centric management, including managing applications, settings, and security across devices. A hybrid approach using both Configuration Manager and Intune is also presented.
The document discusses steps for deploying a successful virtual network, including designing the network, building and configuring hardware, and configuring the virtual machine manager. It covers providing isolation through techniques like VLANs and software defined networking. Topics include logical network addressing, host configuration options, and creating logical switches. Tenant configuration using network virtualization is described for isolation.
This is the keynote of the Into the Box conference, highlighting the release of the BoxLang JVM language, its key enhancements, and its vision for the future.
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Aqusag Technologies
In late April 2025, a significant portion of Europe, particularly Spain, Portugal, and parts of southern France, experienced widespread, rolling power outages that continue to affect millions of residents, businesses, and infrastructure systems.
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveScyllaDB
Want to learn practical tips for designing systems that can scale efficiently without compromising speed?
Join us for a workshop where we’ll address these challenges head-on and explore how to architect low-latency systems using Rust. During this free interactive workshop oriented for developers, engineers, and architects, we’ll cover how Rust’s unique language features and the Tokio async runtime enable high-performance application development.
As you explore key principles of designing low-latency systems with Rust, you will learn how to:
- Create and compile a real-world app with Rust
- Connect the application to ScyllaDB (NoSQL data store)
- Negotiate tradeoffs related to data modeling and querying
- Manage and monitor the database for consistently low latencies
AI and Data Privacy in 2025: Global TrendsInData Labs
In this infographic, we explore how businesses can implement effective governance frameworks to address AI data privacy. Understanding it is crucial for developing effective strategies that ensure compliance, safeguard customer trust, and leverage AI responsibly. Equip yourself with insights that can drive informed decision-making and position your organization for success in the future of data privacy.
This infographic contains:
-AI and data privacy: Key findings
-Statistics on AI data privacy in the today’s world
-Tips on how to overcome data privacy challenges
-Benefits of AI data security investments.
Keep up-to-date on how AI is reshaping privacy standards and what this entails for both individuals and organizations.
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...Alan Dix
Talk at the final event of Data Fusion Dynamics: A Collaborative UK-Saudi Initiative in Cybersecurity and Artificial Intelligence funded by the British Council UK-Saudi Challenge Fund 2024, Cardiff Metropolitan University, 29th April 2025
https://alandix.com/academic/talks/CMet2025-AI-Changes-Everything/
Is AI just another technology, or does it fundamentally change the way we live and think?
Every technology has a direct impact with micro-ethical consequences, some good, some bad. However more profound are the ways in which some technologies reshape the very fabric of society with macro-ethical impacts. The invention of the stirrup revolutionised mounted combat, but as a side effect gave rise to the feudal system, which still shapes politics today. The internal combustion engine offers personal freedom and creates pollution, but has also transformed the nature of urban planning and international trade. When we look at AI the micro-ethical issues, such as bias, are most obvious, but the macro-ethical challenges may be greater.
At a micro-ethical level AI has the potential to deepen social, ethnic and gender bias, issues I have warned about since the early 1990s! It is also being used increasingly on the battlefield. However, it also offers amazing opportunities in health and educations, as the recent Nobel prizes for the developers of AlphaFold illustrate. More radically, the need to encode ethics acts as a mirror to surface essential ethical problems and conflicts.
At the macro-ethical level, by the early 2000s digital technology had already begun to undermine sovereignty (e.g. gambling), market economics (through network effects and emergent monopolies), and the very meaning of money. Modern AI is the child of big data, big computation and ultimately big business, intensifying the inherent tendency of digital technology to concentrate power. AI is already unravelling the fundamentals of the social, political and economic world around us, but this is a world that needs radical reimagining to overcome the global environmental and human challenges that confront us. Our challenge is whether to let the threads fall as they may, or to use them to weave a better future.
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersToradex
Toradex brings robust Linux support to SMARC (Smart Mobility Architecture), ensuring high performance and long-term reliability for embedded applications. Here’s how:
• Optimized Torizon OS & Yocto Support – Toradex provides Torizon OS, a Debian-based easy-to-use platform, and Yocto BSPs for customized Linux images on SMARC modules.
• Seamless Integration with i.MX 8M Plus and i.MX 95 – Toradex SMARC solutions leverage NXP’s i.MX 8 M Plus and i.MX 95 SoCs, delivering power efficiency and AI-ready performance.
• Secure and Reliable – With Secure Boot, over-the-air (OTA) updates, and LTS kernel support, Toradex ensures industrial-grade security and longevity.
• Containerized Workflows for AI & IoT – Support for Docker, ROS, and real-time Linux enables scalable AI, ML, and IoT applications.
• Strong Ecosystem & Developer Support – Toradex offers comprehensive documentation, developer tools, and dedicated support, accelerating time-to-market.
With Toradex’s Linux support for SMARC, developers get a scalable, secure, and high-performance solution for industrial, medical, and AI-driven applications.
Do you have a specific project or application in mind where you're considering SMARC? We can help with Free Compatibility Check and help you with quick time-to-market
For more information: https://www.toradex.com/computer-on-modules/smarc-arm-family
Big Data Analytics Quick Research Guide by Arthur MorganArthur Morgan
This is a Quick Research Guide (QRG).
QRGs include the following:
- A brief, high-level overview of the QRG topic.
- A milestone timeline for the QRG topic.
- Links to various free online resource materials to provide a deeper dive into the QRG topic.
- Conclusion and a recommendation for at least two books available in the SJPL system on the QRG topic.
QRGs planned for the series:
- Artificial Intelligence QRG
- Quantum Computing QRG
- Big Data Analytics QRG
- Spacecraft Guidance, Navigation & Control QRG (coming 2026)
- UK Home Computing & The Birth of ARM QRG (coming 2027)
Any questions or comments?
- Please contact Arthur Morgan at [email protected].
100% human made.
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul
Artificial intelligence is changing how businesses operate. Companies are using AI agents to automate tasks, reduce time spent on repetitive work, and focus more on high-value activities. Noah Loul, an AI strategist and entrepreneur, has helped dozens of companies streamline their operations using smart automation. He believes AI agents aren't just tools—they're workers that take on repeatable tasks so your human team can focus on what matters. If you want to reduce time waste and increase output, AI agents are the next move.
TrsLabs - Fintech Product & Business ConsultingTrs Labs
Hybrid Growth Mandate Model with TrsLabs
Strategic Investments, Inorganic Growth, Business Model Pivoting are critical activities that business don't do/change everyday. In cases like this, it may benefit your business to choose a temporary external consultant.
An unbiased plan driven by clearcut deliverables, market dynamics and without the influence of your internal office equations empower business leaders to make right choices.
Getting things done within a budget within a timeframe is key to Growing Business - No matter whether you are a start-up or a big company
Talk to us & Unlock the competitive advantage
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungenpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-nomad-web-best-practices-und-verwaltung-von-multiuser-umgebungen/
HCL Nomad Web wird als die nächste Generation des HCL Notes-Clients gefeiert und bietet zahlreiche Vorteile, wie die Beseitigung des Bedarfs an Paketierung, Verteilung und Installation. Nomad Web-Client-Updates werden “automatisch” im Hintergrund installiert, was den administrativen Aufwand im Vergleich zu traditionellen HCL Notes-Clients erheblich reduziert. Allerdings stellt die Fehlerbehebung in Nomad Web im Vergleich zum Notes-Client einzigartige Herausforderungen dar.
Begleiten Sie Christoph und Marc, während sie demonstrieren, wie der Fehlerbehebungsprozess in HCL Nomad Web vereinfacht werden kann, um eine reibungslose und effiziente Benutzererfahrung zu gewährleisten.
In diesem Webinar werden wir effektive Strategien zur Diagnose und Lösung häufiger Probleme in HCL Nomad Web untersuchen, einschließlich
- Zugriff auf die Konsole
- Auffinden und Interpretieren von Protokolldateien
- Zugriff auf den Datenordner im Cache des Browsers (unter Verwendung von OPFS)
- Verständnis der Unterschiede zwischen Einzel- und Mehrbenutzerszenarien
- Nutzung der Client Clocking-Funktion
Semantic Cultivators : The Critical Future Role to Enable AIartmondano
By 2026, AI agents will consume 10x more enterprise data than humans, but with none of the contextual understanding that prevents catastrophic misinterpretations.
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfSoftware Company
Explore the benefits and features of advanced logistics management software for businesses in Riyadh. This guide delves into the latest technologies, from real-time tracking and route optimization to warehouse management and inventory control, helping businesses streamline their logistics operations and reduce costs. Learn how implementing the right software solution can enhance efficiency, improve customer satisfaction, and provide a competitive edge in the growing logistics sector of Riyadh.
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Impelsys Inc.
Impelsys provided a robust testing solution, leveraging a risk-based and requirement-mapped approach to validate ICU Connect and CritiXpert. A well-defined test suite was developed to assess data communication, clinical data collection, transformation, and visualization across integrated devices.
What is Model Context Protocol(MCP) - The new technology for communication bw...Vishnu Singh Chundawat
The MCP (Model Context Protocol) is a framework designed to manage context and interaction within complex systems. This SlideShare presentation will provide a detailed overview of the MCP Model, its applications, and how it plays a crucial role in improving communication and decision-making in distributed systems. We will explore the key concepts behind the protocol, including the importance of context, data management, and how this model enhances system adaptability and responsiveness. Ideal for software developers, system architects, and IT professionals, this presentation will offer valuable insights into how the MCP Model can streamline workflows, improve efficiency, and create more intuitive systems for a wide range of use cases.
Dev Dives: Automate and orchestrate your processes with UiPath MaestroUiPathCommunity
This session is designed to equip developers with the skills needed to build mission-critical, end-to-end processes that seamlessly orchestrate agents, people, and robots.
📕 Here's what you can expect:
- Modeling: Build end-to-end processes using BPMN.
- Implementing: Integrate agentic tasks, RPA, APIs, and advanced decisioning into processes.
- Operating: Control process instances with rewind, replay, pause, and stop functions.
- Monitoring: Use dashboards and embedded analytics for real-time insights into process instances.
This webinar is a must-attend for developers looking to enhance their agentic automation skills and orchestrate robust, mission-critical processes.
👨🏫 Speaker:
Andrei Vintila, Principal Product Manager @UiPath
This session streamed live on April 29, 2025, 16:00 CET.
Check out all our upcoming Dev Dives sessions at https://community.uipath.com/dev-dives-automation-developer-2025/.
Dev Dives: Automate and orchestrate your processes with UiPath MaestroUiPathCommunity
SQL Server 2012 and Big Data
1. SQL SERVER 2012 AND BIG DATA
Hadoop Connectors for SQL Server
2. TECHNICALLY – WHAT IS HADOOP
• Hadoop consists of two key services:
• Data storage using the Hadoop Distributed File System (HDFS)
• High-performance parallel data processing using a technique called
MapReduce.
3. HADOOP IS AN ENTIRE ECOSYSTEM
• Hbase as database
• Hive as a Data Warehouse
• Pig as the query language
• Built on top of Hadoop and the Map-Reduce framework.
4. HDFS
• HDFS is designed to scale seamlessly
• That‟s it‟s strength!
• Scaling horizontally is non-trivial in most cases.
• HDFS scales by throwing more hardware at it.
• A lot of it!
• HDFS is asynchronous
• Is what links Hadoop to Cloud computing.
5. DIFFERENCES
• SQL Server & Windows 2008 R2′s NTFS?
• Data is not stored in the traditional table column format.
• HDFS supports only forward only parsing
• Databases built on HDFS don‟t guarantee ACID properties
• Taking code to the data
• SQL Server scales better vertically
6. UNSTRUCTURED DATA
• Doesn‟t know/care about column names, column data types, column
sizes or even number of columns.
• Data is stored in delimited flat files
• You‟re on your own with respect to data cleansing
• Data input in Hadoop is as simple as loading your data file into HDFS
• It‟s very close to copying files on an OS.
7. NO SQL, NO TABLES, NO COLUMNS
NO DATA?
• Write code to do Map-Reduce
• You have to write code to get data
• The best way to get data
• write code that calls the MapReduce framework to slices and dices the stored
data
• Step 1 is Map and Step 2 is Reduce.
8. MAP (REDUCE)
• Mapping
• Pick your selection of keys from record (Linefeed)
• Tell the framework what your Key is and what values that key will hold
• MR will deal with actual creation of the Map
• Control on what keys to include or what values to filter out
• End up with a giant hashtable
9. (MAP) REDUCE
• Reducing Data: Once the map phase is complete code moves on to
the reduce phase. The reduce phase works on mapped data and can
potentially do all the aggregation and summation activities.
• Finally you get a blob of the mapped and reduced data.
10. JAVA… VS. PIG…
• Pig is a querying engine
• Has a „business-friendly‟ syntax
• Spits out MapReduce code
• syntax for Pig is called : Pig Latin (Don‟t ask)
• Pig Latin is very similar syntactically to LINQ.
• Pig converts into MapReduce and sends it off to Hadoop then
retrieves the results
• Half the performance
• 10 times faster to write
11. HBASE
• HBase is a key value store on top of HDFS
• This is the NOSql Database
• Very thin layer over raw HDFS
• Data is grouped in a Table that has rows of data.
• Each row can have multiple „Column Families‟
• Each „Column Family‟ contain(s) multiple columns.
• Each column name is the key and it has it‟s corresponding column value.
• Each row doesn‟t need to have the same number of columns
12. HIVE
• Hive is a little closer to RDBMS systems
• Is a DWH system on top of HDFS and Hbase
• Performs join operations between HBase tables
• Maintains a meta layer
• data summation, ad-hoc queries and analysis of large data stores in HFDS
• High level language
• Hive Query Language, looks like SQL but restricted
• No, Updates or Deletes are allowed
• partitioning can be used to update information
o Essentially re-writing a chunk of data.
13. WINDOWS HADOOP- PROJECT ISOTOPE
• 2 Flavours
• Cloud
o Azure CTP
• On Permise
o integration of the Hadoop File System with Active Directory
o integrate System Center Operations Manager with Hadoop
o BI Integration
• Are not all that interesting in and of themselves, but data and tools are
o Sqoop
– Integration with SQL Server
o Flume
– Access to Lots of data
15. SQOOP
• Is a framework that facilitates transfer between (RDBMS) and HDFS.
• Uses MapReduce programs to import and export data;
• Imports and exports are performed in parallel with fault tolerance.
• Source / Target files being used by Sqoop can be:
• delimited text files
• binary SequenceFiles containing serialized record data.
16. SQL SERVER – HORTONWORKS - HADOOP
• Spin-off from Yahoo
• Bridge the technological gaps between Hadoop and Windows Server
• CTP of the Hadoop-based distribution for Windows Server
( somewhere in 2012)
• Will work with Microsoft‟s business-intelligence tools
• including
o Excel
o PowerPivot
o PowerView
18. WITH SQL SERVER-HADOOP CONNECTOR, YOU CAN:
• Sqoop-based connector
• Import
• tables in SQL Server to delimited text files on HDFS
• tables in SQL Server to SequenceFiles files on HDFS
• tables in SQL Server to tables in Hive
• Result of queries executed on SQL Server to delimited text files on HDFS
• Result of queries executed on SQL Server to SequenceFiles files on HDFS
• Result of queries executed on SQL Server to tables in Hive
• Export
• Delimited text files on HDFS to SQL Server
• DequenceFiles on HDFS to SQL Server
• Hive Tables to tables in SQL Server
19. SQL SERVER 2012 ALONGSIDE THE ELEPHANT
• PowerView utilizes its own class of apps, if you will, that Microsoft is
calling insights.
• SQL Server will extend insights to Hadoop data sets
• Interesting insights can be
• Brought into a SQL Server environment using connectors
• Drive analysis across it using BI tools.
20. WHY USE HADOOP WITH SQL SERVER
• Don‟t just think about big data being large volumes
• Analyze both structured and unstructured datasets
• Think about workload, growth, accessibility and even location
• Can the amount of data stored every day reliably written to a traditional HDD
• Mapreduce is more complex then TSQL
• Many companies try to avoid writing java for queries
• Front ends are immature relative to the tooling available in the relational
database world
• It‟s not going to replace your database, but your database isn‟t likely to replace
Hadoop either.
21. MICROSOFT AND HADOOP
• Broader access of Hadoop to:
• End users
• IT professionals
• Developers
• Enterprise ready Hadoop distribution with greater security,
performance, ease of management.
• Breakthrough insights through the use of familiar tools such as Excel,
PowerPivot, SQL Server Analysis Services and Reporting Services.
23. MICROSOFT ENTERPRISE HADOOP
• Machines in the Hadoop cluster must be running Windows Server 2008
or higher
• Ipv4 network enabled on all nodes
• Deployment does not work on Ipv6 only network.
• The ability to create a new user account called “Isotope”.
• Will be created on all nodes of the cluster.
• Used for running Hadoop daemons and running jobs.
• Must be able to copy and install the deployment binaries to each machine
• Windows File Sharing services must be enabled on each machine that will
be joined to the Hadoop cluster.
• .Net Framework 4 installed on all nodes.
• Minimum of 10G free space in C drive (JBOD HDFS configuration is
supported)
#6: 1. Data is not stored in the traditional table column format. At best some of the database layers mimic this, but deep in the bowels of HDFS, there are no tables, no primary keys, no indexes. Everything is a flat file with predetermined delimiters. HDFS is optimized to recognize <Key, Value> mode of storage. Every things maps down to <Key, Value> pairs.2. HDFS supports only forward only parsing. So you are either reading ahead or appending to the end. There is no concept of ‘Update’ or ‘Insert’.3. Databases built on HDFS don’t guarantee ACID properties. Specially ‘Consistency’. It offers what is called as ‘Eventual Consistency’, meaning data will be saved eventually, but because of the highly asynchronous nature of the file system you are not guaranteed at what point it will finish. So HDFS based systems are NOT ideal for OLTP systems. RDBMS still rock there.4. Taking code to the data. In traditional systems you fire a query to get data and then write code on it to manipulate it. In MapReduce, you write code and send it to Hadoop’s data store and get back the manipulated data. Essentially you are sending code to the data.5. Traditional databases like SQL Server scale better vertically, so more cores, more memory, faster cores is the way to scale. However Hadoop by design scales horizontally. Keep throwing hardware at it and it will scale.
#9: Mapping Data: If it is plain de-limited text data, you have the freedom to pick your selection of keys from the record (remember records are typically linefeed separated) and values and tell the framework what your Key is and what values that key will hold. MR will deal with actual creation of the Map. When the map is being created you can control on what keys to include or what values to filter out. In the end you end up with a giant hashtable of filtered key value pairs. Now what?
#11: Well, if you are that scared of Java, then you have Pig. No, I am not calling names here. Pig is a querying engine that has more ‘business-friendly’ syntax but spits out MapReduce code in the backend and does all the dirty work for you. The syntax for Pig is called, of course, Pig Latin.When you write queries in Pig Latin, Pig converts it into MapReduce and sends it off to Hadoop, then retrieves the results and hands it back to you.Analysis shows you get about half the performance of raw optimal hand written MapReduce java code, but the same code takes more than 10 times the time to write when compared to a Pig query.If you are in the mood for a start-up idea, generating optimal MapReduce code from Pig Latin is a topic to consider …For those in the .NET world, Pig Latin is very similar syntactically to LINQ.
#12: HBase is a key value store that sits on top of HDFS. It is a NOSql Database.It has a very thin veneer over raw HDFS where in it mandates that data is grouped in a Table that has rows of data.Each row can have multiple ‘Column Families’ and each ‘Column Family’ can contain multiple columns.Each column name is the key and it has it’s corresponding column value.So a column of data can be represented asrow[family][column] = valueEach row need not have the same number of columns. Think of each row as a horizontal linked list, that links to a column family and then each column family links to multiple columns as <Key, Value> pairs.row1->family1->col A = val A->family2->col B = val Band so on.
#13: Hive is a little closer to traditional RDBMS systems. In fact it is a Data Warehousing system that sits on top of HDFS but maintains a meta layer that helps data summation, ad-hoc queries and analysis of large data stores in HFDS.Hive supports a high level language called Hive Query Language, that looks like SQL but restricted in a few ways like no, Updates or Deletes are allowed. However Hive has this concept of partitioning that can be used to update information, which is essentially re-writing a chunk of data whose granularity depends on the schema design.Hive can actually sit on top of HBase and perform join operations between HBase tables.
#14: Isotope is more than the distributions that the Softies are building with Hortonworks. Isotope also refers to the whole “tool chain” of supporting big-data analytics offerings that Microsoft is packaging up around the distributions. Microsoft’s big-picture concept is Isotope is what will give all kinds of users, from technical to “ordinary” productivity workers, access from inside data-analysis tools they know — like Microsoft’s own SQL Server Analysis Services, PowerPivot and Excel on their PCs — to data stored in Windows Servers and/or Windows Azure. (The Windows Azure Marketplace fits in here, as this is the place that third-party providers can publish free or paid collections of data which users will be able to download/buy.)To accelerate its adoption in the Enterprise, Microsoft will make Hadoop Enterprise ready by Active Directory Integration: Providing Enterprise-class security through integration of Hadoop with Active Directory High Performance: Boosting Hadoop performance to offer consistently high data throughput System Center Integration: Simplifying management of the Hadoop infrastructure through integration with Microsoft’s management tools such as System Center BI Integration: Enabling integration of relational and Hadoop data into Enterprise BI solution with Hadoop connectors Flexibility and Choice with deployment options for Windows Server and Windows Azure which offers customers: o Freedom to choose: More control as they can choose which data to keep in-house instead of the cloud. o Lower TCO: Cost saving, as fewer resources are required to run their Hadoop deployment in the cloud o Elasticity to meet demand: Elasticity reduces your costs, since more nodes can be added to the Windows Azure deployment for more demanding workloads. In addition, the Azure deployment of Hadoop can be used to extend the on premise solution in periods of high demand o Increased Performance: Bringing computing closer to the data – our solution enables customers to process data closer to where data is born, whether on premise or in the cloud We do this while maintaining compatibility with existing Hadoop tools such as Pig, Hive, and Java. Our goal is to ensure that applications built on Apache Hadoop can be easily migrated to our distribution to run on Windows Azure or Windows Server.
#15: For developers, Microsoft is investing to make JavaScript a first class language within Big Data by making it possible to write high performance Map/Reduce jobs using JavaScript. In addition, our JavaScript console will allow users to write JavaScript Map/Reduce jobs, Pig-Latin, and Hive queries from the browser to execute their Hadoop jobs. Analyze Hadoop data with familiar tools such as Excel, thanks to a Hive Add-in for Excel • Reduce time to solution through integration of Hive and Microsoft BI tools such as PowerPivot and Power View • Build corporate BI solutions that include Hadoop data, through integration of Hive and leading BI tools such as SQL Server Analysis Services and Reporting ServicesCustomers can use this connector (on an already deployed Hadoop cluster) to analyze unstructured or semi-structured data from various sources and then load the processed data into PDW Efficiently transfer terabytes of data between Hadoop and PDW Enables users to get the best of both worlds: Hadoop for processing large volumes of unstructured data, and PDW for analyzing structured data with easy integration to BI tools Use of Map-Reduce and PDW Bulk Load/Extract tool for fast import/export
#16: Sqoop is an open source connectivity framework that facilitates transfer between multiple Relational Database Management Systems (RDBMS) and HDFS. Sqoop uses MapReduce programs to import and export data; the imports and exports are performed in parallel with fault tolerance. The Source / Target files being used by Sqoop can be delimited text files (for example, with commas or tabs separating each field), or binary SequenceFiles containing serialized record data. Please refer to section 7.2.7 in Sqoop User Guide for more details on supported file types. For information on SequenceFile format, please refer to Hadoop API page.
#20: Broader access to Hadoop through simplified deployment and programmability. Microsoft has simplified setup and deployment of Hadoop, making it possible to setup and configure Hadoop on Windows Azure in a few hours instead of days. Since the service is hosted on Windows Azure, customers only download a package that includes the Hive Add-in and Hive ODBC Driver. In addition, Microsoft has introduced new JavaScript libraries to make JavaScript a first class programming language in Hadoop. Through this library JavaScript programmers can easily write MapReduce programs in JavaScript, and run these jobs from simple web browsers. These improvements reduce the barrier to entry, by enabling customers to easily deploy and explore Hadoop on Windows. Breakthrough insights through integration Microsoft Excel and BI tools. This preview ships with a new Hive Add-in for Excel that enables users to interact with data in Hadoop from Excel. With the Hive Add-in customers can issue Hive queries to pull and analyze unstructured data from Hadoop in the familiar Excel. Second, the preview includes a Hive ODBC Driver that integrates Hadoop with Microsoft BI tools. This driver enables customers to integrate and analyze unstructured data from Hadoop using award winning Microsoft BI tools such as PowerPivot and PowerView. As a result customers can gain insight on all their data, including unstructured data stored in Hadoop. Elasticity, thanks to Windows Azure. This preview of the Hadoop based service runs on Windows Azure, offering an elastic and scalable platform for distributed storage and compute.
#21: Companies do not have to be at Google scale to have data issues. Scalability issues occur with less than a terabyte of data. If a company works with relational databases and SQL, they can drown in complex data transformations and calculations that do not fit naturally into sequences of set operations. In that sense, the “big data” mantra is misguided at times…The big issue is not that everyone will suddenly operate at petabyte scale; a lot of folks do not have that much data. The more important topics are the specifics of the storage and processing infrastructure and what approaches best suit each problem.attack unstructured and semi-structured datasets without the overhead of an ETL step to insert them into a traditional relational database. From CSV to XML, we can load in a single step and begin querying.
#22: through easy installation and configuration and simplified programming with JavaScript.The CTP of Microsoft's Hadoop based Service for Windows Azure is now available. Complete the online form with details of your Big Data scenario to download the preview. Microsoft will issue a code that will be used by the selected customers to access the Hadoop based Service.
#23: Gain new insights from your dataHave you ever had trouble finding data you needed? Or combining data from different, incompatible sources? How about sharing the results with others in a web-friendly way? If so, we want you to try Microsoft Codename “Data Explorer” Cloud service.With "Data Explorer" you can:Identify the data you care about from the sources you work with (e.g. Excel spreadsheets, files, SQL Server databases).Discover relevant data and services via automatic recommendations from the Windows Azure Marketplace.Enrich your data by combining it and visualizing the results.Collaborate with your colleagues to refine the data.Publish the results to share them with others or power solutions.In short, we help you harness the richness of data on the Web to generate new insights.
#33: Blue - Use for Cloud on Your Terms specific content
#34: Green - Use for Mission Critical Confidence specific content
#35: Orange - Use for Breakthrough Insight specific content