SlideShare a Scribd company logo
BIG DATA ANALYTICS
 IN A HETEROGENEOUS WORLD



JOYDEEP DAS
DIRECTOR, ANALYTICS PRODUCT MANAGEMENT
SYBASE INC, AN SAP COMPANY

FEBRUARY 16, 2012
AGENDA
 The real world means business


Change is afoot – Myriad solution trends


Building bridges across a heterogeneous world


 Summary
BIG DATA ANALYTICS
REAL WORLD MEANS BUSINESS
BIG DATA ANALYTICS ISSUES
DEALING WITH VOLUME, VARIETY, VELOCITY, COSTS, SKILLS

                                     Volume
                                  Managing and
                                harnessing massive
                                    data sets
        Skills                                              Variety
   Lack of adequate               BIG                 Harmonizing silos of
   skills for popular                                   structured and
           APIs
                                  DATA                 unstructured data
                                ANALYTICS

                        Costs                   Velocity
                 Too expensive to            Keeping up with
                 acquire, operate,          unpredictable data
                   and expand                and query flows
BIG DATA ANALYTICS MATURITY
FROM JARGON TO TRANSFORMATIONAL BUSINESS VALUE*




                                                                                       New Strategies &
                                                                                       Business Models
                                        Column Store


             Hadoop
  Big data




                      NoSQL   In memory
                                                                                              Business
              data            MPP
                                                                                               Value*

                                                                      Operational                                    Revenue
                                                                      Efficiencies                                   Growth




 *A McKinsey study titled “Big Data: Next frontier for innovation, competition, and productivity”, May 2011, has found huge potential for Big Data
  Analytics with metrics as impressive as 60% improvements in Retail operating margins, 8% reduction in (US) national healthcare expenditures, and $150M
  savings in operational efficiencies in European economies
BIG DATA ANALYTICS IN THE REAL WORLD
PREVALENT IN DATA INTENSIVE VERTICALS AND FUNCTIONAL AREAS

                                   BIG DATA
           Verticals              ANALYTICS           Functional

         Banking                              • Marketing Analytics
                                               Digital channels
                                               Track visits, discover best channel mix:
         Telcom,                               email, social media, search
                                              • Sales Analytics
         Global Capital Markets                Deep correlations
                                               Predict risks based on deal DNA (emails,
         Retail                                meetings) pattern match
                                              • Operational Analytics
         Government                            Atomic machine data
                                               Analyze RFIDs, weblogs, SMS, sensors —
                                               continuous operational inefficiency
         Healthcare                           • Financial Analytics
                                               Detailed simulations
         Information Providers                 Liquidity, portfolio simulations —
                                               Stress tests, error margins
CHANGE IS AFOOT
MYRIAD OF BIG DATA ANALYTICS SOLUTIONS
CAUSAL LINKS: VARIETY, VELOCITY, VOLUME


   Events data

                     Transactional data




                                                    µSeconds
 Multi-media data
                             eCommerce data   Continuous and/or Bursts   Routinely Petabytes
                 x


        w        a       y


                 z

            Graph data


             Variety                                Velocity                   Volume
GROWING USER COMMUNITIES


  Data Scientists   Business Analysts     Developers/Programmers




                    Administrators




  Business users     External consumers      Business Processes
HARDWARE IS SUPERIOR



     Small Server farms – Scale out                    Larger Servers with partitions – Scale up




                                       Spinning disks to SSDs              SSD
                                       1.2x to 2x speed up


              SSD                     SSDs to Main Memory

                                       4x to 200x speed up


                                  Main Memory to CPU caches
                                                                         CPU Caches
                                        2x to 6x speed up
SOFTWARE EXPECTATIONS HAVE CHANGED




                           Intelligence & Automation
     Execution
   Characteristics



                                                       Performance & Scalability




      Results
   Characteristics


                     Traditional                                         Contemporary
EXECUTION CHARACTERISTICS
PERFORMANCE FOCUS


         1 2 3 4 5 6 7 8 9…
    r1
    r2
    r3
    r4
    r5



           Columnar access                     MPP: Shared Nothing, Shared Everything




         Algorithms




         Computations close to data:
  InDB Analytics (MapReduce), FPGA filtering                  In-memory processing
EXECUTION CHARACTERISTICS
SCALABILITY FOCUS


                                   1 2 3 4 5 6 7 8 9…
                              r1
 Data Compression             r2
                              r3
                              r4
                              r5

                              Natural Compression                  Compression Techniques               Hybrid Columnar
                             Column Store Databases                 Row Store Databases              Compression Databases



                                                                        SAN

  Distributed File Systems
                                                        DAS                               NAS




                                                              Stream Processing Engines
   Data Filtering
                                            Pre-processing Engines              Transformation Engines
EXECUTION CHARACTERISTICS
INTELLIGENCE FOCUS




    Query & Load Optimization                On-demand systems: Virtualization and provisioning




                                    CPU Caches

                           CPU Cache Conscious Computations
EXECUTION CHARACTERISTICS
AUTOMATION FOCUS




     Data conscious federation                  Automatic Workload Balancer/Mixer




                      User community focused collaborative services
RESULTS CHARACTERISTICS
ACCURACY TOLERANCE FOCUS


                            Complex schemas
                           Multiple applications
                           Write on schema
                            Atomic level locking
                            Consistency guarantees across system losses
                            Declarative API
                            Interactive
                            Does encapsulate elements of CAP
      Traditional           Associated with SQL




                           Simple read on schemas
                            Single application
                           Batch oriented
                            Snapshot isolations
                            Eventual consistency guarantees
                            Procedural APIs
                            Does encapsulate elements of ACID
    Contemporary           Associated with NoSQL
BUILDING BIG DATA BRIDGES
ACROSS A HETEROGENEOUS WORLD
COMPREHENSIVE 3-TIER FRAMEWORK
COMMERCIAL AND/OR OPEN SOURCE



                                 Eco-System
     Business Intelligence Tools, Data Integration Tools, DBA Tools, Packaged Apps




                         Application Services
      In-Database Analytics, Multi-lingual Client APIs, Federation, Web Enabled




                            Data Management
                  High Performance, Highly Scalable, Cloud Enabled
RELIABLE DATA MANAGEMENT
                                  Full Mesh High Speed Interconnect



   Data
Management




              Can handle high performance, compression, batch, ad-hoc analysis

              Can routinely scale to Petabyte class problems, thousands of concurrent jobs

              Typical characteristics
                   Massively parallel processing of complex queries
                   In-memory and on-disk optimizations
                   Elastic resources for user communities
                   ACID guarantees
                   Data variety
                   Information lifecycle management
                   User friendly automation tools
                   File systems (schema free) and/or DBMS structures (schema specific)
DATA MANAGEMENT INFRASTRUCTURE
ROBUST, SCALABLE, HIGH PERFORMANCE



                           Data Discovery     Application Modeling   Reports/Dashboards    Business Decisions
                          (Data Scientists)    (Business Analysts)    (BI Programmers)    (Business End Users)

 Infrastructure
 Management                                      Full Mesh High Speed Interconnect
     (DBAs)




                  • Dynamic, elastic MPP grid
                         – Grow, shrink, provision on-demand
                         – Heavy parallelization
                  • Load, prepare, mine, report in a workflow
                         – Privacy through isolation of resources
                         – Collaboration through sharing of results/data via sharing of resources
VERSATILE APPLICATION SERVICES
                       Python      ADO.NET   PERL
                                                      Programming      PHP   Ruby   Java   C++
                                                          APIs


                                                    Web Services API
Application Services
                           In-Database Analytics Plug-Ins: SQL, PMML, C++, JAVA, …




                                 Comprehensive declarative and procedural APIs
                                 In-Database Analytics Plug-In APIs
                                 In-Database Web Services
                                 Query and data federation APIs
                                 Multi-lingual Client APIs
VERSATILE APPLICATION SERVICES
RICH ALGORITHMS CLOSE TO DATA
                                                                                   Sybase IQ Process
                                                                                      In Memory
            Sybase IQ Process




                                                                                            RPC CALLS
               In Memory


    User’s DLL “A”      User’s DLL “B”

                                                                                 Library Access Process




                                LOAD
                                                                           User’s DLL “A”               User’s DLL “B”




                                                                                                              LOAD
                         User’s DLL “B”


                                                                                                        User’s DLL “B”

   In-database + In-process
                                             Multi-lingual APIs
 • In-process dynamically loaded                                              In-database + Out-process
                                          Scalar to Scalar
 shared libraries
                                                                           • Out of process shared library
                                          Scalar sets to Aggregate
 • Highest possible performance
                                          Scalar sets to Dimensional       • Lower security risks
 • Incurs security risks, but             Aggregates
 manageable via privileges                                                 • Lower robustness risks
                                          Scalar sets to Multi-attribute
                                          (bulk)
 • Incurs robustness risks, but                                            • Lower performance than in-
                                          Multi-attribute (bulk) to
 manageable via multiplex                                                  process but better than out of
                                          Multi-attribute (bulk)
                                                                           database
VERSATILE APPLICATION SERVICES
NATIVE MAPREDUCE

  For stocks in enterprise software sector, find max relative strength of a stock for a trading day*

 Key (k1)                  Value (v1)                          Key (k2)                                Value (v2)
                                                               Ticker     30-min interval        Weighted variance = (A given stock’s variance
 30-min           Ticker   TickValu     TickValue
                                                               Symbol     time                   / Average Variance across All “N” stocks)
 interval time   Symbol    e Day 1        Day 2
                                                               SAP        9:30 am                +1.4 / (SUM (+1.4-2.8-0.7….)/”N” stocks)
 9:30 am          SAP        51           52.4                 SAP        10:00 am               +2.2 / (SUM (+2.2-2.3-1.1 ….)/”N” stocks)
 9:30 am          ORCL       31           28.2      Map        SAP        ……                     ……
 9:30 am          TDC        22           21.3       Fn        ORCL       9:30 am                -2.8 / (SUM (+1.4-2.8-0.7….)/”N” stocks)
                                                               ORCL       10:00 am               -2.3 / (SUM (+2.2-2.3-1.1 ….)/”N” stocks)
 10:00 am         SAP       50.9          53.1
                                                               ORCL       …….                    …..
 10:00 am         TDC       21.8          20.9                 TDC        9:30 am                -0.7 / (SUM (+1.4-2.8-0.7….)/”N” stocks)
 10:00 am         ORCL      29.4          27.1                 TDC        10:00 am               -1.1/ (SUM (+2.2-2.3-1.1 ….)/”N” stocks)
 …..              ORCL      ……             …..                 TDC        …..                    ……


                                                                                Reduce
                                                                                  Fn


                                                                                Value (v3)

                                                               Ticker Symbol    Max Absolute Weighted Variance (v3)
                                                               SAP              Max (ABS(9:30 Wt Var), ABS(10:00 Wt Var), …..)
                                                               ORCL             Max (ABS(9:30 Wt Var), ABS(10:00 Wt Var), …..)
                                                               TDC              Max (ABS(9:30 Wt Var), ABS(10:00 Wt Var), …..)




 *Calculate max variance for the day by comparing each 30-min interval tick values across two days: the trading day & the
  day before, weighted by average variance of all stocks for each 30-min interval
VERSATILE APPLICATION SERVICES
NATIVE MAPREDUCE – DECLARATIVE WAY
 For stocks in enterprise software sector, find max relative strength of a stock for a trading day

  • Map function declaration: CREATE PROCEDURE MapVarTPF (IN XY TABLE (a1 char, a2 datetime, a3 float, a4 float)
                             RESULT SET YZ (b1 char, b2 datetime, b3 float)
  • Reduce function declaration: CREATE PROCEDURE RedMaxVarTPF (IN XY TABLE (a1 char, a2 datetime, a3 float)
                             RESULTE SET YZ (b1 char, b2 float)
  • Query: SELECT RedMaxVarTPF.TickSymb, RedMaxVarTPF.MaxVar,
               FROM RedMaxVarTPF (TABLE (SELECT MapVarTPF.TickSymb, MapVarTPF.30MinIntTime, MapVarTPF.Var
                           FROM MapVarTPF (TABLE ( SELECT TickDataTab.TickSymb, TickDataTab.30MinIntTime,
                                                      TickDataTab.30MinValDay1, TickDataTab.30MinValDay2)
                                         OVER (PARTITION BY TickDataTab.30MinInt)))
                           OVER (PARTITION BY MapVarTPF.TickSymb))
              ORDER BY RedMaxVarTPF.TickSymb
  • Native MapReduce parallel execution workflow:

    MapVarTPF (Partitioned to 15 parallel instances)   RedMaxVarTPF (Partitioned to 25 parallel instances)   SQL Query collates output using 1 node

                                    …….                                                …….                                         …..

                      SAN Fabric                                         SAN Fabric                                   SAN Fabric




  • Native MapReduce with unstructured data: Native MapReduce using can easily be applied to unstructured data also e.g.
    text, multi-media, … stored in DBMS or to unstructured data brought into DBMS during execution time from external files
RICH ECO-SYSTEM
              Source                                                               Answers
                       Data preparation                                 Data Usage


 Eco-System
                                               DBMS /
                                              Filesystem

                          Event Processing   Data Federation        Business Intelligence



                             Data Modeling / Database Design Tool



                        Business Intelligence Tools

                        Data Integration Tools

                        Data Mining Tools

                        Application Tools

                        DBA Tools
RICH ECO-SYSTEM
DBMS <–> HADOOP BRIDGE I

           Feature                           Characteristics                  Big Data Use Cases

                                      • Client tool capable of querying   •Ideal for bringing together Big Data
                                      DBMS and Hadoop                     Analytics pre-computations from
                                                                          different domains
                                      • Better performance when results
                                      from sources are pre-                • Example – In Telecommunication: DBMS
  Client Side Federation: Join data   computed/pre-aggregated
                                                                           has aggregated customer loyalty data &
                                                                           Hadoop with aggregated network
from DBMS AND Hadoop at a client
                                                                           utilization data; Quest Toad for Cloud can
           application level                                               bring data from both sources, linking
                                                                           customer loyalty to network utilization or
                                                                           network faults (e.g. dropped calls)




                                                                                             Quest
                                                                                          Toad for Cloud




                                                                                 DBMS                      Hadoop/Hive
RICH ECO-SYSTEM
DBMS <–> HADOOP BRIDGE II

           Feature                           Characteristics                        Big Data Use Cases
                                     • Extract & load subsets of HDFS
                                     data into DBMS store
                                       • Raw data from HDFS
                                       • Results of Hadoop MR jobs              • Ideal for combining subsets of HDFS
              ETL                                                               unstructured data or summary of
                                     • HDFS Data stored in DBMS is              HDFS data into DBMS for mid to long
                                     treated like other DBMS data               term usage in business reports
  Load Hadoop Data into DBMS           • Gets ACID properties of a DBMS
column store: Extract, Transform,      • Can be indexed, joined, parallelized    • Example – In eCommerce: clickstream data
                                       • Can be queried in an ad-hoc way         from weblogs stored in HDFS and outputs of
  Load data from HDFS (Hadoop                                                    MR jobs on that data (to study browsing
Distributed File System) into DBMS   • Visible to BI and other client tools      behavior) ETL’d into DBMS. The transactional
             schemas                                                             sales data in DBMS joined with clickstream data
                                     via DBMS ANSI SQL API only                  to understand and predict customer browsing
                                                                                 to buying behavior
                                     • Currently, the bulk data transfer
                                     utility SQOOP (built by Cloudera) is
                                     can be used provide this ETL
                                     capability
                                                                                     Clickstream
                                                                                     Data                      Sales Data


                                                                                Hadoop/Hive        SQOOP        DBMS
RICH ECO-SYSTEM
DBMS <–> HADOOP BRIDGE III

           Feature                            Characteristics                                 Big Data Use Cases


                                      • Scan and fetch specified data
                                                                                        • Ideal for combining subsets of HDFS
                                      subsets from HDFS via table UDF
                                                                                        data with DBMS data for operational
                                        • Can read and fetch HDFS data subsets
                                        • Called as part of SQL query
                                                                                        (transient) business reports
                                        • Output joinable with DBMS data
Join HDFS data with DBMS data on        • Multiple, simultaneous UDF calls possible       • Example –   In Retail: Point Of Sale (POS)
 the fly: Fetch and join subsets of     • Sample UDFs provided in JAVA, C++               detailed data stored in HDFS. DBMS EDW
                                                                                          fetches POS data at fixed intervals from HDFS of
 HDFS data on-demand using SQL
     queries from DBMS(Data           • HDFS data not stored in DBMS                      specific hot selling SKUs, combines with
                                                                                          inventory data in DBMS to predict and prevent
                                        • Fetched into DBMS In-memory tables              inventory “stockouts”.
       Federation technique)            • ACID properties not applicable
                                          • Repeated use: put fetched data in tables

                                      • Visible to BI/other client tools via
                                      ANSI SQL API only



                                                                                                                          Inventory
                                                                                              POS Data                      Data

                                                                                       Hadoop/HDFS        UDF Bridge       DBMS
RICH ECO-SYSTEM
DBMS <–> HADOOP BRIDGE IV

            Feature                             Characteristics                               Big Data Use Cases

                                       • Trigger and fetch Hadoop MR job                  • Ideal for combining results of
                                       results via table UDF                              Hadoop MR job results with DBMS
                                          • Can trigger Hadoop MR jobs
                                          • Called as part of Sybase IQ SQL query         data for operational (transient)
                                          • Output joinable with Sybase IQ data           business reports
                                          • No multiple, simultaneous UDF calls
                                          • Sample UDFs provided in JAVA only
Combine results of Hadoop MR jobs                                                          • Example –   In Utilities: Smart meter and
 with DBMS data on the fly: Initiate                                                       smart grid data can be combined for load
 and Join results of Hadoop MR jobs    • HDFS data not stored in DBMS                      monitoring and demand forecast. Smart grid
 on-demand using SQL queries from         • Fetched into DBMS In-memory tables             transmission quality data (multi-attribute time
                                          • ACID properties not applicable                 series data) stored in HDFS can be computed
   DBMS data (Query Federation              • Repeated use: put fetched data in tables     via Hadoop MR jobs triggered from DBMS and
             technique)                                                                    combined with Smart meter data stored in
                                                                                           DBMS to analyze demand and workload.
                                       • Visible to BI and other client tools
                                       via DBMS ANSI SQL API only



                                                                                             Smart Grid                Smart Meter
                                                                                             Transmission Data         consumption data


                                                                                         Hadoop/HDFS         UDF Bridge        DBMS
RICH ECO-SYSTEM
DBMS <–> PREDICTIVE TOOLS BRIDGE
               Express Complex Computations In Industry Standard Predictive Modeling
                Markup Language (PMML), Plug In Models Close To data for execution


                                                           Database Server
                                                         DBMS


                           SQL
       Applications                                                  Bridge
                                                                                  Universal
                        Predictions
                                                                                   Plug-In




                                                                                       PMML
                                                        UDFs
        PMML
         PMML
          PMML
       (models)                                          PMML Preprocessor
        (models)
         (models)                                        (convert & validate)
RICH ECO-SYSTEM
FUNDAMENTALS OF STREAMS TECHNOLOGY

                                     Process data without storing it



 Input Streams
 Events arrive on input streams
                                                                                Derived Streams, Windows
                                                                                  Apply continuous query
                                                                                  operators to one or more
                                                                                  input streams to produce
                                                                                  a new stream




         Continuous Queries create a new                           Windows can Have State
         “derived” stream or window                                • Retention rules define how many or how
                                                                     long events are kept
               SELECT FROM one or more input
                                                                   • Opcodes in events can indicate
               streams/windows
                                                                     insert/update/delete and can be
               WHERE…
                                                                     automatically applied to the window
               GROUP BY…
RICH ECO-SYSTEM
STREAMS DATA PROCESSING VS TRADITIONAL DATA PROCESESSING



             SQL                              CCL
                                         Windows on
            Tables                      Event Streams
             Rows                           Events

           Columns                           Fields

      On-Demand: query                  Event-Driven:
      runs when information            query updates when
            is needed                   information arrives
RICH ECO-SYSTEM
STREAMS PRE-PROCESSING
   Why store Big Data when you can deal with Small Data – Pre-filter un-necessary data on the fly with Streams technologies



                                                    ESP Engine



                                                                                          Alerts Actions

               Updates



               Memory


                 Disk




                                      Hadoop/HDFS                      DBMS
SUMMARY
3-LAYER LOGICAL INTEGRATION
   STREAM PROCESSING <-> NoSQL <-> DBMS




              BI TOOLS      DI TOOLS    DBA TOOLS    DATA MINING TOOLS
Eco-System
                                                                                    Unstructured
                                                                                    Data
     App                                                         Ingest + Persist   (Hadoop,
 Services      Web 2.0       Java       C/C++        SQL             Federation
                                                                                    Content Mgmt)



                                                                                    Structured Data
                                                                                    (DBMS)


   DMBS
                                                                                    Streaming Data
                                                                                    (ESP)




             The heterogeneous world will require co-existence and playing nice!
Q&A

Learn More: http://www.sybase.com/sybaseiqbigdata
Contact: 1-800-SYBASE5 (792.2735)

More Related Content

PDF
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Zekeriya Besiroglu
 
PPTX
A Basic Introduction to the Hadoop eco system - no animation
Sameer Tiwari
 
PPTX
Big Data and Hadoop Introduction
Dzung Nguyen
 
PDF
Big Data and Hadoop Ecosystem
Rajkumar Singh
 
PPTX
Intro to cassandra + hadoop
Jeremy Hanna
 
PDF
What's New Tajo 0.10 and Its Beyond
Gruter
 
ODP
Hadoop introduction
葵慶 李
 
PPTX
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Simplilearn
 
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Zekeriya Besiroglu
 
A Basic Introduction to the Hadoop eco system - no animation
Sameer Tiwari
 
Big Data and Hadoop Introduction
Dzung Nguyen
 
Big Data and Hadoop Ecosystem
Rajkumar Singh
 
Intro to cassandra + hadoop
Jeremy Hanna
 
What's New Tajo 0.10 and Its Beyond
Gruter
 
Hadoop introduction
葵慶 李
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Simplilearn
 

What's hot (20)

PPTX
Apache Hadoop at 10
Cloudera, Inc.
 
PPTX
HADOOP TECHNOLOGY ppt
sravya raju
 
PDF
Introduction to Big Data & Hadoop
Edureka!
 
KEY
End-to-end Analytics with Apache Cassandra
Jeremy Hanna
 
PPTX
JOSA TechTalks - Big Data on Hadoop
Jordan Open Source Association
 
PPTX
Introduction to Big Data and Hadoop
Edureka!
 
PPTX
Redis Modules - Redis India Tour - 2017
HashedIn Technologies
 
PDF
Introduction to Bigdata and HADOOP
vinoth kumar
 
PPTX
Hadoop overview
Siva Pandeti
 
PDF
Big Data technology Landscape
ShivanandaVSeeri
 
PPTX
Hadoop and Big Data: Revealed
Sachin Holla
 
PPTX
Hadoop and Big Data
Harshdeep Kaur
 
PPTX
מיכאל
sqlserver.co.il
 
PDF
Big data and hadoop
Kishor Parkhe
 
PPTX
Big data vahidamiri-tabriz-13960226-datastack.ir
datastack
 
PDF
Hadoop Family and Ecosystem
tcloudcomputing-tw
 
PPTX
Big data and Hadoop
Rahul Agarwal
 
PPT
Hadoop hive presentation
Arvind Kumar
 
PPTX
Hadoop Tutorial For Beginners
Dataflair Web Services Pvt Ltd
 
PPTX
Hadoop Architecture
Ganesh B
 
Apache Hadoop at 10
Cloudera, Inc.
 
HADOOP TECHNOLOGY ppt
sravya raju
 
Introduction to Big Data & Hadoop
Edureka!
 
End-to-end Analytics with Apache Cassandra
Jeremy Hanna
 
JOSA TechTalks - Big Data on Hadoop
Jordan Open Source Association
 
Introduction to Big Data and Hadoop
Edureka!
 
Redis Modules - Redis India Tour - 2017
HashedIn Technologies
 
Introduction to Bigdata and HADOOP
vinoth kumar
 
Hadoop overview
Siva Pandeti
 
Big Data technology Landscape
ShivanandaVSeeri
 
Hadoop and Big Data: Revealed
Sachin Holla
 
Hadoop and Big Data
Harshdeep Kaur
 
מיכאל
sqlserver.co.il
 
Big data and hadoop
Kishor Parkhe
 
Big data vahidamiri-tabriz-13960226-datastack.ir
datastack
 
Hadoop Family and Ecosystem
tcloudcomputing-tw
 
Big data and Hadoop
Rahul Agarwal
 
Hadoop hive presentation
Arvind Kumar
 
Hadoop Tutorial For Beginners
Dataflair Web Services Pvt Ltd
 
Hadoop Architecture
Ganesh B
 
Ad

Viewers also liked (20)

PDF
Apache Spark ile Twitter’ı izlemek
Mehmet Uluer, MSc.
 
PDF
Büyük Veriyle Büyük Resmi Görmek
ideaport
 
PDF
BUYUK VERI ILE RISK YONETIMI
Kutlu MERİH
 
PPTX
APT Eğitimi Sunumu
Alper Başaran
 
PPTX
Pentest almak
Alper Başaran
 
PDF
Bulutta Büyük Veri Yönetimi
MSHOWTO Bilisim Toplulugu
 
PPTX
MEF Üniversitesi - IoT & Data Dersi
İbrahim KIVANÇ
 
PPTX
Hadoop,Pig,Hive ve Oozie ile Büyük Veri Analizi
Serkan Sakınmaz
 
ODP
Büyük veri(bigdata)
Hülya Soylu
 
PDF
Büyük Veri, Hadoop Ekosistemi ve Veri Bilimi
Ankara Big Data Meetup
 
PDF
Büyük Veri ve Risk Yönetimi
Fatma ÇINAR
 
PPTX
Big Data (Büyük Veri) Nedir?
Renerald
 
PDF
Sosyal mühendislik saldırıları
Alper Başaran
 
PDF
Garnizon dns guvenligi
Alper Başaran
 
PDF
IOT Güvenliği
BGA Cyber Security
 
PDF
İstSec'14 - Hamza Şamlıoğlu - Sosyal Medya ve Siber Riskler
BGA Cyber Security
 
PPTX
APT Saldırıları
Alper Başaran
 
PDF
RECOVERY: Olay sonrası sistemleri düzeltmek
Alper Başaran
 
PPTX
Siber güvenlik ve hacking
Alper Başaran
 
PDF
Beyaz Şapkalı Hacker (CEH) Lab Kitabı
BGA Cyber Security
 
Apache Spark ile Twitter’ı izlemek
Mehmet Uluer, MSc.
 
Büyük Veriyle Büyük Resmi Görmek
ideaport
 
BUYUK VERI ILE RISK YONETIMI
Kutlu MERİH
 
APT Eğitimi Sunumu
Alper Başaran
 
Pentest almak
Alper Başaran
 
Bulutta Büyük Veri Yönetimi
MSHOWTO Bilisim Toplulugu
 
MEF Üniversitesi - IoT & Data Dersi
İbrahim KIVANÇ
 
Hadoop,Pig,Hive ve Oozie ile Büyük Veri Analizi
Serkan Sakınmaz
 
Büyük veri(bigdata)
Hülya Soylu
 
Büyük Veri, Hadoop Ekosistemi ve Veri Bilimi
Ankara Big Data Meetup
 
Büyük Veri ve Risk Yönetimi
Fatma ÇINAR
 
Big Data (Büyük Veri) Nedir?
Renerald
 
Sosyal mühendislik saldırıları
Alper Başaran
 
Garnizon dns guvenligi
Alper Başaran
 
IOT Güvenliği
BGA Cyber Security
 
İstSec'14 - Hamza Şamlıoğlu - Sosyal Medya ve Siber Riskler
BGA Cyber Security
 
APT Saldırıları
Alper Başaran
 
RECOVERY: Olay sonrası sistemleri düzeltmek
Alper Başaran
 
Siber güvenlik ve hacking
Alper Başaran
 
Beyaz Şapkalı Hacker (CEH) Lab Kitabı
BGA Cyber Security
 
Ad

Similar to Farklı Ortamlarda Büyük Veri Kavramı -Big Data by Sybase (20)

PDF
Ibm big data ibm marriage of hadoop and data warehousing
DataWorks Summit
 
PPT
Ibm big data hadoop summit 2012 james kobielus final 6-13-12(1)
Ajay Ohri
 
PDF
Big Data and Implications on Platform Architecture
Odinot Stanislas
 
PDF
IBM Big Data Platform Nov 2012
Swiss Big Data User Group
 
PPTX
EDF2013: Selected Talk: Bryan Drexler: The 80/20 Rule and Big Data
European Data Forum
 
PDF
Building Big Data Applications
Richard McDougall
 
PDF
Hortonworks roadshow
Accenture
 
PPTX
Big Data, Big Content, and Aligning Your Storage Strategy
Hitachi Vantara
 
PPTX
Kurukshetra - Big Data
shankar_radhakrishnan
 
PDF
IBM Stream au Hadoop User Group
Modern Data Stack France
 
PDF
Martin Wildberger Presentation
Mauricio Godoy
 
PDF
Analyze This! Best Practices For Big And Fast Data
EMC
 
PPTX
Big Data, Hadoop, Hortonworks and Microsoft HDInsight
Hortonworks
 
PDF
Barak regev
PatrickCrompton
 
PPTX
Anexinet Big Data Solutions
Mark Kromer
 
PDF
Big Data World Forum
bigdatawf
 
PDF
Streaming Hadoop for Enterprise Adoption
DATAVERSITY
 
PPTX
Teradata Big Data London Seminar
Hortonworks
 
PPTX
The Worst Category Name Ever
John Rymer
 
PPTX
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
Caserta
 
Ibm big data ibm marriage of hadoop and data warehousing
DataWorks Summit
 
Ibm big data hadoop summit 2012 james kobielus final 6-13-12(1)
Ajay Ohri
 
Big Data and Implications on Platform Architecture
Odinot Stanislas
 
IBM Big Data Platform Nov 2012
Swiss Big Data User Group
 
EDF2013: Selected Talk: Bryan Drexler: The 80/20 Rule and Big Data
European Data Forum
 
Building Big Data Applications
Richard McDougall
 
Hortonworks roadshow
Accenture
 
Big Data, Big Content, and Aligning Your Storage Strategy
Hitachi Vantara
 
Kurukshetra - Big Data
shankar_radhakrishnan
 
IBM Stream au Hadoop User Group
Modern Data Stack France
 
Martin Wildberger Presentation
Mauricio Godoy
 
Analyze This! Best Practices For Big And Fast Data
EMC
 
Big Data, Hadoop, Hortonworks and Microsoft HDInsight
Hortonworks
 
Barak regev
PatrickCrompton
 
Anexinet Big Data Solutions
Mark Kromer
 
Big Data World Forum
bigdatawf
 
Streaming Hadoop for Enterprise Adoption
DATAVERSITY
 
Teradata Big Data London Seminar
Hortonworks
 
The Worst Category Name Ever
John Rymer
 
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
Caserta
 

More from Sybase Türkiye (20)

PDF
Italya Posta Teskilatı Sybase Afaria Kullaniyot
Sybase Türkiye
 
PDF
SAP REAL TIME DATA PLATFORM WITH SYBASE SUPPORT
Sybase Türkiye
 
PDF
SAP Sybase Event Streaming Processing
Sybase Türkiye
 
PDF
Sybase IQ ile Muhteşem Performans
Sybase Türkiye
 
PDF
Mobil Uygulama Geliştirme Klavuzu
Sybase Türkiye
 
PDF
Mobile Device Management for Dummies
Sybase Türkiye
 
PDF
SAP Sybase Data Management
Sybase Türkiye
 
PDF
Sybase IQ ve Big Data
Sybase Türkiye
 
PDF
Sybase IQ ile Analitik Platform
Sybase Türkiye
 
PDF
SAP EIM
Sybase Türkiye
 
PDF
Appcelerator report-q2-2012
Sybase Türkiye
 
PDF
Sybase PowerDesigner Vs Erwin
Sybase Türkiye
 
PDF
Elastic Platform for Business Analytics
Sybase Türkiye
 
PDF
Actionable Architecture
Sybase Türkiye
 
PDF
Information Architech and DWH with PowerDesigner
Sybase Türkiye
 
PDF
Why modeling matters ?
Sybase Türkiye
 
PDF
Welcome introduction
Sybase Türkiye
 
PDF
Real-Time Loading to Sybase IQ
Sybase Türkiye
 
PDF
Mobile Application Strategy
Sybase Türkiye
 
PDF
Mobile is the new face of business
Sybase Türkiye
 
Italya Posta Teskilatı Sybase Afaria Kullaniyot
Sybase Türkiye
 
SAP REAL TIME DATA PLATFORM WITH SYBASE SUPPORT
Sybase Türkiye
 
SAP Sybase Event Streaming Processing
Sybase Türkiye
 
Sybase IQ ile Muhteşem Performans
Sybase Türkiye
 
Mobil Uygulama Geliştirme Klavuzu
Sybase Türkiye
 
Mobile Device Management for Dummies
Sybase Türkiye
 
SAP Sybase Data Management
Sybase Türkiye
 
Sybase IQ ve Big Data
Sybase Türkiye
 
Sybase IQ ile Analitik Platform
Sybase Türkiye
 
Appcelerator report-q2-2012
Sybase Türkiye
 
Sybase PowerDesigner Vs Erwin
Sybase Türkiye
 
Elastic Platform for Business Analytics
Sybase Türkiye
 
Actionable Architecture
Sybase Türkiye
 
Information Architech and DWH with PowerDesigner
Sybase Türkiye
 
Why modeling matters ?
Sybase Türkiye
 
Welcome introduction
Sybase Türkiye
 
Real-Time Loading to Sybase IQ
Sybase Türkiye
 
Mobile Application Strategy
Sybase Türkiye
 
Mobile is the new face of business
Sybase Türkiye
 

Recently uploaded (20)

PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PDF
Software Development Methodologies in 2025
KodekX
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
Software Development Methodologies in 2025
KodekX
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
Simple and concise overview about Quantum computing..pptx
mughal641
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 

Farklı Ortamlarda Büyük Veri Kavramı -Big Data by Sybase

  • 1. BIG DATA ANALYTICS IN A HETEROGENEOUS WORLD JOYDEEP DAS DIRECTOR, ANALYTICS PRODUCT MANAGEMENT SYBASE INC, AN SAP COMPANY FEBRUARY 16, 2012
  • 2. AGENDA  The real world means business Change is afoot – Myriad solution trends Building bridges across a heterogeneous world  Summary
  • 3. BIG DATA ANALYTICS REAL WORLD MEANS BUSINESS
  • 4. BIG DATA ANALYTICS ISSUES DEALING WITH VOLUME, VARIETY, VELOCITY, COSTS, SKILLS Volume Managing and harnessing massive data sets Skills Variety Lack of adequate BIG Harmonizing silos of skills for popular structured and APIs DATA unstructured data ANALYTICS Costs Velocity Too expensive to Keeping up with acquire, operate, unpredictable data and expand and query flows
  • 5. BIG DATA ANALYTICS MATURITY FROM JARGON TO TRANSFORMATIONAL BUSINESS VALUE* New Strategies & Business Models Column Store Hadoop Big data NoSQL In memory Business data MPP Value* Operational Revenue Efficiencies Growth *A McKinsey study titled “Big Data: Next frontier for innovation, competition, and productivity”, May 2011, has found huge potential for Big Data Analytics with metrics as impressive as 60% improvements in Retail operating margins, 8% reduction in (US) national healthcare expenditures, and $150M savings in operational efficiencies in European economies
  • 6. BIG DATA ANALYTICS IN THE REAL WORLD PREVALENT IN DATA INTENSIVE VERTICALS AND FUNCTIONAL AREAS BIG DATA Verticals ANALYTICS Functional Banking • Marketing Analytics Digital channels Track visits, discover best channel mix: Telcom, email, social media, search • Sales Analytics Global Capital Markets Deep correlations Predict risks based on deal DNA (emails, Retail meetings) pattern match • Operational Analytics Government Atomic machine data Analyze RFIDs, weblogs, SMS, sensors — continuous operational inefficiency Healthcare • Financial Analytics Detailed simulations Information Providers Liquidity, portfolio simulations — Stress tests, error margins
  • 7. CHANGE IS AFOOT MYRIAD OF BIG DATA ANALYTICS SOLUTIONS
  • 8. CAUSAL LINKS: VARIETY, VELOCITY, VOLUME Events data Transactional data µSeconds Multi-media data eCommerce data Continuous and/or Bursts Routinely Petabytes x w a y z Graph data Variety Velocity Volume
  • 9. GROWING USER COMMUNITIES Data Scientists Business Analysts Developers/Programmers Administrators Business users External consumers Business Processes
  • 10. HARDWARE IS SUPERIOR Small Server farms – Scale out Larger Servers with partitions – Scale up Spinning disks to SSDs SSD 1.2x to 2x speed up SSD SSDs to Main Memory 4x to 200x speed up Main Memory to CPU caches CPU Caches 2x to 6x speed up
  • 11. SOFTWARE EXPECTATIONS HAVE CHANGED Intelligence & Automation Execution Characteristics Performance & Scalability Results Characteristics Traditional Contemporary
  • 12. EXECUTION CHARACTERISTICS PERFORMANCE FOCUS 1 2 3 4 5 6 7 8 9… r1 r2 r3 r4 r5 Columnar access MPP: Shared Nothing, Shared Everything Algorithms Computations close to data: InDB Analytics (MapReduce), FPGA filtering In-memory processing
  • 13. EXECUTION CHARACTERISTICS SCALABILITY FOCUS 1 2 3 4 5 6 7 8 9… r1 Data Compression r2 r3 r4 r5 Natural Compression Compression Techniques Hybrid Columnar Column Store Databases Row Store Databases Compression Databases SAN Distributed File Systems DAS NAS Stream Processing Engines Data Filtering Pre-processing Engines Transformation Engines
  • 14. EXECUTION CHARACTERISTICS INTELLIGENCE FOCUS Query & Load Optimization On-demand systems: Virtualization and provisioning CPU Caches CPU Cache Conscious Computations
  • 15. EXECUTION CHARACTERISTICS AUTOMATION FOCUS Data conscious federation Automatic Workload Balancer/Mixer User community focused collaborative services
  • 16. RESULTS CHARACTERISTICS ACCURACY TOLERANCE FOCUS  Complex schemas Multiple applications Write on schema  Atomic level locking  Consistency guarantees across system losses  Declarative API  Interactive  Does encapsulate elements of CAP Traditional  Associated with SQL Simple read on schemas  Single application Batch oriented  Snapshot isolations  Eventual consistency guarantees  Procedural APIs  Does encapsulate elements of ACID Contemporary Associated with NoSQL
  • 17. BUILDING BIG DATA BRIDGES ACROSS A HETEROGENEOUS WORLD
  • 18. COMPREHENSIVE 3-TIER FRAMEWORK COMMERCIAL AND/OR OPEN SOURCE Eco-System Business Intelligence Tools, Data Integration Tools, DBA Tools, Packaged Apps Application Services In-Database Analytics, Multi-lingual Client APIs, Federation, Web Enabled Data Management High Performance, Highly Scalable, Cloud Enabled
  • 19. RELIABLE DATA MANAGEMENT Full Mesh High Speed Interconnect Data Management  Can handle high performance, compression, batch, ad-hoc analysis  Can routinely scale to Petabyte class problems, thousands of concurrent jobs  Typical characteristics  Massively parallel processing of complex queries  In-memory and on-disk optimizations  Elastic resources for user communities  ACID guarantees  Data variety  Information lifecycle management  User friendly automation tools  File systems (schema free) and/or DBMS structures (schema specific)
  • 20. DATA MANAGEMENT INFRASTRUCTURE ROBUST, SCALABLE, HIGH PERFORMANCE Data Discovery Application Modeling Reports/Dashboards Business Decisions (Data Scientists) (Business Analysts) (BI Programmers) (Business End Users) Infrastructure Management Full Mesh High Speed Interconnect (DBAs) • Dynamic, elastic MPP grid – Grow, shrink, provision on-demand – Heavy parallelization • Load, prepare, mine, report in a workflow – Privacy through isolation of resources – Collaboration through sharing of results/data via sharing of resources
  • 21. VERSATILE APPLICATION SERVICES Python ADO.NET PERL Programming PHP Ruby Java C++ APIs Web Services API Application Services In-Database Analytics Plug-Ins: SQL, PMML, C++, JAVA, …  Comprehensive declarative and procedural APIs  In-Database Analytics Plug-In APIs  In-Database Web Services  Query and data federation APIs  Multi-lingual Client APIs
  • 22. VERSATILE APPLICATION SERVICES RICH ALGORITHMS CLOSE TO DATA Sybase IQ Process In Memory Sybase IQ Process RPC CALLS In Memory User’s DLL “A” User’s DLL “B” Library Access Process LOAD User’s DLL “A” User’s DLL “B” LOAD User’s DLL “B” User’s DLL “B” In-database + In-process Multi-lingual APIs • In-process dynamically loaded In-database + Out-process Scalar to Scalar shared libraries • Out of process shared library Scalar sets to Aggregate • Highest possible performance Scalar sets to Dimensional • Lower security risks • Incurs security risks, but Aggregates manageable via privileges • Lower robustness risks Scalar sets to Multi-attribute (bulk) • Incurs robustness risks, but • Lower performance than in- Multi-attribute (bulk) to manageable via multiplex process but better than out of Multi-attribute (bulk) database
  • 23. VERSATILE APPLICATION SERVICES NATIVE MAPREDUCE For stocks in enterprise software sector, find max relative strength of a stock for a trading day* Key (k1) Value (v1) Key (k2) Value (v2) Ticker 30-min interval Weighted variance = (A given stock’s variance 30-min Ticker TickValu TickValue Symbol time / Average Variance across All “N” stocks) interval time Symbol e Day 1 Day 2 SAP 9:30 am +1.4 / (SUM (+1.4-2.8-0.7….)/”N” stocks) 9:30 am SAP 51 52.4 SAP 10:00 am +2.2 / (SUM (+2.2-2.3-1.1 ….)/”N” stocks) 9:30 am ORCL 31 28.2 Map SAP …… …… 9:30 am TDC 22 21.3 Fn ORCL 9:30 am -2.8 / (SUM (+1.4-2.8-0.7….)/”N” stocks) ORCL 10:00 am -2.3 / (SUM (+2.2-2.3-1.1 ….)/”N” stocks) 10:00 am SAP 50.9 53.1 ORCL ……. ….. 10:00 am TDC 21.8 20.9 TDC 9:30 am -0.7 / (SUM (+1.4-2.8-0.7….)/”N” stocks) 10:00 am ORCL 29.4 27.1 TDC 10:00 am -1.1/ (SUM (+2.2-2.3-1.1 ….)/”N” stocks) ….. ORCL …… ….. TDC ….. …… Reduce Fn Value (v3) Ticker Symbol Max Absolute Weighted Variance (v3) SAP Max (ABS(9:30 Wt Var), ABS(10:00 Wt Var), …..) ORCL Max (ABS(9:30 Wt Var), ABS(10:00 Wt Var), …..) TDC Max (ABS(9:30 Wt Var), ABS(10:00 Wt Var), …..) *Calculate max variance for the day by comparing each 30-min interval tick values across two days: the trading day & the day before, weighted by average variance of all stocks for each 30-min interval
  • 24. VERSATILE APPLICATION SERVICES NATIVE MAPREDUCE – DECLARATIVE WAY For stocks in enterprise software sector, find max relative strength of a stock for a trading day • Map function declaration: CREATE PROCEDURE MapVarTPF (IN XY TABLE (a1 char, a2 datetime, a3 float, a4 float) RESULT SET YZ (b1 char, b2 datetime, b3 float) • Reduce function declaration: CREATE PROCEDURE RedMaxVarTPF (IN XY TABLE (a1 char, a2 datetime, a3 float) RESULTE SET YZ (b1 char, b2 float) • Query: SELECT RedMaxVarTPF.TickSymb, RedMaxVarTPF.MaxVar, FROM RedMaxVarTPF (TABLE (SELECT MapVarTPF.TickSymb, MapVarTPF.30MinIntTime, MapVarTPF.Var FROM MapVarTPF (TABLE ( SELECT TickDataTab.TickSymb, TickDataTab.30MinIntTime, TickDataTab.30MinValDay1, TickDataTab.30MinValDay2) OVER (PARTITION BY TickDataTab.30MinInt))) OVER (PARTITION BY MapVarTPF.TickSymb)) ORDER BY RedMaxVarTPF.TickSymb • Native MapReduce parallel execution workflow: MapVarTPF (Partitioned to 15 parallel instances) RedMaxVarTPF (Partitioned to 25 parallel instances) SQL Query collates output using 1 node ……. ……. ….. SAN Fabric SAN Fabric SAN Fabric • Native MapReduce with unstructured data: Native MapReduce using can easily be applied to unstructured data also e.g. text, multi-media, … stored in DBMS or to unstructured data brought into DBMS during execution time from external files
  • 25. RICH ECO-SYSTEM Source Answers Data preparation Data Usage Eco-System DBMS / Filesystem Event Processing Data Federation Business Intelligence Data Modeling / Database Design Tool  Business Intelligence Tools  Data Integration Tools  Data Mining Tools  Application Tools  DBA Tools
  • 26. RICH ECO-SYSTEM DBMS <–> HADOOP BRIDGE I Feature Characteristics Big Data Use Cases • Client tool capable of querying •Ideal for bringing together Big Data DBMS and Hadoop Analytics pre-computations from different domains • Better performance when results from sources are pre- • Example – In Telecommunication: DBMS Client Side Federation: Join data computed/pre-aggregated has aggregated customer loyalty data & Hadoop with aggregated network from DBMS AND Hadoop at a client utilization data; Quest Toad for Cloud can application level bring data from both sources, linking customer loyalty to network utilization or network faults (e.g. dropped calls) Quest Toad for Cloud DBMS Hadoop/Hive
  • 27. RICH ECO-SYSTEM DBMS <–> HADOOP BRIDGE II Feature Characteristics Big Data Use Cases • Extract & load subsets of HDFS data into DBMS store • Raw data from HDFS • Results of Hadoop MR jobs • Ideal for combining subsets of HDFS ETL unstructured data or summary of • HDFS Data stored in DBMS is HDFS data into DBMS for mid to long treated like other DBMS data term usage in business reports Load Hadoop Data into DBMS • Gets ACID properties of a DBMS column store: Extract, Transform, • Can be indexed, joined, parallelized • Example – In eCommerce: clickstream data • Can be queried in an ad-hoc way from weblogs stored in HDFS and outputs of Load data from HDFS (Hadoop MR jobs on that data (to study browsing Distributed File System) into DBMS • Visible to BI and other client tools behavior) ETL’d into DBMS. The transactional schemas sales data in DBMS joined with clickstream data via DBMS ANSI SQL API only to understand and predict customer browsing to buying behavior • Currently, the bulk data transfer utility SQOOP (built by Cloudera) is can be used provide this ETL capability Clickstream Data Sales Data Hadoop/Hive SQOOP DBMS
  • 28. RICH ECO-SYSTEM DBMS <–> HADOOP BRIDGE III Feature Characteristics Big Data Use Cases • Scan and fetch specified data • Ideal for combining subsets of HDFS subsets from HDFS via table UDF data with DBMS data for operational • Can read and fetch HDFS data subsets • Called as part of SQL query (transient) business reports • Output joinable with DBMS data Join HDFS data with DBMS data on • Multiple, simultaneous UDF calls possible • Example – In Retail: Point Of Sale (POS) the fly: Fetch and join subsets of • Sample UDFs provided in JAVA, C++ detailed data stored in HDFS. DBMS EDW fetches POS data at fixed intervals from HDFS of HDFS data on-demand using SQL queries from DBMS(Data • HDFS data not stored in DBMS specific hot selling SKUs, combines with inventory data in DBMS to predict and prevent • Fetched into DBMS In-memory tables inventory “stockouts”. Federation technique) • ACID properties not applicable • Repeated use: put fetched data in tables • Visible to BI/other client tools via ANSI SQL API only Inventory POS Data Data Hadoop/HDFS UDF Bridge DBMS
  • 29. RICH ECO-SYSTEM DBMS <–> HADOOP BRIDGE IV Feature Characteristics Big Data Use Cases • Trigger and fetch Hadoop MR job • Ideal for combining results of results via table UDF Hadoop MR job results with DBMS • Can trigger Hadoop MR jobs • Called as part of Sybase IQ SQL query data for operational (transient) • Output joinable with Sybase IQ data business reports • No multiple, simultaneous UDF calls • Sample UDFs provided in JAVA only Combine results of Hadoop MR jobs • Example – In Utilities: Smart meter and with DBMS data on the fly: Initiate smart grid data can be combined for load and Join results of Hadoop MR jobs • HDFS data not stored in DBMS monitoring and demand forecast. Smart grid on-demand using SQL queries from • Fetched into DBMS In-memory tables transmission quality data (multi-attribute time • ACID properties not applicable series data) stored in HDFS can be computed DBMS data (Query Federation • Repeated use: put fetched data in tables via Hadoop MR jobs triggered from DBMS and technique) combined with Smart meter data stored in DBMS to analyze demand and workload. • Visible to BI and other client tools via DBMS ANSI SQL API only Smart Grid Smart Meter Transmission Data consumption data Hadoop/HDFS UDF Bridge DBMS
  • 30. RICH ECO-SYSTEM DBMS <–> PREDICTIVE TOOLS BRIDGE Express Complex Computations In Industry Standard Predictive Modeling Markup Language (PMML), Plug In Models Close To data for execution Database Server DBMS SQL Applications Bridge Universal Predictions Plug-In PMML UDFs PMML PMML PMML (models) PMML Preprocessor (models) (models) (convert & validate)
  • 31. RICH ECO-SYSTEM FUNDAMENTALS OF STREAMS TECHNOLOGY Process data without storing it Input Streams Events arrive on input streams Derived Streams, Windows Apply continuous query operators to one or more input streams to produce a new stream Continuous Queries create a new Windows can Have State “derived” stream or window • Retention rules define how many or how long events are kept SELECT FROM one or more input • Opcodes in events can indicate streams/windows insert/update/delete and can be WHERE… automatically applied to the window GROUP BY…
  • 32. RICH ECO-SYSTEM STREAMS DATA PROCESSING VS TRADITIONAL DATA PROCESESSING SQL CCL Windows on Tables Event Streams Rows Events Columns Fields On-Demand: query Event-Driven: runs when information query updates when is needed information arrives
  • 33. RICH ECO-SYSTEM STREAMS PRE-PROCESSING Why store Big Data when you can deal with Small Data – Pre-filter un-necessary data on the fly with Streams technologies ESP Engine Alerts Actions Updates Memory Disk Hadoop/HDFS DBMS
  • 35. 3-LAYER LOGICAL INTEGRATION STREAM PROCESSING <-> NoSQL <-> DBMS BI TOOLS DI TOOLS DBA TOOLS DATA MINING TOOLS Eco-System Unstructured Data App Ingest + Persist (Hadoop, Services Web 2.0 Java C/C++ SQL Federation Content Mgmt) Structured Data (DBMS) DMBS Streaming Data (ESP) The heterogeneous world will require co-existence and playing nice!