100% found this document useful (1 vote)
260 views

Introduction To Big Data Analytics

This document provides an overview of an introduction to big data analytics course. The course consists of 1.5 credits over 7-8 lectures that cover topics like Hadoop, NoSQL technologies, machine learning concepts, and applications of big data analytics. The course structure, contents, and some key motivations for using big data analytics are outlined.

Uploaded by

Sweta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
260 views

Introduction To Big Data Analytics

This document provides an overview of an introduction to big data analytics course. The course consists of 1.5 credits over 7-8 lectures that cover topics like Hadoop, NoSQL technologies, machine learning concepts, and applications of big data analytics. The course structure, contents, and some key motivations for using big data analytics are outlined.

Uploaded by

Sweta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 47

Introduction to Big Data

Analytics
Pankaj Sahay
Course Structure
• 1 ½ cerdits – 7-8 lec x 2 hrs each
• Assignment / Test – Internal
• Presentations by Students
• Final Exam
Course Contents
• Motivation
• Introduction to Big Data
• Mining Big Data & the Platform
• Toolkits used for Big Data Analytics
• Overview of Hadoop
• Overview of NoSql Technologies
• Review of Key ML concepts
• Enterprise Data Science
• Some Applications of Big Data Analytics – Introduction

Ref: Practical Big Data Analytics – Nataraj Dasgupta – 2018 Packt Publishing
Understanding How Data Powers Big Business – Bill Schmarzo, 2103, Wiley
Motivation
Brief History and Timelines:
• 1980 s – POS scanner Data –
Changed the balance of power between CPG Manufacturers & Retailers
e.g. P&G, Unilever, Frito Lay, Kraft v/s Walmart, Tesco
• Detailed Data – Product Sales, Customer Loyalty Data
Retailers got new insights about Product Sales, Customer Buying
Patterns, market Trends not available earlier

* Data changed Business Models of Many Companies


..Late 1990 s
• Web
• New information / knowledge currency – web clicks
Online v/s brick & mortar
• Web logs – Product sales & consumer purchase behaviour
• Manipulate user experience to influence purchase choice
• Recommendation engines, targeted ads, etc.

* Led to change in business models


Business Revolution due to Data
• Social Media
• Sensor Data, Machine generated Data
• Scale of Business and Operations, Transactions
• Telecom / Mobile / Smartphones
• Convergence of Platforms & Technologies
• SM – optimize customer engagement
• M/c or Sensor Data – Real time, high granularity
• Mobile – location based & real time customer engagement
* MASSIVE VOLUME OF DATA, UN/STRUCTURED
Explosion of Data & Sources
• Everything we do leaves a Trace - Data
• Activity Data – music, smartphones, Credit Cards
• Conversation Data – WA, FB, Twitter…
• Photo & Video Images
• Sensor Data – GPS, Temperature, Health or Environment Parameters
• IOT – smart watches, fridge, traffic sensors to alarm clock,..
Challenges to IT & the Enterprise
• Old BI & Data Warehouse / Mgmt too slow and rigid to allow
capturing and exploiting fast-moving, short-lived opportunities
• Batch processing & Analytics about past v/s predictive req
• Timely or Real time availability of SM, mobile, .. For customer
experience acquisition and retention
• Sampling & Aggregation – information and nuances lost – contain
new insights about customer, product, operations, market
Blitz of Data – Opportunity Cost
• Competitors innovating and adapting faster, lower costs, more value
• Customer acquisition, service and retention <–> Profits and Margins
• Declining market share – Right time, right customer, right service
• Real time information on all aspects – customer sentiment, product
or service performance, fleeting opportunities – Missed business
opportunities – competitors gain advantage
Transform Nature of Business
• Move NOW
• Move FAST
• Analyse Big Data, Decide & Service Customers NOW
• Rapidly changing data, customer preferences – adapt NOW
• Address rapidly changing market and product/service feedback NOW
• Analyse Technology utility, sensor data & provide real time solutions
Business Transformation – Decision Making
• Past • With Big Data
• Rearview • Forward looking / Predictive
• Use very little data • Use Diverse Data / Sources
• Batch, Incomplete • Real Time, Correlated
• Business Monitoring • Business Transformation /
Optimisation
Case Study - Walmart
• Sell goods at lowest price
• Remove Middlemen, connect with Mfg  reduce costs
• Buy cheap, large stocks, sell cheap
• Bar codes at checkout counters – track consumer behaviour  invest
in s/w for the same
• Real time data – shared with the suppliers – impact on manufacturers
– improve productivity and efficiency
• Enabled Walmart to dictate the price, volume, delivery, packaging,
quality of products
* Impact - Inverted the supplier – retailer relationship
Change in Relationship and Power
• Manufacturer Retailers using Big Data Analytics
• Unilever, P&G, General Mills.. • Walmart
• Dictate Quantity, Price, Promotions • Customer Insights – POS Data
• Customer Behaviour – buying
behaviour, willing to pay how
much, promotions liked
• Cluster of purchases
Customer Loyalty card
• Retailers Dictate Terms
• Monetise POS Data by selling it to
Mfg
Transformation of Key Business Processes
• Procurement – Cost effective suppliers, on-time, no damage
• Product Dev – Usage Insights  new product launch
• Mfg – M/C, Process – Quality Control
• Distribution – Optimisation of - Inventory levels, Supply Chain activities –
impact of weather, holidays, economic conditions
• Marketing – Identity promotions, campaigns driving customer traffic,
engagement, sales, Optimise Marketing mix
• Pricing & Yield Mgmt – Optimisation for transient & short term goods
• Sales – Resources, products mix, commissions, ..
• Human Resources – Identity characteristics of Effective employees, churn
Big Data Business Model Maturity
Identify current state & where to be in future
• Business Monitoring – Performance - Descriptive
• Business Insights – Use Analytics, Actionable Recommendations
• Business Optimisation – Use Analytics to optimise Business Processes
• Data Monetisation – Use above to generate revenue
• Business Metamorphosis – Move from Product Centric to Platform
Centric or Ecosystem Centric Model
Business Monitoring
• Deploy BI & Data Warehouse to Monitor, Report, on-going Business &
Peformance
• Use basic Anlytics – Flag under/over performing areas
• Automate Alerts with req Info to certain parties – events – owners
• Trending – time series, moving averages
• Comparisons with historical data, events, situations
• Benchmark – v/s past, campaigns, industry
• Monitor and index of performance – customer, product perf, financial
• Share – of market, etc
• “DASHBOARDS”
Business Insights
• Got to next step after Monitoring
• Use Advanced Stats, Predictive Analytics, Data Mining, Real-time,..
• Identify actionable Business Insights possible to Integrate into Business
Processes
• “Intelligent Dashboards”
• Uncover material & insights buried in deep in Data
• Make specific, Actionable recommendations
• Identify areas where adv analytics can impact business
• Ask how users of analytics use the data to decide/act
• Launch Pilot to integrate & generate actionable reccomendations
Business Optimisation
• Use Analytics to Optimise parts of Business Ops
• Marketing spend allocation based on in-flight campaign/promo
• Scheduling – purchase history, buyer behaviour, weather, events
• Distribution & Inventory Optimisation – add demographic data
• Product pricing – buying patterns, inventory, product usage, insights from SM,
sentiments, etc
• Algorithmic trading – Fintech
• Transition to Optimisation – identify areas, related business question & decision
making process, along with data sources, nature & arrival of data, models reqd,
etc
• Develop prototype – impact on business, financials
• Create analytics governance process, integrate with business
Data Monetisation
• Leverage Big Data to generate New revenue opportunities
• Customer, product, marketing, etc – insights – sell to other org
e.g. smartphone apps – sell to marketers and mfg end-point data
• “Intelligent Products” – integrated with analytics directly
Cars – driving patters to adjust parameters to suit driver style,
TV – learn likes of user and search relevant channels
Ovens – learn food cooking preferences & cook accordingly
• Redefine customer experience based on Analytics
Identify Target User experience requirements, integrate & leverage all
capabilities
Business Metamorphosis
• Comprehensive change
• Transform business model, new services, new markets
• Move from product- to platform- or eco-centric model
• Data & Analytics is an asset to use, granular & vast, varied, ..
• Models used in your BP, decisions making, customer service, etc are
your differentiators and Intellectual Property
• Integration & Cultural change in the Org, customers, ecosystem
• Move from HiPPO to Analytics based
Limitations of Traditional DBMS
• Storage
• Large Query Time
• Processing Power
• Structured Data
• Architecture
• Scalability
• Cost
Big Data Analytics
• What is Big Data
• Exceeds processing power of conventional DB systems
• Tools required to analyse Big Data
• Markets more competitive - more insights and faster decision making
• Real time response
• Open Source Technologies
• Cloud Computing & Distributed Systems used to Leverage
• Hadoop – Open source s/w framework, distributed processing across
cluster of computers, scalable to thousands of servers, with local
computation and storage
• Early Adopters – Google, Yahoo, Amazon, E-bay
What is Big Data
3 V of Big Data
Volume – Data increasing exponentially
• 90% of data at present generated in the last 2 years
• File size and sources increasing
• Sensors, RFID, User generated on SM, transaction – Petabytes/day
Velocity - Rate at which data is coming in
• Social Media – 3-4 B likes / day, 400M tweets / day
• Processing in Real Time reqd, batch processing irrelevant or too late
e.g. Fraud Detection – Transactions, Call record in M or B / day
• Process faster than arrival of new data – personalised reco instantly
Variety – Structured & Unstructured
• Social Media, Docs, log, emails..
BDA for Practitioners & End Users
Salient characteristics
• What is Big Data Mining
• Enterprise – Build use case, Stakeholders, Implementation cycle

Key Technologies in Big Data Mining


• Selecting the h/w stack – single/multi-node, cloud-based
• Selecting the s/w stack – Hadoop, Spark, NoSQL, cloud-based environ’
Big Data Analytics
• Big Data Mining

Life cycle of Processing Large Scale Datasets – procurement to


Implementation of tools for analysis

• Predictive Analytics
Methods to obtain insights and address business problems / actions

• Enterprise – What business objectives the solution will address


Big Data Strategy – Build the Case for it
• Determine the appropriate USE CASES & NEEDS the platform will
address
• BU Level – find relevant problems  deliver value – measurable

• Selection of the appropriate h/w & s/w stack – e.g. different for
streaming data, or internal data
Steps for Building the Case
• WHO
• WHAT
• BUY IN (STAKEHOLDERS)
• EARLY WINS, EFFORT to REWARD Ratio
• LEVERAGE Early Wins
Who needs Big Data Mining
• Business groups – most significant impact from solution
• Any groups already working with large datasets, important to
business, direct impact on revenue
• Optimise their processes – impact on daily work processes, impact on
final outcome
Determine the Use Cases
• Units identified in the previous step
• Do they already have a platform – then prioritise among the various
use cases – requires familiarity with the work being done in BU
• Hierarchical structure – Management with oversight of Unit, Staff
who are hands on with the analysis – both must collaborate
• Management – business requirement, which use case will give the
most benefit
• Staff / Practitioners – Challenges at the operational level
• Consolidate both operational & Managerial aspects – what is the
optimal outcome
Stakeholders’ Buy-in
• Decision makers, Budget owners
• Prior to starting work, establish their consensus
• Multiple buy-ins for redundancy  pool of support from primary and
secondary sources for funding support & extending early wins into a
larger project
• Baseline – Value from a certain Use Case – leverage on success
Early Wins & Effort to Reward Ratio
• After identification of Appropriate Use Cases
• Which has good effort to reward ration
• Small use case – short time to implement – small budget – specific business
critical function – Early WIN – increase credibility of solution
• Say E/R Ration = ( Time + Cost + No of Resources + Criticality of use case ) /
Business Value
• Effort – time & work reqd to implement use case, procurement, man hrs,
etc
• Barrier to entry – open source tool – less barrier to entry v/s proprietary –
procurement, risk analysis, approval
• Multiple Units – resources already engaged in other projects
Leverage Early Wins
• Paves way for bigger strategy & implementation across
• First crucial step in showing the value to the stakeholders & decision
makers
• Gets past sceptics or those who are not aware
Implementation Life Cycle
• Multiple Steps
• Trial & Error
• Perseverance
• Multiple Stakeholders – Collaborative Effort – best results
Stakeholders of the Solution
• Depends on the Use Case & Domain
• Business Sponsor – Individual / BU gives support & funding for Project
Most likely the beneficiary of the solution or max impact on Unit
• Implementation group – who / team- will implement hands – on
Most often the IT or Analytics Unit
• IT Procurement – Vetting – technology, cost, organisational relevance &
viability, compliance with internal / external policies, other aspects –
licensing, costs, upgrades, etc
• Legal – Terms & Conditions – permissions of use, restrictions
Proprietary – requires more vendor specific agreements & time for approval
Implementing the Solution
• Result of Collaboration & culmination of all activity
• Small size project – 3-6 months to implement
• Big project – months to years – add capabilities incrementally during
implementation & deployment
Technical Elements of the Big Data Platform
• Selection of the h/w stack
• Selection of the s/w & BI platform

• On premises
• Cloud based
Selection of the h/w stack
• Depends on type of solution chosen
• Location of h/w
• Type of data – un/structured, semi-structured
• Size of data – GB, Tera, Peta
• Update frequency of data
Models of h/w architecture
Multinode Architecture

• Multiple nodes/servers, interconnected, work on multimode or


distributed computing e.g. Hadoop
NoSQL databases e.g Cassandra, Platform – Elasticsearch

* Leverage Commodity Servers – low end m/c working in tandem


* Multinode Architectures – host data Terabytes & above
Single Node Architecture
• Computation on a single server
• Advantages over distributed – reliability of the n/w, cost of latency,
b/w, ..
• Structured dataset, mainly text, 1-5 TB – host on single node
Cloud based Architecture
• Reduce barrier to entry into Big Data Analytics
• Platform provisions h/w (and s/w) resources On Demand –
Need Based
• Significant reduction in procuring, managing, maintenance, hosting at in-
house facilities
• Platforms – Amazon Web Services, Azure (M’soft), Google compute
environment – 10-100s of nodes at 1 cent / hr / instance
• Several complimentary services – cloud mgmt. companies – Altiscale, IBM
Cloud Brokerage – select & manage multiple cloud based solns
• Exponential decrease in cost of h/w – s/w more expensive – licensing
• Hence allocate enough budget for h/w selection, h/w determines efficiency
of s/w implementation
Selection of the s/w Stack
• Based on specifics of the situation
Hadoop Ecosystem – Multiple projects run u/Apache s/w Foundation
• Supports almost all types of datasets – un-, semi-, structured
• Ecosystem of Auxiliary tools – more functionalities, evolving marketplace
Four primary components –
• Hadoop Common – common utilities supporting other modules
• HDFS (distributed file system) – provides high throughput access to
application data
• YARN – Framework for job scheduling & cluster resource management
• MapReduce – YARN based system for parallel processing of large datasets
Apache Spark
• Multinode computing Framework at UC Berkkeley
• Seamless interface to run parallel computations
Go beyond limitations of MapReduce
• DAG – directed acyclic graphs – optimise set of operations – smaller,
computationally efficient, set of operations
• API s – to Python – PySpark & Scala (natively available)
• Removes barriers of entry into Hadoop (knowledge of Java essential)
• RDD (Resilient Distributed Datasets) – store data in-memory –
improve retrieval and processing times
.. Spark
• Cluster Manager – communicate between nodes – co-ordinate
• Apache Mesos (standalone Cluster Manager)) or YARN
• Distributed Storage – Spark can access data from various distributed
storage systems –
HDFS, S3(AWS Storage), Cassandra, Hbase, Hive, Tachyon, Hadoop data
sources
* SPARK – standalone usage possible – HADOOP not necessary for ops
• SPARK – supports various cluster managers & backend storage
systems
NoSQL & Traditional Databases
• Open source, commercial, cloud-based
Classifications –
• Key-value: Unique key – identifies a set of related properties
e.g. SSN, Aadhaar – related to individual’s name, address, phone no,etc
Query by the id no to directly access the information about individual
Open source Key-value d’bases – Redis, Commercial – Riak
• In-memory: earlier – cache stored in the memory
• Faster – O ~100 ns, v/s disk 1-10 ms
• NoSQL d’bases e.g. Redis, KDB+ - use temporary in-memory storage
.. NoSQL & Traditional Databases
Columnar – Append multiple columns of data rather than rows
• Primary advantage – faster data access with reduced I/O overhead
• Leverage parallel processing facilities well
• E.g. Cassandra, Google BigTable, etc
Document-Oriented
• Do not conform to any specific schema
e.g. unstructured text – news articles
• Multiple key-value pairs not necessarily consistent in structure across all
other entries
• MongoDB – used in Media-related org – NYT, Forbes, ..
Cloud-based solutions
• AWS Redshift, Azure SQL Data Warehouse, Google Bigquery
• Query datasets directly on cloud-vendor’s platform w/o creating their
own architecture
• End user can have in-house specialists e.g. Redshift sysads
• Management – infrastructure, maintenance, day-to-day tasks mostly
carried out by vendor
• Reduce operational overhead of the client

You might also like