100% found this document useful (1 vote)
80 views

1 U Data-Analytics-Unit-I-1

Big data comes from a variety of sources and has the characteristics of volume, variety and velocity which challenges traditional data processing. New data architectures and analytic sandboxes are needed to enable data scientists to perform advanced analytics on large datasets. Data scientists require skills in quantitative analysis, programming, critical thinking and communication to make sense of big data and provide actionable insights.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
80 views

1 U Data-Analytics-Unit-I-1

Big data comes from a variety of sources and has the characteristics of volume, variety and velocity which challenges traditional data processing. New data architectures and analytic sandboxes are needed to enable data scientists to perform advanced analytics on large datasets. Data scientists require skills in quantitative analysis, programming, critical thinking and communication to make sense of big data and provide actionable insights.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 81

DATA ANALYTICS

(BE-2015 PATTERN)
UNIT-I: INTRODUCTION AND LIFE CYCLE
Big Data Overview
• Industries that gather and exploit data
• Credit card companies monitor purchase
• Good at identifying fraudulent purchases
• Mobile phone companies analyze calling
patterns – e.g., even on rival networks
• Look for customers might switch providers
• For social networks data is primary product
• Intrinsic value increases as data grows
Attributes Defining
Big Data Characteristics
• Huge volume of data
• Not just thousands/millions, but billions of
items
• Complexity of data types and structures
• Varity of sources, formats, structures
• Speed of new data creation and grow
• High velocity, rapid ingestion, fast analysis
Attributes Defining
Big Data Characteristics
• Volume
• Big Data observes and tracks what happens from various
sources which include business transactions, social media and
information from machine-to-machine or sensor data. This
creates large volumes of data.
• Variety
• Data comes in all formats that may be structured, numeric in
the traditional database or the unstructured text documents,
video, audio, email, stock ticker data.
• Velocity
• The data streams in high speed and must be dealt with timely.
The processing of data that is, analysis of streamed data to
produce near or real time results is also fast.
Big Data Analytics Importance
• Cost Savings :  help in identifying more efficient
ways of doing business.
•  Time Reductions :helps businesses analyzing data
immediately and make quick decisions based on the
learnings.
•  New Product Development : By knowing the trends
of customer needs and satisfaction through analytics
you can create products according to the wants of
customers.
•  Understand the market conditions : By analyzing
big data you can get a better understanding of
current market conditions.
•  Control online reputation: Big data tools can do
Sources of Big Data Deluge

• Mobile sensors – GPS, accelerometer, etc.


• Social media – 700 Facebook updates/sec in2012
• Video surveillance – street cameras, stores, etc.
• Video rendering – processing video for display
• Smart grids – gather and act on information
• Geophysical exploration – oil, gas, etc.
• Medical imaging – reveals internal body structures
• Gene sequencing – more prevalent, less expensive,
healthcare would like to predict personal illnesses
Sources of Big Data Deluge
Data Structures:
Characteristics of Big Data
Data Structures:
Characteristics of Big Data

• Structured – defined data type, format, structure


• Transactional data, OLAP cubes, RDBMS, CVS files,
spreadsheets
• Semi-structured
• Text data with discernable patterns – e.g., XML data
• Quasi-structured
• Text data with erratic data formats – e.g., clickstream data
• Unstructured
• Data with no inherent structure – text docs, PDF’s, images,
video
Example of Structured Data
Rno Name Address Phone no
1 Amit Nashik 9766543267
2 Neha Pune -
3 Jiya Mumbai -
4 Riya Aurangabad 8990765432
Example of Semi-Structured Data
Example of Quasi-Structured Data
visiting 3 websites adds 3 URLs to user’s log files
Example of Unstructured Data
Video about Antarctica Expedition
Types of Data Repositories
from an Analyst Perspective
State of the Practice in Analytics

Business Intelligence (BI) versus Data Science

Current Analytical Architecture

Drivers of Big Data

Emerging Big Data Ecosystem and a New


Approach to Analytics
Data Analytics Techniques

BI (Business
Intelligence)

Data Science
Business Intelligence (BI) vs Data Science
Business Intelligence (BI) vs Data Science
Current Analytical Architecture
Typical Analytic Architecture
Current Analytical Architecture

Data sources must be well understood

EDW – Enterprise Data Warehouse

From the EDW data is read by applications

Data scientists get data for downstream analytics


processing
Current Analytical Architecture -Problem

High-value data is hard to reach, and predictive


analytics and data mining activities are last in line for
data.

Data scientists are limited to performing in-memory


analytics, which will restrict the size of the datasets .
So Analyst works on sampling, which can skew
model accuracy.

Data Science projects will remain isolated rather than


centrally managed. The implication of this is that the
organization can never tie together the power of
advanced analytics.
Current Analytical Architecture -Solution

• One solution to this problem is to introduce


analytic sandboxes to enable data
scientists to perform advanced analytics.
Drivers of Big Data
Data Evolution & Rise of Big Data Sources
Drivers of Big Data
Data Evolution & Rise of Big Data Sources
• Medical information, such as diagnostic imaging
• Photos and video footage uploaded to the World Wide Web
• Video surveillance, such as the thousands of video cameras
across a city
• Mobile devices, which provide geospatial location data of the
users
• metadata about text messages, phone calls, and application
usage on smart phones
• Smart devices, which provide sensor-based collection of
information from smart
• Nontraditional IT devices, including the use of radio-frequency
identification (RFID) readers, GPS navigation systems, and
seismic processing
EmergingBig Data Ecosystem and a
New Approach to Analytics
• Organizations and data collectors are realizing that the
Data they can gather from individuals contains intrinsic value
and,as a result,a new economy is emerging.
• Four main groups of players
• Data devices
• Games, smartphones, computers, etc.
• Data collectors
• Phone and TV companies, Internet, Gov’t, etc.
• Data aggregators – make sense of data
• Websites, credit bureaus, media archives, etc.
• Data users and buyers
• Banks, law enforcement, marketers, employers, etc.
Emerging Big Data Ecosystem and a New Approach to Analytics

Data devices
• Gather data from multiple locations and continuously generate new data about
this data. For each gigabyte of new data created, an additional petabyte of data
is created about that data.
• For example, playing an online video game, Smartphones data, Retail shopping
loyalty cards data
Data collectors
• Include sample entities that collect data from the device and users.
• For example, Retail stores tracking the path a customer

Data aggregators – make sense of data


• They transform and package the data as products to sell to list brokers for
specific ad campaigns.
Data users and buyers
• These groups directly benefit from the data collected and aggregated by others
within the data value chain.
• For Example, People want to determine public sentiments toward a candidate by
analyzing related blogs and online comments
Emerging Big Data Ecosystem and a
New Approach to Analytics
Key Roles for the
New Big Data Ecosystem
1. Deep analytical talent
• Advanced training in quantitative disciplines – e.g., math,
statistics, machine learning
2. Data savvy(Intelligent , knowledgeable) professionals
• Savvy but less technical than group 1
3. Technology and data enablers
• Support people – e.g., DB admins, programmers, etc.
• This group represents people providing technical expertise
to support analytical projects,
• such as provisioning and administrating analytical
sandboxes, and managing large-scale data architectures
Three Key Roles of the
New Big Data Ecosystem
Three Recurring
Data Scientist Activities

1. Reframe business challenges as


analytics challenges
2. Design, implement, and deploy statistical
models and data mining techniques on
Big Data
3. Develop insights that lead to actionable
recommendations
Profile of Data Scientist
Five Main Sets of Skills
Profile of Data Scientist
Five Main Sets of Skills

• Quantitative skill – e.g., math, statistics


• Technical aptitude – e.g., software engineering,
programming
• Skeptical mindset and critical thinking – ability to
examine work critically
• Curious and creative – passionate about data and
finding creative solutions
• Communicative and collaborative – can articulate
ideas, can work with others
Examples of
Big Data Analytics
• Retailer Target
• Uses life events: marriage, divorce, pregnancy
• Apache Hadoop
• Open source Big Data infrastructure innovation
• MapReduce paradigm, ideal for many projects
• Social Media Company LinkedIn
• Social network for working professionals
• Can graph a user’s professional network
• 250 million users in 2014
Data Visualization of User’s
Social Network Using In Maps
Summary

• Big Data comes from myriad(many) sources


• Social media, sensors, IoT, video surveillance, and
sources only recently considered
• Companies are finding creative and novel ways
to use Big Data
• Exploiting Big Data opportunities requires
• New data architectures
• New machine learning algorithms, ways of working
• People with new skill sets
Exercise
• 1. What are the three characteristics of Big Data, and
what are the main considerations in processing Big Data?
• 2. What is an analytic sandbox, and why is it important?
• 3. Explain the differences between BI and Data Science.
• 4. Describe the challenges of the current analytical
architecture for data scientists.
• 5. What are the key skill sets and behavioral
characteristics of a data scientist?
DATA ANALYTICS
LIFECYCLE
Data Analytics Lifecycle

• Data science projects differ from BI projects


• More exploratory in nature
• Critical to have a project process
• Participants should be thorough and rigorous

• Break large projects into smaller pieces

• Spend time to plan and scope the work

• Documenting adds rigor and credibility


Data Analytics Lifecycle
Data Analytics Lifecycle Overview

Phase 1: Discovery

Phase 2: Data Preparation

Phase 3: Model Planning

Phase 4: Model Building

Phase 5: Communicate Results

Phase 6: Operationalize

Case Study: GINA


Data Analytics Lifecycle Overview

• The data analytic lifecycle is designed for Big Data problems and
data science projects

• With six phases the project work can occur in several phases
simultaneously

• The cycle is iterative to portray a real project

• Work can return to earlier phases as new information is uncovered


Key Roles for a Successful
Analytics Project
Background and Overview of Data
Analytics Lifecycle
• Data Analytics Lifecycle defines the analytics process and best
practices from discovery to project completion

• The Lifecycle employs aspects of


• Scientific method
• Cross Industry Standard Process for Data Mining (CRISP-DM)
• Process model for data mining
• Davenport’s DELTA framework
• Hubbard’s Applied Information Economics (AIE) approach
• MAD Skills: New Analysis Practices for Big Data by Cohen et al.
Overview of
Data Analytics Lifecycle
Phase 1: Discovery
Phase 1: Discovery

1. Learning the Business Domain


2. Resources
3. Framing the Problem
4. Identifying Key Stakeholders
5. Interviewing the Analytics Sponsor
6. Developing Initial Hypotheses
7. Identifying Potential Data Sources
Phase 2: Data Preparation
Phase 2: Data Preparation

• Includes steps to explore, preprocess, and condition data

• Create robust environment – analytics sandbox

• Data preparation tends to be the most labor-intensive step in the


analytics lifecycle
• Often at least 50% of the data science project’s time

• The data preparation phase is generally the most iterative and


the one that teams tend to underestimate most often
Performing ETLT
(Extract, Transform, Load, Transform)

• In ETL users perform extract, transform, load

• Example – in credit card fraud detection, outliers


can represent high-risk transactions that might be
inadvertently filtered out or transformed before
being loaded into the database
Learning about the Data

• Becoming familiar with the data is critical

• This activity accomplishes several goals:


• Determines the data available to the team early in the project
• Highlights gaps – identifies data not currently available
• Identifies data outside the organization that might be useful
Learning about the Data Sample
Dataset Inventory
Data Conditioning

• Data conditioning includes cleaning data, normalizing datasets,


and performing transformations

• Often viewed as a preprocessing step prior to data analysis, it


might be performed by data owner, IT department, DBA, etc.

• Best to have data scientists involved

• Data science teams prefer more data than too little


Data Conditioning

• Additional questions and considerations

• What are the data sources? Target fields?


• How clean is the data?
• How consistent are the contents and files? Missing or inconsistent
values?
• Assess the consistence of the data types – numeric, alphanumeric?
• Review the contents to ensure the data makes sense
• Look for evidence of systematic error
Common Tools
for Data Preparation

• Hadoop can perform parallel ingest and analysis

• Alpine Miner provides a graphical user interface for creating


analytic workflows

• OpenRefine (formerly Google Refine) is a free, open source


tool for working with messy data

• Similar to OpenRefine, Data Wrangler is an interactive tool


for data cleansing an transformation
Phase 3: Model Planning
Phase 3: Model Planning
Model Planning in Industry Verticals

• Example of other analysts approaching a similar problem


Data Exploration
and Variable Selection
• Explore the data to understand the relationships among the
variables to inform selection of the variables and methods

• A common way to do this is to use data visualization tools

• Often, stakeholders and subject matter experts may have ideas


• For example, some hypothesis that led to the project

• Aim for capturing the most essential predictors and variables


• This often requires iterations and testing to identify key variables
Model Selection
• The main goal is to choose an analytical technique, or
several candidates, based on the end goal of the project
• We observe events in the real world and attempt to
construct models that emulate this behavior with a set of
rules and conditions
• A model is simply an abstraction from reality

• Determine whether to use techniques best suited for


structured data, unstructured data, or a hybrid approach
• Teams often create initial models using statistical software
packages such as R, SAS, or Matlab
• Which may have limitations when applied to very large datasets

• The team moves to the model building phase once it has a


good idea about the type of model to try
Common Tools for the Model
Planning Phase
• R has a complete set of modeling capabilities
• R contains about 5000 packages for data analysis and graphical presentation

• SQL Analysis services can perform in-database analytics of


common data mining functions, involved aggregations, and basic
predictive models

• SAS/ACCESS provides integration between SAS and the


analytics sandbox via multiple data connections
Phase 4: Model Building
Phase 4: Model Building
• Execute the models defined in Phase 3
• Develop datasets for training, testing, and production
• Develop analytic model on training data, test on test data

• Question to consider
• Does the model appear valid and accurate on the test data?
• Does the model output/behavior make sense to the domain experts?
• Do the parameter values make sense in the context of the domain?
• Is the model sufficiently accurate to meet the goal?
• Does the model avoid intolerable mistakes? (see Chapters 3 and 7)
• Are more data or inputs needed?
• Will the kind of model chosen support the runtime environment?
• Is a different form of the model required to address the business problem?
Common Tools for
the Model Building Phase
• Commercial Tools
• SAS Enterprise Miner – built for enterprise-level computing and analytics
• SPSS Modeler (IBM) – provides enterprise-level computing and analytics
• Matlab – high-level language for data analytics, algorithms, data exploration
• Alpine Miner – provides GUI frontend for backend analytics tools
• STATISTICA and MATHEMATICA – popular data mining and analytics
tools
• Free or Open Source Tools
• R and PL/R - PL/R is a procedural language for PostgreSQL with R
• Octave – language for computational modeling
• WEKA – data mining software package with analytic workbench
• Python – language providing toolkits for machine learning and analysis
• SQL – in-database implementations provide an alternative tool
Phase 5: Communicate Results
Phase 5: Communicate Results
• Determine if the team succeeded or failed in its objectives

• Assess if the results are statistically significant and valid


• If so, identify aspects of the results that present salient findings
• Identify surprising results and those in line with the hypotheses

• Communicate and document the key findings and major insights derived
from the analysis

• This is the most visible portion of the process to the outside


stakeholders and sponsors
Phase 6: Operationalize
Phase 6: Operationalize
• In this last phase, the team communicates the benefits of the project
more broadly and sets up a pilot project to deploy the work in a
controlled way
• Risk is managed effectively by undertaking small scope, pilot
deployment before a wide-scale rollout
• During the pilot project, the team may need to execute the algorithm
more efficiently in the database rather than with in-memory tools like
R, especially with larger datasets
• To test the model in a live setting, consider running the model in a
production environment for a discrete set of products or a single line of
business
• Monitor model accuracy and retrain the model if necessary
Phase 6: Operationalize
Key outputs from successful analytics project
Phase 6: Operationalize
Key outputs from successful analytics project

• Business user – tries to determine business benefits and


implications
• Project sponsor – wants business impact, risks, ROI
• Project manager – needs to determine if project completed on
time, within budget, goals met
• Business intelligence analyst – needs to know if reports and
dashboards will be impacted and need to change
• Data engineer and DBA – must share code and document
• Data scientist – must share code and explain model to peers,
managers, stakeholders
Phase 6: Operationalize
Four main deliverables

• Although the seven roles represent many interests, the


interests overlap and can be met with four main
deliverables
1. Presentation for project sponsors – high-level
takeaways for executive level stakeholders
2. Presentation for analysts – describes business
process changes and reporting changes,
includes details and technical graphs
3. Code for technical people
4. Technical specifications of implementing the code
Case Study: Global Innovation Network
and Analysis (GINA)
• In 2012 EMC’s new director wanted to improve the
company’s engagement of employees across the global
centers of excellence (GCE) to drive innovation, research,
and university partnerships

• This project was created to accomplish


• Store formal and informal data
• Track research from global technologists
• Mine the data for patterns and insights to improve the team’s
operations and strategy
Phase 1: Discovery

• Team members and roles


• Business user, project sponsor, project manager – Vice President from Office
of CTO
• BI analyst – person from IT
• Data engineer and DBA – people from IT
• Data scientist – distinguished engineer
Phase 1: Discovery
• The data fell into two categories
• Five years of idea submissions from internal innovation
contests
• Minutes and notes representing innovation and research
activity from around the world
• Hypotheses grouped into two categories
• Descriptive analytics of what is happening to spark further
creativity, collaboration, and asset generation
• Predictive analytics to advise executive management of
where it should be investing in the future
Phase 2: Data Preparation

• Set up an analytics sandbox


• Discovered that certain data needed conditioning and
normalization and that missing datasets were critical
• Team recognized that poor quality data could impact
subsequent steps
• They discovered many names were misspelled and problems
with extra spaces
• These seemingly small problems had to be addressed
Phase 3: Model Planning

• The study included the following


considerations
• Identify the right milestones to achieve the goals
• Trace how people move ideas from each milestone
toward the goal
• Tract ideas that die and others that reach the goal
• Compare times and outcomes using a few different
methods
Phase 4: Model Building

• Several analytic method were employed


• NLP on textual descriptions
• Social network analysis using R and Rstudio
• Developed social graphs and visualizations
Phase 4: Model Building
Social graph of data submitters and finalists
Phase 4: Model Building
Social graph of top innovation influencers
Communicate Results

• Study was successful in in identifying hidden


innovators
• Found high density of innovators in Cork,
Ireland
• The CTO office launched longitudinal studies
Operationalize

• Deployment was not really discussed


• Key findings
• Need more data in future
• Some data were sensitive
• A parallel initiative needs to be created to
improve basic BI activities
• A mechanism is needed to continually
reevaluate the model after deployment
Phase 6: Operationalize
Summary

• The Data Analytics Lifecycle is an approach to


managing and executing analytic projects
• Lifecycle has six phases
• Bulk of the time usually spent on preparation –
phases 1 and 2
• Seven roles needed for a data science team
• Review the exercises
References
• http://www.csis.pace.edu/~ctappert/cs816-15fall/slides/
• https
://norcalbiostat.github.io/ADS/notes/Data%20Analytics%20Lifecycle%20
-%20EH1.pdf
• http://srmnotes.weebly.com/it1110-data-science--big-data.html
• http://www.csis.pace.edu/~
ctappert/cs816-15fall/books/2015DataScience&BigDataAnalytics.pdf

You might also like