0% found this document useful (0 votes)
4 views50 pages

Lecture_2

The document provides an overview of Big Data, defining its characteristics such as volume, variety, and velocity, and illustrating its applications across various industries. It discusses the importance of Big Data analytics for cost savings, time reductions, and market understanding, while outlining the data analytics lifecycle with its six phases. Additionally, it highlights the skills required for data scientists and the tools used in data analytics.

Uploaded by

Ronak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views50 pages

Lecture_2

The document provides an overview of Big Data, defining its characteristics such as volume, variety, and velocity, and illustrating its applications across various industries. It discusses the importance of Big Data analytics for cost savings, time reductions, and market understanding, while outlining the data analytics lifecycle with its six phases. Additionally, it highlights the skills required for data scientists and the tools used in data analytics.

Uploaded by

Ronak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 50

Introduction to Big Data

Big data
• What Is Big Data
• Big data relates to the large data sets, which are created from a
variety of sources and with a lot of speed (a. k. a velocity).
• Any data set that has one of the attributes can be called Big Data.
• It is also about the data with veracity and value.

Some real-world examples that will explain how big data is


used are as follows:
1) Big Data is used to find out consumer shopping habits.
2) It can be used to monitor health conditions through data from
wearables.
3) The transportation industry uses fuel optimization tools where big
data is used.
4) It is used for predictive inventory ordering.
5) It can help you with real-time data monitoring and cybersecurity
protocols.
Big Data Overview
• Industries that gather and exploit data
• Credit card companies monitor purchase
• Good at identifying fraudulent purchases
• Mobile phone companies analyze calling patterns –
e.g., even on rival networks
• Look for customers might switch providers
• For social networks data is primary product
• Intrinsic value increases as data grows
Attributes Defining
Big Data Characteristics
• Huge volume of data
• Not just thousands/millions, but billions of items
• Complexity of data types and structures
• Varity of sources, formats, structures
• Speed of new data creation and grow
• High velocity, rapid ingestion, fast analysis
What Is Big Data Analytics

• Big data analytics is the use of specialized software or


platforms to draw conclusions or to find answers to specific
questions based on correlations or relationships between data
sets from different systems.
Big Data Analytics Importance
• Cost Savings : help in identifying more efficient ways of doing
business.
• Time Reductions :helps businesses analyzing data
immediately and make quick decisions based on the learnings.
• New Product Development : By knowing the trends of
customer needs and satisfaction through analytics you can
create products according to the wants of customers.
• Understand the market conditions : By analyzing big data you
can get a better understanding of current market conditions.
• Control online reputation: Big data tools can do
sentiment analysis. Therefore, you can get feedback about
who is saying what about your company.
Sources of Big Data Deluge
• Mobile sensors – GPS, accelerometer, etc.
• Social media – 700 Facebook updates/sec in2012
• Video surveillance – street cameras, stores, etc.
• Video rendering – processing video for display
• Smart grids – gather and act on information
• Geophysical exploration – oil, gas, etc.
• Medical imaging – reveals internal body structures
• Gene sequencing – more prevalent, less expensive,
healthcare would like to predict personal illnesses
Sources of Big Data Deluge
Data Structures:
Characteristics of Big Data
• Structured – defined data type, format, structure
• Transactional data, OLAP cubes, RDBMS, CVS files, spreadsheets
• Semi-structured
• Text data with discernable patterns – e.g., XML data
• Unstructured
• Data with no inherent structure – text docs, PDF’s, images, video
Example of Structured Data
Rno Name Address Phone no
1 Amit Nashik 9766543267
2 Neha Pune -
3 Jiya Mumbai -
4 Riya Aurangabad 8990765432
Example of Semi-Structured Data
Example of Unstructured Data
Video about Antarctica Expedition
Data Analytics Techniques
BI (Business Intelligence)
 BI(Business Intelligence) is a set of processes,
architectures, and technologies that convert raw data
into meaningful information that drives profitable
business actions.
 It is a suite of software and services to transform data
into actionable intelligence and knowledge.
Data Analytics Techniques
Data Science

• Data Science is a blend of various tools, algorithms, and


machine learning principles with the goal to discover
hidden patterns from the raw data.
Business Intelligence (BI) vs Data Science
Profile of Data Scientist
Five Main Sets of Skills
Profile of Data Scientist
Five Main Sets of Skills
• Quantitative skill – e.g., math, statistics
• Technical aptitude – e.g., software engineering,
programming
• Skeptical mindset and critical thinking – ability to examine
work critically
• Curious and creative – passionate about data and finding
creative solutions
• Communicative and collaborative – can articulate ideas, can
work with others
Data Analytics
Lifecycle
Data Analytics Lifecycle Overview

• The data analytic lifecycle is designed for Big Data problems and data
science projects
• With six phases the project work can occur in several phases
simultaneously
• The cycle is iterative to portray a real project
• Work can return to earlier phases as new information is uncovered
Data Analytics Lifecycle
Data Analytics Lifecycle Overview

Phase 1: Discovery

Phase 2: Data Preparation

Phase 3: Model Planning

Phase 4: Model Building

Phase 5: Communicate Results

Phase 6: Operationalize

Case Study: GINA


Overview of
Data Analytics Lifecycle
Phase 1: Discovery
Phase 1: Discovery

1. Learning the Business Domain


2. Resources
3. Framing the Problem
4. Identifying Key Stakeholders
5. Interviewing the Analytics Sponsor
6. Developing Initial Hypotheses
7. Identifying Potential Data Sources
Discovery
Discovery
Discovery
Discovery
Phase 2: Data Preparation
Phase 2: Data Preparation

• Includes steps to explore, preprocess, and condition data


• Create robust environment – analytics sandbox
• Data preparation tends to be the most labor-intensive
step in the analytics lifecycle
• Often at least 50% of the data science project’s time
• The data preparation phase is generally the most iterative
and the one that teams tend to underestimate most often
Phase 2: Data Preparation
Phase 2: Data Preparation
Performing ETLT
(Extract, Transform, Load, Transform)

• In ETL users perform extract, transform, load


• In the sandbox the process is often ELT – early load
preserves the raw data which can be useful to
examine
• Example – in credit card fraud detection, outliers
can represent high-risk transactions that might be
inadvertently filtered out or transformed before
being loaded into the database
Data Conditioning

• Data conditioning includes cleaning data,


normalizing datasets, and performing
transformations
• Often viewed as a preprocessing step prior to data
analysis, it might be performed by data owner, IT
department, DBA, etc.
• Best to have data scientists involved
• Data science teams prefer more data than too little
Data Conditioning

• Additional questions and considerations


• What are the data sources? Target fields?
• How clean is the data?
• How consistent are the contents and files? Missing or
inconsistent values?
• Assess the consistence of the data types – numeric,
alphanumeric?
• Review the contents to ensure the data makes sense
• Look for evidence of systematic error
Phase 3: Model Planning
Phase 3: Model Planning
Phase 3: Model Planning
Phase 3: Model Planning
Model Planning in Industry Verticals

• Example of other analysts approaching a similar problem


Phase 4: Model Building
Phase 4: Model Building
Common Tools for
the Model Building Phase
• Commercial Tools
• SAS Enterprise Miner – built for enterprise-level computing and analytics
• SPSS Modeler (IBM) – provides enterprise-level computing and analytics
• Matlab – high-level language for data analytics, algorithms, data
exploration
• Alpine Miner – provides GUI frontend for backend analytics tools
• STATISTICA and MATHEMATICA – popular data mining and analytics tools
• Free or Open Source Tools
• R and PL/R - PL/R is a procedural language for PostgreSQL with R
• Octave – language for computational modeling
• WEKA – data mining software package with analytic workbench
• Python – language providing toolkits for machine learning and analysis
• SQL – in-database implementations provide an alternative tool
Phase 5: Communicate Results
Phase 5: Communicate Results
Phase 6: Operationalize
Phase 6: Operationalize
Thanks

You might also like