SlideShare a Scribd company logo
After completing this course, students should be
able to:
 CO1: Understand the significance, structure
and sources of Big data
 CO2: Asses avenues for analytical scalability.
 CO3: Comprehend stream computing and
applications
 CO4: Apply the different clustering techniques
 CO5: Use different Frame works and
Visualization techniques
Course Outcomes
Introduction To Big Data: What Is Big Data? Is The
"Big" Part Or The "Data" Art More Important? How Is
Big Data Different? How Is Big Data More Of The
Same? Risks Of Big Data -Why You Need To Tame Big
Data -The Structure Of Big Data- Exploring Big Data,
Most Big Data Doesn't Matter- Filtering Big Data
Effectively -Mixing Big Data With Traditional Data- The
Need For Standards-Today's Big Data Is Not
Tomorrow's Big Data. Web Data: The Original Big Data
-Web Data Overview -What Web Data Reveals -Web
Data In Action? A Cross-Section Of Big Data Sources
And The Value They Hold.
Unit I
Data Analysis: Evolution Of Analytic Scalability –
Convergence – Parallel Processing Systems –
Cloud Computing – Grid Computing – Map
Reduce – Enterprise Analytic Sand Box –
Analytic Data Sets – Analytic Methods – Analytic
Tools – Cognos – Microstrategy - Pentaho.
Analysis Approaches – Statistical Significance –
Business Approaches – Analytic Innovation –
Traditional Approaches – Iterative
Unit II
Mining Data Streams : Introduction To
Streams Concepts, Stream Data Model And
Architecture, Stream Computing, Sampling
Data In A Stream, Filtering Streams, Counting
Distinct Elements In A Stream, Estimating
Moments, Counting Oneness In A Window,
Decaying Window, Realtime Analytics
Platform(RTAP) Applications, Case Studies, Real
Time Sentiment Analysis, Stock Market
Predictions.
Unit III
Frequent Itemsets And Clustering : Mining
Frequent Itemsets - Market Based Model –
Apriori Algorithm – Handling Large Data Sets In
Main Memory – Limited Pass Algorithm –
Counting Frequent Itemsets In A Stream –
Clustering Techniques – Hierarchical – K- Means
– Clustering High Dimensional Data – CLIQUE
And PROCLUS – Frequent Pattern Based
Clustering Methods – Clustering In Non-
Euclidean Space – Clustering For Streams And
Parallelism.
Unit IV
Frameworks And Visualization : Mapreduce –
Hadoop, Hive, Mapr – Sharding – Nosql
Databases - S3 - Hadoop Distributed File
Systems – Visualizations - Visual Data Analysis
Techniques, Interaction Techniques; Systems
And Applications:
Unit V
Unit I
Introduction To Big Data
What is Big Data?
According to study reported in literature:
•Every day, we create 2.5 quintillion (1 quintillion is 10 30
) bytes of
data.
•So much that 90% of the data in the world today has been created
in the last two years alone.
•This data comes from everywhere: sensors used to gather climate
information, posts to social media sites, digital pictures and videos,
purchase transaction records, and cell phone GPS signals etc.
According to another study
•From the beginning of recorded time (1990) until 2003, 5 billion
gigabytes of data was created.
•In 2011, the same amount was created every two days
•In 2013, the same amount of data was created every 10 minutes
•In 2015, same or more data (generating) every 10 minutes.
•Advances in communications, computation, and storage
have created huge collections of data, having information
of value to business, science, government and society.
•Example: Search engine companies such as Google, Yahoo!, and
Microsoft have created an entirely new business by capturing the
information freely available on the World Wide Web and providing it
to people in useful ways. (SOCIAL NETWORKING)
•These companies collect trillions of data every day and provide NEW
SERVICES such as satellite images, driving directions, image retrieval
etc.
• The societal benefits of these services are well appreciated, it has
transformed how people find and make use of information on a daily
basis.
Foundations of Big Data: Concepts, Techniques, and Applications
•It can be used in wide variety of areas from business, health care,
scientific, Defence etc.
Example: Health care (AKA HEALTH INFORMATICS)
•Modern medicine system collects huge amounts of information
about patients through imaging technology (CAT scans, MRI),
genetic analysis (DNA microarrays), and other forms of diagnostic
equipment.
•By applying analytics to data sets for large numbers of patients,
medical researchers are gaining fundamental insights into the
GENETIC AND ENVIRONMENTAL CAUSES OF DISEASES,
and creating more effective means of diagnosis.
•Recently hollywood star underwent surgery to prevent cancer.
[who]
According to McKinsey report published in US
•140,000-190,000 workers with “knowledge of big data analytics”
will be needed in the US alone. (2014)
•Furthermore, 1.5 million managers will need to become data-
literate.
•Many agencies / media houses/ scientific community across the
world have identified Big Data as important research area.
•Like it or not, a massive amount of data will be coming
your way soon.
•Perhaps it has reached you already.
•Perhaps you’ve been wrestling with it for a while—
trying to figure out how to store it for later access,
address its mistakes and imperfections, or classify it
into structured categories.
GENESIS………………………The
Beginning
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
KNOW DIFFERENCE BETWEEN BIG DATA AND MANAGMENT
As the author Bill Franks puts,
•There may soon be not only a flood of data, but flood of
books on big data.
•Most of these big-data books will be about the management
of big data:
 How to wrestle it into a database or data warehouse.
 How to structure and categorize unstructured data.
 If you find yourself reading a lot about Hadoop or
MapReduce or various approaches to data warehousing.
 you’ve stumbled upon—or were perhaps seeking—a “big
data management” (BDM) book.
• BDM is, of course, important work. No matter how much data
you have of whatever quality, it won’t be much good unless you
get it into an environment and format in which it can be
accessed and analyzed.
• BDM alone won’t get you very far. You also have to analyze and
act on it for data of any size to be of value.
• Just as traditional database management tools didn’t
automatically analyze transaction data from traditional systems,
Hadoop and MapReduce won’t automatically interpret the
meaning of data from web sites, gene mapping, image analysis,
or other sources of big data.
BDM
WHAT IT MEANS TO US: [APPLICATION]
You receive an EMAIL: It contains an offer for a complete personal
computer system. It seems like the retailer read your mind since you were exploring
computers on their web site just a few hours prior. …
As you drive to the store to buy the computer bundle, you get an offer for a discounted
coffee from the coffee shop you are getting ready to drive past. It says that since you’re
in the area, you can get 10% off if you stop by in the next 20 minutes
As you drink your coffee, you receive an apology from the manufacturer of a
product that you complained about yesterday on your Facebook page, as well as on the
company’s web site. …
Finally, once you get back home, you receive notice of a gadget upgrade available for
purchase in your favorite online video game.
Etc…………..
• Explosion of new and powerful data sources like Facebook, Twitter,
LinkedIn, Youtube etc., contributes immensely to Bigdata & research.
• Advance Analytics will be of great impact.
• To stay competitive, it is imperative that organizations aggressively
pursue capturing and analyzing these new data sources to gain the
insights that they offer.
• Ignoring big data will put an organization at risk and cause it to fall
behind the competition.
• Analytic professionals have a lot of work to do! It won’t be easy to
incorporate big data alongside all the other data that has been used
for analysis for years.
DATA SOURCES
Foundations of Big Data: Concepts, Techniques, and Applications
 500 Million Tweets sent each day!
 More than 4 Million Hours of content uploaded to
Youtube every day!
 3.6 Billion Instagram Likes each day.
 4.3 BILLION Facebook messages posted daily!
 5.75 BILLION Facebook likes every day.
 40 Million Tweets shared each day!
 6 BILLION daily Google Searches!
And don’t think with these increases in social media, that email is
going away any time soon! According to The Radacati Group
, 205 BILLION EMAILS are sent each day in 2015, and by 2019
that number will increase to 20% to 246 Billion emails each day!
Big Data?
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
WHAT IS BIG DATA?
•There is no consensus in the marketplace as to how to
define big data!
• Def#1: Big data exceeds the reach of commonly used
hardware environments and software tools to capture,
manage, and process it within a tolerable elapsed time for its
user population.” [terabytemagazine article]
• Def#2: Big data refers to data sets whose size is beyond the
ability of typical database software tools to capture, store,
manage and analyze.”[McKinseyGlobal Institute ]
•Def#3 :“big” in big data also refers to several other
characteristics of a big data source. These aspects include
volume, velocity ,variety and Veracity(optional) [ Gratner
group]
Volume:
• The sheer volume of data being stored today is exploding.
• In the year 2000, 800,000 petabytes (PB) of data were
stored in the world.
• We expect this number to reach 35 zettabytes (ZB) by 2020.
Twitter alone generates more than 7 terabytes (TB) of data
every day, Facebook 10 TB etc.
Variety : “Variety Is the Spice of Life”
• The volume associated with the Big Data phenomena
brings along new challenges for data centres trying to
deal with it: its variety.
• With the explosion of sensors, and smart devices, as
well as social collaboration technologies, data in an
enterprise has become complex, because it includes not
only traditional relational data
• But also raw, semi structured, and unstructured data
from web pages, web log files (including click-stream
data), search indexes, social media forums, e-mail,
documents, sensor data from active and passive
systems, and so on.
Foundations of Big Data: Concepts, Techniques, and Applications
Velocity : How Fast Is Fast?
•The speed at which the data is flowing.
•Increase in RFID sensors and other information streams
has led to a constant flow of data at a pace that has made it
impossible for traditional systems to handle
•Competition can mean identifying a trend, problem, or
opportunity only seconds, or even microseconds, before
someone else.
•In traditional processing, you can think of running queries
against relatively static data
•For example, the query “Show me all people living in
the City X” would result in a single result set to be used
as a warning list of an incoming weather pattern.
•With streams computing [IBM], you can execute a
process similar to a continuous query that identifies
people who are currently “CITY X,” but you get
continuously updated results, because location
information from GPS data is refreshed in real time.
•Big Data requires that you perform analytics against
the volume and variety of data while it is still in motion,
not just after it is at rest.
Veracity: (Non reliable Data)
•There is volume, velocity and variety
• There is Big data Hype, also there is non-reliability with
data
• How effective will these data be?
• Example: Product Branding, Image Branding, Image
assignation
In addition a couple of V’s are also suggested:
Foundations of Big Data: Concepts, Techniques, and Applications
No single definition; here is from Wikipedia:
 Big data is the term for a collection of data sets so large and
complex that it becomes difficult to process using on-hand
database management tools or traditional data processing
applications.
 The challenges include capture, curation, storage, search,
sharing, transfer, analysis, and visualization.
 The trend to larger data sets is due to the additional
information derivable from analysis of a single large set of
related data, as compared to separate smaller sets with the
same total amount of data, allowing correlations to be found to
"spot business trends, determine quality of research, prevent
diseases, link legal citations, combat crime, and determine real-
time roadway traffic conditions.”
What’s Big Data?
Big Data: 3V’s
 Data Volume
◦ 44x increase from 2009 2020
◦ From 0.8 zettabytes to 35zb
 Data volume is increasing exponentially
Volume (Scale)
Exponential increase in
collected/generated data
12+ TBs
of tweet data
every day
25+ TBs of
log data
every day
?
TBs
of
data
every
day
2+
billion
people
on the
Web by
end 2011
30 billion RFID
tags today
(1.3B in 2005)
4.6
billion
camera
phones
world
wide
100s of
millions
of GPS
enabled
devices
sold
annually
76 million smart
meters in 2009…
200M by 2014
Maximilien Brice, © CERN
CERN’s Large Hydron Collider (LHC) generates 15 PB a
• The Earthscope is the world's largest
science project. Designed to track
North America's geological evolution,
this observatory records data over
3.8 million square miles, amassing 67
terabytes of data. It analyzes seismic
slips in the San Andreas fault, sure,
but also the plume of magma
underneath Yellowstone and much,
much more.
(http://www.msnbc.msn.com/id/4436
3598/ns/technology_and_science-
future_of_technology/#.TmetOdQ--uI)
The Earthscope
 Relational Data (Tables/Transaction/Legacy
Data)
 Text Data (Web)
 Semi-structured Data (XML)
 Graph Data
◦ Social Network, Semantic Web (RDF), …
 Streaming Data
◦ You can only scan the data once
 A single application can be
generating/collecting many types of data
 Big Public Data (online, weather, finance, etc)
Variety (Complexity)
To extract knowledge all these types
of data need to linked together
A Single View to the Customer
Customer
Social
Media
Gamin
g
Entertai
n
Bankin
g
Financ
e
Our
Know
n
Histor
y
Purcha
se
 Data is begin generated fast and need to be
processed fast
 Online Data Analytics
 Late decisions  missing opportunities
 Examples
◦ E-Promotions: Based on your current location, your purchase
history, what you like  send promotions right now for store next
to you
◦ Healthcare monitoring: sensors monitoring your activities and
body  any abnormal measurements require immediate reaction
Velocity (Speed)
 The progress and innovation is no longer hindered by the ability to collect data
 But, by the ability to manage, analyze, summarize, visualize, and discover
knowledge from the collected data in a timely manner and in a scalable fashion
Real-time/Fast Data
Social media and networks
(all of us are generating data)
Scientific instruments
(collecting all sorts of data)
Mobile devices
(tracking all objects all the time)
Sensor technology and
networks
(measuring all kinds of data)
Real-Time Analytics/Decision Requirement
Customer
Influence
Behavior
Product
Recommendations
that are Relevant
& Compelling
Friend Invitations
to join a
Game or Activity
that expands
business
Preventing Fraud
as it is Occurring
& preventing more
proactively
Learning why Customers
Switch to competitors
and their offers; in
time to Counter
Improving the
Marketing
Effectiveness of a
Promotion while it
is still in Play
Variability :
•It is often confused with variety.
Example:
•Say you have bakery that sells 10 different breads. That is
variety. Now imagine you go to that bakery three days in a row
and every day you buy the same type of bread but each day it
tastes and smells different.
•Variability is thus very relevant in performing sentiment
analyses.
•Variability means that the meaning is changing (rapidly).
•In (almost) the same tweets a word can have a totally
different meaning.
Some Make it 4V’s
Visualization
•This is the hard part of big data.
•Making all that vast amount of data comprehensible in a
manner that is easy to understand and read.
•It does not mean ordinary graphs or pie charts. They mean
complex graphs that can include many variables of data
while still remaining understandable and readable.
•Telling a complex story in a graph is very difficult but also
extremely crucial.
•Luckily there are more and more big data startups
appearing that focus on this aspect and in the end,
visualizations will make the difference
VALUE
•Data in itself is not valuable at all.
•The value is in the analyses done on that data and how
the data is turned into information and eventually
turning it into knowledge.
•The value is in how organisations will use that data and
turn their organisation into an information-
centric company that relies on insights derived from
data analyses for their decision-making.
IS THE “BIG” PART OR THE “DATA” PART MORE IMPORTANT?
•What is the most important part of the term big data? Is it
(1) the “big” part, (2) the “data” part, (3) both, or (4) neither?
•As with any source of data, big or small, the power of big
data comes :
++ What is done with that data?
++ How is it analyzed?
++ What actions are taken based on the findings?
++ How is the data used to make changes to a
business?
•People are led to believe that just because big data has
high volume, velocity, and variety, it is somehow better or
more important than other data.
• Many big data sources have a far higher percentage of
useless or low-value content than virtually any other data
source.
•By the time, big data is trimmed down to what you
actually need, it may not even be so big any more.
In Summary:
•Whether it stays big or whether it ends up being small
when you’re done processing it,
•the size isn’t important.
•It’s what you do with it.
HOW IS BIG DATA DIFFERENT?
Majority of big data sources have the following feature:
1. Big data is often automatically generated by a machine.
• Instead of a person being involved in creating new data,
it’s generated purely by machines in an automated way.
If you think about traditional data sources, there was
always a person involved.
• For example: Consider retail or bank transactions,
telephone call detail records, product shipments, or
invoice payments. All of those involve a person doing
something in order for a data record to be generated.
• A lot of sources of big data are generated without any
human interaction at all. Example: Sensors
2.Big data is typically an entirely new source of data. It is
not simply an extended collection of existing data.
• For Example, with the use of the Internet, customers can
now execute a transaction with a bank or retailer online.
But the transactions they execute are not fundamentally
different transactions from what they would have done
traditionally.
• They’ve simply executed the transactions through a
different channel.
• An organization may capture web transactions, but they
are really just more of the same old transactions that
have been captured for years.
• However, capturing browsing behaviors as customers
execute a transaction creates fundamentally new data.
3.Many big data sources are not designed to be friendly. In
fact, some of the sources aren’t designed at all!
• Example: Text streams from a social media site.
(There is no way to ask users to follow certain standards
of grammar, or sentence ordering, or vocabulary)
• It will be difficult to work with such data at best and
very, very ugly at worst.
• Most traditional data sources were designed up-front to
be friendly.
• Systems used to capture transactions provide data in a
clean, preformatted template that makes the data easy
to load and use
4. Substantial amount of big data streams may not have
much value. In fact, much of the data may even be close to
worthless.
• Example: Within a web log, there are information that is
very powerful. There is also a lot of information that
doesn’t have much value at all. (pic)
• It is necessary to weed through and pull out the valuable
and relevant pieces
• Traditional data sources were defined up-front to be 100
percent relevant.
Example: Weblog (1)
Example: Weblog (2)
HOW IS BIG DATA MORE OF THE SAME?
•Same thing that existed in the past; is out in a new form.
• In many ways, big data doesn’t pose any problems that
your organization hasn’t faced before.
•Taming new, large data sources that push the current
limits of scalability is an ongoing theme in the world of
analytics
Fig: Data Mining Process
RISKS OF BIG DATA
1. An organization will be so overwhelmed with big data
that it won’t make any progress.
[The key here is to get the right people. You need the right
people attacking big data and attempting to solve the right
kinds of problems]
2. cost escalates too fast as too much big data is captured
before an organization knows what to do with it.
[It is not necessary to go for it all at once and capture 100
percent of every new data source.
What is necessary is to start capturing samples of the new
data sources to learn about them. Using those initial
samples, experimental analysis can be performed to
determine what is truly important within each source and
how each can be used]
3. Perhaps the biggest risk with many sources of big data is
privacy.
• If everyone in the world was good and honest, then we
wouldn’t have to worry much about privacy
• There have also been high-profile cases of major
organizations getting into trouble for having ambiguous or
poorly defined privacy policies
Example: In April 2013, Living Social, a daily-deals site partly
owned by Amazon, announced that the names, email
addresses, birth dates and encrypted passwords of more than
50 million customers worldwide had been stolen by hackers.
•This has led to data being used in ways that consumers didn’t
understand or support, causing a backlash
•Organizations should explain how they will keep data secure
and how they will use it, if they accept their data to be
captured and analyzed
WHY YOU NEED TO TAME BIG DATA
•Many organizations have done little with big data.
•Ecommerce industries have started, where analyzing big
data is already a standard.
•Today, they have a chance to get ahead of the pack.
•Within a few years, any organization that isn’t analyzing
big data will be late to the game and will be stuck playing
catch up for years to come.
•The time to start taming big data is now.
THE STRUCTURE OF BIG DATA
•Big data is often described as Unstructured
•Most traditional data sources are fully structured realm
(sources)
•Data is in pre-defined format and no variation of the
format on day to day or update to update basis.
•Unstructured Data
•Semi Structures Data
• Example : Web logs
What is the difference
between Data Mining and
Web Mining?
Machine Learning : Classification, Clustering
etc.
Semantic approach: Statistics, NLP etc.
FILTERING BIG DATA EFFECTIVELY
•The biggest challenge with big data may not be the
analytics you do with it, but the extract, transform, and load
(ETL) processes you have to build to get it ready for analysis.
(PART OF 90 %)
•Analytic processes may require filters on the front end to
remove portions of a big data stream when it first arrives.
Also there will be other filters along the way as the data is
processed.
•For example, when working with a web log, a rule might be
to filter out up front any information on browser versions or
operating systems. Such data is rarely needed except for
operational reasons.
•Later in the process, the data may be filtered to specific
pages or user actions that need to be examined for the
business issues to be addressed.
<HTML>
<TITLE>
<BODY>
Sachin is a former Indian cricketer and captain, widely regarded as
one of the greatest batsmen of all time. Sachin took up cricket at
the age of eleven, made his Test debut on 15 November 1989
against Pakistan in Karachi at the age of sixteen, and went on to
represent Mumbai domestically and India internationally for close to
twenty-four years. Sachin is the only player to have
scored one hundred international centuries, the first batsman to
score a double century in a One Day International, the holder of the
record for the number of runs in both ODI and Test cricket, and the
only player to complete more than 30,000 runs in
international cricket
</BODY>
</TITLE>
</HTML>
Example-1
Example 2 :Opinion Analysis
Step 1: Sample text
excellent phone, excellent service . i am a business
user who heavily depend on mobile service ….,,,
there is much which has been said in other reviews
about the features of this phone.
Step 2: Remove delimiters from input file
excellent phone excellent service i am a business
user who heavily depend on mobile service there is
much which has been said in other reviews about
Step 3: Subject the text to parts of speech tagger
Example: JJ excellent NN phone JJ excellent NN
service FW i VBP am DT a NN business NN
user WP who RB heavily VBP depend IN on JJ
mobile NN service EX there VBZ is JJ much WDT
which VBZ has VBN been VBN said IN in JJ other
NNS reviews IN about DT the NNS features IN of
DT this NN phone
Step 4: Extract feature
JJ excellent NN phone, JJ excellent NN service
Step 4: Approaches
•Supervised approach
•Unsupervised approach
Step 5: Results:
• Positive opinion
• Negative opinion
•The complexity of the rules and the magnitude of the data
being
removed or kept at each stage will vary by data source and
by business problem.
•The load processes and filters that are put on top of big
data are absolutely critical. Without getting those correct, it
will be very difficult to succeed.
•Traditional structured data doesn’t require as much effort
in these areas since it is specified, understood, and
standardized in advance.
•With big data, it is necessary to specify, understand, and
standardize it as part of the analysis process in many cases.
Example: Application of Filtering to websites to derive
knowledge
MIXING BIG DATA WITH TRADITIONAL DATA
•Perhaps the most exciting thing about big data isn’t what
it will do for a business by itself. It’s what it will do for a
business when combined with an organization’s other
data.
Example:
1. Browsing history, for example, is very powerful.
[Knowing how valuable a customer is and what they have
bought in the past across all channels makes web data
even more powerful by putting it in a larger context].
2. Smart-grid data is very powerful for a utility company.
[Knowing the historical billing patterns of customers, their
dwelling type, and other factors makes data from a smart
meter even more powerful by putting it in a larger
context.]
Foundations of Big Data: Concepts, Techniques, and Applications
3. The text from customer service online chats and e-mails is
powerful.
[Knowing the detailed product specifications of the products
being
discussed, the sales data related to those products, and
historical product defect information makes that text data
even more powerful by putting it in a larger context.] -
Amazon Recommendation system
4.Enterprise Data Warehouses (EDWs) have become such a
widespread corporate tool not just to centralize a bunch of
data marts to save hardware and software costs.
•An EDW adds value by allowing different data sources to
intermix and enhance one another.
•With an EDW, it is possible to analyze customer and
employee data
together since they are in one location. They are no longer
completely
separate.
•This is why it is critically important that organizations don’t
develop a big data strategy that is distinct from their
traditional data strategy.
To succeed, it is necessary to plan not just how to capture
and analyze big data by itself, but also how to use it in
combination with other corporate data.
a. Data Mart
b. Data
Warehouse
Hierarchy of Enterprise Data
THE NEED FOR STANDARDS
•Will big data continue to be a wild west of crazy formats,
unconstrained streams, and lack of definition?
•Probably not. Over time, standards will be developed.
•Many semi-structured data sources will become more
structured over time, and individual organizations will fine-
tune their big data feeds to be friendlier for analysis.
•Example:
• SQL or similar language : usage with Big Data
• Formats, Interfaces to support interoperability across
distributed applications
• Web semantics: XML, OWL etc., with Big Data
• Cloud computing – Big data
TODAY’S BIG DATA IS NOT TOMORROW’S BIG DATA
•There is no specific, universal definition in terms of what
qualifies as big data.
•Rather, big data is defined in relative terms tied to
available technology and resources.
•As a result, what counts as big data to one company or
industry may not count as big data to another.
•A large e-commerce company is going to have a much
“bigger”
definition of big data than a small manufacturer will.
•What qualifies as big data will necessarily change over
time as the tools and techniques to handle it evolve
alongside raw storage size and processing power.
•Household demographic (population) files with hundreds of
fields and millions of customers were huge and tough to
manage a decade or two ago.
•Now such data fits on a thumb drive and can be analyzed by
a low-end laptop.
•Transactional data in the retail, telecommunications, and
banking industries were very big and hard to handle even a
decade ago.
•What we are intimidated by today won’t be so scary a few
years down the road.
Example 1:
• Clickstream data from the web may be a standard, easily
handled data source in 10 years
Click Stream :Trail left by users as they click their way
through a website.
Click-path optimization – Using clickstream analysis, businesses can
collect and analyze data to see which pages web visitors are visiting
and in what order.
Market basket analysis – The benefit of basket analysis for marketers
is that it can give them a better understanding of aggregate customer
purchasing behavior
Next Best Product analysis :helps marketers see what products
customers tend to buy together.
Website resource allocation: Clickstream data analysis tells
marketers which paths on the site are hot and which ones are not.
Customization: personalize the user experience and convert more
web visitors from browsers to buyers.
2. Actively processing every e-mail, customer service
chat, and social media comment may become a standard
practice for most organizations.
As we tame the current generation of big data streams,
other even bigger data sources are going to come along and
take their place.
1. Imagine web browsing data that expands to
include millisecond-level eyeball and mouse
movement so that every tiny detail of a user’s navigation
is captured, instead of just what was clicked on. This
is another order of big.
2. Imagine video game telemetry data being upgraded to
go beyond every button pressed or movement made
3. Imagine RFID (radio frequency identification)
information being available for every single individual
item in every single store, distribution facility, and
manufacturing plant globally.
4. Imagine capturing and translating to text every
conversation anyone has with a customer service or sales
line. Add to that all the associated e-mails, online chats,
and comments from places such as social media sites or
Web Data: The Original Big Data
•Wouldn’t
1. it be great to understand customer intent instead of
just customer action?
2. it be great to understand each customer’s thought
processes to determine whether they make a
purchase or not?
•Virtually impossible to get insights into such topics in the
past
•Today, such topics can be addressed with the use of
detailed web data.
•Organizations across a number of industries have
integrated detailed, customer-level behavioral data sourced
from a web site into their enterprise analytics environments.
•However, for most organizations web integration mean
inclusion of online transactions.
•Traditional web analytics vendors provide operational
reporting (every day task) on click-through rates, traffic
sources, and metrics based only on web data.
•However, detailed web behavior data was not
historically leveraged outside of web reporting.
Is it possible to understand Users
Better? How
WEB DATA OVERVIEW
•Organizations have talked about a 360-degree view of their
customers for years.
•What it really meant is that the organization has as full a
view of its customers as possible considering the
technology and data available at that point in time.
•However, the finish line is always moving. Just when you
think you have finally arrived, the finish line moves farther
out again.
•A few decades ago, companies were at the top of their game
if they had the names and addresses of their customers and
they were able to append demographic information(location
& population) to those names through the then-new third
party data enhancement services.
•Eventually, cutting-edge companies started to have basic
recency, frequency, and monetary value (RFM) metrics
attached to customers. Such metrics look at when a
customer last purchased (recency), how often they have
purchased (frequency), and how much they spent (monetary
value).
•In the past 10 to 15 years, virtually all businesses started to
collect and analyze the detailed transaction histories of their
customers.
•This led to an explosion of analytical power and a much
deeper understanding of customer behavior.
•Many organizations are still frozen at the transactional
history stage.
•Today, while this transactional view is still important, many
companies incorrectly assume that it remains the closest
view possible to a 360-degree view of their customers.
•Today, organizations need to collect from newly evolving
big data sources related to their customers from a variety of
extended and newly emerging touch points such as web
browsers, mobile applications, kiosks, social media sites, and
more.
•Just as transactional data enabled a revolution in power of
computation and depth of analysis, so too do these new
data sources enable taking analytics to a new level.
What Are You Missing?(with Traditional Data)
•Have you ever stopped to think about what happens if only
the transactions generated by a web site are captured?
Study Reveals: 95 percent of browsing sessions do not
result in a basket being created. Of that 5 percent, only
about half, or 2.5 percent, actually begin the check out
process. And, of that 2.5 percent only two-thirds, or 1.7
percent, actually complete a purchase.
•What this means is that information is missing on more
than 98 percent of web sessions, if only transactions are
tracked.
•For every purchase transaction, there might be dozens or
hundreds of specific actions taken on the site to get to that
sale. That information needs to be collected and analyzed
alongside the final sales data.
Imagine the Possibilities (Organizations are trying to know)
•Imagine knowing everything customers do as they go
through the process of doing business with your
organization.
•Not just what they buy, but what they are thinking about
buying along with what key decision criteria they use.
•Such knowledge enables a new level of understanding
about your customers and a new level of interaction with
your customers.
Example:
1. Imagine you are a retailer. Imagine walking through with
customers and recording every place they go, every item
they look at, every item they pick up, every item they put in
the cart and back out. Imagine knowing whether they read
nutritional information, if they look at laundry instructions, if
they read the promotional brochure on the shelf, or if they
look at other information made available to them in the
store.
2. Imagine you are a telecom company. Imagine being
able to identify every phone model, rate plan, data plan,
and accessory that customers considered before making
a final decision.
What is the difference between Traditional
Analytics and New scalable Analytics ?
What Data Should Be Collected and from where?
•Any action that a customer takes while interacting with an
organization should be captured if it is possible to capture it
from web sites, kiosks, social media, mobile apps etc
•Wide range of events can be captured like: Purchases
Requesting, Product views, Forwarding a link , Shopping
basket additions, Posting a comment, Watching a video,
Registering for a webinar, Accessing a download, Executing
a search, Reading / writing a review etc.
What about privacy ? (How Flip kart is handling this?)
•Privacy is a big issue today and may become an even
bigger issue as time passes.
•Need to respect not just formal legal restrictions, but also
what your customers will view as appropriate.
•Faceless Customer: (identify of customer masked in data
stores)
An arbitrary identification number that is not personally
identifiable can be matched to each unique customer
based on a logon, cookie, or similar piece of information.
This creates what might be called a “faceless” customer
record.
•It is the patterns across faceless customers that matter,
not the behavior of any specific customer
•With today’s database technologies, it is possible to
enable analytic professionals to do analysis without
having any ability to identify the individuals involved.
•This can remove many privacy concerns.
Many organizations are in fact identifying and targeting
specific customers as a result of such analytics.
Organizations have presumably put in place privacy
policies, including opt-out options, and are careful to
follow them.
What Web Data Reveals
1. Shopping Behaviors:
A good starting point to understand shopping behavior is
identifying:
•How customers come to a site, begin shopping and their
page navigation.
•What search engine do they use?
•What specific search terms are entered?
•Do they use a bookmark they created previously?
•Analytic professionals can take this information and look for
patterns in terms of which search terms, search engines,
and referring sites are associated with higher sales rates.
•One very capability of web data is to identify product set
that are of interest to a customer before they make a
purchase.
•For example, consider a customer who views computers,
backup disks, printers, and monitors. It is likely the
customer is considering a complete PC system upgrade.
•Offer a package right away that contains the specific mix of
items the customer has browsed.
•Do not wait until after customers purchase the computer
and then offer generic bundles of accessories.
•A customized bundle offer is more powerful than a generic
one . [study says]
•We find this feature lacking in many sites (project work?)
2. Customer Purchase Paths and Preferences
• it is possible to explore and identify the ways customers
arrive at their buying decisions by watching how they
navigate a site.
•It is also possible to gain insight into their preferences.
Consider for example an airline
•An airline can tell a number of things about preferences
based on the ticket that is booked.
•For example, 1.How far in advance was the ticket booked?
2.What fare class was booked?
3.Did the trip span a weekend or not?
•This is all useful, but an airline can get even more from web
data.
•An airline can identify customers who value convenience
(Such customers typically start searches for specific times
and direct flights only.)
•Airlines can also identify customers who value price first and
foremost and are willing to consider many flight options to
get the best price.
•Based on search patterns, airlines can also tell whether
customer value deals or specific destinations.
•Example : Do the customer research all of the special deals
available and then choose one for the trip? Or does the
customer look at a certain destination and pay what is
required to get there?
•For example, a college student may be open to any number
of vacation destinations and will take the one with the best
deal. On the other hand, a customer who visits family on a
regular basis will only be interested in flying to where the
family is.
3. Research Behaviors
•Understanding how customers utilize the research content on
a site can lead to tremendous insights into how to interact
with each individual customer, as well as how different aspects
of the site do or do not add value in driving sales.
For example, consider an online store selling cloths: Saree,
Zovi Shirts
•Another way to use web data to understand customers’
research patterns: is to identify which of the pieces of
information offered on a site are valued by the customer base
overall and the best customers specifically.
•How often do customers look at a previews( glance),
additional photos( thumb nails/ regular), or technical specs or
reviews before making a purchase?
•Sessions data with other data will help to know when did the
customers buy, on the same day or next day.
Feedback Behaviors
•Where are the Feed back expressed?
•Is it relevant? Baised?
•Does it matter?
Web Data in Action
•What an organization knows about its customers is never
the complete picture.
•It is always necessary to make assumptions based on the
information available.
•If there is only a partial view, the full view can often be
extrapolated accurately enough to get the job done.
•it is also possible that the information missing, paints a
totally different picture than expected.
•In the cases where the missing information differs from
the assumptions, it is possible to make suboptimal, if not
totally wrong, decisions.
•A very common marketing EXAMPLE is to predict what is the
next best offer customer. Of all the available options, which
single offer should next be suggested to a customer to
maximize the chances of success?
•Web behaviour data can help ?
Case 1: BANK
• Mr.Kumar has an account with PNB………………………………….etc.
with relevant information.
•What is the best offer you can send via email
•Does it ever occur to provide promotional offer on Mortgage
or Housing loan ? With web data, Bank now know what to
discuss with Mr. Kumar
Case 2: Dominos
•Traditional data they get is:
• Historical purchases
• Marketing campaign and response history
•With web data:
• The effort leads to major changes in the promotional
efforts versus the traditional approach, providing the
following results:
• A decrease in total mailings
• A reduction in total catalog promotions pages
• A materially significant increase in total revenues
• Question: With An Example, Justify How Web Data
Contributes To Better Promotional Benefits As Against
Traditional Data?
Attrition Modelling
•In telecommunication sector (example) , companies have
invested massive amounts of time and effort to create,
enhance, and perfect “churn” models. (Trying to identify
leaving customers)
•Churn models flag those customers most at risk of
cancelling their accounts so that action can be taken
proactively to prevent them from doing so.
•Management of customer churn has been, and remains,
critical to understanding patterns of customer usage and
profitability.
Example :
•Mrs. Smith, as a customer of telecom Provider “AIR”, goes to
Google and types “How do I cancel my Provider AIR
contract?” (Web Data).
• Company Analysts, perhaps not, would have seen her
usage dropping.
•It would take weeks to months to identify such a change in
usage pattern anyway.
•By capturing Mrs. Smith’s actions on the web, Provider
“AIR”, is able to move more quickly to avert losing Mrs.
Smith.
Response Modelling
•Many models are created to help predict the choice a
customer will make when presented with a (Data set)
request for action.
•Models typically try to predict which customers will make a
purchase, or accept an offer, or click on an e-mail link.
•For such models, a technique called logistic regression is
often used. These models are usually referred to as
response models or propensity models.
• The main difference between this and attrition model?
predicting negative behaviour (churn model), predicting
positive behaviour (purchase or response model).
WORKING
•When using a response or propensity model, all customers
are scored and ranked by likelihood of taking action.
•Then, appropriate segments (groups) are created based on
those ranks in order to reach out to the customers.
•In theory, every customer has a unique score. In practice,
since only a small number of variables define most models,
many customers end up with identical or nearly identical
scores.
•Example: Customers who are not very frequent or high-
spending.
•In many cases, many customers can end up in big groups
with very similar/ very low scores.
•Web data can help greatly increase differentiation among
customers.
For Example, consider a scenario: (score can increase or
decrease by delta x)
•Customer 1 has never browsed your site
•Customer 2 viewed the product category featured in the
offer within
the past month.
•Customer 3 viewed the specific product featured in the offer
within
the past month.
•Customer 4 browsed the specific product featured three
• When asked about the value of incorporating web data,
a director of marketing from a multichannel American
specialty retailer replied, “It’s like printing money!”
Customer Segmentation (Grouping): Study
•What is segmentation?
•How Segmentation were done traditionally?
•Web data also enables segmentation of customers based
on their typical browsing patterns. (Seminar/Project topic on
assessing browsing pattern of users)
•Such segmentation will provide a completely different view
of customers than traditional demographic or sales-based
segmentation schemas.
•Assignment: To create dreamers segment and identify the
items selected by the dreamers
Example:
•Consider a segment called the Dreamers that has been
derived purely from browsing behavior.
Who are they?
•Dreamers repeatedly put an item in their basket, but then
abandon it. Dreamers often add and abandon the same
item many times.
This may be especially true for a high-value item like a TV
or computer. It should be possible to identify the segment
of people that does this repeatedly.
•So, what is the outcome of this segment” Dreamers”?
1. What is that the customers are abandoning?
•Perhaps a customer is looking at a high-end TV that is quite
expensive Or phone or Camera etc.
• is price the issue ? From the past data, we get to know that
the customer often aims too high and later will buy a less-
expensive product than the one that was abandoned
repeatedly.
Action Plan
•Sending an e-mail, pointing to less-expensive options or
other variety of High end TV.
2: Get to Know the Abandoned basket statistics . Which can
help organizations to know prospective customer abandoning
baskets.
[Helps analyst to output survey results such as 97% customers
abandoned their baskets. It also gives insights into procedural
aspects, unavailability of services like COD, Credit card etc.]
Assessing Advertising Results
•Assessing paid search and online advertising results is
another high-impact analysis enabled with customer level
web behavior data.
•Traditional web analytics provide high-level summaries
such as total clicks, number of searches, cost per click,
keywords leading to the most clicks, page position
statistics etc.
• Most focus on single web channel.
•This means that all statistics are based only on what
happened during the single session generated from the
search or ad click
•Once a customer leaves the web site and web session
ends, the scope of the analysis is complete.
•There is no attempt to account for past or future visits in
the statistics.
•By incorporating customers’ browsing data and extending
the view to other channels as well, it is possible to assess
search and advertising results at a much deeper level.
For Example:
• How many sales did the first click generate in days/weeks
• Are certain web sites drawing more customers from referred sites.
• Cross channel analysis study, How sales are doing, after information
about the channel was provided on web via ad or search.
CROSS SECTION OF BIG DATA
SOURCES AND VALUE THEY HOLD
1. AUTO INSURANCE: THE VALUE OF TELEMATICS DATA
CASE STUDY
•Telematics involves putting a sensor, or black box, into a car
to capture information about what’s happening with the car.
This black box can measure any number of things
depending on how it is configured.
•It can monitor speed, mileage driven, or if there has been
any heavy braking.
•Telematics data helps insurance companies better
understand customer risk levels and set insurance rates.
•If privacy concerns are ignored and it is taken to the
extreme, a telematics device could keep track of everywhere
a car went, when it was there, how fast it was going, and
what features of the car were in use.
•Text is one of the biggest and most common sources of big
data. Just imagine how much text is out there.
•There are e-mails, text messages, tweets, social media
postings, instant messages, real-time chats, and audio
recordings that have been translated into text.
•Text data is one of the least structured and largest sources of
big data in existence today.
•Luckily, a lot of work has been done already to tame text data
and utilize it to make better business decisions
• Text mining approaches have their own
advantages/disadvantages
2. MULTIPLE INDUSTRIES: THE VALUE OF TEXT DATA
•Here, we will focus on, how to use the results, not
produce them.
•For example, once the sentiment of a customer’s e-mail is
identified, it is possible to generate a variable that tags the
customer’s sentiment as negative or positive. That tag is
now a piece of structured data that can be fed into an
analytics process.
•Creating structured data out of unstructured text is often
called information extraction.
•Another example, assume that we’ve identified which
specific products a customer commented about in his or
her communications with our company.
•We can then generate a set of variables that identify the
products discussed by the customer. Those variables are
again metrics that are structured and can be used for
analysis purposes.
MULTIPLE INDUSTRIES: THE VALUE OF TIME AND LOCATION
DATA
•With the advent of global positioning systems (GPS),
personal GPS devices, and cellular phones, time and
location information is a growing source of data.
• A wide variety of services and applications from Google
Places, to Facebook Places are centered on registering
where a person is at a given point in time.
•Cell phone applications can record your location and
movement on your behalf.
•Cell phones can even provide a fairly accurate location
using cell tower signals, if a phone is not formally GPS-
enabled.
•Example, there are applications that allow you to track the
exact routes you travel when you exercise, how long the
routes are, and how long it takes you to complete the
routes.
•The fact is, if you carry a cell phone, you can keep a record
of everywhere you’ve been. You can also open up that data
to others if you choose.
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Foundations of Big Data: Concepts, Techniques, and Applications
Ad

More Related Content

Similar to Foundations of Big Data: Concepts, Techniques, and Applications (20)

TOPIC.pptx
TOPIC.pptxTOPIC.pptx
TOPIC.pptx
infinix8
 
Unit-1 -2-3- BDA PIET 6 AIDS.pptx
Unit-1 -2-3- BDA PIET 6 AIDS.pptxUnit-1 -2-3- BDA PIET 6 AIDS.pptx
Unit-1 -2-3- BDA PIET 6 AIDS.pptx
YashiBatra1
 
Big Data Intoduction & Hadoop ArchitectureModule1.pdf
Big Data Intoduction & Hadoop ArchitectureModule1.pdfBig Data Intoduction & Hadoop ArchitectureModule1.pdf
Big Data Intoduction & Hadoop ArchitectureModule1.pdf
SharmilaChidaravalli
 
BigData.pptx
BigData.pptxBigData.pptx
BigData.pptx
vidhi171881
 
PresentationBig Data111111111111111.pptx
PresentationBig Data111111111111111.pptxPresentationBig Data111111111111111.pptx
PresentationBig Data111111111111111.pptx
harshadbhaitalpada49
 
Big data
Big dataBig data
Big data
Mahmudul Alam
 
Bigdata " new level"
Bigdata " new level"Bigdata " new level"
Bigdata " new level"
Vamshikrishna Goud
 
Data mining with big data implementation
Data mining with big data implementationData mining with big data implementation
Data mining with big data implementation
Sandip Tipayle Patil
 
Big data and Internet
Big data and InternetBig data and Internet
Big data and Internet
Sanoj Kumar
 
Research issues in the big data and its Challenges
Research issues in the big data and its ChallengesResearch issues in the big data and its Challenges
Research issues in the big data and its Challenges
Kathirvel Ayyaswamy
 
Big data Mining
Big data MiningBig data Mining
Big data Mining
MariamKhan120
 
Big data ankita1
Big data ankita1Big data ankita1
Big data ankita1
Ankita Sharma
 
Bigdatappt 140225061440-phpapp01
Bigdatappt 140225061440-phpapp01Bigdatappt 140225061440-phpapp01
Bigdatappt 140225061440-phpapp01
nayanbhatia2
 
Presentation on Big Data
Presentation on Big DataPresentation on Big Data
Presentation on Big Data
Md. Salman Ahmed
 
Big data
Big dataBig data
Big data
Enfa George
 
Evolution & Introduction to Big data-2.pptx
Evolution & Introduction to Big data-2.pptxEvolution & Introduction to Big data-2.pptx
Evolution & Introduction to Big data-2.pptx
navdeepKaur496978
 
ppt final.pptx
ppt final.pptxppt final.pptx
ppt final.pptx
kalai75
 
Special issues on big data
Special issues on big dataSpecial issues on big data
Special issues on big data
Vedanand Singh
 
Big data seminor
Big data seminorBig data seminor
Big data seminor
berasrujana
 
Big Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar SemwalBig Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar Semwal
IIIT Allahabad
 
TOPIC.pptx
TOPIC.pptxTOPIC.pptx
TOPIC.pptx
infinix8
 
Unit-1 -2-3- BDA PIET 6 AIDS.pptx
Unit-1 -2-3- BDA PIET 6 AIDS.pptxUnit-1 -2-3- BDA PIET 6 AIDS.pptx
Unit-1 -2-3- BDA PIET 6 AIDS.pptx
YashiBatra1
 
Big Data Intoduction & Hadoop ArchitectureModule1.pdf
Big Data Intoduction & Hadoop ArchitectureModule1.pdfBig Data Intoduction & Hadoop ArchitectureModule1.pdf
Big Data Intoduction & Hadoop ArchitectureModule1.pdf
SharmilaChidaravalli
 
PresentationBig Data111111111111111.pptx
PresentationBig Data111111111111111.pptxPresentationBig Data111111111111111.pptx
PresentationBig Data111111111111111.pptx
harshadbhaitalpada49
 
Data mining with big data implementation
Data mining with big data implementationData mining with big data implementation
Data mining with big data implementation
Sandip Tipayle Patil
 
Big data and Internet
Big data and InternetBig data and Internet
Big data and Internet
Sanoj Kumar
 
Research issues in the big data and its Challenges
Research issues in the big data and its ChallengesResearch issues in the big data and its Challenges
Research issues in the big data and its Challenges
Kathirvel Ayyaswamy
 
Bigdatappt 140225061440-phpapp01
Bigdatappt 140225061440-phpapp01Bigdatappt 140225061440-phpapp01
Bigdatappt 140225061440-phpapp01
nayanbhatia2
 
Evolution & Introduction to Big data-2.pptx
Evolution & Introduction to Big data-2.pptxEvolution & Introduction to Big data-2.pptx
Evolution & Introduction to Big data-2.pptx
navdeepKaur496978
 
ppt final.pptx
ppt final.pptxppt final.pptx
ppt final.pptx
kalai75
 
Special issues on big data
Special issues on big dataSpecial issues on big data
Special issues on big data
Vedanand Singh
 
Big data seminor
Big data seminorBig data seminor
Big data seminor
berasrujana
 
Big Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar SemwalBig Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar Semwal
IIIT Allahabad
 

Recently uploaded (20)

VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Abodahab
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
ISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptx
ISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptxISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptx
ISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptx
pankaj6188303
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia
Alexander Romero Arosquipa
 
4. Multivariable statistics_Using Stata_2025.pdf
4. Multivariable statistics_Using Stata_2025.pdf4. Multivariable statistics_Using Stata_2025.pdf
4. Multivariable statistics_Using Stata_2025.pdf
axonneurologycenter1
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
Process Mining and Data Science in the Financial Industry
Process Mining and Data Science in the Financial IndustryProcess Mining and Data Science in the Financial Industry
Process Mining and Data Science in the Financial Industry
Process mining Evangelist
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Abodahab
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
ISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptx
ISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptxISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptx
ISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptx
pankaj6188303
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
4. Multivariable statistics_Using Stata_2025.pdf
4. Multivariable statistics_Using Stata_2025.pdf4. Multivariable statistics_Using Stata_2025.pdf
4. Multivariable statistics_Using Stata_2025.pdf
axonneurologycenter1
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
Process Mining and Data Science in the Financial Industry
Process Mining and Data Science in the Financial IndustryProcess Mining and Data Science in the Financial Industry
Process Mining and Data Science in the Financial Industry
Process mining Evangelist
 
Ad

Foundations of Big Data: Concepts, Techniques, and Applications

  • 1. After completing this course, students should be able to:  CO1: Understand the significance, structure and sources of Big data  CO2: Asses avenues for analytical scalability.  CO3: Comprehend stream computing and applications  CO4: Apply the different clustering techniques  CO5: Use different Frame works and Visualization techniques Course Outcomes
  • 2. Introduction To Big Data: What Is Big Data? Is The "Big" Part Or The "Data" Art More Important? How Is Big Data Different? How Is Big Data More Of The Same? Risks Of Big Data -Why You Need To Tame Big Data -The Structure Of Big Data- Exploring Big Data, Most Big Data Doesn't Matter- Filtering Big Data Effectively -Mixing Big Data With Traditional Data- The Need For Standards-Today's Big Data Is Not Tomorrow's Big Data. Web Data: The Original Big Data -Web Data Overview -What Web Data Reveals -Web Data In Action? A Cross-Section Of Big Data Sources And The Value They Hold. Unit I
  • 3. Data Analysis: Evolution Of Analytic Scalability – Convergence – Parallel Processing Systems – Cloud Computing – Grid Computing – Map Reduce – Enterprise Analytic Sand Box – Analytic Data Sets – Analytic Methods – Analytic Tools – Cognos – Microstrategy - Pentaho. Analysis Approaches – Statistical Significance – Business Approaches – Analytic Innovation – Traditional Approaches – Iterative Unit II
  • 4. Mining Data Streams : Introduction To Streams Concepts, Stream Data Model And Architecture, Stream Computing, Sampling Data In A Stream, Filtering Streams, Counting Distinct Elements In A Stream, Estimating Moments, Counting Oneness In A Window, Decaying Window, Realtime Analytics Platform(RTAP) Applications, Case Studies, Real Time Sentiment Analysis, Stock Market Predictions. Unit III
  • 5. Frequent Itemsets And Clustering : Mining Frequent Itemsets - Market Based Model – Apriori Algorithm – Handling Large Data Sets In Main Memory – Limited Pass Algorithm – Counting Frequent Itemsets In A Stream – Clustering Techniques – Hierarchical – K- Means – Clustering High Dimensional Data – CLIQUE And PROCLUS – Frequent Pattern Based Clustering Methods – Clustering In Non- Euclidean Space – Clustering For Streams And Parallelism. Unit IV
  • 6. Frameworks And Visualization : Mapreduce – Hadoop, Hive, Mapr – Sharding – Nosql Databases - S3 - Hadoop Distributed File Systems – Visualizations - Visual Data Analysis Techniques, Interaction Techniques; Systems And Applications: Unit V
  • 8. What is Big Data? According to study reported in literature: •Every day, we create 2.5 quintillion (1 quintillion is 10 30 ) bytes of data. •So much that 90% of the data in the world today has been created in the last two years alone. •This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals etc.
  • 9. According to another study •From the beginning of recorded time (1990) until 2003, 5 billion gigabytes of data was created. •In 2011, the same amount was created every two days •In 2013, the same amount of data was created every 10 minutes •In 2015, same or more data (generating) every 10 minutes. •Advances in communications, computation, and storage have created huge collections of data, having information of value to business, science, government and society.
  • 10. •Example: Search engine companies such as Google, Yahoo!, and Microsoft have created an entirely new business by capturing the information freely available on the World Wide Web and providing it to people in useful ways. (SOCIAL NETWORKING) •These companies collect trillions of data every day and provide NEW SERVICES such as satellite images, driving directions, image retrieval etc. • The societal benefits of these services are well appreciated, it has transformed how people find and make use of information on a daily basis.
  • 12. •It can be used in wide variety of areas from business, health care, scientific, Defence etc. Example: Health care (AKA HEALTH INFORMATICS) •Modern medicine system collects huge amounts of information about patients through imaging technology (CAT scans, MRI), genetic analysis (DNA microarrays), and other forms of diagnostic equipment. •By applying analytics to data sets for large numbers of patients, medical researchers are gaining fundamental insights into the GENETIC AND ENVIRONMENTAL CAUSES OF DISEASES, and creating more effective means of diagnosis. •Recently hollywood star underwent surgery to prevent cancer. [who]
  • 13. According to McKinsey report published in US •140,000-190,000 workers with “knowledge of big data analytics” will be needed in the US alone. (2014) •Furthermore, 1.5 million managers will need to become data- literate. •Many agencies / media houses/ scientific community across the world have identified Big Data as important research area.
  • 14. •Like it or not, a massive amount of data will be coming your way soon. •Perhaps it has reached you already. •Perhaps you’ve been wrestling with it for a while— trying to figure out how to store it for later access, address its mistakes and imperfections, or classify it into structured categories. GENESIS………………………The Beginning
  • 19. KNOW DIFFERENCE BETWEEN BIG DATA AND MANAGMENT As the author Bill Franks puts, •There may soon be not only a flood of data, but flood of books on big data. •Most of these big-data books will be about the management of big data:  How to wrestle it into a database or data warehouse.  How to structure and categorize unstructured data.  If you find yourself reading a lot about Hadoop or MapReduce or various approaches to data warehousing.  you’ve stumbled upon—or were perhaps seeking—a “big data management” (BDM) book.
  • 20. • BDM is, of course, important work. No matter how much data you have of whatever quality, it won’t be much good unless you get it into an environment and format in which it can be accessed and analyzed. • BDM alone won’t get you very far. You also have to analyze and act on it for data of any size to be of value. • Just as traditional database management tools didn’t automatically analyze transaction data from traditional systems, Hadoop and MapReduce won’t automatically interpret the meaning of data from web sites, gene mapping, image analysis, or other sources of big data. BDM
  • 21. WHAT IT MEANS TO US: [APPLICATION] You receive an EMAIL: It contains an offer for a complete personal computer system. It seems like the retailer read your mind since you were exploring computers on their web site just a few hours prior. … As you drive to the store to buy the computer bundle, you get an offer for a discounted coffee from the coffee shop you are getting ready to drive past. It says that since you’re in the area, you can get 10% off if you stop by in the next 20 minutes As you drink your coffee, you receive an apology from the manufacturer of a product that you complained about yesterday on your Facebook page, as well as on the company’s web site. … Finally, once you get back home, you receive notice of a gadget upgrade available for purchase in your favorite online video game. Etc…………..
  • 22. • Explosion of new and powerful data sources like Facebook, Twitter, LinkedIn, Youtube etc., contributes immensely to Bigdata & research. • Advance Analytics will be of great impact. • To stay competitive, it is imperative that organizations aggressively pursue capturing and analyzing these new data sources to gain the insights that they offer. • Ignoring big data will put an organization at risk and cause it to fall behind the competition. • Analytic professionals have a lot of work to do! It won’t be easy to incorporate big data alongside all the other data that has been used for analysis for years. DATA SOURCES
  • 24.  500 Million Tweets sent each day!  More than 4 Million Hours of content uploaded to Youtube every day!  3.6 Billion Instagram Likes each day.  4.3 BILLION Facebook messages posted daily!  5.75 BILLION Facebook likes every day.  40 Million Tweets shared each day!  6 BILLION daily Google Searches! And don’t think with these increases in social media, that email is going away any time soon! According to The Radacati Group , 205 BILLION EMAILS are sent each day in 2015, and by 2019 that number will increase to 20% to 246 Billion emails each day! Big Data?
  • 27. WHAT IS BIG DATA? •There is no consensus in the marketplace as to how to define big data! • Def#1: Big data exceeds the reach of commonly used hardware environments and software tools to capture, manage, and process it within a tolerable elapsed time for its user population.” [terabytemagazine article] • Def#2: Big data refers to data sets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze.”[McKinseyGlobal Institute ] •Def#3 :“big” in big data also refers to several other characteristics of a big data source. These aspects include volume, velocity ,variety and Veracity(optional) [ Gratner group]
  • 28. Volume: • The sheer volume of data being stored today is exploding. • In the year 2000, 800,000 petabytes (PB) of data were stored in the world. • We expect this number to reach 35 zettabytes (ZB) by 2020. Twitter alone generates more than 7 terabytes (TB) of data every day, Facebook 10 TB etc.
  • 29. Variety : “Variety Is the Spice of Life” • The volume associated with the Big Data phenomena brings along new challenges for data centres trying to deal with it: its variety. • With the explosion of sensors, and smart devices, as well as social collaboration technologies, data in an enterprise has become complex, because it includes not only traditional relational data • But also raw, semi structured, and unstructured data from web pages, web log files (including click-stream data), search indexes, social media forums, e-mail, documents, sensor data from active and passive systems, and so on.
  • 31. Velocity : How Fast Is Fast? •The speed at which the data is flowing. •Increase in RFID sensors and other information streams has led to a constant flow of data at a pace that has made it impossible for traditional systems to handle •Competition can mean identifying a trend, problem, or opportunity only seconds, or even microseconds, before someone else. •In traditional processing, you can think of running queries against relatively static data
  • 32. •For example, the query “Show me all people living in the City X” would result in a single result set to be used as a warning list of an incoming weather pattern. •With streams computing [IBM], you can execute a process similar to a continuous query that identifies people who are currently “CITY X,” but you get continuously updated results, because location information from GPS data is refreshed in real time. •Big Data requires that you perform analytics against the volume and variety of data while it is still in motion, not just after it is at rest.
  • 33. Veracity: (Non reliable Data) •There is volume, velocity and variety • There is Big data Hype, also there is non-reliability with data • How effective will these data be? • Example: Product Branding, Image Branding, Image assignation In addition a couple of V’s are also suggested:
  • 35. No single definition; here is from Wikipedia:  Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.  The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization.  The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real- time roadway traffic conditions.” What’s Big Data?
  • 37.  Data Volume ◦ 44x increase from 2009 2020 ◦ From 0.8 zettabytes to 35zb  Data volume is increasing exponentially Volume (Scale) Exponential increase in collected/generated data
  • 38. 12+ TBs of tweet data every day 25+ TBs of log data every day ? TBs of data every day 2+ billion people on the Web by end 2011 30 billion RFID tags today (1.3B in 2005) 4.6 billion camera phones world wide 100s of millions of GPS enabled devices sold annually 76 million smart meters in 2009… 200M by 2014
  • 39. Maximilien Brice, © CERN CERN’s Large Hydron Collider (LHC) generates 15 PB a
  • 40. • The Earthscope is the world's largest science project. Designed to track North America's geological evolution, this observatory records data over 3.8 million square miles, amassing 67 terabytes of data. It analyzes seismic slips in the San Andreas fault, sure, but also the plume of magma underneath Yellowstone and much, much more. (http://www.msnbc.msn.com/id/4436 3598/ns/technology_and_science- future_of_technology/#.TmetOdQ--uI) The Earthscope
  • 41.  Relational Data (Tables/Transaction/Legacy Data)  Text Data (Web)  Semi-structured Data (XML)  Graph Data ◦ Social Network, Semantic Web (RDF), …  Streaming Data ◦ You can only scan the data once  A single application can be generating/collecting many types of data  Big Public Data (online, weather, finance, etc) Variety (Complexity) To extract knowledge all these types of data need to linked together
  • 42. A Single View to the Customer Customer Social Media Gamin g Entertai n Bankin g Financ e Our Know n Histor y Purcha se
  • 43.  Data is begin generated fast and need to be processed fast  Online Data Analytics  Late decisions  missing opportunities  Examples ◦ E-Promotions: Based on your current location, your purchase history, what you like  send promotions right now for store next to you ◦ Healthcare monitoring: sensors monitoring your activities and body  any abnormal measurements require immediate reaction Velocity (Speed)
  • 44.  The progress and innovation is no longer hindered by the ability to collect data  But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion Real-time/Fast Data Social media and networks (all of us are generating data) Scientific instruments (collecting all sorts of data) Mobile devices (tracking all objects all the time) Sensor technology and networks (measuring all kinds of data)
  • 45. Real-Time Analytics/Decision Requirement Customer Influence Behavior Product Recommendations that are Relevant & Compelling Friend Invitations to join a Game or Activity that expands business Preventing Fraud as it is Occurring & preventing more proactively Learning why Customers Switch to competitors and their offers; in time to Counter Improving the Marketing Effectiveness of a Promotion while it is still in Play
  • 46. Variability : •It is often confused with variety. Example: •Say you have bakery that sells 10 different breads. That is variety. Now imagine you go to that bakery three days in a row and every day you buy the same type of bread but each day it tastes and smells different. •Variability is thus very relevant in performing sentiment analyses. •Variability means that the meaning is changing (rapidly). •In (almost) the same tweets a word can have a totally different meaning.
  • 47. Some Make it 4V’s
  • 48. Visualization •This is the hard part of big data. •Making all that vast amount of data comprehensible in a manner that is easy to understand and read. •It does not mean ordinary graphs or pie charts. They mean complex graphs that can include many variables of data while still remaining understandable and readable. •Telling a complex story in a graph is very difficult but also extremely crucial. •Luckily there are more and more big data startups appearing that focus on this aspect and in the end, visualizations will make the difference
  • 49. VALUE •Data in itself is not valuable at all. •The value is in the analyses done on that data and how the data is turned into information and eventually turning it into knowledge. •The value is in how organisations will use that data and turn their organisation into an information- centric company that relies on insights derived from data analyses for their decision-making.
  • 50. IS THE “BIG” PART OR THE “DATA” PART MORE IMPORTANT? •What is the most important part of the term big data? Is it (1) the “big” part, (2) the “data” part, (3) both, or (4) neither? •As with any source of data, big or small, the power of big data comes : ++ What is done with that data? ++ How is it analyzed? ++ What actions are taken based on the findings? ++ How is the data used to make changes to a business? •People are led to believe that just because big data has high volume, velocity, and variety, it is somehow better or more important than other data.
  • 51. • Many big data sources have a far higher percentage of useless or low-value content than virtually any other data source. •By the time, big data is trimmed down to what you actually need, it may not even be so big any more. In Summary: •Whether it stays big or whether it ends up being small when you’re done processing it, •the size isn’t important. •It’s what you do with it.
  • 52. HOW IS BIG DATA DIFFERENT? Majority of big data sources have the following feature: 1. Big data is often automatically generated by a machine. • Instead of a person being involved in creating new data, it’s generated purely by machines in an automated way. If you think about traditional data sources, there was always a person involved. • For example: Consider retail or bank transactions, telephone call detail records, product shipments, or invoice payments. All of those involve a person doing something in order for a data record to be generated. • A lot of sources of big data are generated without any human interaction at all. Example: Sensors
  • 53. 2.Big data is typically an entirely new source of data. It is not simply an extended collection of existing data. • For Example, with the use of the Internet, customers can now execute a transaction with a bank or retailer online. But the transactions they execute are not fundamentally different transactions from what they would have done traditionally. • They’ve simply executed the transactions through a different channel. • An organization may capture web transactions, but they are really just more of the same old transactions that have been captured for years. • However, capturing browsing behaviors as customers execute a transaction creates fundamentally new data.
  • 54. 3.Many big data sources are not designed to be friendly. In fact, some of the sources aren’t designed at all! • Example: Text streams from a social media site. (There is no way to ask users to follow certain standards of grammar, or sentence ordering, or vocabulary) • It will be difficult to work with such data at best and very, very ugly at worst. • Most traditional data sources were designed up-front to be friendly. • Systems used to capture transactions provide data in a clean, preformatted template that makes the data easy to load and use
  • 55. 4. Substantial amount of big data streams may not have much value. In fact, much of the data may even be close to worthless. • Example: Within a web log, there are information that is very powerful. There is also a lot of information that doesn’t have much value at all. (pic) • It is necessary to weed through and pull out the valuable and relevant pieces • Traditional data sources were defined up-front to be 100 percent relevant.
  • 58. HOW IS BIG DATA MORE OF THE SAME? •Same thing that existed in the past; is out in a new form. • In many ways, big data doesn’t pose any problems that your organization hasn’t faced before. •Taming new, large data sources that push the current limits of scalability is an ongoing theme in the world of analytics Fig: Data Mining Process
  • 59. RISKS OF BIG DATA 1. An organization will be so overwhelmed with big data that it won’t make any progress. [The key here is to get the right people. You need the right people attacking big data and attempting to solve the right kinds of problems] 2. cost escalates too fast as too much big data is captured before an organization knows what to do with it. [It is not necessary to go for it all at once and capture 100 percent of every new data source. What is necessary is to start capturing samples of the new data sources to learn about them. Using those initial samples, experimental analysis can be performed to determine what is truly important within each source and how each can be used]
  • 60. 3. Perhaps the biggest risk with many sources of big data is privacy. • If everyone in the world was good and honest, then we wouldn’t have to worry much about privacy • There have also been high-profile cases of major organizations getting into trouble for having ambiguous or poorly defined privacy policies Example: In April 2013, Living Social, a daily-deals site partly owned by Amazon, announced that the names, email addresses, birth dates and encrypted passwords of more than 50 million customers worldwide had been stolen by hackers. •This has led to data being used in ways that consumers didn’t understand or support, causing a backlash •Organizations should explain how they will keep data secure and how they will use it, if they accept their data to be captured and analyzed
  • 61. WHY YOU NEED TO TAME BIG DATA •Many organizations have done little with big data. •Ecommerce industries have started, where analyzing big data is already a standard. •Today, they have a chance to get ahead of the pack. •Within a few years, any organization that isn’t analyzing big data will be late to the game and will be stuck playing catch up for years to come. •The time to start taming big data is now.
  • 62. THE STRUCTURE OF BIG DATA •Big data is often described as Unstructured •Most traditional data sources are fully structured realm (sources) •Data is in pre-defined format and no variation of the format on day to day or update to update basis. •Unstructured Data •Semi Structures Data • Example : Web logs
  • 63. What is the difference between Data Mining and Web Mining? Machine Learning : Classification, Clustering etc. Semantic approach: Statistics, NLP etc.
  • 64. FILTERING BIG DATA EFFECTIVELY •The biggest challenge with big data may not be the analytics you do with it, but the extract, transform, and load (ETL) processes you have to build to get it ready for analysis. (PART OF 90 %) •Analytic processes may require filters on the front end to remove portions of a big data stream when it first arrives. Also there will be other filters along the way as the data is processed. •For example, when working with a web log, a rule might be to filter out up front any information on browser versions or operating systems. Such data is rarely needed except for operational reasons. •Later in the process, the data may be filtered to specific pages or user actions that need to be examined for the business issues to be addressed.
  • 65. <HTML> <TITLE> <BODY> Sachin is a former Indian cricketer and captain, widely regarded as one of the greatest batsmen of all time. Sachin took up cricket at the age of eleven, made his Test debut on 15 November 1989 against Pakistan in Karachi at the age of sixteen, and went on to represent Mumbai domestically and India internationally for close to twenty-four years. Sachin is the only player to have scored one hundred international centuries, the first batsman to score a double century in a One Day International, the holder of the record for the number of runs in both ODI and Test cricket, and the only player to complete more than 30,000 runs in international cricket </BODY> </TITLE> </HTML> Example-1
  • 66. Example 2 :Opinion Analysis Step 1: Sample text excellent phone, excellent service . i am a business user who heavily depend on mobile service ….,,, there is much which has been said in other reviews about the features of this phone. Step 2: Remove delimiters from input file excellent phone excellent service i am a business user who heavily depend on mobile service there is much which has been said in other reviews about
  • 67. Step 3: Subject the text to parts of speech tagger Example: JJ excellent NN phone JJ excellent NN service FW i VBP am DT a NN business NN user WP who RB heavily VBP depend IN on JJ mobile NN service EX there VBZ is JJ much WDT which VBZ has VBN been VBN said IN in JJ other NNS reviews IN about DT the NNS features IN of DT this NN phone Step 4: Extract feature JJ excellent NN phone, JJ excellent NN service
  • 68. Step 4: Approaches •Supervised approach •Unsupervised approach Step 5: Results: • Positive opinion • Negative opinion
  • 69. •The complexity of the rules and the magnitude of the data being removed or kept at each stage will vary by data source and by business problem. •The load processes and filters that are put on top of big data are absolutely critical. Without getting those correct, it will be very difficult to succeed. •Traditional structured data doesn’t require as much effort in these areas since it is specified, understood, and standardized in advance. •With big data, it is necessary to specify, understand, and standardize it as part of the analysis process in many cases. Example: Application of Filtering to websites to derive knowledge
  • 70. MIXING BIG DATA WITH TRADITIONAL DATA •Perhaps the most exciting thing about big data isn’t what it will do for a business by itself. It’s what it will do for a business when combined with an organization’s other data. Example: 1. Browsing history, for example, is very powerful. [Knowing how valuable a customer is and what they have bought in the past across all channels makes web data even more powerful by putting it in a larger context]. 2. Smart-grid data is very powerful for a utility company. [Knowing the historical billing patterns of customers, their dwelling type, and other factors makes data from a smart meter even more powerful by putting it in a larger context.]
  • 72. 3. The text from customer service online chats and e-mails is powerful. [Knowing the detailed product specifications of the products being discussed, the sales data related to those products, and historical product defect information makes that text data even more powerful by putting it in a larger context.] - Amazon Recommendation system 4.Enterprise Data Warehouses (EDWs) have become such a widespread corporate tool not just to centralize a bunch of data marts to save hardware and software costs. •An EDW adds value by allowing different data sources to intermix and enhance one another. •With an EDW, it is possible to analyze customer and employee data together since they are in one location. They are no longer completely separate.
  • 73. •This is why it is critically important that organizations don’t develop a big data strategy that is distinct from their traditional data strategy. To succeed, it is necessary to plan not just how to capture and analyze big data by itself, but also how to use it in combination with other corporate data.
  • 74. a. Data Mart b. Data Warehouse
  • 76. THE NEED FOR STANDARDS •Will big data continue to be a wild west of crazy formats, unconstrained streams, and lack of definition? •Probably not. Over time, standards will be developed. •Many semi-structured data sources will become more structured over time, and individual organizations will fine- tune their big data feeds to be friendlier for analysis. •Example: • SQL or similar language : usage with Big Data • Formats, Interfaces to support interoperability across distributed applications • Web semantics: XML, OWL etc., with Big Data • Cloud computing – Big data
  • 77. TODAY’S BIG DATA IS NOT TOMORROW’S BIG DATA •There is no specific, universal definition in terms of what qualifies as big data. •Rather, big data is defined in relative terms tied to available technology and resources. •As a result, what counts as big data to one company or industry may not count as big data to another. •A large e-commerce company is going to have a much “bigger” definition of big data than a small manufacturer will. •What qualifies as big data will necessarily change over time as the tools and techniques to handle it evolve alongside raw storage size and processing power.
  • 78. •Household demographic (population) files with hundreds of fields and millions of customers were huge and tough to manage a decade or two ago. •Now such data fits on a thumb drive and can be analyzed by a low-end laptop. •Transactional data in the retail, telecommunications, and banking industries were very big and hard to handle even a decade ago. •What we are intimidated by today won’t be so scary a few years down the road. Example 1: • Clickstream data from the web may be a standard, easily handled data source in 10 years
  • 79. Click Stream :Trail left by users as they click their way through a website. Click-path optimization – Using clickstream analysis, businesses can collect and analyze data to see which pages web visitors are visiting and in what order. Market basket analysis – The benefit of basket analysis for marketers is that it can give them a better understanding of aggregate customer purchasing behavior Next Best Product analysis :helps marketers see what products customers tend to buy together. Website resource allocation: Clickstream data analysis tells marketers which paths on the site are hot and which ones are not. Customization: personalize the user experience and convert more web visitors from browsers to buyers.
  • 80. 2. Actively processing every e-mail, customer service chat, and social media comment may become a standard practice for most organizations. As we tame the current generation of big data streams, other even bigger data sources are going to come along and take their place. 1. Imagine web browsing data that expands to include millisecond-level eyeball and mouse movement so that every tiny detail of a user’s navigation is captured, instead of just what was clicked on. This is another order of big.
  • 81. 2. Imagine video game telemetry data being upgraded to go beyond every button pressed or movement made 3. Imagine RFID (radio frequency identification) information being available for every single individual item in every single store, distribution facility, and manufacturing plant globally. 4. Imagine capturing and translating to text every conversation anyone has with a customer service or sales line. Add to that all the associated e-mails, online chats, and comments from places such as social media sites or
  • 82. Web Data: The Original Big Data •Wouldn’t 1. it be great to understand customer intent instead of just customer action? 2. it be great to understand each customer’s thought processes to determine whether they make a purchase or not? •Virtually impossible to get insights into such topics in the past •Today, such topics can be addressed with the use of detailed web data. •Organizations across a number of industries have integrated detailed, customer-level behavioral data sourced from a web site into their enterprise analytics environments.
  • 83. •However, for most organizations web integration mean inclusion of online transactions. •Traditional web analytics vendors provide operational reporting (every day task) on click-through rates, traffic sources, and metrics based only on web data. •However, detailed web behavior data was not historically leveraged outside of web reporting. Is it possible to understand Users Better? How
  • 84. WEB DATA OVERVIEW •Organizations have talked about a 360-degree view of their customers for years. •What it really meant is that the organization has as full a view of its customers as possible considering the technology and data available at that point in time. •However, the finish line is always moving. Just when you think you have finally arrived, the finish line moves farther out again.
  • 85. •A few decades ago, companies were at the top of their game if they had the names and addresses of their customers and they were able to append demographic information(location & population) to those names through the then-new third party data enhancement services. •Eventually, cutting-edge companies started to have basic recency, frequency, and monetary value (RFM) metrics attached to customers. Such metrics look at when a customer last purchased (recency), how often they have purchased (frequency), and how much they spent (monetary value). •In the past 10 to 15 years, virtually all businesses started to collect and analyze the detailed transaction histories of their customers. •This led to an explosion of analytical power and a much deeper understanding of customer behavior.
  • 86. •Many organizations are still frozen at the transactional history stage. •Today, while this transactional view is still important, many companies incorrectly assume that it remains the closest view possible to a 360-degree view of their customers. •Today, organizations need to collect from newly evolving big data sources related to their customers from a variety of extended and newly emerging touch points such as web browsers, mobile applications, kiosks, social media sites, and more. •Just as transactional data enabled a revolution in power of computation and depth of analysis, so too do these new data sources enable taking analytics to a new level.
  • 87. What Are You Missing?(with Traditional Data) •Have you ever stopped to think about what happens if only the transactions generated by a web site are captured? Study Reveals: 95 percent of browsing sessions do not result in a basket being created. Of that 5 percent, only about half, or 2.5 percent, actually begin the check out process. And, of that 2.5 percent only two-thirds, or 1.7 percent, actually complete a purchase. •What this means is that information is missing on more than 98 percent of web sessions, if only transactions are tracked. •For every purchase transaction, there might be dozens or hundreds of specific actions taken on the site to get to that sale. That information needs to be collected and analyzed alongside the final sales data.
  • 88. Imagine the Possibilities (Organizations are trying to know) •Imagine knowing everything customers do as they go through the process of doing business with your organization. •Not just what they buy, but what they are thinking about buying along with what key decision criteria they use. •Such knowledge enables a new level of understanding about your customers and a new level of interaction with your customers. Example: 1. Imagine you are a retailer. Imagine walking through with customers and recording every place they go, every item they look at, every item they pick up, every item they put in the cart and back out. Imagine knowing whether they read nutritional information, if they look at laundry instructions, if they read the promotional brochure on the shelf, or if they look at other information made available to them in the store.
  • 89. 2. Imagine you are a telecom company. Imagine being able to identify every phone model, rate plan, data plan, and accessory that customers considered before making a final decision. What is the difference between Traditional Analytics and New scalable Analytics ?
  • 90. What Data Should Be Collected and from where? •Any action that a customer takes while interacting with an organization should be captured if it is possible to capture it from web sites, kiosks, social media, mobile apps etc •Wide range of events can be captured like: Purchases Requesting, Product views, Forwarding a link , Shopping basket additions, Posting a comment, Watching a video, Registering for a webinar, Accessing a download, Executing a search, Reading / writing a review etc.
  • 91. What about privacy ? (How Flip kart is handling this?) •Privacy is a big issue today and may become an even bigger issue as time passes. •Need to respect not just formal legal restrictions, but also what your customers will view as appropriate. •Faceless Customer: (identify of customer masked in data stores) An arbitrary identification number that is not personally identifiable can be matched to each unique customer based on a logon, cookie, or similar piece of information. This creates what might be called a “faceless” customer record. •It is the patterns across faceless customers that matter, not the behavior of any specific customer
  • 92. •With today’s database technologies, it is possible to enable analytic professionals to do analysis without having any ability to identify the individuals involved. •This can remove many privacy concerns. Many organizations are in fact identifying and targeting specific customers as a result of such analytics. Organizations have presumably put in place privacy policies, including opt-out options, and are careful to follow them.
  • 93. What Web Data Reveals 1. Shopping Behaviors: A good starting point to understand shopping behavior is identifying: •How customers come to a site, begin shopping and their page navigation. •What search engine do they use? •What specific search terms are entered? •Do they use a bookmark they created previously? •Analytic professionals can take this information and look for patterns in terms of which search terms, search engines, and referring sites are associated with higher sales rates.
  • 94. •One very capability of web data is to identify product set that are of interest to a customer before they make a purchase. •For example, consider a customer who views computers, backup disks, printers, and monitors. It is likely the customer is considering a complete PC system upgrade. •Offer a package right away that contains the specific mix of items the customer has browsed. •Do not wait until after customers purchase the computer and then offer generic bundles of accessories. •A customized bundle offer is more powerful than a generic one . [study says] •We find this feature lacking in many sites (project work?)
  • 95. 2. Customer Purchase Paths and Preferences • it is possible to explore and identify the ways customers arrive at their buying decisions by watching how they navigate a site. •It is also possible to gain insight into their preferences. Consider for example an airline •An airline can tell a number of things about preferences based on the ticket that is booked. •For example, 1.How far in advance was the ticket booked? 2.What fare class was booked? 3.Did the trip span a weekend or not? •This is all useful, but an airline can get even more from web data.
  • 96. •An airline can identify customers who value convenience (Such customers typically start searches for specific times and direct flights only.) •Airlines can also identify customers who value price first and foremost and are willing to consider many flight options to get the best price. •Based on search patterns, airlines can also tell whether customer value deals or specific destinations. •Example : Do the customer research all of the special deals available and then choose one for the trip? Or does the customer look at a certain destination and pay what is required to get there? •For example, a college student may be open to any number of vacation destinations and will take the one with the best deal. On the other hand, a customer who visits family on a regular basis will only be interested in flying to where the family is.
  • 97. 3. Research Behaviors •Understanding how customers utilize the research content on a site can lead to tremendous insights into how to interact with each individual customer, as well as how different aspects of the site do or do not add value in driving sales. For example, consider an online store selling cloths: Saree, Zovi Shirts •Another way to use web data to understand customers’ research patterns: is to identify which of the pieces of information offered on a site are valued by the customer base overall and the best customers specifically. •How often do customers look at a previews( glance), additional photos( thumb nails/ regular), or technical specs or reviews before making a purchase? •Sessions data with other data will help to know when did the customers buy, on the same day or next day.
  • 98. Feedback Behaviors •Where are the Feed back expressed? •Is it relevant? Baised? •Does it matter?
  • 99. Web Data in Action •What an organization knows about its customers is never the complete picture. •It is always necessary to make assumptions based on the information available. •If there is only a partial view, the full view can often be extrapolated accurately enough to get the job done. •it is also possible that the information missing, paints a totally different picture than expected. •In the cases where the missing information differs from the assumptions, it is possible to make suboptimal, if not totally wrong, decisions.
  • 100. •A very common marketing EXAMPLE is to predict what is the next best offer customer. Of all the available options, which single offer should next be suggested to a customer to maximize the chances of success? •Web behaviour data can help ? Case 1: BANK • Mr.Kumar has an account with PNB………………………………….etc. with relevant information. •What is the best offer you can send via email •Does it ever occur to provide promotional offer on Mortgage or Housing loan ? With web data, Bank now know what to discuss with Mr. Kumar
  • 101. Case 2: Dominos •Traditional data they get is: • Historical purchases • Marketing campaign and response history •With web data: • The effort leads to major changes in the promotional efforts versus the traditional approach, providing the following results: • A decrease in total mailings • A reduction in total catalog promotions pages • A materially significant increase in total revenues • Question: With An Example, Justify How Web Data Contributes To Better Promotional Benefits As Against Traditional Data?
  • 102. Attrition Modelling •In telecommunication sector (example) , companies have invested massive amounts of time and effort to create, enhance, and perfect “churn” models. (Trying to identify leaving customers) •Churn models flag those customers most at risk of cancelling their accounts so that action can be taken proactively to prevent them from doing so. •Management of customer churn has been, and remains, critical to understanding patterns of customer usage and profitability. Example : •Mrs. Smith, as a customer of telecom Provider “AIR”, goes to Google and types “How do I cancel my Provider AIR contract?” (Web Data).
  • 103. • Company Analysts, perhaps not, would have seen her usage dropping. •It would take weeks to months to identify such a change in usage pattern anyway. •By capturing Mrs. Smith’s actions on the web, Provider “AIR”, is able to move more quickly to avert losing Mrs. Smith.
  • 104. Response Modelling •Many models are created to help predict the choice a customer will make when presented with a (Data set) request for action. •Models typically try to predict which customers will make a purchase, or accept an offer, or click on an e-mail link. •For such models, a technique called logistic regression is often used. These models are usually referred to as response models or propensity models. • The main difference between this and attrition model? predicting negative behaviour (churn model), predicting positive behaviour (purchase or response model).
  • 105. WORKING •When using a response or propensity model, all customers are scored and ranked by likelihood of taking action. •Then, appropriate segments (groups) are created based on those ranks in order to reach out to the customers. •In theory, every customer has a unique score. In practice, since only a small number of variables define most models, many customers end up with identical or nearly identical scores. •Example: Customers who are not very frequent or high- spending. •In many cases, many customers can end up in big groups with very similar/ very low scores.
  • 106. •Web data can help greatly increase differentiation among customers. For Example, consider a scenario: (score can increase or decrease by delta x) •Customer 1 has never browsed your site •Customer 2 viewed the product category featured in the offer within the past month. •Customer 3 viewed the specific product featured in the offer within the past month. •Customer 4 browsed the specific product featured three
  • 107. • When asked about the value of incorporating web data, a director of marketing from a multichannel American specialty retailer replied, “It’s like printing money!”
  • 108. Customer Segmentation (Grouping): Study •What is segmentation? •How Segmentation were done traditionally? •Web data also enables segmentation of customers based on their typical browsing patterns. (Seminar/Project topic on assessing browsing pattern of users) •Such segmentation will provide a completely different view of customers than traditional demographic or sales-based segmentation schemas. •Assignment: To create dreamers segment and identify the items selected by the dreamers
  • 109. Example: •Consider a segment called the Dreamers that has been derived purely from browsing behavior. Who are they? •Dreamers repeatedly put an item in their basket, but then abandon it. Dreamers often add and abandon the same item many times. This may be especially true for a high-value item like a TV or computer. It should be possible to identify the segment of people that does this repeatedly. •So, what is the outcome of this segment” Dreamers”?
  • 110. 1. What is that the customers are abandoning? •Perhaps a customer is looking at a high-end TV that is quite expensive Or phone or Camera etc. • is price the issue ? From the past data, we get to know that the customer often aims too high and later will buy a less- expensive product than the one that was abandoned repeatedly. Action Plan •Sending an e-mail, pointing to less-expensive options or other variety of High end TV. 2: Get to Know the Abandoned basket statistics . Which can help organizations to know prospective customer abandoning baskets. [Helps analyst to output survey results such as 97% customers abandoned their baskets. It also gives insights into procedural aspects, unavailability of services like COD, Credit card etc.]
  • 111. Assessing Advertising Results •Assessing paid search and online advertising results is another high-impact analysis enabled with customer level web behavior data. •Traditional web analytics provide high-level summaries such as total clicks, number of searches, cost per click, keywords leading to the most clicks, page position statistics etc. • Most focus on single web channel. •This means that all statistics are based only on what happened during the single session generated from the search or ad click
  • 112. •Once a customer leaves the web site and web session ends, the scope of the analysis is complete. •There is no attempt to account for past or future visits in the statistics. •By incorporating customers’ browsing data and extending the view to other channels as well, it is possible to assess search and advertising results at a much deeper level. For Example: • How many sales did the first click generate in days/weeks • Are certain web sites drawing more customers from referred sites. • Cross channel analysis study, How sales are doing, after information about the channel was provided on web via ad or search.
  • 113. CROSS SECTION OF BIG DATA SOURCES AND VALUE THEY HOLD
  • 114. 1. AUTO INSURANCE: THE VALUE OF TELEMATICS DATA CASE STUDY •Telematics involves putting a sensor, or black box, into a car to capture information about what’s happening with the car. This black box can measure any number of things depending on how it is configured. •It can monitor speed, mileage driven, or if there has been any heavy braking. •Telematics data helps insurance companies better understand customer risk levels and set insurance rates. •If privacy concerns are ignored and it is taken to the extreme, a telematics device could keep track of everywhere a car went, when it was there, how fast it was going, and what features of the car were in use.
  • 115. •Text is one of the biggest and most common sources of big data. Just imagine how much text is out there. •There are e-mails, text messages, tweets, social media postings, instant messages, real-time chats, and audio recordings that have been translated into text. •Text data is one of the least structured and largest sources of big data in existence today. •Luckily, a lot of work has been done already to tame text data and utilize it to make better business decisions • Text mining approaches have their own advantages/disadvantages 2. MULTIPLE INDUSTRIES: THE VALUE OF TEXT DATA
  • 116. •Here, we will focus on, how to use the results, not produce them. •For example, once the sentiment of a customer’s e-mail is identified, it is possible to generate a variable that tags the customer’s sentiment as negative or positive. That tag is now a piece of structured data that can be fed into an analytics process. •Creating structured data out of unstructured text is often called information extraction. •Another example, assume that we’ve identified which specific products a customer commented about in his or her communications with our company. •We can then generate a set of variables that identify the products discussed by the customer. Those variables are again metrics that are structured and can be used for analysis purposes.
  • 117. MULTIPLE INDUSTRIES: THE VALUE OF TIME AND LOCATION DATA •With the advent of global positioning systems (GPS), personal GPS devices, and cellular phones, time and location information is a growing source of data. • A wide variety of services and applications from Google Places, to Facebook Places are centered on registering where a person is at a given point in time. •Cell phone applications can record your location and movement on your behalf. •Cell phones can even provide a fairly accurate location using cell tower signals, if a phone is not formally GPS- enabled.
  • 118. •Example, there are applications that allow you to track the exact routes you travel when you exercise, how long the routes are, and how long it takes you to complete the routes. •The fact is, if you carry a cell phone, you can keep a record of everywhere you’ve been. You can also open up that data to others if you choose.

Editor's Notes

  • #70: Smart grid: The “smart grid” will offer a case study in data management. Utilities will get a vast amount of real-time information from homes, businesses, power plants, and transmission infrastructure such as this substation.
  • #78: trail left by users as they click their way through a website.
  • #104: Logistic regression is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. 
  • #114: Telecommunications and Informatics = Telematics