0% found this document useful (0 votes)
61 views

Big Data Defined: Andrew J. Brust

This document defines big data and describes how it is analyzed. It explains that big data involves hundreds of terabytes to petabytes of information that is too large for traditional databases. Hadoop is used to perform distributed and parallel processing on big data using MapReduce. MapReduce splits and processes the data in the map step, aggregates results in the reduce step, and is commonly employed by Hadoop but also used by other systems. The document provides an example of how MapReduce could be used to count data by floor and platform in a building. It also discusses data scientists and their skills in statistics, subject matter expertise, and asking the right questions to analyze big data.

Uploaded by

Favio90
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views

Big Data Defined: Andrew J. Brust

This document defines big data and describes how it is analyzed. It explains that big data involves hundreds of terabytes to petabytes of information that is too large for traditional databases. Hadoop is used to perform distributed and parallel processing on big data using MapReduce. MapReduce splits and processes the data in the map step, aggregates results in the reduce step, and is commonly employed by Hadoop but also used by other systems. The document provides an example of how MapReduce could be used to count data by floor and platform in a building. It also discusses data scientists and their skills in statistics, subject matter expertise, and asking the right questions to analyze big data.

Uploaded by

Favio90
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Big Data Defined

Andrew J. Brust
http://www.bluebadgeinsights.com
[email protected]
Big Data Defined

100s of TB – x PB

Uses Hadoop

Three Vs

Too big for OLTP

Uses distributed/parallel processing


MapReduce

 Map step: split the data and pre-process it


 Reduce step: aggregate the results
 Most typical of Hadoop but employed by others, to various extents
A MapReduce Example

• Count by suite, on each floor

• Send per-suite, per platform totals to lobby

• Sort totals by platform

• Send two platform packets to 10th, 20th, 30th floor

• Tally up each platform

• Collect the tallies

• Merge tallies into one spreadsheet


Data Scientists

Near Abuse of
But with:
synonyms: term:
• Statisticians • Subject • Hadoop
• “Quants” matter experts
expertise • “R”
• Good at developers
knowing (although
the right that does
questions to overlap)
ask

You might also like