SlideShare a Scribd company logo
BEYOND THE
BUZZWORDS
BIG DATA, MACHINE LEARNING –
WHAT DOES IT ALL MEAN?
Trones14@gmail.com
Database Concepts In Business Intelligence
August 4, 2016
Table of Contents
Introduction.....................................................................................................................................................................1
Categorizing Data : How to Think About Your Organizations Data .............................................................2
Defining Big Data ..........................................................................................................................................................3
Primer on Machine Learning .....................................................................................................................................5
NoSQL Database Types ...............................................................................................................................................7
Scaling...............................................................................................................................................................................7
Conclusion.......................................................................................................................................................................9
References.......................................................................................................................................................................10
PAGE 1
Introduction
“Data scientists spend anywhere from 50-80% of their time cleaning up data sets in order
to find usable insights (Lohr, 2014, NYT)”. I can personally attest to the accuracy of this
statement. I recently built a “Tableau Jobs” visualization. The visualization took about 30
minutes, but writing the script to use bulk API calls and correctly saving to csv took
hours.
There are a number of developments that needed to occur for machine learning to
become a reality. Machine learning (ML) only has commercial applications due to two
reasons:
1. Plenty of publicly usable data (stored in less than structured formats) to feed into
the ML models
2. Computing advances that allow these models to be trained in a relatively short
period of time (days or weeks)
This paper:
1. Categorizes data based on some characteristics:
o Degree of structure
o Source of data (internal or external)
2. Explains what is really meant by “big data” in the common vernacular – The link
between big data and machine learning – NoSQL storage
3. Introduces Machine learning
4. Explains why NoSQL is the choice for developers
PAGE 2
CategorizingData : How to Think About Your OrganizationsData
Data can be categorized into 3 groups: structured, semi-structured, and unstructured.
There is one more key distinction: internal or external. Internal data should be structured.
If a company designs the data collection system, then this data can have structure at the
time of data generation. Structure at the time of generation is the best case scenario
because “Data scientists spend anywhere from 50-80% of their time cleaning up data sets
(creating structure) in order to find usable insights (Lohr, 2014)”
Internal External
Labor hours to Insight
Structured Semi Structured Unstructured
Process:
Visualize!
Process:
Transform/Clean
Load
Visualize!
Process:
Find sources
Write scripts to extract
Transform/Clean
Load
Visualize!
PAGE 3
Defining Big Data
According to Johnathan Ward & Adam Barker of the University of St. Andrews, “all
definitions (of big data) make at least one of the following assertions:
 Size: the volume of the datasets is a critical factor.
 Complexity: the structure, behavior and permutations of the datasets is a critical
factor.
 Technologies: the tools and techniques which are used to process a sizable or
complex dataset is a critical factor.” (Becker, Ward, 2013, Undefined by Data)
Barker and Ward then propose a definition, “Big data is a term describing the storage
and analysis of large and or complex data sets using a series of techniques including, but
not limited to: NoSQL, MapReduce and machine learning. ” (Becker, Ward, 2013)
Let’s dive into some reasons why size, complexity, and technologies are all defining
features of big data:
1. Size: The choice of storage, cleaning/transformation, and analysis tools
depend on size:
a. Small data is not a concern.
i. Storage & Analysis: The computational power required to
handle such small datasets is easily achieved with personal
computers, there is no need to scale your job across many
computing clusters if it doesn’t save time. Easy analysis
means that storage choice is not a concern.
PAGE 4
ii. Cleaning Example: It may be quicker to hand clean data in
excel using simple find and replace statements rather than
writing a script.
2. Complexity: Big structured data is not the issue. With a relational
structure we can use SQL to easily find what we want. This data is
formatted according to the specifications of the database and needs few
modifications before it is ready to be analyzed. This data is usually not as
big as semi-structured or unstructured data because it is normalized, there
are no redundancies. This means that it is usually computationally easy to
analyze this data without having to scale horizontally (adding compute
clusters).
3. Technologies: This is how we store big data (NoSQL) and how we
analyze it (Machine Learning).
Our diagram has been narrowed down.
When people speak of big data, they are
usually talking about external semi-
structured or unstructured data. It is this
type of data that can be used by anyone for
machine learning models.
PAGE 5
Primer on Machine Learning
Machine learning is perhaps the most
misleading buzz word ever created. What’s
the difference between machine learning
and data science or statistics? Why are
machine learning and Big Data gaining
popularity at the same time? What is the
relationship between the two?
One common way to categorize machine learning (ML) is into supervised ML and
unsupervised ML. When I first began diving into the tools and algorithms of machine
learning, they seemed quite similar to predictive and descriptive statistics.
1. Supervised ML breaks the data into two sets: train and test. The model is
built/trained on the train set, and then accuracy of the model is tested on
the test set. We are interested in how well the model predicts the actual
values found in the test set.
2. Unsupervised ML deals with finding hidden structure in data without
giving the model any output goal. So what’s the difference between this
and descriptive statistics?
PAGE 6
Aatash Shah of Edvancer.in gives us some insight:
“Robert Tibshirani, a statistician and machine learning expert at Stanford,
calls machine learning ‘glorified statistics’… …Both machine learning and
statistics share the same goal: Learning from data. Both these methods focus on
drawing knowledge or insights from the data… … Cheap computing power and
availability of large amounts of data allowed data scientists to train computers to
learn by analyzing data. But, statistical modeling existed long before computers
were invented.” (Shah, Aatash, 2016, Edvancer.in)
Going back to our original question, What is the relationship between machine learning
and big data? There are a number of developments that needed to occur for machine
learning to become a reality. Without the data explosion caused by the internet, the
development of NoSQL databases, and the computing advances achieved through
Moore’s law, GPGPUs, and horizontal scaling of compute clusters – machine learning
would be restricted to the Academic realm; impractical for the majority of commercial
purposes.
This brings us to the linchpin of the entire discussion. The external data out there on
the internet is stored in a format that is best for the application developer ---
NoSQL.
PAGE 7
NoSQL Database Types
(Habib, 2015, Appdynamics.com)
Scaling
“Achieving scalability and elasticity is a huge challenge for relational databases.
Relational databases were designed in a period when data could be kept small, neat, and
orderly.” (Allen, 2015, Marklogic.com) Relational databases are designed with the data
in mind. This is done to avoid duplication, to normalize the data through the relational
structure. Imposing a relational structure at development time severely limits the software
developers’ flexibility for future versions of their application. The popularity of iterative
PAGE 8
agile like software development life cycles (SDLC) only exacerbates the disadvantages of
RDBMS’s.
Relational NoSQL (Document)
(Allen, 2014, Marklogic.com)
Pay particular attention to the Data Model. Remember my story about pulling JSON and
transforming it? This data was likely pulled from a document database. If the site owner
decided to make a major change to the data that was included, this would be a simple
change in their document database. If they were using a relational structure, they
might have to go in and totally redesign the entire structure. As a job search board
Indeed.com will have to scale their storage & compute power up and down based on the
web traffic and amount of job postings. Scaling back down is virtually impossible with a
relational structure.
PAGE 9
Conclusion
 There are 3 categories of data: structured, semi-structured, & unstructured.
 There are two sources of data: internal & external.
 The challenges associated with deriving insights from data apply to external
data that is semi or unstructured.
 The term “Big Data” refers to volume, but also encompasses the storage
technologies (NoSQL) and analysis tools (Machine Learning) because they are
an integral to the big data ecosystem.
 Big Data is stored in less-structured NoSQL DBs for web developer agility.
Final Statement: Machine learning is becoming democratized due to the availability of
large amounts of less than structured data and cheap compute power. Although it would
be easier for data scientists to work with structured data, this will never happen because
developers need to use NoSQL databases for business requirements such as agility and
scalability.
PAGE 10
References
Barker, A., & Ward, J. S. (2013, September 20). Undefined By Data: A Survey of Big Data Definitions
[Scholarly project]. Retrieved from http://arxiv.org/abs/1309.5821
Habib, O. (2015, September 21) A Newbie Guide to Databases. Retrieved from
https://blog.appdynamics.com/database/a-newbie-guide-to-databases/
Lohr, S. (2014). For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights. Retrieved July 20,
2016, from http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-
to-insights-is-janitor-work.html?_r=1
Lopez, K., & D'Antoni, J. (2014). The Modern Data Warehouse--How Big Data Impacts Analytics
Architecture. Business Intelligence Journal, 19(3), 8-15.
Machine Learning Algorithms Image. Retrieved from,
https://s3.amazonaws.com/MLMastery/MachineLearningAlgorithms.png?__s=c9sqnpazsd
7pmpusegzy
Machine learning frees up data scientists' time, simplifies smart applications - TechRepublic. (2015,
December 14). Retrieved July 20, 2016, from http://www.techrepublic.com/article/machine-
learning-frees-up-data-scientists-time-and-simplifies-smart-applications/
Making Sense of NoSQL. (n.d.). Retrieved July 21, 2016,
from http://macc.foxia.com/files/macc/files/macc_mccreary.pdf
Relational Databases Are Not Designed For Scale | MarkLogic. (2015, November 09). Retrieved July
23, 2016, from http://www.marklogic.com/blog/relational-databases-scale/
Shah, A. (2016, August 1) Machine Learning vs. Statistics Retrieved from,
http://www.edvancer.in/machine-learning-vs-statistics/
Ad

More Related Content

What's hot (20)

Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
Dr. C.V. Suresh Babu
 
6.a survey on big data challenges in the context of predictive
6.a survey on big data challenges in the context of predictive6.a survey on big data challenges in the context of predictive
6.a survey on big data challenges in the context of predictive
EditorJST
 
Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...
Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...
Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...
DATAVERSITY
 
Big Data: Issues and Challenges
Big Data: Issues and ChallengesBig Data: Issues and Challenges
Big Data: Issues and Challenges
Harsh Kishore Mishra
 
Lecture3 business intelligence
Lecture3 business intelligenceLecture3 business intelligence
Lecture3 business intelligence
hktripathy
 
Classification of data
Classification of dataClassification of data
Classification of data
Dr. C.V. Suresh Babu
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Caserta
 
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
Edureka!
 
Big Data Mining - Classification, Techniques and Issues
Big Data Mining - Classification, Techniques and IssuesBig Data Mining - Classification, Techniques and Issues
Big Data Mining - Classification, Techniques and Issues
Karan Deep Singh
 
Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)
Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)
Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)
Toshiyuki Shimono
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3
varshakumar21
 
Analysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ DataAnalysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ Data
Seth Grimes
 
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATAA REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
IJMIT JOURNAL
 
Issues, challenges, and solutions
Issues, challenges, and solutionsIssues, challenges, and solutions
Issues, challenges, and solutions
csandit
 
Data analytics
Data analyticsData analytics
Data analytics
Bhanu Pratap
 
Text, Content, and Social Analytics: BI for the New World
Text, Content, and Social Analytics: BI for the New WorldText, Content, and Social Analytics: BI for the New World
Text, Content, and Social Analytics: BI for the New World
Seth Grimes
 
Data Lakes versus Data Warehouses
Data Lakes versus Data WarehousesData Lakes versus Data Warehouses
Data Lakes versus Data Warehouses
Tom Donoghue
 
Data Mining – A Perspective Approach
Data Mining – A Perspective ApproachData Mining – A Perspective Approach
Data Mining – A Perspective Approach
IRJET Journal
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data Science
Jason Geng
 
Data mining
Data miningData mining
Data mining
Ahmed Moussa
 
6.a survey on big data challenges in the context of predictive
6.a survey on big data challenges in the context of predictive6.a survey on big data challenges in the context of predictive
6.a survey on big data challenges in the context of predictive
EditorJST
 
Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...
Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...
Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...
DATAVERSITY
 
Lecture3 business intelligence
Lecture3 business intelligenceLecture3 business intelligence
Lecture3 business intelligence
hktripathy
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Caserta
 
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
Edureka!
 
Big Data Mining - Classification, Techniques and Issues
Big Data Mining - Classification, Techniques and IssuesBig Data Mining - Classification, Techniques and Issues
Big Data Mining - Classification, Techniques and Issues
Karan Deep Singh
 
Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)
Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)
Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)
Toshiyuki Shimono
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3
varshakumar21
 
Analysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ DataAnalysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ Data
Seth Grimes
 
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATAA REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
IJMIT JOURNAL
 
Issues, challenges, and solutions
Issues, challenges, and solutionsIssues, challenges, and solutions
Issues, challenges, and solutions
csandit
 
Text, Content, and Social Analytics: BI for the New World
Text, Content, and Social Analytics: BI for the New WorldText, Content, and Social Analytics: BI for the New World
Text, Content, and Social Analytics: BI for the New World
Seth Grimes
 
Data Lakes versus Data Warehouses
Data Lakes versus Data WarehousesData Lakes versus Data Warehouses
Data Lakes versus Data Warehouses
Tom Donoghue
 
Data Mining – A Perspective Approach
Data Mining – A Perspective ApproachData Mining – A Perspective Approach
Data Mining – A Perspective Approach
IRJET Journal
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data Science
Jason Geng
 

Similar to BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What Does It All Mean? (20)

Database Essay
Database EssayDatabase Essay
Database Essay
Help With Writing Paper Singapore
 
From Volume to Value - A Guide to Data Engineering
From Volume to Value - A Guide to Data EngineeringFrom Volume to Value - A Guide to Data Engineering
From Volume to Value - A Guide to Data Engineering
Ry Walker
 
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
Editor IJCATR
 
Database Essay
Database EssayDatabase Essay
Database Essay
College Papers Writing Service
 
Essay Database
Essay DatabaseEssay Database
Essay Database
Custom Paper Writing Service
 
Sameer Kumar Das International Conference Paper 53
Sameer Kumar Das International Conference Paper 53Sameer Kumar Das International Conference Paper 53
Sameer Kumar Das International Conference Paper 53
Mr.Sameer Kumar Das
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET Journal
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET Journal
 
Data Warehousing AWS 12345
Data Warehousing AWS 12345Data Warehousing AWS 12345
Data Warehousing AWS 12345
AkhilSinghal21
 
Data modeling techniques used for big data in enterprise networks
Data modeling techniques used for big data in enterprise networksData modeling techniques used for big data in enterprise networks
Data modeling techniques used for big data in enterprise networks
Dr. Richard Otieno
 
Challenges in Analytics for BIG Data
Challenges in Analytics for BIG DataChallenges in Analytics for BIG Data
Challenges in Analytics for BIG Data
Prasant Misra
 
Discussion post· The proper implementation of a database is es.docx
Discussion post· The proper implementation of a database is es.docxDiscussion post· The proper implementation of a database is es.docx
Discussion post· The proper implementation of a database is es.docx
madlynplamondon
 
Semantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data MiningSemantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data Mining
Editor IJCATR
 
Database Management System ( Dbms )
Database Management System ( Dbms )Database Management System ( Dbms )
Database Management System ( Dbms )
Kimberly Brooks
 
BrightTALK - Semantic AI
BrightTALK - Semantic AI BrightTALK - Semantic AI
BrightTALK - Semantic AI
Semantic Web Company
 
A Deep Dissertion Of Data Science Related Issues And Its Applications
A Deep Dissertion Of Data Science  Related Issues And Its ApplicationsA Deep Dissertion Of Data Science  Related Issues And Its Applications
A Deep Dissertion Of Data Science Related Issues And Its Applications
Tracy Hill
 
An Comprehensive Study of Big Data Environment and its Challenges.
An Comprehensive Study of Big Data Environment and its Challenges.An Comprehensive Study of Big Data Environment and its Challenges.
An Comprehensive Study of Big Data Environment and its Challenges.
ijceronline
 
research publish journal
research publish journalresearch publish journal
research publish journal
rikaseorika
 
research publish journal
research publish journalresearch publish journal
research publish journal
rikaseorika
 
Business_Analytics_Presentation_Luke_Caratan
Business_Analytics_Presentation_Luke_CaratanBusiness_Analytics_Presentation_Luke_Caratan
Business_Analytics_Presentation_Luke_Caratan
Luke Caratan
 
From Volume to Value - A Guide to Data Engineering
From Volume to Value - A Guide to Data EngineeringFrom Volume to Value - A Guide to Data Engineering
From Volume to Value - A Guide to Data Engineering
Ry Walker
 
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
Editor IJCATR
 
Sameer Kumar Das International Conference Paper 53
Sameer Kumar Das International Conference Paper 53Sameer Kumar Das International Conference Paper 53
Sameer Kumar Das International Conference Paper 53
Mr.Sameer Kumar Das
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET Journal
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET Journal
 
Data Warehousing AWS 12345
Data Warehousing AWS 12345Data Warehousing AWS 12345
Data Warehousing AWS 12345
AkhilSinghal21
 
Data modeling techniques used for big data in enterprise networks
Data modeling techniques used for big data in enterprise networksData modeling techniques used for big data in enterprise networks
Data modeling techniques used for big data in enterprise networks
Dr. Richard Otieno
 
Challenges in Analytics for BIG Data
Challenges in Analytics for BIG DataChallenges in Analytics for BIG Data
Challenges in Analytics for BIG Data
Prasant Misra
 
Discussion post· The proper implementation of a database is es.docx
Discussion post· The proper implementation of a database is es.docxDiscussion post· The proper implementation of a database is es.docx
Discussion post· The proper implementation of a database is es.docx
madlynplamondon
 
Semantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data MiningSemantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data Mining
Editor IJCATR
 
Database Management System ( Dbms )
Database Management System ( Dbms )Database Management System ( Dbms )
Database Management System ( Dbms )
Kimberly Brooks
 
A Deep Dissertion Of Data Science Related Issues And Its Applications
A Deep Dissertion Of Data Science  Related Issues And Its ApplicationsA Deep Dissertion Of Data Science  Related Issues And Its Applications
A Deep Dissertion Of Data Science Related Issues And Its Applications
Tracy Hill
 
An Comprehensive Study of Big Data Environment and its Challenges.
An Comprehensive Study of Big Data Environment and its Challenges.An Comprehensive Study of Big Data Environment and its Challenges.
An Comprehensive Study of Big Data Environment and its Challenges.
ijceronline
 
research publish journal
research publish journalresearch publish journal
research publish journal
rikaseorika
 
research publish journal
research publish journalresearch publish journal
research publish journal
rikaseorika
 
Business_Analytics_Presentation_Luke_Caratan
Business_Analytics_Presentation_Luke_CaratanBusiness_Analytics_Presentation_Luke_Caratan
Business_Analytics_Presentation_Luke_Caratan
Luke Caratan
 
Ad

Recently uploaded (20)

Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
How to join illuminati Agent in uganda call+256776963507/0741506136
How to join illuminati Agent in uganda call+256776963507/0741506136How to join illuminati Agent in uganda call+256776963507/0741506136
How to join illuminati Agent in uganda call+256776963507/0741506136
illuminati Agent uganda call+256776963507/0741506136
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPerencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
PareaRusan
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPerencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
PareaRusan
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
Ad

BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What Does It All Mean?

  • 1. BEYOND THE BUZZWORDS BIG DATA, MACHINE LEARNING – WHAT DOES IT ALL MEAN? [email protected] Database Concepts In Business Intelligence August 4, 2016
  • 2. Table of Contents Introduction.....................................................................................................................................................................1 Categorizing Data : How to Think About Your Organizations Data .............................................................2 Defining Big Data ..........................................................................................................................................................3 Primer on Machine Learning .....................................................................................................................................5 NoSQL Database Types ...............................................................................................................................................7 Scaling...............................................................................................................................................................................7 Conclusion.......................................................................................................................................................................9 References.......................................................................................................................................................................10
  • 3. PAGE 1 Introduction “Data scientists spend anywhere from 50-80% of their time cleaning up data sets in order to find usable insights (Lohr, 2014, NYT)”. I can personally attest to the accuracy of this statement. I recently built a “Tableau Jobs” visualization. The visualization took about 30 minutes, but writing the script to use bulk API calls and correctly saving to csv took hours. There are a number of developments that needed to occur for machine learning to become a reality. Machine learning (ML) only has commercial applications due to two reasons: 1. Plenty of publicly usable data (stored in less than structured formats) to feed into the ML models 2. Computing advances that allow these models to be trained in a relatively short period of time (days or weeks) This paper: 1. Categorizes data based on some characteristics: o Degree of structure o Source of data (internal or external) 2. Explains what is really meant by “big data” in the common vernacular – The link between big data and machine learning – NoSQL storage 3. Introduces Machine learning 4. Explains why NoSQL is the choice for developers
  • 4. PAGE 2 CategorizingData : How to Think About Your OrganizationsData Data can be categorized into 3 groups: structured, semi-structured, and unstructured. There is one more key distinction: internal or external. Internal data should be structured. If a company designs the data collection system, then this data can have structure at the time of data generation. Structure at the time of generation is the best case scenario because “Data scientists spend anywhere from 50-80% of their time cleaning up data sets (creating structure) in order to find usable insights (Lohr, 2014)” Internal External Labor hours to Insight Structured Semi Structured Unstructured Process: Visualize! Process: Transform/Clean Load Visualize! Process: Find sources Write scripts to extract Transform/Clean Load Visualize!
  • 5. PAGE 3 Defining Big Data According to Johnathan Ward & Adam Barker of the University of St. Andrews, “all definitions (of big data) make at least one of the following assertions:  Size: the volume of the datasets is a critical factor.  Complexity: the structure, behavior and permutations of the datasets is a critical factor.  Technologies: the tools and techniques which are used to process a sizable or complex dataset is a critical factor.” (Becker, Ward, 2013, Undefined by Data) Barker and Ward then propose a definition, “Big data is a term describing the storage and analysis of large and or complex data sets using a series of techniques including, but not limited to: NoSQL, MapReduce and machine learning. ” (Becker, Ward, 2013) Let’s dive into some reasons why size, complexity, and technologies are all defining features of big data: 1. Size: The choice of storage, cleaning/transformation, and analysis tools depend on size: a. Small data is not a concern. i. Storage & Analysis: The computational power required to handle such small datasets is easily achieved with personal computers, there is no need to scale your job across many computing clusters if it doesn’t save time. Easy analysis means that storage choice is not a concern.
  • 6. PAGE 4 ii. Cleaning Example: It may be quicker to hand clean data in excel using simple find and replace statements rather than writing a script. 2. Complexity: Big structured data is not the issue. With a relational structure we can use SQL to easily find what we want. This data is formatted according to the specifications of the database and needs few modifications before it is ready to be analyzed. This data is usually not as big as semi-structured or unstructured data because it is normalized, there are no redundancies. This means that it is usually computationally easy to analyze this data without having to scale horizontally (adding compute clusters). 3. Technologies: This is how we store big data (NoSQL) and how we analyze it (Machine Learning). Our diagram has been narrowed down. When people speak of big data, they are usually talking about external semi- structured or unstructured data. It is this type of data that can be used by anyone for machine learning models.
  • 7. PAGE 5 Primer on Machine Learning Machine learning is perhaps the most misleading buzz word ever created. What’s the difference between machine learning and data science or statistics? Why are machine learning and Big Data gaining popularity at the same time? What is the relationship between the two? One common way to categorize machine learning (ML) is into supervised ML and unsupervised ML. When I first began diving into the tools and algorithms of machine learning, they seemed quite similar to predictive and descriptive statistics. 1. Supervised ML breaks the data into two sets: train and test. The model is built/trained on the train set, and then accuracy of the model is tested on the test set. We are interested in how well the model predicts the actual values found in the test set. 2. Unsupervised ML deals with finding hidden structure in data without giving the model any output goal. So what’s the difference between this and descriptive statistics?
  • 8. PAGE 6 Aatash Shah of Edvancer.in gives us some insight: “Robert Tibshirani, a statistician and machine learning expert at Stanford, calls machine learning ‘glorified statistics’… …Both machine learning and statistics share the same goal: Learning from data. Both these methods focus on drawing knowledge or insights from the data… … Cheap computing power and availability of large amounts of data allowed data scientists to train computers to learn by analyzing data. But, statistical modeling existed long before computers were invented.” (Shah, Aatash, 2016, Edvancer.in) Going back to our original question, What is the relationship between machine learning and big data? There are a number of developments that needed to occur for machine learning to become a reality. Without the data explosion caused by the internet, the development of NoSQL databases, and the computing advances achieved through Moore’s law, GPGPUs, and horizontal scaling of compute clusters – machine learning would be restricted to the Academic realm; impractical for the majority of commercial purposes. This brings us to the linchpin of the entire discussion. The external data out there on the internet is stored in a format that is best for the application developer --- NoSQL.
  • 9. PAGE 7 NoSQL Database Types (Habib, 2015, Appdynamics.com) Scaling “Achieving scalability and elasticity is a huge challenge for relational databases. Relational databases were designed in a period when data could be kept small, neat, and orderly.” (Allen, 2015, Marklogic.com) Relational databases are designed with the data in mind. This is done to avoid duplication, to normalize the data through the relational structure. Imposing a relational structure at development time severely limits the software developers’ flexibility for future versions of their application. The popularity of iterative
  • 10. PAGE 8 agile like software development life cycles (SDLC) only exacerbates the disadvantages of RDBMS’s. Relational NoSQL (Document) (Allen, 2014, Marklogic.com) Pay particular attention to the Data Model. Remember my story about pulling JSON and transforming it? This data was likely pulled from a document database. If the site owner decided to make a major change to the data that was included, this would be a simple change in their document database. If they were using a relational structure, they might have to go in and totally redesign the entire structure. As a job search board Indeed.com will have to scale their storage & compute power up and down based on the web traffic and amount of job postings. Scaling back down is virtually impossible with a relational structure.
  • 11. PAGE 9 Conclusion  There are 3 categories of data: structured, semi-structured, & unstructured.  There are two sources of data: internal & external.  The challenges associated with deriving insights from data apply to external data that is semi or unstructured.  The term “Big Data” refers to volume, but also encompasses the storage technologies (NoSQL) and analysis tools (Machine Learning) because they are an integral to the big data ecosystem.  Big Data is stored in less-structured NoSQL DBs for web developer agility. Final Statement: Machine learning is becoming democratized due to the availability of large amounts of less than structured data and cheap compute power. Although it would be easier for data scientists to work with structured data, this will never happen because developers need to use NoSQL databases for business requirements such as agility and scalability.
  • 12. PAGE 10 References Barker, A., & Ward, J. S. (2013, September 20). Undefined By Data: A Survey of Big Data Definitions [Scholarly project]. Retrieved from http://arxiv.org/abs/1309.5821 Habib, O. (2015, September 21) A Newbie Guide to Databases. Retrieved from https://blog.appdynamics.com/database/a-newbie-guide-to-databases/ Lohr, S. (2014). For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights. Retrieved July 20, 2016, from http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle- to-insights-is-janitor-work.html?_r=1 Lopez, K., & D'Antoni, J. (2014). The Modern Data Warehouse--How Big Data Impacts Analytics Architecture. Business Intelligence Journal, 19(3), 8-15. Machine Learning Algorithms Image. Retrieved from, https://s3.amazonaws.com/MLMastery/MachineLearningAlgorithms.png?__s=c9sqnpazsd 7pmpusegzy Machine learning frees up data scientists' time, simplifies smart applications - TechRepublic. (2015, December 14). Retrieved July 20, 2016, from http://www.techrepublic.com/article/machine- learning-frees-up-data-scientists-time-and-simplifies-smart-applications/ Making Sense of NoSQL. (n.d.). Retrieved July 21, 2016, from http://macc.foxia.com/files/macc/files/macc_mccreary.pdf Relational Databases Are Not Designed For Scale | MarkLogic. (2015, November 09). Retrieved July 23, 2016, from http://www.marklogic.com/blog/relational-databases-scale/ Shah, A. (2016, August 1) Machine Learning vs. Statistics Retrieved from, http://www.edvancer.in/machine-learning-vs-statistics/