Defining The Big Data Architecture Framework
Defining The Big Data Architecture Framework
Framework (BDAF)
Outcome of the Brainstorming Session
at the University of Amsterdam
Big data is high-volume, high-velocity and high-variety information assets that demand cost-
effective, innovative forms of information processing for enhanced insight and decision
making. Gartner, http://www.gartner.com/it-glossary/big-data/
Termed as 3 parts definition, not 3V definition
Big Data: a massive volume of both structured and unstructured data that is so large that
it's difficult to process using traditional database and software techniques.
From The Big Data Long Tail blog post by Jason Bloomberg (Jan 17, 2013). http://www.devx.com/blog/the-big-data-
long-tail.html
Data that exceeds the processing capacity of conventional database systems. The data is
too big, moves too fast, or doesnt fit the structures of your database architectures. To gain
value from this data, you must choose an alternative way to process it.
Ed Dumbill, program chair for the OReilly Strata Conference
*) The Fourth Paradigm: Data-Intensive Scientific Discovery. Edited by Tony Hey, Stewart Tansley, and Kristin Tolle. Microsoft, 2009.
Volume Velocity
Terabytes Batch
Records/Arch Real/near-time
Transactions Processes
Tables, Files Streams
Variety Value
5 Vs of
Structured Statistical
Unstructured Big Data Events
Multi-factor Correlations
Probabilistic Hypothetical
Commonly accepted
Trustworthiness 3Vs of Big Data
Authenticity
Origin, Reputation
Availability
Accountability
Veracity
Veracity
Big Data (Data Intensive) Technologies are targeting to process (1) high-
volume, high-velocity, high-variety data (sets/assets) to extract intended
data value and ensure high-veracity of original data and obtained
information that demand cost-effective, innovative forms of data and
information processing (analytics) for enhanced insight, decision making,
and processes control; all of those demand (should be supported by) new
data models (supporting all data states and stages during the whole data
lifecycle) and new infrastructure services and tools that allows also
obtaining (and processing data) from a variety of sources (including
sensor networks) and delivering data in a variety of forms to different data
and information consumers and devices.
Service Oriented Architecture (SOA): First proposed in 1996 and revived with the
Web Services advent in 2001-2002
Currently standard for industry, and widely used
Provided a conceptual basis for Web Services development
Computer Grids: Initially proposed in 1998 and finally shaped in 2003 with the
Open Grid Services Architecture (OGSA) by Open Grid Forum (OGF)
Currently remains as a collaborative environment
Migrates to cloud and inter-cloud platform
Cloud Computing: Initially proposed in 2008
Defined new features, capabilities, operational/usage models and actually provided a
guidance for the new technology development
Originated from the Service Computing domain and service management focused
Big Data: Yet to be defined
Involves more components and processes to be included into the definition
Can be better defined as ecosystem where data are the main driving factor/component
Need to define the Big Data properties, expected technology capabilities and provide a
guidance/vision for future technology development
Business + +++ ++ - + ++
Living ++ ++ ++ ++ +++++ +
environment,
Cities
Social media, + ++ - ++++ ++ -
networks
Healthcare +++ ++ - - ++ +++++
Metadata
Model Relations
Functions
Data: The lowest layer of abstraction (?) from which information can be
derived
Information: A combination of contextualised data that can provide meaningful
value or usage/action (scientific, business)
Actionable data
Presentation (?)
Where is knowledge (as a target of learning)?
PID=UID+time+Prj
DatasetID={PID+Pfj}
ModelID?=?
Metadata
Model data, Visualised
Metadata statistical data models;
PID Datasets Biz reports,
PID
Data Metadata Trends;
Source Controlled
Data (raw) PID Metadata Processes;
PID Social
Data (archival,
Data (structured, actionable) Actions
datasets)
Consumer
Data Data Data Data
Collection Filter/Enrich, Analytics, Delivery,
Data
and Classification Modeling, Visualisation
Source
Registratn Prediction
Data structures
Structured data
Unstructured data
Data types [ref]
(a) data described via a formal data model
(b) data described via a formalized grammar
(c) data described via a standard format
(d) arbitrary textual or binary data
Data models
Depend on target/goal, or process/object?
Evolve or chain/stack?
Usable Data
Actionable Data Papers/Reports Archival Data
Processed Data (for target use) Processed Data (for target use)
Processed Data (for target use)
Raw Data
Consumer
Data Data Data Data
Collection& Filter/Enrich, Analytics, Delivery,
Data Registration Classification Modeling, Visualisation
Source Prediction
Analytics :
Realtime, Interactive,
Batch, Streaming High Storage
Performance Specialised
Computer Databases
Compute Clusters Archives
Storage
General General
Purpose Purpose
Data Analitics
Data Data Data Data
Application
Consumer
Collection& Filter/Enrich, Analytics, Delivery,
Data Registration Classification Modeling, Visualisation
Source Prediction
Data repurposing,
Analitics re-factoring,
Secondary processing
DB
Re-purpose
Data Re-purpose
Open
Public
Data Linkage Issues Data Clean up and Retirement
Persistent Identifiers (PID) Ownership and authority Use
ORCID (Open Researcher and Data Detainment
Contributor ID)
Lined Data Metadata &
Data Links Mngnt
17 July 2013, UvA Big Data Architecture Brainstorming 33
Additional Information
Using integrated/unified
storage
New DB/storage
technologies allow
storing data during all
lifecycle
[ref] HPCC Systems: Introduction to HPCC (High Performance Computer Cluster), Author: A.M.
Middleton, LexisNexis Risk Solutions, Date: May 24, 2011
17 July 2013, UvA Big Data Architecture Brainstorming 42
LexisNexis HPCC System
Architecture
ECL Enterprise Data Control
Language
THOR Processing Cluster (Data
Refinery)
Roxie Rapid Data Delivery Engine
txt
* Portal/Desktop
Infrastructure * Federation Infrastructure
(C4) Cloud Services
Layer C4 (Infrastructure, Platform,
Cloud Services (Infrastructure, Platform, Application, Software) Cloud Services
(Infrastructure,
Applications)
SaaS Platforms, (C3) Virtual Resources
IaaS PaaS Applications,
PaaS-IaaS IF Software) Composition and
PaaS-IaaS Interface Orchestration
(C2) Virtualisation Layer
IaaS Virtualisation Platform Interface
(C1) Hardware platform and
Layer C3 dedicated network
Cloud Management Platforms Virtual Resources
infrastructure
Cloud Management VM VM VPN Composition and
Software Control
(Generic Functions) OpenNebula OpenStack Other (Orchestration)
CMS
Layer C2
Network Virtualisation
Virtualisation Platform KVM XEN VMware
Virtualis
Technologies and
Layers
solutions
Layer B6
User/Scientific Applications Layer Scientific specialist
Security and AAI
Scientific
Applications applications
Library resources
Scientific
Scientific
Scientific
Scientific
portals
Applic
Applic
Applic
Dataset
User
Optical Network
Infrastructure
Layer B5
Federated Access and Delivery Federated Access Federated Identity
Infrastructure (FADI) and Delivery Management:
Layer eduGAIN, REFEDS,
Shared Scientific Platform and Instruments
Layer B4 VOMS, InCommon
Scientific Platform
(specific for scientific areas, also Grid based) and Instruments
PRACE/DEISA
Layer B3
Cloud/Grid Infrastructure Infrastructure
Middleware
Virtualisation and Management Virtualisation
security Grid/Cloud
Middleware
Layer B2
Compute Sensors and Storage Datacenter and
Resources Devices Resources Computing Facility
Clouds
Layer B1
Network infrastructure Network Infrastructure Autobahn, eduroam
Trusted Discovery
Introducer
FADI Network Infrastructure Directory
Directory
FedIDP (RepoSLA)
(RepoSLA)
(I/P/S)aaS
Provider
(I/P/S)aaS
Provider
(I/P/S)aaS
Provider
(I/P/S)aaS
Provider