SlideShare a Scribd company logo
December 2018
8 Guiding Principles
to Kickstart Your
Healthcare Big Data
Project
White Paper
Big Data technologies have seen widespread adoption
across different industries over the past 3-5 years, but
the healthcare is just starting to realize the benefits.
This is mainly due to the exponential growth of
unstructured and semi-structured healthcare
information.
With sensors and wearables becoming a part of our
daily lives, people and organizations now have access
to enormous amounts of data, e.g., step tracking,
heartbeat / blood pressure monitoring, calorie
tracking, sleep pattern analysis, etc.
The explosion in healthcare data, while posing massive
storage and processing challenges, also has the
potential to transform the way we use data to improve
outcomes, for example:
 Predicting future care needs for specific
populations
 Minimizing health risks by predicting specific
events well in advance
 Identifying / expediting process of identifying new
patterns in disease detection, etc.
Our experience with a large number of healthcare Big
Data projects has shown that most customers face
significant hurdles in kick-starting their Big Data
initiatives.
With limited or no experience, customers often realize
last-minute that their Big Data project
implementations don’t have the architectural
robustness to address future needs.
This white paper illustrates our experiences and
learnings across multiple Big Data implementation
projects. It contains a broad set of guidelines and best
practices around:
 Building highly secure Big Data lakes
 Efficiently processing vast amounts of data
 Providing access to downstream systems
 Best practices to mitigate project risks
 Technical hurdles and approaches to overcome
them
OVERVIEW OF BIG DATA IN HEALTHCARE
1
1. Use a Comprehensive Data
Ingestion Framework
While working with a Big Data lake, you need to
integrate numerous source systems with multiple feed
types. Your Big Data solution should have the ability to
handle different feed types, and cater to future source
system integration needs. Design a data ingestion
framework that addresses:
 All types of data: relational, semi-structured and
unstructured
 Standard feed protocols: HTTPS, SFTP, etc.
 Different types of loading scenarios: including
initial load and incremental load
 ELT (Extract, Load, Transform) approach as
compared to traditional ETL
 Various ingestion frequencies: batch, real-time
 Relevant data ingestion mechanisms (push or pull).
Pulling data may not be preferable when data lake
is on the cloud and sources reside on-premise
Big Data Layers: Typical Architecture
2
DataMonitoringLayer
DataSecurityLayer
Data Visualization Layer
Data Ingestion Layer
S3 HDFS Cluster
Data
Sources
Data
Storage
Data
Processing
Batch
Processing
Real-time
Processing
Querying&
AnalyticsEngine
Data
Query
Layer
Statistical Analytics
Semantic Analytics
Predictive Modelling
Dash-
boards
&
Reports
GUIDING PRINCIPLES FOR BIG DATA IMPLEMENTATION
3
2. Choose the Right Storage Type for
Each Feed
Since Big Data ecosystems provide multiple storage
components, it gives the opportunity to use relevant
and optimal storage type for a specific feed. The
following points need to be considered while choosing
a storage type:
 Feed attributes: e.g., total size of data, size of
individual files, velocity at which data arrives, etc.
 Data ingestion system: Ability to identify whether
the data ingested is small or big in size
 Database architecture: Based on size, data can be
stored in distributed file systems, cloud storage or
in NoSQL / columnar data bases. For example:
• Files of 128MB and above (default Hadoop
block size), can be stored in HDFS. Small files (in
KBs) can be stored in Hadoop sequence files, or
in HBase
• JSON data can be stored in document database
3. Create Separate Storage Layers
Organizations starting their Big Data implementations
often ask, “How do we arrange data in a Data Lake?”
and “How many layers should we create?”. The
answers depend on the type of data being pulled and
processed in the Data Lake. In a standard scenario,
customers want to correlate data from relational
systems, IoT devices, social media and unstructured
data sources, e.g. notes, images, documents etc. In
such scenarios, a three layer approach can be used.
Raw Layer
Although not mandatory , it is always advisable to
store data in its native form in the Data Lake. This
forms the raw layer or raw zone of the Data Lake. The
raw layer is generally referred by data scientists or
analysts to perform analysis instead of waiting for
operational data.
Curated Layer
While the raw layer is important from a raw analytics
and reprocessing perspective, it isn’t the most
4
optimal way to store data, as it may contain duplicate,
incorrect or incomplete records. It is always advisable
to create a curated data layer that has cleansed and
standardized data. Analytics performed on the curated
layer provides much more accurate results than the
raw layer.
Operational Layer
Data stored in the curated layer isn’t reconciled and
continues to have the context of the source system.
This poses analytics challenges and also has the
possibility of duplicate records being sourced. The
operational layer solves this problem by reconciling
and transforming incoming data from different
sources into a single, canonical model.
4. Use the Right Data Processing
Frameworks & Tools
Identifying the right data processing framework can be
difficult as there are multiple processing frameworks in
the Big Data ecosystem. Common data processing
tasks like data cleansing, quality reporting,
aggregation, transformation and reconciliation can be
performed by standard ETL tools. However, for Big
Data processing, most standard ETL tools use Apache
Spark. While these ETL tools provide drag-drop UI and
out-of-the-box adapters, the internal working is
abstracted, making them difficult to operate in certain
scenarios.
Commonly used ETL tools are Talend Enterprise,
Pentaho, Informatica, DataStage and Attunity. For
simple data processing needs, IT teams can create a
custom ETL utility using Apache Spark and its in-built
transformation functions.
The following best practices need to be kept in mind
while working on data processing frameworks:
 Big Data processing happens in a distributed
manner. It is necessary to arrange data to minimize
shuffling and optimizing performance. Use
compression to speed up data transfer over
network and reduce shuffling time
5
 Joins are expensive in Big Data and should be
thoughtfully implemented.You can also improve
performance by de-normalizing records
 Use parameters like batch id, date range or specific
set to overcome bad / corrupt data issues
 Keep track of events (meta-data, audits) during
data processing, e.g., who triggered the process,
which dataset was used for processing, size of the
dataset, count of records processed, status of
processing, start and finish time, etc.
 Be practical with partitioning. Distributed
processing often fails to take full advantage of the
nodes due to small or numerous partitions
 For stream processing, create enough partitions on
a Kafka Topic to trigger parallel processing in
Apache Spark. Provide checkpoints at regular
intervals to minimize stream processing failure
5. Think of Data Management Right at
the Beginning
With business environments changing rapidly,
organizations need to consider data management as a
critical component of their business strategy. The
organization’s data strategy is affected by multiple
scenarios, including:
 Changes in organization or technology
 Process and people changes due to mergers and
acquisitions
 Changes in regulatory compliance or contractual
arrangements
 Issues with quality / availability / timelines of data
that affect decision making
 Massive investments in time and resources
required to get data in correct shape
To overcome these challenges, organizations must
start thinking of data management solutions right
from project inception.
Few frameworks provide data management
capabilities for Big Data, e.g., Apache Atlas with
Apache Falcon for Hortonworks, Cloudera Navigator
has partial functionality, MapR uses a custom
framework.
6
7 Pillars of Data Management
1. Data Architecture: Data analysis, enterprise data
architecture, integration with applications
2. Content Management: Organizing, consolidating
and optimizing content
3. Data Development: Requirement analysis, data
modelling, database design, implementation and
maintenance
4. Master Data and Metadata Management: Master
patient index, master provider index, master facility
index, ICD 9/10, CPT, SNOMED, LOINC, DRG and
standards, common codes, integration metadata,
control metadata, quality metadata
5. Data Quality: Measurement, assessment and
improvement in data quality
6. Operations Management: Acquisition, recovery,
tuning, retention and purging
7. Data security: Classification, administration,
privacy and confidentiality, authentication and
auditing
6. Provide a Sophisticated Search
Capability
The search feature becomes essential to Big Data
systems due to high volumes of data. Searching for
specific attribute values is like finding a needle in a
haystack. As entities are added / updated / removed
from the Data Lake, there must be a way to quickly
search and get a view of the entities present and
quickly search for specific attribute values.
Its always beneficial to index your data and provide a
search UI for quick discovery. Consider providing a
facility to tag attributes to make it searchable and
allow users to group attributes using tags.
7. Simplify Data Access Using APIs and
Data Virtualization
All data warehousing / Data Lake projects need to
provide data extracts to downstream / external
systems, and allow users to search data and enable
analytics systems to connect and analyze data using
7
standard interfaces. Most of these requirements can
be fulfilled by a thin API access layer that provides
unified access to the underlying data. The API layer
implementation should support standard based
interfaces like REST, SQL or a combination of both.
Data extraction processes are scheduled jobs that
extract data from specific tables and store it in a
shared location (e.g., SFTP). A low priority processing
queue can be used for data extraction during peak
hour to ensure the extraction query does not consume
all processing resources. Additionally, data
virtualization software (e.g., Denodo) or custom data
virtualization layer (using Apache Ignite and Spark) can
be used to create a common interface for Data Lakes
and other source systems.
8. Provide an Analytics Workspace for
Advanced Users
With the evolution of Big Data and Data Lakes, more
organizations are adopting advanced analytics tools
and technologies – e.g., Predictive Analytics, Machine
Learning, Deep Learning, Natural Language Processing
and AI algorithms. These technologies require
extensive piloting, model operationalization and
custom dashboarding before they can be applied in
real-world scenarios.
Data scientists and analysts need a dedicated
workspace and desired toolsets to pull, process,
analyze raw, curated and aggregated data, and share
their findings. They should be able to perform
activities like preliminary analysis, identifying new
trends and quick dashboarding, without affecting the
Data Lake.
An analytics workspace can be implemented in one of
the following ways:
A. Use Existing Data Lake Infrastructure to Carve
Out Space for Individual Data Scientists
This option uses the existing Data Lake infrastructure
to create slots for individual data scientists where they
can to play with a copy of the data using various tools
e.g. Apache Spark based note books.
B. Use a Separate Cluster for Each Data Scientist
This option creates separate infrastructure for
individual users and pulls data from the Data Lake.
This option may prove costlier but provides a true
multitenant architecture and ensures that the system
performance is always optimal.
Data Ingestion
Highly configurable data ingestion pipeline that caters to structured, unstructured and
semi-structured data ingestion, using Big Data ecosystem components like Sqoop,
Flume, etc. Also provides real-time data ingestion-streaming using Apache Kafka and
Storm based scalable ingestion cum processing pipeline.
Storage Types
Configurable data ingestion pipeline - dynamically chooses storage (HDFS or HBase)
based on data attributes.
Storage Layers
Ability to configure and execute data transformation and reconciliation rules using a
self-service UI. CitiusTech’s healthcare data model can be used to create canonical data
model in operational layer.
Data Processing
Highly configurable and easy-to-use data processing pipeline built on top of Apache
Spark to perform data validation, curation, transformation and reconciliation. Data
processing pipeline improves time-to-market for customers by quickly integrating data
from various sources.
8
CitiusTech’s H-Scale platform for healthcare data management has been specifically designed to address healthcare
Big Data challenges such as data acquisition, real-time processing, Master Data Management, data security and
advanced analytics. Here is how H-Scale supports the Big Data requirements discussed in this paper.
H-SCALE ADDRESSES KEY HEALTHCARE BIG DATA NEEDS
Data
Management
Data governance adapters to capture data lineage and auditing information. H-Scale
data governance adapters can be used while working with Apache Atlas on
Hortonworks Data Platform (HDP) and Cloudera Navigator when working with
Cloudera Hadoop Distribution (CDH).
Search
Apache Solr indexing framework to index specific tables for fast search. It also provides
tag-based logical grouping facility for searching all occurrences of specific groups.
Data
Access
Apache Spark and Ignite based data virtualization platform which can connect to
different sources without replicating data. Data virtualization processes use source
catalogue to join data at runtime without replication.
Analytics
Workspace
Big Data analytics workspace that provides self-service UI, Zeppelin based notebook
and tools for creating data processing pipeline.
9
H-SCALE ADDRESSES KEY HEALTHCARE BIG DATA NEEDS
REFERENCES
10
As healthcare organizations worldwide begin to roll out their Big Data
strategies, they will face a number of challenges along the way. With
the right initial approach, organizations can create more robust
strategies which enable them to leverage their Big Data assets more
effectively.
Our experience with Big Data implementations puts us in a strong
position to define and articulate best practices for healthcare Big Data
implementation. CitiusTech’s H-Scale platform for healthcare data
management has been aligned to fit seamlessly with the healthcare
industry’s Big Data implementation needs.
 https://atlas.apache.org/
 https://www.redoxengine.com/blog/how-to-do-microservice-
chassis-and-microservice-scaffolding-on-a-budget-2/
CONCLUSION
11
ABOUT THE AUTHORS
Pawan Mathur
Senior Technical Specialist – Data Management Proficiency, CitiusTech
Pawan.mathur@citiustech.com
Pawan has 20+ years of experience in the IT industry. He has extensive experience in software
development using Big Data Flink-Spark-Hadoop and Analytics. He has played the role of Senior
Architect in the development and implementation of CitiusTech’s H-Scale platform. He holds a
degree in Software Enterprise Management from the Indian Institute of Management, Bangalore.
Swanand Prabhutendolkar
Vice President – Data Science Proficiency, CitiusTech
Swanand.Prabhutendolkar@citiustech.com
Swanand leads the Data Management Proficiency at CitiusTech which includes the Healthcare
Interoperability, BI-DW and Big Data practices. He has 20+ years of experience in the IT industry.,
of which 11+ years are in healthcare analytics and data management. Prior to CitiusTech
Swanand served leading technology organizations such as EPIC Corporation, Polaris and 3i
Infotech. He holds a Master of Science degree in Information Technology and Applied Statistics
from the Indian Institute of Technology (IIT), Bombay.
CitiusTech is a specialist provider of healthcare technology services and
solutions to healthcare technology companies, providers, payers and life
sciences organizations. With over 3,200 professionals worldwide,
CitiusTech enables healthcare organizations to drive clinical value chain
excellence - across integration & interoperability, data management
(EDW, Big Data), performance management (BI / analytics), predictive
analytics & data science and digital engagement (mobile, IoT).
CitiusTech helps customers accelerate innovation in healthcare through
specialized solutions, healthcare technology platforms, proficiencies and
accelerators. With cutting-edge technology expertise, world-class service
quality and a global resource base, CitiusTech consistently delivers best-
in-class solutions and an unmatched cost advantage to healthcare
organizations worldwide.
For queries contact thoughtleaders@citiustech.com
Copyright © CitiusTech 2018. All Rights Reserved.

More Related Content

PDF
HL7 Releases FHIR 4 - Highlights, Impact and More
CitiusTech
 
PDF
QCDR or QR (Selecting the Correct Reporting Mechanism)
CitiusTech
 
PPTX
Demystifying Robotic Process Automation (RPA) & Automation Testing
CitiusTech
 
PPTX
Progressive Web Apps in Healthcare
CitiusTech
 
PDF
Accelerate Healthcare Technology Modernization with Containerization and DevOps
CitiusTech
 
PDF
8 Electronic Health Record (EHR) Downstream Challenges
CitiusTech
 
PDF
The App Sec How-To: Choosing a SAST Tool
Checkmarx
 
PDF
M. Josephs - Reaching for the Clouds - Final for Distribution
Michael Josephs
 
HL7 Releases FHIR 4 - Highlights, Impact and More
CitiusTech
 
QCDR or QR (Selecting the Correct Reporting Mechanism)
CitiusTech
 
Demystifying Robotic Process Automation (RPA) & Automation Testing
CitiusTech
 
Progressive Web Apps in Healthcare
CitiusTech
 
Accelerate Healthcare Technology Modernization with Containerization and DevOps
CitiusTech
 
8 Electronic Health Record (EHR) Downstream Challenges
CitiusTech
 
The App Sec How-To: Choosing a SAST Tool
Checkmarx
 
M. Josephs - Reaching for the Clouds - Final for Distribution
Michael Josephs
 

What's hot (20)

PDF
5 Shades of Analytics - Presentation Version - Distributable Version
Michael Josephs
 
PDF
Få overblik over IT/OT-systemer og opgraderingsbehov, Leif Poulsen - NNE Phar...
Mediehuset Ingeniøren Live
 
PPTX
Migrating from Oracle AERS to Argus Safety: Reasons for the Move
Perficient, Inc.
 
PDF
Blockchain Applications in Healthcare
CitiusTech
 
PDF
How to Migrate Drug Safety and Pharmacovigilance Data Cost-Effectively and wi...
Perficient
 
PDF
Integrating Oracle Argus Safety with other Clinical Systems Using Argus Inter...
Perficient
 
PDF
From Disaster to Recovery: Preparing Your IT for the Unexpected
DataCore Software
 
PDF
Solvency II Data Management Handbook
Conor Coughlan
 
PDF
Clinical Trial Management System Implementation Guide
Perficient, Inc.
 
PDF
Automating Patient Management with ApplicationXtender Workflow
Christopher Wynder
 
PPTX
Ensuring document control for healthcare vendors
Christopher Wynder
 
PPT
Paetec Data Center Colocation Presentation
tbunten
 
PDF
Data Security Service Offering-v3
Abe Newton
 
PDF
Ibm and zato health
Diego Rodriguez
 
PPTX
Health IT Services
Key Management Group, Inc.
 
PDF
MTech- Viva_Voce
Vijayananda Mohire
 
PDF
NetFlow Monitoring Standard Content Guide for ESM 6.5c
Protect724migration
 
PPT
7p EHR Presentation
Hunt Russell
 
PPTX
Diaspark Healthcare Technology Services
Diaspark
 
PDF
intel_soae-h_data_sheet
Alan Boucher
 
5 Shades of Analytics - Presentation Version - Distributable Version
Michael Josephs
 
Få overblik over IT/OT-systemer og opgraderingsbehov, Leif Poulsen - NNE Phar...
Mediehuset Ingeniøren Live
 
Migrating from Oracle AERS to Argus Safety: Reasons for the Move
Perficient, Inc.
 
Blockchain Applications in Healthcare
CitiusTech
 
How to Migrate Drug Safety and Pharmacovigilance Data Cost-Effectively and wi...
Perficient
 
Integrating Oracle Argus Safety with other Clinical Systems Using Argus Inter...
Perficient
 
From Disaster to Recovery: Preparing Your IT for the Unexpected
DataCore Software
 
Solvency II Data Management Handbook
Conor Coughlan
 
Clinical Trial Management System Implementation Guide
Perficient, Inc.
 
Automating Patient Management with ApplicationXtender Workflow
Christopher Wynder
 
Ensuring document control for healthcare vendors
Christopher Wynder
 
Paetec Data Center Colocation Presentation
tbunten
 
Data Security Service Offering-v3
Abe Newton
 
Ibm and zato health
Diego Rodriguez
 
Health IT Services
Key Management Group, Inc.
 
MTech- Viva_Voce
Vijayananda Mohire
 
NetFlow Monitoring Standard Content Guide for ESM 6.5c
Protect724migration
 
7p EHR Presentation
Hunt Russell
 
Diaspark Healthcare Technology Services
Diaspark
 
intel_soae-h_data_sheet
Alan Boucher
 
Ad

Similar to 8 Guiding Principles to Kickstart Your Healthcare Big Data Project (20)

PPTX
Data warehouse
RajThakuri
 
PDF
Stream Meets Batch for Smarter Analytics- Impetus White Paper
Impetus Technologies
 
PPTX
lec1_Unit 1_rev.pptx_big data aanalytics
ashima967262
 
PPTX
U - 2 Emerging.pptx
MulukenTamrat2
 
PDF
Benefits of a data lake
Sun Technologies
 
PDF
Big data and oracle
Sourabh Saxena
 
PPT
Unit 5
sasisanjeev
 
PPTX
BD1.pptx
Karthik Rohan
 
PDF
Decoding the Role of a Data Engineer.pdf
Datavalley.ai
 
PPTX
DS103 - Unit03DS103 - Unit03DS103 - Unit03.pptx
FutureTechnologies3
 
DOCX
Key aspects of big data storage and its architecture
Rahul Chaturvedi
 
DOC
Data Mining
ksanthosh
 
PDF
Big Data Processing with Hadoop : A Review
IRJET Journal
 
PDF
Got data?… now what? An introduction to modern data platforms
JamesAnderson599331
 
PDF
@vtucode.in-21CS71-module-1-pdf.pdfBig data
sanjanakorawar
 
PPTX
Chapter 2 - Intro to Data Sciences[2].pptx
JethroDignadice2
 
PPT
E06WarehouseDesign.pptxkjhjkljhlkjhlkhlkj
ElyesAljane1
 
DOCX
IT for Management On-Demand Strategies for Performance, Growth,.docx
vrickens
 
PDF
Enterprise Data Lake
sambiswal
 
PDF
Enterprise Data Lake - Scalable Digital
sambiswal
 
Data warehouse
RajThakuri
 
Stream Meets Batch for Smarter Analytics- Impetus White Paper
Impetus Technologies
 
lec1_Unit 1_rev.pptx_big data aanalytics
ashima967262
 
U - 2 Emerging.pptx
MulukenTamrat2
 
Benefits of a data lake
Sun Technologies
 
Big data and oracle
Sourabh Saxena
 
Unit 5
sasisanjeev
 
BD1.pptx
Karthik Rohan
 
Decoding the Role of a Data Engineer.pdf
Datavalley.ai
 
DS103 - Unit03DS103 - Unit03DS103 - Unit03.pptx
FutureTechnologies3
 
Key aspects of big data storage and its architecture
Rahul Chaturvedi
 
Data Mining
ksanthosh
 
Big Data Processing with Hadoop : A Review
IRJET Journal
 
Got data?… now what? An introduction to modern data platforms
JamesAnderson599331
 
@vtucode.in-21CS71-module-1-pdf.pdfBig data
sanjanakorawar
 
Chapter 2 - Intro to Data Sciences[2].pptx
JethroDignadice2
 
E06WarehouseDesign.pptxkjhjkljhlkjhlkhlkj
ElyesAljane1
 
IT for Management On-Demand Strategies for Performance, Growth,.docx
vrickens
 
Enterprise Data Lake
sambiswal
 
Enterprise Data Lake - Scalable Digital
sambiswal
 
Ad

More from CitiusTech (20)

PDF
Member Engagement Using Sentiment Analysis for Health Plans
CitiusTech
 
PDF
Evolving Role of Digital Biomarkers in Healthcare
CitiusTech
 
PDF
Virtual Care: Key Challenges & Opportunities for Payer Organizations
CitiusTech
 
PDF
Provider-led Health Plans (Payviders)
CitiusTech
 
PDF
CMS Medicare Advantage 2021 Star Ratings: An Analysis
CitiusTech
 
PDF
FHIR for Life Sciences
CitiusTech
 
PDF
Leveraging Analytics to Identify High Risk Patients
CitiusTech
 
PDF
FHIR Adoption Framework for Payers
CitiusTech
 
PDF
Payer-Provider Engagement
CitiusTech
 
PDF
COVID19: Impact & Mitigation Strategies for Payer Quality Improvement 2021
CitiusTech
 
PDF
RPA in Healthcare
CitiusTech
 
PDF
6 Epilepsy Use Cases for NLP
CitiusTech
 
PDF
Opioid Epidemic - Causes, Impact and Future
CitiusTech
 
PDF
Rising Importance of Health Economics & Outcomes Research
CitiusTech
 
PDF
ICD 11: Impact on Payer Market
CitiusTech
 
PDF
Testing Strategies for Data Lake Hosted on Hadoop
CitiusTech
 
PDF
Driving Home Health Efficiency through Data Analytics
CitiusTech
 
PDF
Poster Presentation - FDA Compliance Landscape & What it Means to Your AI Asp...
CitiusTech
 
PDF
CMS’ New Interoperability and Patient Access Proposed Rule - Top 5 Payer Impacts
CitiusTech
 
PDF
UX Design to Improve User Productivity in Healthcare Registries
CitiusTech
 
Member Engagement Using Sentiment Analysis for Health Plans
CitiusTech
 
Evolving Role of Digital Biomarkers in Healthcare
CitiusTech
 
Virtual Care: Key Challenges & Opportunities for Payer Organizations
CitiusTech
 
Provider-led Health Plans (Payviders)
CitiusTech
 
CMS Medicare Advantage 2021 Star Ratings: An Analysis
CitiusTech
 
FHIR for Life Sciences
CitiusTech
 
Leveraging Analytics to Identify High Risk Patients
CitiusTech
 
FHIR Adoption Framework for Payers
CitiusTech
 
Payer-Provider Engagement
CitiusTech
 
COVID19: Impact & Mitigation Strategies for Payer Quality Improvement 2021
CitiusTech
 
RPA in Healthcare
CitiusTech
 
6 Epilepsy Use Cases for NLP
CitiusTech
 
Opioid Epidemic - Causes, Impact and Future
CitiusTech
 
Rising Importance of Health Economics & Outcomes Research
CitiusTech
 
ICD 11: Impact on Payer Market
CitiusTech
 
Testing Strategies for Data Lake Hosted on Hadoop
CitiusTech
 
Driving Home Health Efficiency through Data Analytics
CitiusTech
 
Poster Presentation - FDA Compliance Landscape & What it Means to Your AI Asp...
CitiusTech
 
CMS’ New Interoperability and Patient Access Proposed Rule - Top 5 Payer Impacts
CitiusTech
 
UX Design to Improve User Productivity in Healthcare Registries
CitiusTech
 

Recently uploaded (20)

PDF
agentic-ai-and-the-future-of-autonomous-systems.pdf
siddharthnetsavvies
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
Make GenAI investments go further with the Dell AI Factory - Infographic
Principled Technologies
 
PDF
Why Your AI & Cybersecurity Hiring Still Misses the Mark in 2025
Virtual Employee Pvt. Ltd.
 
PDF
Google’s NotebookLM Unveils Video Overviews
SOFTTECHHUB
 
PDF
REPORT: Heating appliances market in Poland 2024
SPIUG
 
PPTX
How to Build a Scalable Micro-Investing Platform in 2025 - A Founder’s Guide ...
Third Rock Techkno
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
DOCX
Top AI API Alternatives to OpenAI: A Side-by-Side Breakdown
vilush
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PDF
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
PDF
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
PDF
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
PDF
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
CIFDAQ'S Market Insight: BTC to ETH money in motion
CIFDAQ
 
PDF
Doc9.....................................
SofiaCollazos
 
PDF
This slide provides an overview Technology
mineshkharadi333
 
PDF
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
agentic-ai-and-the-future-of-autonomous-systems.pdf
siddharthnetsavvies
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
Make GenAI investments go further with the Dell AI Factory - Infographic
Principled Technologies
 
Why Your AI & Cybersecurity Hiring Still Misses the Mark in 2025
Virtual Employee Pvt. Ltd.
 
Google’s NotebookLM Unveils Video Overviews
SOFTTECHHUB
 
REPORT: Heating appliances market in Poland 2024
SPIUG
 
How to Build a Scalable Micro-Investing Platform in 2025 - A Founder’s Guide ...
Third Rock Techkno
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Top AI API Alternatives to OpenAI: A Side-by-Side Breakdown
vilush
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
CIFDAQ'S Market Insight: BTC to ETH money in motion
CIFDAQ
 
Doc9.....................................
SofiaCollazos
 
This slide provides an overview Technology
mineshkharadi333
 
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 

8 Guiding Principles to Kickstart Your Healthcare Big Data Project

  • 1. December 2018 8 Guiding Principles to Kickstart Your Healthcare Big Data Project White Paper
  • 2. Big Data technologies have seen widespread adoption across different industries over the past 3-5 years, but the healthcare is just starting to realize the benefits. This is mainly due to the exponential growth of unstructured and semi-structured healthcare information. With sensors and wearables becoming a part of our daily lives, people and organizations now have access to enormous amounts of data, e.g., step tracking, heartbeat / blood pressure monitoring, calorie tracking, sleep pattern analysis, etc. The explosion in healthcare data, while posing massive storage and processing challenges, also has the potential to transform the way we use data to improve outcomes, for example:  Predicting future care needs for specific populations  Minimizing health risks by predicting specific events well in advance  Identifying / expediting process of identifying new patterns in disease detection, etc. Our experience with a large number of healthcare Big Data projects has shown that most customers face significant hurdles in kick-starting their Big Data initiatives. With limited or no experience, customers often realize last-minute that their Big Data project implementations don’t have the architectural robustness to address future needs. This white paper illustrates our experiences and learnings across multiple Big Data implementation projects. It contains a broad set of guidelines and best practices around:  Building highly secure Big Data lakes  Efficiently processing vast amounts of data  Providing access to downstream systems  Best practices to mitigate project risks  Technical hurdles and approaches to overcome them OVERVIEW OF BIG DATA IN HEALTHCARE 1
  • 3. 1. Use a Comprehensive Data Ingestion Framework While working with a Big Data lake, you need to integrate numerous source systems with multiple feed types. Your Big Data solution should have the ability to handle different feed types, and cater to future source system integration needs. Design a data ingestion framework that addresses:  All types of data: relational, semi-structured and unstructured  Standard feed protocols: HTTPS, SFTP, etc.  Different types of loading scenarios: including initial load and incremental load  ELT (Extract, Load, Transform) approach as compared to traditional ETL  Various ingestion frequencies: batch, real-time  Relevant data ingestion mechanisms (push or pull). Pulling data may not be preferable when data lake is on the cloud and sources reside on-premise Big Data Layers: Typical Architecture 2 DataMonitoringLayer DataSecurityLayer Data Visualization Layer Data Ingestion Layer S3 HDFS Cluster Data Sources Data Storage Data Processing Batch Processing Real-time Processing Querying& AnalyticsEngine Data Query Layer Statistical Analytics Semantic Analytics Predictive Modelling Dash- boards & Reports GUIDING PRINCIPLES FOR BIG DATA IMPLEMENTATION
  • 4. 3 2. Choose the Right Storage Type for Each Feed Since Big Data ecosystems provide multiple storage components, it gives the opportunity to use relevant and optimal storage type for a specific feed. The following points need to be considered while choosing a storage type:  Feed attributes: e.g., total size of data, size of individual files, velocity at which data arrives, etc.  Data ingestion system: Ability to identify whether the data ingested is small or big in size  Database architecture: Based on size, data can be stored in distributed file systems, cloud storage or in NoSQL / columnar data bases. For example: • Files of 128MB and above (default Hadoop block size), can be stored in HDFS. Small files (in KBs) can be stored in Hadoop sequence files, or in HBase • JSON data can be stored in document database 3. Create Separate Storage Layers Organizations starting their Big Data implementations often ask, “How do we arrange data in a Data Lake?” and “How many layers should we create?”. The answers depend on the type of data being pulled and processed in the Data Lake. In a standard scenario, customers want to correlate data from relational systems, IoT devices, social media and unstructured data sources, e.g. notes, images, documents etc. In such scenarios, a three layer approach can be used. Raw Layer Although not mandatory , it is always advisable to store data in its native form in the Data Lake. This forms the raw layer or raw zone of the Data Lake. The raw layer is generally referred by data scientists or analysts to perform analysis instead of waiting for operational data. Curated Layer While the raw layer is important from a raw analytics and reprocessing perspective, it isn’t the most
  • 5. 4 optimal way to store data, as it may contain duplicate, incorrect or incomplete records. It is always advisable to create a curated data layer that has cleansed and standardized data. Analytics performed on the curated layer provides much more accurate results than the raw layer. Operational Layer Data stored in the curated layer isn’t reconciled and continues to have the context of the source system. This poses analytics challenges and also has the possibility of duplicate records being sourced. The operational layer solves this problem by reconciling and transforming incoming data from different sources into a single, canonical model. 4. Use the Right Data Processing Frameworks & Tools Identifying the right data processing framework can be difficult as there are multiple processing frameworks in the Big Data ecosystem. Common data processing tasks like data cleansing, quality reporting, aggregation, transformation and reconciliation can be performed by standard ETL tools. However, for Big Data processing, most standard ETL tools use Apache Spark. While these ETL tools provide drag-drop UI and out-of-the-box adapters, the internal working is abstracted, making them difficult to operate in certain scenarios. Commonly used ETL tools are Talend Enterprise, Pentaho, Informatica, DataStage and Attunity. For simple data processing needs, IT teams can create a custom ETL utility using Apache Spark and its in-built transformation functions. The following best practices need to be kept in mind while working on data processing frameworks:  Big Data processing happens in a distributed manner. It is necessary to arrange data to minimize shuffling and optimizing performance. Use compression to speed up data transfer over network and reduce shuffling time
  • 6. 5  Joins are expensive in Big Data and should be thoughtfully implemented.You can also improve performance by de-normalizing records  Use parameters like batch id, date range or specific set to overcome bad / corrupt data issues  Keep track of events (meta-data, audits) during data processing, e.g., who triggered the process, which dataset was used for processing, size of the dataset, count of records processed, status of processing, start and finish time, etc.  Be practical with partitioning. Distributed processing often fails to take full advantage of the nodes due to small or numerous partitions  For stream processing, create enough partitions on a Kafka Topic to trigger parallel processing in Apache Spark. Provide checkpoints at regular intervals to minimize stream processing failure 5. Think of Data Management Right at the Beginning With business environments changing rapidly, organizations need to consider data management as a critical component of their business strategy. The organization’s data strategy is affected by multiple scenarios, including:  Changes in organization or technology  Process and people changes due to mergers and acquisitions  Changes in regulatory compliance or contractual arrangements  Issues with quality / availability / timelines of data that affect decision making  Massive investments in time and resources required to get data in correct shape To overcome these challenges, organizations must start thinking of data management solutions right from project inception. Few frameworks provide data management capabilities for Big Data, e.g., Apache Atlas with Apache Falcon for Hortonworks, Cloudera Navigator has partial functionality, MapR uses a custom framework.
  • 7. 6 7 Pillars of Data Management 1. Data Architecture: Data analysis, enterprise data architecture, integration with applications 2. Content Management: Organizing, consolidating and optimizing content 3. Data Development: Requirement analysis, data modelling, database design, implementation and maintenance 4. Master Data and Metadata Management: Master patient index, master provider index, master facility index, ICD 9/10, CPT, SNOMED, LOINC, DRG and standards, common codes, integration metadata, control metadata, quality metadata 5. Data Quality: Measurement, assessment and improvement in data quality 6. Operations Management: Acquisition, recovery, tuning, retention and purging 7. Data security: Classification, administration, privacy and confidentiality, authentication and auditing 6. Provide a Sophisticated Search Capability The search feature becomes essential to Big Data systems due to high volumes of data. Searching for specific attribute values is like finding a needle in a haystack. As entities are added / updated / removed from the Data Lake, there must be a way to quickly search and get a view of the entities present and quickly search for specific attribute values. Its always beneficial to index your data and provide a search UI for quick discovery. Consider providing a facility to tag attributes to make it searchable and allow users to group attributes using tags. 7. Simplify Data Access Using APIs and Data Virtualization All data warehousing / Data Lake projects need to provide data extracts to downstream / external systems, and allow users to search data and enable analytics systems to connect and analyze data using
  • 8. 7 standard interfaces. Most of these requirements can be fulfilled by a thin API access layer that provides unified access to the underlying data. The API layer implementation should support standard based interfaces like REST, SQL or a combination of both. Data extraction processes are scheduled jobs that extract data from specific tables and store it in a shared location (e.g., SFTP). A low priority processing queue can be used for data extraction during peak hour to ensure the extraction query does not consume all processing resources. Additionally, data virtualization software (e.g., Denodo) or custom data virtualization layer (using Apache Ignite and Spark) can be used to create a common interface for Data Lakes and other source systems. 8. Provide an Analytics Workspace for Advanced Users With the evolution of Big Data and Data Lakes, more organizations are adopting advanced analytics tools and technologies – e.g., Predictive Analytics, Machine Learning, Deep Learning, Natural Language Processing and AI algorithms. These technologies require extensive piloting, model operationalization and custom dashboarding before they can be applied in real-world scenarios. Data scientists and analysts need a dedicated workspace and desired toolsets to pull, process, analyze raw, curated and aggregated data, and share their findings. They should be able to perform activities like preliminary analysis, identifying new trends and quick dashboarding, without affecting the Data Lake. An analytics workspace can be implemented in one of the following ways: A. Use Existing Data Lake Infrastructure to Carve Out Space for Individual Data Scientists This option uses the existing Data Lake infrastructure to create slots for individual data scientists where they can to play with a copy of the data using various tools e.g. Apache Spark based note books. B. Use a Separate Cluster for Each Data Scientist This option creates separate infrastructure for individual users and pulls data from the Data Lake. This option may prove costlier but provides a true multitenant architecture and ensures that the system performance is always optimal.
  • 9. Data Ingestion Highly configurable data ingestion pipeline that caters to structured, unstructured and semi-structured data ingestion, using Big Data ecosystem components like Sqoop, Flume, etc. Also provides real-time data ingestion-streaming using Apache Kafka and Storm based scalable ingestion cum processing pipeline. Storage Types Configurable data ingestion pipeline - dynamically chooses storage (HDFS or HBase) based on data attributes. Storage Layers Ability to configure and execute data transformation and reconciliation rules using a self-service UI. CitiusTech’s healthcare data model can be used to create canonical data model in operational layer. Data Processing Highly configurable and easy-to-use data processing pipeline built on top of Apache Spark to perform data validation, curation, transformation and reconciliation. Data processing pipeline improves time-to-market for customers by quickly integrating data from various sources. 8 CitiusTech’s H-Scale platform for healthcare data management has been specifically designed to address healthcare Big Data challenges such as data acquisition, real-time processing, Master Data Management, data security and advanced analytics. Here is how H-Scale supports the Big Data requirements discussed in this paper. H-SCALE ADDRESSES KEY HEALTHCARE BIG DATA NEEDS
  • 10. Data Management Data governance adapters to capture data lineage and auditing information. H-Scale data governance adapters can be used while working with Apache Atlas on Hortonworks Data Platform (HDP) and Cloudera Navigator when working with Cloudera Hadoop Distribution (CDH). Search Apache Solr indexing framework to index specific tables for fast search. It also provides tag-based logical grouping facility for searching all occurrences of specific groups. Data Access Apache Spark and Ignite based data virtualization platform which can connect to different sources without replicating data. Data virtualization processes use source catalogue to join data at runtime without replication. Analytics Workspace Big Data analytics workspace that provides self-service UI, Zeppelin based notebook and tools for creating data processing pipeline. 9 H-SCALE ADDRESSES KEY HEALTHCARE BIG DATA NEEDS
  • 11. REFERENCES 10 As healthcare organizations worldwide begin to roll out their Big Data strategies, they will face a number of challenges along the way. With the right initial approach, organizations can create more robust strategies which enable them to leverage their Big Data assets more effectively. Our experience with Big Data implementations puts us in a strong position to define and articulate best practices for healthcare Big Data implementation. CitiusTech’s H-Scale platform for healthcare data management has been aligned to fit seamlessly with the healthcare industry’s Big Data implementation needs.  https://atlas.apache.org/  https://www.redoxengine.com/blog/how-to-do-microservice- chassis-and-microservice-scaffolding-on-a-budget-2/ CONCLUSION
  • 12. 11 ABOUT THE AUTHORS Pawan Mathur Senior Technical Specialist – Data Management Proficiency, CitiusTech [email protected] Pawan has 20+ years of experience in the IT industry. He has extensive experience in software development using Big Data Flink-Spark-Hadoop and Analytics. He has played the role of Senior Architect in the development and implementation of CitiusTech’s H-Scale platform. He holds a degree in Software Enterprise Management from the Indian Institute of Management, Bangalore. Swanand Prabhutendolkar Vice President – Data Science Proficiency, CitiusTech [email protected] Swanand leads the Data Management Proficiency at CitiusTech which includes the Healthcare Interoperability, BI-DW and Big Data practices. He has 20+ years of experience in the IT industry., of which 11+ years are in healthcare analytics and data management. Prior to CitiusTech Swanand served leading technology organizations such as EPIC Corporation, Polaris and 3i Infotech. He holds a Master of Science degree in Information Technology and Applied Statistics from the Indian Institute of Technology (IIT), Bombay.
  • 13. CitiusTech is a specialist provider of healthcare technology services and solutions to healthcare technology companies, providers, payers and life sciences organizations. With over 3,200 professionals worldwide, CitiusTech enables healthcare organizations to drive clinical value chain excellence - across integration & interoperability, data management (EDW, Big Data), performance management (BI / analytics), predictive analytics & data science and digital engagement (mobile, IoT). CitiusTech helps customers accelerate innovation in healthcare through specialized solutions, healthcare technology platforms, proficiencies and accelerators. With cutting-edge technology expertise, world-class service quality and a global resource base, CitiusTech consistently delivers best- in-class solutions and an unmatched cost advantage to healthcare organizations worldwide. For queries contact [email protected] Copyright © CitiusTech 2018. All Rights Reserved.