SlideShare a Scribd company logo
Speeding Up
Drug Research
with MongoDB
Introducing MongoDB
into an RDBMS Environment
Doug Garrett
• Genentech Research and Early
Development (gRED)
Bioinformatics and Computational Biology
(B&CB)
Software Engineer
gRED: Disease and Drug Research
Bioinformatics Customers: Scientists
Most of All: Patients
Bioinformatics: Not Your Typical IT
MongoDB
• Not just about big data
MongoDB has a flexible schema
• Not just about new systems
MongoDB easily integrates with RDBMS
• Not just about software
It’s about saving lives
Time to Introduce New Genetic Test
Weeks
Drug Development Process
9
New
Drug
Drug Development Process
10
New
Drug
Drug Development Process
11
New
Drug
New Mouse Model - Genetic Testing
File (csv)
J. Colin Cox Sept. 2013 Presentation
Growth In Genetic Testing
(thousands)
Samples
Genotypes
6 months 3 months
6 months 3 months
6 months 3 months
The Best of Both Worlds: Speeding Up Drug Research with MongoDB & Oracle (Genentech)
The Best of Both Worlds: Speeding Up Drug Research with MongoDB & Oracle (Genentech)
Varies by Genetic Test
Case Study: New Genetic Test Instrument
20
New
Instrument!
Impact?
Bio-Rad
CFX384
ABI
7900HT
Case Study: New Genetic Test Instrument
21
New
Instrument!
Impact?
DB Schema?
No Impact
Project?
3 weeks
Bio-Rad
CFX384
ABI
7900HT
Going Live…
22
Failure Mode
Failure Mode
Synch MongoDB with RDBMS
db.NewRdbmsMongoId
.find().forEach(function(doc){
db. TestResults
.update({'_id':doc._id},
{'$inc':{useCount:1}})
})
db. db. TestResults.remove({'useCount':0});
But Wait! There’s More…
• Flexible data collection
Load CSV to MongoDB
"_id" : ObjectId(“…."),
“plate_wells” : [
{ "Well" : "A01",
"Sample" : "308…",
…
}
]
Add Fields to CSV
"_id" : ObjectId(“…."),
“plate_wells” : [
{ "Well" : "A01",
"Sample" : "308…",
…
"New1" : "New Value"
}
]
Future – What If…
Avoiding the typical “Catch 22”:
1.Is it worth collecting the data?
2.What is the value of the data?
3.Need the data to find the value
Future Analytics
MongoDB
Aggregation
Framework
R
Matlab
MongoDB Aggregation Framework
40% Discount Thru July 4
Use Code: mdbdgcf
Under Discussion
CSV
JSON
XML
Other
Lab
Instruments
Government
Agencies
Third Party
Sources
MongoDB
Load to RDBMS
Process Directly
MongoDB
• Not just about big data
MongoDB has a flexible schema
• Not just about new systems
MongoDB easily integrates with RDBMS
• Not just about software
It’s about saving lives
“You better do it fast”
For my Father
Who I hope would have enjoyed this talk

More Related Content

Viewers also liked (9)

PPTX
Michael Poremba, Director, Data Architecture at Practice Fusion
MongoDB
 
PPTX
A Translational Medicine Platform at Sanofi
MongoDB
 
PPTX
App Sharding to Autosharding at Sailthru
MongoDB
 
PDF
Building LinkedIn's Learning Platform with MongoDB
MongoDB
 
PDF
AWS to Bare Metal: Motivation, Pitfalls, and Results
MongoDB
 
PPTX
Data Streaming with Apache Kafka & MongoDB - EMEA
Andrew Morgan
 
PPTX
MongoDB 3.4 webinar
Andrew Morgan
 
PPTX
Powering Microservices with MongoDB, Docker, Kubernetes & Kafka – MongoDB Eur...
Andrew Morgan
 
PPTX
Practice Fusion & MongoDB: Transitioning a 4 TB Audit Log from SQL Server to ...
MongoDB
 
Michael Poremba, Director, Data Architecture at Practice Fusion
MongoDB
 
A Translational Medicine Platform at Sanofi
MongoDB
 
App Sharding to Autosharding at Sailthru
MongoDB
 
Building LinkedIn's Learning Platform with MongoDB
MongoDB
 
AWS to Bare Metal: Motivation, Pitfalls, and Results
MongoDB
 
Data Streaming with Apache Kafka & MongoDB - EMEA
Andrew Morgan
 
MongoDB 3.4 webinar
Andrew Morgan
 
Powering Microservices with MongoDB, Docker, Kubernetes & Kafka – MongoDB Eur...
Andrew Morgan
 
Practice Fusion & MongoDB: Transitioning a 4 TB Audit Log from SQL Server to ...
MongoDB
 

Similar to The Best of Both Worlds: Speeding Up Drug Research with MongoDB & Oracle (Genentech) (20)

PPT
Webinar: How Leading Healthcare Companies use MongoDB
MongoDB
 
PPTX
MongoDB Training
Arcadian Learning
 
PPTX
CMS Mongo DB
Srineel Mazumdar
 
PPTX
Mongodb Presentation
Hashim Shaikh
 
PPTX
Mongodb hashim shaikh
Hashim Shaikh
 
PPTX
Mongodb Presentation
Hashim Shaikh
 
PPTX
MongoDB Aggregation Performance
MongoDB
 
PPTX
How Thermo Fisher is Reducing Data Analysis Times from Days to Minutes with M...
MongoDB
 
PPTX
Webinar: MongoDB Schema Design and Performance Implications
MongoDB
 
PDF
How MongoDB is Transforming Healthcare Technology
MongoDB
 
PPTX
Accelerate pharmaceutical r&d with mongo db
MongoDB
 
PPTX
Mongo db presentaion
Khalil ul Rehman MIRZA
 
PPTX
MongoDB Evenings Minneapolis: Medtronic's MongoDB Journey
MongoDB
 
PPTX
Mongodb Introduction
Nabeel Naqeebi
 
PPTX
Bioinformatic in drug designing
Salman Khan
 
PPTX
How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...
MongoDB
 
PPTX
Techorama - Evolvable Application Development with MongoDB
bwullems
 
PPT
Drug design
Kunal Chakraborty
 
PPTX
mongodb_Introduction
Vikas Pratap Singh
 
PDF
MongoDB World 2019: RDBMS Versus MongoDB Aggregation Performance
MongoDB
 
Webinar: How Leading Healthcare Companies use MongoDB
MongoDB
 
MongoDB Training
Arcadian Learning
 
CMS Mongo DB
Srineel Mazumdar
 
Mongodb Presentation
Hashim Shaikh
 
Mongodb hashim shaikh
Hashim Shaikh
 
Mongodb Presentation
Hashim Shaikh
 
MongoDB Aggregation Performance
MongoDB
 
How Thermo Fisher is Reducing Data Analysis Times from Days to Minutes with M...
MongoDB
 
Webinar: MongoDB Schema Design and Performance Implications
MongoDB
 
How MongoDB is Transforming Healthcare Technology
MongoDB
 
Accelerate pharmaceutical r&d with mongo db
MongoDB
 
Mongo db presentaion
Khalil ul Rehman MIRZA
 
MongoDB Evenings Minneapolis: Medtronic's MongoDB Journey
MongoDB
 
Mongodb Introduction
Nabeel Naqeebi
 
Bioinformatic in drug designing
Salman Khan
 
How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...
MongoDB
 
Techorama - Evolvable Application Development with MongoDB
bwullems
 
Drug design
Kunal Chakraborty
 
mongodb_Introduction
Vikas Pratap Singh
 
MongoDB World 2019: RDBMS Versus MongoDB Aggregation Performance
MongoDB
 
Ad

More from MongoDB (20)

PDF
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB
 
PDF
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
PDF
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB
 
PDF
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB
 
PDF
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB
 
PDF
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB
 
PDF
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
PDF
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB
 
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB
 
PDF
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB
 
PDF
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB
 
PDF
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB
 
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB
 
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB
 
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB
 
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB
 
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB
 
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB
 
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB
 
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB
 
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB
 
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB
 
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB
 
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB
 
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB
 
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB
 
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB
 
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB
 
Ad

Recently uploaded (20)

PDF
HR agent at Mediq: Lessons learned on Agent Builder & Maestro by Tacstone Tec...
UiPathCommunity
 
PDF
Integrating IIoT with SCADA in Oil & Gas A Technical Perspective.pdf
Rejig Digital
 
PPTX
Building and Operating a Private Cloud with CloudStack and LINBIT CloudStack ...
ShapeBlue
 
PDF
Women in Automation Presents: Reinventing Yourself — Bold Career Pivots That ...
DianaGray10
 
PPTX
Extensions Framework (XaaS) - Enabling Orchestrate Anything
ShapeBlue
 
PDF
Market Wrap for 18th July 2025 by CIFDAQ
CIFDAQ
 
PPTX
Building a Production-Ready Barts Health Secure Data Environment Tooling, Acc...
Barts Health
 
PDF
UiPath vs Other Automation Tools Meeting Presentation.pdf
Tracy Dixon
 
PDF
2025-07-15 EMEA Volledig Inzicht Dutch Webinar
ThousandEyes
 
PDF
CIFDAQ'S Token Spotlight for 16th July 2025 - ALGORAND
CIFDAQ
 
PDF
Shuen Mei Parth Sharma Boost Productivity, Innovation and Efficiency wit...
AWS Chicago
 
PDF
visibel.ai Company Profile – Real-Time AI Solution for CCTV
visibelaiproject
 
PPTX
Simplifying End-to-End Apache CloudStack Deployment with a Web-Based Automati...
ShapeBlue
 
PDF
Alpha Altcoin Setup : TIA - 19th July 2025
CIFDAQ
 
PDF
Novus-Safe Pro: Brochure-What is Novus Safe Pro?.pdf
Novus Hi-Tech
 
PDF
CloudStack GPU Integration - Rohit Yadav
ShapeBlue
 
PDF
Apache CloudStack 201: Let's Design & Build an IaaS Cloud
ShapeBlue
 
PDF
Productivity Management Software | Workstatus
Lovely Baghel
 
PPTX
python advanced data structure dictionary with examples python advanced data ...
sprasanna11
 
PPTX
Lecture 5 - Agentic AI and model context protocol.pptx
Dr. LAM Yat-fai (林日辉)
 
HR agent at Mediq: Lessons learned on Agent Builder & Maestro by Tacstone Tec...
UiPathCommunity
 
Integrating IIoT with SCADA in Oil & Gas A Technical Perspective.pdf
Rejig Digital
 
Building and Operating a Private Cloud with CloudStack and LINBIT CloudStack ...
ShapeBlue
 
Women in Automation Presents: Reinventing Yourself — Bold Career Pivots That ...
DianaGray10
 
Extensions Framework (XaaS) - Enabling Orchestrate Anything
ShapeBlue
 
Market Wrap for 18th July 2025 by CIFDAQ
CIFDAQ
 
Building a Production-Ready Barts Health Secure Data Environment Tooling, Acc...
Barts Health
 
UiPath vs Other Automation Tools Meeting Presentation.pdf
Tracy Dixon
 
2025-07-15 EMEA Volledig Inzicht Dutch Webinar
ThousandEyes
 
CIFDAQ'S Token Spotlight for 16th July 2025 - ALGORAND
CIFDAQ
 
Shuen Mei Parth Sharma Boost Productivity, Innovation and Efficiency wit...
AWS Chicago
 
visibel.ai Company Profile – Real-Time AI Solution for CCTV
visibelaiproject
 
Simplifying End-to-End Apache CloudStack Deployment with a Web-Based Automati...
ShapeBlue
 
Alpha Altcoin Setup : TIA - 19th July 2025
CIFDAQ
 
Novus-Safe Pro: Brochure-What is Novus Safe Pro?.pdf
Novus Hi-Tech
 
CloudStack GPU Integration - Rohit Yadav
ShapeBlue
 
Apache CloudStack 201: Let's Design & Build an IaaS Cloud
ShapeBlue
 
Productivity Management Software | Workstatus
Lovely Baghel
 
python advanced data structure dictionary with examples python advanced data ...
sprasanna11
 
Lecture 5 - Agentic AI and model context protocol.pptx
Dr. LAM Yat-fai (林日辉)
 

The Best of Both Worlds: Speeding Up Drug Research with MongoDB & Oracle (Genentech)

Editor's Notes

  • #3: *** possible joke: This is the second time that I’ve been to the “First World Conference” for a ground breaking product. You don’t get this chance very often. The first time was 18 years ago for a product that some of you may have heard of: Java. My name is Doug Garrett. I’m a software engineer in the Bioinformatics and Computational Biology department of Genentech Research and Early Develop. Since that’s quite a mouthful I’ll just refer to them as gRED and Bioinformatics. Genentech was the first biotech company – the first company to produce drugs, such as insulin, from genetically engineered organisms. In 2009 Genentech was purchased by the Swiss pharmaceutical company Roche who wisely decided to keep Genentech Research as a separate group reporting directly to the CEO. *** possible joke: describe cultural clash between a laid back San Francisco academic culture and a Swiss business – 1st time we saw senior management together on the same stage, Genentech wore ties and Roche didn’t.
  • #4: gRED does basic research into disease mechanisms/causes and then uses those discoveries to develop new drugs. Although major successes have been in Cancer, we are now investigating other areas as well including Neurology – Alzheimer's, Parkinson's Immunology (arthritis and asthma) Metabolism (diabetes) Infectious Diseases (Flu, Hepatitis C)
  • #5: My customers are the scientists discovering the cause of diseases and then trying to find new drugs for the diseases.
  • #6: But upper most and most important, the ultimate customers are the patients.
  • #7: How is being a software engineer in Bioinformatics different from typical software development environment? First– most of the people within Bioinformatics are scientists. But within bioinformatics is a fairly small group of software engineers, such as myself. Software Engineers in Bioinformatics have to speak a different language. Have to understand the terminology AND the underlying science. But ALSO – the need to be flexible and adapt quickly – It’s research Terms used for above word map: Heterozygous, Alleles, Genes, Polymorphism, SNP nucleotide polymorphism PCR polymerase chain reaction, IVF, Cryo, Multiplex, Primer, Probes Genetic Assay, Colony, Congenic, Genome, Backcross, Chimera, Microinjection, hCG chorionic PMSG
  • #8: I’m going to be discussing a recently completed project which used MongoDB. Hope to expand or extend people's understanding of what Mongodb excels at and under what situations it is best utilized. Many talks discuss MongoDB for “big data” – but that’s not all MongoDB Excels at Flexible schema can speed development and provide system flexibility Most talks I’ve seen also cover MongoDB for new systems – where that’s all that’s used How many of you would have to integrate MongoDB with an existing Relational Database? (stop to ask this question?) In fact though, both Relational Database and MonogDB can co-exist in the same environment There are some simple ways to allow the two to easily work together In many ways the two complement each other And for us MongoDB– Is not just about software It’s about saving lives Many of the people in this room have probably been touched by the death of someone in their family – quite often from cancer In my case, my father died of non-hodgkins lymphoma shortly after I went to work for Genentech so to me I know the importance of speeding up the development of new drugs because… You never know when even a single day will make a major difference in someone’s life.
  • #9: In our case The flexible schema has helped us reduce the time needed to introduce new lab equipment from months to weeks, or even days This reduced time is not entirely due to MongoDB, but MongoDB plays a key part in the improvement As far as integrating MongoDB with our existing relational database environment, - we did find a very simple way to integrate the two Not completely integrated, not a two-phase commit But Integrated “Enough” AND – it’s simple This allowed us to easily integrate our MongoDB with the existing system, use existing tools geared towards Relational Database While still being able to take advantage of MongoDB’s flexible schema.
  • #10: This is an oversimplified view of drug development, but it illustrates the importance of mouse genetic models in many cases. Drug research begins with an idea – what is the cause of this disease? If the cause is related to genes we create new mouse genetic models, new genetic strains of mice, which are meant to reflect the underlying disease cause. This mouse genetic model is then used to verify the underlying disease cause.
  • #11: If verified – move on to trying to discover drugs to address the underlying genetic cause They then test new drugs first on the genetically modified mouse, testing for safety and effectiveness
  • #12: If safe and effective, only then will they move on to initial clinical trials with humans, although in many cases it’s back to the drawing board. As you can see, the mouse genetic model is an important part of disease research and drug discovery. And Increasingly we’re finding that the underlying genetic cause is much more complex than we thought
  • #13: Determining disease causes and developing drugs to address those diseases requires genetically engineered mice We support around 500 investigators and in the area of 500 different genetic strains of mice New research requires that we develop in the area of 200 new genetic strains of mice per year *** In most cases you can’t purchase a new genetic strain of an animal – a new Mouse Model Creating a new mouse genetic strain requires genetic testing LOTS of genetic testing – about 700,000 genetic tests per year for us The entire process of developing new genetic animal strains is very complex It requires breeding a number of generations of mice to obtain the desired genetic mutation Today I’ll be covering only the step where we determines if a particular gene is present or not Genetic tests uses a plate of “amplified” DNA a wells for each sample and genetic test We Run that dna sample through one of a variety of lab instruments we use for genetic testing We then load those test results, usually a CSV file, into our database Using these results the investigator can then decide which animals to breed There are different types of tests, different lab instruments – new ones coming out all the time
  • #14: This has driven demand within one of the departments that I support, The Genetic Analysis Lab. The demands for mouse genetic testing has increased both because: There is the normal growth in research and therefore the number of samples to be tested But in addition, the growing complexity of sample testing is driving this even faster. We now test an average of two different genes instead of just one
  • #15: In order to keep up with rising demand we needed to update the Genotyping Lab Instruments Originally we had just a 3730 Genetic Test. We loaded a file containing results for a plate Each file had results for one or more wells containing a genetic test
  • #16: We added a new Genetic Test. From this test we would producesome of the same results information as for the original genetic test But we needed to capture additional and different details for the new genetic test. So in our relational database we created a child tables of PCR Wells. But we still generated the original PCR Well row since that was the integration point with the rest of the system. It took six months to integrate this new genetic test.
  • #17: We then added a second new genetic test, this one from the same instrument but generating additional data. This required another child table for the new data, and took an additional three months to implement. You can see where this is going… Every new instrument began to add new complexity AND Perhaps more important – it took too long And – the requirement to add new types of genetic tests was expected to increase – driven by the need to increase throughput in the lab in order to keep up with rising demand.
  • #18: To help address this we had undertaken a redesign of the system As part of the this new design we included a new DB design integrating our Relational Database with MongoDB. We were fortunate that a project for another department had required MongoDB. As a result our oracle dba's were comfortable with supporting mongod, making it easy for us to request a new mongodb database. The key point was to isolate data which we expected to vary for different genetic tests, into a new MongoDB document. For each different type of genetic test we planned to create an instrument specific load process to: Read the CSV File Parse that file into the MongoDB Document Edit, Validate, Preprocess Save the preprocessed data in the MongoDB
  • #19: The next step in the process, a “Generalized Loader”, would then use certain commonly defined fields within the MongoDB document to load the Relational Database. Now, if we need to add a new genetic test– no time to modify the database schema
  • #20: From a user perspective, this is how it appears. Most of the data displayed is coming from our relational database. But details within the results which come from MongoDB are combined with the relational DB data by a Java program and then displayed on the User Interface. Currently, the variable data is only needed when the genetic test results are initially being processed, though it will be available if needed. In the future we may perform further analysis on this data and we may also capture more data since that has become so easy - mainly because with MongoDB Flexible Schema we can do this without any programming effort.
  • #21: This is an actual example – before the new system was even done! The users was “nice” enough to give us an “opportunity” to test out the flexibility of our MongoDB schema While in the middle implementing the data loading for the first time, The user decided we should drop that genetic test and instead load a different, newer genetic test that was just coming online to replace the previous one.
  • #22: There was Zero impact on our data model – all changes were in the MongoDB Flexible Schema No time required to change the schema Approximate three week impact on project vs previous history of three to six months Mongo’s Flexible Schema was a big help in achieving this. It allowed us to use a new instrument without any changes to the data model.
  • #23: Luckily this was NOT what going live looked like. It wasn’t a circus. It might have been a cirucus if we hadn’t used MongoDB though. The entire mouse breeding program, which this genetic testing is just a part of, is so important that we maintain a “Disaster Recovery Data Center” which keeps a running copy of the system - ready to take over if our main data center fails. Keeping a second copy of a database is a no brainer in MongoDB – keeping one or two copies of the MongoDB collections is the default configuration for most production MongoDB systems. But if you’ve every tried to do this with Oracle, the product we use, you may find it a much more difficult task. For example, when we went from Oracle 10g to Oracle 11g, somehow the defaults changed and our “disaster recovery” copy ended up being corrupted. Even scarier, we didn’t know the until a number of months later when we ran our yearly “disaster recovery” test and it failed. When we went live with MongoDB though, we reminded our DBAs that we needed a copy of the production database at the backup site. Although they did already have a replica running, they hadn’t set up one in our disaster recovery Data Center. Luckily, because of MongoDB, they were able to set this up in less than hour – something I wouldn’t have tried to do with Oracle.
  • #24: Next let’s talk about synchronizing our relational DB with the MongoDB. How do we maintain consistency between the Relational DB and MongoDB? Whenever you join two databases together you run into issues regarding keeping the two “synchronized”. Often this requires a complex two phase commit or similar mechanism. In our case we always insert the complete MongoDB document first. The MongoDB then contains a standard set of fields which are needed to define the genetic test results and are then used to load the relational database. But suppose there is a failure before the Relational Database insert is completed?
  • #25: Net result: MongoDB Document left in Collection with no corresponding Relational Database table row We considered a “quasi two phase commit” Set document “status” to “in progress” Insert and commit Relational Database Set document “status” to “committed” But then we still had to deal with scripts that clean up after any failure such as finding any MongoDB documents with a status of “in progress” and either setting the status to “committed” if there is a corresponding Relational Database row, or deleting the document if there wasn’t. But the question was: why bother? Who cares if there is an “extra” MonogDB document? If we just look at those which have an ID in the Relational Database – we’ll never see extra MongoDB documents Our simple solution makes Relational Database the DB of record and lets it handle the transaction management, something it does quite well. If an ID isn’t in the Relational Database, it doesn’t exists, as far as we’re concerned If we ever begin to go against MongoDB Directly, we can write a simple “clean up” script to delete any orphan documents But for now, we just ignore them doesn’t cause a problem Won’t happen often Not as if we have to worry about the MongoDB size The main objective is keeping It Simple
  • #26: If at some future date we did go directly against the MongoDB and needed to clean up the “orphan” MongDB Documents there are various ways we could handle this. Here’s just one example of how it might be done. There are many other simple ways to do this though. In this case we simply need to mark those documents where we do have a corresponding Relational Database Row, And with a single delete command we can delete what doesn’t have a row in the relational database. The point is that there are a number of simple ways to correct this problem, in the rare case that it even happens.
  • #27: Now that we’re live we’re realizing that **IF** we could easily do so, it might be nice to load some additional data that is available from the instrument. In the past we avoided this because we’d have to add new columns to the Relational Database schema. But many lab instruments often allow users to specify additional data elements they want in the CSV file we use to load the results.
  • #28: The current CSV load program always looks for a known set of fields to load into the MongoDB document, These fields must be at the beginning of each row But in fact our load program will also load any other fields added onto the end of each row.
  • #29: As long as the beginning of each “row” of the CSV File is what we expect, we can parse and save any additional comma separated values into the MongoDB document without programming changes. Here again the MongoDB flexible schema allows us to do things which would otherwise be difficult to support in a relational database.
  • #30: You often can’ tell how useful the data might be until you collect it and examine it. With MongoDB’s flexible schema it becomes very easy to collect this additional data at low or no cost, providing the luxury of collecting much more than you might otherwise. So why not collect as much as you can? It’s inexpensive It’s easy
  • #31: As a result we may one day start to analyze some of that additional data –access additional lab instrument specific detailed data which would otherwise be difficult to obtain. You never know what you’ll find. How the information can be used to improve the process Improve accuracy? Spot problems before they occur? Who knows what else… Until you capture the data and take a look at it – you never know what you’ll find. And with MongoDB you lower the barrier so much, that it becomes easy to collect all the data you’d ever want.
  • #32: This is made even easier by new Aggregation Framework capabilities which have removed some of the previous resource limitations of the framework. If you want to find out more about the MongoDB Aggregation Framework Including major revisions included in the April MongoDB Release, 2.6 Removes the 16MB limit for aggregation pipeline results Provides the option to removes limits for intermediate result set sizes Allowing you to save intermediate results on disk Chapter 6 of the soon to be released 2nd Edition of MongoDB in Action will cover this. Please use code mdbdgcf  for 44% off MongoDB in Action, 2e  (all formats) for all attendees. Please also give away a free MEAP and send us the winners name and email. The book is scheduled to be released this summer but Manning has an early access program which will allow you to read the chapter when it is completed – which should soon.
  • #33: There are other future possibilities for MongoDB in our department, Bioinformatics, as well. While conducting an internal review of this project (BTSC), the possibilities enabled by MongoDB flexible schema started others thinking about additional ways we could leverage it. One idea was to use MongoDB to help in dealing with different formats of data arriving from a variety of sources. (actually Jan’s idea) If nothing else, MongoDB could provide a common and flexible access method for programs which need to process these data. It could also provide a common place to first store and then curate the data, if we need to do any preprocessing or validation We could then use the results to load a Relational Database, or even process it directly from MongoDB with either the aggregation framework or other languages which have MongoDB adapters, such as R. MongoDB’s flexible schema as well as easy access makes it a natural tool for this use.
  • #34: So – as you can see MongoDB is not just about big data The flexible schema can speed development and provide system flexibility In our case, just for the genetic testing system, We’ve reduced the time to introduce some new lab equipment from months to weeks And we can actually capture some new instrument data without any programming changes And again - MongoDB is not just for new systems where you don’t need to integrate with existing Relational Database: We found a very simple way to integrate the two Not completely, integrated But Integrated “Enough” “Eventually” as consistent as needed And you never know when a single day will make a big difference in someone’s life.
  • #35: As we’ve seen, MongoDB does help us integrate new genetic tests faster, which in turn can help reduce drug development time. In closing I wanted to share a personal story, one that helps motivate me to do things faster. “You better do it fast” was the punch line from the last joke my father ever made. He died of cancer shortly after I went to work for Genentech. A few weeks before he died I had told him that I was joining this great company, Genentech, and that we were researching cures for cancer. He smiled, laughed and said “You better do it fast” With the help of MongoDB we’ve reduced the time needed to introduce new genetic tests. And you never know when even a single day will make a major difference in someone’s life.