The data life cycle encompasses stages from data creation and acquisition to disposal, including data use, modification, archiving, and repurposing. Key considerations during these stages involve data quality, privacy, and the efficient management of data storage and retrieval. Ultimately, data may be disposed of intentionally or due to legal requirements, even if it holds no current value.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0 ratings0% found this document useful (0 votes)
4 views
Data Life cycle
The data life cycle encompasses stages from data creation and acquisition to disposal, including data use, modification, archiving, and repurposing. Key considerations during these stages involve data quality, privacy, and the efficient management of data storage and retrieval. Ultimately, data may be disposed of intentionally or due to legal requirements, even if it holds no current value.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 5
Data life Cycle
“The overall process from data creation to disposal is normally referred to as the data life
‘cycle. The various stages in the Data life cycle are:
Data creation and Acquisition
Data use.
‘Data Modification
Data Archiving
Data Repurposing &
Data Disposal
Data Creation and Acquisition
‘The process of data creation and acquisition is a function of the source and type of
data.in the pharmacogenomic lab data are generated by sequencing machines and microarrays
in the molecular biology laboratory, and by clinicians and clinical studies in the clinic or
hospital.
‘The major issues in the data-creation phase of the data life cycle include tool selection,
‘data format, standards, version control, error rate, precision, and accuracy. In particular, metrics
such as error rate, precision, and accuracy are more easily ascribed to machine-generated data,
‘whether from clinical laboratory studies or microarray analysis.Creation & Acquisition
Figure :Data Life Cycle of a pharmacogenomic laboratory Key steps in the process include data
creation and acquisition, use, modification, repurposing, and the end game —archiving and
disposal.
Depending on the difficulty in creating the data and the intended use, the creation
process may be trivial and inexpensive or extremely complicated and costly. For example,
recruiting test subjects to donate tissue biopsies is generally more expensive and difficult than
identifying patients who are willing to provide less-invasive (and painful) tissue samples.
Sats Gencasadi Gy ftna-oncliial ae aOR ered nthe erro ea
through the use of manual transcription, voice recognilion data-input systems, or desklop or
handheld computers,
“There is significant variation in subjective interpretation of clinical studies, For example,
_five seasoned radiologists will typically pravide five different interpretations of the same chest
“film or other radiographic study. In addition to the quality of the initial clinical observation,
there are errars introduced by the hardware, software,and processes involved in capturing data,
from keyboard and mouse to optical character recognition.and voice recognition.SS aEieneemmeemene
Data Use
Ones clinical and genomic data are captured, they can be put to a variety of imme
{ses from simulation, statistical analysis, and visualization to communications, Tesuee »
stage of the data life cycle include intellectual Property rights, privacy, and distribution
Sxample, unless patients have expressly given permission to have their names used, mice
dala should be identified by ID number though a system that unaintains dhe anonymity
donor.
Data Modification
Data are rarely used in their raw_form, without some amount of
editing. The data dictionary is one means of modifying data in a controlled way thal ene
Standards ate [ollowed. A data dictionary can be used to tag all microarray data with tone
date information in a standard format so
‘that they can be automatically correlated with cli
findi
i
i
Fig :Relationship between clinical data and microarray dataee
Data Archiving
Tris concemed with making data available for future use(back up).An atchive is
foutainer for data that is infrequently accessed, wilh the focus more on data storage for longe
fe than on access speed. Inthe archiving process —which ean range from, making a backup a
a local database on a CD-ROM or Zip® disk to creating a backup of an entire EME system ins
large hospital— data are named, indexed, and fled jn a way that facilitates identification later,
One of the primary determinants of archive capacity is the storage media the physica
material used to form a tape, disk, or carttidge. In addition to capacity, media can bx
characterized fn terms of compatibility, speed, data density, cost, volatility, durability, and
Compatibility is the ability of media to function within a particular software and
hardware environment,
Powed is a multifaceted performance characteristic that encompasses both the time to
Tocate data (Seek tine) and the time to write i 1 or download it from the medig (data transfer
Tate Seok tine may be several hundred milliseconds for a CD-ROM, a few millisecond fee a
hard drive, and a few microseconds for aflasti memory card.
Capacity the maximum amount of data the media can store —is a function of the media
construction. Capacity is also a function of data density.
Cost is a function of the raw materials involved in the creation of media,
Volatility, a characteristic normaly ascribed to solid-state memory, refers to the status of
the data when external power is removed.Hash memory, like magnetic disk or tape, is
considered relatively non-volatile, and can hold data for years without lose
Durability refers to the physical properties of the media that contsibute to the Tongevity
of the surface, mechanisms, and housing, if any, during normal use. For example, the beatings
and other components in the rotational system of a hard drive undergo wear and tear ever time.
Archives vary considerably in configuration and in proximity to the source date For
Sample servers typically employ several independent hard drives configured as » Redundaee
Artay of Independent Disks (RAID system) that function in part as an integrated archival
System RAID systems derive their speed from reading and writing to multiple disks in parallel,
there are seven levels of RAID, level 3 is most applicable to bioinformatics computing,
In RAID-3, a disk is dedicated to storing » party bit—an extra bit used to determine the
securacy of data transfer—for error detection and correction, If analysis of the parity bie
indicates an error, the faulty disk can be identified and replaced, The dats can be reconstructed
by using the remaining disks and the parity disk
For example in Figure, disks A-D are dedicated to data and disk P is used to store the
parity bit In this case, an odd number of "1" bits cortesponds to a high (1) patity bit, When
dla are written in parallel to the data disks, the corresponding parity bit ia toned on tha Parity
isk Immediately after the data are written to the data disks, the data are road and the paritybts are compared. The change noted in Figure is typical ofa case when thete is an ervor on one
disk. The error on disk "C" can be repaired, or if groups of errors are suddenly becoming
apparent indicating imminent disk failure, then the entire disk can be replaced.
Hig: RAID-5, Data clisks are read and writion to in parallel providing speed, while a dedicated
arity disk provides increased reliability through error detection and correction, In the
Zample, an error in disk C is detected by a different parity bit (P, indicating that the data vend
from disks A-D don't agree with what was writen to the disks. Although the ‘parity bit is
‘ssually based on a comparison of bytes on the data disks, bits (0 or 1) are used here ken clarity.
Data Repurposing
One of the major benefits of having data readily available in an archive is the ability t0
purpose it for a variety of uses. For exemple, linear sequence data originally captured te
slscover new genes are commonly repurposed to support the 3D visualization or protein
structures,
‘The major issues in repurposing data is the ability to efficiently locate data in archives,
‘he difficulty in locating data once is been incorporated into a storage system, depends on the
volumw of data involved.
Data Disposal
All deta die, either because they are intentionally disposed of when their velue has
decreased to the point that itis less than the cost af maintaining it, or because of accidental loss,
Often, data have to be archived because of legal reasons, even though the dota is of no nhc,
‘Value to the institution or researcher. For example, most official hospital or cline Patient records
must be maintained for the life of the patient.