Data Migration Reconciliation PDF
Data Migration Reconciliation PDF
occasional papeR
March 2014
RTI Press
Tel: +1.919.541.6000 All rights reserved. This report is protected by copyright. Credit must be provided to the author
Fax: +1.919.541.5985
and source of the document when the content is quoted. Neither the document nor partial or
E-mail: [email protected]
Website: www.rti.org entire reproductions may be sold without prior written permission from the publisher.
doi:10.3768/rtipress.2014.op.0019.1403 www.rti.org/rtipress
Verifying Data Migration Correctness: Contents
the sixth record would be truncated to be “1326 the transfer. While these methods work well for a
Humuhumunukunuk” in certain database platforms small amount of data, using them in large-scale data
and result in the loss of address information. migration operations is prohibitively expensive in
Likewise, the definition of the Height field is decimal terms of time and effort.
with 2-digit precision in the source table. If the height
Programmers can also perform post-migration testing6
field is erroneously set up as an integer field or a
or validation testing.7,8 For these tests, programmers
decimal with 1-digit precision, the target table will
implement a series of predefined queries that examine
round all numbers received in this field up or down,
the content of the source and target databases. The
making the numbers unusable. In both scenarios,
result is a summary report that provides discrepancy
data lost during data migration result in inaccurate
information once the data migration operation is
and unusable data.
complete. In a previous work,5[p7] we illustrated one
In addition, errors are caused by different definitions example of validation testing as follows:
between the source and target databases in primary • For each field with “numeric” data type, the
key or constraint settings. This difference usually validation test compares maximum, minimum,
causes some records in the source table to be rejected average, and summary (indicating) values between
from the target table. For example, before migrating the two databases. For example, before migrating
data from Table 1, if the target table has ID field as its the data from Table 1, we can calculate these
primary key and the table already has a record with indicating values from the Weight field in the
value 5 in this field, trying to insert record number source table. After the migration, we can perform
5 of Table 1 into the target table would cause a key the same calculation from the Weight field of the
violation, and the record would not be migrated target table. If the results of these two calculations
correctly. do not match, errors have occurred during
Because of these potential errors, verifying the migration.
correctness of the data is an important component • For each field with “string” data type, the validation
of the data migration process. Next, we discuss a few test compares string length between the two
popular methods that are currently used to verify databases. For example, before migration, we can
whether data have been migrated correctly and some make note of the total string length of the Address
of the limitations of each. field of all records and compare it with the same
measure after the migration. If the numbers do not
Current Methods for Detecting Data match, migration errors have occurred.
Migration Errors • For each field with “date/time” data type, the
Once the migration is complete, most data migration validation converts the values into a numeric
systems provide only a simple report stating the representation and compares the databases using
number of records that have been migrated. Although the method applied on “numeric” fields.
this record count provides an important clue as to any
records that were rejected during the migration, it In any of these comparisons, if the values in the
does not provide any information on any problems or target database do not match those of the source
ensure that the data have been transferred correctly. database, errors must have occurred during the
Further verification of the data content is needed. migration process. While this method provides
important information on the successfulness of the
The first and most obvious way to verify data migration, it does not check the data at the record
correctness is by manual inspection and ad hoc level. If programmers detect errors, they still do not
queries. Programmers can examine the data visually know which particular records have been migrated
to verify that the content in the target database incorrectly. Therefore, they may have to repeat the
matches that of the source database. Programmers entire migration process. Even so, repeating the
can also run ad hoc queries to check the quality of process does not guarantee success because the
4 Wei and Chen, 2014 RTI Press
factors that caused error in the first migration may Checksum Methodology in Network
come into play during the second, or other factors Transmission Error Detection
may cause additional problems. Parity bit check and its extension called checksum
are popular methods of detecting network data
Reconciliation testing examines the data correctness at
transmission errors. There are variations on how
the record level.9-11 This testing method retrieves and
to implement a checksum. Bolton12 describes this
compares the content of each record in the source and
methodology as follows:
target databases, making it the most comprehensive
way to verify whether data were migrated correctly. The movement of digital data from one location
to another can result in transmission errors. For
Because this method performs correction verification
example, the transmitted sequence 1001 may be
at the record level, it will normally pinpoint the exact incorrectly received as 1101. In order to detect
records that are in error. This is an improvement over such errors a parity bit is often used. A parity bit
the validation testing method mentioned above, but is an extra 0 or 1 bit attached to a code group at
reconciliation testing has its own drawbacks: transmission. In the even parity method the value
of the bit is chosen so that the total number of 1s in
• Depending on the data volume, reconciliation the code group, including the parity bit, is an even
testing may be very time-consuming. number. For example, in transmitting 1001 the parity
• If the source database is located remotely, this test bit used would be 0 to give 01001, and thus an even
number of 1s. In transmitting 1101 the parity bit
may become difficult to perform and the testing
used would be 1 to give 11101, and thus an even
itself may be error-prone. number of 1s. With odd parity the parity bit is chosen
• If the primary key definitions are inconsistent so that the total number of 1s, including the parity
between the source and target tables, or if the bit, is odd. Thus if at the receiver the number of 1s
in a code group does not give the required parity,
source table does not have a primary key, matching
the receiver will know that there is an error and can
source and target records exactly would be request that the code group be retransmitted.
impossible.
An extension of the parity check is the checksum in
Because of the drawbacks of the methods discussed which a block of code may be checked by sending a
series of bits representing their binary sum.
so far, we have implemented a new data migration
correctness verification method using the checksum For example, for each binary packet sent, an 8-bit
methodology. This method verifies data content at checksum can be attached. The checksum bits can
the record level, so it is more comprehensive than hold 28 different values, so the checksum’s value
validation testing. It also does not require the system ranges from 0 to 255. If the value of the packet is less
to retrieve the source data again, so it avoids the than 255, the checksum is identical to the packet
weaknesses of reconciliation testing. Our method is value; otherwise, the checksum value is the remainder
novel in that it integrates the well-known checksum of the packet value divided by 256.
methodology used in network data transmission
error detection into data migration correctness In summary, checksum methodology basically
verification. Before we provide additional detail on derives extra content from the original information,
how to implement this method, we first briefly review attaches this content to the original information, and
the checksum methodology and how it is used in transmits the combined information to the receiving
network data transmissions. end. Once the information is received, the receiver
re-derives the checksum content and compares it
with the information received. If the checksum values
have discrepancies, errors have occurred. In the
next section, we discuss how to apply the checksum
methodology to verify whether data have migrated
correctly.
Data Migration Correctness Verification 5
Table 2. Employees source table to be migrated using the checksum verification method
ID Name Address … Checksum Hash Code
1 John 1321 IPhone Road … 652adc7cc073fe69b94a48b3704e9f12
2 David 1322 Kindle street … 745f6845ff3ebf33ab618aae567f3926
3 Matthew 1323 IPad Road … 3d88779f639bb5e2bc32441009e7bb00
4 Mark 1324 Galaxy Lane … a88798f5a9025645020026f11c35c93f
5 Luke 1325 Pink Drive … 131ac65ce64cacbd3168afed2a504a30
6 Skip 1326 Humuhumunukunukuapuaa Avenue … 17c126a5e3c06f20a8f36a3f1703778c
6 Wei and Chen, 2014 RTI Press
programmers do need to be mindful of its limitations However, these potential problems should rarely
and drawbacks. First, checksum verification increases occur because our proposed method is based on
the data volume being transferred because of the the well-tested and commonly used checksum
extra checksum field in each table. However, with methodology. The limitations are merely the
improving technologies in data communications and inheritance of this methodology. Given the success
with storage becoming less expensive, this concern of the checksum methodology, and the multiple
becomes less of an issue almost daily. advantages of the checksum data migration
verification method as illustrated in our examples,
Second, if data migration errors occur in transmitting
we believe this method provides an easy and effective
the checksum values but not in transmitting the
way to verify the correctness of a data migration
original record content, the checksum method may
operation. Implementing the checksum correctness
actually cause false alarms.
verification method in transmitting data from one
Third, the checksum method can only detect errors database to another can greatly enhance the quality of
on the records that are received in the target table. a data migration outcome.
If records are rejected because of differences in
primary key and constraint definition between the
Future Work
source and target tables, the checksum method is not
effective. Fortunately, the error caused by records Much work has been done on not only detecting
being rejected is easy to detect. A simple record-count errors with the parity bit and checksum method in
comparison before and after the migration, which network data transmission, but correcting those
almost all data migration tools perform, is sufficient errors as well. Further research and exploration
to detect this sort of error, overcoming the method’s is needed to expand the checksum correctness
biggest shortcoming. verification method described here to detect and
correct errors more effectively and overcome the
Fourth, the checksum method described here can slight possibility that some errors pass undetected.
leave errors undetected, just like the parity bit and Further research in studying the error-missing rate
checksum methods in network data transmission can. will be helpful in establishing the tolerance level when
This is because converting the content of a record to applying this method.
a hash code is not a 1-to-1 relationship. Although
the possibility is extremely low, errors could alter the
record content while the resulting hash code remains
identical to that of the original record. In such a
scenario, the errors will escape the detection of the
checksum method.
8 Wei and Chen, 2014 RTI Press
References
1. Kimball R, Caserta J. The data warehouse ETL 8. Paygude P, Devale P. Automated data validation
toolkit: practical techniques for extracting, testing tool for data migration quality assurance.
cleaning, conforming, and delivering data. Int J Mod Eng Res (IJMER). 2013 Jan-Feb;3(1)
Indianapolis (IN): Wiley Publishing; 2004. 599-603.
2. Morris J. Practical data migration. 2nd ed. 9. Matthes F, Schulz C, Haller K. Testing &
Swindon (UK): BCS Learning & Development, quality assurance in data migration projects.
Ltd.; 2012. In proceeding of: IEEE 27th International
3. A roadmap to data migration success: Conference on Software Maintenance, ICSM
approaching the unique issues of data migration 2011; Williamsburg (VA), September 25-30,
[Internet] San Jose (CA): SAP Business Objects; 2011.
2008. Available from: http://fm.sap.com/data/ 10. Haller K. Towards the industrialization of data
upload/files/A_Road_Map_to_Data_Migration_ migration: concepts and patterns for standard
Success_2010.3.17-17.29.55.pdf software implementation projects. In: Eck P,
4. White C. Data communications and computer Gordijn J, Wieringa R, editors. 21st International
networks: a business user’s approach. Boston: Conference on Advanced Information Systems
Course Technology; 2011. Engineering (CAiSE) Proceedings; 2009;
Amsterdam, The Netherlands; June 8–12, 2009;
5. Wei B, Chen TX. Criteria for evaluating
pp. 63-78.
general database migration tools (RTI Press
publication No. OP-0009-1210). Research 11. Manjunath T, Hegadi R, Mohan H. Automated
Triangle Park (NC): RTI Press; 2012. Available data validation for data migration security. Int J
from: http://www.rti.org/publications/rtipress. Computer App. 2011 Sep;30(6):41-6.
cfm?pubid=20204 12. BoltonW. Mechatronics: electronic control
6. Katzoff D. How to implement an effective data systems in mechanical and electrical
migration testing strategy [Internet]. Data engineering. 3rd ed. Longman (NY): Prentice
Migration Pro; 2009. Available from: http:// Hall; 2004.
www.datamigrationpro.com/data-migration 13. Mironov I. Hash functions: theory, attacks, and
articles/2009/11/30/how-to-implement-an applications. Mountain View (CA): Microsoft
effective-data-migration-testing-strateg.html Research; 2005.
7. Singh I. How to do database migration testing
effectively and quickly? [Software testing blog
on the Internet]. Delhi: Inder P. Singh; 2010.
Available from: http://inderpsingh.blogspot.
com/2010/03/how-to-do-database-migration
testing.html
Acknowledgments
We would like to thank Craig Hollingsworth and Joanne Studders of RTI
International for making substantial editorial contributions to this paper.
We would also like to thank our colleagues at NOAA and RTI’s Research
Computing Division for helping us shape and improve the idea of the method
discussed in this paper.
RTI International is an independent, nonprofit research organization dedicated
to improving the human condition by turning knowledge into practice. RTI
offers innovative research and technical solutions to governments and businesses
worldwide in the areas of health and pharmaceuticals, education and training,
surveys and statistics, advanced technology, international development,
economic and social policy, energy and the environment, and laboratory and
chemistry services.