SlideShare a Scribd company logo
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
_______________________________________________________________________________________
Volume: 05 Issue: 03 | Mar-2016, Available @ http://www.ijret.org 454
A STUDY AND SURVEY ON VARIOUS PROGRESSIVE DUPLICATE
DETECTION MECHANISMS
Ashwini.V. Lakote1
, Lithin K2
1
M.Tech Computer Science and Engineering, REVA Institute of Technology and Management, Bangalore, India
ashwini.lakote@gmail.com
2
Dept of Computer Science and Engineering, REVA Institute of Technology and Management, Bangalore, India
lithinkumble@gmail.com
Abstract
One of the serious problems faced in several applications with personal details management, customer affiliation management,
data mining, etc is duplicate detection. This survey deals with the various duplicate record detection techniques in both small and
large datasets. To detect the duplicity with less time of execution and also without disturbing the dataset quality, methods like
Progressive Blocking and Progressive Neighborhood are used. Progressive sorted neighborhood method also called as PSNM is
used in this model for finding or detecting the duplicate in a parallel approach. Progressive Blocking algorithm works on large
datasets where finding duplication requires immense time. These algorithms are used to enhance duplicate detection system. The
efficiency can be doubled over the conventional duplicate detection method using this algorithm. Several different methods of data
analysis are studied here with various approaches for duplicate detection.
Index Terms: Data Duplicity Detection, Progressive deduplication, PSNM, Data Mining
---------------------------------------------------------------------***---------------------------------------------------------------------
1. INTRODUCTION
A. Data Mining
Data mining is also called as KDD or knowledge discovery
in database.[1][2] The concept of data mining evolved from
several researches that include statistics, database systems,
machine learning concepts, neural networks, visualization,
rough set, etc.[3][4] Both traditional and latest areas like
businesses, sports, etc use the data mining concepts. For
translating the raw data into valuable information, the
companies use a process. By knowing the details about the
customers and by developing efficient marketing policies,
the sales and costs can be increased or decreased in the
businesses. The efficient collection of data, warehousing and
computer processing all have their influence on data mining
concepts.[5] The data is the most essential important asset of
any company but incase the data is changed or a bad data
entry is made certain errors like duplicate detection arises.
[6]
B. Duplicate Detection Problems
Duplicate detection denotes to the process of recognizing
different representations of the real world objectives present
in an information source [7][8]. It is not possible to ignore
several qualities of duplicate detection like effectiveness and
scalability due to the database size. [9]
There are two features in the problems of duplicate detection
which are as follows:
 Several representations generally are not same and have
certain differences like misspelling, missing values,
changed addresses, etc which makes the detection of
duplicates very difficult.
 The detection of duplicates is very costly because the
comparison among all possible duplicate pairs is
required.
 Progressive duplicate detection algorithms are as
follows:-
 PSNM or Progressive Sorted Neighborhood Method
working over clean and small datasets.
 PB or Progressive Blocking working over unclean and
large datasets.
Figure 1: System Architecture
Above system architecture explains the process of duplicate
data detection using progressive mechanism. This
architecture is discussed in detail in section 3 of this paper.
Definitions:
 Duplicate Detection: It is the process of recognizing
several representations in a matched real world item.
 Data Cleaning: It is known as Data Scrubbing which
denotes a process of detection, correction and removal of
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
_______________________________________________________________________________________
Volume: 05 Issue: 03 | Mar-2016, Available @ http://www.ijret.org 455
corrupted and inappropriate records present in the
databases, tables, record sets, etc.
 Progressiveness: It improves the results, efficiencies
and scalability of the algorithms used in this existing
model. Techniques like window interval, look ahead,
partition caching, Magpie Sort are used for delivering
the results faster.
 Entity Resolution: It is also called as de-duplication or
record linkage which identifies the accounts
corresponding to similar entity of a real-world.
 Pay-As-You-Go: It is a technique where the candidate
pairs are theoretically ordered by the matching chances.
Then comparison on records using the match pairs are
performed using the ER algorithm.
2. RELATED WORKS
P. G. Ipeirotis et al. proposed the following concepts in [9]
which states that the ER algorithm is used in this paper for
focusing on determine the expected records that are alike
first. This technique gives various hints like the other
general techniques. But still many problems are yet to be
solved. There are three different types of hints which match
the several ER algorithms called sorted list of record pairs,
hierarchy of record partitions and order list of records. The
hints are used to maximize the count of similar records
recognized with less work and to increase ER quality.
S. E. Whang et al. [10] stated a survey on the active methods
and non identical duplicate entries present in the records of
the database records are all investigated in this paper. It
works for both the duplicate record detection approaches. 1)
Distance Based technique that measures the distance among
the individual fields, by using distance metrics of all the
fields and later computing the distance among the records.
2) Rule based technique that uses rules for defining that if
two records are same or different. Rule based technique is
measured using distance based methods in which the
distances are 0 or 1. The techniques for duplicate record
detection are very essential to improve the extracted data
quality.
U. Draisbach et al. in [11] denoted a Duplicate Count
Strategy is used which become accustomed to the window
size depending on the count of duplicates detected. There
are three strategies:
 Key similarity strategy: The associations of the sorting
keys influence the window size which is improved when
the sorting keys are alike. Then we can expect several
related records in this model.
 Record similarity strategy: The associations of the
records influence the window size. The replacement of
the real resemblance of the records is present inside the
window.
 Duplicate count strategy: The count of the known
duplicates influence the window size. DCS++ algorithm
proves to be trustworthy than the SNM algorithm
without losing the effectiveness. The algorithm of
DCS++ is used to calculate the transitive closure and
then save comparisons.
U.Draisbach and F.Naumann in [12] proposed two major
methods called blocking and windowing used to reduce the
comparisons are studied in this paper. Sorted Blocks that
denotes a generalization of these two methods are also
analyzed here. Blocking divides the records to disjoint
subsets and windowing slides a window on the sorted
records and then comparison is made between records
within the window. The sorted Blocks have advantages like
the variable size of partition size instead of the size of the
window.
A.Thor et al. [13] proposed a theory of deduplication which
is also known as Entity Resolution which is used for
determining entities associated to similar object of the real
world. It is very important for data integration and data
quality. Map Reduce is used for SN blocking execution.
Both blocking methods and methods of parallel processing
are used in the implementation of entity resolution of huge
datasets.
Map Reduce steps:-
1. Demonstrating how to apply map reduce for a common
entity having blocking and matching policies.
2. Identifying the main challenges and proposing two
JobSN and RepSN approaches for Sorted
Neighborhood Blocking.
3. Evaluating the two approaches and displaying its
efficiencies. The size of the window and data skew both
influences the evaluation.
3. PROPOSED SYSTEM
The proposed solution uses two types of novel algorithms
for progressive duplicate detection, which are as follows:
PSNM – It is known as Progressive sorted neighborhood
method and it is performed over clean and small datasets.
PB – It is known as Progressive blocking and it is performed
over dirty and large datasets.
Both these algorithms improve the efficiencies over huge
datasets. Progressive duplicate detection algorithm when
compared with the conventional duplicate persuades two
conditions which are as follows [1]:
 Improved early quality: The target time when the results
are necessary is denoted as t. Then the duplicate pairs
are detected at t when compared to the associated
conventional algorithm. The value of t is less when
compared to the conventional algorithm’s runtime.
 Same eventual quality: When both the progressive
detection algorithm and conventional algorithm finishes
its execution on the same time, without terminating t
earlier. Then the produced results are the same.
As demonstrated in Fig. 1 i.e. System Architecture, initially
a database is picked for deduplication and for practical
processing of data, the data is split into numerous partitions
and blocks. Clustering and classification is used after sorting
the data to make it more ordered for efficiency. Next step
the pair wise matching is done to find duplicates in blocks
and through new transformed dataset is generated. Finally
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
_______________________________________________________________________________________
Volume: 05 Issue: 03 | Mar-2016, Available @ http://www.ijret.org 456
the transformed data is updated in database after all
filtrations.
When the time slot of fixed is given then the progressive
detection algorithms works on maximizing the efficiencies.
Thus PSNM and PB algorithms are dynamically adjusted
using their optimal parameters like window sizes, sorting
keys, block sizes, etc. The following contributions are made
which are as follows:
 PSNM and PB are two algorithms that are proposed for
progressive duplicate detection. It exposes several
strengths.
 This approach is suitable for a multiple pass method and
an algorithm for incremental transitive closure is
adapted.
 To rank the performance, the progressive duplicate
detection is measured using a quality measures.
 Many real world databases are evaluated by testing the
algorithms previously known.
There are three stages in this workflow which are as
follows:
 Pair selection
 Pair wise comparison
 Clustering
Only the pair selection and clustering stages should be
modified for a good workflow.
CONCLUSION
Several duplicate detection approaches are studied in this
paper. The existing techniques which have algorithms to
detect duplicity in records improve the competence in
finding out the duplicates when the time of execution is less.
The process gain within the available time is maximized by
reporting most of the results.
REFERENCES
[1]. "Data Mining Curriculum". ACM SIGKDD. 2006-04-
30. Retrieved 2014-01-27.
[2]. Fayyad, Usama; Piatetsky-Shapiro, Gregory; Smyth,
Padhraic (1996). "From Data Mining to Knowledge
Discovery in Databases" (PDF). Retrieved 17
December 2008.
[3]. Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome
(2009). "The Elements of Statistical Learning: Data
Mining, Inference, and Prediction". Retrieved 2012-08-
07.
[4]. Witten, Ian H.; Frank, Eibe; Hall, Mark A. (30 January
2011). Data Mining: Practical Machine Learning Tools
and Techniques (3 ed.). Elsevier. ISBN 978-0-12-
374856-0.
[5]. Think Before You Dig: Privacy Implications of Data
Mining & Aggregation, NASCIO Research Brief,
September 2004
[6]. Clifton, Christopher (2010). "Encyclopædia Britannica:
Definition of Data Mining". Retrieved 2010-12-09.
[7]. M. A. Hernández and S. J. Stolfo, “Real-world data is
dirty: Data cleansing and the merge/purge problem,”
Data Mining and Knowledge Discovery, vol. 2, no. 1,
1998.
[8]. Thorsten Papenbrock, Arvid Heise, and Felix
Naumann,’ Progressive Duplicate Detection’ IEEE
Transactions on Knowledge and Data
Engineering(TKDE),vol . 25, no. 5, 2014.
[9]. A.K. Elmagarmid, P. G. Ipeirotis, and V. S.Verykios,
“Duplicate record detection: Asurvey,” IEEE
Transactions on Knowledge and Data Engineering
(TKDE), vol. 19, no. 1, 2007.
[10].S. E. Whang, D. Marmaros, and H. Garcia-Molina,
“Pay-as-you-go entity resolution,” IEEE Transactions
on Knowledge and Data Engineering (TKDE), vol. 25,
no. 5, 2012.
[11].U. Draisbach, F. Naumann, S. Szott, and O.
Wonneberg, “Adaptive windows for
duplicatedetection,” in Proceedings of the International
Conference on Data Engineering (ICDE), 2012.
[12].U. Draisbach and F. Naumann, “A generalization of
blocking and windowing algorithms for duplicate
detection.” in International Conference on Data and
Knowledge Engineering (ICDKE), 2011.
[13].L. Kolb, A. Thor, and E. Rahm, “Parallel sorted
neighborhood blocking withmapreduce,” in
Proceedings of the Conference Datenbanksysteme in
Büro, Technik und Wissenschaft (BTW), 2011.

More Related Content

What's hot (19)

PDF
Spe165 t
Rajesh War
 
PDF
Enhancing the labelling technique of
IJDKP
 
PDF
Textual Data Partitioning with Relationship and Discriminative Analysis
Editor IJMTER
 
PPTX
Document clustering for forensic analysis an approach for improving compute...
Madan Golla
 
DOCX
Final proj 2 (1)
Praveen Kumar
 
PDF
A Soft Set-based Co-occurrence for Clustering Web User Transactions
TELKOMNIKA JOURNAL
 
PDF
A unified approach for spatial data query
IJDKP
 
PDF
K04302082087
ijceronline
 
PDF
Seeds Affinity Propagation Based on Text Clustering
IJRES Journal
 
PDF
P33077080
IJERA Editor
 
PPTX
Document clustering for forensic analysis
srinivasa teja
 
PDF
Ijricit 01-002 enhanced replica detection in short time for large data sets
Ijripublishers Ijri
 
PDF
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
IJDKP
 
PDF
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
IJDKP
 
PPTX
Text clustering
KU Leuven
 
PDF
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSE
IJDKP
 
PDF
International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
PDF
Improved Text Mining for Bulk Data Using Deep Learning Approach
IJCSIS Research Publications
 
PDF
LINK MINING PROCESS
IJDKP
 
Spe165 t
Rajesh War
 
Enhancing the labelling technique of
IJDKP
 
Textual Data Partitioning with Relationship and Discriminative Analysis
Editor IJMTER
 
Document clustering for forensic analysis an approach for improving compute...
Madan Golla
 
Final proj 2 (1)
Praveen Kumar
 
A Soft Set-based Co-occurrence for Clustering Web User Transactions
TELKOMNIKA JOURNAL
 
A unified approach for spatial data query
IJDKP
 
K04302082087
ijceronline
 
Seeds Affinity Propagation Based on Text Clustering
IJRES Journal
 
P33077080
IJERA Editor
 
Document clustering for forensic analysis
srinivasa teja
 
Ijricit 01-002 enhanced replica detection in short time for large data sets
Ijripublishers Ijri
 
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
IJDKP
 
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
IJDKP
 
Text clustering
KU Leuven
 
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSE
IJDKP
 
International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
Improved Text Mining for Bulk Data Using Deep Learning Approach
IJCSIS Research Publications
 
LINK MINING PROCESS
IJDKP
 

Viewers also liked (20)

PPT
An adaptive algorithm for detection of duplicate records
Likan Patra
 
PDF
Tutorial 4 (duplicate detection)
Kira
 
PDF
Duplicate detection
jonecx
 
PPTX
novel and efficient approch for detection of duplicate pages in web crawling
Vipin Kp
 
PDF
Efficient Duplicate Detection Over Massive Data Sets
Pradeeban Kathiravelu, Ph.D.
 
PPTX
The Duplicitous Duplicate
Anish Raivadera
 
PDF
Duplicate Detection of Records in Queries using Clustering
IJORCS
 
PPT
Progressive Texture
Dr Rupesh Shet
 
PPTX
Record matching over query results from Web Databases
tusharjadhav2611
 
DOCX
SECURE OPTIMIZATION COMPUTATION OUTSOURCING IN CLOUD COMPUTING: A CASE STUDY ...
Shakas Technologies
 
PPTX
Linking data without common identifiers
Lars Marius Garshol
 
DOCX
A profit maximization scheme with guaranteed
nexgentech15
 
PPTX
Deduplication
Lars Marius Garshol
 
PPTX
Desgn&imp authentctn.ppt by Jaseela
Student
 
PDF
Generation of cda xml schema from dicom images using hl7 standard 2
IAEME Publication
 
PPTX
Healthcare Exchange Interoperability
Tomislav Milinović
 
PDF
Conditional identity based broadcast proxy re-encryption and its application ...
ieeepondy
 
PDF
SecRBAC: Secure data in the Clouds
Nexgen Technology
 
PPTX
Understanding clinical data exchange and cda (hl7 201)
Edifecs Inc
 
PDF
HL7 & Health Information Exchange in Thailand
Nawanan Theera-Ampornpunt
 
An adaptive algorithm for detection of duplicate records
Likan Patra
 
Tutorial 4 (duplicate detection)
Kira
 
Duplicate detection
jonecx
 
novel and efficient approch for detection of duplicate pages in web crawling
Vipin Kp
 
Efficient Duplicate Detection Over Massive Data Sets
Pradeeban Kathiravelu, Ph.D.
 
The Duplicitous Duplicate
Anish Raivadera
 
Duplicate Detection of Records in Queries using Clustering
IJORCS
 
Progressive Texture
Dr Rupesh Shet
 
Record matching over query results from Web Databases
tusharjadhav2611
 
SECURE OPTIMIZATION COMPUTATION OUTSOURCING IN CLOUD COMPUTING: A CASE STUDY ...
Shakas Technologies
 
Linking data without common identifiers
Lars Marius Garshol
 
A profit maximization scheme with guaranteed
nexgentech15
 
Deduplication
Lars Marius Garshol
 
Desgn&imp authentctn.ppt by Jaseela
Student
 
Generation of cda xml schema from dicom images using hl7 standard 2
IAEME Publication
 
Healthcare Exchange Interoperability
Tomislav Milinović
 
Conditional identity based broadcast proxy re-encryption and its application ...
ieeepondy
 
SecRBAC: Secure data in the Clouds
Nexgen Technology
 
Understanding clinical data exchange and cda (hl7 201)
Edifecs Inc
 
HL7 & Health Information Exchange in Thailand
Nawanan Theera-Ampornpunt
 
Ad

Similar to A study and survey on various progressive duplicate detection mechanisms (20)

PDF
Progressive Duplicate Detection
1crore projects
 
PDF
Anomaly detection via eliminating data redundancy and rectifying data error i...
nalini manogaran
 
PDF
Indexing based Genetic Programming Approach to Record Deduplication
idescitation
 
PDF
Matching data detection for the integration system
IJECEIAES
 
PDF
Elimination of data redundancy before persisting into dbms using svm classifi...
nalini manogaran
 
DOC
Power Management in Micro grid Using Hybrid Energy Storage System
ijcnes
 
PDF
A BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATION
cscpconf
 
PDF
4.on demand quality of web services using ranking by multi criteria 31-35
Alexander Decker
 
PDF
11.0004www.iiste.org call for paper.on demand quality of web services using r...
Alexander Decker
 
PDF
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET Journal
 
PDF
S01732110114
IOSR Journals
 
PDF
Frequent Pattern Mining with Serialization and De-Serialization
iosrjce
 
PDF
S01732110114
IOSR Journals
 
PDF
Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...
ijsrd.com
 
PDF
Improved Frequent Pattern Mining Algorithm using Divide and Conquer Technique...
ijsrd.com
 
PDF
A Survey on Identification of Closed Frequent Item Sets Using Intersecting Al...
IOSR Journals
 
PDF
File Sharing and Data Duplication Removal in Cloud Using File Checksum
ijtsrd
 
PDF
Frequent Item Set Mining - A Review
ijsrd.com
 
PDF
386 390
Editor IJARCET
 
PDF
386 390
Editor IJARCET
 
Progressive Duplicate Detection
1crore projects
 
Anomaly detection via eliminating data redundancy and rectifying data error i...
nalini manogaran
 
Indexing based Genetic Programming Approach to Record Deduplication
idescitation
 
Matching data detection for the integration system
IJECEIAES
 
Elimination of data redundancy before persisting into dbms using svm classifi...
nalini manogaran
 
Power Management in Micro grid Using Hybrid Energy Storage System
ijcnes
 
A BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATION
cscpconf
 
4.on demand quality of web services using ranking by multi criteria 31-35
Alexander Decker
 
11.0004www.iiste.org call for paper.on demand quality of web services using r...
Alexander Decker
 
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET Journal
 
S01732110114
IOSR Journals
 
Frequent Pattern Mining with Serialization and De-Serialization
iosrjce
 
S01732110114
IOSR Journals
 
Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...
ijsrd.com
 
Improved Frequent Pattern Mining Algorithm using Divide and Conquer Technique...
ijsrd.com
 
A Survey on Identification of Closed Frequent Item Sets Using Intersecting Al...
IOSR Journals
 
File Sharing and Data Duplication Removal in Cloud Using File Checksum
ijtsrd
 
Frequent Item Set Mining - A Review
ijsrd.com
 
Ad

More from eSAT Journals (20)

PDF
Mechanical properties of hybrid fiber reinforced concrete for pavements
eSAT Journals
 
PDF
Material management in construction – a case study
eSAT Journals
 
PDF
Managing drought short term strategies in semi arid regions a case study
eSAT Journals
 
PDF
Life cycle cost analysis of overlay for an urban road in bangalore
eSAT Journals
 
PDF
Laboratory studies of dense bituminous mixes ii with reclaimed asphalt materials
eSAT Journals
 
PDF
Laboratory investigation of expansive soil stabilized with natural inorganic ...
eSAT Journals
 
PDF
Influence of reinforcement on the behavior of hollow concrete block masonry p...
eSAT Journals
 
PDF
Influence of compaction energy on soil stabilized with chemical stabilizer
eSAT Journals
 
PDF
Geographical information system (gis) for water resources management
eSAT Journals
 
PDF
Forest type mapping of bidar forest division, karnataka using geoinformatics ...
eSAT Journals
 
PDF
Factors influencing compressive strength of geopolymer concrete
eSAT Journals
 
PDF
Experimental investigation on circular hollow steel columns in filled with li...
eSAT Journals
 
PDF
Experimental behavior of circular hsscfrc filled steel tubular columns under ...
eSAT Journals
 
PDF
Evaluation of punching shear in flat slabs
eSAT Journals
 
PDF
Evaluation of performance of intake tower dam for recent earthquake in india
eSAT Journals
 
PDF
Evaluation of operational efficiency of urban road network using travel time ...
eSAT Journals
 
PDF
Estimation of surface runoff in nallur amanikere watershed using scs cn method
eSAT Journals
 
PDF
Estimation of morphometric parameters and runoff using rs & gis techniques
eSAT Journals
 
PDF
Effect of variation of plastic hinge length on the results of non linear anal...
eSAT Journals
 
PDF
Effect of use of recycled materials on indirect tensile strength of asphalt c...
eSAT Journals
 
Mechanical properties of hybrid fiber reinforced concrete for pavements
eSAT Journals
 
Material management in construction – a case study
eSAT Journals
 
Managing drought short term strategies in semi arid regions a case study
eSAT Journals
 
Life cycle cost analysis of overlay for an urban road in bangalore
eSAT Journals
 
Laboratory studies of dense bituminous mixes ii with reclaimed asphalt materials
eSAT Journals
 
Laboratory investigation of expansive soil stabilized with natural inorganic ...
eSAT Journals
 
Influence of reinforcement on the behavior of hollow concrete block masonry p...
eSAT Journals
 
Influence of compaction energy on soil stabilized with chemical stabilizer
eSAT Journals
 
Geographical information system (gis) for water resources management
eSAT Journals
 
Forest type mapping of bidar forest division, karnataka using geoinformatics ...
eSAT Journals
 
Factors influencing compressive strength of geopolymer concrete
eSAT Journals
 
Experimental investigation on circular hollow steel columns in filled with li...
eSAT Journals
 
Experimental behavior of circular hsscfrc filled steel tubular columns under ...
eSAT Journals
 
Evaluation of punching shear in flat slabs
eSAT Journals
 
Evaluation of performance of intake tower dam for recent earthquake in india
eSAT Journals
 
Evaluation of operational efficiency of urban road network using travel time ...
eSAT Journals
 
Estimation of surface runoff in nallur amanikere watershed using scs cn method
eSAT Journals
 
Estimation of morphometric parameters and runoff using rs & gis techniques
eSAT Journals
 
Effect of variation of plastic hinge length on the results of non linear anal...
eSAT Journals
 
Effect of use of recycled materials on indirect tensile strength of asphalt c...
eSAT Journals
 

Recently uploaded (20)

PDF
Ethics and Trustworthy AI in Healthcare – Governing Sensitive Data, Profiling...
AlqualsaDIResearchGr
 
PPTX
UNIT DAA PPT cover all topics 2021 regulation
archu26
 
PDF
Pressure Measurement training for engineers and Technicians
AIESOLUTIONS
 
PDF
6th International Conference on Machine Learning Techniques and Data Science ...
ijistjournal
 
PPTX
MPMC_Module-2 xxxxxxxxxxxxxxxxxxxxx.pptx
ShivanshVaidya5
 
PPTX
Thermal runway and thermal stability.pptx
godow93766
 
PPTX
Day2 B2 Best.pptx
helenjenefa1
 
PDF
PORTFOLIO Golam Kibria Khan — architect with a passion for thoughtful design...
MasumKhan59
 
PPTX
Server Side Web Development Unit 1 of Nodejs.pptx
sneha852132
 
PPTX
ISO/IEC JTC 1/WG 9 (MAR) Convenor Report
Kurata Takeshi
 
PDF
Biomechanics of Gait: Engineering Solutions for Rehabilitation (www.kiu.ac.ug)
publication11
 
PPTX
美国电子版毕业证南卡罗莱纳大学上州分校水印成绩单USC学费发票定做学位证书编号怎么查
Taqyea
 
PPTX
Solar Thermal Energy System Seminar.pptx
Gpc Purapuza
 
PDF
MAD Unit - 2 Activity and Fragment Management in Android (Diploma IT)
JappanMavani
 
PPTX
Types of Bearing_Specifications_PPT.pptx
PranjulAgrahariAkash
 
PPTX
Benefits_^0_Challigi😙🏡💐8fenges[1].pptx
akghostmaker
 
PDF
MAD Unit - 1 Introduction of Android IT Department
JappanMavani
 
PPTX
Heart Bleed Bug - A case study (Course: Cryptography and Network Security)
Adri Jovin
 
PDF
GTU Civil Engineering All Semester Syllabus.pdf
Vimal Bhojani
 
PPTX
Shinkawa Proposal to meet Vibration API670.pptx
AchmadBashori2
 
Ethics and Trustworthy AI in Healthcare – Governing Sensitive Data, Profiling...
AlqualsaDIResearchGr
 
UNIT DAA PPT cover all topics 2021 regulation
archu26
 
Pressure Measurement training for engineers and Technicians
AIESOLUTIONS
 
6th International Conference on Machine Learning Techniques and Data Science ...
ijistjournal
 
MPMC_Module-2 xxxxxxxxxxxxxxxxxxxxx.pptx
ShivanshVaidya5
 
Thermal runway and thermal stability.pptx
godow93766
 
Day2 B2 Best.pptx
helenjenefa1
 
PORTFOLIO Golam Kibria Khan — architect with a passion for thoughtful design...
MasumKhan59
 
Server Side Web Development Unit 1 of Nodejs.pptx
sneha852132
 
ISO/IEC JTC 1/WG 9 (MAR) Convenor Report
Kurata Takeshi
 
Biomechanics of Gait: Engineering Solutions for Rehabilitation (www.kiu.ac.ug)
publication11
 
美国电子版毕业证南卡罗莱纳大学上州分校水印成绩单USC学费发票定做学位证书编号怎么查
Taqyea
 
Solar Thermal Energy System Seminar.pptx
Gpc Purapuza
 
MAD Unit - 2 Activity and Fragment Management in Android (Diploma IT)
JappanMavani
 
Types of Bearing_Specifications_PPT.pptx
PranjulAgrahariAkash
 
Benefits_^0_Challigi😙🏡💐8fenges[1].pptx
akghostmaker
 
MAD Unit - 1 Introduction of Android IT Department
JappanMavani
 
Heart Bleed Bug - A case study (Course: Cryptography and Network Security)
Adri Jovin
 
GTU Civil Engineering All Semester Syllabus.pdf
Vimal Bhojani
 
Shinkawa Proposal to meet Vibration API670.pptx
AchmadBashori2
 

A study and survey on various progressive duplicate detection mechanisms

  • 1. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 _______________________________________________________________________________________ Volume: 05 Issue: 03 | Mar-2016, Available @ http://www.ijret.org 454 A STUDY AND SURVEY ON VARIOUS PROGRESSIVE DUPLICATE DETECTION MECHANISMS Ashwini.V. Lakote1 , Lithin K2 1 M.Tech Computer Science and Engineering, REVA Institute of Technology and Management, Bangalore, India [email protected] 2 Dept of Computer Science and Engineering, REVA Institute of Technology and Management, Bangalore, India [email protected] Abstract One of the serious problems faced in several applications with personal details management, customer affiliation management, data mining, etc is duplicate detection. This survey deals with the various duplicate record detection techniques in both small and large datasets. To detect the duplicity with less time of execution and also without disturbing the dataset quality, methods like Progressive Blocking and Progressive Neighborhood are used. Progressive sorted neighborhood method also called as PSNM is used in this model for finding or detecting the duplicate in a parallel approach. Progressive Blocking algorithm works on large datasets where finding duplication requires immense time. These algorithms are used to enhance duplicate detection system. The efficiency can be doubled over the conventional duplicate detection method using this algorithm. Several different methods of data analysis are studied here with various approaches for duplicate detection. Index Terms: Data Duplicity Detection, Progressive deduplication, PSNM, Data Mining ---------------------------------------------------------------------***--------------------------------------------------------------------- 1. INTRODUCTION A. Data Mining Data mining is also called as KDD or knowledge discovery in database.[1][2] The concept of data mining evolved from several researches that include statistics, database systems, machine learning concepts, neural networks, visualization, rough set, etc.[3][4] Both traditional and latest areas like businesses, sports, etc use the data mining concepts. For translating the raw data into valuable information, the companies use a process. By knowing the details about the customers and by developing efficient marketing policies, the sales and costs can be increased or decreased in the businesses. The efficient collection of data, warehousing and computer processing all have their influence on data mining concepts.[5] The data is the most essential important asset of any company but incase the data is changed or a bad data entry is made certain errors like duplicate detection arises. [6] B. Duplicate Detection Problems Duplicate detection denotes to the process of recognizing different representations of the real world objectives present in an information source [7][8]. It is not possible to ignore several qualities of duplicate detection like effectiveness and scalability due to the database size. [9] There are two features in the problems of duplicate detection which are as follows:  Several representations generally are not same and have certain differences like misspelling, missing values, changed addresses, etc which makes the detection of duplicates very difficult.  The detection of duplicates is very costly because the comparison among all possible duplicate pairs is required.  Progressive duplicate detection algorithms are as follows:-  PSNM or Progressive Sorted Neighborhood Method working over clean and small datasets.  PB or Progressive Blocking working over unclean and large datasets. Figure 1: System Architecture Above system architecture explains the process of duplicate data detection using progressive mechanism. This architecture is discussed in detail in section 3 of this paper. Definitions:  Duplicate Detection: It is the process of recognizing several representations in a matched real world item.  Data Cleaning: It is known as Data Scrubbing which denotes a process of detection, correction and removal of
  • 2. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 _______________________________________________________________________________________ Volume: 05 Issue: 03 | Mar-2016, Available @ http://www.ijret.org 455 corrupted and inappropriate records present in the databases, tables, record sets, etc.  Progressiveness: It improves the results, efficiencies and scalability of the algorithms used in this existing model. Techniques like window interval, look ahead, partition caching, Magpie Sort are used for delivering the results faster.  Entity Resolution: It is also called as de-duplication or record linkage which identifies the accounts corresponding to similar entity of a real-world.  Pay-As-You-Go: It is a technique where the candidate pairs are theoretically ordered by the matching chances. Then comparison on records using the match pairs are performed using the ER algorithm. 2. RELATED WORKS P. G. Ipeirotis et al. proposed the following concepts in [9] which states that the ER algorithm is used in this paper for focusing on determine the expected records that are alike first. This technique gives various hints like the other general techniques. But still many problems are yet to be solved. There are three different types of hints which match the several ER algorithms called sorted list of record pairs, hierarchy of record partitions and order list of records. The hints are used to maximize the count of similar records recognized with less work and to increase ER quality. S. E. Whang et al. [10] stated a survey on the active methods and non identical duplicate entries present in the records of the database records are all investigated in this paper. It works for both the duplicate record detection approaches. 1) Distance Based technique that measures the distance among the individual fields, by using distance metrics of all the fields and later computing the distance among the records. 2) Rule based technique that uses rules for defining that if two records are same or different. Rule based technique is measured using distance based methods in which the distances are 0 or 1. The techniques for duplicate record detection are very essential to improve the extracted data quality. U. Draisbach et al. in [11] denoted a Duplicate Count Strategy is used which become accustomed to the window size depending on the count of duplicates detected. There are three strategies:  Key similarity strategy: The associations of the sorting keys influence the window size which is improved when the sorting keys are alike. Then we can expect several related records in this model.  Record similarity strategy: The associations of the records influence the window size. The replacement of the real resemblance of the records is present inside the window.  Duplicate count strategy: The count of the known duplicates influence the window size. DCS++ algorithm proves to be trustworthy than the SNM algorithm without losing the effectiveness. The algorithm of DCS++ is used to calculate the transitive closure and then save comparisons. U.Draisbach and F.Naumann in [12] proposed two major methods called blocking and windowing used to reduce the comparisons are studied in this paper. Sorted Blocks that denotes a generalization of these two methods are also analyzed here. Blocking divides the records to disjoint subsets and windowing slides a window on the sorted records and then comparison is made between records within the window. The sorted Blocks have advantages like the variable size of partition size instead of the size of the window. A.Thor et al. [13] proposed a theory of deduplication which is also known as Entity Resolution which is used for determining entities associated to similar object of the real world. It is very important for data integration and data quality. Map Reduce is used for SN blocking execution. Both blocking methods and methods of parallel processing are used in the implementation of entity resolution of huge datasets. Map Reduce steps:- 1. Demonstrating how to apply map reduce for a common entity having blocking and matching policies. 2. Identifying the main challenges and proposing two JobSN and RepSN approaches for Sorted Neighborhood Blocking. 3. Evaluating the two approaches and displaying its efficiencies. The size of the window and data skew both influences the evaluation. 3. PROPOSED SYSTEM The proposed solution uses two types of novel algorithms for progressive duplicate detection, which are as follows: PSNM – It is known as Progressive sorted neighborhood method and it is performed over clean and small datasets. PB – It is known as Progressive blocking and it is performed over dirty and large datasets. Both these algorithms improve the efficiencies over huge datasets. Progressive duplicate detection algorithm when compared with the conventional duplicate persuades two conditions which are as follows [1]:  Improved early quality: The target time when the results are necessary is denoted as t. Then the duplicate pairs are detected at t when compared to the associated conventional algorithm. The value of t is less when compared to the conventional algorithm’s runtime.  Same eventual quality: When both the progressive detection algorithm and conventional algorithm finishes its execution on the same time, without terminating t earlier. Then the produced results are the same. As demonstrated in Fig. 1 i.e. System Architecture, initially a database is picked for deduplication and for practical processing of data, the data is split into numerous partitions and blocks. Clustering and classification is used after sorting the data to make it more ordered for efficiency. Next step the pair wise matching is done to find duplicates in blocks and through new transformed dataset is generated. Finally
  • 3. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 _______________________________________________________________________________________ Volume: 05 Issue: 03 | Mar-2016, Available @ http://www.ijret.org 456 the transformed data is updated in database after all filtrations. When the time slot of fixed is given then the progressive detection algorithms works on maximizing the efficiencies. Thus PSNM and PB algorithms are dynamically adjusted using their optimal parameters like window sizes, sorting keys, block sizes, etc. The following contributions are made which are as follows:  PSNM and PB are two algorithms that are proposed for progressive duplicate detection. It exposes several strengths.  This approach is suitable for a multiple pass method and an algorithm for incremental transitive closure is adapted.  To rank the performance, the progressive duplicate detection is measured using a quality measures.  Many real world databases are evaluated by testing the algorithms previously known. There are three stages in this workflow which are as follows:  Pair selection  Pair wise comparison  Clustering Only the pair selection and clustering stages should be modified for a good workflow. CONCLUSION Several duplicate detection approaches are studied in this paper. The existing techniques which have algorithms to detect duplicity in records improve the competence in finding out the duplicates when the time of execution is less. The process gain within the available time is maximized by reporting most of the results. REFERENCES [1]. "Data Mining Curriculum". ACM SIGKDD. 2006-04- 30. Retrieved 2014-01-27. [2]. Fayyad, Usama; Piatetsky-Shapiro, Gregory; Smyth, Padhraic (1996). "From Data Mining to Knowledge Discovery in Databases" (PDF). Retrieved 17 December 2008. [3]. Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome (2009). "The Elements of Statistical Learning: Data Mining, Inference, and Prediction". Retrieved 2012-08- 07. [4]. Witten, Ian H.; Frank, Eibe; Hall, Mark A. (30 January 2011). Data Mining: Practical Machine Learning Tools and Techniques (3 ed.). Elsevier. ISBN 978-0-12- 374856-0. [5]. Think Before You Dig: Privacy Implications of Data Mining & Aggregation, NASCIO Research Brief, September 2004 [6]. Clifton, Christopher (2010). "Encyclopædia Britannica: Definition of Data Mining". Retrieved 2010-12-09. [7]. M. A. Hernández and S. J. Stolfo, “Real-world data is dirty: Data cleansing and the merge/purge problem,” Data Mining and Knowledge Discovery, vol. 2, no. 1, 1998. [8]. Thorsten Papenbrock, Arvid Heise, and Felix Naumann,’ Progressive Duplicate Detection’ IEEE Transactions on Knowledge and Data Engineering(TKDE),vol . 25, no. 5, 2014. [9]. A.K. Elmagarmid, P. G. Ipeirotis, and V. S.Verykios, “Duplicate record detection: Asurvey,” IEEE Transactions on Knowledge and Data Engineering (TKDE), vol. 19, no. 1, 2007. [10].S. E. Whang, D. Marmaros, and H. Garcia-Molina, “Pay-as-you-go entity resolution,” IEEE Transactions on Knowledge and Data Engineering (TKDE), vol. 25, no. 5, 2012. [11].U. Draisbach, F. Naumann, S. Szott, and O. Wonneberg, “Adaptive windows for duplicatedetection,” in Proceedings of the International Conference on Data Engineering (ICDE), 2012. [12].U. Draisbach and F. Naumann, “A generalization of blocking and windowing algorithms for duplicate detection.” in International Conference on Data and Knowledge Engineering (ICDKE), 2011. [13].L. Kolb, A. Thor, and E. Rahm, “Parallel sorted neighborhood blocking withmapreduce,” in Proceedings of the Conference Datenbanksysteme in Büro, Technik und Wissenschaft (BTW), 2011.