0% found this document useful (0 votes)

138 views

Normalization of Duplicate Recordsfrom Multiple Sources: Bachelor of Technology IN Computer Science and Engineering

The document is a project report submitted by five students for their Bachelor of Technology degree. It discusses developing a system to normalize duplicate records from multiple sources. The system will consolidate duplicate information and generate normalized records. The report includes sections on existing systems, proposed system advantages, system requirements, design, coding, testing, execution steps, screenshots and conclusions. It was submitted under the guidance of a faculty member to fulfill degree requirements.

Uploaded by

Jeevana Pathipati

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

138 views

Normalization of Duplicate Recordsfrom Multiple Sources: Bachelor of Technology IN Computer Science and Engineering

Uploaded by

Jeevana Pathipati

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 60

A Project Report

NORMALIZATION OF DUPLICATE
RECORDSFROM MULTIPLE SOURCES

Submitted in partial fulfillment for the award of the degree of

BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING

Submitted By

P. JEEVANA
(17HM1A0533)
A. DIVYA S. SHABAAZ
(17HM1A0502) (17HM1A0543)

S. ATHIF S. AYESHA SHAIK

(17HM1A0542) (17HM1A0508)

Under the esteemed guidance of

Mrs.G.SAVITRIM.Tech.,
Assistant Professor,
Department of CSE

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

ANNAMACHARYA INSTITUTE OF TECHNOLOGY AND SCIENCES
(Affiliated to J.N.T.U.A., Anantapur, Approved by A.I.C.T.E, New Delhi)
Utukur (P), C.K.Dinne (V&M), Kadapa-516003
ANDHRA PRADESH.
2017-2021
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
ANNAMACHARYA INSTITUTE OF TECHNOLOGY AND SCIENCES
(Affiliated to J.N.T.U. Anantapur, Approved by A.I.C.T.E, New Delhi)
Utukur (P), C.K.Dinne (V&M), Kadapa-516003

CERTIFICATE
This is to certify that the project work entitled "NORMALIZATION OF
DUPLICATE RECORDS FROM MULTIPLE SOURCES" is a Bonafied
workdone by

P. JEEVANA (17HM1A0533)
A. DIVYA (17HM1A0502)
S. SHABAAZ (17HM1A0543)
S. ATHIF (17HM1A0542)
S. AYESHA SHAIK (17HM1A0508)

in partial fulfillment of the requirement for the award of the degree of BACHELOR
OF TECHNOLOGY in COMPUTER SCIENCE ANDENGINEERING in
ANNAMACHARYA INSTITUTE OF TECHNOLOGY AND SCIENCES,
KADAPA during the academic year 2017-2021. The results of this work have not
been submitted to any other university or institutes for the award of any degree.

Project Guide Head of the Department

Mrs.G. SAVITRIM.Tech, Mr. C. VENKATASUBBIAHM.Tech.,(Ph.D.),

Assistant Professor, Assistant Professor,
Department of CSE, Department of CSE,
A.I.T.S., Kadapa. A.I.T.S., Kadapa.

External Examiner
DECLARATION

We hereby declare that the project report entitled “NORMALIZATION OF

DUPLICATE RECORDS FROM MULTIPLE SOURCES”is a record of
projectwork carried out by us for award of the degree of Bachelor of Technology
inComputer Science and Engineering. We also declare that this project is a result of
ourown effort and has not been submitted earlier for the award of any Degree or other
courses or any other University.

PROJECT ASSOCIATES

P. JEEVANA (17HM1A0533)
A. DIVYA (17HM1A0502)
S. SHABAAZ (17HM1A0543)
S. ATHIF (17HM1A0542)
S. AYESHA SHAIK (17HM1A0508)
ACKNOWLEDGEMENT

We are deeply indebted to our supervisor Mrs.G. SAVITRI, ASSISTANT

PROFESSOR, DEPT. OF CSE, for her valuable guidance, constant encouragement,
constructive criticism and keen interest evinced throughout the course of our work.
We are really fortunate to associate ourselves with such an advising and helping guide
in every possible way, at all stages, for the successful completion of this project work.

We extend our gratefulness to our Coordinator, Mr.P.CHANDRA SEKHAR,

for his encouragement and support throughout the project.

We are extremely thankful to SRI C. VENKATA SUBBAIAH, HEAD OF

THE DEPARTMENT of Computer Science and Engineering, "Annamacharya
Institute of Technology & Sciences" for assisting us in completion of this project.

We express our gratitude to our principal Dr. A. SUDHAKARA REDDY and

the Management for providing all the facilities and supporting in completing our
Project work successfully.

We express our heartful thanks to entire Faculty Members in the department

of CSE of Annamacharya Institute of Technology & Sciences, for their moral support
and good wishes.

Last, but least by any means, we are thankful to all the non-teaching staff
members of Computer Science & Engineering Department for their extended co-
operation.

PROJECT ASSOCIATES

P. JEEVANA (17HM1A0533)
A. DIVYA (17HM1A0502)
S. SHABAAZ (17HM1A0543)
S. ATHIF (17HM1A0542)
S.AYESHA SHAIK (17HM1A0508)
TABLE OF CONTENTS

CHAPTER.NO CHAPTER NAME PAGE.NO.

List of Figures i

List of Tables ii

List of Abbreviations iii

CHAPTER 1 INTRODUCTION 1

1.1 Project Objective 1

CHAPTER 2 SYSTEM ANALYSIS 2

2.1 Existing System 2

2.1.1. Disadvantages 2

2.2 Proposed System 2

2.2.1. Advantages 3

CHAPTER 3 SYSTEM REQUIREMENTS SPECIFICATION 4

3.1 Hardware Requirements 4

3.2 Software Requirements 4

3.3 Software Description 4

CHAPTER 4 SYSTEM DESIGN 8

4.1 Architecture Design 8

4.2 Modules Description 8

4.2.1. Record-level Normalization 8

4.2.2. Field-level Normalization 9

4.2.3. Page-level Normalization 9

4.3 Introduction to UML 9

4.4 UML Diagrams 10

CHAPTER 5 SYSTEM CODING & IMPLEMENTATION 21

5.1 Coding 21

5.2 Testing

5.2.1 Testing Techniques 28

5.2.2 Test Cases 30

CHAPTER 6 EXECUTION STEPS 31

CHAPTER 7 SYSTEM EXECUTION SCREENSHOTS 38

CHAPTER 8 CONCLUSION 46

CHAPTER 9 FUTURE ENHANCEMENT 47

CHAPTER 10 BIBLIOGRAPHY 48
LIST OF FIGURES

S No. Figure Name Fig No. Page No.

1. Working of Java Program 3.1 5

2. Implementation of Java Virtual Machine 3.2 6

3. Program Running on the Java Platform 3.3 7

4. Architecture Design 4.1 8

5. Class Diagram 4.2 11

6. Use Case Diagram 4.3 12

7. Sequence Diagram 4.4 13

8. Activity Diagram 4.5 14

9. Collaboration Diagram 4.6 14

10. Deployment Diagram 4.7 15

11. State Chart Diagram 4.8 15

12. Component Diagram 4.9 15

13. Class Diagram for Overall Project 4.10 16

14. Use Case Diagram for User 4.11 17

15. Sequence Diagram 4.12 18

16. Activity Diagram for Administrator 4.13 19

17. Data Flow Diagram for Home Page 4.14 20

18. Home Page of Project 6.1 38

19. Admin Menu Page 6.2 38

20. Report Showing List of Duplicated 6.3 39

i
21. Report showing Normalized Records 6.4 39

22. Report Showing List of Book Marks 6.5 40

23. Graph showing Publication Records 6.6 40

24. Report Showing Publication Search History 6.7 41

25. publication frequency rank 6.8 41

26. Form for User login 6.9 42

27. Form for User Menu 6.10 42

28. Form to Search Publication Words 6.11 43

29. Report Showing Publication Report 6.12 43

30. Publisher Login Page 6.13 44

31. Publisher Menu Page 6.14 44

32. Report showing to View Various

Book Marks 6.15 45

LIST OF TABLES

S No. Table Name Table No. Page No.

1. Test Cases 5.2.2 30

ii
LIST OF ABBREVATIONS

1. GUI Graphical User Interface

2. JFC Java Foundation Class
3. JVM Java Virtual Machine
4. AWT Abstract Window Toolkit
5. API Application Programming Interface
6. JDK Java Developing Kit
7. ODBC Open Database Connection
8. JDBC Java Database Connectivity.
9. SQL Structured Query Language
10. OSI Open System Interconnection
11. IP Internet Protocol
12. SDK Software Developing Kit
13. URL Uniform Resource Locator

iii
Normalization Of Duplicate Records From Multiple Sources Introduction

1. INTRODUCTION

1.1. PROJECT OBJECTIVE

The Web has evolved into a data-rich repository containing a large amount of
structured content spread across millions of sources. The usefulness of Web data
increases exponentially (e.g., building knowledge bases, Web-scale data analytics)
when it is linked across numerous sources. Structured data on the Web resides in Web
databases and Web tables.

Web data integration is an important component of many applications

collecting data from Web databases, such as Web data warehousing (e.g., Google and
Bing Shopping; Google Scholar), data aggregation (e.g., product and service reviews),
and metasearching. Integration systems at Web scale need to automatically match
records from different sources that refer to the same real-world entity find the true
matching records among them and turn this set of records into a standard record for
the consumption of users or other applications. In this paper, we assume that the tasks
of record matching and truth discovery have been performed and that the groups of
true matching records have thus been identified. Our goal is to generate a uniform,
standard record for each group of true matching records for end-user consumption.

We call the generated record the normalized record. For example, in the
research publication domain, although the integrator website, such as Citeseer or
Google Scholar, contains records gathered from a variety of sources using automated
extraction techniques, it must display a normalized record to users. Otherwise, it is
unclear what can be presented to users: (i) present the entire group of matching
records or (ii) simply present some random record from the group, to just name a
couple of ad-hoc approaches. Either of these choices can lead to a frustrating
experience for a user, because in (i) the user needs to sort/browse through a
potentially large number of duplicate records, and in (ii) we run the risk of presenting
a record with missing or incorrect pieces of data.

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 1

Normalization Of Duplicate Records From Multiple Sources System Analysis

2. SYSTEM ANALYSIS
2.1. EXISTING SYSTEM
The problem of normalization of database records was first described by
Culottaetal. They provided the first attempt to formalize the record normalization
problem and proposed three solutions. The first solution uses string edit distance to
determine the most central record. The second solution optimizes the edit distance
parameters, and the third one describes a feature-based solution to improve
performance by means of a knowledge base. Their approach is an instance of typical
field value normalization. They did not consider value-component-level
normalization. In addition, their gold standard dataset has many instances of
unreasonable normalized records. Swoosh describes a record Merge operator,
however, the purpose of the operator is not for producing normalized records, but
rather for improving the ability to establish difficult record matchings. Wick et al.
propose a discriminatively-trained model to implement schema matching, reference,
and normalization jointly. But the complexity of the model is greatly increased. This
paper also contains no discussion on complete normalization at the value-component
level. Wang et al. propose a hybrid framework for product normalization in online
shopping by schema integration and data cleaning. Although their work mainly
focuses on record matching, they consider the problem of filling missing data and
repairing incorrect data, which is relevant to record normalization.

2.1.1. Disadvantages

 In the existing work, the system uses only Field-level Normalization.

 There is no Integration system at Web scale which needs to automatically

match records from different sources that refer to the same real-world entity..

2.2. PROPOSED SYSTEM

In this paper, we assume that the tasks of record matching and truth discovery
have been performed and that the groups of true matching records have thus been
identified. Our goal is to generate a uniform, standard record for each group of true
matching records for end-user consumption. The system calls the generated record the
normalized record. We call the problem of computing the normalized record for a
group of matching records the record normalization problem (RNP), and it is the
focus of this work. RNP is another specific interesting problem in data fusion. The
system proposes three levels of granularities for record normalization along with
methods to construct normalized records according to them. The system proposes a

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 2

Normalization Of Duplicate Records From Multiple Sources System Analysis

comprehensive framework for systematic construction of normalized records. Our

framework is flexible and allows new strategies to b added with ease. To our
knowledge, this is the first piece of work to propose such a detailed framework. The
system proposes and compares a range of normalization strategies, from frequency,
length, centroid and feature-based to more complex ones that utilize result merging
models from information retrieval, such as (weighted) Board. The system introduces a
number of heuristic rules to mine desirable value components from a field. We use
them to construct the normalized value for the field. The system performs empirical
studies on publication records. The experimental results show that the proposed
weighted-Board-based approach significantly outperforms the baseline approaches.

2.2.1. Advantages

 The system is very fast due to identification of three levels of normalization

granularity such as record, field, and value component.
 An Exact Duplicate records detection due to Mining Template Collocation-
Sub Collocation Pairs

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 3

Normalization Of Duplicate Records From Multiple Sources System Requirements Specification

3.SYSTEM REQUIREMENTS SPECIFICATION

3.1. HARDWARE REQUIREMENTS

Processor : Intel core(i3,i5,..)

Speed : 1.2 GHZ or above
RAM : 2 GB of space (min)

Hard Disk : 10 GB of space (min)

3.2. SOFTWARE REQUIREMENTS

Operating System : Windows XP/7/8

Front End : JSP
Database : MYSQL
Programming language : Java(J2EE)
Web Server : TOMCAT

3.3. SOFTWARE DESCRIPTION

Java Technology

Java technology is both a programming language and a platform.

The Java Programming Language:

The Java programming language is a high-level language that can be

characterized by all of the following buzzwords:

 Simple

 Architecture neutral

 Object oriented

 Portable

 Distributed

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 4

Normalization Of Duplicate Records From Multiple Sources System Requirements Specification

 High performance

 Interpreted

 Multithreaded

 Robust

 Dynamic

 Secure

With most programming languages, you either compile or interpret a program

so that you can run it on your computer. The Java programming language is unusual
in that a program is both compiled and interpreted. With the compiler, first you
translate a program into an intermediate language called Java byte codes —the
platform-independent codes interpreted by the interpreter on the Java platform. The
interpreter parses and runs each Java byte code instruction on the computer.
Compilation happens just once; interpretation occurs each time the program is
executed. The following figure illustrates how this works.

Fig 3.1: Working of Java Program

If we think of Java byte codes as the machine code instructions for the Java
Virtual Machine (Java VM). Every Java interpreter, whether it’s a development tool
or a Web browser that can run applets, is an implementation of the Java VM. Java
byte codes help make “write once, run anywhere” possible. You can compile your

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 5

Normalization Of Duplicate Records From Multiple Sources System Requirements Specification

program into byte codes on any platform that has a Java compiler. The byte codes can
then be run on any implementation of the Java VM. That means that as long as a
computer has a Java VM, the same program written in the Java programming
language can run on Windows 2000, a Solaris workstation, or on an iMac.

Fig 3.2: Implementation of Java Virtual Machine

The Java Platform

A platform is the hardware or software environment in which a program runs.

We’ve already mentioned some of the most popular platforms like Windows 2000,
Linux, Solaris, and MacOS. Most platforms can be described as a combination of the
operating system and hardware. The Java platform differs from most other platforms
in that it’s a software-only platform that runs on top of other hardware-based
platforms.

The Java platform has two components:

 The Java Virtual Machine (Java VM)

 The Java Application Programming Interface (Java API)

You’ve already been introduced to the Java VM. It’s the base for the Java
platform and is ported onto various hardware-based platforms.

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 6

Normalization Of Duplicate Records From Multiple Sources System Requirements Specification

Fig 3.3: Program Running on the Java Platform

Native code is code that after you compile it, the compiled code runs on a
specific hardware platform. As a platform-independent environment, the Java
platform can be a bit slower than native code. However, smart compilers, well-tuned
interpreters, and just-in-time byte code compilers can bring performance close to that
of native code without threatening portability.

Feasibility Study

Technical Feasibility

GUI is developed using HTML to capture the information from the customer.
HTML is used to display the content on the browser. It uses TCP/IP protocol. It is an
interpreted language. It is very easy to develop a page/document using HTML some
RAD (Rapid Application Development) tools are provided to quickly design/develop
our application. So many objects such as button, text fields, and text areaetc are
provided to capture the information from the customer.

Economical Feasibility
The economical issues usually arise during the economical feasibility stage are
whether the system will be used if it is developed and implemented. It reduces the
work load. Keep the class of application in the view, the cost of hardware and
software is considered to be economically feasible.

Operational Feasibility

In our application front end is developed using GUI. So it is very easy to the customer

to enter the necessary information. But customer must have some knowledge on using

web applications before going to use our application.

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 7

Normalization Of Duplicate Records From Multiple Sources System Design

4. SYSTEM DESIGN
4.1. ARCHITECTURE DESIGN

Fig 4.1: Architecture Design

Our goal is to generate a uniform, standard record for each group of true matching
records for end-user consumption. We call the generated record the normalized
record. We call the problem of computing the normalized record for a group of
matching records the record normalization problem (RNP), and it is the focus of this
work. RNP is another specific interesting problem in data fusion. Record
normalization is important in many application domains. For example, in the research
publication domain, although the integrator website, such as Citeseer or Google
Scholar, contains records gathered from a variety of sources using automated
extraction techniques, it must display a normalized record to users.

4.2. MODULES DESCRIPTION

.2.1 Record-level Normalization

The record-level normalization assumes that each record, The assumption, while
intuitively appealing and allows to build the theoretical underpins for constructing
normalized records, needs to be taken with a grain of salt in practice. Re contains a
mixture of candidate normalized records and records with incomplete or arcane
representations of e, which may be difficult to understand by ordinary users

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 8

Normalization Of Duplicate Records From Multiple Sources System Design

4.2.2 Field-level Normalization

Field-level normalization selects a normalized value for each field fi independently

and concatenates the selected values of all fields into a normalized record. The
normalized value for the field fi is one of the values that appear among the records in
Rein the field fi and it is selected according to some criteria (e.g., more descriptive).
The normalized record formed in this way may consist of field values from different
records.

4.2.3 Page-level Normalization

The complete normalization form works at the value for entire web page.

In the following sections, we give the details of our key techniques:

(1) ranking-based strategies

(2) value component mining

(3) ranked list merging.

4.3. INTRODUCTION TO UML

The unified modeling language allows the software engineer to express an

analysis model using the modeling notation that is governed by a set of syntactic,
semantic and pragmatic rules. A UML system is represented using five different
views that describe the system from distinctly different perspective.

UML is specifically constructed through two different domains they are:

 UML Analysis modeling, this focuses on the user model and structural
model views of the system.

 UML design modeling, which focuses on the behavioral modeling,

implementation modeling and environmental model views.

 UML system is represented using five different views that describe the
system from distinctly different perspective

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 9

Normalization Of Duplicate Records From Multiple Sources System Design

4.4. UML DIAGRAMS

Why we have to use UML in projects?

As the strategic value of software increases for many companies, the industry
looks for techniques to automate the production of software and to improve quality
and reduce cost and time-to-market. These techniques include component technology,
visual programming, patterns and frameworks. Businesses also seek techniques to
manage the complexity of systems as they increase in scope and scale. In particular,
they recognize the need to solve recurring architectural problems, such as physical
distribution, concurrency, replication, security, load balancing and fault tolerance.
Additionally, the development for the World Wide Web, while making some things
simpler, has exacerbated these architectural problems. The Unified Modeling
Language (UML) was designed to respond to these needs. Simply, Systems design
refers to the process of defining the architecture, components, modules, interfaces,
and data for a system to satisfy specified requirements which can be done easily
through UML diagrams.

In the project eight basic UML diagrams have been explained

 Class Diagram

 Use Case Diagram

 Sequence Diagram

 Activity Diagram

 Collaboration Diagram

 Deployment Diagram

 State Chart Diagram

 Component Diagram

Class Diagram

In software engineering, a class diagram in the Unified Modeling Language

(UML) is a type of static structure diagram that describes the structure of a system by

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 10

Normalization Of Duplicate Records From Multiple Sources System Design

showing the system's classes, their attributes, and the relationships between the
classes.This is one of the most important of the diagrams in development. The
diagram breaks the class into three layers. One has the name, the second describes its
attributes and the third its methods. A padlock to left of the name represents the
private attributes. The relationships are drawn between the classes. Developers use the
Class Diagram to develop the classes. Analyses use it to show the details of the
system.Architects look at class diagrams to see if any class has too many functions
and see if they are required to be split.

Fig 4.2: Class

Diagram

Use Case Diagram

In software engineering, a use case diagram in the Unified Modeling

Language (UML) is a type of behavioral diagram defined by and created from a Use-
case analysis. Its purpose is to present a graphical overview of the functionality
provided by a system in terms of actors, their goals (represented as use cases), and
any dependencies between those use cases. The main purpose of a use case diagram is
to show what system functions are performed for which actor. Roles of the actors in
the system can be depicted. Use cases are used during requirements elicitation and
analysis to represent the functionality of the system. Use cases focus on the behavior
of the system from the external point of view. The actors are outside the boundary of
the system, whereas the use cases are inside the boundary of the system.

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 11

Normalization Of Duplicate Records From Multiple Sources System Design

Fig 4.3: Use Case Diagram

Sequence Diagram

A sequence diagram in Unified Modeling Language (UML) is a kind of interaction

diagram that shows how processes operate with one another and in what order. It is a
construct of a Message Sequence Chart. Sequence diagrams are sometimes called
Event-trace diagrams, event scenarios, and timing diagrams

 OBJECT: Objects are typically named or anonymous instances of class. But

may also represent instances of other things such as components, collaboration
and nodes.

 LINK: A link is a semantic connection among objects i.e., an object of an

association is called as a link.

 LIFELINE: A lifeline is vertical dashed line that represents the lifetime of an

object.

 FOCUS OF CONTROL: A Focus of Control is tall, thin rectangle that shows

the period of time during which an object is performing an action.

 MESSAGES: A message is a specification of a communication between

objects that conveys the information with the expectation that the activity will
ensure

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 12

Normalization Of Duplicate Records From Multiple Sources System Design

Fig 4.4: Sequence Diagram

Activity Diagram

Activity diagrams are a loosely defined diagram technique for showing

workflows of stepwise activities and actions, with support for choice, iteration and
concurrency. In the Unified Modeling Language, activity diagrams can be used to
describe the business and operational step-by-step workflows of components in a
system. An activity diagram shows the overall flow of control.

 ACTIVITY STATE: An activity states is a kind of states in activity diagram;

it shows an ongoing non-atomic execution within a state machine. An activity
states can be further decomposed.

 ACTION STATE: An action states are states of the system, each representing
the execution of an action. An action states can’t be further decomposed.

 TRANSITION: A transition specifies the path from one action or activity

state to the next action or activity state. The transition is rendered as a simple
directed line.

 OBJECT: An object is a concrete manifestation of an abstraction; an entity

with a well defined boundary and identity that encapsulates state and behavior;
an instance of a class. Objects may be involved in the flow of control
associated with an activity diagram.

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 13

Normalization Of Duplicate Records From Multiple Sources System Design

Fig 4.5: Activity Diagram

Collaboration Diagram

A Communication diagram models the interactions between objects or parts in terms

of sequenced messages. Communication diagrams represent a combination of
information taken from Class, Sequence, and Use Case Diagrams describing both the
static structure and dynamic behavior of a system.

Fig 4.6:
Collaboration Diagram

Deployment Diagram

A deployment diagram in the Unified Modeling Language models the physical

deployment of artifacts on nodes. To describe a web site, for example, a deployment
diagram would show what hardware components ("nodes") exist (e.g., a web server,
an application server, and a database server), what software components ("artifacts")

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 14

Normalization Of Duplicate Records From Multiple Sources System Design

run on each node (e.g., web application, database), and how the different pieces are
connected e.g. JDBC, REST.

Fig 4.7: Deployment

Diagram

State Chart Diagram

A state diagram is a type of diagram used in computer science and related

fields to describe the behavior of systems. State diagrams require that the system
described is composed of a finite number of states sometimes, this is indeed the case,
while at other times this is a reasonable abstraction. Many forms of state diagrams
exist, which differ slightly and have different semantics.

Fig 4.8: State Chart Diagram

Component Diagram

In the Unified Modeling Language, a component diagram depicts how

components are wired together to form larger components and or software systems.
They are used to illustrate the structure of arbitrarily complex systems.

Fig 4.9: Component Diagram

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 15

Normalization Of Duplicate Records From Multiple Sources System Design

The following are the uml diagrams that are being discussed under this project:

- Usecase diagram

- Class diagram

- Activity diagram

- Sequence diagram

CLASS DIAGRAM

In our class diagram, the class names are data user, public cloud, private cloud and the
attributes of the data user are user id, password, files.Coming to the operations of this
class, we have login, register, upload, checking of duplicates and logout. If the data
user belongs to the public cloud through many to one relationship and hereattributes
of the public cloud are userid, password, file storage and the operations performed in
this class contains login, encrypt, decrypt, duplicate and logout. In the same way, if
the user belongs to private cloud here authorized users can have the permission to
access or modify the data.

Public-Cloud
userid
password
filestorage
DataUser files
userid
password 1 login()
files storefiles()
fileid encrypt()
fileblocks * decrypt()
duplicate()
login() logout()
register()
upload()
duplicatecheck() *
encrypt()
decrypt()
downoad() Private-Cloud
logout() userid
password
1 files
rights
ownername
permissions

Fig 4.10: Class Diagram for Overall Project

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 16

Normalization Of Duplicate Records From Multiple Sources System Design

USE CASE DIAGRAM

A Use case diagram shows a set of use cases and actors and their relationships.
These diagrams are especiallyis important in organizing and modeling the behaviors
of a system. The actors for our usecase diagram are user and public cloud. Firstly, the
user register to the webpage and after completion of the registration process, the user
logins to the webpage and uploads the necessary files and check if the duplications
are present or not and then logout. Secondly, the (public cloud) admin logins to the
webpage and checks for the duplication in the list of files. Suppose, if the duplications
are seen, the admin removes the repeated data by using the normalization concepts.

Upload File

User
Public Cloud

Download File

Check for Duplicate

List of Files

Logout

Fig. 4.11: Use Case Diagram for User

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 17

Normalization Of Duplicate Records From Multiple Sources System Design

SEQUENCE DIAGRAM

In this sequence diagram, Generally objects are anonymous instances of class. Here,
the objects are typically named as Owner, Login, Receive permission from admin,
File upload, View user details, receive files, attribute. At first, the owner logins to the
web page by entering the userid and password, then the admin verifies if that
particular user who enters details is authorized or not and if the user is authorized
person then admin gives the permission to upload the data. Further, if they want to
view(search) any information that they needed, then the cloud removes the redundant
data and sends the standand information.

Owner Login Received Permission

File Upload View User Receive File Attribte
from admin Details

uid,pwd

verify

receive permission

file upload

view user details

receive file from cloud

change key

Fig. 4.12: Sequence Diagram

Activity Diagram

An activity diagram shows the flow from activity to activity. The activity
diagram emphasizes the dynamic view of a system. It consists of activity states, action
states, transition, and object. In this diagram, At starting point (pre- condition) the
user enters the user name and password. After submission of the details of that

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 18

Normalization Of Duplicate Records From Multiple Sources System Design

particular user, the admin takes the decision by checking if the user enteredthe
validate data or not. Suppose if the user entered the correct details then he is accepted
as an authorized person to use the web page and incase if user does not enter the valid
data then he is rejected and hence, reaches the ending point(post-condition).

Fig. 4.13 :Activity Diagram for Administrator

DATA FLOW DIAGRAMS

Data flow analysis studies the use of data in each activity. It documents these details
in the Data Flow Diagrams.

A data flow diagram is a logical model of a system. The model does not depend on
hardware, software, and data structures of the organization. There is no physical
implication in a data flow diagram because the diagram is a graphic picture of the
logical system, to be easy for every non-technical user to understand and thus serves
as an excellent communication tool. Finally a data flow diagram is a good starting
point for system design.To construct a data flow diagram, it uses basic symbols.

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 19

Normalization Of Duplicate Records From Multiple Sources System Design

Basic Symbols of data flow diagrams are,

DFD NOTATIONS

Define source and destination data.

Shows path of the data flow.

To represent a process that transforms or

modifies the Data

To represent an attribute

Data Store

Enter yse Check

Open Login User Home
Username username yes
Form Page
Password Password
No

Validation Data

Fig 4.14: Data Flow Diagram for Home Page

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 20

Normalization Of Duplicate Records From Multiple Sources SystemCoding&Implementation

5. SYSTEM CODING AND IMPLEMENTATION

5.1. CODING

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"

"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=iso-8859-1" />
<title>Multi Cloud</title>
<style type="text/css">
.b:hover{
border-size:3px;
border-color:red;
}
.big:hover
{
color:red;

font-weight:bold;
}
.b1
{
background-color: #color;
border-bottom:solid;
border-left: #FFEEEE;

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 21

Normalization Of Duplicate Records From Multiple Sources SystemCoding&Implementation

border-right:solid;
border-top: #EEEEEE;
color: brown;
font-family: Verdana, Arial
}
</style>
<meta name="keywords" content="" />
<meta name="description" content="" />
<link rel="stylesheet" type="text/css" href="default.css" />
</head>
<body>
<div id="upbg"></div>
<div id="outer">
<div id="header">
<div id="headercontent">
</h1>
</div>
</div>
<div id="headerpic"></div>
<div id="menu">

<ul>
<li><a href="#" class="active">Home</a></li>
<li><a href="user_log.jsp" >User</a></li>
<li><a href="signup.jsp">Sign up</a></li>
<li><a href="server_log.jsp">Admin</a></li>
</ul>
</div>
<div id="menubottom"></div>

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 22

Normalization Of Duplicate Records From Multiple Sources SystemCoding&Implementation

<div id="content">


<div id="primarycontainer">
<div id="primarycontent">

<div class="post">
<p><strong><em><font color="#990000" size="+1" face="Verdana, Arial,
Helvetica, sans-serif">Architecture</font></em></strong>
<br/>
<br/>
<imgsrc="images/archi.bmp" width="700" height="400"></p>
</div>

</div>
</div>
<br>
<br>
</div>
</div>
<div id="footer"><strong><font color="#990033" face="Geneva, Arial, Helvetica,
sans-serif"></font></strong></div>

</div>
</body>
</html>

UPLOAD.JSP

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 23

Normalization Of Duplicate Records From Multiple Sources SystemCoding&Implementation

<%@page
import="java.sql.*,java.lang.*,databaseconnection.*,databaseconnection1.*,dat
abaseconnection2.*,databaseconnection3.*,java.text.SimpleDateFormat,java.uti
l.*,java.io.*,javax.servlet.*, javax.servlet.http.*" %>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>multi cloud</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<script type="text/javascript">
</script>
</head>
<body>
<%
java.util.Date now = new java.util.Date();
String DATE_FORMAT1 = "dd/MM/yyyy";
SimpleDateFormat sdf1 = new SimpleDateFormat(DATE_FORMAT1);
String strDateNew1 = sdf1.format(now);
String a="D:\\multi-cloud\\temp\\file1.txt";
String b="D:\\multi-cloud\\temp\\file2.txt";
String c="D:\\multi-cloud\\temp\\file3.txt";
FileInputStreamfis=null;
File image=new File(a);
File image1=new File(b);
File image2=new File(c);
//String m="on process";
String ser=(String)session.getAttribute("ser");
String u=(String)session.getAttribute("u");
String name=(String)session.getAttribute("name");
String f=(String)session.getAttribute("f");
String kbs=(String)session.getAttribute("kbs");
A.I.T.S, KADAPA DEPARTMENT OF CSE Page 24
Normalization Of Duplicate Records From Multiple Sources SystemCoding&Implementation

String tfid=(String)session.getAttribute("tfid");
String fkey=(String)session.getAttribute("fkey");
String akey=(String)session.getAttribute("akey");
String m="not_verified";
String x="s1";
String y="s2";
String z="s3";
Connection con=null,con1=null,con2=null;
PreparedStatement psmt1=null,psmt2=null,psmt3=null;
try{
con=databasecon.getconnection();
psmt1=con.prepareStatement("insert into
tpafile(fileid,uid,name,fname,fsize,b1,b2,b3,fkey,date,status,akey)
values(?,?,?,?,?,AES_ENCRYPT(?, 'key'),AES_ENCRYPT(?,
'key'),AES_ENCRYPT(?, 'key'),?,?,?,?)");
psmt1.setString(1,tfid);
psmt1.setString(2,u);
psmt1.setString(3,name);
psmt1.setString(4,f);
psmt1.setString(5,kbs);
fis=new FileInputStream(image);
psmt1.setBinaryStream(6, (InputStream)fis, (int)(image.length()));
fis=new FileInputStream(image1);
psmt1.setBinaryStream(7, (InputStream)fis, (int)(image1.length()));
fis=new FileInputStream(image2);
psmt1.setBinaryStream(8, (InputStream)fis, (int)(image2.length()));
psmt1.setString(9,fkey);
psmt1.setString(10,strDateNew1);
psmt1.setString(11,m);
psmt1.setString(12,akey);
psmt1.executeUpdate();
A.I.T.S, KADAPA DEPARTMENT OF CSE Page 25
Normalization Of Duplicate Records From Multiple Sources SystemCoding&Implementation

}
catch(Exception ex)
{
out.println("Error in connection : "+ex);
}response.sendRedirect("tpa_home.jsp?message=success");%></body></html>
psmt1=con.prepareStatement("insert into
tpafile(fileid,uid,name,fname,fsize,b1,b2,b3,fkey,date,status,akey)
values(?,?,?,?,?,AES_ENCRYPT(?, 'key'),AES_ENCRYPT(?,
'key'),AES_ENCRYPT(?, 'key'),?,?,?,?)");
psmt1.setString(1,tfid);
psmt1.setString(2,u);
psmt1.setString(3,name);
psmt1.setString(4,f);
psmt1.setString(5,kbs);
fis=new FileInputStream(image);
psmt1.setBinaryStream(6, (InputStream)fis, (int)(image.length()));
fis=new FileInputStream(image1);
psmt1.setBinaryStream(7, (InputStream)fis, (int)(image1.length()));
fis=new FileInputStream(image2);
psmt1.setBinaryStream(8, (InputStream)fis, (int)(image2.length()));
psmt1.setString(9,fkey);
psmt1.setString(10,strDateNew1);
psmt1.setString(11,m);
psmt1.setString(12,akey);
psmt1.executeUpdate();
}
catch(Exception ex)
{
out.println("Error in connection : "+ex);
}response.sendRedirect("tpa_home.jsp?message=success");%></body></html>

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 26

Normalization Of Duplicate Records From Multiple Sources SystemCoding&Implementation

psmt1=con.prepareStatement("insert into
tpafile(fileid,uid,name,fname,fsize,b1,b2,b3,fkey,date,status,akey)
values(?,?,?,?,?,AES_ENCRYPT(?, 'key'),AES_ENCRYPT(?,
'key'),AES_ENCRYPT(?, 'key'),?,?,?,?)");
psmt1.setString(1,tfid);
psmt1.setString(2,u);
psmt1.setString(3,name);
psmt1.setString(4,f);
psmt1.setString(5,kbs);
fis=new FileInputStream(image);
psmt1.setBinaryStream(6, (InputStream)fis, (int)(image.length()));
fis=new FileInputStream(image1);
psmt1.setBinaryStream(7, (InputStream)fis, (int)(image1.length()));
fis=new FileInputStream(image2);
psmt1.setBinaryStream(8, (InputStream)fis, (int)(image2.length()));
psmt1.setString(9,fkey);
psmt1.setString(10,strDateNew1);
psmt1.setString(11,m);
psmt1.setString(12,akey);
psmt1.executeUpdate();
}
catch(Exception ex)
{
out.println("Error in connection : "+ex);
}response.sendRedirect("tpa_home.jsp?message=success");%></body></html>
<div><label style="font-size:14px;font-
weight:bold;">Username :</label>  <input
style="width:100px;height:20px;font-size:12px;" type="text" id="username1"
name="username" /></div>
<div><label style="font-size:14px;font-
weight:bold;">Password  :</label>  <input
style="width:100px;height:20px;font-size:12px;" type="password"
id="password1" name="password" /></div>

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 27

Normalization Of Duplicate Records From Multiple Sources SystemCoding&Implementation

<input type="button" style="width:85px;height:25px;font-size:12px;"

onclick="call1()" value="Sumbit" />    <input
type="button" style="width:85px;height:25px;font-size:12px;" value="Close"
onclick="$.unblockUI()" />
<div style="width:120px;height:25px;font-size:12px;"><a
href="Registration.jsp">REGISTER for FREE</a>  
  <a
href="ForgetPassword.jsp">FORGET PASSWORD</a></div>
</div>
</body>
</html></html>

5.2. TESTING
A primary purpose of testing is to detect software failure so that defects may
be covered and corrected.

5.2.1. Testing Techniques

Software testing methods are traditionally divided into 2 types

1. Black Box Testing

2. White Box Testing

Black Box Testing

It is also known as Glass Box Testing, in which the internal

structure/design/implementation of item being tested is known to the tester.

White Box Testing

It is also known as behavioral testing, in which the internal

structure/design/implementation of the item being tested is not known to the tester.

Types of Tests

1. Unit Testing
2. Integration Testing
3. System Testing

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 28

Normalization Of Duplicate Records From Multiple Sources SystemCoding&Implementation

Unit Testing

Unit testing focuses verification effort on the smallest unit of software design
that is the module. Using procedural design description as a guide, important control
paths are tested to uncover errors within the boundaries of the module.

Integration Testing

Integration testing is a systematic technique for constructing the program

structure, while conducting test to uncover errors associated with the interface. The
objective is to take unit tested methods and build a program structure that has been
dictated by design.

System Testing

System testing is actually a series of different tests whose primary purpose is

to fully exercise the computer-based system. Although each test has a different
purpose, all work to verify that all system elements have been properly integrated to
perform allocated functions.

Acceptance Testing

It is the sub part of system testing and it is the critical phase for any project.

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 29

Normalization Of Duplicate Records From Multiple Sources SystemCoding&Implementation

5.2.2. TEST CASES

TEST EXPECTED ACTUAL

S. No. INPUT STATUS
CASES RESULT RESULT

User User gets Registration

1 Enter all fields Pass
Registration registered is successful

User if user miss User not Registration is

2 Fail
Registration any field registered un successful

Give the user Admin home

Admin home Page
3 Admin Login name and page should Pass
has been opened
password be opened

User page
Give Username User page has
4 User Login should be Pass
and password been opened l
opened

Give Username User page User name and

5 User Login without should not be password is Fail
Password opened invalid

Upload Add Select the to Upload to the Post Upload

6 Pass
file upload file Database Success Fully

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 30

Normalization Of Duplicate Records From Multiple Sources SystemCoding&Implementation

6.EXECUTION STEPS

1.Installation of java:

 Go to http://www.oracle.com/technetwork/java/javase/downloads
/index.html.
 click on JDK DOWNLOAD button. run the exe file and then follow the
instruction given in wizard.
 To set up the path:-
o Rig ht click on my pc and then go to my properties

Fig: properties wizard

o Go to advanced settings and then click on environment variables
o create a class path and copy the path of the java folder where it is
located in program files.

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 31

Normalization Of Duplicate Records From Multiple Sources SystemCoding&Implementation

Fig: path setting for java

2. Installation and setup of Apache Tomcat:

 Go to http://tomcat.apache.org/index.html and click on download latest

versions.
 Run the exe file and click on next and follow the wizard instructions.

Fig: Welcome Page of Tomcat

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 32

Normalization Of Duplicate Records From Multiple Sources SystemCoding&Implementation

 Click on install with port number 8090 with username and password as
aits and aits.
 Mention the connection port as 8090 and then click on next and finally
click on finish.

Fig: Tomcat Configuration Options Page

 Click on I agree button in. license agreement in order to accept the

terms and condition.

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 33

Normalization Of Duplicate Records From Multiple Sources SystemCoding&Implementation

Fig: Tomcat License Agreement

3. Installation and setup of SQL:

 Go tohttp://dev.mywql.com/downloads/ . and click on install button.

 After completion of installation, click on exe file and then click on
next.
 Run the MySQL setup and click on next and follow the instruction in
wizard.

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 34

Normalization Of Duplicate Records From Multiple Sources SystemCoding&Implementation

Fig: Welcome wizard of MySQL

 Conform the type as typical and then click on next and follow the
instructions.

Fig: SQL setup Wizard

 Now confirm the password as root in system settings field and then
click on finish.

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 35

Normalization Of Duplicate Records From Multiple Sources SystemCoding&Implementation

Fig: Database Configuration Engine

After completion of installations,

1. copy the project software’s folder and then paste it in

C:\ProgramFiles(x86) \Apache Software Foundation\Tomcat
7.0\webapps.
2. copy the database.sql file and copy in D or E drive.
3. To check weather out software’s are properly setup or not, click on start on
go to MySQL command line prompt and login with root as password.
4. command as source D:/database.sql to display the data or tables.
5. To see the data base type command as show tables;

Select any browser and type localhost: 8090/normalization/ and then press enter
then the project home page will be displayed.

1. The home page consists of a sidebar menu which consists of admin, user and
publisher.
2. On clicking on admin, admin login page will be opened. If the username and
password are correct then admin main page will be opened.

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 36

Normalization Of Duplicate Records From Multiple Sources SystemCoding&Implementation

3. Here the admin can show some sidebar menu like view all authorized users,
uploaded publications, duplicated and normalized publication records,
bookmark and publication search history, bookmark and publication frequency
ranking and so on.
4. After showing all these menus whatever data we want to show that menu we
have to click. The data will be displayed clearly.
5. On clicking view all duplicate records, it shows list of duplicated publication
records.
6. On clicking view all normalized records, then is shows the report of
normalized records.
7. On clicking view all bookmarks, it shows the list of bookmarks.
8. On clicking view publication or bookmark frequency ranking, it shows the
graph of publication and bookmark records.
9. On clicking view all publication and bookmark search history, it then shows
the report of publication and bookmark search history.
10. Further, On clicking on user button, then form will be opened to enter the
name and the password.
11. In the user menu, it show the option like view your profile, search bookmark
and publication, view bookmark and publication search history and then
logout.
12. After showing all these options in the user menu whatever data we want to be
displayed then that menu we have to click. The data will be displayed clearly.
13. Then on clicking publisher button, publisher login page will be displayed.
14. After entering the publisher details, the publisher menu options like add
publication and bookmarks, view all bookmarks and publications will be
displayed.
15. On clicking on view all bookmarks and publication button, report is shown to
view various book marks and publications.
16. On clicking on add bookmarks and publication button, is shows details like
name, url, venue and soon be to filled .
17. Finally, after all this above steps are done, then we can logout from the web
page.

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 37

Normalization Of Duplicate Records From Multiple Sources Screen Shots

7.SYSTEM EXECUTION SCREEN SHOTS

1.Home Page

Screen 7.1: Home page of project

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 38

Normalization Of Duplicate Records From Multiple Sources Screen Shots

2.Admin menu

Screen 7.2: Admin Menu Page

3.View Duplicate Records

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 39

Normalization Of Duplicate Records From Multiple Sources Screen Shots

Screen 7.3: Report Showing List of Duplicated Publication Records

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 40

Normalization Of Duplicate Records From Multiple Sources Screen Shots

4.Normalized Records

Screen 7.4: Report showing Normalized Records

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 41

Normalization Of Duplicate Records From Multiple Sources Screen Shots

5.View Book Marks

Screen 7.5: Report Showing List of Book Marks

6.Graph of Publication Rank

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 42

Normalization Of Duplicate Records From Multiple Sources Screen Shots

Screen 7.6: Graph showing Publication Records

7.Publication Search History

Screen 7.7:Report Showing Publication Search History

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 43

Normalization Of Duplicate Records From Multiple Sources Screen Shots

8.View all publication frequency ranking

Screen 7.8: publication frequency ranking

9.User Login

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 44

Normalization Of Duplicate Records From Multiple Sources Screen Shots

Screen 7.9: Form for User Login

10.User Menu

Screen 7.10: Form for User Menu

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 45

Normalization Of Duplicate Records From Multiple Sources Screen Shots

11.Search Publication

Screen 7.11: Form to Search Publication Words

12.Search Publication Report

Screen 7.12: Report Showing Publication Report

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 46

Normalization Of Duplicate Records From Multiple Sources Screen Shots

13.Publisher Login

Screen 7.13: Publisher Login Page

14.Publisher Menu

Screen 7.14: Publisher Menu Page

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 47

Normalization Of Duplicate Records From Multiple Sources Screen Shots

15.View Book Marks

Screen 7.15: Report showing to View Various Book Marks

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 48

Normalization Of Duplicate Records From Multiple Sources Conclusion

8. CONCLUSION

In this paper, we studied the problem of record normalization over a set

of matching records that refer to the same real-world entity. We presented three levels

of normalization granularities (record-level, field-level and valuecomponent level)

and two forms of normalization (typical normalization and complete normalization).

For each form of normalization, we proposed a computational framework that

includes both single-strategy and multi-strategy approaches. We proposed four single-

strategy approaches: frequency, length, centroid, and feature-based to select the

normalized record or the normalized field value. For multistrategy approach, we used

result merging models inspired from meta searching to combine the results from a

number of single strategies. We analyzed the record and field level normalization in

the typical normalization. In the complete normalization, we focused on field values

and proposed algorithms for acronym expansion and value component mining to

produce much improved normalized field values. We implemented a prototype and

tested it on a real-world dataset. The experimental results demonstrate the feasibility

and effectiveness of our approach. Our method outperforms the state-of-the-art by a

significant margin.

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 49

Normalization Of Duplicate Records From Multiple Sources Future Enhancements

9. FUTURE ENHANCEMENTS

In the future, we plan to extend our research as follows. First, conduct

additional experiments using more diverse and larger datasets. The lack of appropriate
datasets currently has made this difficult. Second, investigate how to add an effective
human-in-the-loop component into the current solution as automated solutions alone
will not be able to achieve perfect accuracy. Third, develop solutions that handle
numeric or more complex values.

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 50

Normalization Of Duplicate Records From Multiple Sources Bibliography

10.BIBLIOGRAPHY

[1] K. C.-C. Chang and J. Cho, “Accessing the web: From search to integration,” in
SIGMOD, 2006, pp. 804–805.

[2] M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang, “Webtables: Exploring the
power of tables on the web,” PVLDB, vol. 1, no. 1, pp. 538–549, 2008.

[3] W. Meng and C. Yu, Advanced Metasearch Engine Technology. Morgan & Claypool
Publishers, 2010.

[4] A. Gruenheid, X. L. Dong, and D. Srivastava, “Incremental record linkage,” PVLDB, vol.
7, no. 9, pp. 697–708, May 2014.

[5] E. K. Rezig, E. C. Dragut, M. Ouzzani, and A. K. Elmagarmid, “Query-time record

linkage and fusion over web databases,” in ICDE, 2015, pp. 42–53.

[6] W. Su, J. Wang, and F. Lochovsky, “Record matching over query results from multiple
web databases,” TKDE, vol. 22, no. 4, 2010.

[7] H. K¨opcke and E. Rahm, “Frameworks for entity matching: A comparison,” DKE, vol.
69, no. 2, pp. 197–210, 2010.

[8] X. Yin, J. Han, and S. Y. Philip, “Truth discovery with multiple conflicting information
providers on the web,” ICDE, 2008.

[9] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios, “Duplicate record detection: A

survey,” TKDE, vol. 19, no. 1, pp. 1–16, 2007.

[10] P. Christen, “A survey of indexing techniques for scalable record linkage and
deduplication,” TKDE, vol. 24, no. 9, 2012.

[11] S. Tejada, C. A. Knoblock, and S. Minton, “Learning object identification rules for
information integration,” Inf. Sys., vol. 26, no. 8, pp. 607–633, 2001.

[12] L. Shu, A. Chen, M. Xiong, and W. Meng, “Efficient spectral neighborhood blocking for
entity resolution,” in ICDE, 2011.

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 51

Mysql 3rd Edition
100% (10)
Mysql 3rd Edition
646 pages
Grokking The System Design Interview PDF
93% (46)
Grokking The System Design Interview PDF
196 pages
Google Hacking Database
83% (18)
Google Hacking Database
91 pages
Dangerous Google - Searching For Secrets PDF
88% (26)
Dangerous Google - Searching For Secrets PDF
12 pages
Voyager 7S Data Dictionary - Through Update DB 5854 - 060619
67% (3)
Voyager 7S Data Dictionary - Through Update DB 5854 - 060619
3,877 pages
Google Dorks For Credit Card Details
75% (4)
Google Dorks For Credit Card Details
5 pages
Dangerous Google Searching For Secrets
No ratings yet
Dangerous Google Searching For Secrets
12 pages
Google Hacking Database
No ratings yet
Google Hacking Database
91 pages
Understanding Database Types - by Alex Xu
No ratings yet
Understanding Database Types - by Alex Xu
13 pages
PG Thesis Guidelines - 2023
No ratings yet
PG Thesis Guidelines - 2023
39 pages
Hackr - Io's Google Dorks Cheat Sheet PDF
No ratings yet
Hackr - Io's Google Dorks Cheat Sheet PDF
35 pages
COMP246-zFish Tracker-Assignment Part A, B, C - SRS and SDD
No ratings yet
COMP246-zFish Tracker-Assignment Part A, B, C - SRS and SDD
44 pages
Bk72xx SDK User Manual-3.0.3
No ratings yet
Bk72xx SDK User Manual-3.0.3
14 pages
Universidad Autónoma de Zacatecas: Unidad Académica de Ingeniería Eléctrica
No ratings yet
Universidad Autónoma de Zacatecas: Unidad Académica de Ingeniería Eléctrica
14 pages
Policy Document Ucc Redemption Understanding The Process Further
80% (20)
Policy Document Ucc Redemption Understanding The Process Further
37 pages
How To Use Google Hack
100% (1)
How To Use Google Hack
4 pages
Hackers Black Book (2011-Edition)
No ratings yet
Hackers Black Book (2011-Edition)
6 pages
UCC-1 Financing Statement
87% (39)
UCC-1 Financing Statement
94 pages
A Micro-Project Report On "Digital Stopwatch": Guided by
No ratings yet
A Micro-Project Report On "Digital Stopwatch": Guided by
17 pages
Capstone Project
No ratings yet
Capstone Project
43 pages
Major Project On: Online Voting System
No ratings yet
Major Project On: Online Voting System
24 pages
ENGR 243 Course Outline-Summer 2017
No ratings yet
ENGR 243 Course Outline-Summer 2017
7 pages
FireBird V Hardware Manual V1.08 2012-10-12
No ratings yet
FireBird V Hardware Manual V1.08 2012-10-12
121 pages
Google Hacking
100% (7)
Google Hacking
66 pages
Dark Web Market Price Index Hacking Tools July 2018 Top10VPN2
91% (11)
Dark Web Market Price Index Hacking Tools July 2018 Top10VPN2
7 pages
Kali Linux Tools Descriptions
100% (2)
Kali Linux Tools Descriptions
26 pages
Color-Coded Genealogy Research Filing System
No ratings yet
Color-Coded Genealogy Research Filing System
15 pages
LANDIS - GYR - DC450 User Manual
No ratings yet
LANDIS - GYR - DC450 User Manual
69 pages
Temenos - Country Model Banks Generic ATM Framework Gpack - Atmi User Guide
No ratings yet
Temenos - Country Model Banks Generic ATM Framework Gpack - Atmi User Guide
91 pages
Project Report: Dish Washer Using Programmable Logic Controller
No ratings yet
Project Report: Dish Washer Using Programmable Logic Controller
64 pages
Project Based Lab Report Problems On Stack: K L University
No ratings yet
Project Based Lab Report Problems On Stack: K L University
16 pages
SVCE Seminar Report Format (FINAL)
No ratings yet
SVCE Seminar Report Format (FINAL)
6 pages
PUME_6823 - Report File (Certificate)
No ratings yet
PUME_6823 - Report File (Certificate)
6 pages
Final Project Template-20-24 Batch-1
No ratings yet
Final Project Template-20-24 Batch-1
52 pages
Design of Turbo Jet Engine
No ratings yet
Design of Turbo Jet Engine
19 pages
Employee Management
No ratings yet
Employee Management
15 pages
Passport Automation System: A Case Study Report On
No ratings yet
Passport Automation System: A Case Study Report On
97 pages
Interior Design Using Augmented Reality
No ratings yet
Interior Design Using Augmented Reality
24 pages
DS Project
No ratings yet
DS Project
23 pages
Sca Service
No ratings yet
Sca Service
68 pages
Haar Cascades 2 Ref
No ratings yet
Haar Cascades 2 Ref
59 pages
Laboratory Record Note Book: Rajalakshmi Institute of Technology
100% (1)
Laboratory Record Note Book: Rajalakshmi Institute of Technology
110 pages
Thesis Guide Bachelor
No ratings yet
Thesis Guide Bachelor
30 pages
Student Feedback System in PHP
No ratings yet
Student Feedback System in PHP
56 pages
MSC Thesis Kekatos
No ratings yet
MSC Thesis Kekatos
118 pages
QM33120W User Manual
No ratings yet
QM33120W User Manual
285 pages
PROJECT REPORT Sample 6 Sem
No ratings yet
PROJECT REPORT Sample 6 Sem
70 pages
SVCE EEE DEPT R20 Complete Course Structure Syllabus
0% (1)
SVCE EEE DEPT R20 Complete Course Structure Syllabus
265 pages
JAC DELHI 2024 Preference Order Shit
100% (1)
JAC DELHI 2024 Preference Order Shit
2 pages
BCS-012 Solved Assignment 2023-2024
No ratings yet
BCS-012 Solved Assignment 2023-2024
21 pages
MSBTE Diploma Project Report Templet
No ratings yet
MSBTE Diploma Project Report Templet
14 pages
Internship 7th Sem
No ratings yet
Internship 7th Sem
16 pages
Project of Petya and Staircases
0% (1)
Project of Petya and Staircases
16 pages
Restructured & Revised Syllabi of PG & PHD
No ratings yet
Restructured & Revised Syllabi of PG & PHD
663 pages
Project and Production Management (Video)
No ratings yet
Project and Production Management (Video)
3 pages
Set-02 Cl-10 Sci Practice Paper 2024-25 MS
100% (1)
Set-02 Cl-10 Sci Practice Paper 2024-25 MS
7 pages
n16 Ashley
No ratings yet
n16 Ashley
6 pages
4
50% (2)
4
89 pages
Malicious Twitter Bots Detection Using Machine Learning: A Mini Project Report
No ratings yet
Malicious Twitter Bots Detection Using Machine Learning: A Mini Project Report
54 pages
IBM Professional Certification Program: Study Guide Series
No ratings yet
IBM Professional Certification Program: Study Guide Series
30 pages
Developers Guide
No ratings yet
Developers Guide
29 pages
U2 CN Mini Project Report
No ratings yet
U2 CN Mini Project Report
30 pages
K L University Freshman Engineering Department: A Project Based Lab Report On Rank List
No ratings yet
K L University Freshman Engineering Department: A Project Based Lab Report On Rank List
16 pages
Unit Test Plan: Software Engineering Project (2IP40)
No ratings yet
Unit Test Plan: Software Engineering Project (2IP40)
14 pages
Programming in ANSI C 6th Edition E. Balagurusamy - Quickly download the ebook to never miss important content
100% (1)
Programming in ANSI C 6th Edition E. Balagurusamy - Quickly download the ebook to never miss important content
48 pages
Complete CG Project PDF
No ratings yet
Complete CG Project PDF
32 pages
Drug Recommendation System Using Machine Learning: Bachelor of Technology in Computer Science and Engineering by
No ratings yet
Drug Recommendation System Using Machine Learning: Bachelor of Technology in Computer Science and Engineering by
71 pages
Christian Eminent College: Department of Computer Science and Electronics
No ratings yet
Christian Eminent College: Department of Computer Science and Electronics
6 pages
Thesis 8 Sem
No ratings yet
Thesis 8 Sem
60 pages
Oecd Survey On Social and Emotional Skills
No ratings yet
Oecd Survey On Social and Emotional Skills
249 pages
Flutter lab manual-New
No ratings yet
Flutter lab manual-New
63 pages
KCA-353 Mini Project Guidelines and Synposis Format
No ratings yet
KCA-353 Mini Project Guidelines and Synposis Format
6 pages
SC-CH4 Manual - Eng - 25 - 09 - 2012
No ratings yet
SC-CH4 Manual - Eng - 25 - 09 - 2012
29 pages
MINI PROJECT REPORT Doucment
No ratings yet
MINI PROJECT REPORT Doucment
78 pages
PDF 1
No ratings yet
PDF 1
17 pages
2019 S2 FIT5145 期末复习资料
No ratings yet
2019 S2 FIT5145 期末复习资料
42 pages
Visvesvaraya Technological University: "Art Gallery Management System"
No ratings yet
Visvesvaraya Technological University: "Art Gallery Management System"
7 pages
Mini Project
No ratings yet
Mini Project
39 pages
Final Report Batch 15
No ratings yet
Final Report Batch 15
78 pages
Prediction of Crops Based On Soil Type Using Machine Learning
0% (1)
Prediction of Crops Based On Soil Type Using Machine Learning
44 pages
Useful Google Hacks
100% (4)
Useful Google Hacks
7 pages
Microsoft Access For Beginners PDF
100% (2)
Microsoft Access For Beginners PDF
196 pages
Introduction To Database Systems
No ratings yet
Introduction To Database Systems
42 pages
SQL Crash Course
No ratings yet
SQL Crash Course
17 pages
Hack Yahoo Pasword
50% (2)
Hack Yahoo Pasword
2 pages
SQL Cheat Sheet
91% (11)
SQL Cheat Sheet
11 pages
Capital One Hack Advisory
No ratings yet
Capital One Hack Advisory
3 pages
99 Resources To Research & Mine The Invisible Web
75% (4)
99 Resources To Research & Mine The Invisible Web
3 pages
Google Hacking Database PDF
0% (1)
Google Hacking Database PDF
100 pages
Excel Cheat Sheet: Travis Cuzick
100% (1)
Excel Cheat Sheet: Travis Cuzick
15 pages
Sba Fullmethod With Screenshots
100% (5)
Sba Fullmethod With Screenshots
10 pages
24 Essential SQL Interview Questions
No ratings yet
24 Essential SQL Interview Questions
13 pages
HP DL380 G8: Hardware Module Description
No ratings yet
HP DL380 G8: Hardware Module Description
6 pages
Kentucky Cabinet Medicaid Penalty Warning
No ratings yet
Kentucky Cabinet Medicaid Penalty Warning
3 pages
XI AI UNIT 3-python programming notes
No ratings yet
XI AI UNIT 3-python programming notes
8 pages
Uc Davis Gunrock Team Statement On Amazon Alexa Prize 2019 Disqualification
No ratings yet
Uc Davis Gunrock Team Statement On Amazon Alexa Prize 2019 Disqualification
12 pages
XOR Problem Tensorflow NN - Ipynb
No ratings yet
XOR Problem Tensorflow NN - Ipynb
29 pages
Poultry Farm Management S.9555944.powerpoint
No ratings yet
Poultry Farm Management S.9555944.powerpoint
8 pages
SQL: Database Normalization
No ratings yet
SQL: Database Normalization
10 pages
Resume Chitrangada Dubey
No ratings yet
Resume Chitrangada Dubey
3 pages
2ND Year Pulkit Savitribai Phule Pune University, Online Result
No ratings yet
2ND Year Pulkit Savitribai Phule Pune University, Online Result
1 page
M-Tech Set Up Manual
No ratings yet
M-Tech Set Up Manual
4 pages
DiSi (210) Troubleshooting
No ratings yet
DiSi (210) Troubleshooting
4 pages
Hackers Malaysia
No ratings yet
Hackers Malaysia
2 pages
Microsoft Dynamics ERP Licensing Guide
No ratings yet
Microsoft Dynamics ERP Licensing Guide
20 pages
LVS-95XX Series Software Installation Guide M-95XX-3.0.9-B-1 PDF
No ratings yet
LVS-95XX Series Software Installation Guide M-95XX-3.0.9-B-1 PDF
46 pages
Cara Format Laptop Tanpa CD
100% (1)
Cara Format Laptop Tanpa CD
7 pages
4cp0 2a Rms 20230824
No ratings yet
4cp0 2a Rms 20230824
33 pages
Revenue Recognition Configuration
No ratings yet
Revenue Recognition Configuration
5 pages
17 GNSS Data Processing 2019 ESS
No ratings yet
17 GNSS Data Processing 2019 ESS
120 pages
com.ranger.cheat_logcat
No ratings yet
com.ranger.cheat_logcat
78 pages
CORDIS Paraplane
No ratings yet
CORDIS Paraplane
12 pages
Contact Center Questionnaire
100% (1)
Contact Center Questionnaire
4 pages
Data Link Layer: Net 221D: Computer Networks Fundamentals
No ratings yet
Data Link Layer: Net 221D: Computer Networks Fundamentals
58 pages
Clojure For Machine Learning: Chapter No. 1 "Working With Matrices"
100% (2)
Clojure For Machine Learning: Chapter No. 1 "Working With Matrices"
38 pages
Top 10 ABAP Dumps... : 1 Comentario
No ratings yet
Top 10 ABAP Dumps... : 1 Comentario
4 pages
Template Control Tower V7
No ratings yet
Template Control Tower V7
21 pages
TSV IPB100 200 Manuel
100% (1)
TSV IPB100 200 Manuel
86 pages
Itu-T Future Networks and Its New Identifiers
No ratings yet
Itu-T Future Networks and Its New Identifiers
6 pages
Configuracion
No ratings yet
Configuracion
18 pages