0% found this document useful (0 votes)
138 views

Normalization of Duplicate Recordsfrom Multiple Sources: Bachelor of Technology IN Computer Science and Engineering

The document is a project report submitted by five students for their Bachelor of Technology degree. It discusses developing a system to normalize duplicate records from multiple sources. The system will consolidate duplicate information and generate normalized records. The report includes sections on existing systems, proposed system advantages, system requirements, design, coding, testing, execution steps, screenshots and conclusions. It was submitted under the guidance of a faculty member to fulfill degree requirements.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
138 views

Normalization of Duplicate Recordsfrom Multiple Sources: Bachelor of Technology IN Computer Science and Engineering

The document is a project report submitted by five students for their Bachelor of Technology degree. It discusses developing a system to normalize duplicate records from multiple sources. The system will consolidate duplicate information and generate normalized records. The report includes sections on existing systems, proposed system advantages, system requirements, design, coding, testing, execution steps, screenshots and conclusions. It was submitted under the guidance of a faculty member to fulfill degree requirements.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 60

A Project Report

On

NORMALIZATION OF DUPLICATE
RECORDSFROM MULTIPLE SOURCES

Submitted in partial fulfillment for the award of the degree of

BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING

Submitted By

P. JEEVANA
(17HM1A0533)
A. DIVYA S. SHABAAZ
(17HM1A0502) (17HM1A0543)

S. ATHIF S. AYESHA SHAIK


(17HM1A0542) (17HM1A0508)

Under the esteemed guidance of


Mrs.G.SAVITRIM.Tech.,
Assistant Professor,
Department of CSE

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


ANNAMACHARYA INSTITUTE OF TECHNOLOGY AND SCIENCES
(Affiliated to J.N.T.U.A., Anantapur, Approved by A.I.C.T.E, New Delhi)
Utukur (P), C.K.Dinne (V&M), Kadapa-516003
ANDHRA PRADESH.
2017-2021
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
ANNAMACHARYA INSTITUTE OF TECHNOLOGY AND SCIENCES
(Affiliated to J.N.T.U. Anantapur, Approved by A.I.C.T.E, New Delhi)
Utukur (P), C.K.Dinne (V&M), Kadapa-516003

CERTIFICATE
This is to certify that the project work entitled "NORMALIZATION OF
DUPLICATE RECORDS FROM MULTIPLE SOURCES" is a Bonafied
workdone by

P. JEEVANA (17HM1A0533)
A. DIVYA (17HM1A0502)
S. SHABAAZ (17HM1A0543)
S. ATHIF (17HM1A0542)
S. AYESHA SHAIK (17HM1A0508)

in partial fulfillment of the requirement for the award of the degree of BACHELOR
OF TECHNOLOGY in COMPUTER SCIENCE ANDENGINEERING in
ANNAMACHARYA INSTITUTE OF TECHNOLOGY AND SCIENCES,
KADAPA during the academic year 2017-2021. The results of this work have not
been submitted to any other university or institutes for the award of any degree.

Project Guide Head of the Department

Mrs.G. SAVITRIM.Tech, Mr. C. VENKATASUBBIAHM.Tech.,(Ph.D.),


Assistant Professor, Assistant Professor,
Department of CSE, Department of CSE,
A.I.T.S., Kadapa. A.I.T.S., Kadapa.

External Examiner
DECLARATION

We hereby declare that the project report entitled “NORMALIZATION OF


DUPLICATE RECORDS FROM MULTIPLE SOURCES”is a record of
projectwork carried out by us for award of the degree of Bachelor of Technology
inComputer Science and Engineering. We also declare that this project is a result of
ourown effort and has not been submitted earlier for the award of any Degree or other
courses or any other University.

PROJECT ASSOCIATES

P. JEEVANA (17HM1A0533)
A. DIVYA (17HM1A0502)
S. SHABAAZ (17HM1A0543)
S. ATHIF (17HM1A0542)
S. AYESHA SHAIK (17HM1A0508)
ACKNOWLEDGEMENT

We are deeply indebted to our supervisor Mrs.G. SAVITRI, ASSISTANT


PROFESSOR, DEPT. OF CSE, for her valuable guidance, constant encouragement,
constructive criticism and keen interest evinced throughout the course of our work.
We are really fortunate to associate ourselves with such an advising and helping guide
in every possible way, at all stages, for the successful completion of this project work.

We extend our gratefulness to our Coordinator, Mr.P.CHANDRA SEKHAR,


for his encouragement and support throughout the project.

We are extremely thankful to SRI C. VENKATA SUBBAIAH, HEAD OF


THE DEPARTMENT of Computer Science and Engineering, "Annamacharya
Institute of Technology & Sciences" for assisting us in completion of this project.

We express our gratitude to our principal Dr. A. SUDHAKARA REDDY and


the Management for providing all the facilities and supporting in completing our
Project work successfully.

We express our heartful thanks to entire Faculty Members in the department


of CSE of Annamacharya Institute of Technology & Sciences, for their moral support
and good wishes.

Last, but least by any means, we are thankful to all the non-teaching staff
members of Computer Science & Engineering Department for their extended co-
operation.

PROJECT ASSOCIATES

P. JEEVANA (17HM1A0533)
A. DIVYA (17HM1A0502)
S. SHABAAZ (17HM1A0543)
S. ATHIF (17HM1A0542)
S.AYESHA SHAIK (17HM1A0508)
TABLE OF CONTENTS

CHAPTER.NO CHAPTER NAME PAGE.NO.

List of Figures i

List of Tables ii

List of Abbreviations iii

CHAPTER 1 INTRODUCTION 1

1.1 Project Objective 1

CHAPTER 2 SYSTEM ANALYSIS 2

2.1 Existing System 2

2.1.1. Disadvantages 2

2.2 Proposed System 2

2.2.1. Advantages 3

CHAPTER 3 SYSTEM REQUIREMENTS SPECIFICATION 4

3.1 Hardware Requirements 4

3.2 Software Requirements 4

3.3 Software Description 4

CHAPTER 4 SYSTEM DESIGN 8

4.1 Architecture Design 8


4.2 Modules Description 8

4.2.1. Record-level Normalization 8

4.2.2. Field-level Normalization 9

4.2.3. Page-level Normalization 9

4.3 Introduction to UML 9

4.4 UML Diagrams 10

CHAPTER 5 SYSTEM CODING & IMPLEMENTATION 21

5.1 Coding 21

5.2 Testing

28

5.2.1 Testing Techniques 28

5.2.2 Test Cases 30

CHAPTER 6 EXECUTION STEPS 31

CHAPTER 7 SYSTEM EXECUTION SCREENSHOTS 38

CHAPTER 8 CONCLUSION 46

CHAPTER 9 FUTURE ENHANCEMENT 47

CHAPTER 10 BIBLIOGRAPHY 48
LIST OF FIGURES

S No. Figure Name Fig No. Page No.

1. Working of Java Program 3.1 5

2. Implementation of Java Virtual Machine 3.2 6

3. Program Running on the Java Platform 3.3 7

4. Architecture Design 4.1 8

5. Class Diagram 4.2 11

6. Use Case Diagram 4.3 12

7. Sequence Diagram 4.4 13

8. Activity Diagram 4.5 14

9. Collaboration Diagram 4.6 14

10. Deployment Diagram 4.7 15

11. State Chart Diagram 4.8 15

12. Component Diagram 4.9 15

13. Class Diagram for Overall Project 4.10 16

14. Use Case Diagram for User 4.11 17

15. Sequence Diagram 4.12 18

16. Activity Diagram for Administrator 4.13 19

17. Data Flow Diagram for Home Page 4.14 20

18. Home Page of Project 6.1 38

19. Admin Menu Page 6.2 38

20. Report Showing List of Duplicated 6.3 39

i
21. Report showing Normalized Records 6.4 39

22. Report Showing List of Book Marks 6.5 40

23. Graph showing Publication Records 6.6 40

24. Report Showing Publication Search History 6.7 41

25. publication frequency rank 6.8 41

26. Form for User login 6.9 42

27. Form for User Menu 6.10 42

28. Form to Search Publication Words 6.11 43

29. Report Showing Publication Report 6.12 43

30. Publisher Login Page 6.13 44

31. Publisher Menu Page 6.14 44

32. Report showing to View Various

Book Marks 6.15 45

LIST OF TABLES

S No. Table Name Table No. Page No.

1. Test Cases 5.2.2 30

ii
LIST OF ABBREVATIONS

1. GUI Graphical User Interface


2. JFC Java Foundation Class
3. JVM Java Virtual Machine
4. AWT Abstract Window Toolkit
5. API Application Programming Interface
6. JDK Java Developing Kit
7. ODBC Open Database Connection
8. JDBC Java Database Connectivity.
9. SQL Structured Query Language
10. OSI Open System Interconnection
11. IP Internet Protocol
12. SDK Software Developing Kit
13. URL Uniform Resource Locator

iii
Normalization Of Duplicate Records From Multiple Sources Introduction

1. INTRODUCTION

1.1. PROJECT OBJECTIVE

The Web has evolved into a data-rich repository containing a large amount of
structured content spread across millions of sources. The usefulness of Web data
increases exponentially (e.g., building knowledge bases, Web-scale data analytics)
when it is linked across numerous sources. Structured data on the Web resides in Web
databases and Web tables.

Web data integration is an important component of many applications


collecting data from Web databases, such as Web data warehousing (e.g., Google and
Bing Shopping; Google Scholar), data aggregation (e.g., product and service reviews),
and metasearching. Integration systems at Web scale need to automatically match
records from different sources that refer to the same real-world entity find the true
matching records among them and turn this set of records into a standard record for
the consumption of users or other applications. In this paper, we assume that the tasks
of record matching and truth discovery have been performed and that the groups of
true matching records have thus been identified. Our goal is to generate a uniform,
standard record for each group of true matching records for end-user consumption.

We call the generated record the normalized record. For example, in the
research publication domain, although the integrator website, such as Citeseer or
Google Scholar, contains records gathered from a variety of sources using automated
extraction techniques, it must display a normalized record to users. Otherwise, it is
unclear what can be presented to users: (i) present the entire group of matching
records or (ii) simply present some random record from the group, to just name a
couple of ad-hoc approaches. Either of these choices can lead to a frustrating
experience for a user, because in (i) the user needs to sort/browse through a
potentially large number of duplicate records, and in (ii) we run the risk of presenting
a record with missing or incorrect pieces of data.

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 1


Normalization Of Duplicate Records From Multiple Sources System Analysis

2. SYSTEM ANALYSIS
2.1. EXISTING SYSTEM
The problem of normalization of database records was first described by
Culottaetal. They provided the first attempt to formalize the record normalization
problem and proposed three solutions. The first solution uses string edit distance to
determine the most central record. The second solution optimizes the edit distance
parameters, and the third one describes a feature-based solution to improve
performance by means of a knowledge base. Their approach is an instance of typical
field value normalization. They did not consider value-component-level
normalization. In addition, their gold standard dataset has many instances of
unreasonable normalized records. Swoosh describes a record Merge operator,
however, the purpose of the operator is not for producing normalized records, but
rather for improving the ability to establish difficult record matchings. Wick et al.
propose a discriminatively-trained model to implement schema matching, reference,
and normalization jointly. But the complexity of the model is greatly increased. This
paper also contains no discussion on complete normalization at the value-component
level. Wang et al. propose a hybrid framework for product normalization in online
shopping by schema integration and data cleaning. Although their work mainly
focuses on record matching, they consider the problem of filling missing data and
repairing incorrect data, which is relevant to record normalization.

2.1.1. Disadvantages

 In the existing work, the system uses only Field-level Normalization.

 There is no Integration system at Web scale which needs to automatically


match records from different sources that refer to the same real-world entity..

2.2. PROPOSED SYSTEM


In this paper, we assume that the tasks of record matching and truth discovery
have been performed and that the groups of true matching records have thus been
identified. Our goal is to generate a uniform, standard record for each group of true
matching records for end-user consumption. The system calls the generated record the
normalized record. We call the problem of computing the normalized record for a
group of matching records the record normalization problem (RNP), and it is the
focus of this work. RNP is another specific interesting problem in data fusion. The
system proposes three levels of granularities for record normalization along with
methods to construct normalized records according to them. The system proposes a

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 2


Normalization Of Duplicate Records From Multiple Sources System Analysis

comprehensive framework for systematic construction of normalized records. Our


framework is flexible and allows new strategies to b added with ease. To our
knowledge, this is the first piece of work to propose such a detailed framework. The
system proposes and compares a range of normalization strategies, from frequency,
length, centroid and feature-based to more complex ones that utilize result merging
models from information retrieval, such as (weighted) Board. The system introduces a
number of heuristic rules to mine desirable value components from a field. We use
them to construct the normalized value for the field. The system performs empirical
studies on publication records. The experimental results show that the proposed
weighted-Board-based approach significantly outperforms the baseline approaches.

2.2.1. Advantages

 The system is very fast due to identification of three levels of normalization


granularity such as record, field, and value component.
 An Exact Duplicate records detection due to Mining Template Collocation-
Sub Collocation Pairs

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 3


Normalization Of Duplicate Records From Multiple Sources System Requirements Specification

3.SYSTEM REQUIREMENTS SPECIFICATION


3.1. HARDWARE REQUIREMENTS

Processor : Intel core(i3,i5,..)


Speed : 1.2 GHZ or above
RAM : 2 GB of space (min)

Hard Disk : 10 GB of space (min)

3.2. SOFTWARE REQUIREMENTS

Operating System : Windows XP/7/8


Front End : JSP
Database : MYSQL
Programming language : Java(J2EE)
Web Server : TOMCAT

3.3. SOFTWARE DESCRIPTION

Java Technology

Java technology is both a programming language and a platform.

The Java Programming Language:

The Java programming language is a high-level language that can be


characterized by all of the following buzzwords:

 Simple

 Architecture neutral

 Object oriented

 Portable

 Distributed

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 4


Normalization Of Duplicate Records From Multiple Sources System Requirements Specification

 High performance

 Interpreted

 Multithreaded

 Robust

 Dynamic

 Secure

With most programming languages, you either compile or interpret a program


so that you can run it on your computer. The Java programming language is unusual
in that a program is both compiled and interpreted. With the compiler, first you
translate a program into an intermediate language called Java byte codes —the
platform-independent codes interpreted by the interpreter on the Java platform. The
interpreter parses and runs each Java byte code instruction on the computer.
Compilation happens just once; interpretation occurs each time the program is
executed. The following figure illustrates how this works.

Fig 3.1: Working of Java Program

If we think of Java byte codes as the machine code instructions for the Java
Virtual Machine (Java VM). Every Java interpreter, whether it’s a development tool
or a Web browser that can run applets, is an implementation of the Java VM. Java
byte codes help make “write once, run anywhere” possible. You can compile your

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 5


Normalization Of Duplicate Records From Multiple Sources System Requirements Specification

program into byte codes on any platform that has a Java compiler. The byte codes can
then be run on any implementation of the Java VM. That means that as long as a
computer has a Java VM, the same program written in the Java programming
language can run on Windows 2000, a Solaris workstation, or on an iMac.

Fig 3.2: Implementation of Java Virtual Machine

The Java Platform

A platform is the hardware or software environment in which a program runs.


We’ve already mentioned some of the most popular platforms like Windows 2000,
Linux, Solaris, and MacOS. Most platforms can be described as a combination of the
operating system and hardware. The Java platform differs from most other platforms
in that it’s a software-only platform that runs on top of other hardware-based
platforms.

The Java platform has two components:

 The Java Virtual Machine (Java VM)

 The Java Application Programming Interface (Java API)

You’ve already been introduced to the Java VM. It’s the base for the Java
platform and is ported onto various hardware-based platforms.

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 6


Normalization Of Duplicate Records From Multiple Sources System Requirements Specification

Fig 3.3: Program Running on the Java Platform

Native code is code that after you compile it, the compiled code runs on a
specific hardware platform. As a platform-independent environment, the Java
platform can be a bit slower than native code. However, smart compilers, well-tuned
interpreters, and just-in-time byte code compilers can bring performance close to that
of native code without threatening portability.

Feasibility Study

Technical Feasibility

GUI is developed using HTML to capture the information from the customer.
HTML is used to display the content on the browser. It uses TCP/IP protocol. It is an
interpreted language. It is very easy to develop a page/document using HTML some
RAD (Rapid Application Development) tools are provided to quickly design/develop
our application. So many objects such as button, text fields, and text areaetc are
provided to capture the information from the customer.

Economical Feasibility
The economical issues usually arise during the economical feasibility stage are
whether the system will be used if it is developed and implemented. It reduces the
work load. Keep the class of application in the view, the cost of hardware and
software is considered to be economically feasible.

Operational Feasibility

In our application front end is developed using GUI. So it is very easy to the customer

to enter the necessary information. But customer must have some knowledge on using

web applications before going to use our application.

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 7


Normalization Of Duplicate Records From Multiple Sources System Design

4. SYSTEM DESIGN
4.1. ARCHITECTURE DESIGN

Fig 4.1: Architecture Design


Our goal is to generate a uniform, standard record for each group of true matching
records for end-user consumption. We call the generated record the normalized
record. We call the problem of computing the normalized record for a group of
matching records the record normalization problem (RNP), and it is the focus of this
work. RNP is another specific interesting problem in data fusion. Record
normalization is important in many application domains. For example, in the research
publication domain, although the integrator website, such as Citeseer or Google
Scholar, contains records gathered from a variety of sources using automated
extraction techniques, it must display a normalized record to users.

4.2. MODULES DESCRIPTION


.2.1 Record-level Normalization

The record-level normalization assumes that each record, The assumption, while
intuitively appealing and allows to build the theoretical underpins for constructing
normalized records, needs to be taken with a grain of salt in practice. Re contains a
mixture of candidate normalized records and records with incomplete or arcane
representations of e, which may be difficult to understand by ordinary users

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 8


Normalization Of Duplicate Records From Multiple Sources System Design

4.2.2 Field-level Normalization

Field-level normalization selects a normalized value for each field fi independently


and concatenates the selected values of all fields into a normalized record. The
normalized value for the field fi is one of the values that appear among the records in
Rein the field fi and it is selected according to some criteria (e.g., more descriptive).
The normalized record formed in this way may consist of field values from different
records.

4.2.3 Page-level Normalization

The complete normalization form works at the value for entire web page.

In the following sections, we give the details of our key techniques:

(1) ranking-based strategies

(2) value component mining

(3) ranked list merging.

4.3. INTRODUCTION TO UML

The unified modeling language allows the software engineer to express an


analysis model using the modeling notation that is governed by a set of syntactic,
semantic and pragmatic rules. A UML system is represented using five different
views that describe the system from distinctly different perspective.

UML is specifically constructed through two different domains they are:


 UML Analysis modeling, this focuses on the user model and structural
model views of the system.

 UML design modeling, which focuses on the behavioral modeling,


implementation modeling and environmental model views.

 UML system is represented using five different views that describe the
system from distinctly different perspective

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 9


Normalization Of Duplicate Records From Multiple Sources System Design

4.4. UML DIAGRAMS

Why we have to use UML in projects?

As the strategic value of software increases for many companies, the industry
looks for techniques to automate the production of software and to improve quality
and reduce cost and time-to-market. These techniques include component technology,
visual programming, patterns and frameworks. Businesses also seek techniques to
manage the complexity of systems as they increase in scope and scale. In particular,
they recognize the need to solve recurring architectural problems, such as physical
distribution, concurrency, replication, security, load balancing and fault tolerance.
Additionally, the development for the World Wide Web, while making some things
simpler, has exacerbated these architectural problems. The Unified Modeling
Language (UML) was designed to respond to these needs. Simply, Systems design
refers to the process of defining the architecture, components, modules, interfaces,
and data for a system to satisfy specified requirements which can be done easily
through UML diagrams.

In the project eight basic UML diagrams have been explained

 Class Diagram

 Use Case Diagram

 Sequence Diagram

 Activity Diagram

 Collaboration Diagram

 Deployment Diagram

 State Chart Diagram

 Component Diagram

Class Diagram

In software engineering, a class diagram in the Unified Modeling Language


(UML) is a type of static structure diagram that describes the structure of a system by

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 10


Normalization Of Duplicate Records From Multiple Sources System Design

showing the system's classes, their attributes, and the relationships between the
classes.This is one of the most important of the diagrams in development. The
diagram breaks the class into three layers. One has the name, the second describes its
attributes and the third its methods. A padlock to left of the name represents the
private attributes. The relationships are drawn between the classes. Developers use the
Class Diagram to develop the classes. Analyses use it to show the details of the
system.Architects look at class diagrams to see if any class has too many functions
and see if they are required to be split.

Fig 4.2: Class


Diagram

Use Case Diagram

In software engineering, a use case diagram in the Unified Modeling


Language (UML) is a type of behavioral diagram defined by and created from a Use-
case analysis. Its purpose is to present a graphical overview of the functionality
provided by a system in terms of actors, their goals (represented as use cases), and
any dependencies between those use cases. The main purpose of a use case diagram is
to show what system functions are performed for which actor. Roles of the actors in
the system can be depicted. Use cases are used during requirements elicitation and
analysis to represent the functionality of the system. Use cases focus on the behavior
of the system from the external point of view. The actors are outside the boundary of
the system, whereas the use cases are inside the boundary of the system.

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 11


Normalization Of Duplicate Records From Multiple Sources System Design

Fig 4.3: Use Case Diagram

Sequence Diagram

A sequence diagram in Unified Modeling Language (UML) is a kind of interaction


diagram that shows how processes operate with one another and in what order. It is a
construct of a Message Sequence Chart. Sequence diagrams are sometimes called
Event-trace diagrams, event scenarios, and timing diagrams

 OBJECT: Objects are typically named or anonymous instances of class. But


may also represent instances of other things such as components, collaboration
and nodes.

 LINK: A link is a semantic connection among objects i.e., an object of an


association is called as a link.

 LIFELINE: A lifeline is vertical dashed line that represents the lifetime of an


object.

 FOCUS OF CONTROL: A Focus of Control is tall, thin rectangle that shows


the period of time during which an object is performing an action.

 MESSAGES: A message is a specification of a communication between


objects that conveys the information with the expectation that the activity will
ensure

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 12


Normalization Of Duplicate Records From Multiple Sources System Design

Fig 4.4: Sequence Diagram

Activity Diagram

Activity diagrams are a loosely defined diagram technique for showing


workflows of stepwise activities and actions, with support for choice, iteration and
concurrency. In the Unified Modeling Language, activity diagrams can be used to
describe the business and operational step-by-step workflows of components in a
system. An activity diagram shows the overall flow of control.

 ACTIVITY STATE: An activity states is a kind of states in activity diagram;


it shows an ongoing non-atomic execution within a state machine. An activity
states can be further decomposed.

 ACTION STATE: An action states are states of the system, each representing
the execution of an action. An action states can’t be further decomposed.

 TRANSITION: A transition specifies the path from one action or activity


state to the next action or activity state. The transition is rendered as a simple
directed line.

 OBJECT: An object is a concrete manifestation of an abstraction; an entity


with a well defined boundary and identity that encapsulates state and behavior;
an instance of a class. Objects may be involved in the flow of control
associated with an activity diagram.

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 13


Normalization Of Duplicate Records From Multiple Sources System Design

Fig 4.5: Activity Diagram

Collaboration Diagram

A Communication diagram models the interactions between objects or parts in terms


of sequenced messages. Communication diagrams represent a combination of
information taken from Class, Sequence, and Use Case Diagrams describing both the
static structure and dynamic behavior of a system.

Fig 4.6:
Collaboration Diagram

Deployment Diagram

A deployment diagram in the Unified Modeling Language models the physical


deployment of artifacts on nodes. To describe a web site, for example, a deployment
diagram would show what hardware components ("nodes") exist (e.g., a web server,
an application server, and a database server), what software components ("artifacts")

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 14


Normalization Of Duplicate Records From Multiple Sources System Design

run on each node (e.g., web application, database), and how the different pieces are
connected e.g. JDBC, REST.

Fig 4.7: Deployment


Diagram

State Chart Diagram

A state diagram is a type of diagram used in computer science and related


fields to describe the behavior of systems. State diagrams require that the system
described is composed of a finite number of states sometimes, this is indeed the case,
while at other times this is a reasonable abstraction. Many forms of state diagrams
exist, which differ slightly and have different semantics.

Fig 4.8: State Chart Diagram

Component Diagram

In the Unified Modeling Language, a component diagram depicts how


components are wired together to form larger components and or software systems.
They are used to illustrate the structure of arbitrarily complex systems.

Fig 4.9: Component Diagram

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 15


Normalization Of Duplicate Records From Multiple Sources System Design

The following are the uml diagrams that are being discussed under this project:

- Usecase diagram

- Class diagram

- Activity diagram

- Sequence diagram

CLASS DIAGRAM

In our class diagram, the class names are data user, public cloud, private cloud and the
attributes of the data user are user id, password, files.Coming to the operations of this
class, we have login, register, upload, checking of duplicates and logout. If the data
user belongs to the public cloud through many to one relationship and hereattributes
of the public cloud are userid, password, file storage and the operations performed in
this class contains login, encrypt, decrypt, duplicate and logout. In the same way, if
the user belongs to private cloud here authorized users can have the permission to
access or modify the data.

Public-Cloud
userid
password
filestorage
DataUser files
userid
password 1 login()
files storefiles()
fileid encrypt()
fileblocks * decrypt()
duplicate()
login() logout()
register()
upload()
duplicatecheck() *
encrypt()
decrypt()
downoad() Private-Cloud
logout() userid
password
1 files
rights
ownername
permissions

login()
activiation()
permissions()
logout()

Fig 4.10: Class Diagram for Overall Project

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 16


Normalization Of Duplicate Records From Multiple Sources System Design

USE CASE DIAGRAM

A Use case diagram shows a set of use cases and actors and their relationships.
These diagrams are especiallyis important in organizing and modeling the behaviors
of a system. The actors for our usecase diagram are user and public cloud. Firstly, the
user register to the webpage and after completion of the registration process, the user
logins to the webpage and uploads the necessary files and check if the duplications
are present or not and then logout. Secondly, the (public cloud) admin logins to the
webpage and checks for the duplication in the list of files. Suppose, if the duplications
are seen, the admin removes the repeated data by using the normalization concepts.

Register

Login

Upload File

User
Public Cloud

Download File

Check for Duplicate

List of Files

Logout

Fig. 4.11: Use Case Diagram for User

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 17


Normalization Of Duplicate Records From Multiple Sources System Design

SEQUENCE DIAGRAM

In this sequence diagram, Generally objects are anonymous instances of class. Here,
the objects are typically named as Owner, Login, Receive permission from admin,
File upload, View user details, receive files, attribute. At first, the owner logins to the
web page by entering the userid and password, then the admin verifies if that
particular user who enters details is authorized or not and if the user is authorized
person then admin gives the permission to upload the data. Further, if they want to
view(search) any information that they needed, then the cloud removes the redundant
data and sends the standand information.

Owner Login Received Permission


File Upload View User Receive File Attribte
from admin Details

uid,pwd

verify

receive permission

file upload

view user details

receive file from cloud

change key

Fig. 4.12: Sequence Diagram

Activity Diagram

An activity diagram shows the flow from activity to activity. The activity
diagram emphasizes the dynamic view of a system. It consists of activity states, action
states, transition, and object. In this diagram, At starting point (pre- condition) the
user enters the user name and password. After submission of the details of that

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 18


Normalization Of Duplicate Records From Multiple Sources System Design

particular user, the admin takes the decision by checking if the user enteredthe
validate data or not. Suppose if the user entered the correct details then he is accepted
as an authorized person to use the web page and incase if user does not enter the valid
data then he is rejected and hence, reaches the ending point(post-condition).

Fig. 4.13 :Activity Diagram for Administrator

DATA FLOW DIAGRAMS

Data flow analysis studies the use of data in each activity. It documents these details
in the Data Flow Diagrams.

A data flow diagram is a logical model of a system. The model does not depend on
hardware, software, and data structures of the organization. There is no physical
implication in a data flow diagram because the diagram is a graphic picture of the
logical system, to be easy for every non-technical user to understand and thus serves
as an excellent communication tool. Finally a data flow diagram is a good starting
point for system design.To construct a data flow diagram, it uses basic symbols.

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 19


Normalization Of Duplicate Records From Multiple Sources System Design

Basic Symbols of data flow diagrams are,

DFD NOTATIONS

Define source and destination data.

Shows path of the data flow.

To represent a process that transforms or


modifies the Data

To represent an attribute

Data Store

Login Master

Enter yse Check


Open Login User Home
Username username yes
Form Page
Password Password
No

Validation Data

Fig 4.14: Data Flow Diagram for Home Page

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 20


Normalization Of Duplicate Records From Multiple Sources SystemCoding&Implementation

5. SYSTEM CODING AND IMPLEMENTATION


5.1. CODING

LOGIN.JSP

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"


"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<!--
zenlike1.0 by nodethirtythree design
http://www.nodethirtythree.com
-->
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=iso-8859-1" />
<title>Multi Cloud</title>
<style type="text/css">
.b:hover{
border-size:3px;
border-color:red;
}
.big:hover
{
color:red;

font-weight:bold;
}
.b1
{
background-color: #color;
border-bottom:solid;
border-left: #FFEEEE;

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 21


Normalization Of Duplicate Records From Multiple Sources SystemCoding&Implementation

border-right:solid;
border-top: #EEEEEE;
color: brown;
font-family: Verdana, Arial
}
</style>
<meta name="keywords" content="" />
<meta name="description" content="" />
<link rel="stylesheet" type="text/css" href="default.css" />
</head>
<body>
<div id="upbg"></div>
<div id="outer">
<div id="header">
<div id="headercontent">
</h1>
</div>
</div>
<div id="headerpic"></div>
<div id="menu">
<!-- HINT: Set the class of any menu link below to "active" to make it appear
active -->
<ul>
<li><a href="#" class="active">Home</a></li>
<li><a href="user_log.jsp" >User</a></li>
<li><a href="signup.jsp">Sign up</a></li>
<li><a href="server_log.jsp">Admin</a></li>
</ul>
</div>
<div id="menubottom"></div>

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 22


Normalization Of Duplicate Records From Multiple Sources SystemCoding&Implementation

<div id="content">
<!-- Normal content: Stuff that's not going to be put in the left or right column. -->
<!-- Primary content: Stuff that goes in the primary content column (by default, the
left column) -->
<div id="primarycontainer">
<div id="primarycontent">
<!-- Primary content area start -->
<div class="post">
<p><strong><em><font color="#990000" size="+1" face="Verdana, Arial,
Helvetica, sans-serif">Architecture</font></em></strong>
<br/>
<br/>
<imgsrc="images/archi.bmp" width="700" height="400"></p>
</div>
<!-- Primary content area end -->
</div>
</div>
<br>
<br>
</div>
</div>
<div id="footer"><strong><font color="#990033" face="Geneva, Arial, Helvetica,
sans-serif"></font></strong></div>
<!--<div class="right">Design by <a
href="http://www.nodethirtythree.com/">NodeThirtyThree
Design</a></div>-->
</div>
</body>
</html>

UPLOAD.JSP

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 23


Normalization Of Duplicate Records From Multiple Sources SystemCoding&Implementation

<%@page
import="java.sql.*,java.lang.*,databaseconnection.*,databaseconnection1.*,dat
abaseconnection2.*,databaseconnection3.*,java.text.SimpleDateFormat,java.uti
l.*,java.io.*,javax.servlet.*, javax.servlet.http.*" %>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>multi cloud</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<script type="text/javascript">
</script>
</head>
<body>
<%
java.util.Date now = new java.util.Date();
String DATE_FORMAT1 = "dd/MM/yyyy";
SimpleDateFormat sdf1 = new SimpleDateFormat(DATE_FORMAT1);
String strDateNew1 = sdf1.format(now);
String a="D:\\multi-cloud\\temp\\file1.txt";
String b="D:\\multi-cloud\\temp\\file2.txt";
String c="D:\\multi-cloud\\temp\\file3.txt";
FileInputStreamfis=null;
File image=new File(a);
File image1=new File(b);
File image2=new File(c);
//String m="on process";
String ser=(String)session.getAttribute("ser");
String u=(String)session.getAttribute("u");
String name=(String)session.getAttribute("name");
String f=(String)session.getAttribute("f");
String kbs=(String)session.getAttribute("kbs");
A.I.T.S, KADAPA DEPARTMENT OF CSE Page 24
Normalization Of Duplicate Records From Multiple Sources SystemCoding&Implementation

String tfid=(String)session.getAttribute("tfid");
String fkey=(String)session.getAttribute("fkey");
String akey=(String)session.getAttribute("akey");
String m="not_verified";
String x="s1";
String y="s2";
String z="s3";
Connection con=null,con1=null,con2=null;
PreparedStatement psmt1=null,psmt2=null,psmt3=null;
try{
con=databasecon.getconnection();
psmt1=con.prepareStatement("insert into
tpafile(fileid,uid,name,fname,fsize,b1,b2,b3,fkey,date,status,akey)
values(?,?,?,?,?,AES_ENCRYPT(?, 'key'),AES_ENCRYPT(?,
'key'),AES_ENCRYPT(?, 'key'),?,?,?,?)");
psmt1.setString(1,tfid);
psmt1.setString(2,u);
psmt1.setString(3,name);
psmt1.setString(4,f);
psmt1.setString(5,kbs);
fis=new FileInputStream(image);
psmt1.setBinaryStream(6, (InputStream)fis, (int)(image.length()));
fis=new FileInputStream(image1);
psmt1.setBinaryStream(7, (InputStream)fis, (int)(image1.length()));
fis=new FileInputStream(image2);
psmt1.setBinaryStream(8, (InputStream)fis, (int)(image2.length()));
psmt1.setString(9,fkey);
psmt1.setString(10,strDateNew1);
psmt1.setString(11,m);
psmt1.setString(12,akey);
psmt1.executeUpdate();
A.I.T.S, KADAPA DEPARTMENT OF CSE Page 25
Normalization Of Duplicate Records From Multiple Sources SystemCoding&Implementation

}
catch(Exception ex)
{
out.println("Error in connection : "+ex);
}response.sendRedirect("tpa_home.jsp?message=success");%></body></html>
psmt1=con.prepareStatement("insert into
tpafile(fileid,uid,name,fname,fsize,b1,b2,b3,fkey,date,status,akey)
values(?,?,?,?,?,AES_ENCRYPT(?, 'key'),AES_ENCRYPT(?,
'key'),AES_ENCRYPT(?, 'key'),?,?,?,?)");
psmt1.setString(1,tfid);
psmt1.setString(2,u);
psmt1.setString(3,name);
psmt1.setString(4,f);
psmt1.setString(5,kbs);
fis=new FileInputStream(image);
psmt1.setBinaryStream(6, (InputStream)fis, (int)(image.length()));
fis=new FileInputStream(image1);
psmt1.setBinaryStream(7, (InputStream)fis, (int)(image1.length()));
fis=new FileInputStream(image2);
psmt1.setBinaryStream(8, (InputStream)fis, (int)(image2.length()));
psmt1.setString(9,fkey);
psmt1.setString(10,strDateNew1);
psmt1.setString(11,m);
psmt1.setString(12,akey);
psmt1.executeUpdate();
}
catch(Exception ex)
{
out.println("Error in connection : "+ex);
}response.sendRedirect("tpa_home.jsp?message=success");%></body></html>

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 26


Normalization Of Duplicate Records From Multiple Sources SystemCoding&Implementation

psmt1=con.prepareStatement("insert into
tpafile(fileid,uid,name,fname,fsize,b1,b2,b3,fkey,date,status,akey)
values(?,?,?,?,?,AES_ENCRYPT(?, 'key'),AES_ENCRYPT(?,
'key'),AES_ENCRYPT(?, 'key'),?,?,?,?)");
psmt1.setString(1,tfid);
psmt1.setString(2,u);
psmt1.setString(3,name);
psmt1.setString(4,f);
psmt1.setString(5,kbs);
fis=new FileInputStream(image);
psmt1.setBinaryStream(6, (InputStream)fis, (int)(image.length()));
fis=new FileInputStream(image1);
psmt1.setBinaryStream(7, (InputStream)fis, (int)(image1.length()));
fis=new FileInputStream(image2);
psmt1.setBinaryStream(8, (InputStream)fis, (int)(image2.length()));
psmt1.setString(9,fkey);
psmt1.setString(10,strDateNew1);
psmt1.setString(11,m);
psmt1.setString(12,akey);
psmt1.executeUpdate();
}
catch(Exception ex)
{
out.println("Error in connection : "+ex);
}response.sendRedirect("tpa_home.jsp?message=success");%></body></html>
<div><label style="font-size:14px;font-
weight:bold;">Username&nbsp;:</label>&nbsp;&nbsp;<input
style="width:100px;height:20px;font-size:12px;" type="text" id="username1"
name="username" /></div>
<div><label style="font-size:14px;font-
weight:bold;">Password&nbsp;&nbsp;:</label>&nbsp;&nbsp;<input
style="width:100px;height:20px;font-size:12px;" type="password"
id="password1" name="password" /></div>

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 27


Normalization Of Duplicate Records From Multiple Sources SystemCoding&Implementation

<input type="button" style="width:85px;height:25px;font-size:12px;"


onclick="call1()" value="Sumbit" />&nbsp;&nbsp;&nbsp;&nbsp;<input
type="button" style="width:85px;height:25px;font-size:12px;" value="Close"
onclick="$.unblockUI()" />
<div style="width:120px;height:25px;font-size:12px;"><a
href="Registration.jsp">REGISTER&nbsp;for&nbsp;FREE</a>&nbsp;&nbsp;
&nbsp;&nbsp;<a
href="ForgetPassword.jsp">FORGET&nbsp;PASSWORD</a></div>
</div>
</body>
</html></html> 

5.2. TESTING
A primary purpose of testing is to detect software failure so that defects may
be covered and corrected.

5.2.1. Testing Techniques

Software testing methods are traditionally divided into 2 types

1. Black Box Testing

2. White Box Testing

Black Box Testing

It is also known as Glass Box Testing, in which the internal


structure/design/implementation of item being tested is known to the tester.

White Box Testing

It is also known as behavioral testing, in which the internal


structure/design/implementation of the item being tested is not known to the tester.

Types of Tests

1. Unit Testing
2. Integration Testing
3. System Testing

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 28


Normalization Of Duplicate Records From Multiple Sources SystemCoding&Implementation

Unit Testing

Unit testing focuses verification effort on the smallest unit of software design
that is the module. Using procedural design description as a guide, important control
paths are tested to uncover errors within the boundaries of the module.

Integration Testing

Integration testing is a systematic technique for constructing the program


structure, while conducting test to uncover errors associated with the interface. The
objective is to take unit tested methods and build a program structure that has been
dictated by design.

System Testing

System testing is actually a series of different tests whose primary purpose is


to fully exercise the computer-based system. Although each test has a different
purpose, all work to verify that all system elements have been properly integrated to
perform allocated functions.

Acceptance Testing

It is the sub part of system testing and it is the critical phase for any project.

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 29


Normalization Of Duplicate Records From Multiple Sources SystemCoding&Implementation

5.2.2. TEST CASES

TEST EXPECTED ACTUAL


S. No. INPUT STATUS
CASES RESULT RESULT

User User gets Registration


1 Enter all fields Pass
Registration registered is successful

User if user miss User not Registration is


2 Fail
Registration any field registered un successful

Give the user Admin home


Admin home Page
3 Admin Login name and page should Pass
has been opened
password be opened

User page
Give Username User page has
4 User Login should be Pass
and password been opened l
opened

Give Username User page User name and


5 User Login without should not be password is Fail
Password opened invalid

Upload Add Select the to Upload to the Post Upload


6 Pass
file upload file Database Success Fully

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 30


Normalization Of Duplicate Records From Multiple Sources SystemCoding&Implementation

6.EXECUTION STEPS

1.Installation of java:

 Go to http://www.oracle.com/technetwork/java/javase/downloads
/index.html.
 click on JDK DOWNLOAD button. run the exe file and then follow the
instruction given in wizard.
 To set up the path:-
o Rig ht click on my pc and then go to my properties

Fig: properties wizard


o Go to advanced settings and then click on environment variables
o create a class path and copy the path of the java folder where it is
located in program files.

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 31


Normalization Of Duplicate Records From Multiple Sources SystemCoding&Implementation

Fig: path setting for java

2. Installation and setup of Apache Tomcat:

 Go to http://tomcat.apache.org/index.html and click on download latest


versions.
 Run the exe file and click on next and follow the wizard instructions.

Fig: Welcome Page of Tomcat

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 32


Normalization Of Duplicate Records From Multiple Sources SystemCoding&Implementation

 Click on install with port number 8090 with username and password as
aits and aits.
 Mention the connection port as 8090 and then click on next and finally
click on finish.

Fig: Tomcat Configuration Options Page

 Click on I agree button in. license agreement in order to accept the


terms and condition.

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 33


Normalization Of Duplicate Records From Multiple Sources SystemCoding&Implementation

Fig: Tomcat License Agreement

3. Installation and setup of SQL:

 Go tohttp://dev.mywql.com/downloads/ . and click on install button.


 After completion of installation, click on exe file and then click on
next.
 Run the MySQL setup and click on next and follow the instruction in
wizard.

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 34


Normalization Of Duplicate Records From Multiple Sources SystemCoding&Implementation

Fig: Welcome wizard of MySQL


 Conform the type as typical and then click on next and follow the
instructions.

Fig: SQL setup Wizard

 Now confirm the password as root in system settings field and then
click on finish.

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 35


Normalization Of Duplicate Records From Multiple Sources SystemCoding&Implementation

Fig: Database Configuration Engine

After completion of installations,

1. copy the project software’s folder and then paste it in


C:\ProgramFiles(x86) \Apache Software Foundation\Tomcat
7.0\webapps.
2. copy the database.sql file and copy in D or E drive.
3. To check weather out software’s are properly setup or not, click on start on
go to MySQL command line prompt and login with root as password.
4. command as source D:/database.sql to display the data or tables.
5. To see the data base type command as show tables;

Select any browser and type localhost: 8090/normalization/ and then press enter
then the project home page will be displayed.

1. The home page consists of a sidebar menu which consists of admin, user and
publisher.
2. On clicking on admin, admin login page will be opened. If the username and
password are correct then admin main page will be opened.

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 36


Normalization Of Duplicate Records From Multiple Sources SystemCoding&Implementation

3. Here the admin can show some sidebar menu like view all authorized users,
uploaded publications, duplicated and normalized publication records,
bookmark and publication search history, bookmark and publication frequency
ranking and so on.
4. After showing all these menus whatever data we want to show that menu we
have to click. The data will be displayed clearly.
5. On clicking view all duplicate records, it shows list of duplicated publication
records.
6. On clicking view all normalized records, then is shows the report of
normalized records.
7. On clicking view all bookmarks, it shows the list of bookmarks.
8. On clicking view publication or bookmark frequency ranking, it shows the
graph of publication and bookmark records.
9. On clicking view all publication and bookmark search history, it then shows
the report of publication and bookmark search history.
10. Further, On clicking on user button, then form will be opened to enter the
name and the password.
11. In the user menu, it show the option like view your profile, search bookmark
and publication, view bookmark and publication search history and then
logout.
12. After showing all these options in the user menu whatever data we want to be
displayed then that menu we have to click. The data will be displayed clearly.
13. Then on clicking publisher button, publisher login page will be displayed.
14. After entering the publisher details, the publisher menu options like add
publication and bookmarks, view all bookmarks and publications will be
displayed.
15. On clicking on view all bookmarks and publication button, report is shown to
view various book marks and publications.
16. On clicking on add bookmarks and publication button, is shows details like
name, url, venue and soon be to filled .
17. Finally, after all this above steps are done, then we can logout from the web
page.

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 37


Normalization Of Duplicate Records From Multiple Sources Screen Shots

7.SYSTEM EXECUTION SCREEN SHOTS


1.Home Page

Screen 7.1: Home page of project

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 38


Normalization Of Duplicate Records From Multiple Sources Screen Shots

2.Admin menu

Screen 7.2: Admin Menu Page

3.View Duplicate Records

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 39


Normalization Of Duplicate Records From Multiple Sources Screen Shots

Screen 7.3: Report Showing List of Duplicated Publication Records

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 40


Normalization Of Duplicate Records From Multiple Sources Screen Shots

4.Normalized Records

Screen 7.4: Report showing Normalized Records

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 41


Normalization Of Duplicate Records From Multiple Sources Screen Shots

5.View Book Marks

Screen 7.5: Report Showing List of Book Marks

6.Graph of Publication Rank

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 42


Normalization Of Duplicate Records From Multiple Sources Screen Shots

Screen 7.6: Graph showing Publication Records

7.Publication Search History

Screen 7.7:Report Showing Publication Search History

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 43


Normalization Of Duplicate Records From Multiple Sources Screen Shots

8.View all publication frequency ranking

Screen 7.8: publication frequency ranking

9.User Login

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 44


Normalization Of Duplicate Records From Multiple Sources Screen Shots

Screen 7.9: Form for User Login

10.User Menu

Screen 7.10: Form for User Menu

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 45


Normalization Of Duplicate Records From Multiple Sources Screen Shots

11.Search Publication

Screen 7.11: Form to Search Publication Words

12.Search Publication Report

Screen 7.12: Report Showing Publication Report

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 46


Normalization Of Duplicate Records From Multiple Sources Screen Shots

13.Publisher Login

Screen 7.13: Publisher Login Page

14.Publisher Menu

Screen 7.14: Publisher Menu Page

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 47


Normalization Of Duplicate Records From Multiple Sources Screen Shots

15.View Book Marks

Screen 7.15: Report showing to View Various Book Marks

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 48


Normalization Of Duplicate Records From Multiple Sources Conclusion

8. CONCLUSION

In this paper, we studied the problem of record normalization over a set

of matching records that refer to the same real-world entity. We presented three levels

of normalization granularities (record-level, field-level and valuecomponent level)

and two forms of normalization (typical normalization and complete normalization).

For each form of normalization, we proposed a computational framework that

includes both single-strategy and multi-strategy approaches. We proposed four single-

strategy approaches: frequency, length, centroid, and feature-based to select the

normalized record or the normalized field value. For multistrategy approach, we used

result merging models inspired from meta searching to combine the results from a

number of single strategies. We analyzed the record and field level normalization in

the typical normalization. In the complete normalization, we focused on field values

and proposed algorithms for acronym expansion and value component mining to

produce much improved normalized field values. We implemented a prototype and

tested it on a real-world dataset. The experimental results demonstrate the feasibility

and effectiveness of our approach. Our method outperforms the state-of-the-art by a

significant margin.

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 49


Normalization Of Duplicate Records From Multiple Sources Future Enhancements

9. FUTURE ENHANCEMENTS

In the future, we plan to extend our research as follows. First, conduct


additional experiments using more diverse and larger datasets. The lack of appropriate
datasets currently has made this difficult. Second, investigate how to add an effective
human-in-the-loop component into the current solution as automated solutions alone
will not be able to achieve perfect accuracy. Third, develop solutions that handle
numeric or more complex values.

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 50


Normalization Of Duplicate Records From Multiple Sources Bibliography

10.BIBLIOGRAPHY

[1] K. C.-C. Chang and J. Cho, “Accessing the web: From search to integration,” in
SIGMOD, 2006, pp. 804–805.

[2] M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang, “Webtables: Exploring the
power of tables on the web,” PVLDB, vol. 1, no. 1, pp. 538–549, 2008.

[3] W. Meng and C. Yu, Advanced Metasearch Engine Technology. Morgan & Claypool
Publishers, 2010.

[4] A. Gruenheid, X. L. Dong, and D. Srivastava, “Incremental record linkage,” PVLDB, vol.
7, no. 9, pp. 697–708, May 2014.

[5] E. K. Rezig, E. C. Dragut, M. Ouzzani, and A. K. Elmagarmid, “Query-time record


linkage and fusion over web databases,” in ICDE, 2015, pp. 42–53.

[6] W. Su, J. Wang, and F. Lochovsky, “Record matching over query results from multiple
web databases,” TKDE, vol. 22, no. 4, 2010.

[7] H. K¨opcke and E. Rahm, “Frameworks for entity matching: A comparison,” DKE, vol.
69, no. 2, pp. 197–210, 2010.

[8] X. Yin, J. Han, and S. Y. Philip, “Truth discovery with multiple conflicting information
providers on the web,” ICDE, 2008.

[9] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios, “Duplicate record detection: A


survey,” TKDE, vol. 19, no. 1, pp. 1–16, 2007.

[10] P. Christen, “A survey of indexing techniques for scalable record linkage and
deduplication,” TKDE, vol. 24, no. 9, 2012.

[11] S. Tejada, C. A. Knoblock, and S. Minton, “Learning object identification rules for
information integration,” Inf. Sys., vol. 26, no. 8, pp. 607–633, 2001.

[12] L. Shu, A. Chen, M. Xiong, and W. Meng, “Efficient spectral neighborhood blocking for
entity resolution,” in ICDE, 2011.

A.I.T.S, KADAPA DEPARTMENT OF CSE Page 51

You might also like