Normalization of Duplicate Recordsfrom Multiple Sources: Bachelor of Technology IN Computer Science and Engineering
Normalization of Duplicate Recordsfrom Multiple Sources: Bachelor of Technology IN Computer Science and Engineering
On
NORMALIZATION OF DUPLICATE
RECORDSFROM MULTIPLE SOURCES
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
Submitted By
P. JEEVANA
(17HM1A0533)
A. DIVYA S. SHABAAZ
(17HM1A0502) (17HM1A0543)
CERTIFICATE
This is to certify that the project work entitled "NORMALIZATION OF
DUPLICATE RECORDS FROM MULTIPLE SOURCES" is a Bonafied
workdone by
P. JEEVANA (17HM1A0533)
A. DIVYA (17HM1A0502)
S. SHABAAZ (17HM1A0543)
S. ATHIF (17HM1A0542)
S. AYESHA SHAIK (17HM1A0508)
in partial fulfillment of the requirement for the award of the degree of BACHELOR
OF TECHNOLOGY in COMPUTER SCIENCE ANDENGINEERING in
ANNAMACHARYA INSTITUTE OF TECHNOLOGY AND SCIENCES,
KADAPA during the academic year 2017-2021. The results of this work have not
been submitted to any other university or institutes for the award of any degree.
External Examiner
DECLARATION
PROJECT ASSOCIATES
P. JEEVANA (17HM1A0533)
A. DIVYA (17HM1A0502)
S. SHABAAZ (17HM1A0543)
S. ATHIF (17HM1A0542)
S. AYESHA SHAIK (17HM1A0508)
ACKNOWLEDGEMENT
Last, but least by any means, we are thankful to all the non-teaching staff
members of Computer Science & Engineering Department for their extended co-
operation.
PROJECT ASSOCIATES
P. JEEVANA (17HM1A0533)
A. DIVYA (17HM1A0502)
S. SHABAAZ (17HM1A0543)
S. ATHIF (17HM1A0542)
S.AYESHA SHAIK (17HM1A0508)
TABLE OF CONTENTS
List of Figures i
List of Tables ii
CHAPTER 1 INTRODUCTION 1
2.1.1. Disadvantages 2
2.2.1. Advantages 3
5.1 Coding 21
5.2 Testing
28
CHAPTER 8 CONCLUSION 46
CHAPTER 10 BIBLIOGRAPHY 48
LIST OF FIGURES
i
21. Report showing Normalized Records 6.4 39
LIST OF TABLES
ii
LIST OF ABBREVATIONS
iii
Normalization Of Duplicate Records From Multiple Sources Introduction
1. INTRODUCTION
The Web has evolved into a data-rich repository containing a large amount of
structured content spread across millions of sources. The usefulness of Web data
increases exponentially (e.g., building knowledge bases, Web-scale data analytics)
when it is linked across numerous sources. Structured data on the Web resides in Web
databases and Web tables.
We call the generated record the normalized record. For example, in the
research publication domain, although the integrator website, such as Citeseer or
Google Scholar, contains records gathered from a variety of sources using automated
extraction techniques, it must display a normalized record to users. Otherwise, it is
unclear what can be presented to users: (i) present the entire group of matching
records or (ii) simply present some random record from the group, to just name a
couple of ad-hoc approaches. Either of these choices can lead to a frustrating
experience for a user, because in (i) the user needs to sort/browse through a
potentially large number of duplicate records, and in (ii) we run the risk of presenting
a record with missing or incorrect pieces of data.
2. SYSTEM ANALYSIS
2.1. EXISTING SYSTEM
The problem of normalization of database records was first described by
Culottaetal. They provided the first attempt to formalize the record normalization
problem and proposed three solutions. The first solution uses string edit distance to
determine the most central record. The second solution optimizes the edit distance
parameters, and the third one describes a feature-based solution to improve
performance by means of a knowledge base. Their approach is an instance of typical
field value normalization. They did not consider value-component-level
normalization. In addition, their gold standard dataset has many instances of
unreasonable normalized records. Swoosh describes a record Merge operator,
however, the purpose of the operator is not for producing normalized records, but
rather for improving the ability to establish difficult record matchings. Wick et al.
propose a discriminatively-trained model to implement schema matching, reference,
and normalization jointly. But the complexity of the model is greatly increased. This
paper also contains no discussion on complete normalization at the value-component
level. Wang et al. propose a hybrid framework for product normalization in online
shopping by schema integration and data cleaning. Although their work mainly
focuses on record matching, they consider the problem of filling missing data and
repairing incorrect data, which is relevant to record normalization.
2.1.1. Disadvantages
2.2.1. Advantages
Java Technology
Simple
Architecture neutral
Object oriented
Portable
Distributed
High performance
Interpreted
Multithreaded
Robust
Dynamic
Secure
If we think of Java byte codes as the machine code instructions for the Java
Virtual Machine (Java VM). Every Java interpreter, whether it’s a development tool
or a Web browser that can run applets, is an implementation of the Java VM. Java
byte codes help make “write once, run anywhere” possible. You can compile your
program into byte codes on any platform that has a Java compiler. The byte codes can
then be run on any implementation of the Java VM. That means that as long as a
computer has a Java VM, the same program written in the Java programming
language can run on Windows 2000, a Solaris workstation, or on an iMac.
You’ve already been introduced to the Java VM. It’s the base for the Java
platform and is ported onto various hardware-based platforms.
Native code is code that after you compile it, the compiled code runs on a
specific hardware platform. As a platform-independent environment, the Java
platform can be a bit slower than native code. However, smart compilers, well-tuned
interpreters, and just-in-time byte code compilers can bring performance close to that
of native code without threatening portability.
Feasibility Study
Technical Feasibility
GUI is developed using HTML to capture the information from the customer.
HTML is used to display the content on the browser. It uses TCP/IP protocol. It is an
interpreted language. It is very easy to develop a page/document using HTML some
RAD (Rapid Application Development) tools are provided to quickly design/develop
our application. So many objects such as button, text fields, and text areaetc are
provided to capture the information from the customer.
Economical Feasibility
The economical issues usually arise during the economical feasibility stage are
whether the system will be used if it is developed and implemented. It reduces the
work load. Keep the class of application in the view, the cost of hardware and
software is considered to be economically feasible.
Operational Feasibility
In our application front end is developed using GUI. So it is very easy to the customer
to enter the necessary information. But customer must have some knowledge on using
4. SYSTEM DESIGN
4.1. ARCHITECTURE DESIGN
The record-level normalization assumes that each record, The assumption, while
intuitively appealing and allows to build the theoretical underpins for constructing
normalized records, needs to be taken with a grain of salt in practice. Re contains a
mixture of candidate normalized records and records with incomplete or arcane
representations of e, which may be difficult to understand by ordinary users
The complete normalization form works at the value for entire web page.
UML system is represented using five different views that describe the
system from distinctly different perspective
As the strategic value of software increases for many companies, the industry
looks for techniques to automate the production of software and to improve quality
and reduce cost and time-to-market. These techniques include component technology,
visual programming, patterns and frameworks. Businesses also seek techniques to
manage the complexity of systems as they increase in scope and scale. In particular,
they recognize the need to solve recurring architectural problems, such as physical
distribution, concurrency, replication, security, load balancing and fault tolerance.
Additionally, the development for the World Wide Web, while making some things
simpler, has exacerbated these architectural problems. The Unified Modeling
Language (UML) was designed to respond to these needs. Simply, Systems design
refers to the process of defining the architecture, components, modules, interfaces,
and data for a system to satisfy specified requirements which can be done easily
through UML diagrams.
Class Diagram
Sequence Diagram
Activity Diagram
Collaboration Diagram
Deployment Diagram
Component Diagram
Class Diagram
showing the system's classes, their attributes, and the relationships between the
classes.This is one of the most important of the diagrams in development. The
diagram breaks the class into three layers. One has the name, the second describes its
attributes and the third its methods. A padlock to left of the name represents the
private attributes. The relationships are drawn between the classes. Developers use the
Class Diagram to develop the classes. Analyses use it to show the details of the
system.Architects look at class diagrams to see if any class has too many functions
and see if they are required to be split.
Sequence Diagram
Activity Diagram
ACTION STATE: An action states are states of the system, each representing
the execution of an action. An action states can’t be further decomposed.
Collaboration Diagram
Fig 4.6:
Collaboration Diagram
Deployment Diagram
run on each node (e.g., web application, database), and how the different pieces are
connected e.g. JDBC, REST.
Component Diagram
The following are the uml diagrams that are being discussed under this project:
- Usecase diagram
- Class diagram
- Activity diagram
- Sequence diagram
CLASS DIAGRAM
In our class diagram, the class names are data user, public cloud, private cloud and the
attributes of the data user are user id, password, files.Coming to the operations of this
class, we have login, register, upload, checking of duplicates and logout. If the data
user belongs to the public cloud through many to one relationship and hereattributes
of the public cloud are userid, password, file storage and the operations performed in
this class contains login, encrypt, decrypt, duplicate and logout. In the same way, if
the user belongs to private cloud here authorized users can have the permission to
access or modify the data.
Public-Cloud
userid
password
filestorage
DataUser files
userid
password 1 login()
files storefiles()
fileid encrypt()
fileblocks * decrypt()
duplicate()
login() logout()
register()
upload()
duplicatecheck() *
encrypt()
decrypt()
downoad() Private-Cloud
logout() userid
password
1 files
rights
ownername
permissions
login()
activiation()
permissions()
logout()
A Use case diagram shows a set of use cases and actors and their relationships.
These diagrams are especiallyis important in organizing and modeling the behaviors
of a system. The actors for our usecase diagram are user and public cloud. Firstly, the
user register to the webpage and after completion of the registration process, the user
logins to the webpage and uploads the necessary files and check if the duplications
are present or not and then logout. Secondly, the (public cloud) admin logins to the
webpage and checks for the duplication in the list of files. Suppose, if the duplications
are seen, the admin removes the repeated data by using the normalization concepts.
Register
Login
Upload File
User
Public Cloud
Download File
List of Files
Logout
SEQUENCE DIAGRAM
In this sequence diagram, Generally objects are anonymous instances of class. Here,
the objects are typically named as Owner, Login, Receive permission from admin,
File upload, View user details, receive files, attribute. At first, the owner logins to the
web page by entering the userid and password, then the admin verifies if that
particular user who enters details is authorized or not and if the user is authorized
person then admin gives the permission to upload the data. Further, if they want to
view(search) any information that they needed, then the cloud removes the redundant
data and sends the standand information.
uid,pwd
verify
receive permission
file upload
change key
Activity Diagram
An activity diagram shows the flow from activity to activity. The activity
diagram emphasizes the dynamic view of a system. It consists of activity states, action
states, transition, and object. In this diagram, At starting point (pre- condition) the
user enters the user name and password. After submission of the details of that
particular user, the admin takes the decision by checking if the user enteredthe
validate data or not. Suppose if the user entered the correct details then he is accepted
as an authorized person to use the web page and incase if user does not enter the valid
data then he is rejected and hence, reaches the ending point(post-condition).
Data flow analysis studies the use of data in each activity. It documents these details
in the Data Flow Diagrams.
A data flow diagram is a logical model of a system. The model does not depend on
hardware, software, and data structures of the organization. There is no physical
implication in a data flow diagram because the diagram is a graphic picture of the
logical system, to be easy for every non-technical user to understand and thus serves
as an excellent communication tool. Finally a data flow diagram is a good starting
point for system design.To construct a data flow diagram, it uses basic symbols.
DFD NOTATIONS
To represent an attribute
Data Store
Login Master
Validation Data
LOGIN.JSP
font-weight:bold;
}
.b1
{
background-color: #color;
border-bottom:solid;
border-left: #FFEEEE;
border-right:solid;
border-top: #EEEEEE;
color: brown;
font-family: Verdana, Arial
}
</style>
<meta name="keywords" content="" />
<meta name="description" content="" />
<link rel="stylesheet" type="text/css" href="default.css" />
</head>
<body>
<div id="upbg"></div>
<div id="outer">
<div id="header">
<div id="headercontent">
</h1>
</div>
</div>
<div id="headerpic"></div>
<div id="menu">
<!-- HINT: Set the class of any menu link below to "active" to make it appear
active -->
<ul>
<li><a href="#" class="active">Home</a></li>
<li><a href="user_log.jsp" >User</a></li>
<li><a href="signup.jsp">Sign up</a></li>
<li><a href="server_log.jsp">Admin</a></li>
</ul>
</div>
<div id="menubottom"></div>
<div id="content">
<!-- Normal content: Stuff that's not going to be put in the left or right column. -->
<!-- Primary content: Stuff that goes in the primary content column (by default, the
left column) -->
<div id="primarycontainer">
<div id="primarycontent">
<!-- Primary content area start -->
<div class="post">
<p><strong><em><font color="#990000" size="+1" face="Verdana, Arial,
Helvetica, sans-serif">Architecture</font></em></strong>
<br/>
<br/>
<imgsrc="images/archi.bmp" width="700" height="400"></p>
</div>
<!-- Primary content area end -->
</div>
</div>
<br>
<br>
</div>
</div>
<div id="footer"><strong><font color="#990033" face="Geneva, Arial, Helvetica,
sans-serif"></font></strong></div>
<!--<div class="right">Design by <a
href="http://www.nodethirtythree.com/">NodeThirtyThree
Design</a></div>-->
</div>
</body>
</html>
UPLOAD.JSP
<%@page
import="java.sql.*,java.lang.*,databaseconnection.*,databaseconnection1.*,dat
abaseconnection2.*,databaseconnection3.*,java.text.SimpleDateFormat,java.uti
l.*,java.io.*,javax.servlet.*, javax.servlet.http.*" %>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>multi cloud</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<script type="text/javascript">
</script>
</head>
<body>
<%
java.util.Date now = new java.util.Date();
String DATE_FORMAT1 = "dd/MM/yyyy";
SimpleDateFormat sdf1 = new SimpleDateFormat(DATE_FORMAT1);
String strDateNew1 = sdf1.format(now);
String a="D:\\multi-cloud\\temp\\file1.txt";
String b="D:\\multi-cloud\\temp\\file2.txt";
String c="D:\\multi-cloud\\temp\\file3.txt";
FileInputStreamfis=null;
File image=new File(a);
File image1=new File(b);
File image2=new File(c);
//String m="on process";
String ser=(String)session.getAttribute("ser");
String u=(String)session.getAttribute("u");
String name=(String)session.getAttribute("name");
String f=(String)session.getAttribute("f");
String kbs=(String)session.getAttribute("kbs");
A.I.T.S, KADAPA DEPARTMENT OF CSE Page 24
Normalization Of Duplicate Records From Multiple Sources SystemCoding&Implementation
String tfid=(String)session.getAttribute("tfid");
String fkey=(String)session.getAttribute("fkey");
String akey=(String)session.getAttribute("akey");
String m="not_verified";
String x="s1";
String y="s2";
String z="s3";
Connection con=null,con1=null,con2=null;
PreparedStatement psmt1=null,psmt2=null,psmt3=null;
try{
con=databasecon.getconnection();
psmt1=con.prepareStatement("insert into
tpafile(fileid,uid,name,fname,fsize,b1,b2,b3,fkey,date,status,akey)
values(?,?,?,?,?,AES_ENCRYPT(?, 'key'),AES_ENCRYPT(?,
'key'),AES_ENCRYPT(?, 'key'),?,?,?,?)");
psmt1.setString(1,tfid);
psmt1.setString(2,u);
psmt1.setString(3,name);
psmt1.setString(4,f);
psmt1.setString(5,kbs);
fis=new FileInputStream(image);
psmt1.setBinaryStream(6, (InputStream)fis, (int)(image.length()));
fis=new FileInputStream(image1);
psmt1.setBinaryStream(7, (InputStream)fis, (int)(image1.length()));
fis=new FileInputStream(image2);
psmt1.setBinaryStream(8, (InputStream)fis, (int)(image2.length()));
psmt1.setString(9,fkey);
psmt1.setString(10,strDateNew1);
psmt1.setString(11,m);
psmt1.setString(12,akey);
psmt1.executeUpdate();
A.I.T.S, KADAPA DEPARTMENT OF CSE Page 25
Normalization Of Duplicate Records From Multiple Sources SystemCoding&Implementation
}
catch(Exception ex)
{
out.println("Error in connection : "+ex);
}response.sendRedirect("tpa_home.jsp?message=success");%></body></html>
psmt1=con.prepareStatement("insert into
tpafile(fileid,uid,name,fname,fsize,b1,b2,b3,fkey,date,status,akey)
values(?,?,?,?,?,AES_ENCRYPT(?, 'key'),AES_ENCRYPT(?,
'key'),AES_ENCRYPT(?, 'key'),?,?,?,?)");
psmt1.setString(1,tfid);
psmt1.setString(2,u);
psmt1.setString(3,name);
psmt1.setString(4,f);
psmt1.setString(5,kbs);
fis=new FileInputStream(image);
psmt1.setBinaryStream(6, (InputStream)fis, (int)(image.length()));
fis=new FileInputStream(image1);
psmt1.setBinaryStream(7, (InputStream)fis, (int)(image1.length()));
fis=new FileInputStream(image2);
psmt1.setBinaryStream(8, (InputStream)fis, (int)(image2.length()));
psmt1.setString(9,fkey);
psmt1.setString(10,strDateNew1);
psmt1.setString(11,m);
psmt1.setString(12,akey);
psmt1.executeUpdate();
}
catch(Exception ex)
{
out.println("Error in connection : "+ex);
}response.sendRedirect("tpa_home.jsp?message=success");%></body></html>
psmt1=con.prepareStatement("insert into
tpafile(fileid,uid,name,fname,fsize,b1,b2,b3,fkey,date,status,akey)
values(?,?,?,?,?,AES_ENCRYPT(?, 'key'),AES_ENCRYPT(?,
'key'),AES_ENCRYPT(?, 'key'),?,?,?,?)");
psmt1.setString(1,tfid);
psmt1.setString(2,u);
psmt1.setString(3,name);
psmt1.setString(4,f);
psmt1.setString(5,kbs);
fis=new FileInputStream(image);
psmt1.setBinaryStream(6, (InputStream)fis, (int)(image.length()));
fis=new FileInputStream(image1);
psmt1.setBinaryStream(7, (InputStream)fis, (int)(image1.length()));
fis=new FileInputStream(image2);
psmt1.setBinaryStream(8, (InputStream)fis, (int)(image2.length()));
psmt1.setString(9,fkey);
psmt1.setString(10,strDateNew1);
psmt1.setString(11,m);
psmt1.setString(12,akey);
psmt1.executeUpdate();
}
catch(Exception ex)
{
out.println("Error in connection : "+ex);
}response.sendRedirect("tpa_home.jsp?message=success");%></body></html>
<div><label style="font-size:14px;font-
weight:bold;">Username :</label> <input
style="width:100px;height:20px;font-size:12px;" type="text" id="username1"
name="username" /></div>
<div><label style="font-size:14px;font-
weight:bold;">Password :</label> <input
style="width:100px;height:20px;font-size:12px;" type="password"
id="password1" name="password" /></div>
5.2. TESTING
A primary purpose of testing is to detect software failure so that defects may
be covered and corrected.
Types of Tests
1. Unit Testing
2. Integration Testing
3. System Testing
Unit Testing
Unit testing focuses verification effort on the smallest unit of software design
that is the module. Using procedural design description as a guide, important control
paths are tested to uncover errors within the boundaries of the module.
Integration Testing
System Testing
Acceptance Testing
It is the sub part of system testing and it is the critical phase for any project.
User page
Give Username User page has
4 User Login should be Pass
and password been opened l
opened
6.EXECUTION STEPS
1.Installation of java:
Go to http://www.oracle.com/technetwork/java/javase/downloads
/index.html.
click on JDK DOWNLOAD button. run the exe file and then follow the
instruction given in wizard.
To set up the path:-
o Rig ht click on my pc and then go to my properties
Click on install with port number 8090 with username and password as
aits and aits.
Mention the connection port as 8090 and then click on next and finally
click on finish.
Now confirm the password as root in system settings field and then
click on finish.
Select any browser and type localhost: 8090/normalization/ and then press enter
then the project home page will be displayed.
1. The home page consists of a sidebar menu which consists of admin, user and
publisher.
2. On clicking on admin, admin login page will be opened. If the username and
password are correct then admin main page will be opened.
3. Here the admin can show some sidebar menu like view all authorized users,
uploaded publications, duplicated and normalized publication records,
bookmark and publication search history, bookmark and publication frequency
ranking and so on.
4. After showing all these menus whatever data we want to show that menu we
have to click. The data will be displayed clearly.
5. On clicking view all duplicate records, it shows list of duplicated publication
records.
6. On clicking view all normalized records, then is shows the report of
normalized records.
7. On clicking view all bookmarks, it shows the list of bookmarks.
8. On clicking view publication or bookmark frequency ranking, it shows the
graph of publication and bookmark records.
9. On clicking view all publication and bookmark search history, it then shows
the report of publication and bookmark search history.
10. Further, On clicking on user button, then form will be opened to enter the
name and the password.
11. In the user menu, it show the option like view your profile, search bookmark
and publication, view bookmark and publication search history and then
logout.
12. After showing all these options in the user menu whatever data we want to be
displayed then that menu we have to click. The data will be displayed clearly.
13. Then on clicking publisher button, publisher login page will be displayed.
14. After entering the publisher details, the publisher menu options like add
publication and bookmarks, view all bookmarks and publications will be
displayed.
15. On clicking on view all bookmarks and publication button, report is shown to
view various book marks and publications.
16. On clicking on add bookmarks and publication button, is shows details like
name, url, venue and soon be to filled .
17. Finally, after all this above steps are done, then we can logout from the web
page.
2.Admin menu
4.Normalized Records
9.User Login
10.User Menu
11.Search Publication
13.Publisher Login
14.Publisher Menu
8. CONCLUSION
of matching records that refer to the same real-world entity. We presented three levels
normalized record or the normalized field value. For multistrategy approach, we used
result merging models inspired from meta searching to combine the results from a
number of single strategies. We analyzed the record and field level normalization in
and proposed algorithms for acronym expansion and value component mining to
significant margin.
9. FUTURE ENHANCEMENTS
10.BIBLIOGRAPHY
[1] K. C.-C. Chang and J. Cho, “Accessing the web: From search to integration,” in
SIGMOD, 2006, pp. 804–805.
[2] M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang, “Webtables: Exploring the
power of tables on the web,” PVLDB, vol. 1, no. 1, pp. 538–549, 2008.
[3] W. Meng and C. Yu, Advanced Metasearch Engine Technology. Morgan & Claypool
Publishers, 2010.
[4] A. Gruenheid, X. L. Dong, and D. Srivastava, “Incremental record linkage,” PVLDB, vol.
7, no. 9, pp. 697–708, May 2014.
[6] W. Su, J. Wang, and F. Lochovsky, “Record matching over query results from multiple
web databases,” TKDE, vol. 22, no. 4, 2010.
[7] H. K¨opcke and E. Rahm, “Frameworks for entity matching: A comparison,” DKE, vol.
69, no. 2, pp. 197–210, 2010.
[8] X. Yin, J. Han, and S. Y. Philip, “Truth discovery with multiple conflicting information
providers on the web,” ICDE, 2008.
[10] P. Christen, “A survey of indexing techniques for scalable record linkage and
deduplication,” TKDE, vol. 24, no. 9, 2012.
[11] S. Tejada, C. A. Knoblock, and S. Minton, “Learning object identification rules for
information integration,” Inf. Sys., vol. 26, no. 8, pp. 607–633, 2001.
[12] L. Shu, A. Chen, M. Xiong, and W. Meng, “Efficient spectral neighborhood blocking for
entity resolution,” in ICDE, 2011.