0% found this document useful (0 votes)

49 views

Met PDF Extraction System

The document is a project report submitted by four students for their Bachelor of Engineering degree. It discusses developing a system to extract metadata from scientific PDF documents and convert the summarized text into audio for better user understanding. The system will use technologies like Python, NLP, machine learning and convert PDF text to audio in multiple languages.

Uploaded by

Shivam Gosavi

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views

Met PDF Extraction System

Uploaded by

Shivam Gosavi

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

A

Project Report on

“Metadata extraction from scientific pdf ”

SUBMITTED TO THE SAVITRIBAI PHULE UNIVERSITY, PUNE

IN THE PARTIAL FULFILMENT OF THE REQUIREMENTS
FOR THE AWARD OF THE DEGREE

BACHELOR OF ENGINEERING (COMPUTER ENGINEERING)

(Academic Year: 2022-23)

SUBMITTED BY

Vaibhav Rajendra Joshi (PRN No: 72030000C)

Vishal rambhau kardile (PRN No: 72030017H)
Bhushan Gokul Jadhav (PRN No: 72029985D)
Shivam Nanagir Gosavi (PRN No: 72029975G)

Under the guidance of

Prof. Atul Chaudhary

DEPARTMENT OF COMPUTER ENGINEERING

MET’s Institute of Engineering,

Adgaon, Nashik-422003
SAVITRIBAI PHULE UNIVERSITY, PUNE

May, 2022
Certificate
This is to Certify that the project report entitles

“Metadata extraction from scientific pdf ”

Vaibhav Rajendra Joshi (PRN No: 72030000C)
Vishal rambhau kardile (PRN No: 72030017H)
Bhushan Gokul Jadhav (PRN No: 72029985D)
Shivam Nanagir Gosavi (PRN No: 72029975G)

are bonafide students of this institute and the work has been carried out
by them under the guidance of Prof. Atul Chaudhary and it is approved
for the partial fulfillment of the requirement of Savitribai Phule Pune Uni-
versity for the award of the degree of Bachelor of Engineering (Computer
Engineering).

Guide H.O.D Principal

(Prof. Atul Chaudhary) (Dr.M.U.Kharat) (Dr.V.P.Wani)

Date: / /
Acknowledgement

We have taken efforts in this project. However, it would not have been possible
without the kind support and help of many individual and organizations. We would like
to extend our sincere thanks to all of them. It gives us proud privilege to complete the
project on “Metadata extraction from scientific pdf ”. We are highly
indebted to our internal guide Prof. Atul Chaudhary for his guidance and
constant supervision as well as for providing necessary information regarding the project
and also for his support in completing the project.
We are also extremely grateful to our respected H.O.D. (Computer Department)
Dr. M. U. Kharat and Dr. P. N. Metange (Project Co-ordinator) for
providing all facilities and every help for smooth progress of project work.

Vaibhav Rajendra Joshi

Vishal rambhau kardile
Bhushan Gokul Jadhav
Shivam Nanagir Gosavi

ii
Abstract

Project Title: Metadata extraction from scientific pdf

With the availability of World Wide Web in every corner of the world these
days, the amount of information on the internet is growing at an exponential rate. How-
ever, given the hectic schedule of people and the immense amount of information avail-
able, there is increase in need for information abstraction or summarization. Be it brows-
ing through the seemingly endless pages of terms and conditions on an important official
document or kicking back and flipping through an intriguing eBook- reading is quite
an undeniable and inescapable part of our everyday lives. However, reading anything
demands our complete undivided attention making it nearly impossible for us to multi-
task. This Online PDF to Audio Converter and Translator was created by using Python
(Django) can instantly convert any PDF text into audio. Along with reading any PDF
document out loud, this application can also translate and vocalize any text into up to
five languages. Text summarization presents the user a shorter version of text with only
vital information and thus helps him to understand the text in shorter amount of time.
The goal of this project is to condense the documents or reports into a shorter version
and preserve important contentsconvert that summarized text into audio for better un-
derstanding of the user. Also projects convert the generated summery to the audio for
better understanding.

Technology: Front end development(HTML, CSS, Bootstrap,JavaScript), back-

end development(PHP), SQL database(MySQL), Machine learning, NPL

Keywords: Python, NPL, PDF Extraction, audio converter, machine learning.

iii
Contents

Acknowledgement i

1 Introduction 1
1.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Literature Survey 3
2.1 Literature Review papers . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Problem Definition 6
3.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4 Analysis 7
4.1 Project Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.1.1 Project Plan for semester I . . . . . . . . . . . . . . . . . . . 7
4.2 Requirement Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2.1 Necessary Functions . . . . . . . . . . . . . . . . . . . . . . . 9
4.2.2 Desirable Functions . . . . . . . . . . . . . . . . . . . . . . . 9
4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

5 Design 10
5.1 Project Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5.1.1 Operating Environment . . . . . . . . . . . . . . . . . . . . . 10
5.1.2 User Classes and Characteristics . . . . . . . . . . . . . . . . . 10
5.1.3 Design and Implementation Constraints . . . . . . . . . . . . . 11
5.1.4 Assumptions and Dependencies . . . . . . . . . . . . . . . . . 11
5.2 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 12
iv
Metadata extraction from scientific pdf

5.3 Workflow of the project . . . . . . . . . . . . . . . . . . . . . . . . . 13

5.4 User Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.5 Hardware Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.6 Communication Interfaces . . . . . . . . . . . . . . . . . . . . . . . 15
5.7 Functional Requirements . . . . . . . . . . . . . . . . . . . . . . . . 15
5.8 Nonfunctional Requirement . . . . . . . . . . . . . . . . . . . . . . . 15
5.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

6 Modeling 17
6.1 Data Flow Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6.2 ER Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.3 UML Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.3.1 Activity Diagram . . . . . . . . . . . . . . . . . . . . . . . . 22
6.3.2 Sequence Diagram . . . . . . . . . . . . . . . . . . . . . . . . 23
6.4 Component Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

7 Technical Specifications 26
7.1 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7.4 Technology used . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
7.5 System Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7.5.1 Database Requirements . . . . . . . . . . . . . . . . . . . . . 29
7.5.2 Software Requirements(Platform Choice . . . . . . . . . . . . . 30
7.5.3 Hardware Requirements) . . . . . . . . . . . . . . . . . . . . 32
7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

8 Conclusion 34

References 34

MET’s Institute of Engineering v

List of Figures

1.1 Existing system . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

5.1 Architecture diagram . . . . . . . . . . . . . . . . . . . . . . . . . 12

6.1 DFD 0 Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

6.2 DFD 1 Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6.3 DFD 2 Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6.4 ER Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.5 Activity Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.6 Sequence Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.7 Component diagram . . . . . . . . . . . . . . . . . . . . . . . . . . 24

7.1 NPL Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . .28

7.2 MySQL Database . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7.3 Python software programming language . . . . . . . . . . . . . . . . . 30
7.4 Xamp software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

vi
List of Tables

4.1 Planner and Progress Report I for project . . . . . . . . . . . . . . . . 8

vii
Chapter 1

Introduction

Natural Language Processing (NLP) is an area of application and research that

explores how computers can be used to understand and manipulate natural language
speech or text to do useful things. The foundation of NLP lie in a number of disciplines,
namely, computer and information sciences, linguistics, mathematics, electrical and elec-
tronic engineering, artificial intelligence robotics, and psychology. NLP researchers aim
to gather knowledge on how human beings use and manipulate natural languages to
perform desired tasks so that appropriate tools and techniques can be developed. Appli-
cations of NLP include a number of fields of study such as multilingual and cross-language
information retrieval (CLIR), machine transaction, natural language, text processing and
summarization, user interfaces, speech recognition, artificial intelligence and expert sys-
tems.
Text-to-speech and related read audio tools are being widely implemented in an
attempt to assist students’ reading comprehension skills.PDF to the audio system is a
screen reader application designed and constructed for an effective audio communication
system. PDFs were designed to present and exchange documents reliably, PDFs are an
open standard document format used globally, maintained by the International Organi-
zation for Standardization (ISO). The document format is one of the most convenient
methods for electronic communication, and also for the exchange of information. Hence,
there is a need to make it more accessible to readers on-screen through audio. PDF
documents are designed and structured to contain links and buttons, form fields, audio
or sounds, video, and business logic. The PDF to the audio system will power text on
screens to read aloud (speak) with support for many languages [2]. The PDF to Audio
Converter project provides an alternative to access the PDF books for the blind, lazy,
1
Metadata extraction from scientific pdf

readers, and others. Using this PDF to Audio Converter the user will be able to listen
to hisfavorite PDF and can do their daily routine. The following application can be used
to convert text from PDF to audio using Python predefined libraries [1].

Figure 1.1: Existing system

1.1 Objective
• Easy to clear the idea : Instantly Reading the entire article, breaking it and sepa-
rating the important ideas from the original text takes time and effort.

• Important facts : This highly improves productiveness as it quicken surfing process,

Does Not Miss Important Facts.

• Improves Quality : Some software summarizes not only documents but also web
pages.

MET’s Institute of Engineering 2

Chapter 2

Literature Survey

Many researchers aim to gather knowledge on how human beings tend to under-
stand and use the language so that appropriate tools and techniques can be developed
to make computer systems understand and manipulate natural languages to perform the
desired Phonological rules are captured through machine learning on training sets.

2.1 Literature Review papers

• “An approach to sentence-selection-based text summarization”, Fang Chen; Kesong
Han; Guilin Chen is a author of this paper, this paper published in 2016. This pa-
per presented an We introduced a newly developed text summarization system. It
supports both Chinese and English, while this paper focuses on Chinese processing.
We apply 6 word level features and 3 sentence level features to weigh each word
and sentence. We also describe two new techniques, one is for processing the topic
sensitive word feature and another is for processing the sentence length feature.
Primary subjective evaluation shows that these approaches are effective and effi-
cient, and performance of the system is promising.

• “Automatic Text Summarization Using Hybrid Fuzzy GA-GP” is paper of A. Kiani-

B; M.R. Akbarzadeh-T , 2015 A novel technique is proposed for summarizing text
using a combination of Genetic Algorithms (GA) and Genetic Programming (GP)
to optimize rule sets and membership functions of fuzzy systems. The novelty of
the proposed algorithm is that fuzzy system is optimized for extractive based text
3
Metadata extraction from scientific pdf

summarizing. In this method GP is used for structural part and GA for the string
part (Membership functions). The goal is to develop an optimal intelligent system
to extract important sentences in the texts by reducing the redundancy of data.
The method is applied in 3 test documents and compared with the standard fuzzy
systems as well as two other commercial summarizers: Microsoft word and Coper-
nic Summarizer. Simulations demonstrate several significant improvements with
the proposed approach.

• ”Generic text summarization using local and global properties of sentences” C.

Kruengkrai; C. Jaruskulchai; 2015. In this paper described The paper With the
proliferation of text data on the World-Wide Web, the development of methods for
automatically summarizing these data becomes more important. Here, we propose
a practical approach for extracting the most relevant sentences from the original
document to form a summary. The idea of our approach is to exploit both the local
and global properties of sentences. The local property can be considered as clusters
of significant words within each sentence, while the global property can be though
of as relations of all sentences in the document. These two properties are combined
to get a single measure reflecting the informativeness of sentences. Experimental re-
sults show that our approach compares favorably to a commercial text summarizer.

• ”A Review on Optical Character Recognition and Text to Speech Conversion” Swati

Vikas Kodgire; 2013. The application depending on image and voice with a parallel
functioning is suitable to assist physically challenged people. So that dependability
of a challenged person is decreased to a improved level. Image acquisition based text
reader can help visually challenged people to manage the handheld objects in day to
day life. Initially steps involves capturing of image, distinguishing image with text
portion and residual regions, image pre-processing on region of interest, after the
extraction of characters and words, conversion of text to speech is done. To splinter
text from a document it is obligatory to discover all the possible manuscript text
regions. Text detection, line detection, character identification, feature extraction,
training of extracted features are the steps in sequence that are executed.

MET’s Institute of Engineering 4

Metadata extraction from scientific pdf

2.2 Summary
In this chapter we discussed the various researches conducted for our system and
also understand the needs of system to current users.

MET’s Institute of Engineering 5

Chapter 3

Problem Definition

A system for the summarization of single documents. The system produces multi
as well as single document summaries using data mining techniques for identifying com-
mon terms across the set of documents.

3.1 Summary
To gain insight into how the field of supply chain management when integrated with a
blockchain-enabled platform will provide businesses to gain a competitive edge as well as
used to overcome the arising challenges and problems faced by organizations in supply
chain operations

6
Chapter 4

Analysis

This chapter describes the project plan adopted and determines the requirement
analysis. We have implemented the project on the basis of Rapid Application Develop-
ment (RAD) model and Model View Controller (MVC) model.

4.1 Project Plan

4.1.1 Project Plan for semester I

The following Table 4.1 describes the project plan for semester I. It describes
the various activities and accountability of the developers for the respective modules.
Following are the major activities carried out in this plan :

• Identifying the functional requirements.

• Designing of the Framework.

• Studying the necessary development tools and technologies.

7
Metadata extraction from scientific pdf

Phase Activity Start Date End Date Group Mem-

bers
1 Selection of Project Topic 22-08-2022 24-08-2022 Team
1 Study literature survey in 25-08-2022 28-08-2022 Team
detail
1 Functional Requirement 29-08-2022 09-09-2022 Team
Specification(FRS)
1 Design Prototype 11-09-2022 21-09-2022 Team
1 Set Theory and Math 23-09-2022 06-09-2022 Team
Model
1 UML Diagram Prototype 23-09-2022 03-10-2022 Team
1 Project Problem Statement 08-10–2022 19-10-2022 Team
using NP Complete
1 UML Diagram in StarUML 05-10-2022 22-10-2022 Team
1 Paper Presentation 05-11-2022 05-11-2022 Team
1 Software Requirement 6-11-2022 10-11-2022 Team
Specification
1 Test Plan 11-11-2022 15-11-2022 Team

Table 4.1: Planner and Progress Report I for project

MET’s Institute of Engineering 8

Metadata extraction from scientific pdf

4.2 Requirement Analysis

4.2.1 Necessary Functions

• Provide a way for users to access pages.

• Display Web pages properly.

• Provide technology to check the patient data of system

4.2.2 Desirable Functions

• Networking Interface.

• Make and save Money.

• Build An Online Presence.

4.3 Summary
In this chapter we described the implementation details of the project plan for
Semester I and Semester II. We also studied the necessary functions and the desirable
functions of our system.

MET’s Institute of Engineering 9

Chapter 5

Design

5.1 Project Scope

In this new era, where tremendous information is available on the Internet. It is
most important to provide the improved mechanism to extract the information quickly
and most efficientlyIn current system there is corruption occurring in this field. It is very
difficult for human beings to manually extract the summary of a large documents of text.
So there is a problem of searching for relevant documents from the number of documents
available.

5.1.1 Operating Environment

Natural Language Systems are very complicated to design. NLP’s future will be
redefined as it faces new technological challenges to create more user-friendly systems.
It is also forcing NLP more towards Open Source Development. If the NLP community
embraces Open Source Development, it will make NLP systems less proprietary and
therefore less expensive.

5.1.2 User Classes and Characteristics

The user who is going to operate the system should have the laptop having web
application as the base operating system.

10
Metadata extraction from scientific pdf

5.1.3 Design and Implementation Constraints

Using laptops (with touch interface) has a very different set of challenges. The is-
sue is not whether you have larger screen - but fundamentally they are different. Battery
life, screen size, form factor, variations in keyboard availability and dynamically changing
orientation (horizontal or vertical positioning done by user) present using set of issues to
be dealt with.
When we compare with the current features present in a normal audiobook con-
verter, they convert PDF texts (or images)into speech, and they have volume controls
with single voice conversion (either male or female). Only a single choice is given to the
user in case of voice modification. They provide the play and pause options. The speed
of voice is alwaysfixed. In this current busy scheduled human do not get time to read
a book, or to convert the PDF file into an MP3 playerusing third-party applications or
web applications. Even I have a directory at which I store pdf books that I plan on
reading, but I never do. So, I thought hey, why do not I make them audiobooks and
listen to them while I do something else! In this system, we are developing a GUI based
web application using python to convert the PDF file into audio format and read it out
to the user. The application is more user friendly as it does not require any audio file or
MP3 player.
Following are the merits of the design implementation :

• Portability: As it is web base, on the move learning is achieved anywhere and

anytime.

• Delivery Mechanism: It is convenient to develop application and even very

easy to use.

• User-friendly: It is user-friendly due to the use of devices like tablets, mobiles,

laptops.

5.1.4 Assumptions and Dependencies

The Framework is capable of allowing the developer to develop the neural learning
application with ease and import it on the devices which contain web application . This
application developed by the vendor will allow the user to use it with high power of
interactivity and portability. The commercialization of the web application may take

MET’s Institute of Engineering 11

Metadata extraction from scientific pdf

time. It incorporated best practice web research into a practical framework of web based
design requirements.

5.2 System Architecture

Figure 5.1: Architecture diagram

In this current busy routine people do not find time to read a book, or to convert
the PDF file into MP3 player using third party applications or web application. In this
system I am developing an application using python to convert the PDF file into audio
format and read out to the user. The application is more used friendly as it not requires
any audio file or MP3 player. The user will have to select the PDF file which user wants
to listen.

MET’s Institute of Engineering 12

Metadata extraction from scientific pdf

5.3 Workflow of the project

The Workflow of the project is:

• In this PDF to Audio Converter the user needs to select any PDF file from the
desired location by pressing the open pdf.

• After selecting the PDF file, we have to select the type of voice we want like a
female voice or a male voice.

• After selecting the PDF file, the user needs to click play button.

• If the PDF file contains page numbers, the PDF file will be extracted.

• The extracted text will be printed on the console.

• The extracted text will be then read.

• Now, after reading the text the text will be printed on the QtLabel which is provided
in GUI.

• If the PDF file do not contain page numbers the above operations will not be
performed.

• After selecting the type of voice, we want the program to read out the pdf in the
respective voice we have selected.

• We can tune the speed and volume of speech.

• To exit the program, we press the exit button.

MET’s Institute of Engineering 13

Metadata extraction from scientific pdf

5.4 User Interfaces

• Operating system (OS):
Is a set of programs that manages computer hardware resources, and pro-
vides common services for application software. The operating system is the most
important type of system software in a computer system. Without an operating
system, a user cannot run an application program on their computer, unless the
application program is self booting.

• Application Programming Interface (API):

Is a particular set of rules (code) and specifications that software programs
can follow to communicate with each other.I t serves as an interface between differ-
ent software programs and facilitates their interaction, similar to the way the user
interface facilitates interaction between humans and computers.

An API can be created for applications, libraries, operating systems, etc.,

as a way of defining their vocabularies and resources request conventions (e.g.
function-calling conventions). It may include specifications for routines, data struc-
tures, object classes, and protocols used to communicate between the consumer
program and the implementer program of the API. An API consist of a core set
of packages and classes. As shown in the Figure 4.1 the applications will be built
using the M-Learning Framework. These applications will be built by importing
the libraries, include files and the style sheets developed as a whole framework.
The framework is developed considering the developer’s point of view that is to be
able to develop the applications by putting in less time and efforts. Thus the devel-
oper will access the API’s present in the framework and develop the applications
by writing small amount of code.

MET’s Institute of Engineering 14

Metadata extraction from scientific pdf

5.5 Hardware Interfaces

• Devices:
The applications built using the python will be deployed on LAPTOPS and
tablets supporting Web application system version 2.2 and above.

5.6 Communication Interfaces

The most important protocols for data transmission across the Internet are TCP
(Transmission Control Protocol) and IP (Internet Protocol). Using these jointly (TCP/IP),
we can link devices that access the network; some other communication protocols asso-
ciated with the Internet are POP, SMTP and HTTP.

5.7 Functional Requirements

• The System should be able to retrieve the results stored on database by using quick
retrived process.

• The system application of modules must able to encrypt the data and decrypt it
whenever needed.

5.8 Nonfunctional Requirement

• There should be minimal lag between taking of the processing and result

• The processing should be as efficient with maximum accuracy.

• The system should give valid result for positive as well as negative test cases.

• Usability: The ease with which the system can be learned, managed or used.
Usability gives the measure of how much user friendly the system is.

• Reliability: The degree to which the system must work for users. It also refers
to the mean time between failures, means what can be the maximum down time.

MET’s Institute of Engineering 15

Metadata extraction from scientific pdf

• Performance: Performance specifications typically refer to response time, trans-

action throughput, and capacity. They deal with response time, which means the
time taken by the system to load, reload, screen open and refresh times etc.

• Scalability: It refers to the ability of the proposed software application to

increase the number of users or applications associated with the product.

• Open standard: t ensures the viability and future expansion of the system,
all offered development tools, server software, as well as, the application are based
on open templates and are available under the terms of the General Public License.

5.9 Summary
In this chapter we studied the operating environment and the user classes and
characteristics which describes the scope of the project. We have also described the
software system attributes and various nonfunctional requirements.

MET’s Institute of Engineering 16

Chapter 6

Modeling

This chapter includes the various modeling techniques which describes the various
users of the web application It also describes the functionality of the different features of
the NPL.

6.1 Data Flow Diagrams

A data flow diagram (DFD) is a graphical or visual representation using a stan-
dardized set of symbols and notations to describe a business’s operations through data
movement. They are often elements of a formal methodology such as Structured Systems
Analysis and Design Methods.
The objective of a DFD is to show the scope and boundaries of a system as a whole.
It may be used as a communication tool between a system analyst and any person who
plays a part in the order that acts as a starting point for redesigning a system. The DFD
is also called as a data flow graph or bubble chart.

17
Metadata extraction from scientific pdf

DFD 0, also called context diagram of the result management system. As the
bubbles are decomposed into less and less abstract bubbles, the corresponding data flow
may also be needed to be decomposed.

Figure 6.1: DFD 0 Diagram

MET’s Institute of Engineering 18

Metadata extraction from scientific pdf

DFD 1, a context diagram is decomposed into multiple bubbles/processes. In

this level, we highlight the main objectives of the system and breakdown the high-level
process of 0-level DFD into subprocesses.

Figure 6.2: DFD 1 Diagram

MET’s Institute of Engineering 19

Metadata extraction from scientific pdf

DFD 2 goes one process deeper into parts of 1-level DFD. It can be used to
project or record the specific/necessary detail about the system’s functioning.

Figure 6.3: DFD 2 Diagram

MET’s Institute of Engineering 20

Metadata extraction from scientific pdf

6.2 ER Diagrams
An entity relationship diagram (ERD), also known as an entity relationship model,
is a graphical representation that depicts relationships among people, objects, places,
concepts or events within an information technology (IT) system.
Depending on the scale of change, it can be risky to alter a database structure directly in
a DBMS. To avoid ruining the data in a production database, it is important to plan out
the changes carefully. ERD is a tool that helps. By drawing ER diagrams to visualize
database design ideas, you have a chance to identify the mistakes and design flaws, and
to make corrections before executing the changes in the database.

Figure 6.4: ER Diagram

MET’s Institute of Engineering 21

Metadata extraction from scientific pdf

6.3 UML Diagram

6.3.1 Activity Diagram

Use cases show what your system should do. Activity diagrams allow you to spec-
ify how your system will accomplish its goals. Activity diagrams show high-level actions
chained together to represent a process occurring in your system. An activity diagram
is essentially a flowchart, showing flow of control from activity to activity. Unlike a tra-
ditional flowchart, an activity diagram shows concurrency as well as branches of control.
Activity diagrams focus on the dynamic flow of a system.

Figure 6.5: Activity Diagram

MET’s Institute of Engineering 22

Metadata extraction from scientific pdf

6.3.2 Sequence Diagram

The sequence diagram is used primarily to show the interactions between objects
in the sequential order that those interactions occur. Developers typically think sequence
diagrams were meant exclusively for them. However, an organization’s business staff
can find sequence diagrams useful to communicate how the business currently works by
showing how various business objects interact.Sequence diagrams illustrate how objects
interact with each other. They focus on message sequences, that is, how messages are
sent and received between a number of objects. The main purpose of sequence diagram
is to show the order of events between the parts of system that are involved in particular
interaction.

Figure 6.6: Sequence Diagram

MET’s Institute of Engineering 23

Metadata extraction from scientific pdf

6.4 Component Diagram

Component diagram are one of the two kinds of diagrams found in modeling the
physical aspects of object oriented systems. A component diagram shows organization
and dependencies among set of components. Component diagram can be seen to model
the static implementation view of a system. This involves modeling the physical things
that resides on a node, such as executables, libraries, tables, files and documents.
Component diagram shows a set of components and their relationships. Graph-
ically a component diagram is a collection of vertices and arcs. Component diagrams
commonly contain,

• Components

• Interfaces

• Dependency, generalization, association and realization re-

lationships.

Figure 6.7: Component diagram

MET’s Institute of Engineering 24

Metadata extraction from scientific pdf

6.5 Summary
Thus we saw the various modeling techniques used for the design of NPL of
machine language.

MET’s Institute of Engineering 25

Chapter 7

Technical Specifications

In this chapter we will discuss the advantages and limitations of the system. We
will also go through the applications of the framework and have a brief study about the
technical requirements.

7.1 Advantages

• Text to speech synthesis is a rapidly growing aspect of computer technology and is

increasingly playing a more important role in the way we interact with the system
and interfaces across a variety of platforms.

• We have identified the various operations and processes involved in text to speech
synthesis. We have also developed a very simple and attractive graphical user
interface which allows the user to type in his/her text provided in the text field in
the application.

• It was seen that this code performs really well in reading straightforward PDF text
files.

• Should enable users to select the desired PDF and convert it to audio and display
text in, so the user can understand that particular text has been read.

• Should enable students with reading disabilities.

26
Metadata extraction from scientific pdf

7.2 Limitations
• Conversion issue due to some error.

• Programming is complex.

7.3 Applications
• This feature will help mostly for the disabled persons like the blind and handicap.

• Teachers and school librarians may also use these findings as a rationale for adding
audiobooks to the list of reading strategies used successfully with struggling readers.

• Those who participated in the studies and on audiobook usage of English Language
Learners usually.

MET’s Institute of Engineering 27

Metadata extraction from scientific pdf

7.4 Technology used

Natural language processing (NPL)

Figure 7.1: NPL Technology

Natural language processing (NLP) is the ability of a computer program to

understand human language as it is spoken and written – referred to as natural language.
It is a component of artificial intelligence (AI). NLP has existed for more than 50 years
and has roots in the field of linguistics. It has a variety of real-world applications in a
number of fields, including medical research, search engines and business intelligence.
NLP enables computers to understand natural language as humans do. Whether
the language is spoken or written, natural language processing uses artificial intelligence
to take real-world input, process it, and make sense of it in a way a computer can
understand. Just as humans have different sensors – such as ears to hear and eyes to
see – computers have programs to read and microphones to collect audio. And just as
humans have a brain to process that input, computers have a program to process their
respective inputs. At some point in processing, the input is converted to code that the
computer can understand.
There are two main phases to natural language processing are data preprocessing and
algorithm development. Data preprocessing involves preparing and ”cleaning” text data
for machines to be able to analyze it. preprocessing puts data in workable form and
highlights features in the text that an algorithm can work with.

MET’s Institute of Engineering 28

Metadata extraction from scientific pdf

7.5 System Requirements

7.5.1 Database Requirements

Figure 7.2: MySQL Database

MySQL is a fast, easy-to-use RDBMS being used for many small and big businesses.
MySQL is developed, marketed and supported by MySQL AB, which is a Swedish com-
pany. MySQL is becoming so popular because of many good reasons

• MySQL is released under an open-source license. So you have nothing to pay to

use it.

• MySQL is a very powerful program in its own right. It handles a large subset of
the functionality of the most expensive and powerful database packages.

• MySQL uses a standard form of the well-known SQL data language.

• MySQL works on many operating systems and with many languages including PHP,
PERL, C, C++, JAVA, etc.

• MySQL works very quickly and works well even with large data sets.

• MySQL is very friendly to PHP, the most appreciated language for web develop-
ment.

MET’s Institute of Engineering 29

Metadata extraction from scientific pdf

• MySQL supports large databases, up to 50 million rows or more in a table. The

default file size limit for a table is 4GB, but you can increase this (if your operating
system can handle it) to a theoretical limit of 8 million terabytes (TB).

• MySQL is customizable. The open-source GPL license allows programmers to

modify the MySQL software to fit their own specific environments.

7.5.2 Software Requirements(Platform Choice

• Python

Figure 7.3: Python software programming language

Python is a multi-paradigm programming language. Object-oriented program-

ming and structured programming are fully supported, and many of their fea-
tures support functional programming and aspect-oriented programming (including
metaprogramming and metaobjects).
Python is an interpreted, object-oriented, high-level programming language with
dynamic semantics. Its high-level built in data structures, combined with dynamic
typing and dynamic binding, make it very attractive for Rapid Application De-
velopment, as well as for use as a scripting or glue language to connect existing
components together. Python’s simple, easy to learn syntax emphasizes readability
and therefore reduces the cost of program maintenance. Python supports modules
and packages, which encourages program modularity and code reuse.

MET’s Institute of Engineering 30

Metadata extraction from scientific pdf

• Xamp

Figure 7.4: Xamp software

XAMPP is one of the widely used cross-platform web servers, which helps devel-
opers to create and test their programs on a local webserver. It was developed by
the Apache Friends, and its native source code can be revised or modified by the
audience. It consists of Apache HTTP Server, MariaDB, and interpreter for the
different programming languages like PHP and Perl. It is available in 11 languages
and supported by different platforms such as the IA-32 package of Windows x64
package of macOS and Linux.
XAMPP helps a local host or server to test its website and clients via computers
and laptops before releasing it to the main server. It is a platform that furnishes a
suitable environment to test and verify the working of projects based on Apache,
Perl, MySQL database, and PHP through the system of the host itself. Among
these technologies, Perl is a programming language used for web development, PHP
is a backend scripting language, and MariaDB is the most vividly used database
developed by MySQL.

MET’s Institute of Engineering 31

Metadata extraction from scientific pdf

• javacript
JavaScript is a lightweight, interpreted programming language. It is designed for
creating network-centric applications. It is complimentary to and integrated with
Java. JavaScript is very easy to implement because it is integrated with HTML. It
is open and cross-platform.Javascript is the most popular programming language
in the world and that makes it a programmer’s great choice. Once you learnt
Javascript, it helps you developing great front-end as well as back-end softwares
using different Javascript based frameworks like jQuery, Node.JS etc.

list of software requirement are as follow:

1. Operating System : Windows xp/7/8/10

2. Programming Language : Python

3. Software Version : Python 4.4

4. Tools : Anaconda/pycharm

5. Front End : Python

7.5.3 Hardware Requirements)

1. Processor - Pentium IV/Intel I3 core

2. Speed - 1.1 GHZ

3. RAM - 512 MB(min)

4. Hard disk - 20 GB

5. Keyboard - Standard Keyboard

6. Mouse - Two Or Three Button Mouse

7. Monitor - LED Monitor

MET’s Institute of Engineering 32

Metadata extraction from scientific pdf

7.6 Summary
In this chapter we were made aware of the various advantages of the framework and
also the limitations of the project. We also saw the hardware and software requirements
of the project.

MET’s Institute of Engineering 33

Chapter 8

Conclusion

The Conclusion of this project is that the client will get an web application that
will execute on client side and get the summary of the input document as per clients
requirement. The automatic generated summary is useful for the client to understand
the core concept of the document with in few lines instead of reading whole document.
It was seen that this code performs really well in reading straightforward PDF text files.
Should enable users to select the desired PDF and convert it to audio and display text in,
so the user can understand that particular text has been read. Should enable students
with reading disabilities. The success of this research project is significant given the broad
use of audiobooks in literacy and library programs across the United States. Teachers
and school librarians may also use these findings as a rationale for adding audiobooks to
the list of reading strategies used successfully with struggling readers.

34
Bibliography

[1] Pankaj Gupta, Vijay Shankar Pendhluri, Ishant Vats,“Summarizing text by ranking
text units according to shallow linguistic features”, Feb. 13 16, 2011 ICACT, 2011.

[2] Rajesh S. Prasad, U. V. Kulkarni, Jayashree R. Prasad,“Connectionist Approach to

Generic Text Summarization,”,World Academy of Science, Engineering and Technol-
ogy 55,2009.

[3] R. S. Prasad, U. V. Kulkarni, J. R. Prasad, “A Novel Evolutionary Connectionist

Text Summarizer (ECTS),”, 2009,IEEE Xplore.

[4] Rajesh Shardanand Prasad, Uday. V. Kulkarni,“Implementation and Evaluation of

Evolutionary Connectionist Approaches to Automated Text Summarization”, Journal
of Computer Science 6 (11): 1366-1376, 2010 ISSN 1549-3636, 2010 Science Publica-
tions.

[5] Ranjit Bose “Natural Language Processing: Current state and future directions”, In-
ternational Journal of the Computer, the Internet and Management Vol. 121, January
– April, 2004.

[6] Natural Language Processing Techniques Applied in Information Retrieval-Analysis

and Implementation in Python, TulikaNarang, International Journal of Innovations
Advancement in Computer Science IJIACS ISSN 2347 – 8616 Volume 5, Issue 4 April
2016.

[7] Pdf. (2021, March 08). Retrieved March 09, 2021, from
https://en.wikipedia.org/wiki/PDF

[8] 7 ways Audio books benefit students who struggle with reading. (n.d.). Retrieved
March 09, 2021, from: :https://learningally.org/Solutions-for School/7-Ways-Audio
books-Benefit-Students WhoStruggleWith-Reading
35

Data Build Tool (DBT)
No ratings yet
Data Build Tool (DBT)
65 pages
Project Report Android Development
100% (3)
Project Report Android Development
32 pages
Expense Tracker
100% (1)
Expense Tracker
38 pages
React Native Report
No ratings yet
React Native Report
65 pages
09 First Marbella Condominium Association, Inc - vs. Gatmaytan
No ratings yet
09 First Marbella Condominium Association, Inc - vs. Gatmaytan
11 pages
Theophanes Chronographia
100% (3)
Theophanes Chronographia
922 pages
Job-Description-NWGACU CEO PDF
No ratings yet
Job-Description-NWGACU CEO PDF
5 pages
AZ-900 Study Guide
No ratings yet
AZ-900 Study Guide
11 pages
Blockchain Personal Sem 1
0% (1)
Blockchain Personal Sem 1
41 pages
Report
No ratings yet
Report
59 pages
Koushik Final Project
No ratings yet
Koushik Final Project
37 pages
Emergency Service in Dhaka City
No ratings yet
Emergency Service in Dhaka City
56 pages
Sample Minor Project Report-2020-21
No ratings yet
Sample Minor Project Report-2020-21
55 pages
Capstone Project
No ratings yet
Capstone Project
75 pages
Report
No ratings yet
Report
48 pages
Online Chatting System For College Enquiry Using Knowledgeable Database
No ratings yet
Online Chatting System For College Enquiry Using Knowledgeable Database
53 pages
Analyzing Sentiments in One Go: Savitribai Phule Pune University A Priliminary Project Report On
No ratings yet
Analyzing Sentiments in One Go: Savitribai Phule Pune University A Priliminary Project Report On
54 pages
report12
No ratings yet
report12
40 pages
Final Mini Project
No ratings yet
Final Mini Project
26 pages
LAN Security Manager PDF
No ratings yet
LAN Security Manager PDF
47 pages
DevSecPerfOps-Pipeline-Reference-Application-1
No ratings yet
DevSecPerfOps-Pipeline-Reference-Application-1
68 pages
Synopsis
No ratings yet
Synopsis
23 pages
FYDP2 Final Dhaka ISP Solution V2
No ratings yet
FYDP2 Final Dhaka ISP Solution V2
62 pages
Term 2 Report
No ratings yet
Term 2 Report
61 pages
Project Stage I Report Format
No ratings yet
Project Stage I Report Format
50 pages
Project Report pdf
No ratings yet
Project Report pdf
41 pages
project-reportG15
No ratings yet
project-reportG15
45 pages
Bus Buddy Mini Project Report
No ratings yet
Bus Buddy Mini Project Report
57 pages
Daniel J. Finnegan-Thesis
No ratings yet
Daniel J. Finnegan-Thesis
31 pages
Report Reference
No ratings yet
Report Reference
55 pages
3D Point Plotting Robot For Evacuation
No ratings yet
3D Point Plotting Robot For Evacuation
63 pages
Share CapstoneFinal
No ratings yet
Share CapstoneFinal
69 pages
SPPU Report Format
No ratings yet
SPPU Report Format
50 pages
Hall Management Module Report
No ratings yet
Hall Management Module Report
59 pages
Design Emplementation A Online House Renting Platform
No ratings yet
Design Emplementation A Online House Renting Platform
58 pages
College Management System Suny
No ratings yet
College Management System Suny
30 pages
Blackbook (Sahil)
No ratings yet
Blackbook (Sahil)
48 pages
SY Minor Report
No ratings yet
SY Minor Report
32 pages
Expense Tracker
No ratings yet
Expense Tracker
34 pages
Vishal
No ratings yet
Vishal
70 pages
E Notice Report
No ratings yet
E Notice Report
82 pages
Title Title Title Title Title
No ratings yet
Title Title Title Title Title
16 pages
FULLTEXT01
No ratings yet
FULLTEXT01
114 pages
project_report (1)
No ratings yet
project_report (1)
57 pages
Semina
No ratings yet
Semina
23 pages
Project Report Final 2019 Blank
No ratings yet
Project Report Final 2019 Blank
13 pages
GreenHR HR Payroll Management System 1
No ratings yet
GreenHR HR Payroll Management System 1
48 pages
"Title of The Project": A Minor Project Report Submitted To Rajiv Gandhi Proudyogiki Vishwavidyalaya
No ratings yet
"Title of The Project": A Minor Project Report Submitted To Rajiv Gandhi Proudyogiki Vishwavidyalaya
12 pages
Final Report Hall Management Module
No ratings yet
Final Report Hall Management Module
58 pages
Seminar Report Enviornmental Monitoring
No ratings yet
Seminar Report Enviornmental Monitoring
27 pages
Grid Environment Setup
No ratings yet
Grid Environment Setup
52 pages
Garbage Overload Detection System: Project Title
No ratings yet
Garbage Overload Detection System: Project Title
88 pages
Final Final
No ratings yet
Final Final
81 pages
Pfa2 2024 13
No ratings yet
Pfa2 2024 13
48 pages
Gaurav Blackbook Edited
No ratings yet
Gaurav Blackbook Edited
81 pages
Morris 18 PH D
No ratings yet
Morris 18 PH D
181 pages
Report of Dimensions Measurement of An Object in 2D Image Using Image Processing in Python
No ratings yet
Report of Dimensions Measurement of An Object in 2D Image Using Image Processing in Python
70 pages
Automation of Software Application Engineering Using Machine Learning and Reasoning
No ratings yet
Automation of Software Application Engineering Using Machine Learning and Reasoning
32 pages
sample major project report Jul-Dec 24.doc
No ratings yet
sample major project report Jul-Dec 24.doc
110 pages
Format Report
No ratings yet
Format Report
41 pages
Major Project Format New
No ratings yet
Major Project Format New
29 pages
Internship Final Sujith
No ratings yet
Internship Final Sujith
41 pages
Biennial Report on Operations Evaluation: Assessing the Monitoring and Evaluation Systems of IFC and MIGA
From Everand
Biennial Report on Operations Evaluation: Assessing the Monitoring and Evaluation Systems of IFC and MIGA
The World Bank
No ratings yet
Basic Research and Technologies for Two-Stage-to-Orbit Vehicles: Final Report of the Collaborative Research Centres 253, 255 and 259
From Everand
Basic Research and Technologies for Two-Stage-to-Orbit Vehicles: Final Report of the Collaborative Research Centres 253, 255 and 259
Dieter Jacob
No ratings yet
4
No ratings yet
4
49 pages
Trans-Minipump Eng Short v09
No ratings yet
Trans-Minipump Eng Short v09
17 pages
Trident x2
No ratings yet
Trident x2
8 pages
Self Balancing of Two Wheeler
No ratings yet
Self Balancing of Two Wheeler
4 pages
Naukri_MeetSanjayGore[3y_0m]
No ratings yet
Naukri_MeetSanjayGore[3y_0m]
2 pages
IOSA Checklist: ISM Edition 9 - Effective September 1, 2015
No ratings yet
IOSA Checklist: ISM Edition 9 - Effective September 1, 2015
75 pages
OB Chapter 3
No ratings yet
OB Chapter 3
62 pages
Corporate Restructuring
No ratings yet
Corporate Restructuring
3 pages
Core 603 - Program Development & Administration: "Drug Use Prevention Program in The Province of Maguindanao"
100% (1)
Core 603 - Program Development & Administration: "Drug Use Prevention Program in The Province of Maguindanao"
5 pages
01-Technical-Note-Heat-Exchanger-and-Boiler-Tube-Inspection-using-APRIS
No ratings yet
01-Technical-Note-Heat-Exchanger-and-Boiler-Tube-Inspection-using-APRIS
8 pages
Transformasi Rumah Sakit Di Era Digital
No ratings yet
Transformasi Rumah Sakit Di Era Digital
27 pages
Rockjumper Birding Tours - Tour Catalogue
No ratings yet
Rockjumper Birding Tours - Tour Catalogue
68 pages
PTW Work Leader Assessment Result (22 Feb 2023) : Dear All
No ratings yet
PTW Work Leader Assessment Result (22 Feb 2023) : Dear All
2 pages
Math Pyq
No ratings yet
Math Pyq
4 pages
PFR Cases San Beda
No ratings yet
PFR Cases San Beda
37 pages
Vaibhav Ram Chavan Offer Letter - PDF 3
No ratings yet
Vaibhav Ram Chavan Offer Letter - PDF 3
2 pages
Mantra
No ratings yet
Mantra
6 pages
DM-OUHROD-2025-0705-Launch-of-the-E-Learning-Courses-Scholarship-Program-for-DepEd-Educators
No ratings yet
DM-OUHROD-2025-0705-Launch-of-the-E-Learning-Courses-Scholarship-Program-for-DepEd-Educators
11 pages
Custody Investigation
No ratings yet
Custody Investigation
2 pages
Global Brand Strategy
No ratings yet
Global Brand Strategy
10 pages
Power Amplifiers - 1
No ratings yet
Power Amplifiers - 1
32 pages
EGMC004 Exam PDF
No ratings yet
EGMC004 Exam PDF
22 pages
Iqc - Midterms
No ratings yet
Iqc - Midterms
7 pages
Psychology: Students' Views of Mentors in Graduate Training
No ratings yet
Psychology: Students' Views of Mentors in Graduate Training
5 pages
SOA and Web Services - Understanding SOA With Web Services...
No ratings yet
SOA and Web Services - Understanding SOA With Web Services...
69 pages
The Definitive Drucker - Book Review
No ratings yet
The Definitive Drucker - Book Review
2 pages