0% found this document useful (0 votes)
49 views

Met PDF Extraction System

The document is a project report submitted by four students for their Bachelor of Engineering degree. It discusses developing a system to extract metadata from scientific PDF documents and convert the summarized text into audio for better user understanding. The system will use technologies like Python, NLP, machine learning and convert PDF text to audio in multiple languages.

Uploaded by

Shivam Gosavi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views

Met PDF Extraction System

The document is a project report submitted by four students for their Bachelor of Engineering degree. It discusses developing a system to extract metadata from scientific PDF documents and convert the summarized text into audio for better user understanding. The system will use technologies like Python, NLP, machine learning and convert PDF text to audio in multiple languages.

Uploaded by

Shivam Gosavi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

A

Project Report on

“Metadata extraction from scientific pdf ”

SUBMITTED TO THE SAVITRIBAI PHULE UNIVERSITY, PUNE


IN THE PARTIAL FULFILMENT OF THE REQUIREMENTS
FOR THE AWARD OF THE DEGREE

OF

BACHELOR OF ENGINEERING (COMPUTER ENGINEERING)


(Academic Year: 2022-23)

SUBMITTED BY

Vaibhav Rajendra Joshi (PRN No: 72030000C)


Vishal rambhau kardile (PRN No: 72030017H)
Bhushan Gokul Jadhav (PRN No: 72029985D)
Shivam Nanagir Gosavi (PRN No: 72029975G)

Under the guidance of

Prof. Atul Chaudhary

DEPARTMENT OF COMPUTER ENGINEERING

MET’s Institute of Engineering,


Adgaon, Nashik-422003
SAVITRIBAI PHULE UNIVERSITY, PUNE

May, 2022
Certificate
This is to Certify that the project report entitles

“Metadata extraction from scientific pdf ”


Vaibhav Rajendra Joshi (PRN No: 72030000C)
Vishal rambhau kardile (PRN No: 72030017H)
Bhushan Gokul Jadhav (PRN No: 72029985D)
Shivam Nanagir Gosavi (PRN No: 72029975G)

are bonafide students of this institute and the work has been carried out
by them under the guidance of Prof. Atul Chaudhary and it is approved
for the partial fulfillment of the requirement of Savitribai Phule Pune Uni-
versity for the award of the degree of Bachelor of Engineering (Computer
Engineering).

Guide H.O.D Principal


(Prof. Atul Chaudhary) (Dr.M.U.Kharat) (Dr.V.P.Wani)

Date: / /
Acknowledgement

We have taken efforts in this project. However, it would not have been possible
without the kind support and help of many individual and organizations. We would like
to extend our sincere thanks to all of them. It gives us proud privilege to complete the
project on “Metadata extraction from scientific pdf ”. We are highly
indebted to our internal guide Prof. Atul Chaudhary for his guidance and
constant supervision as well as for providing necessary information regarding the project
and also for his support in completing the project.
We are also extremely grateful to our respected H.O.D. (Computer Department)
Dr. M. U. Kharat and Dr. P. N. Metange (Project Co-ordinator) for
providing all facilities and every help for smooth progress of project work.

Vaibhav Rajendra Joshi


Vishal rambhau kardile
Bhushan Gokul Jadhav
Shivam Nanagir Gosavi

ii
Abstract

Project Title: Metadata extraction from scientific pdf

With the availability of World Wide Web in every corner of the world these
days, the amount of information on the internet is growing at an exponential rate. How-
ever, given the hectic schedule of people and the immense amount of information avail-
able, there is increase in need for information abstraction or summarization. Be it brows-
ing through the seemingly endless pages of terms and conditions on an important official
document or kicking back and flipping through an intriguing eBook- reading is quite
an undeniable and inescapable part of our everyday lives. However, reading anything
demands our complete undivided attention making it nearly impossible for us to multi-
task. This Online PDF to Audio Converter and Translator was created by using Python
(Django) can instantly convert any PDF text into audio. Along with reading any PDF
document out loud, this application can also translate and vocalize any text into up to
five languages. Text summarization presents the user a shorter version of text with only
vital information and thus helps him to understand the text in shorter amount of time.
The goal of this project is to condense the documents or reports into a shorter version
and preserve important contentsconvert that summarized text into audio for better un-
derstanding of the user. Also projects convert the generated summery to the audio for
better understanding.

Technology: Front end development(HTML, CSS, Bootstrap,JavaScript), back-


end development(PHP), SQL database(MySQL), Machine learning, NPL

Keywords: Python, NPL, PDF Extraction, audio converter, machine learning.

iii
Contents

Acknowledgement i

1 Introduction 1
1.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Literature Survey 3
2.1 Literature Review papers . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Problem Definition 6
3.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4 Analysis 7
4.1 Project Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.1.1 Project Plan for semester I . . . . . . . . . . . . . . . . . . . 7
4.2 Requirement Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2.1 Necessary Functions . . . . . . . . . . . . . . . . . . . . . . . 9
4.2.2 Desirable Functions . . . . . . . . . . . . . . . . . . . . . . . 9
4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

5 Design 10
5.1 Project Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5.1.1 Operating Environment . . . . . . . . . . . . . . . . . . . . . 10
5.1.2 User Classes and Characteristics . . . . . . . . . . . . . . . . . 10
5.1.3 Design and Implementation Constraints . . . . . . . . . . . . . 11
5.1.4 Assumptions and Dependencies . . . . . . . . . . . . . . . . . 11
5.2 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 12
iv
Metadata extraction from scientific pdf

5.3 Workflow of the project . . . . . . . . . . . . . . . . . . . . . . . . . 13


5.4 User Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.5 Hardware Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.6 Communication Interfaces . . . . . . . . . . . . . . . . . . . . . . . 15
5.7 Functional Requirements . . . . . . . . . . . . . . . . . . . . . . . . 15
5.8 Nonfunctional Requirement . . . . . . . . . . . . . . . . . . . . . . . 15
5.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

6 Modeling 17
6.1 Data Flow Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6.2 ER Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.3 UML Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.3.1 Activity Diagram . . . . . . . . . . . . . . . . . . . . . . . . 22
6.3.2 Sequence Diagram . . . . . . . . . . . . . . . . . . . . . . . . 23
6.4 Component Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

7 Technical Specifications 26
7.1 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7.4 Technology used . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
7.5 System Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7.5.1 Database Requirements . . . . . . . . . . . . . . . . . . . . . 29
7.5.2 Software Requirements(Platform Choice . . . . . . . . . . . . . 30
7.5.3 Hardware Requirements) . . . . . . . . . . . . . . . . . . . . 32
7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

8 Conclusion 34

References 34

MET’s Institute of Engineering v


List of Figures

1.1 Existing system . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

5.1 Architecture diagram . . . . . . . . . . . . . . . . . . . . . . . . . 12

6.1 DFD 0 Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18


6.2 DFD 1 Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6.3 DFD 2 Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6.4 ER Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.5 Activity Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.6 Sequence Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.7 Component diagram . . . . . . . . . . . . . . . . . . . . . . . . . . 24

7.1 NPL Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . .28


7.2 MySQL Database . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7.3 Python software programming language . . . . . . . . . . . . . . . . . 30
7.4 Xamp software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

vi
List of Tables

4.1 Planner and Progress Report I for project . . . . . . . . . . . . . . . . 8

vii
Chapter 1

Introduction

Natural Language Processing (NLP) is an area of application and research that


explores how computers can be used to understand and manipulate natural language
speech or text to do useful things. The foundation of NLP lie in a number of disciplines,
namely, computer and information sciences, linguistics, mathematics, electrical and elec-
tronic engineering, artificial intelligence robotics, and psychology. NLP researchers aim
to gather knowledge on how human beings use and manipulate natural languages to
perform desired tasks so that appropriate tools and techniques can be developed. Appli-
cations of NLP include a number of fields of study such as multilingual and cross-language
information retrieval (CLIR), machine transaction, natural language, text processing and
summarization, user interfaces, speech recognition, artificial intelligence and expert sys-
tems.
Text-to-speech and related read audio tools are being widely implemented in an
attempt to assist students’ reading comprehension skills.PDF to the audio system is a
screen reader application designed and constructed for an effective audio communication
system. PDFs were designed to present and exchange documents reliably, PDFs are an
open standard document format used globally, maintained by the International Organi-
zation for Standardization (ISO). The document format is one of the most convenient
methods for electronic communication, and also for the exchange of information. Hence,
there is a need to make it more accessible to readers on-screen through audio. PDF
documents are designed and structured to contain links and buttons, form fields, audio
or sounds, video, and business logic. The PDF to the audio system will power text on
screens to read aloud (speak) with support for many languages [2]. The PDF to Audio
Converter project provides an alternative to access the PDF books for the blind, lazy,
1
Metadata extraction from scientific pdf

readers, and others. Using this PDF to Audio Converter the user will be able to listen
to hisfavorite PDF and can do their daily routine. The following application can be used
to convert text from PDF to audio using Python predefined libraries [1].

Figure 1.1: Existing system

1.1 Objective
• Easy to clear the idea : Instantly Reading the entire article, breaking it and sepa-
rating the important ideas from the original text takes time and effort.

• Important facts : This highly improves productiveness as it quicken surfing process,


Does Not Miss Important Facts.

• Improves Quality : Some software summarizes not only documents but also web
pages.

MET’s Institute of Engineering 2


Chapter 2

Literature Survey

Many researchers aim to gather knowledge on how human beings tend to under-
stand and use the language so that appropriate tools and techniques can be developed
to make computer systems understand and manipulate natural languages to perform the
desired Phonological rules are captured through machine learning on training sets.

2.1 Literature Review papers


• “An approach to sentence-selection-based text summarization”, Fang Chen; Kesong
Han; Guilin Chen is a author of this paper, this paper published in 2016. This pa-
per presented an We introduced a newly developed text summarization system. It
supports both Chinese and English, while this paper focuses on Chinese processing.
We apply 6 word level features and 3 sentence level features to weigh each word
and sentence. We also describe two new techniques, one is for processing the topic
sensitive word feature and another is for processing the sentence length feature.
Primary subjective evaluation shows that these approaches are effective and effi-
cient, and performance of the system is promising.

• “Automatic Text Summarization Using Hybrid Fuzzy GA-GP” is paper of A. Kiani-


B; M.R. Akbarzadeh-T , 2015 A novel technique is proposed for summarizing text
using a combination of Genetic Algorithms (GA) and Genetic Programming (GP)
to optimize rule sets and membership functions of fuzzy systems. The novelty of
the proposed algorithm is that fuzzy system is optimized for extractive based text
3
Metadata extraction from scientific pdf

summarizing. In this method GP is used for structural part and GA for the string
part (Membership functions). The goal is to develop an optimal intelligent system
to extract important sentences in the texts by reducing the redundancy of data.
The method is applied in 3 test documents and compared with the standard fuzzy
systems as well as two other commercial summarizers: Microsoft word and Coper-
nic Summarizer. Simulations demonstrate several significant improvements with
the proposed approach.

• ”Generic text summarization using local and global properties of sentences” C.


Kruengkrai; C. Jaruskulchai; 2015. In this paper described The paper With the
proliferation of text data on the World-Wide Web, the development of methods for
automatically summarizing these data becomes more important. Here, we propose
a practical approach for extracting the most relevant sentences from the original
document to form a summary. The idea of our approach is to exploit both the local
and global properties of sentences. The local property can be considered as clusters
of significant words within each sentence, while the global property can be though
of as relations of all sentences in the document. These two properties are combined
to get a single measure reflecting the informativeness of sentences. Experimental re-
sults show that our approach compares favorably to a commercial text summarizer.

• ”A Review on Optical Character Recognition and Text to Speech Conversion” Swati


Vikas Kodgire; 2013. The application depending on image and voice with a parallel
functioning is suitable to assist physically challenged people. So that dependability
of a challenged person is decreased to a improved level. Image acquisition based text
reader can help visually challenged people to manage the handheld objects in day to
day life. Initially steps involves capturing of image, distinguishing image with text
portion and residual regions, image pre-processing on region of interest, after the
extraction of characters and words, conversion of text to speech is done. To splinter
text from a document it is obligatory to discover all the possible manuscript text
regions. Text detection, line detection, character identification, feature extraction,
training of extracted features are the steps in sequence that are executed.

MET’s Institute of Engineering 4


Metadata extraction from scientific pdf

2.2 Summary
In this chapter we discussed the various researches conducted for our system and
also understand the needs of system to current users.

MET’s Institute of Engineering 5


Chapter 3

Problem Definition

A system for the summarization of single documents. The system produces multi
as well as single document summaries using data mining techniques for identifying com-
mon terms across the set of documents.

3.1 Summary
To gain insight into how the field of supply chain management when integrated with a
blockchain-enabled platform will provide businesses to gain a competitive edge as well as
used to overcome the arising challenges and problems faced by organizations in supply
chain operations

6
Chapter 4

Analysis

This chapter describes the project plan adopted and determines the requirement
analysis. We have implemented the project on the basis of Rapid Application Develop-
ment (RAD) model and Model View Controller (MVC) model.

4.1 Project Plan

4.1.1 Project Plan for semester I


The following Table 4.1 describes the project plan for semester I. It describes
the various activities and accountability of the developers for the respective modules.
Following are the major activities carried out in this plan :

• Identifying the functional requirements.

• Designing of the Framework.

• Studying the necessary development tools and technologies.

7
Metadata extraction from scientific pdf

Phase Activity Start Date End Date Group Mem-


bers
1 Selection of Project Topic 22-08-2022 24-08-2022 Team
1 Study literature survey in 25-08-2022 28-08-2022 Team
detail
1 Functional Requirement 29-08-2022 09-09-2022 Team
Specification(FRS)
1 Design Prototype 11-09-2022 21-09-2022 Team
1 Set Theory and Math 23-09-2022 06-09-2022 Team
Model
1 UML Diagram Prototype 23-09-2022 03-10-2022 Team
1 Project Problem Statement 08-10–2022 19-10-2022 Team
using NP Complete
1 UML Diagram in StarUML 05-10-2022 22-10-2022 Team
1 Paper Presentation 05-11-2022 05-11-2022 Team
1 Software Requirement 6-11-2022 10-11-2022 Team
Specification
1 Test Plan 11-11-2022 15-11-2022 Team

Table 4.1: Planner and Progress Report I for project

MET’s Institute of Engineering 8


Metadata extraction from scientific pdf

4.2 Requirement Analysis

4.2.1 Necessary Functions


• Provide a way for users to access pages.

• Display Web pages properly.

• Provide technology to check the patient data of system

4.2.2 Desirable Functions


• Networking Interface.

• Make and save Money.

• Build An Online Presence.

4.3 Summary
In this chapter we described the implementation details of the project plan for
Semester I and Semester II. We also studied the necessary functions and the desirable
functions of our system.

MET’s Institute of Engineering 9


Chapter 5

Design

5.1 Project Scope


In this new era, where tremendous information is available on the Internet. It is
most important to provide the improved mechanism to extract the information quickly
and most efficientlyIn current system there is corruption occurring in this field. It is very
difficult for human beings to manually extract the summary of a large documents of text.
So there is a problem of searching for relevant documents from the number of documents
available.

5.1.1 Operating Environment


Natural Language Systems are very complicated to design. NLP’s future will be
redefined as it faces new technological challenges to create more user-friendly systems.
It is also forcing NLP more towards Open Source Development. If the NLP community
embraces Open Source Development, it will make NLP systems less proprietary and
therefore less expensive.

5.1.2 User Classes and Characteristics


The user who is going to operate the system should have the laptop having web
application as the base operating system.

10
Metadata extraction from scientific pdf

5.1.3 Design and Implementation Constraints


Using laptops (with touch interface) has a very different set of challenges. The is-
sue is not whether you have larger screen - but fundamentally they are different. Battery
life, screen size, form factor, variations in keyboard availability and dynamically changing
orientation (horizontal or vertical positioning done by user) present using set of issues to
be dealt with.
When we compare with the current features present in a normal audiobook con-
verter, they convert PDF texts (or images)into speech, and they have volume controls
with single voice conversion (either male or female). Only a single choice is given to the
user in case of voice modification. They provide the play and pause options. The speed
of voice is alwaysfixed. In this current busy scheduled human do not get time to read
a book, or to convert the PDF file into an MP3 playerusing third-party applications or
web applications. Even I have a directory at which I store pdf books that I plan on
reading, but I never do. So, I thought hey, why do not I make them audiobooks and
listen to them while I do something else! In this system, we are developing a GUI based
web application using python to convert the PDF file into audio format and read it out
to the user. The application is more user friendly as it does not require any audio file or
MP3 player.
Following are the merits of the design implementation :

• Portability: As it is web base, on the move learning is achieved anywhere and


anytime.

• Delivery Mechanism: It is convenient to develop application and even very


easy to use.

• User-friendly: It is user-friendly due to the use of devices like tablets, mobiles,


laptops.

5.1.4 Assumptions and Dependencies


The Framework is capable of allowing the developer to develop the neural learning
application with ease and import it on the devices which contain web application . This
application developed by the vendor will allow the user to use it with high power of
interactivity and portability. The commercialization of the web application may take

MET’s Institute of Engineering 11


Metadata extraction from scientific pdf

time. It incorporated best practice web research into a practical framework of web based
design requirements.

5.2 System Architecture

Figure 5.1: Architecture diagram

In this current busy routine people do not find time to read a book, or to convert
the PDF file into MP3 player using third party applications or web application. In this
system I am developing an application using python to convert the PDF file into audio
format and read out to the user. The application is more used friendly as it not requires
any audio file or MP3 player. The user will have to select the PDF file which user wants
to listen.

MET’s Institute of Engineering 12


Metadata extraction from scientific pdf

5.3 Workflow of the project


The Workflow of the project is:

• In this PDF to Audio Converter the user needs to select any PDF file from the
desired location by pressing the open pdf.

• After selecting the PDF file, we have to select the type of voice we want like a
female voice or a male voice.

• After selecting the PDF file, the user needs to click play button.

• If the PDF file contains page numbers, the PDF file will be extracted.

• The extracted text will be printed on the console.

• The extracted text will be then read.

• Now, after reading the text the text will be printed on the QtLabel which is provided
in GUI.

• If the PDF file do not contain page numbers the above operations will not be
performed.

• After selecting the type of voice, we want the program to read out the pdf in the
respective voice we have selected.

• We can tune the speed and volume of speech.

• To exit the program, we press the exit button.

MET’s Institute of Engineering 13


Metadata extraction from scientific pdf

5.4 User Interfaces


• Operating system (OS):
Is a set of programs that manages computer hardware resources, and pro-
vides common services for application software. The operating system is the most
important type of system software in a computer system. Without an operating
system, a user cannot run an application program on their computer, unless the
application program is self booting.

• Application Programming Interface (API):


Is a particular set of rules (code) and specifications that software programs
can follow to communicate with each other.I t serves as an interface between differ-
ent software programs and facilitates their interaction, similar to the way the user
interface facilitates interaction between humans and computers.

An API can be created for applications, libraries, operating systems, etc.,


as a way of defining their vocabularies and resources request conventions (e.g.
function-calling conventions). It may include specifications for routines, data struc-
tures, object classes, and protocols used to communicate between the consumer
program and the implementer program of the API. An API consist of a core set
of packages and classes. As shown in the Figure 4.1 the applications will be built
using the M-Learning Framework. These applications will be built by importing
the libraries, include files and the style sheets developed as a whole framework.
The framework is developed considering the developer’s point of view that is to be
able to develop the applications by putting in less time and efforts. Thus the devel-
oper will access the API’s present in the framework and develop the applications
by writing small amount of code.

MET’s Institute of Engineering 14


Metadata extraction from scientific pdf

5.5 Hardware Interfaces


• Devices:
The applications built using the python will be deployed on LAPTOPS and
tablets supporting Web application system version 2.2 and above.

5.6 Communication Interfaces


The most important protocols for data transmission across the Internet are TCP
(Transmission Control Protocol) and IP (Internet Protocol). Using these jointly (TCP/IP),
we can link devices that access the network; some other communication protocols asso-
ciated with the Internet are POP, SMTP and HTTP.

5.7 Functional Requirements


• The System should be able to retrieve the results stored on database by using quick
retrived process.

• The system application of modules must able to encrypt the data and decrypt it
whenever needed.

5.8 Nonfunctional Requirement


• There should be minimal lag between taking of the processing and result

• The processing should be as efficient with maximum accuracy.

• The system should give valid result for positive as well as negative test cases.

• Usability: The ease with which the system can be learned, managed or used.
Usability gives the measure of how much user friendly the system is.

• Reliability: The degree to which the system must work for users. It also refers
to the mean time between failures, means what can be the maximum down time.

MET’s Institute of Engineering 15


Metadata extraction from scientific pdf

• Performance: Performance specifications typically refer to response time, trans-


action throughput, and capacity. They deal with response time, which means the
time taken by the system to load, reload, screen open and refresh times etc.

• Scalability: It refers to the ability of the proposed software application to


increase the number of users or applications associated with the product.

• Open standard: t ensures the viability and future expansion of the system,
all offered development tools, server software, as well as, the application are based
on open templates and are available under the terms of the General Public License.

5.9 Summary
In this chapter we studied the operating environment and the user classes and
characteristics which describes the scope of the project. We have also described the
software system attributes and various nonfunctional requirements.

MET’s Institute of Engineering 16


Chapter 6

Modeling

This chapter includes the various modeling techniques which describes the various
users of the web application It also describes the functionality of the different features of
the NPL.

6.1 Data Flow Diagrams


A data flow diagram (DFD) is a graphical or visual representation using a stan-
dardized set of symbols and notations to describe a business’s operations through data
movement. They are often elements of a formal methodology such as Structured Systems
Analysis and Design Methods.
The objective of a DFD is to show the scope and boundaries of a system as a whole.
It may be used as a communication tool between a system analyst and any person who
plays a part in the order that acts as a starting point for redesigning a system. The DFD
is also called as a data flow graph or bubble chart.

17
Metadata extraction from scientific pdf

DFD 0, also called context diagram of the result management system. As the
bubbles are decomposed into less and less abstract bubbles, the corresponding data flow
may also be needed to be decomposed.

Figure 6.1: DFD 0 Diagram

MET’s Institute of Engineering 18


Metadata extraction from scientific pdf

DFD 1, a context diagram is decomposed into multiple bubbles/processes. In


this level, we highlight the main objectives of the system and breakdown the high-level
process of 0-level DFD into subprocesses.

Figure 6.2: DFD 1 Diagram

MET’s Institute of Engineering 19


Metadata extraction from scientific pdf

DFD 2 goes one process deeper into parts of 1-level DFD. It can be used to
project or record the specific/necessary detail about the system’s functioning.

Figure 6.3: DFD 2 Diagram

MET’s Institute of Engineering 20


Metadata extraction from scientific pdf

6.2 ER Diagrams
An entity relationship diagram (ERD), also known as an entity relationship model,
is a graphical representation that depicts relationships among people, objects, places,
concepts or events within an information technology (IT) system.
Depending on the scale of change, it can be risky to alter a database structure directly in
a DBMS. To avoid ruining the data in a production database, it is important to plan out
the changes carefully. ERD is a tool that helps. By drawing ER diagrams to visualize
database design ideas, you have a chance to identify the mistakes and design flaws, and
to make corrections before executing the changes in the database.

Figure 6.4: ER Diagram

MET’s Institute of Engineering 21


Metadata extraction from scientific pdf

6.3 UML Diagram

6.3.1 Activity Diagram


Use cases show what your system should do. Activity diagrams allow you to spec-
ify how your system will accomplish its goals. Activity diagrams show high-level actions
chained together to represent a process occurring in your system. An activity diagram
is essentially a flowchart, showing flow of control from activity to activity. Unlike a tra-
ditional flowchart, an activity diagram shows concurrency as well as branches of control.
Activity diagrams focus on the dynamic flow of a system.

Figure 6.5: Activity Diagram

MET’s Institute of Engineering 22


Metadata extraction from scientific pdf

6.3.2 Sequence Diagram


The sequence diagram is used primarily to show the interactions between objects
in the sequential order that those interactions occur. Developers typically think sequence
diagrams were meant exclusively for them. However, an organization’s business staff
can find sequence diagrams useful to communicate how the business currently works by
showing how various business objects interact.Sequence diagrams illustrate how objects
interact with each other. They focus on message sequences, that is, how messages are
sent and received between a number of objects. The main purpose of sequence diagram
is to show the order of events between the parts of system that are involved in particular
interaction.

Figure 6.6: Sequence Diagram

MET’s Institute of Engineering 23


Metadata extraction from scientific pdf

6.4 Component Diagram


Component diagram are one of the two kinds of diagrams found in modeling the
physical aspects of object oriented systems. A component diagram shows organization
and dependencies among set of components. Component diagram can be seen to model
the static implementation view of a system. This involves modeling the physical things
that resides on a node, such as executables, libraries, tables, files and documents.
Component diagram shows a set of components and their relationships. Graph-
ically a component diagram is a collection of vertices and arcs. Component diagrams
commonly contain,

• Components

• Interfaces

• Dependency, generalization, association and realization re-


lationships.

Figure 6.7: Component diagram

MET’s Institute of Engineering 24


Metadata extraction from scientific pdf

6.5 Summary
Thus we saw the various modeling techniques used for the design of NPL of
machine language.

MET’s Institute of Engineering 25


Chapter 7

Technical Specifications

In this chapter we will discuss the advantages and limitations of the system. We
will also go through the applications of the framework and have a brief study about the
technical requirements.

7.1 Advantages

• Text to speech synthesis is a rapidly growing aspect of computer technology and is


increasingly playing a more important role in the way we interact with the system
and interfaces across a variety of platforms.

• We have identified the various operations and processes involved in text to speech
synthesis. We have also developed a very simple and attractive graphical user
interface which allows the user to type in his/her text provided in the text field in
the application.

• It was seen that this code performs really well in reading straightforward PDF text
files.

• Should enable users to select the desired PDF and convert it to audio and display
text in, so the user can understand that particular text has been read.

• Should enable students with reading disabilities.

26
Metadata extraction from scientific pdf

7.2 Limitations
• Conversion issue due to some error.

• Programming is complex.

7.3 Applications
• This feature will help mostly for the disabled persons like the blind and handicap.

• Teachers and school librarians may also use these findings as a rationale for adding
audiobooks to the list of reading strategies used successfully with struggling readers.

• Those who participated in the studies and on audiobook usage of English Language
Learners usually.

MET’s Institute of Engineering 27


Metadata extraction from scientific pdf

7.4 Technology used


Natural language processing (NPL)

Figure 7.1: NPL Technology

Natural language processing (NLP) is the ability of a computer program to


understand human language as it is spoken and written – referred to as natural language.
It is a component of artificial intelligence (AI). NLP has existed for more than 50 years
and has roots in the field of linguistics. It has a variety of real-world applications in a
number of fields, including medical research, search engines and business intelligence.
NLP enables computers to understand natural language as humans do. Whether
the language is spoken or written, natural language processing uses artificial intelligence
to take real-world input, process it, and make sense of it in a way a computer can
understand. Just as humans have different sensors – such as ears to hear and eyes to
see – computers have programs to read and microphones to collect audio. And just as
humans have a brain to process that input, computers have a program to process their
respective inputs. At some point in processing, the input is converted to code that the
computer can understand.
There are two main phases to natural language processing are data preprocessing and
algorithm development. Data preprocessing involves preparing and ”cleaning” text data
for machines to be able to analyze it. preprocessing puts data in workable form and
highlights features in the text that an algorithm can work with.

MET’s Institute of Engineering 28


Metadata extraction from scientific pdf

7.5 System Requirements

7.5.1 Database Requirements

Figure 7.2: MySQL Database

MySQL is a fast, easy-to-use RDBMS being used for many small and big businesses.
MySQL is developed, marketed and supported by MySQL AB, which is a Swedish com-
pany. MySQL is becoming so popular because of many good reasons

• MySQL is released under an open-source license. So you have nothing to pay to


use it.

• MySQL is a very powerful program in its own right. It handles a large subset of
the functionality of the most expensive and powerful database packages.

• MySQL uses a standard form of the well-known SQL data language.

• MySQL works on many operating systems and with many languages including PHP,
PERL, C, C++, JAVA, etc.

• MySQL works very quickly and works well even with large data sets.

• MySQL is very friendly to PHP, the most appreciated language for web develop-
ment.

MET’s Institute of Engineering 29


Metadata extraction from scientific pdf

• MySQL supports large databases, up to 50 million rows or more in a table. The


default file size limit for a table is 4GB, but you can increase this (if your operating
system can handle it) to a theoretical limit of 8 million terabytes (TB).

• MySQL is customizable. The open-source GPL license allows programmers to


modify the MySQL software to fit their own specific environments.

7.5.2 Software Requirements(Platform Choice


• Python

Figure 7.3: Python software programming language

Python is a multi-paradigm programming language. Object-oriented program-


ming and structured programming are fully supported, and many of their fea-
tures support functional programming and aspect-oriented programming (including
metaprogramming and metaobjects).
Python is an interpreted, object-oriented, high-level programming language with
dynamic semantics. Its high-level built in data structures, combined with dynamic
typing and dynamic binding, make it very attractive for Rapid Application De-
velopment, as well as for use as a scripting or glue language to connect existing
components together. Python’s simple, easy to learn syntax emphasizes readability
and therefore reduces the cost of program maintenance. Python supports modules
and packages, which encourages program modularity and code reuse.

MET’s Institute of Engineering 30


Metadata extraction from scientific pdf

• Xamp

Figure 7.4: Xamp software

XAMPP is one of the widely used cross-platform web servers, which helps devel-
opers to create and test their programs on a local webserver. It was developed by
the Apache Friends, and its native source code can be revised or modified by the
audience. It consists of Apache HTTP Server, MariaDB, and interpreter for the
different programming languages like PHP and Perl. It is available in 11 languages
and supported by different platforms such as the IA-32 package of Windows x64
package of macOS and Linux.
XAMPP helps a local host or server to test its website and clients via computers
and laptops before releasing it to the main server. It is a platform that furnishes a
suitable environment to test and verify the working of projects based on Apache,
Perl, MySQL database, and PHP through the system of the host itself. Among
these technologies, Perl is a programming language used for web development, PHP
is a backend scripting language, and MariaDB is the most vividly used database
developed by MySQL.

MET’s Institute of Engineering 31


Metadata extraction from scientific pdf

• javacript
JavaScript is a lightweight, interpreted programming language. It is designed for
creating network-centric applications. It is complimentary to and integrated with
Java. JavaScript is very easy to implement because it is integrated with HTML. It
is open and cross-platform.Javascript is the most popular programming language
in the world and that makes it a programmer’s great choice. Once you learnt
Javascript, it helps you developing great front-end as well as back-end softwares
using different Javascript based frameworks like jQuery, Node.JS etc.

list of software requirement are as follow:

1. Operating System : Windows xp/7/8/10

2. Programming Language : Python

3. Software Version : Python 4.4

4. Tools : Anaconda/pycharm

5. Front End : Python

7.5.3 Hardware Requirements)


1. Processor - Pentium IV/Intel I3 core

2. Speed - 1.1 GHZ

3. RAM - 512 MB(min)

4. Hard disk - 20 GB

5. Keyboard - Standard Keyboard

6. Mouse - Two Or Three Button Mouse

7. Monitor - LED Monitor

MET’s Institute of Engineering 32


Metadata extraction from scientific pdf

7.6 Summary
In this chapter we were made aware of the various advantages of the framework and
also the limitations of the project. We also saw the hardware and software requirements
of the project.

MET’s Institute of Engineering 33


Chapter 8

Conclusion

The Conclusion of this project is that the client will get an web application that
will execute on client side and get the summary of the input document as per clients
requirement. The automatic generated summary is useful for the client to understand
the core concept of the document with in few lines instead of reading whole document.
It was seen that this code performs really well in reading straightforward PDF text files.
Should enable users to select the desired PDF and convert it to audio and display text in,
so the user can understand that particular text has been read. Should enable students
with reading disabilities. The success of this research project is significant given the broad
use of audiobooks in literacy and library programs across the United States. Teachers
and school librarians may also use these findings as a rationale for adding audiobooks to
the list of reading strategies used successfully with struggling readers.

34
Bibliography

[1] Pankaj Gupta, Vijay Shankar Pendhluri, Ishant Vats,“Summarizing text by ranking
text units according to shallow linguistic features”, Feb. 13 16, 2011 ICACT, 2011.

[2] Rajesh S. Prasad, U. V. Kulkarni, Jayashree R. Prasad,“Connectionist Approach to


Generic Text Summarization,”,World Academy of Science, Engineering and Technol-
ogy 55,2009.

[3] R. S. Prasad, U. V. Kulkarni, J. R. Prasad, “A Novel Evolutionary Connectionist


Text Summarizer (ECTS),”, 2009,IEEE Xplore.

[4] Rajesh Shardanand Prasad, Uday. V. Kulkarni,“Implementation and Evaluation of


Evolutionary Connectionist Approaches to Automated Text Summarization”, Journal
of Computer Science 6 (11): 1366-1376, 2010 ISSN 1549-3636, 2010 Science Publica-
tions.

[5] Ranjit Bose “Natural Language Processing: Current state and future directions”, In-
ternational Journal of the Computer, the Internet and Management Vol. 121, January
– April, 2004.

[6] Natural Language Processing Techniques Applied in Information Retrieval-Analysis


and Implementation in Python, TulikaNarang, International Journal of Innovations
Advancement in Computer Science IJIACS ISSN 2347 – 8616 Volume 5, Issue 4 April
2016.

[7] Pdf. (2021, March 08). Retrieved March 09, 2021, from
https://en.wikipedia.org/wiki/PDF

[8] 7 ways Audio books benefit students who struggle with reading. (n.d.). Retrieved
March 09, 2021, from: :https://learningally.org/Solutions-for School/7-Ways-Audio
books-Benefit-Students WhoStruggleWith-Reading
35

You might also like