Met PDF Extraction System
Met PDF Extraction System
Project Report on
OF
SUBMITTED BY
May, 2022
Certificate
This is to Certify that the project report entitles
are bonafide students of this institute and the work has been carried out
by them under the guidance of Prof. Atul Chaudhary and it is approved
for the partial fulfillment of the requirement of Savitribai Phule Pune Uni-
versity for the award of the degree of Bachelor of Engineering (Computer
Engineering).
Date: / /
Acknowledgement
We have taken efforts in this project. However, it would not have been possible
without the kind support and help of many individual and organizations. We would like
to extend our sincere thanks to all of them. It gives us proud privilege to complete the
project on “Metadata extraction from scientific pdf ”. We are highly
indebted to our internal guide Prof. Atul Chaudhary for his guidance and
constant supervision as well as for providing necessary information regarding the project
and also for his support in completing the project.
We are also extremely grateful to our respected H.O.D. (Computer Department)
Dr. M. U. Kharat and Dr. P. N. Metange (Project Co-ordinator) for
providing all facilities and every help for smooth progress of project work.
ii
Abstract
With the availability of World Wide Web in every corner of the world these
days, the amount of information on the internet is growing at an exponential rate. How-
ever, given the hectic schedule of people and the immense amount of information avail-
able, there is increase in need for information abstraction or summarization. Be it brows-
ing through the seemingly endless pages of terms and conditions on an important official
document or kicking back and flipping through an intriguing eBook- reading is quite
an undeniable and inescapable part of our everyday lives. However, reading anything
demands our complete undivided attention making it nearly impossible for us to multi-
task. This Online PDF to Audio Converter and Translator was created by using Python
(Django) can instantly convert any PDF text into audio. Along with reading any PDF
document out loud, this application can also translate and vocalize any text into up to
five languages. Text summarization presents the user a shorter version of text with only
vital information and thus helps him to understand the text in shorter amount of time.
The goal of this project is to condense the documents or reports into a shorter version
and preserve important contentsconvert that summarized text into audio for better un-
derstanding of the user. Also projects convert the generated summery to the audio for
better understanding.
iii
Contents
Acknowledgement i
1 Introduction 1
1.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Literature Survey 3
2.1 Literature Review papers . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Problem Definition 6
3.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4 Analysis 7
4.1 Project Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.1.1 Project Plan for semester I . . . . . . . . . . . . . . . . . . . 7
4.2 Requirement Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2.1 Necessary Functions . . . . . . . . . . . . . . . . . . . . . . . 9
4.2.2 Desirable Functions . . . . . . . . . . . . . . . . . . . . . . . 9
4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5 Design 10
5.1 Project Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5.1.1 Operating Environment . . . . . . . . . . . . . . . . . . . . . 10
5.1.2 User Classes and Characteristics . . . . . . . . . . . . . . . . . 10
5.1.3 Design and Implementation Constraints . . . . . . . . . . . . . 11
5.1.4 Assumptions and Dependencies . . . . . . . . . . . . . . . . . 11
5.2 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 12
iv
Metadata extraction from scientific pdf
6 Modeling 17
6.1 Data Flow Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6.2 ER Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.3 UML Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.3.1 Activity Diagram . . . . . . . . . . . . . . . . . . . . . . . . 22
6.3.2 Sequence Diagram . . . . . . . . . . . . . . . . . . . . . . . . 23
6.4 Component Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
7 Technical Specifications 26
7.1 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7.4 Technology used . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
7.5 System Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7.5.1 Database Requirements . . . . . . . . . . . . . . . . . . . . . 29
7.5.2 Software Requirements(Platform Choice . . . . . . . . . . . . . 30
7.5.3 Hardware Requirements) . . . . . . . . . . . . . . . . . . . . 32
7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
8 Conclusion 34
References 34
vi
List of Tables
vii
Chapter 1
Introduction
readers, and others. Using this PDF to Audio Converter the user will be able to listen
to hisfavorite PDF and can do their daily routine. The following application can be used
to convert text from PDF to audio using Python predefined libraries [1].
1.1 Objective
• Easy to clear the idea : Instantly Reading the entire article, breaking it and sepa-
rating the important ideas from the original text takes time and effort.
• Improves Quality : Some software summarizes not only documents but also web
pages.
Literature Survey
Many researchers aim to gather knowledge on how human beings tend to under-
stand and use the language so that appropriate tools and techniques can be developed
to make computer systems understand and manipulate natural languages to perform the
desired Phonological rules are captured through machine learning on training sets.
summarizing. In this method GP is used for structural part and GA for the string
part (Membership functions). The goal is to develop an optimal intelligent system
to extract important sentences in the texts by reducing the redundancy of data.
The method is applied in 3 test documents and compared with the standard fuzzy
systems as well as two other commercial summarizers: Microsoft word and Coper-
nic Summarizer. Simulations demonstrate several significant improvements with
the proposed approach.
2.2 Summary
In this chapter we discussed the various researches conducted for our system and
also understand the needs of system to current users.
Problem Definition
A system for the summarization of single documents. The system produces multi
as well as single document summaries using data mining techniques for identifying com-
mon terms across the set of documents.
3.1 Summary
To gain insight into how the field of supply chain management when integrated with a
blockchain-enabled platform will provide businesses to gain a competitive edge as well as
used to overcome the arising challenges and problems faced by organizations in supply
chain operations
6
Chapter 4
Analysis
This chapter describes the project plan adopted and determines the requirement
analysis. We have implemented the project on the basis of Rapid Application Develop-
ment (RAD) model and Model View Controller (MVC) model.
7
Metadata extraction from scientific pdf
4.3 Summary
In this chapter we described the implementation details of the project plan for
Semester I and Semester II. We also studied the necessary functions and the desirable
functions of our system.
Design
10
Metadata extraction from scientific pdf
time. It incorporated best practice web research into a practical framework of web based
design requirements.
In this current busy routine people do not find time to read a book, or to convert
the PDF file into MP3 player using third party applications or web application. In this
system I am developing an application using python to convert the PDF file into audio
format and read out to the user. The application is more used friendly as it not requires
any audio file or MP3 player. The user will have to select the PDF file which user wants
to listen.
• In this PDF to Audio Converter the user needs to select any PDF file from the
desired location by pressing the open pdf.
• After selecting the PDF file, we have to select the type of voice we want like a
female voice or a male voice.
• After selecting the PDF file, the user needs to click play button.
• If the PDF file contains page numbers, the PDF file will be extracted.
• Now, after reading the text the text will be printed on the QtLabel which is provided
in GUI.
• If the PDF file do not contain page numbers the above operations will not be
performed.
• After selecting the type of voice, we want the program to read out the pdf in the
respective voice we have selected.
• The system application of modules must able to encrypt the data and decrypt it
whenever needed.
• The system should give valid result for positive as well as negative test cases.
• Usability: The ease with which the system can be learned, managed or used.
Usability gives the measure of how much user friendly the system is.
• Reliability: The degree to which the system must work for users. It also refers
to the mean time between failures, means what can be the maximum down time.
• Open standard: t ensures the viability and future expansion of the system,
all offered development tools, server software, as well as, the application are based
on open templates and are available under the terms of the General Public License.
5.9 Summary
In this chapter we studied the operating environment and the user classes and
characteristics which describes the scope of the project. We have also described the
software system attributes and various nonfunctional requirements.
Modeling
This chapter includes the various modeling techniques which describes the various
users of the web application It also describes the functionality of the different features of
the NPL.
17
Metadata extraction from scientific pdf
DFD 0, also called context diagram of the result management system. As the
bubbles are decomposed into less and less abstract bubbles, the corresponding data flow
may also be needed to be decomposed.
DFD 2 goes one process deeper into parts of 1-level DFD. It can be used to
project or record the specific/necessary detail about the system’s functioning.
6.2 ER Diagrams
An entity relationship diagram (ERD), also known as an entity relationship model,
is a graphical representation that depicts relationships among people, objects, places,
concepts or events within an information technology (IT) system.
Depending on the scale of change, it can be risky to alter a database structure directly in
a DBMS. To avoid ruining the data in a production database, it is important to plan out
the changes carefully. ERD is a tool that helps. By drawing ER diagrams to visualize
database design ideas, you have a chance to identify the mistakes and design flaws, and
to make corrections before executing the changes in the database.
• Components
• Interfaces
6.5 Summary
Thus we saw the various modeling techniques used for the design of NPL of
machine language.
Technical Specifications
In this chapter we will discuss the advantages and limitations of the system. We
will also go through the applications of the framework and have a brief study about the
technical requirements.
7.1 Advantages
• We have identified the various operations and processes involved in text to speech
synthesis. We have also developed a very simple and attractive graphical user
interface which allows the user to type in his/her text provided in the text field in
the application.
• It was seen that this code performs really well in reading straightforward PDF text
files.
• Should enable users to select the desired PDF and convert it to audio and display
text in, so the user can understand that particular text has been read.
26
Metadata extraction from scientific pdf
7.2 Limitations
• Conversion issue due to some error.
• Programming is complex.
7.3 Applications
• This feature will help mostly for the disabled persons like the blind and handicap.
• Teachers and school librarians may also use these findings as a rationale for adding
audiobooks to the list of reading strategies used successfully with struggling readers.
• Those who participated in the studies and on audiobook usage of English Language
Learners usually.
MySQL is a fast, easy-to-use RDBMS being used for many small and big businesses.
MySQL is developed, marketed and supported by MySQL AB, which is a Swedish com-
pany. MySQL is becoming so popular because of many good reasons
• MySQL is a very powerful program in its own right. It handles a large subset of
the functionality of the most expensive and powerful database packages.
• MySQL works on many operating systems and with many languages including PHP,
PERL, C, C++, JAVA, etc.
• MySQL works very quickly and works well even with large data sets.
• MySQL is very friendly to PHP, the most appreciated language for web develop-
ment.
• Xamp
XAMPP is one of the widely used cross-platform web servers, which helps devel-
opers to create and test their programs on a local webserver. It was developed by
the Apache Friends, and its native source code can be revised or modified by the
audience. It consists of Apache HTTP Server, MariaDB, and interpreter for the
different programming languages like PHP and Perl. It is available in 11 languages
and supported by different platforms such as the IA-32 package of Windows x64
package of macOS and Linux.
XAMPP helps a local host or server to test its website and clients via computers
and laptops before releasing it to the main server. It is a platform that furnishes a
suitable environment to test and verify the working of projects based on Apache,
Perl, MySQL database, and PHP through the system of the host itself. Among
these technologies, Perl is a programming language used for web development, PHP
is a backend scripting language, and MariaDB is the most vividly used database
developed by MySQL.
• javacript
JavaScript is a lightweight, interpreted programming language. It is designed for
creating network-centric applications. It is complimentary to and integrated with
Java. JavaScript is very easy to implement because it is integrated with HTML. It
is open and cross-platform.Javascript is the most popular programming language
in the world and that makes it a programmer’s great choice. Once you learnt
Javascript, it helps you developing great front-end as well as back-end softwares
using different Javascript based frameworks like jQuery, Node.JS etc.
4. Tools : Anaconda/pycharm
4. Hard disk - 20 GB
7.6 Summary
In this chapter we were made aware of the various advantages of the framework and
also the limitations of the project. We also saw the hardware and software requirements
of the project.
Conclusion
The Conclusion of this project is that the client will get an web application that
will execute on client side and get the summary of the input document as per clients
requirement. The automatic generated summary is useful for the client to understand
the core concept of the document with in few lines instead of reading whole document.
It was seen that this code performs really well in reading straightforward PDF text files.
Should enable users to select the desired PDF and convert it to audio and display text in,
so the user can understand that particular text has been read. Should enable students
with reading disabilities. The success of this research project is significant given the broad
use of audiobooks in literacy and library programs across the United States. Teachers
and school librarians may also use these findings as a rationale for adding audiobooks to
the list of reading strategies used successfully with struggling readers.
34
Bibliography
[1] Pankaj Gupta, Vijay Shankar Pendhluri, Ishant Vats,“Summarizing text by ranking
text units according to shallow linguistic features”, Feb. 13 16, 2011 ICACT, 2011.
[5] Ranjit Bose “Natural Language Processing: Current state and future directions”, In-
ternational Journal of the Computer, the Internet and Management Vol. 121, January
– April, 2004.
[7] Pdf. (2021, March 08). Retrieved March 09, 2021, from
https://en.wikipedia.org/wiki/PDF
[8] 7 ways Audio books benefit students who struggle with reading. (n.d.). Retrieved
March 09, 2021, from: :https://learningally.org/Solutions-for School/7-Ways-Audio
books-Benefit-Students WhoStruggleWith-Reading
35