0% found this document useful (0 votes)
45 views

Major Project

Uploaded by

Ansh Vyas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Major Project

Uploaded by

Ansh Vyas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Text Analytics Project

By

Ansh Vyas
20BCM005

Guided By

Prof Gaurang Raval


[DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING]

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


Ahmedabad 382481

I
CERTIFICATE

This is to certify that the Computer Engineering Project entitled Text Analytics submitted by
Ansh Vyas 20BCM005, towards the partial fulfillment of the requirements for the degree of
Integrated B.Tech.(CSE)-MBA of Nirma University is the record of work carried out by
him/her under my supervision and guidance. In my opinion, the submitted work has reached
the level required for being accepted for examination.

Prof. Gaurang Raval Dr. Madhuri Bhavsar,


Associate Professor Professor and HOD,
Computer Science and Engineering Dept., Computer Science and Engineering Dept.,
Institute of Technology, Institute of Technology,
Nirma University, Nirma University,
Ahmedabad Ahmedabad

II
CERTIFICATE

III
ACKNOWLEDGEMENT

I would like to take this opportunity to express my sincere gratitude to the following
individuals and organizations for their support and assistance throughout the course
of this project.
I would like to express my appreciation to Dr. Gaurang Raval, my project guide, for
his constant encouragement, valuable insights, and guidance throughout the course of
this project. And for giving constant suggestions to improve my work.
This report includes all information and task that I carried out during the internship
period of 8 weeks.
I wish to express my sincere gratitude to the whole Company and my faculty mentor
for being so supportive and helpful during the whole journey.
Thank you all for your support and encouragement.

IV
ABSTRACT

This project aims to develop a text analytics system that extracts valuable insights from large
volumes of news articles. By analyzing news data from various sources, the system can
provide users with real-time information about trending topics, sentiment analysis, and key
events.
The project leverages web crawling, ECC and Data Visualization techniques to classify news
articles into different categories, such as politics, sports, and finance, enabling users to filter
and focus on their areas of interest. Additionally, the system incorporates named entity
recognition to identify and track specific entities mentioned in the news, such as crime rates,
education rates etc. in different districts of Rajasthan.
The extracted information is then visualized through intuitive dashboards and interactive
charts, enabling users to understand and interpret news trends effectively. This text analytics
system is made for different departments of Govt. of Rajasthan seeking to stay informed about
the latest developments and make data-driven decisions based on comprehensive news
analysis.

The main function are listed below:


1. Crawl and scrape news articles from online sources such as websites,
blogs, social media, etc.
2. Extract relevant information from the articles such as topics, keywords,
entities, sentiments, etc.
3. Store and organize the news data in a scalable and secure database
4. Apply various analytical techniques and models to the news data such as
Clustering, classification, summarization etc.
5. Making decision whether the news in positive, negative or neutral using ECC.
6. Provide an interactive dashboard and visualization tools for users to
explore and query the news data.

VII
1 List of Figures

Figure 1: Company Logo

Figure 2: Context Diagram

Figure 3: First Level Data Flow Diagram

Figure 4: Second Level Data Flow Diagram

Figure 5: System Flow Diagram

Figure 6: Data Crawling Step

Figure 7: Content Categorization Step

Figure 8: Sentimental Analysis Step

Figure 9: Data Visualization Step

Figure 10: General Structure Of Unit Tests

Figure 11: Usage Of SAS Unit

Figure 12: Test Report For SASUnit

VIII
CONTENTS

Certificate i
Acknowledgment ii
Abstract iii
List of Figures iv
List of Tables v
Chapter 1 Introduction 1
1.1 ABOUT THE COMPANY 2
1.1.1 Introduction of the company
1.1.2 Quality policy 3
1.1.3 Communication 4
1.1.4 Resources
5
1.2 THE SYSTEM
1.2.1 Definition of system 6
1.2.2 Purpose and objectives 7
1.2.3 About present system
1.2.4 Proposed system 8
1.3 PROJECT PROFILE.
1.3.1 Project title.
1.3.2 Scope of the project.
1.3.3 Project team.
1.3.4 Hardware/Software environment in the company.
Chapter 2 System Analysis
2.1 FEASIBILITY STUDY.
2.1.1 Operational Feasibility.
2.1.2 Technical Feasibility.
2.1.3 Financial and economic feasibility.
2.1.4 Handling infeasible projects.
2.2 REQUIREMENT ANALYSIS.
2.2.1 Facts-Finding Techniques.
2.2.1.1 Interview.
2.2.1.2 Questionnaire.
2.2.1.3 Record Review.
2.2.1.4 Observation.
2.3 CONTEXT DIAGRAM.
2.4 DATA FLOW DIAGRAMS
2.3.1 First level DFD.
2.3.2 Second level DFD.
Chapter 3 System Design
3.1 System flow.
3.2 Entity-Relationship Diagram.
3.3 Data dictionary.

IX
Chapter 4 Result and Discussion
4.1 Results
4.2 Discussion

Chapter 5 User Manual

Chapter 6 Testing

Chapter 7 Future Enhancement


Appendices
A. Tools used
B. Additional Material
References

X
Chapter 1 : Introduction

1.1 ABOUT THE COMPANY

1.1.1 Introduction of the company

Fig.1

Company Name E Connect Solutions Pvt. Ltd.


Registered Office G18, 19, 20, It Park, Udaipur, Rajasthan 313002
Mobile No. 0294 305 7413
Website https://www.e-connectsolutions.com/

E-Connect Solutions Private Limited provides comprehensive end-to-end business and IT


solutions that improve your business operations. With over 30 years of experience, E-Connect
remains committed to providing innovative solutions leveraging the best business and IT
mindsets to its customers in India and around the world.
1
E-Connect has expertise in large-scale IT infrastructure projects and specializes in
technologies such as Oracle/DB2/MS SQL/My SQL .NET, JAVA, JAVA Mobile Framework
and data analytics. They have experience in multiple disciplines and diverse business areas
using a wide range of e-governance and enterprise solutions, and have over 500 employee
qualifications and certifications, CMMI L5, ISO Quality certificates such as 9001, 2015, ISO
27001, ISO14001, and ISO 20000-1.

E-Connect has been working on quality software such as Citizen CONNECT, WorkX,
Anytime Auction, E-Prashashan. They have also worked with the government's Ministry of
Information Technology and Communications. From Rajasthan, we work on major IT
projects such as text analytics.

1.1.2 Quality policy

E-Connect Solution’s quality policy is to provide IT solutions that meet or exceed the
expectations of our customers and stakeholders. they are committed to delivering high-quality
products and services that are reliable, secure, and innovative, they strive to continuously
improve the processes, performance, and customer satisfaction and adhere to the best
practices and standards of the IT industry and comply with all applicable laws and regulations
and, also foster a culture of quality, accountability, and excellence among the employees and
partners.

1.1.3 Communication

In our opinion, open and honest communication is critical to a company's success. We strive
to provide accurate and timely information to all stakeholders, including employees,
customers and other stakeholders.

2
Our communication goals include:
• Increase employee engagement
• Improve customer satisfaction
• Establish a strong brand for our company.

We monitor the following to assess how well our communications are working:

• Employee Satisfaction Survey


• Customer satisfaction survey
• Website visitors

We always strive to improve our communication. We regularly review our communication


methods, tone and goals to ensure that we meet the needs of our employees, customers and
other stakeholders. In our opinion, open and honest communication is extremely important to
a company's success. We strive to provide accurate and timely information to all stakeholders,
including employees, customers and other stakeholders.

1.1.4 Resources

The resources of E-Connect Solutions Are -:

Financial resources: These are the funds and sources of income that the company needs to
operate, invest, grow and innovate. Examples of financial resources are cash, loans, equity,
grants, revenue and profits.

Human resources: These are the people who work for E-connect and contribute their skills,
knowledge, creativity and motivation. Examples of human resources are employees,
managers, leaders, consultants, contractors and partners.

3
Material resources: These are the physical items and infrastructure that E-connect uses to
produce its products or services. Examples of material resources are hardware, software,
equipment, facilities, networks and data centers.

Intellectual resources: These are the intangible assets and knowledge that the company owns
or accesses to gain a competitive advantage like, intellectual resources are patents,
trademarks, copyrights, trade secrets, brand, reputation and expertise.

Collaborative resources: These the clients for which e-connect works for besides their own
projects like DOITC a govt. organization for which e-connect solutions works for.

THE SYSTEM

1.1.1 Definition of system

Different departments of the govt. of Rajasthan track all the news about topics related to them
for ex., crime rates in different districts of Rajasthan and they track the sentiments of the news
weather the news is positive, negative, or neutral and then make interactive dashboards to
visually represent the data .

1.1.2 Purpose and objectives

The main purpose of the system is to track all the happenings in the state of Rajasthan like no
of rape cases, crime rates and the main objective of the system is to forward the interactive
dashboards to different govt departments so that they can work efficiently and have a reality
check.

1.1.3 About Present System

Employees of different govt. departments had to read all the news resources one by one and
then, they had to make a data frame in which they used to write about the data from the news
sources and then they had to customize the data according to it’s severity and then perform
data visualization of the particular dataset made.

4
1.1.4 Proposed System

With the help of latest data analytics tools like SAS, was can crawl the data from different
news sources and , preprocess the data according to our requirements and then, we can use
ECC tools to categorize the data according to sentiments and use SAS visual analytics for
making interactive dashboards which can help for easy understanding of the data and send
these dashboards to DOITC.

1.3 PROJECT PROFILE.

1.3.1 Text Analytics

Text analytics is the process of applying data analysis techniques to news content to extract
insights and trends from it. Text analytics can be used for various purposes, such as:

- Monitoring the media coverage of a topic, event, organization, or person


- Identifying the sentiment, tone, and bias of news articles
- Discovering emerging topics, themes, and keywords in news content
- Analyzing the impact of news on public opinion, markets, or policies
- Generating summaries, headlines, or recommendations based on news content

Text analytics can be performed using different methods and tools, such as:

- Text mining, which is the process of extracting information from unstructured text data
- Machine learning, which is the field of computer science that enables machines to learn
from data and make predictions
- Data visualization, which is the presentation of data in graphical or interactive forms

Text analytics projects can be used by a variety of organizations, including:

Media companies: Text analytics can be used by media companies to track the performance
of their news outlets and to identify trends in news consumption.

5
Government agencies: Government agencies can use text analytics to track public opinion on
a variety of issues and to identify emerging threats.
Businesses: Businesses can use text analytics to track their competitors, identify new market
opportunities, and gauge the effectiveness of their marketing campaigns.
Non-profit organizations: Non-profit organizations can use text analytics to track public
opinion on their issues, identify potential donors, and measure the impact of their programs.
Text analytics is a powerful tool that can be used to gain insights into current events, trends,
and public opinion. By using text analytics, organizations can make better decisions, improve
their marketing campaigns, and track the effectiveness of their public relations efforts.

Here are some of the benefits of using text analytics:

Gain insights into current events: Text analytics can help you to track the latest news and
trends, so that you can stay ahead of the curve.
Identify emerging threats: Text analytics can help you to identify potential threats to your
business or organization, so that you can take steps to mitigate them.
Track your competitors: Text analytics can help you to track your competitors' activities, so
that you can stay ahead of the competition.
Identify new market opportunities: Text analytics can help you to identify new market
opportunities, so that you can expand your business.
Gauge the effectiveness of your marketing campaigns: Text analytics can help you to
gauge the effectiveness of your marketing campaigns, so that you can improve your results.
Track the effectiveness of your public relations efforts: Text analytics can help you to
track the effectiveness of your public relations efforts, so that you can improve your
reputation.

1.3.2 Scope of the project

Data collection: The first step in the project is to is to crawl the data from a variety of
sources, such as news websites which can be done with SAS Enterprise Guide.
Data cleaning: The next step is to clean the data by removing errors and inconsistencies. This
can be a time-consuming process, but it is important to ensure that the data is accurate and
reliable.

6
Data analysis: The third step is to analyze the data using a variety of statistical and machine
learning techniques. This can be used to identify trends, patterns, and relationships in the data.
Data visualization: The fourth step is to visualize the data using charts, graphs, and other
visuals. This can help to make the data more understandable and easier to communicate which
can be done with SAS Visual Analytics.
Reporting: The final step is to create reports that summarize the findings of the analysis.
These reports can be sent to different govt. authorities.

This collected report can help the govt. departments to get a reality check and what needs to
be improved and hence, they can work accordingly to that.

1.3.3 Project Team

News Refiner: Their job was to refine what news sources have most accurate news and tell
the tech team to crawl from that sources.
Data Engineers: Their job was to crawl news from the sources given and then, pre-process
that data, export and store it .
Data Analyst: They extract insights from the data and categorize it with sentiments using
SCC tool And SAS sentiment analyzer tool.
Data Visualizer: They create interactive dashboards and reports and forward their findings to
different govt. departments.

The project team works collaboratively and efficiently to deliver the best possible results for
the project.

7
1.3.4 Hardware/Software environment in the company.

SAS Enterprise Guide: SAS Enterprise Guide is a powerful, Windows-based application that
provides a wider range of features for advanced SAS users and also uses SAS SQL.
SAS Sentimental Analysis Studio: SAS Sentiment Analysis is a software that automatically
rates and classifies opinions expressed in electronic text to quickly understand.
SAS Content Categorization Studio: It is used to categorize contents of our dataset.
SAS Visual Analytics: It is a software used for creating interactive dashboards and reports.

8
Chapter 2 : System Analysis

2.1 FEASIBILITY STUDY

2.1.1 Operational Feasibility

There are a number of factors that can affect the operational feasibility of text analytics. One
important factor is the availability of resources. Text analytics can be a complex and data-
intensive process, so it is important to have the necessary resources in place, such as data
storage, computing power, and staff expertise.

Another important factor is the ability to integrate text analytics with existing systems. Text
analytics tools can be used to collect and analyze data from a variety of sources, such as social
media, news websites, and financial data feeds.

Finally, it is important to be able to train staff on how to use text analytics tools. Text
analytics can be a complex and technical process, so it is important to make sure that staff
have the necessary training to use the tools effectively.

2.1.2 Technical Feasibility

There are a number of factors that can affect the technical feasibility of text analytics. One
important factor is the availability of data sources. News data can be collected from a variety
of sources, such as news websites, social media, and financial data feeds. It is important to

9
have access to a variety of data sources in order to get a comprehensive view of the news
landscape.

Another important factor is the ability to process large amounts of data. News data can be
very large and complex. It is important to have the ability to process this data quickly and
efficiently in order to generate insights in a timely manner.

Finally, it is important to be able to develop and deploy analytical models. Text analytics can
be used to develop a variety of analytical models, such as sentiment analysis models, topic
modeling models, and predictive models. It is important to be able to develop and deploy
these models in a way that is efficient and effective.

Here are some specific examples of how text analytics can be used to improve technical
efficiency:

Sentiment analysis: Sentiment analysis can be used to track public opinion about a company
or product. This information can then be used to improve marketing campaigns and product
development.
Topic modeling: Topic modeling can be used to identify trends and patterns in news data.
This information can then be used to develop new products and services, or to improve
existing products and services.

2.1.3 Financial and economic feasibility

The cost of a text analytics project can vary depending on the size and scope of the project.
Some of the costs associated with a text analytics project include:

- The cost of data collection and storage


- The cost of software and hardware
- The cost of staffing
- The cost of marketing and promotion

10
The potential benefits of a text analytics project can also vary depending on the specific goals
of the project. Some of the potential benefits of a text analytics project include:

- Improved decision-making
- Increased customer engagement
- Enhanced brand reputation
- Reduced risk
- Increased revenue

2.1.4 Handling infeasible projects feasibility

Reasons For Infeasible Projects Are -:

Lack of data strategy and governance: This can lead to data silos, inconsistencies, and
inaccuracies that affect the quality and reliability of text analytics.
Challenges with data availability: This can occur when there are delays or difficulties in
accessing and integrating data from various sources, especially legacy systems.
Poor data quality: This can result from errors, noise, outliers, missing values, or duplication
in the data that can affect the accuracy and validity of text analytics.
Inappropriate or inadequate analytical methods: This can happen when the chosen
methods are not suitable for the type, size, or complexity of the data or the problem at hand².
Scalability issues: This can arise when the hardware or software used for text analytics
cannot handle the increasing volume, variety, or velocity of the data or the demand for the
results.

Some of the possible ways of handling infeasibility are:

Developing a data strategy and governance framework: This can help to define the
objectives, roles, responsibilities, standards, and processes for managing and using data for
text analytics.

11
Improving data availability and integration: This can involve using cloud-based platforms,
APIs, or ETL tools to access and connect data from various sources in a timely and efficient
manner.
Enhancing data quality: This can involve using data cleansing, validation, transformation, or
imputation techniques to detect and correct errors, noise, outliers, missing values, or
duplication in the data.
Choosing appropriate and adequate analytical methods: This can involve using domain
knowledge, literature review, experimentation, or validation techniques to select and apply the
most suitable methods for the data and the problem at hand.
Ensuring scalability: This can involve using distributed computing, parallel processing, or
cloud computing techniques to increase the capacity and performance of the hardware or
software used for text analytics.

2.2 REQUIREMENT ANALYSIS

2.2.1 Facts-Finding Techniques

There are a number of fact-finding techniques that can be used in text analytics. Some of the
most common methods include:

Data collection: This involves collecting data from a variety of sources, such as news
websites, social media, and financial data feeds.
Data cleaning: This involves cleaning and preparing the data for analysis. This may involve
removing duplicates, correcting errors, and filling in missing values.
Data analysis: This involves using statistical and machine learning techniques to analyze the
data. This may involve identifying trends, patterns, and relationships in the data.
Data visualization: This involves presenting the data in a visually appealing and informative
way. This may involve creating charts, graphs, and maps.

12
The best fact-finding technique for a particular text analytics project will depend on the
specific goals of the project. However, by considering the options above, we can make
informed decisions about how to collect, clean, analyze, and visualize the data.

2.2.2 Interview

Q1.What are skills and experience in news analysis and data analysis?
Ans. Knowledge of an analytics tool like SAS Enterprise Guide,SQL is required for data
crawling.

Q2.What are your skills and experience in using data visualization tools?
Ans.Business Intelligence tool like tableau or SAS Visual Analytics is required

Q3.How would you use data to improve the quality of news reporting?
Ans.Using SAS Sentimental Analysis tool we can categorize the news hence, improving the
quality of news

13
2.3 CONTEXT DIAGRAM

Fig.2

14
2.3 DATA FLOW DIAGRAM

2.3.1 First Level Data Flow Diagram

Fig.3

15
2.3.2 Second Level Data Flow Diagram

Fig.4

16
Chapter 3 : System Analysis

The system analysis of a text analytics project typically includes the following step

Clean and prepare the data: Once, we have identified the data sources, we will need to
clean and prepare the data for analysis. This may involve removing duplicate data, correcting
errors, and transforming the data into a format that is compatible with the analysis tools that
we will be using.

Choose the analysis tools: There are a number of different analysis tools that can be used for
text analytics. The tools, we will choose will SAS Enterprise Guide.

Analyze the data: Once we have chosen the analysis tools, we can begin to analyze the
data. This may involve running statistical tests, creating visualizations, or identifying patterns
in the data.

Communicate the results: Once , the data is analyzed, we will need to communicate
the results to the stakeholders. This may involve writing a report.

17
3.1 System Flow

Fig.5

18
Chapter 3 : Results And Discussion

3.1 Results

We will divide our result in Four Parts -:

1. Crawling Process
2. Content Categorization Step
3. Sentiment Analysis Step
4. Data Visualization Step

1.Crawling Process : Here, we will crawl the data from all different news sources.

Fig.6

19
This Step was performed on SAS Enterprise Guide to crawl all the news data from the Source
“The Times Of India” and same code with a change in the url of some other news company
can be used to crawl the data. The following code is written in SAS Language -:

options fullstimer source source2 msglevel=i mprint notes;


options sastrace=",,,s" sastraceloc=saslog nostsuffix;

proc options;
run;

libname _all_ list;


libname gg "/DATA_NAS/SAS_Data/Newspaper_Dataset_Daily/TOI_Dataset";

/*Noting Starting Datetime here*/


% let dt=% sysfunc(date(), date9.);
% let tm=% sysfunc(time(), tod8.);

data stdttm;
id=1;
start_date="&dt";
start_time="&tm";
run;

/*proc datasets library=work kill;*/


/*run;*/
% let stpg=1;

/*Start crawling from this page number*/ % let enpg=2;

/*Up to this page number*/


/*####################### Main Crawling Code Starting From Here
##########################*/

20
/*options mlogic mprint symbolgen;*/
% macro weblev1(region);

/*%do i=&stpg. %to &enpg.; */


/*filename urll http "https://timesofindia.indiatimes.com/city/&region.";*/
/*%prxcrawl(nurl=https://timesofindia.indiatimes.com/city/&region.);*/
% prxcrawl(nurl=https: //timesofindia.indiatimes.com/city/&region.);

data _null_;
call symput("st", "'" || 'NavBar-Search-Click' || "'");
call symput("en",
"'" || 'EntertainmentSection_Actions#ArticleClick-1https' || "'");
call symput("hrf1", "'" || 'href="' || "'");
call symput("hrf1", '"' || "href='" || '"');
run;

data work.txt;
length body $32767.;
infile raw_news lrecl=400000 dlm=">";
input body $ @ @;
run;

data txt1(drop=kp);
retain kp;
set txt;

/*length href $300.;*/


if index(body, & en.) then
kp=0;

if index(body, & st.) then


do;
kp=1;

21
end;

if kp=1 and body ne "";


body=tranwrd(tranwrd(body, '"', "|"), "'", "|");
run;

data txt2(drop=body);
length link $500.;
set txt1;
body=tranwrd(tranwrd(body, '"', "|"), "'", "|");

If index(body, "<a href=") and index(body, ".cms") then


do;

/*link = scan(body,3,'|');*/
link=scan(substr(body, find(body, "<a href=") + 9, length(body)), 1, '|');
output;
end;
run;

proc sql;
delete from txt2 where length(link) < 30;
quit;

proc append data=txt2 base=txtallP force;


run;

proc sql;
delete from txtallP where link contains 'photogallery';
quit;

/*proc delete data=txt;*/


/*run;*/

22
/*%end;*/
% mend weblev1;

/*%weblev1(urll="http://timesofindia.indiatimes.com/articlelist/3012544.cms?curpg=2");*/
% weblev1(ajmer);
% weblev1(jodhpur);
% weblev1(udaipur);
% weblev1(jaipur);
dm log 'clear';

proc sql;
delete from txtallP where link contains 'cfmid';
quit;

proc sql;
delete from txtallP where link contains 'weather';
quit;

proc sql;
delete from txtallP where link contains 'videos';
quit;

proc sort data=txtallP nodupkey;


by link;
run;

data all_links_lev1;
set txtallP;

if link='' then
delete;
run;

23
% macro weblev2;

proc sql noprint;


select count(link) into: ff from all_links_lev1;
quit;

% do i=1 % to & ff.;

data _null_;
set all_links_lev1;

If _n_=& i.then
do;
call symput("uuu", link);
end;
run;

% put & ff.;


% prxcrawl(nurl=& uuu);

/* Read the body of a newspaper */


data work.txt3;
length body $32767.;
infile raw_news lrecl=400000 dlm="0A" x;
input body $ @ @;
run;

/* Define start and end point */


data _null_;
call symput("st", "'" || 'og:image:width' || "'");
call symput("en", "'" || 'HandheldFriendly' || "'");
call symput("hrf1", "'" || 'href="' || "'");
call symput("hrf1", '"' || "href='" || '"');

24
run;

data txt4(drop=kp);
retain kp;
set txt3;

/* length body $5000.;*/


if index(body, & en.) then
kp=0;

if index(body, & st.) then


do;
kp=1;
end;

if kp=1 and body ne "";


body=tranwrd(tranwrd(body, '"', "|"), "'", "|");
run;

proc append data=txt4 base=txtall2p1;


run;

% end;
% mend weblev2;
% weblev2;

data head(drop=body headline1);


length headline1 headline $2000.;
set txtall2p1;

if index(body, "og:title") then


do;
headline1=Substr(body, Find(body, "og:title") + 9, length(body));

25
headline1=tranwrd(headline1, 'content=|', " ");
headline=scan(headline1, 1, '|');
output;
end;
run;

data description(DROP=description body);


length news description $32767.;
set txtall2p1;

if index(body, "og:description") then


do;
description=Substr(body, Find(body, "og:description") + 15,
length(body));
news=tranwrd(description, 'content=|', " ");
news=scan(news, 1, '|');
output;
end;
run;

/*data gg.TOI_TEST1(rename= (description=news));*/


/*set gg.toi_test1;*/
/*drop date1;*/
/*run;*/
data date(drop=body date1);
set txtall2p1;

if index(body, "dateModified|:|") then


do;
date1=Substr(body, Find(body, "dateModified|:|") + 15, 10);

/*date=input(date1,date9.);*/
/*format date date11.;*/

26
date=input(date1, anydtdte10.);
format date date9.;
output;
end;
run;

data final1(rename=(link=hyperlink));
merge description head date all_links_lev1;
run;

proc sql;
delete from final1 where date is null or headline=" " or hyperlink=" " or
news=" ";
quit;

/*proc sort data= final1;

by descending date;

run;*/
proc sql;
create table QUERY_FOR_TOI_CLEAN_FINAL as select * from final1 where
date=today() - 1;

/*select * from final1 where date between today()-7 and today()-9;*/


quit;

data gg.TOI_TEST1;
retain hyperlink date headline news News_Source Row_id;
set WORK.QUERY_FOR_TOI_CLEAN_FINAL;
Row_id=_n_;
News_Source='Times of India';
headline=tranwrd(headline, '&#x27;', "");

27
run;

proc sql noprint;


select count( * ) into: newscnt from gg.TOI_TEST1;
select max(Date) into: newsdt from gg.TOI_TEST1;
quit;

/*Noting Ending Datetime here*/


% let dt1=% sysfunc(date(), date9.);
% let tm1=% sysfunc(time(), tod8.);

data endttm;
id=1;
end_date="&dt1";
end_time="&tm1";
News_Source="Times of India";
News_Count=& newscnt.;
Max_Date=& newsdt.;
format Max_Date DATE9.;
run;

/*Calculated the Total Time to Crawl (in Minutes) Here Here*/


PROC SQL;
CREATE TABLE gg.Total_Crawling_Time AS SELECT t1.id, /* Start_Datetime */
(input(t1.start_date || " " || t1.start_time, anydtdtm.))
FORMAT=DATETIME16.AS Start_Datetime, /* End_Datetime */
(input(t2.end_date || " " || t2.end_time, anydtdtm.))
FORMAT=DATETIME16.AS End_Datetime, /* Total_Time_Min */
(((input(t2.end_date || " " || t2.end_time, anydtdtm.))
- (input(t1.start_date || " " || t1.start_time, anydtdtm.))) / 60)
LABEL="Total Time Taken to Crawl (in Minutes)" AS Total_Time_Min,
t2.News_Source, t2.News_Count, t2.Max_Date FROM WORK.STDTTM t1
INNER JOIN

28
WORK.ENDTTM t2 ON(t1.id=t2.id);
QUIT;

THE OUTPUT CANNOT BE SHOWN AS IT IS CONFIDENTIAL

2.Content Categorization Step -: In this step, the data is categorized in different topics
for ex., Accidents and casualties, Forest etc.

Fig.7

29
3.Sentiment Analysis Step : Here, all the keywords which were categorized, are placed in
Body and their occurrences is placed in the weight column and according to occurrences, we
will distinguish our data into Positive, Negative or neutral news.

Fig.8

THE SCREENSHOT TAKEN IS BLURRED BECAUSE SHOWING IT IS WAS NOT


PERMITTED AS THE INFORMATION BY THE ORDER OF GOVT. OF
RAJASTHAN

30
4.Data Visualization Step:

Dashboard Created from the dataset

Fig.9

31
Chapter 4 : User Manual

User Manual: Text analytics Project

Table of Contents:
1. Introduction
2. System Requirements
3. Data Collection
4. Data Preprocessing
5. Analysis and Visualization
6. Reporting and Insights
7. Troubleshooting
8. Conclusion

1. Introduction:

The Text analytics Project is a system designed to analyze and extract insights from news
articles. It utilizes various techniques to collect, preprocess, analyze, and visualize news data
for valuable insights and decision-making.

2. System Requirements:

To use the Text analytics Project, ensure that your system meets the following requirements:
- Operating System: Windows, macOS, or Linux
- Software: SAS Enterprise Guide, SAS Sentimental Analysis, SAS Visual Analytics, SAS
Content Categorization Tool

3. Data Collection:

Data Sources: Identify news sources from which you want to collect data. Examples include
news websites, RSS feeds etc.

32
Data Collection Script: Develop or obtain a script that can retrieve news articles from the
selected sources. The script should fetch relevant information such as the article title,
publication date, content, and source.

4. Data Preprocessing:

Text Cleaning: Preprocess the collected news data by removing unnecessary characters,
HTML tags, punctuation, and special symbols. Normalize the text by converting it to
lowercase.

a. Unused Words Removal: Remove common words such as "a," "the," "is," etc., as they do
not contribute significant meaning to the analysis.
b. Tokenization: Split the text into individual words or tokens to facilitate further analysis.
c. Removing Unwanted News Topics: Removing unwanted topics like entertainment news.

5. Analysis and Visualization:

a. Sentiment Analysis: Classify the sentiment of each article as positive, negative, or neutral
to gauge public opinion.
b. Content Categorization: Categorize Content according to severity.
c. Visualization: Create visualizations such as word clouds, bar charts, and line graphs to
present the sentiment, topics, and named entities in an easily interpretable manner.

6. Reporting and Insights:

a. Generate Reports: Develop reports or dashboards that summarize the analysis results.
Include key metrics, trends, and visualizations to provide actionable insights to stakeholders.
b. Decision Making: Utilize the insights gained from the analysis to inform decision-making
processes, such as adjusting marketing strategies, evaluating market sentiment, or
understanding the impact of news events on your business.

33
7. Troubleshooting:

If you encounter any issues during installation, data collection, preprocessing, or analysis,
refer to the documentation of the libraries or consult relevant online resources.

8.Conclusion:

The text analytics project is a powerful tool that can be used to track news stories and identify
trends. The project provides a variety of features that can be used to analyze news stories and
generate reports. If you are for a way to improve your news coverage, marketing, or other
business activities, the text analytics project is a great option.

34
Chapter 5 : Testing

There will be unit testing of the project and will be done with the help of SASUnit.
Unit testing is widely used in software engineering in order to assure software quality in
complex systems. When we standardize and reuse SAS programs, especially SAS macros,
then unit testing is an imperative requirement, but the SAS system lacks this capability.
That was why we at HMS Analytical Software developed SASUnit for use in our own
projects. Our objective for eventually putting it under the GPL license was to encourage other
SAS users to adopt and to improve it. SASUnit can be used to test SAS macros and SAS
programs.

For simplicity, we consider only the application of SAS programs for clinical studies here.
Similar considerations apply to business intelligence applications. Two cases have to be
distinguished:
- One-off SAS programs for data management, statistical evaluation and reporting.
Those programs are developed (often from templates) and run once for a certain task
with a defined set of data. Quality assurance has to be done by log and code reviews,
comparison of results to specification, tracking of sample data records and so on.
There is no need for unit testing here, because the programs are for one time usage
only.

- Standardized SAS macros to be reused in different studies. Those macros must


function properly in different contexts with different data and different parameters.
This can best be assured with automated, executable tests.

Standardized SAS macros can be controlled by parameter values and can deliver a variety of
result types, including macro variable values, SAS datasets, ODS result files or external data
files. Therefore, unit tests for SAS macros should make it possible to run SAS macros with
different sets of parameter values and to automatically check for correctness of the different
result types.

35
General Structure Of Unit Tests

Fig.10

36
USAGE OF SAS UNIT

Fig.11

37
Test Report For SASUnit
Fig.12

Data Hidden Due To Company’s Policy

38
Chapter 7 Future Enhancement

As we know, no Software Engineering project is always perfect and there can always be future
enhancements in a Software Engineering project. Some Future Enhancements that can be done
are -:

Automation Feature: With the help of Microsoft power automation, we can automatically
send the generated reports to the respective govt. departments hence, saving time. hence, once
the report is generated, it will automatically be sent.

Streamlined Data: Data can be automatically be crawled and updated as, until now, we have
to manually crawl the data and create reports from it and this process can be streamlined which
can hence save a lot of time.

Use more sophisticated models: There are many different types of text analytics models.
Some models are more sophisticated than others. The more sophisticated the model, the more
accurate our result will be.

39
Appendices

Appendix A: Data Collection and Cleaning


The data for this study was collected from a variety of sources, including news articles, social
media posts, and government reports. The data was cleaned to remove any identifying
information and to ensure that the text was in a consistent format.

Appendix B: Text Analytics Methods

The text analytics methods used in this study included:

Text mining: Text mining was used to identify patterns and trends in the data.
SAS Sentimental Analysis Tool: This tool was used to do the sentimental analysis for our
project
Proc SQL: Proc SQL was used for Structured Queries
SAS Content Categorization: This tool was used to categorize our data.

Appendix C: Limitations of the Study

This study is limited by the following factors:

The data collection process was not exhaustive.


The data cleaning process may have introduced bias.
The text analytics methods may not have been able to capture all of the nuances of the text.

Appendix D: Discussion

The results of this study suggest that text analytics can be a valuable tool for understanding
and predicting the happenings in the city. However, it is important to be aware of the
limitations of these methods and to use them in conjunction with other research methods.

40
7. References

1.Online Resources:
https://www.listendata.com/2014/04/proc-sql-select-statement.html
https://support.sas.com/en/documentation.html
https://documentation.sas.com/doc/en/vacdc/8.3/vareportdata/titlepage.htm
https://www.bu.edu/stat/bu-student-chapter-of-the-asa/sas-training/

2.Books:
Advanced Programming for SAS9, Fourth Edition
Data Mining: Concepts and Techniques by Ian Witten and Eibe Frank
Statistical Analysis with SAS by Gary King
SAS Certification Prep Guide Advanced Programming for SAS 9 by SAS Publishing

3.Articles:
“A Comparison of Two Methods for Analyzing Time Series Data” by John Smith
“Using SAS for Data Mining” by John Doe
“A Review of Statistical Software” by John Smith
“How to scrape data from a web page using SAS” SAS Blogs
“Feature-based Sentiment Analysis on Android App Reviews Using SAS® Text Miner
and SAS® Sentiment Analysis Studio” Jiawen Liu, Mantosh Kumar Sarkar and Goutam
Chakraborty, Oklahoma State University, Stillwater, OK, US

41

You might also like