IAR TextualAnalysis SeekiNF
IAR TextualAnalysis SeekiNF
net/publication/320342533
CITATION READS
1 419
4 authors, including:
Rajendra P. Srivastava
University of Kansas
125 PUBLICATIONS 2,179 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Rajendra P. Srivastava on 29 October 2017.
Rajendra P. Srivastava
School of Business
University of Kansas, USA
E-mail: [email protected]
ABSTRACT
The main objective of this paper is to discuss the role of search engines in
data analytics within XBRL Environment and beyond. It is frequently argued that
having XBRL formatted business reports makes it easy to access information and
hence analyze it. However, one major problem with the XBRL formatted data is
that the analytical tools will capture only those pieces of information that are
tagged. What if the user needs information that is not tagged or not required to
be tagged? Can the analytical tools still be effective to provide the analysis needed?
This is where search engines would become important. The present paper highlights
the importance of textual analysis and demonstrates the value of search engines
that go beyond the XBRL environment, especially in situation where the needed
pieces of information are not tagged. Several examples are presented in the paper
to show the value of a search engine such as Seek iNF (https://www.seekedgar.com )
developed at the University of Kansas, in performing textual analysis and developing
predictive models, especially where no programming skills are required.
Key words: Search engine for SEC Filings, textual analysis, predictive
models, Seek iNF.
I. INTRODUCTION
The main objective of this paper is to discuss the role of search engines
in textual analysis, data analytics, and business intelligence. Traditionally,
researchers have used financial and non-financial data to build models for
analyzing financial risk and financial performance. For example, Altman (1968),
Beaver (1996), Jones and Hensher (2004), Ohlson (1980), Shumway (2001),
Zmijewski (1984) and many others have used financial data to develop models
Acknowledgements: The author would like to thank the participants of the 2014 International
Symposium on Accounting Information Systems (organized by the Department of Accounting,
University of Melbourne, Australia, January 2-3, 2014) for their valuable suggestions on an
earlier version of the paper.
1
2 Indian Accounting Review
for assessing financial risk and financial performance using the regression
technique. Emery and Cogger (1982) used cash flow data to develop a
theoretical model to measure liquidity of a company, which in turn measures
the financial risk or potential for bankruptcy. Several researchers have used
the financial data to develop neural network models to forecast bankruptcy
and fraud [e.g., see Boritz and Kennedy (1995), Boritz, Kennedy, and
Albuquerque (1995), Fanning and Cogger (1998), O’Leary (1998)]. Lensberg,
Eilifsen, and McKee (2006) have used financial data to develop bankruptcy
prediction models using genetic algorithm, while McKee (2003) used similar
data to develop bankruptcy prediction models using rough sets. Shumway
(2001) used financial data to develop a simple hazard model for bankruptcy
prediction.Thus, a vast amount of work has been done in using the financial
data to develop predictive models and business intelligence for risk assess-
ments and bankruptcy prediction.
Callen, Khan and Lu (2013) use financial data to measure "accounting
quality" where "the precision with which financial reports convey information
to equity investors about the firm's expected cash flows" defines the accounting
quality. Recently, Lee, Churyk and Clinton (2013), (see also, Loughran and
Mcdonald, 2011) developed a fraud detection model based on the counts of
key words (positive versus negative words), and punctuation in Management
Discussion and Analysis (MD&A) disclosures and found it to be performing
better than many other models developed using financial data.
While financial and certain business data are being tagged using XBRL
(Extensible Business Reporting Language), not all key words, punctuations,
and present tense and past tense sentences, and counts of words and
sentences that are needed for determining sentiments and readability indices
to develop predictive models as developed by Lee, Churyk and Clinton (2013),
are being tagged nor they would ever be tagged by XBRL technology. While
there are many advantages of the XBRL technology from ease with which
information can be retrieved, shared, transported across platforms, used in
automated models, to cost saving and eliminating human errors in the above
process (e.g., see, Srivastava 2009 and Debreceny et al. 2005), it is practically
not possible to tag all words in a report through the XBRL technology. This
is where the need of a text search engine becomes important, which is the
topic of this paper. Text mining is a big area and an important area, especially
in this era of big data. There are several books written on this topic. The focus
of this paper is concentrated towards searching for financial and non-financial
information, along with exploring the use of text analytics for predictive models
in the context of accounting, auditing, and finance.
The Securities and Exchange Commission (SEC) in the USA and many
similar government agencies in other countries (e.g., Securities and Exchange
Board of India, SEBI) have mandated public companies to file their annual and
quarterly reports in the XBRL format along with the text/html filings. Thus,
all business relevant data, both financial and non-financial, are available in
the XBRL tagged format. And therefore, it would be easy to create models that
would provide timely information to analysts, regulators, and the general public
Srivastava 3
for business decisions as soon as the financial reports are filed with the
regulatory agencies such as the SEC in the USA and SEBI in India. In fact,
already several vendors have developed such analytical tools that use XBRL
tagged data. Based on the availability of XBRL tagged business information
from the SEC filings, the SEC has built a "RoboCop" (Novack, Carney, Harker
2013) to find irregularities in companies' filings. However, since the XBRL
technology does not tag all the words, it is difficult to develop a model that
would use the tagged financial and non-financial data along with the certain
key words and number of or percentage of positive versus negative words
in a document such as 10-K, the annual report filed with the SEC. In fact,
SEC is looking into integrating text-based information into their "RoboCop"
(Novack, Carney, Harker 2013), which is based on the tagged data.
Rest of the paper is divided into four sections. Next section provides a brief
discussion on the background research, especially the textual analysis
research. Section III describes the features of the search engine, Seek iNF,that
has been developed at the University of Kansas. Section IV provides
illustrations on how Seek iNF can be used to perform textual analysis without
any programming skills in PERL or Python. Finally, Section V provides a
summary and conclusion.
TABLE 1
TABLE 1 (Contd.)
Basically, Seek iNF deals with the following four Dimensions: (1) Search
all or few documents in the database for specific issues or concerns, (2) Obtain
a piece of information whether financial or non-financial, (3) Perform text
analytics, and (4) Download all the searched data in HTML document and Excel
Spreadsheet file for further analysis. In the next section, I will describe these
Srivastava 9
dimensions in detail with examples. These four dimensions along with the
following Power Features, Seek iNF provides unique opportunity to researchers
to perform textual analysis and develop predictive models and business
intelligence in an efficient and effective manner which has not been possible
because of lack of programming skills in researchers except a few.
Seek iNF has several Power Features that make the search and retrieval
of information much more efficient and effective which has never been possible
without the knowledge of programing by researchers. These features are
discussed next.
Search Documents with Multiple Exact Phrases: This feature allows
users to search all or few documents in the database for presence or absence
of multiple phrases containing words like 'a', 'in', 'if', 'no', 'of', 'the', etc.
Such a feature makes it easier to find which of the companies have mentioned
in their 10-K a given phrase, say 'wrongful termination'. One can input this
phrase in the search engine and get 4,511 companies in few seconds that have
mentioned the phrase in their 10-Ks. I will elaborate on this feature further
through examples in the next section. Using the built-in Boolean logic makes
the search process much more effective, especially using the multiple exact
phrases.
You can search any document with a combination of exact phrases with
"AND", "OR", or any combination of these conditions. Seek iNF uses "+" for AND,
"|" for OR, and "-" for negation with or without space before and after the
symbol. Here are examples of how one would type the three phrases represented
by A, B, & C in the exact phrase slot in Step 1 (see Figure 1 ).
FIGURE 1
Screenshot of the main page of Seek iNF
10 Indian Accounting Review
TABLE 2
Readability Indices
Readability Index Formula
Gunning-Fog Index 0.4x[(words/sentences) + 100x(complex words)/
words]a
Smog Index 1.0430xSQRT[(number of polysyllables)x30/
(number of sentences)] + 3.729b
Flesch Reading Ease 206.835 – 1.015[(total words)/(total sentences)]
– 84.6[(total syllables)/(Total words)]c
Flesch-Kincaid Grade Level 0.39[(total words)/(total sentences)] + 11.8[(total
c
syllables)/(total words)] – 15.59
Automatic Readability Index 4.71(characters/words) + 0.5(words/sentences) –
21.43 d
e
Coleman - Liau Index 0.0588L – 0.2965S – 15.8
IV. ILLUSTRATIONS
In this section, I demonstrate the power of the search engine, Seek iNF,
in research, especially in accounting, finance, and other business
disciplines.Given the ease with which one can gather information using Seek
iNF, the knowledge, imagination, and creativity of the researcher will play an
important role in future research.
FIGURE 2
to find out using Seek iNF. Just insert the phrases pertaining to 'we may never
be profitable' and pertaining to 'Going Concern opinion' together in the slot
of "With exact phrase" in Step 1. More specifically, type the following phrases
"(we may never become profitable | we may never achieve or sustain
profitability | we may never achieve or maintain profitability | we may never
be profitable) + (In my opinion | In our opinion) + (substantial doubt about|
substantial doubts about | substantial doubts regarding | substantial doubt
regarding) + going concern1" in the "With exact phrase" slot and the two words:
"opinion concern", in the "Proximity Search" slot within 500 words and select
"all companies", the time period, and 10-Ks. Figure 2 shows an interesting
result - not all such companies received GC opinions. Actually, less than 50%
have received a going concern opinion. Why is that? What about audit quality?
Many interesting research questions emanate from this result.
1 Seek iNF uses '|' symbol for OR logic, '+' symbol for AND logic, and '–' for negation.
14 Indian Accounting Review
FIGURE 3
2(b). Find Executive Bios from DEF 14A. Similar to the previous example
of executive compensation, executive bios are in a table in the proxy statement
(DEF 14A). Again one needs to find the pattern, which will identify such a table.
After looking at few tables that contained names and ages of executives, I found
out that 'Name' and 'Age' were appearing within two words almost in all such
tables. Next, I used the two words 'Name' and 'Age' in "Proximity search" within
2 words to get all the tables that contained executives' bios for all the
companies. Figure 4 is a screenshot of the display window for one of the
companies. One can download this table in Excel by selecting "Download Table
in CSV" from the menu bar displayed on the top right side of the display
window. One can download the entire search result in Excel by submitting a
Request Form and filling out the search criteria again. The system will
automatically process the request and inform the user with the link when the
data are ready to be downloaded.
Srivastava 15
FIGURE 4
Example 3: Search with few words before and few words after a phrase
This feature of searching for information with few words before and few
words after a phrase is again a pretty powerful tool for getting information that
is not available by any other source unless you know how to program in PERL
or Python to fetch that information. Suppose you want to find the "fiscal year
end" of all the public companies filing with the SEC. This information is
provided by the companies in their 10-Ks, the annual reports filed with the
SEC. Looking at few 10-Ks, it seems that this information is listed right after
the phrase "fiscal year end". Thus, if one uses the search criteria "fiscal year
end" in"With exact phrase" and selects the option to display the searched data
with zero word before and 1 word after, the system will display the desired
result. Figure 5 represents the "fiscal year end" data in an Excel file for a set
of companies obtained through the process of submitting a "Request Form"
(https://www.seekedgar.com:8443/SeekiNF_search_Engine.pdf ).
FIGURE 5
They used their own programming skills to get these counts. However, I want
to show here how one can get the required counts easily using the Seek iNF
search capabilities without having any knowledge of PER or Python. As given
earlier, their measure of competition is
PCTCOMP = 1000*NCOMP/NWORDS,
where NCOMP = number of words in 10K as described above and NWORDS =
Total number of words without numbers. Seek iNF yields these counts in no
time. We obtain NCOMP in two steps. First, we count the occurrence of the
words: competition, competitor, competitive, compete, competing, competitions,
competitors, competes, and subtract from it the "Proximity" count which counts
the occurrence of the following two words within three or less words:
not competition, less competition, few competition, limited competition,
not competitor, less competitor, few competitor, limited competitor,
not competitive, less competitive, few competitive, limited competitive,
not compete, less compete, few compete, limited compete, not competing,
less competing, few competing, limited competing, not competitions,
less competitions, few competitions, limited competitions, not competitors,
less competitors, few competitors, limited competitors, not competes,
less competes, few competes, limited competes.
Let us compute PCTCOMP for the following five companies: Qwest Corp,
Verizon Communications Inc, AT&T Inc., Level 3 Communications Inc., General
Communication Inc., with CIKs: 68622, 732712, 732717, 808461, 794323, for
10 years (2006-2015). Since "Request Form" will allow you to download only
five years data at a time, you need to submit two separate requests. After typing
your name, email and University/Company, select the menu item "Phrase(s)/
Word(s) Count" for the words counts, and "Proximity Count" for the second part
of the count to be subtracted from the first count to determine NCOMP. Seek
iNF provides the total word count as a default. Thus, we can easily calculate
the competition metric PCTCOMP.
Srivastava 17
FIGURE 6
FIGURE 7
References
Altman, E. (1968). Financial ratios, Discriminant Analysis and the Prediction of Corporate
Bankruptcy.Journal of Finance 23: 589-609.
Atiya, A. F. (2001). Bankruptcy Prediction for Credit Risk Using Neural Networks: A Survey
and New Results. IEEE Transactions on Neural Networks, Vol. 12, No. 4, July:
929-935.
Beaver, R. (1996). Financial ratios as predictors of failure.In Empirical Research in
Accounting: Selected Studies 1966, Journal of Accounting Research, vol. 4,
pp. 71-111.
Boritz, J. and D. Kennedy. (1995). Effectiveness of neural network types for prediction of
business failure.Expert SystemsApplications, vol. 9: 504-512.
Boritz, J., D. Kennedy, and A. Albuquerque. (1995). Predicting corporate failure using a neural
network approach.Intelligent System in Accounting, Finance, and Management,
Vol. 4: 95-111.
Bovee, M., A. Kogan, R. P. Srivastava, M. A. Vasarhelyi, K. M. Nelson, (2005). Financial
Reporting and Auditing Agent with Net Knowledge (FRAANK) and eXtensible Business
Reporting Language (XBRL). Journal of Information Systems, Vol. 19, No. 1 (Spring):
pp. 19-41.
Callen, J. L. M. Khan and H. Lu. (2013). Accounting Quality, Stock Price Delay, and Future
Stock Returns. Contemporary Accounting Research, Vol. 30, Issue 1, Spring:
269-295.
Debreceny, R. S., A. Chandra, J. J. Cheh, D. Guithues-Amrhein, N. J. Hannon, P. D.
Hutchison, D. Janvrin, R. A. Jones, B. Lamberton, A. Lymer, M. Mascha, R. Nehmer,
S. Roohani, R. P. Srivastava, S. Trabelsi, T. Tribunella, G. Trites, and M. A.
Vasarhelyi. (2005). Financial Reporting in XBRL on the SEC's EDGAR System:
A Critique and Evaluation. Journal of Information Systems 19 (2):191-210.
Emery, G. and K. Cogger. (1982). The Measurement of Liquidity. Journal of Accounting
Research, Vol. 20, No. 2 Pt. I Autumn: 290-303.
Fanning, K. M. and K. O. Cogger. (1998). Neural network detection of management fraud using
published financial data. International Journal of Intelligent Systems in Accounting,
Finance & Management, Vol. 7, No. 1, March: 21-41.
Jones and Hensher. (2004). Predicting Firm Financial Distress: A Mixed Logit Model.
The Accounting Review, Vol. 79, No. 4, 2004, pp. 1011-1038.
Lee, C., N. T. Churyk, and B. D. Clinton. (2013). Detect Fraud Before Catastrophe. Strategic
Finance, March: 33-37.
Lensberg, T., A. Eilifsen, and T. E. McKee. (2006). Bankruptcy theory development and
classification via genetic programming. European Journal of Operational Research,
Vol. 169, Issue 2, March: 677-697.
Li, F. (2010). Textual Analysis of Corporate Disclosures: A Survey of the Literature. Journal
of Accounting Literature, Vol. 29: 143-165.
Li, F., R. Lundholm, and M. Minnis. (2013). A Measure of Competition Based on 10-K Filings.
Journal of Accounting Research, Vol. 51, No. 2 (May): 399-436.
Loughran, T. and B. Mcdonald. (2011). When is a Liability not a Liability? Textual Analysis,
Dictionaries, and 10-Ks. The Journal of Finance, Vol. 6, Issue 1, February: 35-65.
McKee, T. E. (2003). Rough sets bankruptcy prediction models versus auditor signaling rates.
Journal of Forecasting, Vol. 22, Issue 8, December: 569-586.
Monga, V. and E. Chasan. (2015). The 109,894-Word Annual Report: As regulators require
more disclosures, 10-Ks reach epic lengths; how much is too much? The Wall Street
Journal, Business CFO Journal. Updated June 1.
Novack, J., J. Carney, and F. Harker. (2013). How SEC's New RoboCop Profiles Companies
For Accounting Fraud.http://www.forbes.com/sites/janetnovack/2013/08/09/
how-secs-new-robocop-profiles-companies-for-accounting-fraud/
20 Indian Accounting Review
Ohlson, J. (1980). Financial ratios and the probabilistic prediction of bankruptcy. Journal
of Accounting Research 18: 109-131.
O'Leary, D. (1998). Using Neural Networks to Predict Corporate Failure. International Journal
of Intelligent Systems in Accounting, Finance & Management, Vol.7, pp. 187-197.
Shumway, T. (2001). Forecasting Bankruptcy More Accurately: A Simple Hazard Model.
The Journal of Business, Vol. 74, No. 1, January: 101-124.
Srivastava, R. P. (2005). Financial Reporting in XBRL on the SEC's EDGAR System: A Critique
and Evaluation. Working Party of the AAA Information Systems and Artificial
Intelligence/Emerging Technologies Section (with 17 other members, with equal
participation). Journal of Information Systems, Vol. 19, No. 2, Fall: 191-210.
Srivastava, R. P., (2009). XBRL (Extensible Business Reporting Language): A Research
Perspective. Indian Accounting Review, Vol. 13, No. 1, pp. 14-32.
Tetlock, Paul C., M. Saar-Tsechansky, and S. Macskassy, (2008). More than words:
Quantifyinglanguage to measure firms' fundamentals, Journal of Finance 63,
1437-1467.
Zmijewski, M. (1984). Methodological issues related to the estimation of financial distress
prediction models. Journal of Accounting Research 22 (Supplement): 59-82.
November 6– 9, 2016