Business AI Data Structures and Analytics
Business AI Data Structures and Analytics
Sean Cao
University of Maryland
Wei Jiang
Emory University
Lijun Lei
University of North Carolina
at Greensboro
Table of Contents
Data analytics is a broad category encompassing diverse activities that involve collecting,
organizing, and analyzing raw data. Fueled by advances in computing power, mass storage, and
machine learning, the usage of data analytics has skyrocketed over the past decade. Data
analytics has the capacity to analyze any type of information, including both structured and
unstructured data.
and risk-sharing among capital market participants. The nature of a company is based on how its
various contractual arrangements with stakeholders are structured. These contractual relations
are essential to companies, and stakeholders such as customers, suppliers, employees, investors,
communities, and others who have a stake in the company form a network of interconnected
contracts. Throughout history, information supply and demand in capital markets has been used
company’s past and prospective returns and risks. Companies that wish to lower financing costs
and/or other costs such as political, contracting, and labor costs supply that information.
Managers of a company decide the amount of information to supply by weighing the costs and
benefits of disclosing such information. Although managers can exert control over what
information a company supplies and when, regulatory agencies have consistently intervened in
this process by establishing a baseline level of information that must be released with various
disclosure requirements.
1
Cao, Jiang, &Lei
_____________________________________________________________________________________
etc. Accordingly, the decision-making process has evolved from a model in which managers
primarily rely on their experience to one in which decision-making is based on data analytics.
However, the shift presents an inherent challenge that is the need for large-scale data collection
from different sources and in various formats. Data analytics techniques can help stakeholders
collect relevant information, organize structured and unstructured data, and then conduct
appropriate analyses to reveal trends and metrics that would otherwise be lost amidst a sea of
information. Given the central role that information plays in capital markets, corporate
stakeholders increasingly recognize the value and the importance of data analytics. As revealed
in Cao, Jiang, Yang, and Zhang (2023), there is clearly an increase in the application of data
analytic tools in analyzing regulatory company filings downloadable from the Securities and
Exchange Commission’s Electronic Data Gathering, Analysis, and Retrieval (EDGAR) database
system. Specifically, the proportion of automatic machine downloads of annual and quarterly
regulatory company filings (i.e., 10-Ks and 10-Qs) surged from under 40 percent in 2003 to over
2
Cao, Jiang, &Lei
_____________________________________________________________________________________
This figure plots the annual number of machine downloads (blue bars and left axis) and the annual percentage of
machine downloads over total downloads (red line and right axis) across all 10-K and 10-Q filings from 2003 to
2016. Machine downloads are defined as downloads from an IP address downloading more than 50 unique firms’
filings daily. The number of machine downloads and the number of total downloads for each filing are recorded as
the respective downloads within seven days after the filing becomes available on EDGAR.
What separates us from computer science and statistics majors: The importance of domain
knowledge
How to leverage data science for corporate stakeholders: the importance of domain knowledge
competitions began in chess. One of the most famous chess computers is Deep Blue because of
the chess match between Deep Blue and World Chess Champion Garry Kasporov in 1997. Cao,
Jiang, Wang, and Yang (2021) build an AI analyst that is able to digest corporate financial
information, qualitative disclosure, and macroeconomic indicators. They find that such an AI
analyst could beat the majority of human analysts in stock price forecasts. The relative advantage
of the AI analyst is stronger when the firm is complex, and when information is high-
when critical information requires institutional knowledge. More importantly, the edge of the AI
over human analysts declines over time as human analysts gain access to alternative data and to
3
Cao, Jiang, &Lei
_____________________________________________________________________________________
in-house AI resources. Unsurprisingly, combining AI’s computational power and the human art
of understanding soft information produces the highest potential in generating accurate forecasts.
This figure plots the proportion of AI-assisted Analyst recommendations that are more accurate than the Analyst
recommendations alone on an annual basis. The blue line in the middle gives the annual AI-assisted Analyst beat
ratios, the blue-dotted lines above and below are the 95% confidence interval of the beat ratio, and the red line gives
the best linear approximation of the trend in beat ratios.
powerful, the first step in utilizing them is relatively simple: identifying the required information
and devising a strategy to gather it. Given the dynamic nature of capital markets, analysts must
have in-depth knowledge of institutional backgrounds to effectively apply data analytics in the
corporate world or in capital markets. For instance, analysts must understand the objectives of
and potential sources of useful information. This creates a pressing demand for business
professionals who possess both domain knowledge in business and a practical understanding of
data analytics techniques. Business professionals with both business expertise and data analytics
4
Cao, Jiang, &Lei
_____________________________________________________________________________________
skills can play a critical bridging role that decipher information needs of decision-makers,
conduct preliminary analyses, and lead a team to formalize and implement quantitative models.
Furthermore, the advancement of Artificial intelligence (AI) opens the avenue to various
marketing, etc. It is equally important to be able to use and manage such AI applications,
including understanding their functions, exploring their roles in improving productivity, and
managing associated legal and security risks. Additionally, some abilities, such as reasoning-
based intelligence, are exclusive to humans and cannot be entirely replaced by AI. Therefore,
professionals who possess domain expertise will be in high demand in this new era.
Figure 3 describes a typical data analytics team consisting of a customer team and an
implementation team. The customer team includes the client-facing product manager. The
product manager needs to understand customer demand, perform preliminary analysis, and
communicate customer demand and desirable solution to the implementation team. An ideal
product manager thus should have business knowledge and be able to apply basic data analytics
tools. The implementation team includes a lead team serving as the bridge between the customer
team and the implementation team and a data team that are expert in implementation. The lead
team require professionals with strong business knowledge and data analytic skills since they are
responsible for receiving demand from the product manager, conveying the data analytics
solution to data scientists, and perform quality control. The data team mostly comprises of data
scientists who are computer science or statistics majors with strong programming skills. Business
professionals with data analytics skills can work in any roles that require integrated skills such as
5
Cao, Jiang, &Lei
_____________________________________________________________________________________
The objective of this book is thus to introduce computational tools and AI technologies to
business major students and link these tools to business domain knowledge before they jump into
questions and use cases will be the key to stay in the game in an era that has enthusiastically
Managers and employees are deeply invested in their company’s current and future
financial well-being. This creates a strong demand for information on the company’s financial
companies and business opportunities, which permits them to benchmark their company’s
performance and condition. Managers also use company information to design compensation and
bonus contracts. Information extracted using data analytics could assist managers in addressing
• What product lines, geographic areas, or other segments are performing well in
6
Cao, Jiang, &Lei
_____________________________________________________________________________________
• How will current profit levels impact incentive- and share-based compensation?
Shareholders of a company, much like managers and employees, are keenly interested in
predicting its future performance. Expectations of future profitability and cash generation
significantly impact a company’s stock price and ability to borrow money on favorable terms.
Shareholders therefore demand company information to project its gains and losses accurately.
Investors also use company information to evaluate managerial performance. Here are some
examples of questions that information extracted using data analytics could assist investors in
addressing:
• What are the expected future profits, cash flows, and dividends for input into stock-
price models?
• Is the company financially solvent and able to meet its financial obligations?
• How do expectations about the economy, interest rates, and the competitive
proposed by management?
7
Cao, Jiang, &Lei
_____________________________________________________________________________________
regarding their financial transactions and business relationships. By using data analytics, they
can gain insights into the company’s financial health, performance, and risks. For example,
creditors can use data analytics to determine loan terms, loan amounts, interest rates, and
required collateral, as well as to forecast earnings, predict bankruptcy, and assign credit scores.
Meanwhile, suppliers can use data analytics to establish credit terms and to evaluate their long-
term commitment to supply-chain relations. Both creditors and suppliers use company
information to monitor and adjust their contracts and commitments. Here are some examples of
questions that creditors and suppliers can address with the help of data analytics:
• Should we extend credit in the form of a loan or line of credit for inventory purchase?
• What interest rate is reasonable given the company’s current debt load and overall
risk profile?
or services as well as its staying power and reliability. Auditors rely on company information to
compliance with laws and regulations. All stakeholders can uncover critical insights for their
8
Cao, Jiang, &Lei
_____________________________________________________________________________________
Over the past few decades, significant progress has been made in data analytics
techniques. These advances have led to new sources of information that enable us to effectively
tackle a broader range problems. Recently, there have also been remarkable innovations in the
methods used to create new data. Information collected from these fresh sources or generated
More and more information users are drawn to texts in firm regulatory filings. Despite
containing financial statements and pages of tables and charts, annual reports and other company
regulatory filings still comprise mostly of text. The recent reduction in computer storage costs
and an increase in computer processing capabilities have made textual analysis of these
disclosures more feasible. Regulatory filings that are widely available for analysis encompass
9
Cao, Jiang, &Lei
_____________________________________________________________________________________
annual reports, current reports, proxy statements, initial public offering (IPO) prospectuses, and
more.
Conference call transcripts allow analysts to capture and analyze information disclosed
during corporate conference calls. These calls provide an opportunity for managers to announce
and discuss the firm’s financial performance, while allowing analysts and investors to ask
communications detailing a company’s CSR initiatives and their impact on the environment and
society. While some countries mandate the annual publication of CSR reports, many companies
Social media has become a vital communication channel for businesses. Social media
platforms enable the rapid dissemination of information to millions of people within seconds.
This evolution in information exchange has opened up a new range of opportunities for
companies to inform and interact with stakeholders. Company information shared on social
media platforms, whether by the company itself or by investors, consumers, competitors, and
Audio data pertaining to business activities can also enhance decision-making processes.
For example, in addition to analyzing textual transcripts of conference calls and investment
presentations, the audio recordings of these events can provide valuable nuances to analysts.
Video and image data are more widely used than ever before due to the progress of
video and image capture devices. Computer algorithms are becoming increasingly sophisticated,
enabling the processing and interpretation of static images and deriving objective information
10
Cao, Jiang, &Lei
_____________________________________________________________________________________
examples of potentially valuable image data. Videos of investor presentations, product releases,
and other company events could be valuable for managers, investors, and other decision-makers.
quantitative data that is aggregated and used to prepare financial statements for internal and
external information users. In addition, information intermediaries and other marketplace agents
applicable accounting standards to ensure the relevancy, reliability, and comparability of firm
information. Companies use four financial statements to periodically report on their business
activities: the balance sheet, income statement, statement of stockholders’ equity, and statement
of cash flows. The balance sheet reports on a company’s financial position at a specific point in
time, while the income statement, statement of stockholders’ equity, and statement of cash flows
report on performance over a period of time. These three statements link the balance sheet from
type of compensation paid to a firm’s chief executive officer, chief financial officer, and the
other three highest-paid executive officers in the annual proxy statement. The company must
also disclose the criteria used to reach executive compensation decisions and the relationship
between the company’s executive compensation practices and corporate performance. The
Summary Compensation Table, included in the proxy statement, is the cornerstone of the
11
Cao, Jiang, &Lei
_____________________________________________________________________________________
Summary Compensation Table is then followed by other tables and disclosures containing more
specific information on the components of compensation for the last completed fiscal year, for
example, information about grants of stock options and stock appreciation rights, long-term
incentive plan awards, pension plans, and employment contracts and related arrangements.
information from financial experts. Financial analysts provide short-term and long-term forecasts
on earnings, sales, capital expenditures, etc. Financial analysts also regularly issue investment
recommendations.
Loan agreements include information about loan terms, loan amounts, interest rates, and
required collateral. Loan agreements often include contractual requirements, called covenants,
that restrict a company’s behavior in some fashion. Violation of loan covenants can lead to early
repayment or other compensation demanded by the lender. Information in loan agreements thus
Patents and citations are highly useful in understanding the innovations a company has
economic importance and patent data is well-documented. Unsurprisingly, patent data has
In this book, unstructured textual and image data are discussed in chapter 2 to chapter 8;
12
Cao, Jiang, &Lei
_____________________________________________________________________________________
When analyzing either quantitative or qualitative data, two approaches can be taken in
guide the direction of data analytics, whereas the machine-learning approach starts with data
being supplied to a computer model to train itself to identify patterns or make predictions. The
13
Cao, Jiang, &Lei
_____________________________________________________________________________________
theory-driven approach resembles human thinking process, making it intuitive and interpretable.
The machine-learning approach, on the other hand, leverages computational power of machines
to yield strong predictive capability, but the machine learning process remains a “black box”.
As an example, the Securities and Exchange Commission (SEC) charged Luckin Coffee
Inc. with material misstatement of financial statements to falsely appear to achieve rapid growth
and increased profitability and to meet the company’s earnings estimates. The fraud was
uncovered by Muddy Waters LLC. Muddy Waters is an investment research firm that conducts
research about financial fraud. The firm received an anonymous tipping and thus mobilized 92
full-time and 1,418 part-time staff on the ground to run surveillance and recorded store traffic for
981 store-days covering 100% of the operating hours of 620 stores. The investigation resulted in
more than 11,200 hours of videotaping and led to the conclusion that the number of items per
store was inflated by at least 69 percent in 2019’s third quarter and 88 percent in the fourth
quarter (Muddy Waters Research, 2020). This is a typical investigation following the theory
(hypothesis)-driven approach where Muddy Waters LLC first formed a hypothesis that Luckin
Coffee Inc. misstated financial statements and then conducted investigation to examine the
hypothesis. In contrast, financial statement auditors are required to perform analytical procedures
that aim at detecting potential anomalies in financial reporting. When auditors perform analytical
procedures, they do not necessarily have a hypothesis that a company misstates financial
Both approaches can be employed to perform three types of data analytics: descriptive,
diagnostic, and predicative. Descriptive analytics summarize data and describe observable
patterns. These analyses focus on understanding what has happened over a period of time.
14
Cao, Jiang, &Lei
_____________________________________________________________________________________
However, conventional descriptive analytics rely on structured data mostly in numeric format.
Diagnostic analytics seek to understand what happened and why. This type of data
analytics involves more diverse data inputs and a deeper dive into the data. Diagnostic analytic
techniques involve correlations, regressions, and other statistical methods. For instance,
Finally, predictive analytics explores what is likely to happen or what will happen “if”
something else happens. To be able to predict what will happen, we first need to understand what
happened, how, and why; hence, predictive analytics builds on descriptive and diagnostic
analytics. Conventional predictive analytic techniques involve building models using past data
and statistical techniques, including regression and a deep understanding of cause and effect.
With machine-learning algorithms, patterns can be learned from training dataset, and predictive
15
Cao, Jiang, &Lei
_____________________________________________________________________________________
analysis techniques. First, traditional statistical methods often struggle with large amounts of
data. In contrast, machine learning algorithms such as convolutional neural networks (CNNs) are
capable of selecting the best features to process information effectively. Another advantage of
machine learning is its ability to handle nonlinear relationships in data. Along with the
information explosion, the types of information available to analysts have grown from mostly
numerical data to more complex data that involves both text and images. Machine learning can
effectively identify nonlinear patterns and make predictions using various types of data. This
makes it particularly valuable for tasks such as natural language processing and computer vision.
Machine learning also offers the ability to make out-of-sample predictions, which is particularly
useful in cases where data is limited. In traditional statistical methods, repeated optimization
(learning) is often required to improve the accuracy of the model. However, in machine learning,
this process is streamlined and only requires one-time optimization. For tasks involving time
series data, machine learning algorithms such as long short-term memory (LSTM) networks can
be applied to identify time-series patterns. Finally, machine learning algorithms are designed to
be efficient, which is particularly important given the increasing amounts of data being
generated. By using optimized algorithms, machine learning can process data quickly and
effectively, making it a valuable tool in fields such as healthcare and finance where time is of the
essence.
The field of machine learnings encompasses a wide range of techniques and approaches,
ensemble learning, and more. Supervised learning involves training a model with labeled data.
On the contrary, unsupervised learning does not require labeled data and instead focuses on
16
Cao, Jiang, &Lei
_____________________________________________________________________________________
finding patterns and relationships within the data itself. Self-supervised learning is a variant of
unsupervised learning that uses the data itself to generate labels, such as using stock returns to
label news positivity. Transfer learning is a powerful technique that uses a pre-trained model to
tackle new tasks with less training data. This approach is particularly useful in cases where the
cost of obtaining labeled data is high or where the amount of labeled data available is limited.
These machine learnings techniques are discussed in detail in later chapters in the book.
17
Cao, Jiang, &Lei
_____________________________________________________________________________________
Reference
Cao, S., Jiang, W, Yang, B, Zhang, A. 2023. How to Talk When a Machine is Listening?
18
Cao, Jiang, &Lei
_____________________________________________________________________________________
the U.S. Securities and Exchange Commission (SEC) within 60 days of the end of the fiscal
year. The SEC’s Electronic Data Gathering, Analysis, and Retrieval (EDGAR) database system
allows anyone to retrieve a company’s 10-K report. Some companies also post their 10-K reports
on their websites. In addition, SEC rules mandate that companies send an annual report to their
shareholders in advance of annual meetings. While both annual reports and 10-K filings provide
an overview of the company's performance for the given fiscal year, annual reports tend to be
much more visually appealing than 10-K filings. Companies put effort into designing their
annual reports, using graphics and images to communicate data, while 10-K filings only report
numbers and other qualitative information, devoid of design elements or additional flair.
A comprehensive Form 10-K contains four parts and 15 items. Researchers are often
most interested in Item 1, “Business,” Item 1A, “Risk Factors”, and Item 7, “Management’s
Discussion and Analysis of Financial Condition and Results of Operations.” Therefore, we begin
business, including its main products and services, its subsidiaries, and in which markets it
operates. To gain an understanding of a company’s operations and its primary products and
services, Item 1 serves as an excellent starting point. Figure 1 shows an excerpt of Item 1 of
19
Cao, Jiang, &Lei
_____________________________________________________________________________________
Apple Inc.’s 2021 10-K. It introduces the company background and provides information on
Item 1A, “Risk Factors,” also is an item in Part I of Form 10-K. It outlines the most
significant risks faced by the company or its securities. In practice, this section focuses on the
risks themselves, not how the company addresses those risks. Risks outlines may pertain to the
entire economy or market, the company’s industry sector or geographic region, or be unique to
the company itself. Figure 2 shows an excerpt of Item 1A of Apple Inc.’s 2021 10-K which
discusses business risk arising from the COVID-19 pandemic such as disruption in supply chain
Operations” presents the company’s perspective on its financial performance during the prior
fiscal year. This section, commonly referred to as the MD&A, allows company management to
summarize its recent business in its own words. The MD&A presents:
20
Cao, Jiang, &Lei
_____________________________________________________________________________________
• The company’s operations and financial results, including information about the
company’s liquidity and capital resources and any known trends or uncertainties that
could materially affect the company’s results. This section may also present the
management’s views on key business risks and how they are being addressed.
• Material changes in the company’s results compared to the prior period, as well as off-
judgmentsand any changes from previous yearscan have a significant impact on the
numbers in the financial statements, such as assets, costs, and net income.
21
Cao, Jiang, &Lei
_____________________________________________________________________________________
Figure 3 is an excerpt of Item 7 of Target 2021 10-K. It begins with highlights of the fiscal year,
a summary of financial outcomes, and then continues to analyze key performance indicators such
Part I of the report comprises another two items in addition to Item 1 and Item 1A. Item
1B, “Unresolved Staff Comments,” requires the company to explain certain comments received
22
Cao, Jiang, &Lei
_____________________________________________________________________________________
from SEC staff on previously filed reports that have not been resolved after an extended period
of time.
principal plants, mines and other materially important physical properties. Figure 4 displays Item
2 from Apple Inc.’s 2021 10-K. It reveals that Apple Inc. owns and leases facilities and land
pending lawsuits or other legal proceedings, other than ordinary litigation. Figure displays Item 3
from Apple Inc.’s 2021 10-K. It is worth noting that it is not uncommon for companies to be
involved in legal proceedings. Item 4 has no required information and is reserved by the SEC for
future rulemaking.
Part II of Form 10-K comprises seven items additional to Item 7. Item 5, “Market for
Registrant’s Common Equity, Related Stockholder Matters and Issuer Purchases of Equity
Securities,” provides information about the company’s equity securities, including market
23
Cao, Jiang, &Lei
_____________________________________________________________________________________
information, the number of shareholders, dividends, stock repurchases by the company, and other
relevant information. Figure 6 provides an example of Item 5 in Apple Inc. 2021 10-K.
from the past five years. As shown in Figure 7, Item 6 of Apple Inc. 2021 10-K reports selected
financial information from 2016 to 2021. More detailed financial information on the past three
years are included in a separate section: Item 8, “Financial Statements and Supplementary Data,”
which includes the company’s balance sheet, income statement, cash flow statement, and notes
24
Cao, Jiang, &Lei
_____________________________________________________________________________________
Item 7A, “Quantitative and Qualitative Disclosures about Market Risk,” mandates
disclosure of the company’s exposure to market risks arising from, for example, fluctuations in
interest rates, foreign currency exchanges, commodity prices, or equity prices. This section may
also include information on how the company manages these risks. Figure 8 provides an excerpt
25
Cao, Jiang, &Lei
_____________________________________________________________________________________
audited financial statements, which includes the company’s income statement, balance sheet,
statement of cash flows, and statement of stockholders’ equity. The financial statements are
accompanied by notes that elucidate the information presented in the financial statements. An
independent accountant audits these financial statements, and, for large companies, also reports
Disclosure,” requires companies that have changed accountants to discuss any disagreements
they had with those accountants. Such disclosure is often seen as a red flag by many investors.
Item 9A, “Controls and Procedures,” discloses information about the company’s disclosure
controls and procedures, as well as its internal controls over financial reporting. Item 9B, “Other
Information,” requires companies to provide any information that should have been reported on
another form during the fourth quarter of the year covered by the 10-K, but was not disclosed.
Part III of the 10-K includes five items. Item 10, “Directors, Executive Officers and
Corporate Governance,” requires information about the background and experience of the
company’s directors and executive officers, the company’s code of ethics, and certain
qualifications for directors and committees of the board of directors. Item 11, “Executive
programs, as well as how much compensation was paid to its top executive officers in the past
fiscal year.
In Item 12, “Security Ownership of Certain Beneficial Owners and Management and
Related Stockholder Matters,” companies provide information about the shares owned by the
26
Cao, Jiang, &Lei
_____________________________________________________________________________________
company’s directors, officers, and certain large shareholders. This item also includes information
Item 13, “Certain Relationships and Related Transactions, and Director Independence,”
includes information about relationships and transactions between the company and its directors,
officers, and their family members. It also includes information about whether each director of
Item 14, “Principal Accountant Fees and Services,” requires companies to disclose fees
paid to their accounting firm for various types of services during the year. Although this
separate document called the proxy statement. Companies distribute the proxy statement among
their shareholders in preparation for annual meetings. If the information was provided in a proxy
statement, Item 14 will include a message from the company directing readers to the proxy
statement document. The proxy statement is typically filed a month or two after the 10-K. Part
Part IV of 10-K
Part IV contains Item 15, “Exhibits, Financial Statement Schedules,” which outlines the
financial statements and exhibits included as part of the 10-K filing. Many exhibits are
mandatory, including documents such as the company’s bylaws, copies of its material contracts,
performance and activities throughout a fiscal year. Many companies choose to incorporate a lot
of graphics and images instead of large amounts of text in their annual reports to create more
27
Cao, Jiang, &Lei
_____________________________________________________________________________________
visually appealing documents than 10-Ks. For example, in Figure 9, Procter and Gamble
provides both numeric and graphic information regarding its financial performance in the 2021
annual report.
The structure of annual reports vary across companies, but they typically include several
common sections such as (1) letter to shareholders, (2) performance and highlights, (3) corporate
strategies, (4) non-financial information such as CSR information, (5) financial information, (6)
leadership information, and any other pertinent information the company wishes to share.
The “bag of words” technique is a Natural Language Processing (NLP) technique used
for textual modelling. Text data can be messy and unstructured, making it challenging for
machine learning algorithms to analyze. These algorithms prefer structured, well defined, fixed-
28
Cao, Jiang, &Lei
_____________________________________________________________________________________
length inputs. A “bag of words” is a textual representation of the occurrence of words within a
document. To create this representation, analysts track the frequency of word occurrences in a
document, disregarding grammatical details and word orders. The term “bag” is used because
information about the order or structure of words in the document is discarded, and all words are
collected en masse as if in a bag. Using this technique, variable-length texts can be converted
into a fixed-length vector. The bag-of-words approach is a simple and flexible way to extract
set of keywords. Sentiment analysis, for instance, can be conducted by computing the frequency
29
Cao, Jiang, &Lei
_____________________________________________________________________________________
of pre-determined negative and positive words. By comparing the number of negative words to
provide words, the bag-of-words can identify the sentiment of a text as negative without the need
There is a wide range of well-established keyword lists readily available for textual
analyses with various objectives. In sentiment analysis, for example, the Harvard-IV-4
Dictionary is a general-purpose dictionary that provides a list of positive and negative words
developed by the Harvard University. The Loughran-McDonald Sentiment Word Lists are
widely used in technical accounting and finance texts. Other researchers have developed similar
keyword lists for non-English languages (Du, Huang, Wermers, and Wu 2022) or for purpose
other than sentiment analysis, such as forward-looking statements, extreme sentiment, deception,
intangible assets, culture, big data and artificial intelligence, litigation, social affiliation, supply
chain, etc. (Cao, Ma, Tucker, and Wan 2018; Hassan, Hollander, Lent, and Tahoun 2019, etc.).
Self-defined keywords
When a suitable keyword list is not readily available for a specific research question, we
can create a customized one by reading a small sample of related texts and selecting the most
relevant keywords. This approach is easy to implement, but it can also be arbitrary. Below we
Corpus approach
The corpus approach begins with gathering textual contents relevant to the topic of
interest, from which a set of frequently used words {A} is extracted. This set of often includes
noisy keywords unrelated to the topic. To eliminate this noise, we then identify the irrelevant
30
Cao, Jiang, &Lei
_____________________________________________________________________________________
topics and generate a list of frequently used words for each irrelevant topic {Bi}. A robust
keyword list for the topic of interest is then obtained by subtracting irrelevant topic keywords
from the preliminary high-frequency word list, or {A}-U{Bi}. For example, to generate a list of
political keywords {Ap}, one might start with political science textbooks to generate a high-
frequency word list {Ap0}. This preliminary high-frequency word list might contain keywords
relating to economics, law, science, etc. To remove these irrelevant topics, we could use a similar
approach to generate a list of high-frequency words for each irrelevant topics {Beconomics},
{Blaw},{Bscience}, etc. Finally, we subtract these irrelevant keywords from the preliminary
political keyword list, resulting in a clean political keyword list, or {Ap}={Ap0}-U{Bi}. Figure
Dictionary expansion
for synonyms of key topical words in authoritative dictionaries. For instance, to generate a
keyword list for “risk”, we can begin with the single word “risk” and look up all synonyms of
“risk” in the Merriam-Webster Dictionary. This can give us words like “threat” and “danger”.
We can then look up the synonyms of these synonyms, which could yield words such as
31
Cao, Jiang, &Lei
_____________________________________________________________________________________
“menace”, “jeopardy”, and “trouble”. The process can be continued until the additional
synonyms are no longer closely related to the original concept of “risk.” (Figure 12).
Figure 12. Using the dictionary expansion approach to develop keyword lists
expansions (Alba, Gruhl, Ristoski, and Welch 2018). The approach not only discover new
instances from input text corpus and predict new “unseen” terms not currently in the corpus. The
approach runs in two phases. Continuing with the political word example, during the explore
phase, the model calculate a similarity score between words in the Merriam-Webster Dictionary
and the single word “politics” to identify instances in the dictionary that are similar to the word
“politics” such as “activism”, “legislature”, or “government”. In the exploit phase, the model
generates new phrases based on a word’s co-occurrence score, or how often words appear
together. For example, “government policy” may not appear in the Merriam-Webster Dictionary
but “political policies” and “science of government” appear often together and thus can be used
32
Cao, Jiang, &Lei
_____________________________________________________________________________________
reduction. Unsupervised LDA is useful for exploring unstructured text data by inferring
is topic modeling. Given a sample of textual data and a pre-determined number of topics, K, an
LDA algorithm can generate K topics that best fit the data. Determining the appropriate number
of topics is somewhat arbitrary. The best practice is to review the textual sample to obtain a
feeling for the contents, generate a word frequency table to review the high-frequency words,
and then determine the number of topics in an informed manner. Figure 13 illustrates the
keywords for the topic “politics” that were developed using an unsupervised LDA algorithm.
Figure 13. Keywords for the topic “politics” generated using unsupervised LDA
In addition to unsupervised LDA, LDA can also be supervised. Supervised LDA requires
humans to read a small sample of the textual contents and label the topics for each textual input.
The labeled sample is used to train a model that predicts the topics of the remaining texts in the
sample. The self-supervised approach involves using an existing label to label keywords. For
33
Cao, Jiang, &Lei
_____________________________________________________________________________________
instance, we can use subsequent stock returns to label positive and negative keywords when
2.3. Empirical examples: Analyzing corporate filings for making business decisions
When companies make an active change in 10-Ks, this often provides an important signal
about future operations (Brown and Tucker 2011; Cohen, Malloy, and Nguyen 2020), but Cohen
et al. (2020) document that investors tend to neglect the valuable information in the changes. If
an investor construct a portfolio that shorts the companies significantly changing 10-Ks or 10-Qs
and buy companies not significantly changing 10-Ks or 10-Qs, the investor could earn a return of
Item 1 of 10-K
10-K Item 1 discusses companies’ products and services. Textual description in 10-K
Item 1 can be used to construct a stream of measures based on product similarity. Hoberg and
Phillips (2016) extract companies’ product description from 10-K Item 1 and represent the usage
of words in product description with a binary vector. The cosine similarity score between a pair
of companies’ product description then captures similarity of the products between the two
34
Cao, Jiang, &Lei
_____________________________________________________________________________________
companies. The product similarity measure is useful in evaluating the extend of competition a
company faces. If a large number of companies provide products or services highly similar to
those provided by a given company, then this company is likely to face fierce competition in the
product market. The measure can also be used to refine industry classification. Nowadays, many
companies provide products and services covering multiple traditional industries. For example,
Amazon Inc. is a retailer in the retailing industry, a streaming service provider in the
entertainment industry, an electronic device maker in the manufacturing industry, and a software
provider in the computer and business service industry. The product similarity measure provides
an avenue to define an “industry” for Amazon that consists of companies providing a similar set
of products and services rather than arbitrarily classifying Amazon Inc. into one of the traditional
industry. In a related vein, measuring time-series similarity of Item 1 could help analysts to
detect whether a company launches new products or services and implement new strategies.
Item 1A of 10-K
Companies discuss risk factors impacting their business in Item 1A. Henley and Hoberg
(2019) develop an emerging risks database for banks based on risk factor disclosures in Item 1A.
They employ topic modeling to obtain a 25-factor Latent Dirichlet Allocation (LDA) model
which is then used to extract 625 bigrams. Figure 15 provides an overview of the 25 emerging
risk topics and five most prevalent words in each topic. The bigrams are then converted into a set
of interpretable risk factors in the form of word vectors using semantic vector analysis. The
cosine similarity between the vocabulary list associated with each risk theme and the raw text of
a bank’s Item 1A disclosure reflects the intensity of the bank’s discussion of each emerging risk.
Using the risk loading, Henley and Hoberg (2019) show that risks related to real estate,
prepayment, and commercial paper are elevated as early as mid 2005 prior to the 2008 financial
35
Cao, Jiang, &Lei
_____________________________________________________________________________________
crisis. They also find individual bank exposure to emerging risk factors strongly predicts stock
Segment Information
Recruiting CEOs whose skills and attributes suit company needs is critical to company
success. The inherent difficulty in external CEO hiring arises from the immense heterogeneity of
both job candidates and companies. Central to recruiting is not only identifying competent
managers, but also maximizing quality of the match between companies and CEOs. Cao, Li, and
Ma (2022) find that segment information in 10-K filings help companies find CEOs who fit
companies’ needs. For instance, Ford hired Allan Mulally as the CEO from Boeing’s
Commercial Airplanes in 2006. At that time, Ford was looking for a leader with experience in
turning around a troubled corporate giant. Allan Mulally happened to possess that experience as
revealed by the segment information disclosed in Boeing’s 1999 10-K (Figure 16). The
Commercial Airplanes segment of Boeing suffered a loss of 1,589 million in 1997 but earned a
profit of 2,016 million in two years after. This experience was exactly what Ford valued in CEO
candidates.
36
Cao, Jiang, &Lei
_____________________________________________________________________________________
This figure provides an overview of the 25 risk factors detected using topic modeling from fiscal year 2006 10-Ks of
banks. Each box is ranked and sized relative to its importance in the document and contains the five most prevalent
words or commongrams in the topic (Henley and Hoberg 2019).
37
Cao, Jiang, &Lei
_____________________________________________________________________________________
38
Cao, Jiang, &Lei
_____________________________________________________________________________________
References
Alba, A., Gruhhl, D., Ristoski, Pl, and Welch, S. 2018. Interactive Dictionary Expansion using
with Humainsinthe-Loop.
Brown, S., and Tucker, J. 2011. Large-sample Evidence on Firms’ Year-over-Year MD&A
Cao, S., Ma, G., Tucker, J., and Wan, C. 2018. Technological Peer Pressure and Product
Cao, S., Jiang, W., Yang, B., Zhang, A. 2022. How to Talk When a Machine is Listening?
Cohen, L., Malloy, C., and Nguyen, Q. 2020. Lazy Prices. Journal of Finance, 3, 1371-1415.
Du, Z., Huang, A.G., Wermers, R., and Wu, W. 2022. Language and domain specificity: A
Hassan, T., Hollander, S., Lent, L., and Tahoun, A. 2019. Firm-level Political Risk:
Loughran, T., and McDonald, B. 2011. When Is a Liability Not a Liability? Textual Analysis,
39
Cao, Jiang, &Lei
_____________________________________________________________________________________
The initial step in building machine learning models is to preprocess the raw data, or data
cleaning, which is essential for improving data quality. This process involves various tasks such
as eliminating redundant entries and boilerplate text, handling missing data and outliers, and
rectifying data that is improperly formatted. These operations can significantly improve the
performance of the final model since it ensures that the model learns from consistent and
relevant data. For instance, when working with analyst reports, it is a common practice to
remove sections such as the analyst disclaimer, as they typically offer minimal or no value for
machine learning tasks. By removing such “boilerplate” text, the machine can better discern
The next key aspect of the model-building process is data parsing, which involves
converting data from one format to another. The objective is to transform unstructured raw data
into a unified structured representation that machines can easily comprehend and utilize. For
instance, consider an HTML webpage. By parsing the HTML, we can transform it into organized
formats like CSV or JSON, simplifying the extraction of specific details from the data. Regular
expressions are commonly employed to extract specific patterns or sequences of characters from
Building machine learning models also involves feature selection. Feature selection is the
process of identifying input variables that are important for building a high-performing model.
The inclusion or exclusion of relevant features has a profound impact on the quality of the
model's output. As the saying goes, "garbage in, garbage out," emphasizing that the output's
40
Cao, Jiang, &Lei
_____________________________________________________________________________________
reliability is inherently tied to the quality of the input. If a model is trained on a dataset that
Domain knowledge is crucial in feature selection because it provides valuable insights that
inform the selection process. Experts with a deep understanding of the subject matter can
leverage their knowledge and experience to identify key features. For instance, Cao, Jiang,
Wang, Yang (2021) choose several firm-level, industry-level, macro-economic variables, and
textual information from firms’ disclosures as inputs for their “AI analyst” model. The
researchers base their selection on the knowledge that prior studies have demonstrated a strong
correlation between these variables and earnings forecasts. This informed approach underscores
the importance of leveraging domain expertise when choosing relevant features for machine
learning models.
Once the data has undergone preprocessing, the next step is building machine learning
models. This process is not merely a random selection but rather a systematic approach to
finding the most suitable model for the given data and problem at hand.
Model selection heavily relies on our understanding of the data characteristics and the
relative strengths and limitations of different models. Since various models excel at different
types of tasks, we can make an initial model selection based on our knowledge and experience of
each model’s capabilities, strengths, and weaknesses. For instance, random forest models are
frequently used for classification tasks due to their ability to handle both numerical and
categorical data and their resistance to overfitting. Long short-term memory (LSTM) models are
particularly effective for time series analysis as they can capture long-term dependencies in the
data. Transfer models, on the other hand, are commonly used for tasks where previously
41
Cao, Jiang, &Lei
_____________________________________________________________________________________
acquired knowledge can be utilized, such as image or natural language processing. For example,
Cao, Jiang, Wang, and Yang (2021) start with two versatile quasi-linear ML models, Elastic-Net
and Support Vector Regressions, that are adept at tasks with a large number of variables. They
then add on three highly nonlinear machine learning models, Random Forest, Gradient Boosting,
and Long Short-Term Memory (LSTM) Neural Networks. Random Forest and Gradient
Boosting can both capture complex and hierarchical interactions among the input variables while
the LSTM model is designed to model time-series patterns in the data. By doing so, the
researchers align the machine learning models with the specific characteristics of the data and
Once we have selected our initial set of models, we proceed to execute each model and
evaluate their performance. This process serves the purpose of quantifying the effectiveness of
each model and validating our initial choices. It is important to note that ensemble models have
the ability to combine knowledge from multiple models, often resulting in improved outcomes
compared to relying on a single model alone. Therefore, leveraging an ensemble model allows us
to integrate predictions from the top-performing models. Cao, Jiang, Wang, and Yang (2021) use
an ensemble comprising the three best-performing models as their primary model. By doing so,
they can harness the collective strengths and insights of these models, enhancing the overall
predictive power and reliability of their analysis. Just like feature selection, model selection in
machine learning also relies on domain knowledge. It is particularly crucial when deciding the
appropriate level of analysis to apply the model, such as individual, industry, or market level. In
this regard, researchers must draw upon their deep understanding of the research question and
the specific domain. By leveraging their expertise, they can make informed decisions on the
42
Cao, Jiang, &Lei
_____________________________________________________________________________________
scope and granularity of the analysis, ensuring that the selected model aligns with the objectives
Grid search
Grid search can be employed to find the best combination of hyperparameters for a
machine learning model. Grid search derives its name from the grid-like structure it creates,
Implementing grid search involves first identifying the hyperparameters that require
tuning and defining the range of values to explore for each hyperparameter. As an example, we
can specify a list of learning rates like [0.001, 0.005, 0.01] and a list of batch sizes like [30, 50,
60]. By generating a grid that encompasses all possible hyperparameter combinations, the model
can be trained and evaluated for each combination within a loop. Ultimately, the best model is
determined by identifying the hyperparameter combination that yields the highest performance.
Cross-validation
performance. Generally, the available data is divided into three main subsets: the training set, the
validation set, and the test set. The models are trained on the training set, fine-tuned using the
validation set, and then evaluated for accuracy on the test set. This approach enables us to assess
how effectively the trained model generalizes to unseen data, thus mitigating the risk of
overfitting, where the model performs well on the training data but fails to generalize.
Figure 1 illustrates the process of k-fold cross-validation, which is the most common
form of cross-validation. In k-fold cross-validation, the data is divided into k equal-sized folds.
43
Cao, Jiang, &Lei
_____________________________________________________________________________________
The model is trained k times, with each iteration using k-1 folds as the training set and a different
fold as the validation set. After training the model k times, we obtain k individual evaluation
scores, and the average of these scores represents the overall performance of the model.
Employing k-fold cross-validation allows for efficient utilization of the available data,
For example, there are 10 million photos of pets, of which some are photos of dogs. We
would like to identify the photos of dogs. What we have is a set of 10,000 photos already labeled
as photos of dogs or not photos of dogs. To apply a five-fold cross-validation, we could divide
the 10,000 labeled photos into five folds with 2,000 in each subset. We then build a classification
model using four folds as the training set and the fifth fold as the validation set. After training the
model five times, we obtain five individual evaluation scores which can reflect the overall
44
Cao, Jiang, &Lei
_____________________________________________________________________________________
There are several metrics to evaluate the performance of classification machine learning
models, including accuracy, precision, recall, and F1 score. To gain a better understanding of
these measures, let us start by exploring the confusion matrix and related terms.
their alignment with the actual labels of a test dataset. The confusion matrix displays the number
of four categories of the classification results: true positives (TP), true negatives (TN), false
positives (FP), and false negatives (FN). A true positive (TP) refers to a test result where the
model correctly predicts a positive condition, while a true negative (TN) indicates a test result
where the model correctly predicts a negative condition. On the other hand, a false positive (FP)
occurs when the model incorrectly predicts a positive outcome, but the actual outcome is
negative, and a false negative (FN) occurs when the model incorrectly predicts a negative
outcome, but the actual outcome is positive. Figure 2 depicts a 2x2 confusion matrix that
Using the above components of the confusion matrix, we can calculate four evaluation
45
Cao, Jiang, &Lei
_____________________________________________________________________________________
Accuracy
Accuracy measures the overall correctness of the model's predictions. It is defined as the
total number of correct predictions (i.e., true positives and true negatives) divided by the total
number of predictions. The higher the accuracy, the better the model is at making correct
predictions.
𝑇𝑇𝑇𝑇 + 𝑇𝑇𝑇𝑇
𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 =
𝑇𝑇𝑇𝑇 + 𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹 + 𝐹𝐹𝐹𝐹
Precision
Precision focuses on the proportion of true positive predictions out of all positive
predictions made by the model. It is defined as the number of true positives divided by the total
number of positive predictions. The higher the precision, the lower the likelihood of a model
𝑇𝑇𝑇𝑇
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 =
𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹
Recall
Recall evaluates how well the model identifies the true positive cases. It is defined as the
number of true positives divided by the total number of true positives and false negatives.
𝑇𝑇𝑇𝑇
𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 =
𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹
F1 Score
The F1 score combines precision and recall into a single metric, showing a model's
ability to handle both false positives and false negatives. It is calculated as follows:
2 ∗ (𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑠𝑠𝑖𝑖𝑖𝑖𝑖𝑖 ∗ 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅)
𝐹𝐹1 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 =
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 + 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅
46
Cao, Jiang, &Lei
_____________________________________________________________________________________
When working with imbalanced data, we often encounter a challenge called the
“Accuracy Paradox”. This challenge arises when we rely solely on accuracy as a metric, which
can be misleading and lead to incorrect conclusions. In this case, precision is an important metric
to consider.
Let's take a look at an example where we have an imbalanced dataset for spam detection
in email. In this dataset, 98% of the emails are not spam (negative), while only 2% are identified
as spam (positive). Now, let's say we build a classification model specifically designed to detect
Accuracy = (TP + TN) / (TP + TN + FP + FN) = (150 + 9800) / (150 + 9800 + 50 + 50)
= 99.3%.
At first glance, this high accuracy may appear impressive. However, it's important to take
into account the context of the imbalanced dataset we are dealing with. In this dataset, the
majority of emails (approximately 98%) are non-spam, indicating that even if we were to classify
all emails as non-spam, we would still attain a high accuracy due to the significant number of
true negatives. Therefore, when evaluating the performance of the model, it is important to
consider additional metrics that provide a more comprehensive assessment of its effectiveness,
To gain further insights, let's calculate the precision of the model. We will get:
47
Cao, Jiang, &Lei
_____________________________________________________________________________________
In this case, the precision is 75%, among all the emails predicted as spam, only 75% of
them are truly spam. This highlights the significance of computing precision in an imbalanced
dataset, as the presence of false positives can incur substantial costs or consequences. Thus,
precision serves as a valuable evaluation metric, particularly in situations where the dataset
The methods for textual analysis discussed thus far have primarily relied on word
frequency. Frequency-based techniques ignore syntax and contextual meaning, which can result
and “people” as two unrelated textual input because it is not able to recognize that “person” and
“people” share similar meaning. To address this limitation of frequency-based techniques, word-
embedding was developed to incorporate the meaning in textual analysis. This method represents
a word with a semantic vector, or a new bag of words related to the word of interest. Word-
embeddings are based on Zellig Harris’s “distributional hypothesis,” which posits that words
used in proximity to one another typically have similar meanings (Harris, 1954). To create a
semantic vector for a given phrase, such as “cash flow”, a word-embedding algorithm selects
words from surrounding text inputs that can accurately predict the presence of the phrase. For
example, in the text input “earnings present cash flow, which helps future investment” (figure 1),
“investment”, “earnings”, “present”, “which”, “help”, and “future” are all adjacent words of
“cash flow”. The algorithm would select “investment” and “earnings” for the semantic vector
representing “cash flow”, but not “which” or “help” as “which” or “help” could be adjacent to a
48
Cao, Jiang, &Lei
_____________________________________________________________________________________
large number of phrases other than “cash flow”. In other words, when “cash flow” is concealed,
“earnings” and “investment” can relatively accurately predict the presence of “cash flow” but
“which” or “help” cannot. Hence, “cash flow” can be represented by the word vector [Earnings,
investment].
relationships between words in the same way as humans do. As discussed earlier, a frequency-
based algorithm would fail to identify that “person” and “people” share similar meanings.
However, word-based embedding creates comparable semantic vectors for “person” and “people.
For example, “person” could be represented by a word vector [human, man, woman, men,
women, they] and “people” might be represented by a word vector [human, men, women, they].
Such word-embedding allows an algorithm to recognize that they share similar meaning through
semantic computation.
ignores the other rich information contained in sentences, resulting in potential problems in
textual analysis. First, the same word often has multiple meanings depending on the context it is
used in. Word-embedding cannot distinguish between these different meanings of the same
word, such as “liability”, which can mean a responsibility or burden in general language, but
49
Cao, Jiang, &Lei
_____________________________________________________________________________________
statements. Therefore, the word “liability” should be represented by different semantic vectors in
the two contexts. Second, the meanings of words often change over time. For example, in the
1850s, “broadcast” was commonly used in agriculture, but today it has more associations with
media. Thus, a semantic vector representing “broadcast” from the 1850s would likely include
words such as “sow” and “seed,” while a semantic vector representing “broadcast” for the
modern usage would include such words as “television”, “radio”, and “newspapers”.
embedding can be achieved based on hidden layer outputs of transformer models. It can also be
One of the most advanced tools for natural language processing (NLP) is Bidirectional
Encoder Representations from Transformers (BERT), which was developed by Google. BERT
has its origins in pre-training contextual representations. BERT was trained on two tasks,
namely, language modeling and next sentence prediction, on the Toronto BookCorpus
and English Wikipedia. In language modeling, 15% of words were selected for prediction, and
the training objective was to predict the selected word given its context. The selected word is
masked with probability of 80%, replaced with a random word with probability of 10%, and not
50
Cao, Jiang, &Lei
_____________________________________________________________________________________
replaced with probability 10%. For example, the sentence “he is nice” has three words and the
third word “nice” is selected for prediction. The input text would be “he is [MASK]” with
probability of 80%, “he is kind” with probability of 10%, and “he is nice” with probability of
10%.
Next Sentence Prediction (NSP) training enables the model to understand how sentences
relate to each other, if sentence B should precede or follow sentence A. As previously discussed,
representation for each word in the vocabulary. The main advantage of BERT is its use of bi-
directional learning optimized for predicting masked words to gain context of words from both
left to right context and right to left context simultaneously so as to take into account the context
of each occurrence of a given word. For example, it can understand the semantic meanings of
bank in the following sentences: “I went to the bank to deposit a check” and “We walked along
the river bank.” To understand this, BERT uses right-to-left “debit card” and left-to-right “river”
clues. As a result of this training process, BERT learned contextual embeddings for words. Once
the pre-training is complete, the same model can be fine-tuned for a variety of downstream tasks.
natural language processing (NLP). A unique advantage of transformer is its ability to rely
entirely on self-attention to compute representations of its input and output. There are 12
encoders with 12 bi-directional self- attention heads, and 110 million parameters in its “BERT
base” model. “BERT large” employs 24 encoders, 16 bi-directional attention heads, and 340
million parameters.
51
Cao, Jiang, &Lei
_____________________________________________________________________________________
OpenAI. The GPT series consists of four major versions: GPT-1.0, GPT-2.0, GPT-3.0, GPT-3.5.,
and GPT-4.0. GPT-1.0 was the first model released by OpenAI in 2018, representing a
significant breakthrough in the field of natural language processing. It had 117 million
parameters and was pre-trained on a large corpus of text data, making it highly effective at
understanding and generating natural language. GPT-1.0 was primarily used for language
translation, text completion, and question answering tasks. GPT-2.0 was released in 2019. As a
significant improvement over its predecessor, it had 1.5 billion parameters. This allowed it to
perform much more complex natural language processing tasks, including story generation, text
summarization, and even image captioning. GPT-3.0 was released in 2020 and was an
autoregressive language model using a transformer architecture with 175 billion parameters,
making it one of the largest language models ever developed. GPT-4 is the latest and largest in a
series of GPT models. It is six times larger than the GPT-3 model with approximately one trillion
parameters.
Architecture
BERT and GPT use different machine-learning models. As discussed earlier, BERT is
designed for bidirectional context representation, which means it processes text from both left-
to-right and right-to-left, allowing it to capture context from both directions. This allows BERT
to better understand the context and meaning of a sentence. Unlike BERT models, GPT is an
autoregressive model generating text sequentially from left to right, predicting the next word in a
52
Cao, Jiang, &Lei
_____________________________________________________________________________________
sentence based on the words that came before it. This allows GPT to generate highly coherent
Training data
corpus of text. The original was trained on the English Wikipedia and BooksCorpus, a dataset
containing approximately 11,000 unpublished books, which amounts to about 800 million words.
Conversely, GPT-3 was trained on the WebText dataset, a large-scale corpus containing web
pages from sources like Wikipedia, books, and articles. It also includes text from Common
Pre-training approach
GPT is a generative model, meaning that it is trained to predict the next word in a
sentence or generate a new sentence from scratch. This pre-training approach allows GPT to
excel at tasks such as language generation and text completion. On the other hand, BERT is a
discriminative model, meaning that it is trained to classify whether a given sentence is coherent
or not. This pre-training approach allows BERT to excel at tasks such as sentiment analysis and
text classification.
Usability
To use BERT, you'll have to download the originally published Jupyter Notebook for
BERT and then set up a development environment using Google Colab or TensorFlow. If you
don't want to worry about using a Jupyter Notebook or aren't as technical, you could consider
using ChatGPT to leverage the GPT model, which is as simple as just logging into a website.
53
Cao, Jiang, &Lei
_____________________________________________________________________________________
Application of GPT
GPT has been widely incorporated into various business applications, resulting in various
domain-specific generative AI applications. For instance, Salesforce develops Einstein GPT, the
world’s first generative AI for customer relationship management. Bloomberg also builds its
own generative AI, BloombergGPT, because the complexity and unique terminology of the
artificial intelligence model trained on a wide range of financial data to support a diverse set of
NLP tasks within the financial industry. Here are three examples of future applications of GPT in
GPT can be used to build language models that can be used to generate new text in a
it useful for applications such as chatbots, language translation, and content creation. GPT can be
reports, and other financial documents in a more efficient and accurate manner.
GPT can be used to summarize large volumes of text, such as news articles or research
papers, into shorter summaries that capture the most important information. GPT can be used to
analyze the sentiment of a given piece of text, allowing businesses to monitor customer feedback
and sentiment towards their products or services. This could be a usefultools for financial
analysts.
54
Cao, Jiang, &Lei
_____________________________________________________________________________________
GPT can be used to analyze financial data, make predictions about future trends and
outcomes, and identify potential risks associated with investments, loans, and other financial
products. This can help companies to make informed decisions about investments, budgeting,
and other financial matters. This can help companies to mitigate risks and identify potential
instances of fraud or financial irregularities and ensure compliance with regulations and industry
standards.
In recent years, the investment management industry has experienced a rapid surge in the
adoption of AI and machine learning technologies. These technologies have found applications in
various areas within the industry, such as identifying trading patterns for generating alphas,
streamlining customer support and prospect identification, and managing risk exposure. One
Warren Buffett), which utilizes big data from capital markets and applies machine learning to
discover correlations and exploit arbitrage opportunities. Another illustration comes from the
hedge funds that has embraced AI and algorithmic trading. According to a 2018 survey conducted
by BarclayHedge with 2,135 hedge fund professionals, 56% of respondents reported using
AI/machine learning in their investment strategies, with the primary application being idea
concerns have also garnered public attention. In response to privacy concerns, federated learning
techniques have emerged. Unlike traditional centralized machine learning methods, federated
55
Cao, Jiang, &Lei
_____________________________________________________________________________________
learning enables AI algorithms to be trained without sharing or transmitting sensitive data. Overall,
The rapid advancement of AI technologies make people wonder: would machines replace
man in near future? Although AI technologies are increasing the capabilities of machines
exponentially, humans and machines each have unique competitive advantages. These
deduction, which is inherently challenging for AI. As revealed by Cao et al. (2022), AI
underperforms humans in analytical tasks with limited data where reasoning-based intelligence
has to be relied upon. On the other hand, probability-based intelligence, which refers to the
requiring probability-based intelligence. Consistent with the notion, Cao et al. (2022) find that AI
models can understand and analyze large amounts of numerical and textual data and make
decisions based on these data. Consequently, humans might be replaced by AI in tasks mostly
Secondly, spiritual pursuit is unique to humans, and machines do not possess spiritual
pursuit. While AI cannot understand spiritual pursuit, AI can assist in creating relevant entities,
such as constructing a church. Similarly, humans have emotions, but AI does not have emotions.
This is one of the most significant differences between humans and machines. Machines can
provide objects that help humans generate positive emotions, such as humanoid robot that can
56
Cao, Jiang, &Lei
_____________________________________________________________________________________
replace a deceased loved one and provide companionship. However, AI is unable to replace
humans in tasks that involve emotions such as artistic creation and expression.
Thirdly, curiosity is a crucial trait of humans that machines currently do not possess.
Curiosity motivates humans to acquire and accumulate knowledge that allows the development
and advancement of AI technologies. If AI becomes curious one day, they might be able to
In conclusion, AI has the potential to replace humans in certain tasks, such as those
involve unique human traits, such as reasoning-based intelligence, spiritual pursuit, emotions,
and curiosity.
57
Cao, Jiang, &Lei
_____________________________________________________________________________________
58
Cao, Jiang, &Lei
_____________________________________________________________________________________
References
Cao, S., Jiang, W., Wang, J.L., and Yang, B. 2021. From Man vs. Machine to Man+Machine:
59
Cao, Jiang, &Lei
_____________________________________________________________________________________
Most U.S. public companies host a quarterly conference call, typically within a month
after the end of the fiscal quarter, to discuss their financial performance with investors.
Participants of these calls include key company executives, investors, and financial analysts.
During a conference call, company executives review financial information and discuss major
issues impacting company performance in the previous quarter. They also provide insights into
the company’s expectations for the upcoming quarters. Semi-formal presentations by company
executives are followed by question-and-answer sessions where investors and financial analysts
can ask questions about any area that requires further elaboration.
In the past, earnings conference calls were only available to professional financial
analysts and institutional investors. Nowadays, almost all public companies use online streaming
to broadcast conference calls to average investors, or provide audio recordings that can be
accessed to on-demand. Furthermore, various online stock research sites offer access to earnings
conference call transcripts. The widespread access to earnings conference call audios and
transcripts creates opportunities to employ AI and machine learning methods to perform timely
As an example, let us take a look at the transcript of Microsoft’s earnings conference call
for the first quarter of fiscal year 2023. The call was held on October 25, 2022. Brett Iversen, the
Vice President of Investor Relations at Microsoft, hosted the call, and other Microsoft
participants include Satya Nadella who is the CEO, Amy Hood who is the CFO, Alice Jolla, the
chief accounting officer, and Keith Dolliver, the deputy general counsel.
60
Cao, Jiang, &Lei
_____________________________________________________________________________________
Following Brett Iversen’s introduction of the Microsoft participants and overview of the
structure and principles of the earnings conference call, Satya Nadella, the CEO, took over to
units such as Microsoft Cloud, and expectations for the upcoming quarters.
After Satya Nadella’s high-level summary, the CFO, Amy Hood, shared detailed
financial information and provided her interpretation of the data from the company’s perspective.
For example, she explained that increased operating expenses were primarily driven by the
growth in headcount, while a shift in sales mix and foreign exchange impact negatively affected
the operating margin. She also provided an outlook for the second quarter of the fiscal year, both
61
Cao, Jiang, &Lei
_____________________________________________________________________________________
Finally, the floor was open to questions from the other participants in the earnings
conference call. During this call, eight investors and analysts asked questions. These questions
covered both Microsoft’s operating and financial decisions. For instance, an analyst from Stanley
asked for elaboration on how Microsoft formed the outlook guidance; other analysts and
investors asked about future plans for various business segments, such as Microsoft Cloud,
Windows, and advertising, in light of past performance of Microsoft and its competitors.
62
Cao, Jiang, &Lei
_____________________________________________________________________________________
Chapter 2 and Chapter 3 introduces three textual analysis methods, namely the Bag-of-
Word approach, phrase-level word embedding, and sentence-level word embedding. The Bag-of-
Word approach is a frequency-based method that summarizes textual data with numeric
information but disregards meanings and contexts of words. Phrase-level word embedding takes
word meanings into account and can recognize different words that share similar meanings.
However, it cannot handle the scenarios where the same word is used for different meanings in
different sentences. On the other hand, the sentence-level word embedding leads to algorithms
that can recognize not only recognize different words sharing similar meanings but also different
meanings of the same words in different sentences. Nevertheless, all of these methods only
consider the dynamic relationships among individual words, disregarding the grammatical
In linguistics, the words in a sentence are classified as parts of speech (e.g., nouns, verbs,
adverbs, and so on) and are connected to each other to form certain dependency relationships. As
an example, in the sentence, “Undeterred by the bad weather, we have experienced great sales
growth this quarter,” the word “weather” is a noun modified by the adjective “bad,” while the
noun “growth” is modified by the adjective “great.” Table 1 provides a summary of the most
63
Cao, Jiang, &Lei
_____________________________________________________________________________________
A dependency parser is a data analytics tool used to analyze the grammatical structure of
a sentence. It can identify the “head” word in a sentence and the words that modify it. The
Nature Language Processing (NLP) group at Stanford University has developed a neural network
As discussed previously, a transfer model is a type of machine learning model that is pre-
trained on one task and then fine-tuned for a different, related task. The idea behind transfer
learning is that a model that has been trained on a large and diverse dataset can be re-purposed
for other tasks. In contrast, a neural model is a type of machine learning model that is designed to
mimic the structure and function of the human brain. These models are composed of
interconnected layers of artificial neurons, and they are trained using large amounts of data to
The parser builds a parse by performing a linear-time scan over the words of a sentence.
At every step, it maintains a partial parse, a stack of words which are currently being processed,
and a buffer of words yet to be processed. At every stage, the parser uses a neural network
classifier to determine grammatical relationships among the words. The classifier is trained using
64
Cao, Jiang, &Lei
_____________________________________________________________________________________
an oracle. Specifically, the researchers gathered a sample of three million words from various
sources such as Wall Street Journal articles, IBM computer manuals, nursing notes, transcribed
telephone conversations, etc. This oracle takes each sentence in the training data and produces
many training examples. The neural network is trained on these examples using adaptive
gradient descent (AdaGrad) with hidden unit dropout. The researchers divided the three million
words into ten groups. 90 percent of the sample was used to train a model. The researchers then
used the trained model to predict the dependency relationships of the remaining 10 percent of the
sample to evaluate the accuracy of the parser. To accurately recognize word dependency
relationships, this process is repeated over years to refine and enhance the robustness of the
The standard dependency parser has distinct advantages over other textual analysis
methods. For example, consider the sentence “Undeterred by the bad weather, we have
experienced great sales growth this quarter.” A frequency-based bag-of-words approach can only
identify one positive word and one negative word, the standard dependency parser can discern
that “bad,” which modifies “weather,” is unrelated to firm performance, while “great,” which
in countries and regions that are not exposed to extreme weather events.” The frequency-based
bag-of-words approach would only identify two negative words, “unreasonable” and “exposed.”
By contrast, the standard dependency parser can appreciate that “not” and “unreasonable”
together form a double negative that modifies “goal,” indicating a positive sentiment related to
firm performance. Meanwhile, “not” and “exposed” form another double negative that modifies
65
Cao, Jiang, &Lei
_____________________________________________________________________________________
“countries and regions.” This too is a positive sentiment, but unlike “goal,” “countries and
4.3. Empirical example: Contrasting earnings conference calls and expert network calls
Earnings conference calls were a major source from which professional analysts and
institutional investors acquire private information. Regulation Fair Disclosure in 2000 prohibited
public companies from selectively disclosing material information, which encouraged investors
to seek out unique information sources to gain an investing edge. Subsequently, regulatory
concerns about conflicts of interest at sell-side research departments lead to the Global Analyst
Research Settlement in 2003, which resulted in reduced analyst coverage. This period also
coincided with the considerable growth in hedge funds who possess considerable financial
resources and demand valuable firm information to make profitable investment decisions. These
The expert network industry consists of firms that work to recruit and connect subject
matter experts with clients seeking to do deep dive research on a company or market segment.
The standard engagement is a 45-60 minute question and answer discussion between the expert
and the client. Expert network firms often generate recordings and transcripts of these calls for
66
Cao, Jiang, &Lei
_____________________________________________________________________________________
compliance purposes. The availability of call transcripts has allowed expert network firms to
Cao, Green, lei, and Zhang (2023) compare the content of expert network calls and
earnings conference calls. Specifically, they use Latent Dirichlet Allocation (LDA) model to
determine whether the distribution of covered topics significantly differs across call type. Early
LDA models identify topics based solely on word cooccurrences, but this approach often
generates topics that are difficult to interpret. Seeded LDA leverages both knowledge-based and
frequency-based seed words to determine interpretable pre-defined topics and classify textual
contents into these topics based on a seed word dictionary. Knowledge-based seed words are
selected based on researchers' knowledge in the field, and frequency-based key words are chosen
Cao, Green, lei, and Zhang (2023) identify seven common topics that emerge from expert
network calls: Competition, Consumer, Financial, Product, Operation, Strategy, and Technology.
They then obtain a list of 100 most frequent non-stop words in the expert network call
transcripts. From this list, 50 knowledge-based root words are selected and classified into
To compare expert calls with earnings calls, they classify earnings calls using the same
topics and seed word dictionary. As shown in Figure 7, they find that Financial discussions
comprise the most common topic for earnings conference calls, present in 33.2% of calls,
whereas this topic is considerably less prevalent in expert network calls (9.3%). Instead, expert
calls are more likely to emphasize Technology (20.9% of expert calls vs6.6% for earnings calls),
suggests expert calls being less oriented towards financial statement information that are widely
67
Cao, Jiang, &Lei
_____________________________________________________________________________________
available and more geared to understanding industry segments and trends which requires expert
insights.
68
Cao, Jiang, &Lei
_____________________________________________________________________________________
Appendix 4: Applying GPT to analyze conference call transcripts using both API and web
interface
Applying GPT to analyze conference call transcripts
Download 5 conference earnings calls and extract the CEO’s prepared remark. Use OpenAI
ChatGPT API to perform LDA topic modeling and classify the remark into an appropriate
number of topics. Report the topics, their weight in the CEO’s prepared remark, and the top five
frequent words in each topic. Then use ChatGPT web interface to perform the same task.
Compare the outputs. Then use ChatGPT web interface to perform the same task. Compare the
outputs.
69
Cao, Jiang, &Lei
_____________________________________________________________________________________
References
Cao, S., C. Green, L, Lei, and S. Zhang. 2023. Expert network Calls. Working paper.
70
Cao, Jiang, &Lei
_____________________________________________________________________________________
report certain material corporate events to their shareholders on a more current basis using Form
8-K, or a “current report.” The types of information trigger Form 8-K filings are generally
promptly rather than waiting until the end of a fiscal period end to file a Form 10-Q or a Form
10-K.
In March 2004, the SEC adopted sweeping changes to the Form 8-K disclosure
requirements. Companies must make 8-K disclosures within four business days of the triggering
event and in some cases even earlier. The rest of the section visits each type of information
disclosed in Form 8-K and provide examples of 8-K filings. The revised rules add new items and
events to be disclosed in Form 8-K and require Form 8-K be filed within four business days,.
This item pertains to business agreements that are outside the ordinary course of business.
The item is also triggered by material amendments to those agreements. For instance, if a
company secures a substantial loan from a bank or enters into a long-term lease that is material
to the company, the agreement must be reported under Item 1.01 by filing a current report.
However, if a retailer with an established chain of stores signs a lease for one additional store,
71
Cao, Jiang, &Lei
_____________________________________________________________________________________
the new lease generally would be in the ordinary course of business and would not be reported
here. The required disclosure includes the date on which the agreement was entered into or
amended, the identity of the parties to the agreement or amendment, and a brief description of
the terms and conditions of the agreement or amendment. Figure 1 provides an example of Item
1.01 filed by Amazon Inc. In this Form 8-K report, Amazon Inc. disclosed that the company
entered into a credit agreement with Bank of America, N.A. on September 5, 2014. The
agreement provided Amazon Inc. with a credit facility with a borrowing capacity of up to $2
billion at an initial interest rate of the London interbank offered rate (LIBOR) plus 0.625%.
This item concerns about termination of material business agreements. For example, if a
company procures most of its raw material through a long-term procurement agreement with one
significant supplier, and that supplier terminates the agreement, the termination of the agreement
must be reported under this item. In contrast, if the agreement expires according to its terms, the
termination need not be reported on Form 8-K. The required disclosure includes the date of the
termination of the material definitive agreement, the identity of the parties to the agreement, a
brief description of the terms and conditions of the agreement, a brief description of the material
circumstances causing the termination, and any material early termination penalties.
72
Cao, Jiang, &Lei
_____________________________________________________________________________________
In the event of a potential bankruptcy, a company must disclose this information on Form
8-K along with its plan for reorganization (under Chapter 11) or liquidation (under Chapter 7).
Such information is particularly important for shareholders as they need to evaluate their
potential losses and the likelihood that the company could emerge from bankruptcy.
merging with another company or selling a business unit, the company must file an 8-K to
Many companies announce their quarterly and annual results simultaneously in an 8-K
filing. If the company will hold an earnings conference call, it is announced in the 8-K filing as
well. As shown in Figure 2, Amazon Inc. filed a Form 8-K along with its announcement of its
fourth quarter 2020 and year ended December 31, 2020 financial results.
73
Cao, Jiang, &Lei
_____________________________________________________________________________________
Companies must report the basic terms of material financial obligations, for example, any
long-term debt, capital or operating lease, and short-term debt outside the ordinary course of
business. The required disclosure includes the date on which the company became obligated on
the direct financial obligation, a brief description of the transaction creating the obligation, the
amount of the direct financial obligation, and a brief description of the other terms and
conditions of the transaction. Figure 3 provides an example of Item 2.03 in a Form 8-K filed by
USHG Acquisition Corp. on March 29, 2022. It reveals that USHG Acquisition Corp issued an
29, 2022.
Item 2.04 Triggering Events That Accelerate or Increase a Direct Financial Obligation or an
This item refers to such events as loan defaults and any other events that accelerate or
increase financial obligations. For example, if a company defaults on a loan, its creditors can
demand immediate payment of the entire outstanding amount. In such case, the company must
disclose the date of the triggering event, a brief description of the triggering event, the amount to
74
Cao, Jiang, &Lei
_____________________________________________________________________________________
be repaid, the repayment terms, and any other financial obligations that might arise from the
initial default.
This item requires companies to disclosure material charges associated with restructuring
plans. The required disclosure includes the date of the commitment to the exit or disposal
activities, a description of the plan, and an estimate of the total cost expected to be incurred.
A company must disclose certain material impairments under this item, including the date
of the conclusion that a material charge is required, a description of the impaired assets, the facts
leading to the conclusion, and an estimate of the amount of the impairment charge.
Transfer of Listing
This item mandates companies to disclose the delisting of their stock if they receive
notification from the stock exchange that they no longer meet the requirements for continued
listing. A company receiving this type of notice must disclosure the date that it received the
notice, the rule or standard for continued listing that the company fails, and any response the
75
Cao, Jiang, &Lei
_____________________________________________________________________________________
Companies are required to disclose private issuance of securities exceeding one percent
This item requires firms to disclose material changes to the rights of shareholders or
material limitations on the rights of shareholders that result from the issuance or modification of
Changes in auditors can raise concerns regarding the integrity of financial statements. As
a result, companies must disclose any changes in their independent auditor regardless of whether
the independent auditor is involuntarily dismissed, voluntarily resigns, or declines to stand for
reappointment. The company also need to disclose in Form 8-K if it hires a new auditor.
Item 4.02 Non-Reliance on Previously Issued Financial Statements or a Related Audit Report or
This item requires companies to inform information users when previously issued
financial statements contain errors or when previously issued audit reports or interim reviews on
financial statements should no longer be relied upon. The item requires the disclosure of the date
of the conclusion regarding the non-reliance, and identification of the financial statements and
years or periods covered, and a brief description of the facts underlying the conclusion.
76
Cao, Jiang, &Lei
_____________________________________________________________________________________
Whenever there is a change in control of registrant, companies must disclose the persons
who have acquired control and any arrangements between the old and new control groups.
A company must disclose any changes to the board of directors or high-level executive
officers. In addition, the company must disclose any changes to the compensation of current
high-level officers. The required disclosure includes the date of the director’s resignation, refusal
to stand for re-election or removal, any positions held by the director on any committee of the
board of directors at the time, and a brief description of the circumstances representing the
disagreement that management believes caused the director’s departure. Figure 4 shows an item
5.02 filed by WeTrade Group Inc. relating to the resignation of the Chief Executive Officer on
September 1, 2020.
If a company amends its articles of incorporation or bylaws or changes its fiscal year, the
77
Cao, Jiang, &Lei
_____________________________________________________________________________________
Item 5.04 Temporary Suspension of Trading Under Registrant’s Employee Benefit Plans
If a company amends its articles of incorporation or bylaws or changes its fiscal year, the
Item 5.05 Amendments to the Registrant’s Code of Ethics, or Waiver of a Provision of the Code
of Ethics
If a company changes to the code of ethics that apply to the high-level officers, it must
disclose such changes. The company must also disclose any waivers granted to the high-level
officers.
Companies must file a Form 8-K under item 5.06 when the company completes a
transaction that causes it to cease being a shell company. In the example in Figure 5, WeTrade
Group Inc. disclosed that the company ceased to be a shell company as a result of
commencement of regular revenue generating operations and controls of the WePay System.
78
Cao, Jiang, &Lei
_____________________________________________________________________________________
Companies must disclose the results of shareholder votes at annual meetings or special
Companies disclose material events under this item to comply with the requirements of
Regulation FD. Regulation FD requires companies to provide material information to the public
at the same time as they provide it to others. If a company discloses certain information to some
institutional investors and financial analysts during an investor event, it can file a Form 8-K
under this item to share the same information with the public. Please see Figure 6 for an example
If a company believes an event is important but does not fall in any other categories, the
company can disclose this event under item 8.01. In Figure 7, Facebook Inc. disclosed that its
exhibits that it has filed. For example, if a company discloses in Item 2.01 that it has acquired a
79
Cao, Jiang, &Lei
_____________________________________________________________________________________
business, Item 9.01 would require the company to provide the financial statements of the
business. In addition, the company must present “pro forma” financial statements that
demonstrate what the company’s financial results might have been if the transaction had
occurred earlier. Similarly, if the company discloses in Item 1.01 that it has entered into a
due to the multifaceted nature of both competition and disclosure. For example, Amazon
competes with Walmart for customers and distribution channels, but competes with Google in
information technology; Intel competes with ARM in CPU architecture design but competes with
Samsung in mobile CPU sales. In terms of disclosure, companies provide a variety of types of
disclosures, etc. The impact of competition on corporate disclosure may depend on the specific
Cao, Ma, Tucker, and Wan (2018) construct a measure of technological competition
“technological peer pressure” (TPP) that captures the aggregate technological advances of
companies that compete with a given company in the product market relative to the given
company’s own technological preparedness. This type of competition is aligned with product
disclosure which reveals where the firm invests in technology for product development and
improvement, as well as how these investments have progressed. Product disclosure is quantified
with the number of words in product-disclosure press releases issued by a company. Using the
80
Cao, Jiang, &Lei
_____________________________________________________________________________________
two measures, Cao et al. (2018) find that TPP has a significantly negative and strong economic
association with product disclosure: a company that moves from the lowest decile of TPP to the
highest decile reduces its product disclosure by 44.7%. In contrast, they find that TPP is not
associated with the frequency of management earnings forecasts, which is a type of disclosure
81
Cao, Jiang, &Lei
_____________________________________________________________________________________
Reference
Cao, S., Ma, G., Tucker, J., and Wan, C. 2018. Technological Peer Pressure and Product
82
Cao, Jiang, &Lei
_____________________________________________________________________________________
encompasses any digital technology that facilitates the sharing of ideas, thoughts, and
information through virtual networks and communities. Anyone with an internet connection can
make a social media profile and can use that profile to post nearly any content they like. Hence,
personalized profiles and user-generated content are characteristics of social media platforms.
Developing from the late 1970s on, social media originated as a means for people to
interact with friends, family, and shared interest communities. Nowadays, social media serves as
a platform for people to find career opportunities, make romantic connections, and share their
own insights and perspectives online. In addition, businesses use social media for advertising,
customer communication, and increasing brand awareness. As of October 2021, more than 4.5
billion people worldwide use social media. Its capacity to instantly post photographs, share
viewpoints, and record events has revolutionized the way people live and do business.
There are various types of social media platforms offering a variety of services. Social
networks like Facebook and Snapchat allow users to share ideas, opinions, and content with
other users, and hence most content on social networks consists of text, images, or a combination
of the two. Media networks like YouTube and TikTok facilitate the sharing of media assets such
as images, videos, and other content. Review networks like Yelp enable the evaluation of
products and services. Discussion networks such as Reddit and Quora provide a forum for people
to discuss problems, ask questions, and debate issues. Finally, business platforms like LinkedIn,
Glassdoor, and Blind enable professionals to network and collaborate with other professionals or
with potential clients. Table 1 lists the most popular social media platforms by category.
83
Cao, Jiang, &Lei
_____________________________________________________________________________________
Social Media
Type Data Purpose
Platforms
Video YouTube,
Broadcast live video to viewers
TikTok, Twitch
of ways. Many social media users are aware that platforms collect user data to customize
advertisements; retailers and artisans use social media to market their products to a global
audience. However, there are other less visible ways in which capital market participants, such as
financial analysts and investors, use social media. With the help of AI and machine learning
84
Cao, Jiang, &Lei
_____________________________________________________________________________________
technologies, they are exploring new ways to transform the vast amounts of data generated by
social media users into financial knowledge. At the same time, companies are leveraging social
media to their own advantage, using it as an alternative channel with significantly less oversight
for disclosing information and communicating with the market. This trend has garnered a lot of
interest among scholars of finance. In this section, we will examine how specific platforms are
6.2.1. Twitter
post and interact with messages, media, and images contained in "tweets." Users who make their
own profiles can post, like, and retweet tweets, while unregistered users are limited to reading
some public tweets. At first, tweets could only be up to 140 characters, but the limit was doubled
to 280 in November 2017. Audio and video tweets remain limited to 140 seconds for most
accounts. Figure 1 shows a tweet from Microsoft on October 14, 2018. The tweet summarizes
Microsoft’s first-quarter financial performance, including revenue, income, and earnings per
share. The tweet has attracted enormous attention from Twitter users as evidenced by 594
Twitter users have produced vast amounts of information in their tweets. leading
researchers to ask whether that user-generated content has informational value in the context of
business and finance. Some studies conducted over the past few years have found that, in the
aggregate, certain types of tweets actually have predictive power. In one such study, Bartov,
Faurel, and Mohanram (2018) find that the aggregate opinion from individual tweets predicts a
firm’s forthcoming quarterly earnings and announcement returns. Relatedly, Tang (2018) shows
85
Cao, Jiang, &Lei
_____________________________________________________________________________________
sales. The predictive power is greater for firms whose major customers are consumers rather than
In recent years, companies have increasingly sought to harness the power of Twitter to
achieve strategic purposes, a phenomenon that has also piqued scholarly interest. Blankespoor,
Miller, and White (2014), for example, examine whether firms can reach more investors and thus
reduce information asymmetry by tweeting news. Using a sample of technology firms, they find
that using Twitter to send market participants links to press releases indeed helps disseminate
Jung, Naughton, Tahoun, and Wang (2018) find that companies disseminate news via Twitter
strategically. Specifically, they are less likely to tweet negative earnings news than positive
86
Cao, Jiang, &Lei
_____________________________________________________________________________________
earnings news. The incentives to disseminate information strategically are stronger for firms with
a lower level of investor sophistication, a larger social media audience, and higher litigation risk.
At the same time, the autonomous nature of social media imposes challenges on
companies as well because they have very limited control over what information or opinions that
people share on these social media platforms. Lee, Hutton, and Shu (2015) find that a
corporation’s use of social media can help counterbalance negative price reactions to recall
announcements. However, they also observed that, with the arrival of Facebook and Twitter,
firms relinquished a certain amount of control over their social media content, and the
attenuation benefits of corporate social media, while still significant, lessened. In other words,
when a company recalls a product, there is almost always a negative price reaction. The more the
company tweets on its own behalf, the smaller the negative reaction; the more other people tweet
6.2.2. Glassdoor.com
Glassdoor.com is a large recruiting platform where users can search and apply for jobs. In
addition to information about job postings, Glassdoor also hosts a social media platform through
which current and former employees can anonymously review companies across a variety of
criteria, including internal CEO approval ratings, salary data, interview difficulty and questions,
compensation and benefits assessments, company outlook, etc. Glassdoor thus provides insider
information from employees who voluntarily share their opinions on the companies they work
company’s employees’ comments and opinions about the company, can we extract from
87
Cao, Jiang, &Lei
_____________________________________________________________________________________
management? Using data from Glassdoor, Hale, Moon, and Sweson (2018) find that employee
opinions are useful in predicting earnings growth and management forecast news. Huang, Li,
and Markov (2020) document that the average employee outlook is incrementally informative in
predicting future operating performance, particularly when the disclosures are aggregated from a
larger, more diverse, and more knowledgeable employee base. Interestingly, the average outlook
predicts bad news events more strongly than good news events.
In addition to the predictive power, Dube and Zhu (2021) show that employee opinions
relations and diversity. Such improvement is more pronounced among firms with negative initial
reviews and high labor intensity. Those improvements in workplace practices pay off financially.
Green, Huang, Wen, and Zhou (2019) find that companies experiencing improvements in
88
Cao, Jiang, &Lei
_____________________________________________________________________________________
employee opinions significantly outperform firms with declines. The return effect is concentrated
among reviews from current employees. Furthermore, changes in an employer’s rating are
Message boards have been a feature of digital life since the debut of USENET in 1979.
The main function of message boards is to provide a forum where readers and users can share
their thoughts and interact with people who share similar interests or have specialized
knowledge in a particular field. Stock message boards give investors an opportunity to connect
with other investors at all levels of expertise and learn more about profitable investing
strategies. Many stock message boards focus on a specific topic or group of topics, such
as investing in options, precious metals, exchange traded funds (ETFs), or commodities. Figure
3 shows the stock message board, Seeking Alpha. The topics discussed on Seeking Alpha cover
basic materials, bonds, closed end funds, commodities, cryptocurrency, etc. Seeking Alpha
users can share and exchange opinions with other users by posting or comment on analysis
articles.
Stock message boards provide investors a platform to exchange ideas and opinions,
potentially generating “the wisdom of crowds,” that is, the idea that the aggregate perspective of
a large group of people is sometimes more accurate than that of a single expert. If that is true,
then “crowd wisdom” from stock message boards might help investors to better predict company
future performance. Drake, Moon, Twedt, and Warren (2022) examine another stock message
board, Seeking Alpha. They find that market reaction to sell-side analyst research is substantially
reduced when the analyst research is preceded by reports from “social media analysts”
89
Cao, Jiang, &Lei
_____________________________________________________________________________________
(SMAs)—individuals posting equity research online via social media investment platforms, and
that this is particularly true of sell-side analysts’ earnings forecasts. They further find that this
effect is more pronounced when SMA reports contain more decision-useful language, are
produced by SMAs with greater expertise, and relate to firms with greater retail investor
ownership. They suggest that the attenuated response to sell-side research is most likely
As with many issues discussed on social media, one major concern about stock message
boards relates to the accuracy and credibility of the content producers, which determines the
legitimacy of the content itself. To shed light on this issue, Campbell, DeAngelis, and Moon
(2019) investigate whether stock holding positions by SMAs have a negative effect on analyst
objectivity. They find no evidence that an SMA’s position reduces investor responses to the
SMA’s posting. In fact, they show that holding a position magnifies investor responses to the
90
Cao, Jiang, &Lei
_____________________________________________________________________________________
SMA’s article. Their findings suggest that SMAs’ stock positions do not decrease the credibility
6.2.4 YouTube
YouTube is a social media platform that allows uses to upload, share, store, and play
back videos. Launched in 2005, YouTube has become enormously popular, reporting 2.5 billion
monthly users in June 2022. In fact, YouTube is the second most visited website on the internet,
and its visitors watch more than a billion hours of videos per day. As of May 2019, more than
Given the massive amount of content hosted on YouTube, it makes sense on the one hand
that researchers would be highly interested in exploiting the data therein, but on the other hand
that they would need powerful technological tools to process video content. Hu and Ma (2022)
pioneerly collect startups’ self-introductory pitch videos from YouTube and another video-
sharing website, Vimeo. Using machine learning algorithms to process these pitch videos, they
are able to measure the persuasiveness of delivery in start-up pitches from visual, vocal, and
verbal dimensions. They find that passionate and warm pitches increase funding probability;
6.2.5 LinkedIn
real-world relations. While you probably don’t know everyone on r/WallStreetBets, there’s a
good chance that your Facebook friends are people you have actually met. Networking-based
social media platforms thus allow researchers to infer real-world networks from relations on
91
Cao, Jiang, &Lei
_____________________________________________________________________________________
For example, LinkedIn is a social network specifically designed for career and business
professionals to connect. LinkedIn users create their professional profiles by sharing their
educational backgrounds, employment histories, skills, and career interests. Unlike other social
networks in which you might become "friends" with any and everyone, LinkedIn is about
building strategic relationships. As of 2020, over 722 million professionals use LinkedIn to
cultivate their careers and businesses. This rich dataset of professional profiles offers a unique
opportunity to peek into capital market participants’ social networks in the real world.
Of all the networks on LinkedIn, researchers are particularly interested in the impact of
connections among financial analysts, fund managers, and corporate executives. Jiang, Wang,
and Wang (2018) use professional profiles posted on LinkedIn to identify revolving rating
analysts with structured finance rating experience. They find that the more companies issuing
debt securities employ such analysts, the more likely that ratings of their debt securities are
Bradley, Gokkaya, and Liu (2020) examine professional connections among executives
and analysts formed through overlapping historical employment. They search for each analyst on
LinkedIn.com to capture pre-analyst employment history. They find that analysts with
professional connections to coverage firms have more accurate earnings forecasts and issue more
informative buy and sell recommendations. These analysts are more likely to participate, be
chosen first, and ask more questions during earnings conference calls and analyst/investor days.
Brokers attract greater trade commissions on stocks covered by connected analysts. Meanwhile,
firms benefit through securing research coverage and invitations to broker-hosted investor
92
Cao, Jiang, &Lei
_____________________________________________________________________________________
In addition to employment histories, many professionals post their photos on social media
networks. Li, Lin, Lu, and Venstra (2020) collect analysts’ photos from their LinkedIn profiles.
They find that, while female analysts are more likely to be voted as All-Star analysts in the
United States, good-looking female analysts are less likely to be voted as All-Stars. On the
contrary, female analysts in China are less likely to be voted as All-Stars, but the likelihood
increases with their facial attractiveness. These findings implicate a beauty penalty for female
analysts in the United States and gender discrimination against female analysts in China.
Most corporate social media posts tend to fall into common categories such as product
announcements, earnings disclosures, industry awards, community engagement, etc. Cao, Fang,
and Lei (2021) uncover an emerging —and entirely different— type of corporate social media
posts: negative peer disclosure (NPD). NPD refers to the phenomenon that firm A discloses
negative information about competitor firm B without mentioning anything about itself in a
tweet. Here is an example of what happened between Dropbox/Box and Globalscape, two
companies that compete in the online file storage space. In 2014, news broke of a Dropbox
security flaw that exposed its users’ private data. Globalscape responded by retweeting a news
article with this headline: “Dropbox and Box Leak Files in Security Through Obscurity
Nightmare.”
When the negative news came out about Dropbox and Box, Globalscape could have been
affected in two ways. On one hand, it could have been positive in terms of product market
competition, since Globalscape didn’t have any security breakdowns as its competitor did. At the
same time, it could also have been negative from the technology spillover perspective—the
93
Cao, Jiang, &Lei
_____________________________________________________________________________________
market might have assumed Globalscape was subject to the same technology vulnerability
(Figure 4). Hence the NPD tweet was a signal to the market that what happened to Dropbox
To build a dataset of NPD, Cao, Fang, and Lei (2021) collect tweets that mention a
competitor from corporate Twitter accounts, and employ sentiment analysis to identify negative
peer disclosures. The study find that NPD are issued by well-known and successful companies
such as Nvidia, T-Mobile, Symantec, and others. The propensity of NPD increases with the
degree of product market rivalry and technology proximity. The approach appears to work.
Consistent with NPDs being implicit positive self-disclosures, disclosing firms experience a two-
day abnormal return of 1.6–1.7% over the market and industry. Firms using NPDs tend to
94
Cao, Jiang, &Lei
_____________________________________________________________________________________
Reference
Bartov, E., Faurel, L., and Mohanram, P. 2018. Can Twitter Help Predict Firm-Level Earnings
Blankespoor, E., Miller, G., and White, H. 2014. The role of dissemination in market liquidity:
Evidence from firms’ use of Twitter. The Accounting Review, 89(1), 79-112.
Bradley, D., Gokkaya, S., Liu, X., and Xie, F. 2017. Are all analysts created equal? Industry
Campbell, J., DeAgelis, M., and Moon, J. 2019. Skin in the game: personal stock holdings and
investors’ response to stock analysis on social media. Review of Accounting Studies, 24,
731-779.
Cao, S., Fang, V., Lei, L. 2021. Negative peer disclosure. Journal of Financial Economics,
140(3), 815-837.
Dube, S., and Zhu, C. 2021. The disciplinary effect of social media: Evidence from firms’
Green, Huang, Wen, and Zhou. 2019. Crowdsourced employer reviews and stock returns.
Hales, J., Moon, J., and Swenson, L. 2018. A new era of voluntary disclosure? Empirical
evidence eon how employee postings on social medial relate to future corporate
Huang, K., Li, M., Markov, S. 2020. What do employees know? Evidence from a social media
95
Cao, Jiang, &Lei
_____________________________________________________________________________________
Jiang, J., Wang, I., and Wang, K. 2018. Revolving rating analysts and ratings of mortgage-
64(12), 5461-5959.
Jung, M., Naughton, J., Tahoun, A., and Wang C. 2018. Do firms strategically disseminate?
Evidence from corporate use of social media. The Accounting Review, 93(4), 225-252.
Lee, L., Hutton, A., and Shu, S. 2015. The role of social media in the capital market: Evidence
Li, C., Lin, A., Lu, H., and Veenstra, K. 2020. Gender and beauty in the financial analyst
profession: evidence from the United States and China. Review of Accounting Studies, 25,
1230-1262.
989-1034.
96
Cao, Jiang, &Lei
_____________________________________________________________________________________
Corporate governance refers to the set of processes, customs, policies, laws, and
institutions that impact the way a corporation is directed, administered, or controlled. The need
for corporate governance arises from the separation of ownership and control and information
asymmetry. Prior to the Industrial Revolution, businesses were owned and managed by a handful
of individuals, known as sole proprietors. This simple form gave way to the more complicated
corporate form where individuals invest in a corporation, which is then managed by corporate
managers chosen by those investors. This evolution introduces agency relationships where
corporate managers are agents who control corporate resources, and investors are the principals
Separation of ownership and control inevitably raises concerns about whether the agent is
acting in the best interests of the principal. It also allows managers to obtain private information
by exercising control. Hence, separation of ownership and control naturally leads to a separation
of ownership and information, where managers enjoy an informational advantage over investors.
Following the Myers and Majluf (1984) paradigm, information asymmetry occurs “…when firms
have information that investors do not have.” To address the agency and the information
asymmetry concerns, in the traditional corporate governance system, shareholders appoint boards
of directors to monitor senior managers. Boards of directors could receive advice from parties
such as auditors and legal counsels. Figure 1 describes this corporate governance environment.
Traditionally, the sole aim of corporate governance has been to maximize shareholder
value to protect the interests of shareholders. In recent years, there has been a growing emphasis
97
Cao, Jiang, &Lei
_____________________________________________________________________________________
Specifically, the corporate governance system today should aim to guarantee the interests of all
responsibilities towards multiple external monitors and stakeholders such as creditors, regulators,
conference calls are often the first-place people look. However, the proxy statement can be just
as informative, if not more so, as it delves into business relationships, the backgrounds, and
help them understand how to vote at shareholder meetings and make informed decisions about
98
Cao, Jiang, &Lei
_____________________________________________________________________________________
how to delegate their votes to a proxy. It covers a variety of issues, such as proposals for new
additions to the board of directors, information on directors' salaries, information on bonus and
options plans for directors, corporate actions like proposed mergers or acquisitions, dividend
Below is some of the information you can glean from this important document.
• Important issues vote on. Figure 2 shows the items of business board voting
Figure 3 lists the items of business, board voting recommendations, and how to vote
in Walmart’s 2022 proxy statement. Typical items for voting at annual meetings of
99
Cao, Jiang, &Lei
_____________________________________________________________________________________
Inc’s 2022 proxy statement. It provides detailed information on the amount and
composition of executive compensation for the top-five paying executives in the past
three years. Figure 5 reflects Apple’s analysis concerning the vesting of CEO’s
Restricted Stock Units (RSU). The analysis provides detailed information about the
conditions required for RSU vesting, the corresponding performance outcomes, and
100
Cao, Jiang, &Lei
_____________________________________________________________________________________
• Loans advanced to senior executives. These loans can deprive the company of capital,
are often made on generous terms, and sometimes are forgiven, footing shareholders
Shareholder voting on important corporate issues sometimes leads to proxy contests, also
known as proxy battles. This occurs when a group of shareholders joins forces in an attempt to
oppose and vote out the current management or board of directors, essentially creating a battle
for control of the company between shareholders and senior management. Proxy fights (Figure
6) are commonly initiated by dissatisfied shareholders who convene with other shareholders to
pressure management and the board of directors to make changes within the company.
Shareholders use their votes to pressure the board of directors by voting against them at
101
Cao, Jiang, &Lei
_____________________________________________________________________________________
Figure 5. CEO RSU vesting analysis in Apple Inc’s 2022 proxy statement
102
Cao, Jiang, &Lei
_____________________________________________________________________________________
designed to ensure a company’s operations are ethical and beneficial to all stakeholders in the
society. The concept of CSR is broad and varies among companies, but the fundamental
environmentally responsible.
Information about CSR can be obtained from both internal and external sources.
These reports contain information about the environmental, social, and governance (ESG)
impacts of a company’s operations. Investors and other stakeholders are increasingly calling for
103
Cao, Jiang, &Lei
_____________________________________________________________________________________
more transparency in companies’ sustainability and ESG strategies, and many legislative
independent standards for companies to report non-financial information. These standards are
designed to help businesses identify their impacts on climate change, the environment, human
rights, and corporate governance. Although the GRI standards are non-mandatory and non-
binding, they serve as the foundation for the proposed Corporate Sustainability Reporting
Directive (CSRD) and the forthcoming mandatory European Sustainability Reporting Standards
(ESRS) are based on the GRI structure. ESRS is a set of standards (analogous to IFRS) that
companies must comply with when reporting sustainability information. The use of sustainability
report is an effective way for companies to answer a wide variety of questions that stakeholders
may raise in a single document. In addition to GRI standards, the Sustainability Accounting
Standards Board (“SASB”) also offers guidance on how to prepare informative ESG
mission is to help businesses around the world identify, manage and report on the sustainability
topics that SASB believes matter most to investors. Other ESG reporting frameworks include the
United nations Sustainable Development Goals, Figure 7 presents the reporting frameworks
on their websites, in proxy statements, and on social media platforms. CSR information could
also be collected from alternative sources outside of companies, for example, government
104
Cao, Jiang, &Lei
_____________________________________________________________________________________
generated inside firms. The rise of big data and data analytics generates a large amount of
information outside firms by tracking “footprints”, for example, satellite images, internet traffic,
credit card scan, censors, social media postings, etc. Such alternative information could be ahead
alternative data are predictive of firm performance and offset insider advantage. We can imagine
that the governance power of growing data that provide information on ownership structures,
leadership quality, shareholder sentiment, governance risks, etc. is very promising. In addition,
there is an increasing demand and supply of information about environmental, social, and
governance, which would help hold managers accountable along these important dimensions.
resources available. The rise of big data creates information asymmetry in accessing traditional
data as well. The SEC estimates that “as much as 85% of the documents visited are by internet
bot”. The ability to process big data has become increasingly critical to establish informational
advantage. Cao, Jiang, Wang, and Yang (2021) find that, when alternative data becomes
105
Cao, Jiang, &Lei
_____________________________________________________________________________________
available, analysts affiliated with brokerage firms equipped with AI capacity provide more
accurate forecasts. Cao, Jiang, Yang, and Zhang (2021) document that increases in machine
downloads of SEC filings are associated with decreases in time to the first trade; however,
On the other hand, firms are responding to the increasing use of machine in processing
corporate information. Cao, Jiang, Yang, and Zhang (2021) find that the publication of Loughran
and McDonald (2011) prompts firms to reduce the use of negative words in Loughran and
McDonald (2011) in corporate filings (Figure 8). Since the way that machines process corporate
filings is largely rule-based, the change in manager behavior would impact the effectiveness of
knowing the power of big data and data analytics, managers are incentivized to change behavior
7.3.2. Governance with distributive ledger and blockchains: Shareholder voting and smart
contracting
annual or special meeting of stockholders or through action by written consent. Today the most
common types of proxy contests are contests by activist stockholders seeking board
representation or control, generally with the objective of maximizing return on the activist’s
investment in the short-term. The proxy contest serves as a tool to drive change. Brav, Jiang, Li,
and Pennington (2021) document that approximately one percent of the firms are targeted with
106
Cao, Jiang, &Lei
_____________________________________________________________________________________
This figure plots LM – Harvard Sentiment of 10-K and 10-Q filings and compares sentiment of firms with high
machine downloads with that of the low group. LM – Harvard Sentiment is the difference of LM Sentiment and
Harvard Sentiment. LM Sentiment is defined as the number of Loughran-McDonald (LM) finance-related negative
words in a filing divided by the total number of words in the filing. Harvard Sentiment is defined as the number of
Harvard General Inquirer negative words in a filing divided by the total number of words in the filing. Filings are
sorted into top tercile or bottom tercile based on Machine Downloads. LM Sentiment and Harvard Sentiment
sentiments are normalized to one, respectively, in 2010 within each group, one year before the publication of
Loughran and McDonald (2011). The dotted lines represent the 95% confidence limits.
The importance of shareholder voting indicates that shareholding records are pivotal to
corporate governance. In this sense, blockchains can help with maintaining transparent
shareholding records and resolves the problem of “double voting” (Yermack 2017). Blockchains
can also be implemented to add new features to the existing shareholding voting system, for
example, tenure-based voting which is a system that awards greater voting power to shares held
for a longer duration (Edelman, Jiang, and Thomas 2019), voting power of outside shares
In addition to voting, the blockchain technology makes “smart contracts” feasible. Smart
contracts are digital contracts allowing terms contingent on decentralized consensus that are
tamper-proof and typically self-enforcing through automated execution (Cong and He 2019).
107
Cao, Jiang, &Lei
_____________________________________________________________________________________
Such contracts mitigate traditional moral hazard as “hidden actions” becomes verifiable and thus
108
Cao, Jiang, &Lei
_____________________________________________________________________________________
References
Brav, A., Jiang, W., Li, T., and Pinnington, J. 2019. Picking friends before picking (proxy)
fights: How mutual fund voting shapes proxy contests. Columbia Business School
Cao, S., Jiang, W, Wang, J, and Yang, B. 2021. From Man vs. Machine to Man+Machine: The
Cao, S., Jiang, W, Yang, B, Zhang, A. 2022. How to Talk When a Machine is Listening?
Cong, L., and He, Z. 2019. Blockchain Disruption and Smart Contracts. Review of Financial Studies,
32(5), 1754-1797.
Edelman, P., Jiang, W., and Thomas, R. 2019. Will Tenure Voting Give Corporate Managers
Loughran, T., and McDonald, B. 2011. When Is a Liability Not a Liability? Textual Analysis,
Myers, S., and Majluf, N. 1984. Corporate Financing and Investment Decisions When Firms
Have Information that Investors Do Not Have. Journal of Financial Economics, 13, 187-
221.
Yermack, D. 2017. Corporate Governance and Blockchains. Review of Finance, 21(1), 7-31.
109
Cao, Jiang, &Lei
_____________________________________________________________________________________
events include non-deal road shows, which are organized to generate investor interest and
promote the company's image; initial public offering (IPO) road shows, where executives present
to potential investors prior to going public; broker-hosted investor conferences, which allow
CEOs to connect with a broader investor audience; and capital market day events, dedicated to
Corporate executive presentations are characterized by two key features. Firstly, due to
the time constraints faced by executives when delivering live presentations, these presentations
often incorporate a significant amount of visual and graphic information in their slides.
Executives understand the importance of conveying complex ideas succinctly, and visuals help
to convey information quickly and effectively. These visuals can include charts, graphs,
diagrams, images, and videos, all of which aid in capturing the attention of the audience and
enhancing their understanding of the presented material. Figure 1 shows an example of charts
used in executive compensations. Figure 2 presents the images of production sites under
providing a wealth of visual information about the firm's product designs and operational plans.
While other corporate disclosures primarily focus on quantitative data, such as financial
by showcasing the visual aspects of the company's offerings. This can include detailed product
designs, prototypes, manufacturing processes, supply chain diagrams, and strategic plans. Figure
3 shows two examples of the images of product designs and prototypes used in executive
110
Cao, Jiang, &Lei
_____________________________________________________________________________________
view of the company's future direction, growth strategies, and competitive advantage.
111
Cao, Jiang, &Lei
_____________________________________________________________________________________
high-dimensional nature. An image can contain tens of thousands of pixels, each with millions
of possible colors, that form complex patterns and objects. Recent advances in machine learning
112
Cao, Jiang, &Lei
_____________________________________________________________________________________
humans. Cao, Cheng, Yang, Xia, and Yang (2023) leverage deep learning to extract key features
As the first step, they manually review and classify (label) a random subsample of images
into several different categories, providing a training sample for the machine learning algorithms.
They classify each image into one of three categories: Operations Summary, Operations
Forward, and Others. To minimize human errors in the labeling process, they cross-validate and
require a consensus on the classification by at least three graduate research assistants. They use a
two-step bootstrapping process to construct the training sample. They first label an initial random
sample of 3,000 images. They then use this initial sample to train the machine learning model
of all images. They then select a final training sample of 20,000 pre-classified images with
balanced numbers of images in each category. Finally, they manually classify the final training
sample.
A plethora of machine learning models, including random forests, gradient boosting, and
neural networks, have found use in a wide range of applications. Image recognition has been an
important problem for deep learning, and a milestone in image recognition is the development of
ImagNet by Google (Li et al., 2009), which achieved performance at par with humans. The
primary deep learning model employed by ImageNet and other leading image recognition
algorithms is a special class of neural networks, the Convolutional Neural Network (CNN). The
CNN is a multiple-layer neural network where the lower layers capture finer details, and the
113
Cao, Jiang, &Lei
_____________________________________________________________________________________
The challenge for recognizing business pictures is that there are no ready-made models
for this purpose, and training a CNN model usually requires a large training dataset. Therefore,
Cao et al. (2023) utilize an advanced machine learning technique called transfer learning (Pratt,
1993; Rajat et al., 2006) to build their own deep learning model based on pre-trained CNN
models and train the model with their business image sample. Specifically, they first build a
neural network on top of a pre-trained CNN neural network from a state-of-the-art image
recognition model VGG16 (Simonyan and Zisserman, 2014). They then keep the parameters of
CNN layers fixed and fine-tuned model with the training sample. The resulting model is termed
the Transfer CNN model. Transfer learning takes advantage of existing CNN models
trained with very large datasets and also adapts the model to a specific business problem.
Cao et al. (2023) also consider the model that utilizes both image and text information from
They train four different model architectures: 1) A CNN model that starts from scratch
(CNN ); 2) A deep learning model that processes both images and texts (CNN +
Text); 3) A transfer learning model that relies on pre-trained CNN model to process images
(Transfer CNN ); 4) A transfer learning model that processes both images and texts (Transfer
CNN + Text). For each model, they evaluate its out-of-sample performance using the ten-fold
cross-validation approach, in which we use the stratified sampling method to split samples
and report the results in Table 1. They use four measures to evaluate the out-of-sample
performance of the models. Accuracy is the ratio of correct predictions to total observations.
Precision is the ratio of true positives to the sum of true positives and false positives. Recall is
the ratio of true positives to the sum of true positives and false negatives. We calculate Precision
and Recall for each category and then average across the three categories. F1 score is the
114
Cao, Jiang, &Lei
_____________________________________________________________________________________
harmonic mean of Precision and Recall. Among the four architectures, Transfer CNN and
Transfer CNN + Text have the best performance in terms of F1 score and accuracy. Transfer
CNN + Text outperforms other models with an accuracy of 80.0% and F1 score of 79.3%. After
fitting transfer CNN + text model, they use the fitted model to obtain a final classification for the
Based on the classified categories of each slide page, they aggregate the number of pages
under a certain category to presentation level and scale by the total number of pages. They find a
presentation slide deck includes 3.6% Operations Forward slides and 11% Operations Summary
Summary are 5.8% and 12.2%, respectively. Figure 4 shows the time series of the number of
different types of information contained in presentations from 2006 to 2018. Beside a clear
increasing trend in the number of all types of presentations, interestingly, the ratio of
presentations with Operations Forward images also increases over years. Specifically, in 2006,
only 35% of corporate presentations include Operations Forward images; this number becomes
40% in 2010, and 47% in 2018, suggesting that firms are more likely to include Operations
115
Cao, Jiang, &Lei
_____________________________________________________________________________________
This figure plots annual number of presentations (bar plot and left axis) and the ratio of presentations with
Operations Forward images over all types of presentations (line plot and right axis) in our sample from 2006 to
2018. Presentations are classified as with Operations Forward images if any slides in the presentation displays
figures with Operations Forward information.
116
Cao, Jiang, &Lei
_____________________________________________________________________________________
References
Cao, S., Cheng, Y., Yang, M., Xia, Y., and Yang, B. 2023. Visual Information in the Age of AI:
117
Cao, Jiang, &Lei
_____________________________________________________________________________________
It outlines the company’s resources (assets), namely, what the company owns. The balance sheet
also reports the sources of financing for these assets. There are two primary methods through
which a company can finance its assets. Firstly, it can raise funds from stockholders, known as
owner financing. It can also acquire capital from banks, creditors, and suppliers, which is known
as nonowner financing. This means both owners and nonowners hold claims on the company’s
assets. Owner claims on assets are referred to as equity, and nonowner claims are referred to as
liabilities. As all financing is directed towards investments, we can establish the fundamental
relationship: investing (assets) equals financing (liabilities + equity). This equality is called the
Figure 2 illustrates Los Gatos Corporation’s balance sheet as of December 31, 2013.
There are fourteen line items across the categories of assets, liabilities, and owners’ equity. For a
118
Cao, Jiang, &Lei
_____________________________________________________________________________________
larger company with intricate business structures and models, the balance sheet could encompass
over 250 variables. These variables serve as valuable information for stakeholders seeking to
understand companies’ operating strategies and outcomes. For instance, managers can leverage
the balance sheet to assess liquidity and solvency; while investors can use it to evaluate
119
Cao, Jiang, &Lei
_____________________________________________________________________________________
values of the variables within the balance sheet are determined based on accounting standards
and managerial judgement. Assets and liabilities are measured either at fair value or historical
cost, following relevant accounting standards. Under historical cost accounting method, the focus
lies on the initial price paid by the company during the acquisition of an asset or the incurrence
of a liability. The balance sheet reflects either the purchase price or a reduced value due to
factors such as obsolescence, depreciation or depletion. For financial assets, the price remains
unchanged until the security is liquidated. Historical cost accounting is considered more
conservative and reliable since it is based on a fixed price that is fully known, namely the actual
price paid by the company. While this eliminates uncertainty in the initial valuation decision, it
introduces uncertainty in future periods regarding the true value of assets, which impairs the
An alternative approach is to measure assets and liabilities at fair value, which represents
the price at which knowledgeable and willing parties would exchange or settle them. Fair value
accounting entails adjusting the prices of certain assets on the balance sheet in each reporting
period to reflect changes in market prices. Fair value accounting enhances relevance of
accounting information. However, determining the fair value of assets and liabilities is not
always straightforward, as it involves subjectivity. Given the pros and cons of both historical cost
and fair value accounting, debates persist on how assets and liabilities on the balance sheet
should be valued. When analyzing a balance sheet, it is essential to consider not only the values
of the assets and liabilities but also how they are measured.
120
Cao, Jiang, &Lei
_____________________________________________________________________________________
Furthermore, the evolution of business models across various industries has led to a shift
in value creation, with increasing emphasis on intangible assets such as ideas, knowledge,
brands, content, data, and human capital, rather than physical assets like machinery or factories.
However, the accounting framework has not kept pace with this transformation. Existing
accounting standards often fail to recognize the value generated by certain intangible assets, both
in terms of their representation on the balance sheet or disclosure in footnotes. While tangible
assets like property and equipment are typically included on a company's balance sheet,
investments made in internally-generated intangibles are generally expensed as they are incurred.
Consequently, a company's most valuable assets often remain unaccounted for on its balance
sheet. When examining a balance sheet, it becomes crucial to consider the implicit value of such
intangible assets. Due to the intricate nature of intangibles and the diversity in how companies
manage and investors evaluate them, there is no universally applicable method for their
measurement.
Whisenant, and Yohn (2003) find that, after adjusting for current profitability, asset growth
exhibits negative associations with one-year-ahead return on assets. On the other hand, Cooper,
Gulen, and Schill (2008) document that asset growth rates are strong predictors of future stock
While stockholders are owners of public companies, the day-to-day control of company
resources lies in the hands of professional managers. This separation of ownership and control
gives rise to what is known as agency problems. One of the typical agency problems is empire
121
Cao, Jiang, &Lei
_____________________________________________________________________________________
building. Managers are incentivized to grow companies aggressively to fulfill personal and
career ambitions, but reckless expansions could result in inefficient usage of resources and
decreases in shareholder wealth. Therefore, asset growth could suggest healthy growth of
companies’ business but might also indicate empire building. To fully understand the implication
of asset growth, it is important to separate growth of assets from normal business activities and
nonoperating. Operating activities encompass the production and sale of company products and
marketable securities and debt financing endeavors. Asset growth can stem from both operating
and nonoperating activities. The growth of total assets can be broken down into two components:
growth funded by operating liabilities and growth funded by debt and equity (refer to Figure 3).
For instance, a company may negotiate favorable credit terms with suppliers, which essentially
represents a loan from the suppliers to the company. Alternatively, the company could obtain a
bank loan to finance purchases from suppliers. Both financing activities increase assets and
There is no doubt that both suppliers and banks meticulously assess the financial standing
of a company before making financing decisions. However, suppliers may possess a comparative
business transactions, and have access to unbiased private information. Additionally, suppliers
have stronger economic incentives as they are typically less diversified in credit risk compared to
banks. Consequently, it is plausible to consider whether growth financed by these more informed
stakeholders, such as suppliers, may indicate better future performance, while growth financed
122
Cao, Jiang, &Lei
_____________________________________________________________________________________
by debt and equity could potentially predict worse future performance. This hypothesis can be
nonoperating liabilities, Cao (2016) decomposes asset growth into non-financing growth and
such as accounts payable, while financing growth in assets arises from increases in debt or equity
financing, such as bank loans. Non-financing growth can be further decomposed into operating
growth, growth in accounts payable (representing financing from suppliers), and growth in tax
payable (representing financing from tax authorities). The focus is to understand the distinct
implications of operating growth, growth in accounts payable, and growth in tax payable for a
company's operating performance and stock market performance as investors are primarily
123
Cao, Jiang, &Lei
_____________________________________________________________________________________
Cao (2016) measures operating performance with return on assets (ROA) which is
computed as net income divided by average total assets at the beginning and the end of a year
and reflects return from the perspective of the entire company. This return includes both
profitability (numerator) and total assets (denominator). To earn a high return on assets,
managers must earn profit and manage assets to minimize the assets invested to the level
necessary to achieve the profit. The explanatory variables are various components of asset
growth. The regression model is stated as below where ROAt+1 is one-year ahead ROA; ROAt is
Table 1 tabulates the regression results including various types of asset growth as
of the explanatory variables. Regression 2 includes growth in operating assets other than
accounts payable and tax payable (OAgrowth_Other) as one of the explanatory variables.
Regression 3 include both in the same model. The results suggest that growth in accounts
124
Cao, Jiang, &Lei
_____________________________________________________________________________________
payable is negatively associated with future ROA while growth in operating assets other than
accounts payable and tax payable is positively associated with future ROA, respectively.
explanatory variable. The results show that growth in net operating assets is negatively
associated with future ROA. Regression 6 and 7 further indicate that growth in both current and
long-term net operating assets is negatively associated with future ROA. Interestingly, both
growth of accounts payable and growth of operating assets other than accounts payable and tax
payable become positively associated with future ROA when controlling for growth of net
operating assets. This means that growth of net operating assets is a correlated omitted variable
in Regression 1 to regression 3.
In Regression 8, growth in various operating assets are included in one regression and
“horse raced” against each other. The results confirm that growth of operating assets financed by
operating liabilities is positively associated with future ROA; growth of operating assets financed
Cao (2016) measures stock market performance with one-year ahead stock return and
apply the Fama-MacBeth procedure in this analysis to investigate the association between
125
Cao, Jiang, &Lei
_____________________________________________________________________________________
operating growth, growth in accounts payable, and growth in tax payable and a company's stock
market performance.
operating assets other than accounts payable and tax payable (OAgrowth_Other) as one of the
explanatory variables. Regression 3 include both in the same model. Similar to the findings with
operating performance, growth in accounts payable is negatively associated with future ROA
while growth in operating assets other than accounts payable and tax payable is positively
As growth in net operating assets could be an omitted correlated variable, Cao (2016)
Regression 4 to Regression 7 show that growth in net operating assets is negatively associated
with future stock return. Further, growth of operating assets other than accounts payable and tax
payable becomes positively associated with future stock return when controlling for growth of
net operating assets. Regression 8 confirms that growth of operating assets financed by operating
liabilities is positively associated with future stock return; growth of operating assets financed by
A natural follow-up question is that whether the difference between growth of different
developed. Growth of operating assets financed by operating liabilities should predicts future
stock returns only when investors do not recognize the difference among various types of asset
growth. To test this possibility, Cao, Wang, and Yeung (2022) regress asset growth on three-day
stock return around earnings announcement. Model (1) to (3) in Table 3 show that asset growth
126
Cao, Jiang, &Lei
_____________________________________________________________________________________
day earnings announcement return for future two quarters; asset growth financed by
earnings announcement return for future three quarters. The results suggest that investors do not
realize the difference between asset growth financed by operating liabilities and debt or equity
initially; it takes investors, on average, six months to figure out the difference.
Since average investors do not immediately recognize the difference between asset
growth driven by operating liabilities and by equity or debt, a profitable trading strategy can be
developed by holding stocks with low growth in nonoperating assets and selling stocks with high
growth in nonoperating assets. The outcome of such a trading strategy can be examined using the
127
Cao, Jiang, &Lei
_____________________________________________________________________________________
Specifically, stocks are sorted based on total asset growth (TAgrowth) and nonoperating
asset growth (NOAgrowth) into quintiles by year, forming 25 portfolios. Then average returns on
these portfolios are computed over a year. The equal-weighted portfolios show that the trading
strategy based on net operating asset growth yields returns ranging from 6.69% to 12.29%
holding total asset growth constant. The value-weighted portfolios show that the trading strategy
yields returns ranging from 7.67% to 9.05% holding total asset growth constant. On the contrary,
a trading strategy based on total asset growth does not yield significant positive returns.
128
Cao, Jiang, &Lei
_____________________________________________________________________________________
The field of equity markets research has experienced significant growth and advancements
due to the utilization of AI and big data. Initially, early studies in this domain focused on
employing new methodologies and data to gain deeper insights into existing research questions,
particularly in the areas of earnings and returns forecasting. Machine learning algorithms have
proven advantageous over traditional regression techniques, as they allow for non-linearity, the
incorporation of high-dimensional and complex time-series data, and the implementation of cross-
validation techniques. Consequently, the adoption of AI and machine learning algorithms holds
the potential to enhance forecasting performance. However, more recent studies have shifted their
focus towards addressing emerging questions that have arisen as a result of AI and big data,
including the comparison between human and machine performance in various tasks. This
transition reflects the evolving landscape of equity markets research driven by technological
advancements.
In a recent study, Chen, Cho, Dou and Lev (2022) use decision tree methods to predict
directions and signs of earnings, comparing them with a conventional logit model and financial
analyst forecasts. They feed in more than 4,000 financial items identified through XBRL tags in
corporate 10-K filings in current and lagged years, as well as their annual changes, which together
yield over 12,000 input variables. For every three-year period, they assign the first two years as a
training period and the final year as the validation period for model selection, and then conduct
out-of-sample tests. They find that the machine learning approach demonstrates better out-of-
sample predictive power and significant returns to portfolios formed on the basis of the AI-
generated predictions. It should be noted, however, that the logit model is estimated based on a
129
Cao, Jiang, &Lei
_____________________________________________________________________________________
limited group of selected input variables; therefore, the superiority of the decision tree approach is
Cao and You (2021) examine the efficacy of machine learning in forecasting corporate
three linear machine learning models, ordinary least squares regression (OLS), least absolute
shrinkage and selection operator (LASSO), and Ridge regression (RIDGE), as well as three
nonlinear machine learning models, random forest (RF), gradient boosting regression (GBR) and
artificial neural networks (ANNs). These they compare with six conventional time-series and
cross-sectional models. Feeding in a selection of 56 features from financial statements, they find
that nonlinear machine learning models generate significantly more accurate and informative
forecasts than the conventional forecasting models found in the literature. Notably, the superior
forecasting capabilities of nonlinear machine learning models are attributable to both the
earnings forecasts in an international setting, finding that the GBR and RF models perform the
best in a global setting compared to other simple linear models in the extant literature. The
performance gain is particularly large for international firms with poorer information environments
algorithm in a Dupont analysis framework to estimate Nissim and Penman (2001)’s structure of
decomposing accounting profitability, and compare its out-of-sample predictability with random
walk and linear regression models. Unlike the previous two studies, the inputs of their machine
learning algorithm are ratios based on Nissim and Penman (2001), and their focus is on the
130
Cao, Jiang, &Lei
_____________________________________________________________________________________
nonlinear relation between the input factors and the target profitability measures to be forecasted.
They find that machine learning algorithms that incorporate nonlinearity perform better than the
random walk or linear models and that investing strategies based on intrinsic values generated
from those forecasts generate significant abnormal returns. They further find that using a long time
influenced various aspects of our lives, including lifestyle, culture, economy, and environment.
Accounting research has not been immune to this impact. Taking advantage of the advancements
in AI and machine learning, accounting researchers have begun to harness AI technologies and
131
Cao, Jiang, &Lei
_____________________________________________________________________________________
An overview of regressions
Three examples of regressions
We use regression to develop understanding of relationships between variables. In
regression, and in statistical modeling in general, we want to model the relationship between an
output variable, or a response, and one or more input variables, or factors. Depending on the
variables, effects, predictors or X variables. We can use regression, and the results of regression
modeling, to determine which variables have an effect on the response or help explain the
response. This is known as explanatory modeling. We can also use regression to predict the
values of a response variable based on the values of the important predictors. This is generally
Simple linear regression is used to model the relationship between two continuous
variables. The below model describes how Y changes for given values of X. Because the
individual data values for any given value of X vary randomly about the mean, we need to
account for this random variation, or error, in the regression equation. We add the Greek letter
epsilon to the equation to represent the random error in the individual observations:
Y=β0+β1X1+ε
response variable and a series of continuous or categorical explanatory variables. When we fit a
multiple linear regression model, we add a slope coefficient for each explanatory variable. Each
132
Cao, Jiang, &Lei
_____________________________________________________________________________________
coefficient represents the average increase in Y for every one-unit increase in that explanatory
Fama-Macbeth regression
The Fama-MacBeth procedure is used to estimate consistent standard errors in the
presence of cross-sectional correlation. The first step is a cross-sectional regression of the model
to obtain the estimated beta-factor of a stock at t over a period of T. The second step is to
compute the overall estimate (𝜆𝜆) and standard-errors (SE) using the time-series estimates of the
beta-factor under the assumption that error terms are uncorrelated over time. A more modern
approach is to run a standard panel regression and then cluster on the date variable.
Y=β0+β1X1+ε
1 2
𝛴𝛴(𝛽𝛽1−𝜆𝜆)
𝜆𝜆=Σβ1/T, SE=�𝑇𝑇 𝑇𝑇
Two-way sorting
Risk-adjusted return sorting
Economic theory, or empirical conjecture, often yields a prediction that expected returns
should be increasing (or decreasing) in some characteristic or feature. Portfolio sorts are very
widely-used to test the theory or conjecture. One of the appeals of tests of the “top-minus-
bottom” spread in portfolio returns is that they can be interpreted as the expected return on a
trading strategy: short the bottom portfolio and invest in the top portfolio, reaping the difference
133
Cao, Jiang, &Lei
_____________________________________________________________________________________
• Average returns on these portfolios over a subsequent period are then computed;
• The significance of the relationship is judged by whether the “top” and “bottom”
134
Cao, Jiang, &Lei
_____________________________________________________________________________________
Reference
Binz, O., Schipper, K., and Standridge, K. 2022. What Can Analyst Learn from Artificial
Cao, S. 2016. Reexamining Growth Effects: Are All Types of Asset Growth the Same?
Cao, K., and You, H. 2021. Fundamental Analysis Via Machine Learning. Working paper.
Chattopadhyay, A., Fang, B., and Mohanrm, P. 2022. Machine Learning, Earnings Forecasting
Chen, X., Cho, T., and Dou, Y. 2022. Predicting Future Earnings Changes Using Machine
Learning and Detailed Financial Data. Journal of Accounting Research, 60(2), 467-515.
Cao, S., Wang, Z., and Yeung, P.E. 2016. Skin in the Game: Operating Grgowth, Firm
Performance, and Future Stock Returns. Journal of Financial and Quantitative Analysis,
57(7), 2559-2590.
Cooper, M., Gulen, H., and Schill, M. 2008. Asset Growth and the Cross-section of Stock
Fairfield, P., Whisenant, S., and Yohn, T. 2003. Accrued Earnings and Growth: Implications for
Future Profitability and Market Mispricing. The Accounting Review, 78(1), 353-371.
135
Cao, Jiang, &Lei
_____________________________________________________________________________________
amounts for its top line revenue and its expenses. Revenue less expenses equals the bottom-line
net income. The income statement reports on both operating and nonoperating activities.
Operating activities are those that relate to bringing a company’s products or services to market
and any after-sales support. The income statement captures operating revenues and expenses,
yielding operating profit. Major operating line items in the income statement are revenues, costs
of goods sold (COGS), and selling, general, and administrative expense (SG&A). Nonoperating
activities relate to such items as borrowed money that creates interest expense and nonstrategic
nonoperating line items on the income statement include interest expense on debt and lease
obligations, loss or income relating to discontinued operations, debt issuance and retirement
costs, interest and dividend income on investments, and gains or losses on the sale of
Corporation.
Operating profit less income tax on operating profit results in net operating profit after
tax. This measure of a company’s operating performance warrants special attention because it is
the lifeblood of a company’s value creation and growth. Total profit less total income tax results
in net income. Net income is not equivalent to net operating profit after tax because nonoperating
income could be transitory or irrelevant. Holding net income constant, a company with more
136
Cao, Jiang, &Lei
_____________________________________________________________________________________
The proportion of income attributable to the core operating activities of a business is one
important aspect of earnings quality. Thus, if a business reports an increase in profits due to
improved sales or cost reductions, the quality of earnings is considered to be high. Conversely,
an organization can have low-quality earnings if changes in its earnings relate to other issues,
such as the aggressive use of accounting rules, inflation, the sale of assets for a gain, or increases
in business risk. In general, any use of accounting trickery to temporarily bolster earnings
reduces the quality of earnings. A key characteristic of high-quality earnings is that the earnings
are readily repeatable over a series of reporting periods, rather than being earnings that are only
137
Cao, Jiang, &Lei
_____________________________________________________________________________________
Data in the income statement is closely related to stock markets. There is a natural
positive relation between expected earnings and stock prices because investors expect dividends,
which are paid out of earnings. Early research by Ball and Brown (1968) confirmed this expected
relation.
138
Cao, Jiang, &Lei
_____________________________________________________________________________________
We can use regression to confirm the association between earnings and stock prices. In
the equation P=β0+β1X+ε, P represents stock price and X represents earnings per share. β1
reflects the relation between earnings and stock prices, or earnings multiple.
Earnings multiples often vary significantly across companies, industries, and business
cycles. This is because stock price implies the current value of all expected future cash flows
P=β0+β11X1+β12X2+β13X3+ε
The permanent component of earnings (X1) has a lasting impact on future cash flows
while the transitory component of earnings (X2) only has one-time effect on future cash flows.
Value irrelevant components of earnings (X3) might even have minimal impact on future cash
flows. If we regress stock price, P, on the three components of earnings, we would obtain the
earnings multiple corresponding to each earnings component. Overall earning multiple, β1, is
determined by the weight of each earnings component and the earnings multiple corresponding
139
Cao, Jiang, &Lei
_____________________________________________________________________________________
Investors like to see high-quality earnings, since these earnings tend to be repeated in
future periods and provide more cash flows for investors. Thus, entities that have high-quality
earnings are also more likely to have high stock prices. Conversely, those entities reporting
lower-quality earnings will not attract investors, resulting in lower stock prices. For example, in
Table 1, both firm A and firm B report earnings of $10 per share. 60 percent of firm A’s earnings
are permanent, 30 percent are transitory, and only 10 percent are value-irrelevant. In contrast,
only 50 percent of firm B’s earnings are permanent, 20 percent are transitory, and the remaining
30 percent are value-irrelevant. Since firm A has a larger proportion of permanent component
and a smaller proportion of value-irrelevant component than firm B, firm A’s earnings quality is
deemed to be better than that of firm B. As a result, firm A has an earnings multiple of 3.3,
140
Cao, Jiang, &Lei
_____________________________________________________________________________________
returns of a stock tend to move in the same direction as the firm's earnings surprise for an
extended period of time. The PEAD anomaly refers to the phenomenon whereby portfolios based
on information in past earnings earn abnormal returns. Ball and Brown (1986) were the first to
note that even after earnings are announced, estimated cumulative abnormal returns continue to
drift up for “good news” firms and down for “bad news” firms. Foster, Olsen, and Shevlin
(1984) estimate that over the 60 trading days subsequent to an earnings announcement, a long
position in stocks with unexpected earnings in the highest decile, combined with a short position
in stocks in the lowest decile, yields an annualized abnormal return of about 25 percent before
transactions costs.
One class of explanations for PEAD suggests that at least a portion of the price response
to new information is delayed. The delay might occur either because traders fail to assimilate
141
Cao, Jiang, &Lei
_____________________________________________________________________________________
available information, or because certain costs (e.g., transaction costs) exceed gains from
immediate exploitation of information for a sufficiently large number of traders. What is less
One possibility is that the market erroneously assuming a seasonal random walk for
expected earnings and ignoring the autocorrelations in earnings. For instance, a company
announces a new long-term contract with a customer that increases earnings at t, resulting in
large positive seasonally adjusted unexpected earnings (SUEt). This contract would also bring in
a stream of future earnings, leading to large positive SUE in subsequent years (SUEt+1, SUEt+2,
SUEt+3, etc.). The value of this stream of earnings which is the current value of all future SUE
should be factored into the stock price at t. If all investors immediately factor the current value of
all future SUEs into the stock price at time t, current SUEt should not be associated with future
abnormal returns. If investors fail to adjust stock price expectations immediately, then the current
SUEt, should predict future abnormal return. Specifically, assuming investors delay response to
the stream of SUEs from t to t+1, current SUEt should predict abnormal return at t+1. As a result,
informed investors can earn abnormal returns by longing stocks with high SUE and short stocks
with low SUE at time t. This prediction can be examined by estimating the following regression
Abrett+1=α+θ1SUEt+εt+1 (1)
For example, suppose Apple Inc. signed a contract with a client that yielded a $1 million
profit at time t, and this contract possesses a 40% likelihood of renewal. Hence, the overall
estimated value of the contract would be $1.4 million. If all investors were aware of the 40%
renewal probability, the current SUE will have no relation to future returns, as the market has
already incorporated all relevant contract information into stock prices in a timely manner.
142
Cao, Jiang, &Lei
_____________________________________________________________________________________
However, if investors undervalued the contract and estimated its value to be $1.2 million, the
current SUE could predict future stock returns. This is because the information about the
remaining $0.2 million has not yet been reflected in stock prices at time t. In such a case,
informed investors can make profits by buying stocks with high SUE and selling stocks with low
Cao and Narayanamoorthy (2012) verify the existence of the PEAD anomaly. Panel A of
Table 2 shows that SUE at t predicts future SUEs up to the subsequent three quarters. Panel B
and Panel C shows the association between current SUE and future three-day earnings
announcement abnormal returns up to two quarters and quarter-long abnormal returns up to three
Figure 6 illustrates that from 1988 to 2008, implementing a trading strategy based on the
PEAD anomaly consistently resulted in positive abnormal returns. In practice, this strategy is
143
Cao, Jiang, &Lei
_____________________________________________________________________________________
It is intuitive to posit that the level of autocorrelation is lower for the companies with
To examine whether the effect of earnings volatility on the level of SUE autocorrelation
passes on to the association between current SUE and future abnormal returns, we can augment
the regression model above to include an interaction term between SUEt and EVOLt .
144
Cao, Jiang, &Lei
_____________________________________________________________________________________
The results of this analysis are provided in Table 4. The coefficient of DSUE in Model 1
of Panel A suggests that three-day earnings announcement abnormal returns in three days
surrounding the earnings announcement date at t+1 is approximately 0.7 percent . Model 1 of
Panel B suggests that abnormal stock returns from quarter t to quarter t+1 is about 6 percent. The
negative coefficient of EVOL*DSUE suggests that abnormal returns are higher for the stocks
with less earnings volatility. Model 2 and Model 3 in both panels control for company size and
whether a company suffers losses at t. The effect of earnings volatility is robust to including both
controls.
As noted above, the delay in investor response might occur because transaction costs are
prohibitively high. To exclude this alternative explanation, we include earnings volatility and a
proxy for transaction costs, SPREAD, in the same regression. If θ3 continues to be significantly
negative, then the effect of earnings volatility at least is not completely attributable to transaction
145
Cao, Jiang, &Lei
_____________________________________________________________________________________
costs. Table 5 show that θ3 remains significantly negative, indicating a significant effect of
earnings volatility on the association between SUE and abnormal returns in addition to the effect
of transaction costs.
146
Cao, Jiang, &Lei
_____________________________________________________________________________________
informed investment decisions. However, the complexity arises from the substantial number of
data. Then, the question arises: how can we effectively gain insights from the vast amount of
There are two primary approaches to gaining insights from the broad range of income-
related variables present in income statements. The first approach is theory-based, where we first
come up with a hypothesis based on established theories and test its validity. By doing so, we
147
Cao, Jiang, &Lei
_____________________________________________________________________________________
can determine whether the patterns suggested by the theory can provide valuable guidance to
investors.
particularly suited for handling high dimensional data. In this approach, all income-related
variables are incorporated into machine learning models. Then, the machine processes the data,
continuously learning and adapting to identify patterns and relations between the variables and
subsequent stock returns. The advantage of using machine learning techniques lies in their ability
to handle vast amounts of data and identify complex patterns that might not be apparent through
traditional analysis methods. Specifically, machine learning algorithms have the capacity to
148
Cao, Jiang, &Lei
_____________________________________________________________________________________
losses) generated by a security/stock. Abnormal returns are measured as the difference between
the actual returns that investors earn on an asset and the expected returns. Expected returns are
estimated using, for example, CAPM model. Abnormal returns can be positive or negative.
Positive abnormal returns are realized when actual returns are greater than expected returns.
Negative abnormal returns (or losses) occur when the actual return is lower than what was
expected.
For instance, in Cao and Narayanamoorthy (2012), ARq, t+1 is calculated by subtracting
the CRSP value-weighted index from the raw return for the period between two days after the
quarter earnings announcement date and one day before the next announcement date. ARs, t+1 is
calculated by subtracting the CRSP value-weighted index from the raw return during the three-
idea of systematic risk (otherwise known as non-diversifiable risk) that investors need to be
compensated for in the form of a risk premium. A risk premium is a rate of return greater than
the risk-free rate. When investing, investors desire a higher risk premium when taking on more
risky investments.
149
Cao, Jiang, &Lei
_____________________________________________________________________________________
The risk premium required by investors is based on the β of that stock. β is a measure of a
stock’s risk reflected by measuring the fluctuation of its price changes relative to the overall
market. In other words, it is the stock’s sensitivity to market risk. For instance, if a company’s β
is equal to one, the expected return on a stock is equal to the average market return. A β of -1
means a stock has a perfect negative correlation with the market. Cumulative Abnormal Return
(CAR) refers to the sum of abnormal returns over a given period of time. It allows investors to
measure the performance of an asset or security over a specific period of time, especially since
The Security Market Line (SML) graphically represents the relationship between
expected returns and the associated risk levels. A security or portfolio that is in equilibrium lies
on the SML, indicating that it is fairly priced, as its expected return equals the return required by
the market at that level of risk. Assets lying above the SML are undervalued, as they offer a
higher return than what is required for their level of risk. Conversely, assets lying below the
SML are overvalued, as they offer a lower return than what is required for their level of risk.
150
Cao, Jiang, &Lei
_____________________________________________________________________________________
between the current earnings and the earnings from the same quarter in the previous year,
DSUE is the SUE decile rank for each quarter transformed by dividing the rank by 9 and
EVOL is the earnings volatility (VOL) decile rank for each quarter transformed by
dividing the rank by 9 and subtracting 0.5, resulting in values that range from −0.5 to +0.5.
151
Cao, Jiang, &Lei
_____________________________________________________________________________________
Reference
Ball, R., and Brown, P. 1968. An Empirical Evaluation of Accounting Income Numbers. Journal
Foster, G., Olsen, C. and Shevlin, T. 1984. Earnings Releases, Anomalies, and the Behavior of
152