0% found this document useful (0 votes)
43 views

Business AI Data Structures and Analytics

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

Business AI Data Structures and Analytics

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 159

Analytics for Finance and Accounting:

Data Structures and Applied AI

Sean Cao
University of Maryland
Wei Jiang
Emory University
Lijun Lei
University of North Carolina
at Greensboro
Table of Contents

1. Data Analytics in Finance and Accounting ..............................................................................1


1.1. How to leverage data science for corporate stakeholders ......................................................1
Video The rising use of big data for decision-making
Video How to leverage data science for corporate stakeholders: the importance of domain
knowledge
1.2. Overview of structured and unstructured business data ........................................................9
Video An overview of available business data
Video An overview of structured data analytics
Video An overview of unstructured data analytics
1.3. Theory-driven and machine-learning approaches of data analytics ...................................13
Video Theory-driven and machine-learning approaches of data analytics
1.4 The advantages of applying machine-learning approaches ..................................................16
Analyzing Textual Data
2. Analyzing Annual Reports ......................................................................................................19
2.1. Data structure in 10-K filings and annual reports ...............................................................19
2.1.1. Data structure in 10-K filings ..........................................................................................19
Video Data structure of the 10-K filing
2.1.2. Data structure in annual reports .......................................................................................27
2.2. Conventional textual analysis approach ..............................................................................28
2.2.1. Bag of Words ....................................................................................................................28
2.2.2. Latent Dirichlet Allocation ...............................................................................................33
Video Bag of Words and LDA
2.3. Empirical example: Analyzing corporate filings for making business decisions ...............34
Video An empirical example of analyzing 10-K filings
Appendix 2A: How to crawl annual reports ..............................................................................38
Video How to crawl annual reports
Appendix 2B: How to parse unstructured data .........................................................................38
Video How to parse unstructured data
3. Emerging AI technology in textual analysis ...........................................................................40
3.1. How to select and apply machine learning models ..............................................................40
3.1.1 Data cleaning, parsing, and feature selection.....................................................................40
3.1.2 Machine learning model selection .....................................................................................41
3.1.3 Hyperparameter tuning ......................................................................................................43
3.1.4 Machine learning model performance evaluation..............................................................45
3.2. Pre-trained phrase level-word embedding ...........................................................................48
Video Pre-trained phrase level analysis
3.3. Pre-trained sentence level-word embedding ........................................................................49
Video Pre-trained sentence level analysis
3.3.1 Bidirectional Encoder Representations from Transformers (BERT) ................................50
3.3.2 Generative pre-trained transformer (GPT).........................................................................52
Appendix 3: Evaluating machine learning models .....................................................................58
4. Analyzing Conference Calls ...................................................................................................60
4.1. Data in conference calls ......................................................................................................60
Video Data structure of conference calls
4.2. Standard dependence parser .................................................................................................63
4.3. Empirical example: Contrasting conference calls and expert calls .....................................66
Appendix 4: Applying GPT to analyze conference call transcripts ............................................69
Video Applying GPT to analyze conference call transcripts using both API and web interface
5. Analyzing Material Firm News ..............................................................................................71
5.1. Types of 8-K news filings ...................................................................................................70
Video Data structure of the 8-K filing
5.2. Empirical example: Technological peer disclosure .............................................................80
Video An empirical example of analyzing material firm news
6. Analyzing Data from Social Media ........................................................................................83
6.1. What is social media ...........................................................................................................83
6.2. Data from social media ........................................................................................................84
Video Data from social media platforms
Video An example of analyzing disclosure on social media platforms
6.2.1. Twitter ...............................................................................................................................85
6.2.2. Glassdoor.com ..................................................................................................................87
6.2.3. Stock Message Boards ......................................................................................................89
6.2.4. YouTube ...........................................................................................................................91
6.2.5. LinkedIn ............................................................................................................................91
6.3. Empirical example: Negative Peer Disclosure ....................................................................93
7. Data Analytics in Environmental, Social, and Governance ...................................................97
7.1. Corporate governance .........................................................................................................97
7.2. Textual data for corporate governance ................................................................................98
7.2.1. Proxy statements ...............................................................................................................98
Video Data in proxy statements
7.2.2. Corporate social responsibility disclosure ......................................................................103
Video Data in corporate social responsibility disclosure
7.3. Emerging technologies as governance mechanisms .........................................................105
7.3.1. Governance with availability of alternative data ...........................................................105
7.3.2. Governance with distributive ledger and blockchains: Shareholder voting and smart
contracting .....................................................................................................................106
8. Analyzing Image Data ..........................................................................................................110
8.1. Image data in corporate executive presentations ...............................................................110
8.2. Empirical example: Visual information in the age of AI ...................................................112
Optional Chapters
Analyzing Numerical Data
9. Analyzing the balance sheet..................................................................................................118
9.1. Data structure in the balance sheet ....................................................................................118
Video Data structure of the balance sheet
Video Debate about fair value accounting
Video Analyzing the balance sheet
9.2. Empirical example: Analyzing the balance sheet for investors ........................................121
9.3. Machine learning application on balance sheet data ........................................................129
Appendix 9 Regression Methods .............................................................................................113
A9.1. Linear Regression ..........................................................................................................132
Video An overview of regressions
Video Three examples of regressions
A9.2. Fama-Macbeth regression ..............................................................................................133
Video Fama-Macbeth regressions
A9.3. Portfolio-sorting .............................................................................................................133
Video Two-way sorting
Video Risk-adjusted return sorting
10. Analyzing the income statement .........................................................................................136
10.1. Data structure in the income statement ...........................................................................136
Video Data structure of the income statement
10.2. Earnings and stock prices.................................................................................................138
10.3. Empirical example: Analyzing the income statement for investors ................................141
Video An empirical example of analyzing the income statement
10.4. Machine learning application on income state data .......................................................147
Appendix 10 Key variable explanations ...................................................................................149
A10.1. Abnormal return ............................................................................................................149
A10.2. Expected return .............................................................................................................149
A10.3. Standardized unexpected earnings ................................................................................150
A10.4. Earnings volatility .........................................................................................................150
Cao, Jiang, &Lei
_____________________________________________________________________________________

Chapter 1 Data Analytics in Finance and Accounting

1.1. How to leverage data science for corporate stakeholders

The rising use of big data for decision-making

The rising use of big data for decision-making

Data analytics is a broad category encompassing diverse activities that involve collecting,

organizing, and analyzing raw data. Fueled by advances in computing power, mass storage, and

machine learning, the usage of data analytics has skyrocketed over the past decade. Data

analytics has the capacity to analyze any type of information, including both structured and

unstructured data.

Capital markets produce a plethora of information that is central to efficient contracting

and risk-sharing among capital market participants. The nature of a company is based on how its

various contractual arrangements with stakeholders are structured. These contractual relations

are essential to companies, and stakeholders such as customers, suppliers, employees, investors,

communities, and others who have a stake in the company form a network of interconnected

contracts. Throughout history, information supply and demand in capital markets has been used

to facilitate efficient contracting and risk-sharing. Stakeholders demand information on a

company’s past and prospective returns and risks. Companies that wish to lower financing costs

and/or other costs such as political, contracting, and labor costs supply that information.

Managers of a company decide the amount of information to supply by weighing the costs and

benefits of disclosing such information. Although managers can exert control over what

information a company supplies and when, regulatory agencies have consistently intervened in

this process by establishing a baseline level of information that must be released with various

disclosure requirements.

1
Cao, Jiang, &Lei
_____________________________________________________________________________________

In today’s capital markets, information can be extracted from a multitude of sources,

including mandatory filings required by regulators, companies’ voluntary disclosures,

information produced by financial analysts, shared by competitors, uncovered by news media,

etc. Accordingly, the decision-making process has evolved from a model in which managers

primarily rely on their experience to one in which decision-making is based on data analytics.

However, the shift presents an inherent challenge that is the need for large-scale data collection

from different sources and in various formats. Data analytics techniques can help stakeholders

collect relevant information, organize structured and unstructured data, and then conduct

appropriate analyses to reveal trends and metrics that would otherwise be lost amidst a sea of

information. Given the central role that information plays in capital markets, corporate

stakeholders increasingly recognize the value and the importance of data analytics. As revealed

in Cao, Jiang, Yang, and Zhang (2023), there is clearly an increase in the application of data

analytic tools in analyzing regulatory company filings downloadable from the Securities and

Exchange Commission’s Electronic Data Gathering, Analysis, and Retrieval (EDGAR) database

system. Specifically, the proportion of automatic machine downloads of annual and quarterly

regulatory company filings (i.e., 10-Ks and 10-Qs) surged from under 40 percent in 2003 to over

80 percent after 2015.

2
Cao, Jiang, &Lei
_____________________________________________________________________________________

Figure 1. Machines downloads of 10-K and 10-Q filings

This figure plots the annual number of machine downloads (blue bars and left axis) and the annual percentage of
machine downloads over total downloads (red line and right axis) across all 10-K and 10-Q filings from 2003 to
2016. Machine downloads are defined as downloads from an IP address downloading more than 50 unique firms’
filings daily. The number of machine downloads and the number of total downloads for each filing are recorded as
the respective downloads within seven days after the filing becomes available on EDGAR.

What separates us from computer science and statistics majors: The importance of domain
knowledge

How to leverage data science for corporate stakeholders: the importance of domain knowledge

If computers begin to play an increasingly important role in data analytics, an interesting

question is whether computers could outperform humans. High-profile human-computer

competitions began in chess. One of the most famous chess computers is Deep Blue because of

the chess match between Deep Blue and World Chess Champion Garry Kasporov in 1997. Cao,

Jiang, Wang, and Yang (2021) build an AI analyst that is able to digest corporate financial

information, qualitative disclosure, and macroeconomic indicators. They find that such an AI

analyst could beat the majority of human analysts in stock price forecasts. The relative advantage

of the AI analyst is stronger when the firm is complex, and when information is high-

dimensional, transparent and voluminous. Nevertheless, human analysts remain competitive

when critical information requires institutional knowledge. More importantly, the edge of the AI

over human analysts declines over time as human analysts gain access to alternative data and to

3
Cao, Jiang, &Lei
_____________________________________________________________________________________

in-house AI resources. Unsurprisingly, combining AI’s computational power and the human art

of understanding soft information produces the highest potential in generating accurate forecasts.

Figure 2. The performance of AI-assisted analyst vs human analysts

This figure plots the proportion of AI-assisted Analyst recommendations that are more accurate than the Analyst
recommendations alone on an annual basis. The blue line in the middle gives the annual AI-assisted Analyst beat
ratios, the blue-dotted lines above and below are the 95% confidence interval of the beat ratio, and the red line gives
the best linear approximation of the trend in beat ratios.

Although cutting-edge data analytics techniques can be extremely complicated and

powerful, the first step in utilizing them is relatively simple: identifying the required information

and devising a strategy to gather it. Given the dynamic nature of capital markets, analysts must

have in-depth knowledge of institutional backgrounds to effectively apply data analytics in the

corporate world or in capital markets. For instance, analysts must understand the objectives of

the decision-making process, the pertinent information necessary to facilitate decision-making,

and potential sources of useful information. This creates a pressing demand for business

professionals who possess both domain knowledge in business and a practical understanding of

data analytics techniques. Business professionals with both business expertise and data analytics

4
Cao, Jiang, &Lei
_____________________________________________________________________________________

skills can play a critical bridging role that decipher information needs of decision-makers,

conduct preliminary analyses, and lead a team to formalize and implement quantitative models.

Furthermore, the advancement of Artificial intelligence (AI) opens the avenue to various

domain-specific AI in future. Future business professionals should be prepared to design

generative AI for domain-specific applications in, for example, investing, compliance,

marketing, etc. It is equally important to be able to use and manage such AI applications,

including understanding their functions, exploring their roles in improving productivity, and

managing associated legal and security risks. Additionally, some abilities, such as reasoning-

based intelligence, are exclusive to humans and cannot be entirely replaced by AI. Therefore,

professionals who possess domain expertise will be in high demand in this new era.

Figure 3 describes a typical data analytics team consisting of a customer team and an

implementation team. The customer team includes the client-facing product manager. The

product manager needs to understand customer demand, perform preliminary analysis, and

communicate customer demand and desirable solution to the implementation team. An ideal

product manager thus should have business knowledge and be able to apply basic data analytics

tools. The implementation team includes a lead team serving as the bridge between the customer

team and the implementation team and a data team that are expert in implementation. The lead

team require professionals with strong business knowledge and data analytic skills since they are

responsible for receiving demand from the product manager, conveying the data analytics

solution to data scientists, and perform quality control. The data team mostly comprises of data

scientists who are computer science or statistics majors with strong programming skills. Business

professionals with data analytics skills can work in any roles that require integrated skills such as

the product manager and the tech lead.

5
Cao, Jiang, &Lei
_____________________________________________________________________________________

Figure 3. The importance of data analytics and business domain knowledge

The objective of this book is thus to introduce computational tools and AI technologies to

business major students and link these tools to business domain knowledge before they jump into

learning programming. Understanding the domain-specific data features and domain-specific

questions and use cases will be the key to stay in the game in an era that has enthusiastically

embraced data-driven decision-making.

Tailoring data science to the different needs of each corporate stakeholder

Data analytics for corporate talents

Managers and employees are deeply invested in their company’s current and future

financial well-being. This creates a strong demand for information on the company’s financial

condition, profitability, and future prospects, as well as comparative information on competing

companies and business opportunities, which permits them to benchmark their company’s

performance and condition. Managers also use company information to design compensation and

bonus contracts. Information extracted using data analytics could assist managers in addressing

various questions, including, for example:

• What product lines, geographic areas, or other segments are performing well in

comparison to our peer companies and our own benchmarks?

6
Cao, Jiang, &Lei
_____________________________________________________________________________________

• Should we consider expanding or contracting our business?

• How will current profit levels impact incentive- and share-based compensation?

• What capital structure is suitable for our business?

• How to improve cash flow management?

• What is an appropriate dividend payout policy?

• How are we doing compared to competitors?

Data analytics for shareholders

Shareholders of a company, much like managers and employees, are keenly interested in

predicting its future performance. Expectations of future profitability and cash generation

significantly impact a company’s stock price and ability to borrow money on favorable terms.

Shareholders therefore demand company information to project its gains and losses accurately.

Investors also use company information to evaluate managerial performance. Here are some

examples of questions that information extracted using data analytics could assist investors in

addressing:

• What are the expected future profits, cash flows, and dividends for input into stock-

price models?

• Is the company financially solvent and able to meet its financial obligations?

• How do expectations about the economy, interest rates, and the competitive

environment affect the company?

• Is company management demonstrating good stewardship of the resources to which

they have been entrusted?

• Do we have the information we need to critically evaluate strategic initiatives

proposed by management?

7
Cao, Jiang, &Lei
_____________________________________________________________________________________

Data analytics for creditors

Creditors and suppliers need company information to make important decisions

regarding their financial transactions and business relationships. By using data analytics, they

can gain insights into the company’s financial health, performance, and risks. For example,

creditors can use data analytics to determine loan terms, loan amounts, interest rates, and

required collateral, as well as to forecast earnings, predict bankruptcy, and assign credit scores.

Meanwhile, suppliers can use data analytics to establish credit terms and to evaluate their long-

term commitment to supply-chain relations. Both creditors and suppliers use company

information to monitor and adjust their contracts and commitments. Here are some examples of

questions that creditors and suppliers can address with the help of data analytics:

• Should we extend credit in the form of a loan or line of credit for inventory purchase?

• What interest rate is reasonable given the company’s current debt load and overall

risk profile?

• Is the company in compliance with the existing loan covenants?

Data analytics for other stakeholders

Customers seek company information to assess a company’s ability to provide products

or services as well as its staying power and reliability. Auditors rely on company information to

detect potential financial misstatements. Governments demand company information to ensure

compliance with laws and regulations. All stakeholders can uncover critical insights for their

decision-making through the application of data analytics.

8
Cao, Jiang, &Lei
_____________________________________________________________________________________

Figure 3. Firm as a nexus of contracts

1.2. Overview of structured and unstructured data

An overview of available business data

Unstructured data analytics

An overview of unstructured data analytics

Over the past few decades, significant progress has been made in data analytics

techniques. These advances have led to new sources of information that enable us to effectively

tackle a broader range problems. Recently, there have also been remarkable innovations in the

methods used to create new data. Information collected from these fresh sources or generated

through these new mechanisms is largely unstructured qualitative data.

More and more information users are drawn to texts in firm regulatory filings. Despite

containing financial statements and pages of tables and charts, annual reports and other company

regulatory filings still comprise mostly of text. The recent reduction in computer storage costs

and an increase in computer processing capabilities have made textual analysis of these

disclosures more feasible. Regulatory filings that are widely available for analysis encompass

9
Cao, Jiang, &Lei
_____________________________________________________________________________________

annual reports, current reports, proxy statements, initial public offering (IPO) prospectuses, and

more.

Conference call transcripts allow analysts to capture and analyze information disclosed

during corporate conference calls. These calls provide an opportunity for managers to announce

and discuss the firm’s financial performance, while allowing analysts and investors to ask

relevant questions about the company.

Corporate social responsibility (CSR) reports serve as internal and external

communications detailing a company’s CSR initiatives and their impact on the environment and

society. While some countries mandate the annual publication of CSR reports, many companies

in regions without such requirements also voluntarily release them.

Social media has become a vital communication channel for businesses. Social media

platforms enable the rapid dissemination of information to millions of people within seconds.

This evolution in information exchange has opened up a new range of opportunities for

companies to inform and interact with stakeholders. Company information shared on social

media platforms, whether by the company itself or by investors, consumers, competitors, and

others, provides offers an additional perspective on a company’s operations, performance, risks.

Audio data pertaining to business activities can also enhance decision-making processes.

For example, in addition to analyzing textual transcripts of conference calls and investment

presentations, the audio recordings of these events can provide valuable nuances to analysts.

Video and image data are more widely used than ever before due to the progress of

video and image capture devices. Computer algorithms are becoming increasingly sophisticated,

enabling the processing and interpretation of static images and deriving objective information

from videos. Product-related images provided by companies or shared by customers are

10
Cao, Jiang, &Lei
_____________________________________________________________________________________

examples of potentially valuable image data. Videos of investor presentations, product releases,

and other company events could be valuable for managers, investors, and other decision-makers.

Structured data analytics

An overview of structured data analytics

Traditionally, company information is financial in nature and comprises structured

quantitative data that is aggregated and used to prepare financial statements for internal and

external information users. In addition, information intermediaries and other marketplace agents

produce company information for capital market participants.

Financial statements provide critical financial information in accordance with

applicable accounting standards to ensure the relevancy, reliability, and comparability of firm

information. Companies use four financial statements to periodically report on their business

activities: the balance sheet, income statement, statement of stockholders’ equity, and statement

of cash flows. The balance sheet reports on a company’s financial position at a specific point in

time, while the income statement, statement of stockholders’ equity, and statement of cash flows

report on performance over a period of time. These three statements link the balance sheet from

the beginning to the end of a period.

Executive compensation disclosures provide information concerning the amount and

type of compensation paid to a firm’s chief executive officer, chief financial officer, and the

other three highest-paid executive officers in the annual proxy statement. The company must

also disclose the criteria used to reach executive compensation decisions and the relationship

between the company’s executive compensation practices and corporate performance. The

Summary Compensation Table, included in the proxy statement, is the cornerstone of the

required disclosures on executive compensation. The Summary Compensation Table provides, in

a single chart, a comprehensive overview of a company’s executive compensation practices. The

11
Cao, Jiang, &Lei
_____________________________________________________________________________________

Summary Compensation Table is then followed by other tables and disclosures containing more

specific information on the components of compensation for the last completed fiscal year, for

example, information about grants of stock options and stock appreciation rights, long-term

incentive plan awards, pension plans, and employment contracts and related arrangements.

Financial analyst forecasts and recommendations provide useful processed

information from financial experts. Financial analysts provide short-term and long-term forecasts

on earnings, sales, capital expenditures, etc. Financial analysts also regularly issue investment

recommendations.

Loan agreements include information about loan terms, loan amounts, interest rates, and

required collateral. Loan agreements often include contractual requirements, called covenants,

that restrict a company’s behavior in some fashion. Violation of loan covenants can lead to early

repayment or other compensation demanded by the lender. Information in loan agreements thus

reflects the creditor’s assessment of a company’s credit risk.

Patents and citations are highly useful in understanding the innovations a company has

in development. Patent filings protect most significant discoveries, providing a wealth of

technological, geographical, and industry data. The relationship between an invention’s

economic importance and patent data is well-documented. Unsurprisingly, patent data has

become increasingly used in business analytics.

In this book, unstructured textual and image data are discussed in chapter 2 to chapter 8;

structured numerical data are introduced in chapter 9 and chapter 10.

12
Cao, Jiang, &Lei
_____________________________________________________________________________________

Figure 4. Structured and unstructured business data

Figure 5. Data analytics based on structured and unstructured business data

1.3. Theory-driven and machine-learning approach of data analytics


Theory-driven and machine-learning approach of data analytics

Theory-driven approach vs. machine-learning approach

When analyzing either quantitative or qualitative data, two approaches can be taken in

data analytics. The theory (hypothesis)-driven approach relies on theory-based hypotheses to

guide the direction of data analytics, whereas the machine-learning approach starts with data

being supplied to a computer model to train itself to identify patterns or make predictions. The
13
Cao, Jiang, &Lei
_____________________________________________________________________________________

theory-driven approach resembles human thinking process, making it intuitive and interpretable.

The machine-learning approach, on the other hand, leverages computational power of machines

to yield strong predictive capability, but the machine learning process remains a “black box”.

As an example, the Securities and Exchange Commission (SEC) charged Luckin Coffee

Inc. with material misstatement of financial statements to falsely appear to achieve rapid growth

and increased profitability and to meet the company’s earnings estimates. The fraud was

uncovered by Muddy Waters LLC. Muddy Waters is an investment research firm that conducts

research about financial fraud. The firm received an anonymous tipping and thus mobilized 92

full-time and 1,418 part-time staff on the ground to run surveillance and recorded store traffic for

981 store-days covering 100% of the operating hours of 620 stores. The investigation resulted in

more than 11,200 hours of videotaping and led to the conclusion that the number of items per

store was inflated by at least 69 percent in 2019’s third quarter and 88 percent in the fourth

quarter (Muddy Waters Research, 2020). This is a typical investigation following the theory

(hypothesis)-driven approach where Muddy Waters LLC first formed a hypothesis that Luckin

Coffee Inc. misstated financial statements and then conducted investigation to examine the

hypothesis. In contrast, financial statement auditors are required to perform analytical procedures

that aim at detecting potential anomalies in financial reporting. When auditors perform analytical

procedures, they do not necessarily have a hypothesis that a company misstates financial

statements. This is an example of machine-learning driven approach.

Both approaches can be employed to perform three types of data analytics: descriptive,

diagnostic, and predicative. Descriptive analytics summarize data and describe observable

patterns. These analyses focus on understanding what has happened over a period of time.

Descriptive analytics techniques include descriptive statistics and cross-tabulation of data.

14
Cao, Jiang, &Lei
_____________________________________________________________________________________

However, conventional descriptive analytics rely on structured data mostly in numeric format.

To offer a more complete picture, artificial intelligence provides necessary technologies to

retrieve novel data from various structured and unstructured sources.

Diagnostic analytics seek to understand what happened and why. This type of data

analytics involves more diverse data inputs and a deeper dive into the data. Diagnostic analytic

techniques involve correlations, regressions, and other statistical methods. For instance,

diagnostic analytics can be applied to investigate economic implications of artificial intelligence

applications, such as, virtual currency, digital payment, robot advising.

Finally, predictive analytics explores what is likely to happen or what will happen “if”

something else happens. To be able to predict what will happen, we first need to understand what

happened, how, and why; hence, predictive analytics builds on descriptive and diagnostic

analytics. Conventional predictive analytic techniques involve building models using past data

and statistical techniques, including regression and a deep understanding of cause and effect.

With machine-learning algorithms, patterns can be learned from training dataset, and predictive

models can be built with limited human intervention.

Figure 6. Theory-driven and machine-learning approach of data analytics

15
Cao, Jiang, &Lei
_____________________________________________________________________________________

1.4 The advantages of applying machine-learning approaches


Machine learning has several unique advantages compared with conventional data

analysis techniques. First, traditional statistical methods often struggle with large amounts of

data. In contrast, machine learning algorithms such as convolutional neural networks (CNNs) are

capable of selecting the best features to process information effectively. Another advantage of

machine learning is its ability to handle nonlinear relationships in data. Along with the

information explosion, the types of information available to analysts have grown from mostly

numerical data to more complex data that involves both text and images. Machine learning can

effectively identify nonlinear patterns and make predictions using various types of data. This

makes it particularly valuable for tasks such as natural language processing and computer vision.

Machine learning also offers the ability to make out-of-sample predictions, which is particularly

useful in cases where data is limited. In traditional statistical methods, repeated optimization

(learning) is often required to improve the accuracy of the model. However, in machine learning,

this process is streamlined and only requires one-time optimization. For tasks involving time

series data, machine learning algorithms such as long short-term memory (LSTM) networks can

be applied to identify time-series patterns. Finally, machine learning algorithms are designed to

be efficient, which is particularly important given the increasing amounts of data being

generated. By using optimized algorithms, machine learning can process data quickly and

effectively, making it a valuable tool in fields such as healthcare and finance where time is of the

essence.

The field of machine learnings encompasses a wide range of techniques and approaches,

including supervised and unsupervised learning, self-supervised learning, transfer learning,

ensemble learning, and more. Supervised learning involves training a model with labeled data.

On the contrary, unsupervised learning does not require labeled data and instead focuses on

16
Cao, Jiang, &Lei
_____________________________________________________________________________________

finding patterns and relationships within the data itself. Self-supervised learning is a variant of

unsupervised learning that uses the data itself to generate labels, such as using stock returns to

label news positivity. Transfer learning is a powerful technique that uses a pre-trained model to

tackle new tasks with less training data. This approach is particularly useful in cases where the

cost of obtaining labeled data is high or where the amount of labeled data available is limited.

These machine learnings techniques are discussed in detail in later chapters in the book.

17
Cao, Jiang, &Lei
_____________________________________________________________________________________

Reference

Cao, S., Jiang, W, Yang, B, Zhang, A. 2023. How to Talk When a Machine is Listening?

Corporate Disclosure in the Age of AI. Review of Financial Studies, forthcoming.

18
Cao, Jiang, &Lei
_____________________________________________________________________________________

Chapter 2 Analyzing Annual Reports

2.1. Data structure in annual reports and 10-K filings

Data structure of the 10-K filing


Each year, U.S. public companies are required to produce a Form 10-K and file it with

the U.S. Securities and Exchange Commission (SEC) within 60 days of the end of the fiscal

year. The SEC’s Electronic Data Gathering, Analysis, and Retrieval (EDGAR) database system

allows anyone to retrieve a company’s 10-K report. Some companies also post their 10-K reports

on their websites. In addition, SEC rules mandate that companies send an annual report to their

shareholders in advance of annual meetings. While both annual reports and 10-K filings provide

an overview of the company's performance for the given fiscal year, annual reports tend to be

much more visually appealing than 10-K filings. Companies put effort into designing their

annual reports, using graphics and images to communicate data, while 10-K filings only report

numbers and other qualitative information, devoid of design elements or additional flair.

2.1.1. Data structure in 10-K filings

A comprehensive Form 10-K contains four parts and 15 items. Researchers are often

most interested in Item 1, “Business,” Item 1A, “Risk Factors”, and Item 7, “Management’s

Discussion and Analysis of Financial Condition and Results of Operations.” Therefore, we begin

our discussion with these three items of Form 10-K.

Item 1, “Business,” appears in Part I. It gives a detailed description of the company’s

business, including its main products and services, its subsidiaries, and in which markets it

operates. To gain an understanding of a company’s operations and its primary products and

services, Item 1 serves as an excellent starting point. Figure 1 shows an excerpt of Item 1 of

19
Cao, Jiang, &Lei
_____________________________________________________________________________________

Apple Inc.’s 2021 10-K. It introduces the company background and provides information on

Apple’s main products.

Figure 1 Item 1 of Apple Inc. 2021 Form 10-K

Item 1A, “Risk Factors,” also is an item in Part I of Form 10-K. It outlines the most

significant risks faced by the company or its securities. In practice, this section focuses on the

risks themselves, not how the company addresses those risks. Risks outlines may pertain to the

entire economy or market, the company’s industry sector or geographic region, or be unique to

the company itself. Figure 2 shows an excerpt of Item 1A of Apple Inc.’s 2021 10-K which

discusses business risk arising from the COVID-19 pandemic such as disruption in supply chain

and logistical services and store closing.

Item 7, “Management’s Discussion and Analysis of Financial Condition and Results of

Operations” presents the company’s perspective on its financial performance during the prior

fiscal year. This section, commonly referred to as the MD&A, allows company management to

summarize its recent business in its own words. The MD&A presents:

20
Cao, Jiang, &Lei
_____________________________________________________________________________________

• The company’s operations and financial results, including information about the

company’s liquidity and capital resources and any known trends or uncertainties that

could materially affect the company’s results. This section may also present the

management’s views on key business risks and how they are being addressed.

• Material changes in the company’s results compared to the prior period, as well as off-

balance-sheet arrangements and contractual obligations.

• Critical accounting judgments, such as estimates and assumptions. These accounting

judgmentsand any changes from previous yearscan have a significant impact on the

numbers in the financial statements, such as assets, costs, and net income.

Figure 2. Item 1A of Apple Inc. 2021 Form 10-K

21
Cao, Jiang, &Lei
_____________________________________________________________________________________

Figure 3 is an excerpt of Item 7 of Target 2021 10-K. It begins with highlights of the fiscal year,

a summary of financial outcomes, and then continues to analyze key performance indicators such

as the gross margin.

Figure 3. Item 7 of Target 2021 Form 10-K

Other Items in Form 10-K

Part I of Form 10-K

Part I of the report comprises another two items in addition to Item 1 and Item 1A. Item

1B, “Unresolved Staff Comments,” requires the company to explain certain comments received

22
Cao, Jiang, &Lei
_____________________________________________________________________________________

from SEC staff on previously filed reports that have not been resolved after an extended period

of time.

Item 2, “Properties,” describes the company’s significant physical properties, such as

principal plants, mines and other materially important physical properties. Figure 4 displays Item

2 from Apple Inc.’s 2021 10-K. It reveals that Apple Inc. owns and leases facilities and land

throughout the U.S. and outside the U.S.

Figure 4. Item 2 of Apple Inc. 2021 Form 10-K

Item 3 “Legal Proceedings” requires companies to disclose information about significant

pending lawsuits or other legal proceedings, other than ordinary litigation. Figure displays Item 3

from Apple Inc.’s 2021 10-K. It is worth noting that it is not uncommon for companies to be

involved in legal proceedings. Item 4 has no required information and is reserved by the SEC for

future rulemaking.

Figure 5. Item 3 of Apple Inc. 2021 Form 10-K

Part II of Form 10-K

Part II of Form 10-K comprises seven items additional to Item 7. Item 5, “Market for

Registrant’s Common Equity, Related Stockholder Matters and Issuer Purchases of Equity

Securities,” provides information about the company’s equity securities, including market

23
Cao, Jiang, &Lei
_____________________________________________________________________________________

information, the number of shareholders, dividends, stock repurchases by the company, and other

relevant information. Figure 6 provides an example of Item 5 in Apple Inc. 2021 10-K.

Item 6, “Selected Financial Data,” provides a summary of certain financial information

from the past five years. As shown in Figure 7, Item 6 of Apple Inc. 2021 10-K reports selected

financial information from 2016 to 2021. More detailed financial information on the past three

years are included in a separate section: Item 8, “Financial Statements and Supplementary Data,”

which includes the company’s balance sheet, income statement, cash flow statement, and notes

to the financial statements.

Figure 6. Item 5 of Apple Inc. 2021 Form 10-K

24
Cao, Jiang, &Lei
_____________________________________________________________________________________

Figure 7. Item 6 of Apple Inc. 2021 Form 10-K

Item 7A, “Quantitative and Qualitative Disclosures about Market Risk,” mandates

disclosure of the company’s exposure to market risks arising from, for example, fluctuations in

interest rates, foreign currency exchanges, commodity prices, or equity prices. This section may

also include information on how the company manages these risks. Figure 8 provides an excerpt

of Item 7A in Apple Inc.’s 2021 10-K.

Figure 8. Item 7A of Apple Inc. 2021 Form 10-K

25
Cao, Jiang, &Lei
_____________________________________________________________________________________

Item 8, “Financial Statements and Supplementary Data,” mandates the company’s

audited financial statements, which includes the company’s income statement, balance sheet,

statement of cash flows, and statement of stockholders’ equity. The financial statements are

accompanied by notes that elucidate the information presented in the financial statements. An

independent accountant audits these financial statements, and, for large companies, also reports

on their internal controls over financial reporting.

Item 9, “Changes in and Disagreements with Accountants on Accounting and Financial

Disclosure,” requires companies that have changed accountants to discuss any disagreements

they had with those accountants. Such disclosure is often seen as a red flag by many investors.

Item 9A, “Controls and Procedures,” discloses information about the company’s disclosure

controls and procedures, as well as its internal controls over financial reporting. Item 9B, “Other

Information,” requires companies to provide any information that should have been reported on

another form during the fourth quarter of the year covered by the 10-K, but was not disclosed.

Part III of Form 10-K

Part III of the 10-K includes five items. Item 10, “Directors, Executive Officers and

Corporate Governance,” requires information about the background and experience of the

company’s directors and executive officers, the company’s code of ethics, and certain

qualifications for directors and committees of the board of directors. Item 11, “Executive

Compensation,” requires a detailed disclosure of the company’s compensation policies and

programs, as well as how much compensation was paid to its top executive officers in the past

fiscal year.

In Item 12, “Security Ownership of Certain Beneficial Owners and Management and

Related Stockholder Matters,” companies provide information about the shares owned by the

26
Cao, Jiang, &Lei
_____________________________________________________________________________________

company’s directors, officers, and certain large shareholders. This item also includes information

about shares covered by equity compensation plans.

Item 13, “Certain Relationships and Related Transactions, and Director Independence,”

includes information about relationships and transactions between the company and its directors,

officers, and their family members. It also includes information about whether each director of

the company is independent.

Item 14, “Principal Accountant Fees and Services,” requires companies to disclose fees

paid to their accounting firm for various types of services during the year. Although this

disclosure is required as part of Form10-K, most companies provide this information in a

separate document called the proxy statement. Companies distribute the proxy statement among

their shareholders in preparation for annual meetings. If the information was provided in a proxy

statement, Item 14 will include a message from the company directing readers to the proxy

statement document. The proxy statement is typically filed a month or two after the 10-K. Part

III of Apple Inc’s 2021 10-K is given in Figure 9 as an example.

Part IV of 10-K

Part IV contains Item 15, “Exhibits, Financial Statement Schedules,” which outlines the

financial statements and exhibits included as part of the 10-K filing. Many exhibits are

mandatory, including documents such as the company’s bylaws, copies of its material contracts,

and a roster of the company’s subsidiaries.

2.1.2. Data structure in annual reports

Similar to 10-Ks, annual reports are comprehensive reports detailing companies’

performance and activities throughout a fiscal year. Many companies choose to incorporate a lot

of graphics and images instead of large amounts of text in their annual reports to create more

27
Cao, Jiang, &Lei
_____________________________________________________________________________________

visually appealing documents than 10-Ks. For example, in Figure 9, Procter and Gamble

provides both numeric and graphic information regarding its financial performance in the 2021

annual report.

Figure 9. Part III of Apple Inc. 2021 Form 10-K

The structure of annual reports vary across companies, but they typically include several

common sections such as (1) letter to shareholders, (2) performance and highlights, (3) corporate

strategies, (4) non-financial information such as CSR information, (5) financial information, (6)

leadership information, and any other pertinent information the company wishes to share.

2.2. Conventional textual analysis approach

2.2.1. Conventional Approach Review (Bag of Words)

Bag of words and LDA

The “bag of words” technique is a Natural Language Processing (NLP) technique used

for textual modelling. Text data can be messy and unstructured, making it challenging for

machine learning algorithms to analyze. These algorithms prefer structured, well defined, fixed-

28
Cao, Jiang, &Lei
_____________________________________________________________________________________

length inputs. A “bag of words” is a textual representation of the occurrence of words within a

document. To create this representation, analysts track the frequency of word occurrences in a

document, disregarding grammatical details and word orders. The term “bag” is used because

information about the order or structure of words in the document is discarded, and all words are

collected en masse as if in a bag. Using this technique, variable-length texts can be converted

into a fixed-length vector. The bag-of-words approach is a simple and flexible way to extract

features from documents.

Figure 10. Financial highlights in P&G’s 2021 annual report

How to build keyword dictionary?

To utilize the bag-of-words approach effectively, data analytics requires a pre-established

set of keywords. Sentiment analysis, for instance, can be conducted by computing the frequency
29
Cao, Jiang, &Lei
_____________________________________________________________________________________

of pre-determined negative and positive words. By comparing the number of negative words to

provide words, the bag-of-words can identify the sentiment of a text as negative without the need

to read the entire document.

Existing keyword lists

There is a wide range of well-established keyword lists readily available for textual

analyses with various objectives. In sentiment analysis, for example, the Harvard-IV-4

Dictionary is a general-purpose dictionary that provides a list of positive and negative words

developed by the Harvard University. The Loughran-McDonald Sentiment Word Lists are

widely used in technical accounting and finance texts. Other researchers have developed similar

keyword lists for non-English languages (Du, Huang, Wermers, and Wu 2022) or for purpose

other than sentiment analysis, such as forward-looking statements, extreme sentiment, deception,

financial constraints, uncertainty, financial performance, research and development. technology,

intangible assets, culture, big data and artificial intelligence, litigation, social affiliation, supply

chain, etc. (Cao, Ma, Tucker, and Wan 2018; Hassan, Hollander, Lent, and Tahoun 2019, etc.).

Self-defined keywords

When a suitable keyword list is not readily available for a specific research question, we

can create a customized one by reading a small sample of related texts and selecting the most

relevant keywords. This approach is easy to implement, but it can also be arbitrary. Below we

discuss two structured approaches to develop self-defined keyword lists.

Corpus approach

The corpus approach begins with gathering textual contents relevant to the topic of

interest, from which a set of frequently used words {A} is extracted. This set of often includes

noisy keywords unrelated to the topic. To eliminate this noise, we then identify the irrelevant

30
Cao, Jiang, &Lei
_____________________________________________________________________________________

topics and generate a list of frequently used words for each irrelevant topic {Bi}. A robust

keyword list for the topic of interest is then obtained by subtracting irrelevant topic keywords

from the preliminary high-frequency word list, or {A}-U{Bi}. For example, to generate a list of

political keywords {Ap}, one might start with political science textbooks to generate a high-

frequency word list {Ap0}. This preliminary high-frequency word list might contain keywords

relating to economics, law, science, etc. To remove these irrelevant topics, we could use a similar

approach to generate a list of high-frequency words for each irrelevant topics {Beconomics},

{Blaw},{Bscience}, etc. Finally, we subtract these irrelevant keywords from the preliminary

political keyword list, resulting in a clean political keyword list, or {Ap}={Ap0}-U{Bi}. Figure

11 illustrates the process of using the corpus approach to develop a dictionary.

Figure 11. Using the corpus approach to develop keyword lists

Dictionary expansion

The “dictionary expansion’ approach generates an expanded keyword list by searching

for synonyms of key topical words in authoritative dictionaries. For instance, to generate a

keyword list for “risk”, we can begin with the single word “risk” and look up all synonyms of

“risk” in the Merriam-Webster Dictionary. This can give us words like “threat” and “danger”.

We can then look up the synonyms of these synonyms, which could yield words such as

31
Cao, Jiang, &Lei
_____________________________________________________________________________________

“menace”, “jeopardy”, and “trouble”. The process can be continued until the additional

synonyms are no longer closely related to the original concept of “risk.” (Figure 12).

Figure 12. Using the dictionary expansion approach to develop keyword lists

IBM Research-Almaden created a “human-in-the-loop” approach for AI dictionary

expansions (Alba, Gruhl, Ristoski, and Welch 2018). The approach not only discover new

instances from input text corpus and predict new “unseen” terms not currently in the corpus. The

approach runs in two phases. Continuing with the political word example, during the explore

phase, the model calculate a similarity score between words in the Merriam-Webster Dictionary

and the single word “politics” to identify instances in the dictionary that are similar to the word

“politics” such as “activism”, “legislature”, or “government”. In the exploit phase, the model

generates new phrases based on a word’s co-occurrence score, or how often words appear

together. For example, “government policy” may not appear in the Merriam-Webster Dictionary

but “political policies” and “science of government” appear often together and thus can be used

to build the more complex phrase “government policy”.

32
Cao, Jiang, &Lei
_____________________________________________________________________________________

2.2.2. Latent Dirichlet Allocation (LDA)

Bag of words and LDA

Latent Dirichlet Allocation (LDA) is often used for dimensionality

reduction. Unsupervised LDA is useful for exploring unstructured text data by inferring

relationships between words in a set of documents. A common application of unsupervised LDA

is topic modeling. Given a sample of textual data and a pre-determined number of topics, K, an

LDA algorithm can generate K topics that best fit the data. Determining the appropriate number

of topics is somewhat arbitrary. The best practice is to review the textual sample to obtain a

feeling for the contents, generate a word frequency table to review the high-frequency words,

and then determine the number of topics in an informed manner. Figure 13 illustrates the

keywords for the topic “politics” that were developed using an unsupervised LDA algorithm.

Figure 13. Keywords for the topic “politics” generated using unsupervised LDA

In addition to unsupervised LDA, LDA can also be supervised. Supervised LDA requires

humans to read a small sample of the textual contents and label the topics for each textual input.

The labeled sample is used to train a model that predicts the topics of the remaining texts in the

sample. The self-supervised approach involves using an existing label to label keywords. For

33
Cao, Jiang, &Lei
_____________________________________________________________________________________

instance, we can use subsequent stock returns to label positive and negative keywords when

determining positive and negative keywords in earnings announcements (Figure 14).

Figure 14. Using supervised LDA to develop keyword lists

2.3. Empirical examples: Analyzing corporate filings for making business decisions

An empirical example of analyzing 10-K filings


10-Ks and 10-Qs

When companies make an active change in 10-Ks, this often provides an important signal

about future operations (Brown and Tucker 2011; Cohen, Malloy, and Nguyen 2020), but Cohen

et al. (2020) document that investors tend to neglect the valuable information in the changes. If

an investor construct a portfolio that shorts the companies significantly changing 10-Ks or 10-Qs

and buy companies not significantly changing 10-Ks or 10-Qs, the investor could earn a return of

30 to 50 basis points per month over the following year.

Item 1 of 10-K

10-K Item 1 discusses companies’ products and services. Textual description in 10-K

Item 1 can be used to construct a stream of measures based on product similarity. Hoberg and

Phillips (2016) extract companies’ product description from 10-K Item 1 and represent the usage

of words in product description with a binary vector. The cosine similarity score between a pair

of companies’ product description then captures similarity of the products between the two

34
Cao, Jiang, &Lei
_____________________________________________________________________________________

companies. The product similarity measure is useful in evaluating the extend of competition a

company faces. If a large number of companies provide products or services highly similar to

those provided by a given company, then this company is likely to face fierce competition in the

product market. The measure can also be used to refine industry classification. Nowadays, many

companies provide products and services covering multiple traditional industries. For example,

Amazon Inc. is a retailer in the retailing industry, a streaming service provider in the

entertainment industry, an electronic device maker in the manufacturing industry, and a software

provider in the computer and business service industry. The product similarity measure provides

an avenue to define an “industry” for Amazon that consists of companies providing a similar set

of products and services rather than arbitrarily classifying Amazon Inc. into one of the traditional

industry. In a related vein, measuring time-series similarity of Item 1 could help analysts to

detect whether a company launches new products or services and implement new strategies.

Item 1A of 10-K

Companies discuss risk factors impacting their business in Item 1A. Henley and Hoberg

(2019) develop an emerging risks database for banks based on risk factor disclosures in Item 1A.

They employ topic modeling to obtain a 25-factor Latent Dirichlet Allocation (LDA) model

which is then used to extract 625 bigrams. Figure 15 provides an overview of the 25 emerging

risk topics and five most prevalent words in each topic. The bigrams are then converted into a set

of interpretable risk factors in the form of word vectors using semantic vector analysis. The

cosine similarity between the vocabulary list associated with each risk theme and the raw text of

a bank’s Item 1A disclosure reflects the intensity of the bank’s discussion of each emerging risk.

Using the risk loading, Henley and Hoberg (2019) show that risks related to real estate,

prepayment, and commercial paper are elevated as early as mid 2005 prior to the 2008 financial

35
Cao, Jiang, &Lei
_____________________________________________________________________________________

crisis. They also find individual bank exposure to emerging risk factors strongly predicts stock

returns, bank failure, and return volatility.

Segment Information

Recruiting CEOs whose skills and attributes suit company needs is critical to company

success. The inherent difficulty in external CEO hiring arises from the immense heterogeneity of

both job candidates and companies. Central to recruiting is not only identifying competent

managers, but also maximizing quality of the match between companies and CEOs. Cao, Li, and

Ma (2022) find that segment information in 10-K filings help companies find CEOs who fit

companies’ needs. For instance, Ford hired Allan Mulally as the CEO from Boeing’s

Commercial Airplanes in 2006. At that time, Ford was looking for a leader with experience in

turning around a troubled corporate giant. Allan Mulally happened to possess that experience as

revealed by the segment information disclosed in Boeing’s 1999 10-K (Figure 16). The

Commercial Airplanes segment of Boeing suffered a loss of 1,589 million in 1997 but earned a

profit of 2,016 million in two years after. This experience was exactly what Ford valued in CEO

candidates.

36
Cao, Jiang, &Lei
_____________________________________________________________________________________

Figure 15. Emerging risks using LDA with 25 topics

This figure provides an overview of the 25 risk factors detected using topic modeling from fiscal year 2006 10-Ks of
banks. Each box is ranked and sized relative to its importance in the document and contains the five most prevalent
words or commongrams in the topic (Henley and Hoberg 2019).

Figure 16. Segment information in Boeing’s 1999 10-K

37
Cao, Jiang, &Lei
_____________________________________________________________________________________

Appendix 2A: Project 1a How to crawl annual reports


How to crawl annual reports
Download ten 10-K filings and twenty 8-K filings of a company of your choice using the SEC
API. Randomly read 10 files that you download and check whether the downloaded filings are
correct and complete. Write a document explaining your code.
Appendix 2B: Project 1b How to parse unstructured data
How to parse unstructured data
Parse Item 1, Item 1A, and Item 7 of the ten 10-K filings you download. Read the output and
check whether the output is complete and accurate by comparing with the original files. Write a
document explaining your code.

38
Cao, Jiang, &Lei
_____________________________________________________________________________________

References

Alba, A., Gruhhl, D., Ristoski, Pl, and Welch, S. 2018. Interactive Dictionary Expansion using

neural Language Models. Second International Workshop on Augmenting Intelligence

with Humainsinthe-Loop.

Brown, S., and Tucker, J. 2011. Large-sample Evidence on Firms’ Year-over-Year MD&A

Modifications. Journal of Accounting Research, 49(2), 309-346.

Cao, S., Ma, G., Tucker, J., and Wan, C. 2018. Technological Peer Pressure and Product

Disclosure. The Accounting Review, 93(6), 95-126.

Cao, S., Jiang, W., Yang, B., Zhang, A. 2022. How to Talk When a Machine is Listening?

Corporate Disclosure in the Age of AI. Review of Financial Studies, forthcoming.

Cohen, L., Malloy, C., and Nguyen, Q. 2020. Lazy Prices. Journal of Finance, 3, 1371-1415.

Du, Z., Huang, A.G., Wermers, R., and Wu, W. 2022. Language and domain specificity: A

Chinese financial statement dictionary. Review of Finance, 26(3), 673-719.

Hassan, T., Hollander, S., Lent, L., and Tahoun, A. 2019. Firm-level Political Risk:

Measurement and effects. Quarterly Journal of Economics, 134(4), 2135-2202.

Harris, Z. (1954). Distributional structure. Word, 10(23): 146-162.

Loughran, T., and McDonald, B. 2011. When Is a Liability Not a Liability? Textual Analysis,

Dictionaries, and 10-Ks. Journal of Finance, 66(10), 35-65.

39
Cao, Jiang, &Lei
_____________________________________________________________________________________

Chapter 3 Emerging AI technology in textual analysis

3.1. Procedures of applying machine learning models

3.1.1 Data cleaning, parsing, and feature selection

The initial step in building machine learning models is to preprocess the raw data, or data

cleaning, which is essential for improving data quality. This process involves various tasks such

as eliminating redundant entries and boilerplate text, handling missing data and outliers, and

rectifying data that is improperly formatted. These operations can significantly improve the

performance of the final model since it ensures that the model learns from consistent and

relevant data. For instance, when working with analyst reports, it is a common practice to

remove sections such as the analyst disclaimer, as they typically offer minimal or no value for

machine learning tasks. By removing such “boilerplate” text, the machine can better discern

patterns by focusing its attention on the report’s essential content.

The next key aspect of the model-building process is data parsing, which involves

converting data from one format to another. The objective is to transform unstructured raw data

into a unified structured representation that machines can easily comprehend and utilize. For

instance, consider an HTML webpage. By parsing the HTML, we can transform it into organized

formats like CSV or JSON, simplifying the extraction of specific details from the data. Regular

expressions are commonly employed to extract specific patterns or sequences of characters from

the data so as to further enhance the data organization and usability.

Building machine learning models also involves feature selection. Feature selection is the

process of identifying input variables that are important for building a high-performing model.

The inclusion or exclusion of relevant features has a profound impact on the quality of the

model's output. As the saying goes, "garbage in, garbage out," emphasizing that the output's

40
Cao, Jiang, &Lei
_____________________________________________________________________________________

reliability is inherently tied to the quality of the input. If a model is trained on a dataset that

contains numerous irrelevant features, it is prone to producing unreliable or erroneous results.

Domain knowledge is crucial in feature selection because it provides valuable insights that

inform the selection process. Experts with a deep understanding of the subject matter can

leverage their knowledge and experience to identify key features. For instance, Cao, Jiang,

Wang, Yang (2021) choose several firm-level, industry-level, macro-economic variables, and

textual information from firms’ disclosures as inputs for their “AI analyst” model. The

researchers base their selection on the knowledge that prior studies have demonstrated a strong

correlation between these variables and earnings forecasts. This informed approach underscores

the importance of leveraging domain expertise when choosing relevant features for machine

learning models.

3.1.2. Machine learning model selection

Once the data has undergone preprocessing, the next step is building machine learning

models. This process is not merely a random selection but rather a systematic approach to

finding the most suitable model for the given data and problem at hand.

Model selection heavily relies on our understanding of the data characteristics and the

relative strengths and limitations of different models. Since various models excel at different

types of tasks, we can make an initial model selection based on our knowledge and experience of

each model’s capabilities, strengths, and weaknesses. For instance, random forest models are

frequently used for classification tasks due to their ability to handle both numerical and

categorical data and their resistance to overfitting. Long short-term memory (LSTM) models are

particularly effective for time series analysis as they can capture long-term dependencies in the

data. Transfer models, on the other hand, are commonly used for tasks where previously

41
Cao, Jiang, &Lei
_____________________________________________________________________________________

acquired knowledge can be utilized, such as image or natural language processing. For example,

Cao, Jiang, Wang, and Yang (2021) start with two versatile quasi-linear ML models, Elastic-Net

and Support Vector Regressions, that are adept at tasks with a large number of variables. They

then add on three highly nonlinear machine learning models, Random Forest, Gradient Boosting,

and Long Short-Term Memory (LSTM) Neural Networks. Random Forest and Gradient

Boosting can both capture complex and hierarchical interactions among the input variables while

the LSTM model is designed to model time-series patterns in the data. By doing so, the

researchers align the machine learning models with the specific characteristics of the data and

their knowledge of each model's advantages.

Once we have selected our initial set of models, we proceed to execute each model and

evaluate their performance. This process serves the purpose of quantifying the effectiveness of

each model and validating our initial choices. It is important to note that ensemble models have

the ability to combine knowledge from multiple models, often resulting in improved outcomes

compared to relying on a single model alone. Therefore, leveraging an ensemble model allows us

to integrate predictions from the top-performing models. Cao, Jiang, Wang, and Yang (2021) use

an ensemble comprising the three best-performing models as their primary model. By doing so,

they can harness the collective strengths and insights of these models, enhancing the overall

predictive power and reliability of their analysis. Just like feature selection, model selection in

machine learning also relies on domain knowledge. It is particularly crucial when deciding the

appropriate level of analysis to apply the model, such as individual, industry, or market level. In

this regard, researchers must draw upon their deep understanding of the research question and

the specific domain. By leveraging their expertise, they can make informed decisions on the

42
Cao, Jiang, &Lei
_____________________________________________________________________________________

scope and granularity of the analysis, ensuring that the selected model aligns with the objectives

and requirements of the study.

3.1.3 Hyperparameter tuning

Grid search

Grid search can be employed to find the best combination of hyperparameters for a

machine learning model. Grid search derives its name from the grid-like structure it creates,

where each dimension represents a different hyperparameter. It systematically evaluates all

possible combinations by iterating through the grid.

Implementing grid search involves first identifying the hyperparameters that require

tuning and defining the range of values to explore for each hyperparameter. As an example, we

can specify a list of learning rates like [0.001, 0.005, 0.01] and a list of batch sizes like [30, 50,

60]. By generating a grid that encompasses all possible hyperparameter combinations, the model

can be trained and evaluated for each combination within a loop. Ultimately, the best model is

determined by identifying the hyperparameter combination that yields the highest performance.

Cross-validation

Cross-validation is a widely used technique in machine learning for evaluating model

performance. Generally, the available data is divided into three main subsets: the training set, the

validation set, and the test set. The models are trained on the training set, fine-tuned using the

validation set, and then evaluated for accuracy on the test set. This approach enables us to assess

how effectively the trained model generalizes to unseen data, thus mitigating the risk of

overfitting, where the model performs well on the training data but fails to generalize.

Figure 1 illustrates the process of k-fold cross-validation, which is the most common

form of cross-validation. In k-fold cross-validation, the data is divided into k equal-sized folds.

43
Cao, Jiang, &Lei
_____________________________________________________________________________________

The model is trained k times, with each iteration using k-1 folds as the training set and a different

fold as the validation set. After training the model k times, we obtain k individual evaluation

scores, and the average of these scores represents the overall performance of the model.

Employing k-fold cross-validation allows for efficient utilization of the available data,

particularly when the dataset size is limited.

For example, there are 10 million photos of pets, of which some are photos of dogs. We

would like to identify the photos of dogs. What we have is a set of 10,000 photos already labeled

as photos of dogs or not photos of dogs. To apply a five-fold cross-validation, we could divide

the 10,000 labeled photos into five folds with 2,000 in each subset. We then build a classification

model using four folds as the training set and the fifth fold as the validation set. After training the

model five times, we obtain five individual evaluation scores which can reflect the overall

performance of the model.

Figure 1. Five-Fold cross-validation

44
Cao, Jiang, &Lei
_____________________________________________________________________________________

3.1.4. Model evaluation

There are several metrics to evaluate the performance of classification machine learning

models, including accuracy, precision, recall, and F1 score. To gain a better understanding of

these measures, let us start by exploring the confusion matrix and related terms.

Confusion matrix provides a comprehensive summary of the model's predictions and

their alignment with the actual labels of a test dataset. The confusion matrix displays the number

of four categories of the classification results: true positives (TP), true negatives (TN), false

positives (FP), and false negatives (FN). A true positive (TP) refers to a test result where the

model correctly predicts a positive condition, while a true negative (TN) indicates a test result

where the model correctly predicts a negative condition. On the other hand, a false positive (FP)

occurs when the model incorrectly predicts a positive outcome, but the actual outcome is

negative, and a false negative (FN) occurs when the model incorrectly predicts a negative

outcome, but the actual outcome is positive. Figure 2 depicts a 2x2 confusion matrix that

represents the four possible outcomes:

Figure 2. A confusion matrix

Using the above components of the confusion matrix, we can calculate four evaluation

metrics: accuracy, precision, recall, and F1 score.

45
Cao, Jiang, &Lei
_____________________________________________________________________________________

Accuracy

Accuracy measures the overall correctness of the model's predictions. It is defined as the

total number of correct predictions (i.e., true positives and true negatives) divided by the total

number of predictions. The higher the accuracy, the better the model is at making correct

predictions.

𝑇𝑇𝑇𝑇 + 𝑇𝑇𝑇𝑇
𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 =
𝑇𝑇𝑇𝑇 + 𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹 + 𝐹𝐹𝐹𝐹

Precision

Precision focuses on the proportion of true positive predictions out of all positive

predictions made by the model. It is defined as the number of true positives divided by the total

number of positive predictions. The higher the precision, the lower the likelihood of a model

falsely labeling negative instances as positive.

𝑇𝑇𝑇𝑇
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 =
𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹

Recall

Recall evaluates how well the model identifies the true positive cases. It is defined as the

number of true positives divided by the total number of true positives and false negatives.

𝑇𝑇𝑇𝑇
𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 =
𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹

F1 Score

The F1 score combines precision and recall into a single metric, showing a model's

ability to handle both false positives and false negatives. It is calculated as follows:

2 ∗ (𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑠𝑠𝑖𝑖𝑖𝑖𝑖𝑖 ∗ 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅)
𝐹𝐹1 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 =
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 + 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅

46
Cao, Jiang, &Lei
_____________________________________________________________________________________

When working with imbalanced data, we often encounter a challenge called the

“Accuracy Paradox”. This challenge arises when we rely solely on accuracy as a metric, which

can be misleading and lead to incorrect conclusions. In this case, precision is an important metric

to consider.

Let's take a look at an example where we have an imbalanced dataset for spam detection

in email. In this dataset, 98% of the emails are not spam (negative), while only 2% are identified

as spam (positive). Now, let's say we build a classification model specifically designed to detect

spam. The model yields the following results:

o True Positives (TP): 150

o False Positives (FP): 50

o True Negatives (TN): 9800

o False Negatives (FN): 50

If we calculate the accuracy of the model, we will have:

Accuracy = (TP + TN) / (TP + TN + FP + FN) = (150 + 9800) / (150 + 9800 + 50 + 50)

= 99.3%.

At first glance, this high accuracy may appear impressive. However, it's important to take

into account the context of the imbalanced dataset we are dealing with. In this dataset, the

majority of emails (approximately 98%) are non-spam, indicating that even if we were to classify

all emails as non-spam, we would still attain a high accuracy due to the significant number of

true negatives. Therefore, when evaluating the performance of the model, it is important to

consider additional metrics that provide a more comprehensive assessment of its effectiveness,

especially in the presence of imbalanced class distributions.

To gain further insights, let's calculate the precision of the model. We will get:

47
Cao, Jiang, &Lei
_____________________________________________________________________________________

Precision = TP / (TP + FP) = 150 / (150 + 50) = 75%.

In this case, the precision is 75%, among all the emails predicted as spam, only 75% of

them are truly spam. This highlights the significance of computing precision in an imbalanced

dataset, as the presence of false positives can incur substantial costs or consequences. Thus,

precision serves as a valuable evaluation metric, particularly in situations where the dataset

exhibits class imbalance.

3.2. Pre-trained phrase-level word embedding

Pre-trained phrase-level analysis


Textual representation

The methods for textual analysis discussed thus far have primarily relied on word

frequency. Frequency-based techniques ignore syntax and contextual meaning, which can result

in inaccurate analysis. As an example, a frequency-based algorithm would treat “many persons”

and “people” as two unrelated textual input because it is not able to recognize that “person” and

“people” share similar meaning. To address this limitation of frequency-based techniques, word-

embedding was developed to incorporate the meaning in textual analysis. This method represents

a word with a semantic vector, or a new bag of words related to the word of interest. Word-

embeddings are based on Zellig Harris’s “distributional hypothesis,” which posits that words

used in proximity to one another typically have similar meanings (Harris, 1954). To create a

semantic vector for a given phrase, such as “cash flow”, a word-embedding algorithm selects

words from surrounding text inputs that can accurately predict the presence of the phrase. For

example, in the text input “earnings present cash flow, which helps future investment” (figure 1),

“investment”, “earnings”, “present”, “which”, “help”, and “future” are all adjacent words of

“cash flow”. The algorithm would select “investment” and “earnings” for the semantic vector

representing “cash flow”, but not “which” or “help” as “which” or “help” could be adjacent to a

48
Cao, Jiang, &Lei
_____________________________________________________________________________________

large number of phrases other than “cash flow”. In other words, when “cash flow” is concealed,

“earnings” and “investment” can relatively accurately predict the presence of “cash flow” but

“which” or “help” cannot. Hence, “cash flow” can be represented by the word vector [Earnings,

investment].

Figure 3. Textual representation

Advantages of phrase-level word-embedding

Using textual representation, we can train algorithms to understand semantic

relationships between words in the same way as humans do. As discussed earlier, a frequency-

based algorithm would fail to identify that “person” and “people” share similar meanings.

However, word-based embedding creates comparable semantic vectors for “person” and “people.

For example, “person” could be represented by a word vector [human, man, woman, men,

women, they] and “people” might be represented by a word vector [human, men, women, they].

Such word-embedding allows an algorithm to recognize that they share similar meaning through

semantic computation.

3.3. Pre-trained sentence level-word embedding

Pre-trained sentence-level analysis


Although phrase-level word embedding takes into consideration the meaning of words, it

ignores the other rich information contained in sentences, resulting in potential problems in

textual analysis. First, the same word often has multiple meanings depending on the context it is

used in. Word-embedding cannot distinguish between these different meanings of the same

word, such as “liability”, which can mean a responsibility or burden in general language, but

represents a neutral line item representing resources contributed by creditors in financial

49
Cao, Jiang, &Lei
_____________________________________________________________________________________

statements. Therefore, the word “liability” should be represented by different semantic vectors in

the two contexts. Second, the meanings of words often change over time. For example, in the

1850s, “broadcast” was commonly used in agriculture, but today it has more associations with

media. Thus, a semantic vector representing “broadcast” from the 1850s would likely include

words such as “sow” and “seed,” while a semantic vector representing “broadcast” for the

modern usage would include such words as “television”, “radio”, and “newspapers”.

Sentence embedding solves these problems by mapping sentences to vectors. Sentence

embedding can be achieved based on hidden layer outputs of transformer models. It can also be

achieved via aggregating word embeddings into sentence embeddings.

Figure 4. The changing meaning of “broadcast” over time

3.3.1 Bidirectional Encoder Representations from Transformers

One of the most advanced tools for natural language processing (NLP) is Bidirectional

Encoder Representations from Transformers (BERT), which was developed by Google. BERT

has its origins in pre-training contextual representations. BERT was trained on two tasks,

namely, language modeling and next sentence prediction, on the Toronto BookCorpus

and English Wikipedia. In language modeling, 15% of words were selected for prediction, and

the training objective was to predict the selected word given its context. The selected word is

masked with probability of 80%, replaced with a random word with probability of 10%, and not
50
Cao, Jiang, &Lei
_____________________________________________________________________________________

replaced with probability 10%. For example, the sentence “he is nice” has three words and the

third word “nice” is selected for prediction. The input text would be “he is [MASK]” with

probability of 80%, “he is kind” with probability of 10%, and “he is nice” with probability of

10%.

Next Sentence Prediction (NSP) training enables the model to understand how sentences

relate to each other, if sentence B should precede or follow sentence A. As previously discussed,

context-free phrase-level word-embedding models create a single word embedding

representation for each word in the vocabulary. The main advantage of BERT is its use of bi-

directional learning optimized for predicting masked words to gain context of words from both

left to right context and right to left context simultaneously so as to take into account the context

of each occurrence of a given word. For example, it can understand the semantic meanings of

bank in the following sentences: “I went to the bank to deposit a check” and “We walked along

the river bank.” To understand this, BERT uses right-to-left “debit card” and left-to-right “river”

clues. As a result of this training process, BERT learned contextual embeddings for words. Once

the pre-training is complete, the same model can be fine-tuned for a variety of downstream tasks.

BERT’s architecture is based on “transformer”, a deep learning model primarily used in

natural language processing (NLP). A unique advantage of transformer is its ability to rely

entirely on self-attention to compute representations of its input and output. There are 12

encoders with 12 bi-directional self- attention heads, and 110 million parameters in its “BERT

base” model. “BERT large” employs 24 encoders, 16 bi-directional attention heads, and 340

million parameters.

51
Cao, Jiang, &Lei
_____________________________________________________________________________________

3.3.2 Generative Pre-trained Transformers

GPT (Generative Pre-trained Transformer) is a series of language models developed by

OpenAI. The GPT series consists of four major versions: GPT-1.0, GPT-2.0, GPT-3.0, GPT-3.5.,

and GPT-4.0. GPT-1.0 was the first model released by OpenAI in 2018, representing a

significant breakthrough in the field of natural language processing. It had 117 million

parameters and was pre-trained on a large corpus of text data, making it highly effective at

understanding and generating natural language. GPT-1.0 was primarily used for language

translation, text completion, and question answering tasks. GPT-2.0 was released in 2019. As a

significant improvement over its predecessor, it had 1.5 billion parameters. This allowed it to

perform much more complex natural language processing tasks, including story generation, text

summarization, and even image captioning. GPT-3.0 was released in 2020 and was an

autoregressive language model using a transformer architecture with 175 billion parameters,

making it one of the largest language models ever developed. GPT-4 is the latest and largest in a

series of GPT models. It is six times larger than the GPT-3 model with approximately one trillion

parameters.

Differences between GPT and BERT

Architecture

BERT and GPT use different machine-learning models. As discussed earlier, BERT is

designed for bidirectional context representation, which means it processes text from both left-

to-right and right-to-left, allowing it to capture context from both directions. This allows BERT

to better understand the context and meaning of a sentence. Unlike BERT models, GPT is an

autoregressive model generating text sequentially from left to right, predicting the next word in a

52
Cao, Jiang, &Lei
_____________________________________________________________________________________

sentence based on the words that came before it. This allows GPT to generate highly coherent

and natural-sounding text.

Training data

As discussed earlier, BERT is trained using a masked language model on a large-scale

corpus of text. The original was trained on the English Wikipedia and BooksCorpus, a dataset

containing approximately 11,000 unpublished books, which amounts to about 800 million words.

Conversely, GPT-3 was trained on the WebText dataset, a large-scale corpus containing web

pages from sources like Wikipedia, books, and articles. It also includes text from Common

Crawl, a publicly available archive of web content.

Pre-training approach

GPT is a generative model, meaning that it is trained to predict the next word in a

sentence or generate a new sentence from scratch. This pre-training approach allows GPT to

excel at tasks such as language generation and text completion. On the other hand, BERT is a

discriminative model, meaning that it is trained to classify whether a given sentence is coherent

or not. This pre-training approach allows BERT to excel at tasks such as sentiment analysis and

text classification.

Usability

To use BERT, you'll have to download the originally published Jupyter Notebook for

BERT and then set up a development environment using Google Colab or TensorFlow. If you

don't want to worry about using a Jupyter Notebook or aren't as technical, you could consider

using ChatGPT to leverage the GPT model, which is as simple as just logging into a website.

53
Cao, Jiang, &Lei
_____________________________________________________________________________________

Application of GPT

GPT has been widely incorporated into various business applications, resulting in various

domain-specific generative AI applications. For instance, Salesforce develops Einstein GPT, the

world’s first generative AI for customer relationship management. Bloomberg also builds its

own generative AI, BloombergGPT, because the complexity and unique terminology of the

financial domain warrant a domain-specific model. BloombergGPT is a large-scale generative

artificial intelligence model trained on a wide range of financial data to support a diverse set of

NLP tasks within the financial industry. Here are three examples of future applications of GPT in

the field of accounting and finance:

Language generation and modeling

GPT can be used to build language models that can be used to generate new text in a

particular style or genre. It is particularly good at generating natural-sounding language, making

it useful for applications such as chatbots, language translation, and content creation. GPT can be

used to automate financial reporting processes by generating financial statements, analysis

reports, and other financial documents in a more efficient and accurate manner.

Text summarization and analysis

GPT can be used to summarize large volumes of text, such as news articles or research

papers, into shorter summaries that capture the most important information. GPT can be used to

analyze the sentiment of a given piece of text, allowing businesses to monitor customer feedback

and sentiment towards their products or services. This could be a usefultools for financial

analysts.

54
Cao, Jiang, &Lei
_____________________________________________________________________________________

Quantitative forecasting and analysis

GPT can be used to analyze financial data, make predictions about future trends and

outcomes, and identify potential risks associated with investments, loans, and other financial

products. This can help companies to make informed decisions about investments, budgeting,

and other financial matters. This can help companies to mitigate risks and identify potential

instances of fraud or financial irregularities and ensure compliance with regulations and industry

standards.

Other application of AI in accounting and finance

In recent years, the investment management industry has experienced a rapid surge in the

adoption of AI and machine learning technologies. These technologies have found applications in

various areas within the industry, such as identifying trading patterns for generating alphas,

streamlining customer support and prospect identification, and managing risk exposure. One

notable example of AI and machine learning implementation is Kensho Technologies, a

Massachusetts-based startup. Kensho developed an algorithm named "Warren" (in honor of

Warren Buffett), which utilizes big data from capital markets and applies machine learning to

discover correlations and exploit arbitrage opportunities. Another illustration comes from the

hedge funds that has embraced AI and algorithmic trading. According to a 2018 survey conducted

by BarclayHedge with 2,135 hedge fund professionals, 56% of respondents reported using

AI/machine learning in their investment strategies, with the primary application being idea

generation and portfolio construction.

As AI continues to expand its presence in industrial applications, privacy and ethical

concerns have also garnered public attention. In response to privacy concerns, federated learning

techniques have emerged. Unlike traditional centralized machine learning methods, federated

55
Cao, Jiang, &Lei
_____________________________________________________________________________________

learning enables AI algorithms to be trained without sharing or transmitting sensitive data. Overall,

the growth of AI in the investment management industry brings significant opportunities.

Competitive advantages of man and machine

The rapid advancement of AI technologies make people wonder: would machines replace

man in near future? Although AI technologies are increasing the capabilities of machines

exponentially, humans and machines each have unique competitive advantages. These

competitive advantages determine in what tasks humans could be replaced by AI.

Firstly, human intelligence is divided into reasoning-based and probability-based

intelligence. Reasoning-based intelligence involved making correct decisions through logical

deduction, which is inherently challenging for AI. As revealed by Cao et al. (2022), AI

underperforms humans in analytical tasks with limited data where reasoning-based intelligence

has to be relied upon. On the other hand, probability-based intelligence, which refers to the

ability of making decisions based on probability. AI unarguably outperforms humans in tasks

requiring probability-based intelligence. Consistent with the notion, Cao et al. (2022) find that AI

models can understand and analyze large amounts of numerical and textual data and make

decisions based on these data. Consequently, humans might be replaced by AI in tasks mostly

relying on probability-based intelligence.

Secondly, spiritual pursuit is unique to humans, and machines do not possess spiritual

pursuit. While AI cannot understand spiritual pursuit, AI can assist in creating relevant entities,

such as constructing a church. Similarly, humans have emotions, but AI does not have emotions.

This is one of the most significant differences between humans and machines. Machines can

provide objects that help humans generate positive emotions, such as humanoid robot that can

56
Cao, Jiang, &Lei
_____________________________________________________________________________________

replace a deceased loved one and provide companionship. However, AI is unable to replace

humans in tasks that involve emotions such as artistic creation and expression.

Thirdly, curiosity is a crucial trait of humans that machines currently do not possess.

Curiosity motivates humans to acquire and accumulate knowledge that allows the development

and advancement of AI technologies. If AI becomes curious one day, they might be able to

create a new generation of AI.

In conclusion, AI has the potential to replace humans in certain tasks, such as those

requiring probability-based intelligence. AI cannot replace humans completely in tasks that

involve unique human traits, such as reasoning-based intelligence, spiritual pursuit, emotions,

and curiosity.

57
Cao, Jiang, &Lei
_____________________________________________________________________________________

Appendix 3: Evaluating machine learning models


Dataset 1 and dataset 2 contain the validation outcomes of two machine learning models for
detecting spam emails.
(1) Review sample dataset 1, and calculate accuracy, precision, recall, and F1 score.
(2) Review sample dataset 2, and calculate accuracy, precision, recall, and F1 score.
(3) Which model do you think perform better? Explain your reasoning.

58
Cao, Jiang, &Lei
_____________________________________________________________________________________

References

Cao, S., Jiang, W., Wang, J.L., and Yang, B. 2021. From Man vs. Machine to Man+Machine:

The Art and AI of Stock Analyses. Working paper.

59
Cao, Jiang, &Lei
_____________________________________________________________________________________

Chapter 4 Analyzing Earnings Conference Calls

4.1. Data structure in earnings conference calls

Data structure of conference calls

Most U.S. public companies host a quarterly conference call, typically within a month

after the end of the fiscal quarter, to discuss their financial performance with investors.

Participants of these calls include key company executives, investors, and financial analysts.

During a conference call, company executives review financial information and discuss major

issues impacting company performance in the previous quarter. They also provide insights into

the company’s expectations for the upcoming quarters. Semi-formal presentations by company

executives are followed by question-and-answer sessions where investors and financial analysts

can ask questions about any area that requires further elaboration.

In the past, earnings conference calls were only available to professional financial

analysts and institutional investors. Nowadays, almost all public companies use online streaming

to broadcast conference calls to average investors, or provide audio recordings that can be

accessed to on-demand. Furthermore, various online stock research sites offer access to earnings

conference call transcripts. The widespread access to earnings conference call audios and

transcripts creates opportunities to employ AI and machine learning methods to perform timely

and in-depth information processing.

As an example, let us take a look at the transcript of Microsoft’s earnings conference call

for the first quarter of fiscal year 2023. The call was held on October 25, 2022. Brett Iversen, the

Vice President of Investor Relations at Microsoft, hosted the call, and other Microsoft

participants include Satya Nadella who is the CEO, Amy Hood who is the CFO, Alice Jolla, the

chief accounting officer, and Keith Dolliver, the deputy general counsel.

60
Cao, Jiang, &Lei
_____________________________________________________________________________________

Figure 1. Introduction in Microsoft’s FY 23 Q1 earnings conference call

Following Brett Iversen’s introduction of the Microsoft participants and overview of the

structure and principles of the earnings conference call, Satya Nadella, the CEO, took over to

provide a high-level summary of Microsoft’s strategies, financial performance of major business

units such as Microsoft Cloud, and expectations for the upcoming quarters.

Figure 2. CEO remarks in Microsoft’s FY 23 Q1 earnings conference call

After Satya Nadella’s high-level summary, the CFO, Amy Hood, shared detailed

financial information and provided her interpretation of the data from the company’s perspective.

For example, she explained that increased operating expenses were primarily driven by the

growth in headcount, while a shift in sales mix and foreign exchange impact negatively affected

the operating margin. She also provided an outlook for the second quarter of the fiscal year, both

at the company and the segment level.

61
Cao, Jiang, &Lei
_____________________________________________________________________________________

Finally, the floor was open to questions from the other participants in the earnings

conference call. During this call, eight investors and analysts asked questions. These questions

covered both Microsoft’s operating and financial decisions. For instance, an analyst from Stanley

asked for elaboration on how Microsoft formed the outlook guidance; other analysts and

investors asked about future plans for various business segments, such as Microsoft Cloud,

Windows, and advertising, in light of past performance of Microsoft and its competitors.

Figure 3. CFO remarks in Microsoft’s FY 23 Q1 earnings conference call

Figure 4. Q&A session in Microsoft’s FY 23 Q1 earnings conference call

62
Cao, Jiang, &Lei
_____________________________________________________________________________________

4.2. Standard dependence parser

Chapter 2 and Chapter 3 introduces three textual analysis methods, namely the Bag-of-

Word approach, phrase-level word embedding, and sentence-level word embedding. The Bag-of-

Word approach is a frequency-based method that summarizes textual data with numeric

information but disregards meanings and contexts of words. Phrase-level word embedding takes

word meanings into account and can recognize different words that share similar meanings.

However, it cannot handle the scenarios where the same word is used for different meanings in

different sentences. On the other hand, the sentence-level word embedding leads to algorithms

that can recognize not only recognize different words sharing similar meanings but also different

meanings of the same words in different sentences. Nevertheless, all of these methods only

consider the dynamic relationships among individual words, disregarding the grammatical

relationships among them within a sentence.

In linguistics, the words in a sentence are classified as parts of speech (e.g., nouns, verbs,

adverbs, and so on) and are connected to each other to form certain dependency relationships. As

an example, in the sentence, “Undeterred by the bad weather, we have experienced great sales

growth this quarter,” the word “weather” is a noun modified by the adjective “bad,” while the

noun “growth” is modified by the adjective “great.” Table 1 provides a summary of the most

common dependency relationships.

Figure 5. Dependency relationship

63
Cao, Jiang, &Lei
_____________________________________________________________________________________

A dependency parser is a data analytics tool used to analyze the grammatical structure of

a sentence. It can identify the “head” word in a sentence and the words that modify it. The

Nature Language Processing (NLP) group at Stanford University has developed a neural network

model that trains an algorithm to identify “part-of-speech” and “dependency relationships.”

As discussed previously, a transfer model is a type of machine learning model that is pre-

trained on one task and then fine-tuned for a different, related task. The idea behind transfer

learning is that a model that has been trained on a large and diverse dataset can be re-purposed

for other tasks. In contrast, a neural model is a type of machine learning model that is designed to

mimic the structure and function of the human brain. These models are composed of

interconnected layers of artificial neurons, and they are trained using large amounts of data to

learn patterns and make predictions.

Table 1. Dependency relation table

The parser builds a parse by performing a linear-time scan over the words of a sentence.

At every step, it maintains a partial parse, a stack of words which are currently being processed,

and a buffer of words yet to be processed. At every stage, the parser uses a neural network

classifier to determine grammatical relationships among the words. The classifier is trained using

64
Cao, Jiang, &Lei
_____________________________________________________________________________________

an oracle. Specifically, the researchers gathered a sample of three million words from various

sources such as Wall Street Journal articles, IBM computer manuals, nursing notes, transcribed

telephone conversations, etc. This oracle takes each sentence in the training data and produces

many training examples. The neural network is trained on these examples using adaptive

gradient descent (AdaGrad) with hidden unit dropout. The researchers divided the three million

words into ten groups. 90 percent of the sample was used to train a model. The researchers then

used the trained model to predict the dependency relationships of the remaining 10 percent of the

sample to evaluate the accuracy of the parser. To accurately recognize word dependency

relationships, this process is repeated over years to refine and enhance the robustness of the

model. The reported accuracy of the current model is 92 percent.

Advantages of human supervised standard dependency parser

The standard dependency parser has distinct advantages over other textual analysis

methods. For example, consider the sentence “Undeterred by the bad weather, we have

experienced great sales growth this quarter.” A frequency-based bag-of-words approach can only

identify one positive word and one negative word, the standard dependency parser can discern

that “bad,” which modifies “weather,” is unrelated to firm performance, while “great,” which

modifies “sales,” is related to it.

Here is another example: “a double-digit increase is not an unreasonable goal, especially

in countries and regions that are not exposed to extreme weather events.” The frequency-based

bag-of-words approach would only identify two negative words, “unreasonable” and “exposed.”

By contrast, the standard dependency parser can appreciate that “not” and “unreasonable”

together form a double negative that modifies “goal,” indicating a positive sentiment related to

firm performance. Meanwhile, “not” and “exposed” form another double negative that modifies

65
Cao, Jiang, &Lei
_____________________________________________________________________________________

“countries and regions.” This too is a positive sentiment, but unlike “goal,” “countries and

regions” is unlikely to be relevant to firm performance.

Figure 6. Using dependency relationship to process double negatives

4.3. Empirical example: Contrasting earnings conference calls and expert network calls

Earnings conference calls were a major source from which professional analysts and

institutional investors acquire private information. Regulation Fair Disclosure in 2000 prohibited

public companies from selectively disclosing material information, which encouraged investors

to seek out unique information sources to gain an investing edge. Subsequently, regulatory

concerns about conflicts of interest at sell-side research departments lead to the Global Analyst

Research Settlement in 2003, which resulted in reduced analyst coverage. This period also

coincided with the considerable growth in hedge funds who possess considerable financial

resources and demand valuable firm information to make profitable investment decisions. These

factors together boom the expert network industry.

The expert network industry consists of firms that work to recruit and connect subject

matter experts with clients seeking to do deep dive research on a company or market segment.

The standard engagement is a 45-60 minute question and answer discussion between the expert

and the client. Expert network firms often generate recordings and transcripts of these calls for

66
Cao, Jiang, &Lei
_____________________________________________________________________________________

compliance purposes. The availability of call transcripts has allowed expert network firms to

create content libraries that can be sold to multiple clients.

Cao, Green, lei, and Zhang (2023) compare the content of expert network calls and

earnings conference calls. Specifically, they use Latent Dirichlet Allocation (LDA) model to

determine whether the distribution of covered topics significantly differs across call type. Early

LDA models identify topics based solely on word cooccurrences, but this approach often

generates topics that are difficult to interpret. Seeded LDA leverages both knowledge-based and

frequency-based seed words to determine interpretable pre-defined topics and classify textual

contents into these topics based on a seed word dictionary. Knowledge-based seed words are

selected based on researchers' knowledge in the field, and frequency-based key words are chosen

from the most frequent words in the documents.

Cao, Green, lei, and Zhang (2023) identify seven common topics that emerge from expert

network calls: Competition, Consumer, Financial, Product, Operation, Strategy, and Technology.

They then obtain a list of 100 most frequent non-stop words in the expert network call

transcripts. From this list, 50 knowledge-based root words are selected and classified into

relevant topics to construct the final seed word dictionary.

To compare expert calls with earnings calls, they classify earnings calls using the same

topics and seed word dictionary. As shown in Figure 7, they find that Financial discussions

comprise the most common topic for earnings conference calls, present in 33.2% of calls,

whereas this topic is considerably less prevalent in expert network calls (9.3%). Instead, expert

calls are more likely to emphasize Technology (20.9% of expert calls vs6.6% for earnings calls),

Strategy(15.4% vs 8.3%),and Operational(13.1% vs10.8%). The different topic emphasis

suggests expert calls being less oriented towards financial statement information that are widely

67
Cao, Jiang, &Lei
_____________________________________________________________________________________

available and more geared to understanding industry segments and trends which requires expert

insights.

Table 2. Contrasting earnings conference calls and expert network calls

68
Cao, Jiang, &Lei
_____________________________________________________________________________________

Appendix 4: Applying GPT to analyze conference call transcripts using both API and web
interface
Applying GPT to analyze conference call transcripts
Download 5 conference earnings calls and extract the CEO’s prepared remark. Use OpenAI
ChatGPT API to perform LDA topic modeling and classify the remark into an appropriate
number of topics. Report the topics, their weight in the CEO’s prepared remark, and the top five
frequent words in each topic. Then use ChatGPT web interface to perform the same task.
Compare the outputs. Then use ChatGPT web interface to perform the same task. Compare the
outputs.

69
Cao, Jiang, &Lei
_____________________________________________________________________________________

References

Cao, S., C. Green, L, Lei, and S. Zhang. 2023. Expert network Calls. Working paper.

70
Cao, Jiang, &Lei
_____________________________________________________________________________________

Chapter 5 Analyzing Material Company News

5.1. Data structure in 8-K filings

Data structure of the 8-K filing


In addition to filing annual and quarterly reports, public companies are mandated to

report certain material corporate events to their shareholders on a more current basis using Form

8-K, or a “current report.” The types of information trigger Form 8-K filings are generally

considered to be “material.” As such, companies are obligated to disclose such information

promptly rather than waiting until the end of a fiscal period end to file a Form 10-Q or a Form

10-K.

In March 2004, the SEC adopted sweeping changes to the Form 8-K disclosure

requirements. Companies must make 8-K disclosures within four business days of the triggering

event and in some cases even earlier. The rest of the section visits each type of information

disclosed in Form 8-K and provide examples of 8-K filings. The revised rules add new items and

events to be disclosed in Form 8-K and require Form 8-K be filed within four business days,.

Section 1 Registrant’s Business and Operations


Item Added post-2004?
Item 1.01 Entry into a Material Definitive Agreement Yes
Item 1.02 Termination of a Material Definitive Agreement Yes
Item 1.03 Bankruptcy or Receivership No
Item 1.01 Entry into a Material Definitive Agreement

This item pertains to business agreements that are outside the ordinary course of business.

The item is also triggered by material amendments to those agreements. For instance, if a

company secures a substantial loan from a bank or enters into a long-term lease that is material

to the company, the agreement must be reported under Item 1.01 by filing a current report.

However, if a retailer with an established chain of stores signs a lease for one additional store,

71
Cao, Jiang, &Lei
_____________________________________________________________________________________

the new lease generally would be in the ordinary course of business and would not be reported

here. The required disclosure includes the date on which the agreement was entered into or

amended, the identity of the parties to the agreement or amendment, and a brief description of

the terms and conditions of the agreement or amendment. Figure 1 provides an example of Item

1.01 filed by Amazon Inc. In this Form 8-K report, Amazon Inc. disclosed that the company

entered into a credit agreement with Bank of America, N.A. on September 5, 2014. The

agreement provided Amazon Inc. with a credit facility with a borrowing capacity of up to $2

billion at an initial interest rate of the London interbank offered rate (LIBOR) plus 0.625%.

Figure 1. Item 1.01 filed by Amazon Inc.

Item 1.02 Termination of a Material Definitive Agreement

This item concerns about termination of material business agreements. For example, if a

company procures most of its raw material through a long-term procurement agreement with one

significant supplier, and that supplier terminates the agreement, the termination of the agreement

must be reported under this item. In contrast, if the agreement expires according to its terms, the

termination need not be reported on Form 8-K. The required disclosure includes the date of the

termination of the material definitive agreement, the identity of the parties to the agreement, a

brief description of the terms and conditions of the agreement, a brief description of the material

circumstances causing the termination, and any material early termination penalties.

72
Cao, Jiang, &Lei
_____________________________________________________________________________________

Item 1.03 Bankruptcy or Receivership

In the event of a potential bankruptcy, a company must disclose this information on Form

8-K along with its plan for reorganization (under Chapter 11) or liquidation (under Chapter 7).

Such information is particularly important for shareholders as they need to evaluate their

potential losses and the likelihood that the company could emerge from bankruptcy.

Section 2 Financial Information


Item Added post-2004?
Item 2.01 Completion of Acquisition or Disposition of Assets No
Item 2.02 Results of Operations and Financial Condition No

Item 2.03 Creation of a Direct Financial Obligation or an Obligation Yes


under an Off-Balance Sheet Arrangement of a Registrant
Item 2.04 Triggering Events That Accelerate or Increase a Direct Yes
Financial Obligation or an Obligation under an Off-Balance
Sheet Arrangement
Item 2.05 Costs Associated with Exit or Disposal Activities Yes
Item 2.06 Material Impairments Yes
Item 2.01 Completion of Acquisition or Disposition of Assets

If a company acquires or divests a significant amount of assets, including acquiring or

merging with another company or selling a business unit, the company must file an 8-K to

describe the terms of the transaction.

Item 2.02 Results of Operations and Financial Condition

Many companies announce their quarterly and annual results simultaneously in an 8-K

filing. If the company will hold an earnings conference call, it is announced in the 8-K filing as

well. As shown in Figure 2, Amazon Inc. filed a Form 8-K along with its announcement of its

fourth quarter 2020 and year ended December 31, 2020 financial results.

Figure 2. Item 2.02 filed by Amazon Inc.

73
Cao, Jiang, &Lei
_____________________________________________________________________________________

Item 2.03 Creation of a Direct Financial Obligation or an Obligation under an Off-Balance

Sheet Arrangement of a Registrant

Companies must report the basic terms of material financial obligations, for example, any

long-term debt, capital or operating lease, and short-term debt outside the ordinary course of

business. The required disclosure includes the date on which the company became obligated on

the direct financial obligation, a brief description of the transaction creating the obligation, the

amount of the direct financial obligation, and a brief description of the other terms and

conditions of the transaction. Figure 3 provides an example of Item 2.03 in a Form 8-K filed by

USHG Acquisition Corp. on March 29, 2022. It reveals that USHG Acquisition Corp issued an

unsecured non-interest-bearing promissory note in the principal amount of $500,000 on March

29, 2022.

Figure 3. Item 2.02 filed by USHG Acquisition Corp.

Item 2.04 Triggering Events That Accelerate or Increase a Direct Financial Obligation or an

Obligation under an Off-Balance Sheet Arrangement

This item refers to such events as loan defaults and any other events that accelerate or

increase financial obligations. For example, if a company defaults on a loan, its creditors can

demand immediate payment of the entire outstanding amount. In such case, the company must

disclose the date of the triggering event, a brief description of the triggering event, the amount to

74
Cao, Jiang, &Lei
_____________________________________________________________________________________

be repaid, the repayment terms, and any other financial obligations that might arise from the

initial default.

Item 2.05 Costs Associated with Exit or Disposal Activities

This item requires companies to disclosure material charges associated with restructuring

plans. The required disclosure includes the date of the commitment to the exit or disposal

activities, a description of the plan, and an estimate of the total cost expected to be incurred.

Item 2.06 Material Impairments

A company must disclose certain material impairments under this item, including the date

of the conclusion that a material charge is required, a description of the impaired assets, the facts

leading to the conclusion, and an estimate of the amount of the impairment charge.

Section 3 Securities and Trading Markets


Item Added post-2004?
Item 3.01 Notice of Delisting or Failure to Satisfy a Continued Listing Yes
Rule or Standard; Transfer of Listing
Item 3.02 Unregistered Sales of Equity Securities No
Item 3.03 Material Modification to Rights of Security Holders No
Item 3.01 Notice of Delisting or Failure to Satisfy a Continued Listing Rule or Standard;

Transfer of Listing

This item mandates companies to disclose the delisting of their stock if they receive

notification from the stock exchange that they no longer meet the requirements for continued

listing. A company receiving this type of notice must disclosure the date that it received the

notice, the rule or standard for continued listing that the company fails, and any response the

company as determined to make. Delisting due to non-compliance with listing requirements

often signals red flags to investors.

75
Cao, Jiang, &Lei
_____________________________________________________________________________________

Item 3.02 Unregistered Sales of Equity Securities

Companies are required to disclose private issuance of securities exceeding one percent

of outstanding shares of that class under this item.

Item 3.03 Material Modification to Rights of Security Holders

This item requires firms to disclose material changes to the rights of shareholders or

material limitations on the rights of shareholders that result from the issuance or modification of

another class of securities.

Section 4 Matters Related to Accountants and Financial Statements


Item Added post-2004?
Item 4.01 Changes in Registrant’s Certifying Accountant No
Item 4.02 Non-Reliance on Previously Issued Financial Statements or a Yes
Related Audit Report or Completed Interim Review
Item 4.01 Changes in Registrant’s Certifying Accountant

Changes in auditors can raise concerns regarding the integrity of financial statements. As

a result, companies must disclose any changes in their independent auditor regardless of whether

the independent auditor is involuntarily dismissed, voluntarily resigns, or declines to stand for

reappointment. The company also need to disclose in Form 8-K if it hires a new auditor.

Item 4.02 Non-Reliance on Previously Issued Financial Statements or a Related Audit Report or

Completed Interim Review

This item requires companies to inform information users when previously issued

financial statements contain errors or when previously issued audit reports or interim reviews on

financial statements should no longer be relied upon. The item requires the disclosure of the date

of the conclusion regarding the non-reliance, and identification of the financial statements and

years or periods covered, and a brief description of the facts underlying the conclusion.

76
Cao, Jiang, &Lei
_____________________________________________________________________________________

Section 5 Corporate Governance and Management


Item Added post-2004?
Item 5.01 Changes in Control of Registrant No
Item 5.02 Departure of Directors or Certain Officers; Election of No
Directors; Appointment of Certain Officers; Compensatory
Arrangements of Certain Officers
Item 5.03 Amendments to Articles of Incorporation or Bylaws; Change No
in Fiscal Year
Item 5.04 Temporary Suspension of Trading Under Registrant’s No
Employee Benefit Plans
Item 5.05 Amendments to the Registrant’s Code of Ethics, or Waiver No
of a Provision of the Code of Ethics
Item 5.06 Change in Shell Company Status Yes
Item 5.07 Submission of Matters to a Vote of Security Holders Yes
Item 5.01 Changes in Control of Registrant

Whenever there is a change in control of registrant, companies must disclose the persons

who have acquired control and any arrangements between the old and new control groups.

Item 5.02 Departure of Directors or Certain Officers; Election of Directors; Appointment of

Certain Officers; Compensatory Arrangements of Certain Officers

A company must disclose any changes to the board of directors or high-level executive

officers. In addition, the company must disclose any changes to the compensation of current

high-level officers. The required disclosure includes the date of the director’s resignation, refusal

to stand for re-election or removal, any positions held by the director on any committee of the

board of directors at the time, and a brief description of the circumstances representing the

disagreement that management believes caused the director’s departure. Figure 4 shows an item

5.02 filed by WeTrade Group Inc. relating to the resignation of the Chief Executive Officer on

September 1, 2020.

Item 5.03 Amendments to Articles of Incorporation or Bylaws; Change in Fiscal Year

If a company amends its articles of incorporation or bylaws or changes its fiscal year, the

company should disclose the changes under item 5.03.

77
Cao, Jiang, &Lei
_____________________________________________________________________________________

Figure 4. Item 5.02 filed by WeTrade Group Inc.

Item 5.04 Temporary Suspension of Trading Under Registrant’s Employee Benefit Plans
If a company amends its articles of incorporation or bylaws or changes its fiscal year, the

company should disclose the changes under item 5.03.

Item 5.05 Amendments to the Registrant’s Code of Ethics, or Waiver of a Provision of the Code
of Ethics
If a company changes to the code of ethics that apply to the high-level officers, it must

disclose such changes. The company must also disclose any waivers granted to the high-level

officers.

Item 5.06 Change in Shell Company Status

Companies must file a Form 8-K under item 5.06 when the company completes a

transaction that causes it to cease being a shell company. In the example in Figure 5, WeTrade

Group Inc. disclosed that the company ceased to be a shell company as a result of

commencement of regular revenue generating operations and controls of the WePay System.

Figure 5. Item 5.06 filed by WeTrade Group Inc.

78
Cao, Jiang, &Lei
_____________________________________________________________________________________

Item 5.07 Submission of Matters to a Vote of Security Holders

Companies must disclose the results of shareholder votes at annual meetings or special

meetings by filing Form 8-K under this item.

Section 7 Item 7.01 Regulation FD

Companies disclose material events under this item to comply with the requirements of

Regulation FD. Regulation FD requires companies to provide material information to the public

at the same time as they provide it to others. If a company discloses certain information to some

institutional investors and financial analysts during an investor event, it can file a Form 8-K

under this item to share the same information with the public. Please see Figure 6 for an example

of Item 7.01 filed by CVS Health.

Figure 6. Item 7.01 filed by CVS Health

Section 8 Item 8.01 Other Events

If a company believes an event is important but does not fall in any other categories, the

company can disclose this event under item 8.01. In Figure 7, Facebook Inc. disclosed that its

board of directors authorized a share repurchase program.

Figure 7. Item 8.01 filed by Facebook Inc.

Section 9 Item 9.01 Financial Statements and Exhibits


Under this item, a company is required to file certain financial statements and list the

exhibits that it has filed. For example, if a company discloses in Item 2.01 that it has acquired a

79
Cao, Jiang, &Lei
_____________________________________________________________________________________

business, Item 9.01 would require the company to provide the financial statements of the

business. In addition, the company must present “pro forma” financial statements that

demonstrate what the company’s financial results might have been if the transaction had

occurred earlier. Similarly, if the company discloses in Item 1.01 that it has entered into a

material agreement, that agreement may be filed as an exhibit in the 8-K.

5.2. Empirical example: Technological peer pressure and product disclosure

An empirical example of analyzing material firm news


The theoretical framework suggests that companies decrease the level of disclosure in

response to increased competition. However, empirical testing of the relationship is challenging

due to the multifaceted nature of both competition and disclosure. For example, Amazon

competes with Walmart for customers and distribution channels, but competes with Google in

information technology; Intel competes with ARM in CPU architecture design but competes with

Samsung in mobile CPU sales. In terms of disclosure, companies provide a variety of types of

disclosure, such as management earnings forecasts, segment reporting, information discussed in

conference calls, risk-related disclosures, compensation-related disclosures, CSR-related

disclosures, etc. The impact of competition on corporate disclosure may depend on the specific

type of disclosure and its alignment with the competitive dimension.

Cao, Ma, Tucker, and Wan (2018) construct a measure of technological competition

“technological peer pressure” (TPP) that captures the aggregate technological advances of

companies that compete with a given company in the product market relative to the given

company’s own technological preparedness. This type of competition is aligned with product

disclosure which reveals where the firm invests in technology for product development and

improvement, as well as how these investments have progressed. Product disclosure is quantified

with the number of words in product-disclosure press releases issued by a company. Using the

80
Cao, Jiang, &Lei
_____________________________________________________________________________________

two measures, Cao et al. (2018) find that TPP has a significantly negative and strong economic

association with product disclosure: a company that moves from the lowest decile of TPP to the

highest decile reduces its product disclosure by 44.7%. In contrast, they find that TPP is not

associated with the frequency of management earnings forecasts, which is a type of disclosure

supposedly not aligned with technological competition.

81
Cao, Jiang, &Lei
_____________________________________________________________________________________

Reference

Cao, S., Ma, G., Tucker, J., and Wan, C. 2018. Technological Peer Pressure and Product

Disclosure. 2018. The Accounting Review, 93(6), 95-126.

82
Cao, Jiang, &Lei
_____________________________________________________________________________________

Chapter 6 Analyzing Data from Social Media

6.1. What is social media?

Social media has become an incredibly influential aspect of modern-day disclosure. It

encompasses any digital technology that facilitates the sharing of ideas, thoughts, and

information through virtual networks and communities. Anyone with an internet connection can

make a social media profile and can use that profile to post nearly any content they like. Hence,

personalized profiles and user-generated content are characteristics of social media platforms.

Developing from the late 1970s on, social media originated as a means for people to

interact with friends, family, and shared interest communities. Nowadays, social media serves as

a platform for people to find career opportunities, make romantic connections, and share their

own insights and perspectives online. In addition, businesses use social media for advertising,

customer communication, and increasing brand awareness. As of October 2021, more than 4.5

billion people worldwide use social media. Its capacity to instantly post photographs, share

viewpoints, and record events has revolutionized the way people live and do business.

There are various types of social media platforms offering a variety of services. Social

networks like Facebook and Snapchat allow users to share ideas, opinions, and content with

other users, and hence most content on social networks consists of text, images, or a combination

of the two. Media networks like YouTube and TikTok facilitate the sharing of media assets such

as images, videos, and other content. Review networks like Yelp enable the evaluation of

products and services. Discussion networks such as Reddit and Quora provide a forum for people

to discuss problems, ask questions, and debate issues. Finally, business platforms like LinkedIn,

Glassdoor, and Blind enable professionals to network and collaborate with other professionals or

with potential clients. Table 1 lists the most popular social media platforms by category.

83
Cao, Jiang, &Lei
_____________________________________________________________________________________

Table 1. Popular social media platforms

Social Media
Type Data Purpose
Platforms

Textual or image Snapchat,


Twitter, Send messages privately or
Social network
Facebook, publish at-the-moment content
WeChat

Audio Clubhouse, Listen to live conversations on


Spotify specific topics

Image Send short messages privately


Pinterest,
Media network and publish conveniently, at-
Instagram
the-moment content

Video YouTube,
Broadcast live video to viewers
TikTok, Twitch

Textual Debate and discuss, network,


Discussion form communities around a
Reddit, Quora
network subject, and share views on
internet-driven topics

Business Textual/Image LinkedIn, Collaborate with professionals


Platforms Glassdoor, Blind or with potential clients

6.2. Data from Social Media

Data from social media platforms


An example of analyzing disclosure on social media platforms
As discussed in the previous section, businesses have adopted social media in a number

of ways. Many social media users are aware that platforms collect user data to customize

advertisements; retailers and artisans use social media to market their products to a global

audience. However, there are other less visible ways in which capital market participants, such as

financial analysts and investors, use social media. With the help of AI and machine learning

84
Cao, Jiang, &Lei
_____________________________________________________________________________________

technologies, they are exploring new ways to transform the vast amounts of data generated by

social media users into financial knowledge. At the same time, companies are leveraging social

media to their own advantage, using it as an alternative channel with significantly less oversight

for disclosing information and communicating with the market. This trend has garnered a lot of

interest among scholars of finance. In this section, we will examine how specific platforms are

being used by FinTech researchers in innovative ways.

6.2.1. Twitter

Launched in 2006, Twitter is a social media “microblogging” platform on which users

post and interact with messages, media, and images contained in "tweets." Users who make their

own profiles can post, like, and retweet tweets, while unregistered users are limited to reading

some public tweets. At first, tweets could only be up to 140 characters, but the limit was doubled

to 280 in November 2017. Audio and video tweets remain limited to 140 seconds for most

accounts. Figure 1 shows a tweet from Microsoft on October 14, 2018. The tweet summarizes

Microsoft’s first-quarter financial performance, including revenue, income, and earnings per

share. The tweet has attracted enormous attention from Twitter users as evidenced by 594

“Retweets”, 116 “Quotes”, 1,760 “Likes”, and seven “Bookmarks”.

Using Twitter to predict firm performance

Twitter users have produced vast amounts of information in their tweets. leading

researchers to ask whether that user-generated content has informational value in the context of

business and finance. Some studies conducted over the past few years have found that, in the

aggregate, certain types of tweets actually have predictive power. In one such study, Bartov,

Faurel, and Mohanram (2018) find that the aggregate opinion from individual tweets predicts a

firm’s forthcoming quarterly earnings and announcement returns. Relatedly, Tang (2018) shows

85
Cao, Jiang, &Lei
_____________________________________________________________________________________

that aggregated third-party-generated product information on Twitter is predictive of firm-level

sales. The predictive power is greater for firms whose major customers are consumers rather than

businesses as well as when advertising is limited.

Figure 1. Earnings announcement tweet by Microsoft

Using Twitter for public relations and information management

In recent years, companies have increasingly sought to harness the power of Twitter to

achieve strategic purposes, a phenomenon that has also piqued scholarly interest. Blankespoor,

Miller, and White (2014), for example, examine whether firms can reach more investors and thus

reduce information asymmetry by tweeting news. Using a sample of technology firms, they find

that using Twitter to send market participants links to press releases indeed helps disseminate

earnings information among investors which leads to a reduction in information asymmetry.

Jung, Naughton, Tahoun, and Wang (2018) find that companies disseminate news via Twitter

strategically. Specifically, they are less likely to tweet negative earnings news than positive

86
Cao, Jiang, &Lei
_____________________________________________________________________________________

earnings news. The incentives to disseminate information strategically are stronger for firms with

a lower level of investor sophistication, a larger social media audience, and higher litigation risk.

At the same time, the autonomous nature of social media imposes challenges on

companies as well because they have very limited control over what information or opinions that

people share on these social media platforms. Lee, Hutton, and Shu (2015) find that a

corporation’s use of social media can help counterbalance negative price reactions to recall

announcements. However, they also observed that, with the arrival of Facebook and Twitter,

firms relinquished a certain amount of control over their social media content, and the

attenuation benefits of corporate social media, while still significant, lessened. In other words,

when a company recalls a product, there is almost always a negative price reaction. The more the

company tweets on its own behalf, the smaller the negative reaction; the more other people tweet

about the company, the worse the reaction.

6.2.2. Glassdoor.com

Glassdoor.com is a large recruiting platform where users can search and apply for jobs. In

addition to information about job postings, Glassdoor also hosts a social media platform through

which current and former employees can anonymously review companies across a variety of

criteria, including internal CEO approval ratings, salary data, interview difficulty and questions,

compensation and benefits assessments, company outlook, etc. Glassdoor thus provides insider

information from employees who voluntarily share their opinions on the companies they work

for. Figure 2 shows the company profile page of Microsoft on Glassdoor.com.

Using employee reviews to predict firm outcomes

As Glassdoor.com opens a “window” through which outsiders can easily collect a

company’s employees’ comments and opinions about the company, can we extract from

87
Cao, Jiang, &Lei
_____________________________________________________________________________________

Glassdoor posts valuable information incremental to the information disclosed by company

management? Using data from Glassdoor, Hale, Moon, and Sweson (2018) find that employee

opinions are useful in predicting earnings growth and management forecast news. Huang, Li,

and Markov (2020) document that the average employee outlook is incrementally informative in

predicting future operating performance, particularly when the disclosures are aggregated from a

larger, more diverse, and more knowledgeable employee base. Interestingly, the average outlook

predicts bad news events more strongly than good news events.

Figure 2. Glassdoor Company Profile of Microsoft

In addition to the predictive power, Dube and Zhu (2021) show that employee opinions

on Glassdoor.com prompt companies to improve their workplace practices such as employee

relations and diversity. Such improvement is more pronounced among firms with negative initial

reviews and high labor intensity. Those improvements in workplace practices pay off financially.

Green, Huang, Wen, and Zhou (2019) find that companies experiencing improvements in

88
Cao, Jiang, &Lei
_____________________________________________________________________________________

employee opinions significantly outperform firms with declines. The return effect is concentrated

among reviews from current employees. Furthermore, changes in an employer’s rating are

associated with growth in sales and profitability.

6.2.3 Stock Message Boards

Message boards have been a feature of digital life since the debut of USENET in 1979.

The main function of message boards is to provide a forum where readers and users can share

their thoughts and interact with people who share similar interests or have specialized

knowledge in a particular field. Stock message boards give investors an opportunity to connect

with other investors at all levels of expertise and learn more about profitable investing

strategies. Many stock message boards focus on a specific topic or group of topics, such

as investing in options, precious metals, exchange traded funds (ETFs), or commodities. Figure

3 shows the stock message board, Seeking Alpha. The topics discussed on Seeking Alpha cover

basic materials, bonds, closed end funds, commodities, cryptocurrency, etc. Seeking Alpha

users can share and exchange opinions with other users by posting or comment on analysis

articles.

Using online “crowd wisdom” to predict stock performance

Stock message boards provide investors a platform to exchange ideas and opinions,

potentially generating “the wisdom of crowds,” that is, the idea that the aggregate perspective of

a large group of people is sometimes more accurate than that of a single expert. If that is true,

then “crowd wisdom” from stock message boards might help investors to better predict company

future performance. Drake, Moon, Twedt, and Warren (2022) examine another stock message

board, Seeking Alpha. They find that market reaction to sell-side analyst research is substantially

reduced when the analyst research is preceded by reports from “social media analysts”

89
Cao, Jiang, &Lei
_____________________________________________________________________________________

(SMAs)—individuals posting equity research online via social media investment platforms, and

that this is particularly true of sell-side analysts’ earnings forecasts. They further find that this

effect is more pronounced when SMA reports contain more decision-useful language, are

produced by SMAs with greater expertise, and relate to firms with greater retail investor

ownership. They suggest that the attenuated response to sell-side research is most likely

explained by SMA research preempting information in sell-side research.

Figure 3. Stock analyses on Seeking Alpha

Credibility of online contributors

As with many issues discussed on social media, one major concern about stock message

boards relates to the accuracy and credibility of the content producers, which determines the

legitimacy of the content itself. To shed light on this issue, Campbell, DeAngelis, and Moon

(2019) investigate whether stock holding positions by SMAs have a negative effect on analyst

objectivity. They find no evidence that an SMA’s position reduces investor responses to the

SMA’s posting. In fact, they show that holding a position magnifies investor responses to the

90
Cao, Jiang, &Lei
_____________________________________________________________________________________

SMA’s article. Their findings suggest that SMAs’ stock positions do not decrease the credibility

and informativeness of their analyses.

6.2.4 YouTube

YouTube is a social media platform that allows uses to upload, share, store, and play

back videos. Launched in 2005, YouTube has become enormously popular, reporting 2.5 billion

monthly users in June 2022. In fact, YouTube is the second most visited website on the internet,

and its visitors watch more than a billion hours of videos per day. As of May 2019, more than

500 hours of content were being uploaded to YouTube every minute.

Using YouTube to extract multi-dimension information

Given the massive amount of content hosted on YouTube, it makes sense on the one hand

that researchers would be highly interested in exploiting the data therein, but on the other hand

that they would need powerful technological tools to process video content. Hu and Ma (2022)

pioneerly collect startups’ self-introductory pitch videos from YouTube and another video-

sharing website, Vimeo. Using machine learning algorithms to process these pitch videos, they

are able to measure the persuasiveness of delivery in start-up pitches from visual, vocal, and

verbal dimensions. They find that passionate and warm pitches increase funding probability;

however, conditional on funding, high-positivity startups underperform.

6.2.5 LinkedIn

Connections and interactions on social media platforms are likely to be extensions of

real-world relations. While you probably don’t know everyone on r/WallStreetBets, there’s a

good chance that your Facebook friends are people you have actually met. Networking-based

social media platforms thus allow researchers to infer real-world networks from relations on

social media platforms.

91
Cao, Jiang, &Lei
_____________________________________________________________________________________

For example, LinkedIn is a social network specifically designed for career and business

professionals to connect. LinkedIn users create their professional profiles by sharing their

educational backgrounds, employment histories, skills, and career interests. Unlike other social

networks in which you might become "friends" with any and everyone, LinkedIn is about

building strategic relationships. As of 2020, over 722 million professionals use LinkedIn to

cultivate their careers and businesses. This rich dataset of professional profiles offers a unique

opportunity to peek into capital market participants’ social networks in the real world.

Using LinkedIn to uncover personal ties among financial professionals

Of all the networks on LinkedIn, researchers are particularly interested in the impact of

connections among financial analysts, fund managers, and corporate executives. Jiang, Wang,

and Wang (2018) use professional profiles posted on LinkedIn to identify revolving rating

analysts with structured finance rating experience. They find that the more companies issuing

debt securities employ such analysts, the more likely that ratings of their debt securities are

inflated compared with otherwise similarly rated securities.

Bradley, Gokkaya, and Liu (2020) examine professional connections among executives

and analysts formed through overlapping historical employment. They search for each analyst on

LinkedIn.com to capture pre-analyst employment history. They find that analysts with

professional connections to coverage firms have more accurate earnings forecasts and issue more

informative buy and sell recommendations. These analysts are more likely to participate, be

chosen first, and ask more questions during earnings conference calls and analyst/investor days.

Brokers attract greater trade commissions on stocks covered by connected analysts. Meanwhile,

firms benefit through securing research coverage and invitations to broker-hosted investor

conferences originating from these connections.

92
Cao, Jiang, &Lei
_____________________________________________________________________________________

Using LinkedIn to extract facial information of financial professionals

In addition to employment histories, many professionals post their photos on social media

networks. Li, Lin, Lu, and Venstra (2020) collect analysts’ photos from their LinkedIn profiles.

They find that, while female analysts are more likely to be voted as All-Star analysts in the

United States, good-looking female analysts are less likely to be voted as All-Stars. On the

contrary, female analysts in China are less likely to be voted as All-Stars, but the likelihood

increases with their facial attractiveness. These findings implicate a beauty penalty for female

analysts in the United States and gender discrimination against female analysts in China.

6.3. Empirical Example: Negative Peer Disclosure

Most corporate social media posts tend to fall into common categories such as product

announcements, earnings disclosures, industry awards, community engagement, etc. Cao, Fang,

and Lei (2021) uncover an emerging —and entirely different— type of corporate social media

posts: negative peer disclosure (NPD). NPD refers to the phenomenon that firm A discloses

negative information about competitor firm B without mentioning anything about itself in a

tweet. Here is an example of what happened between Dropbox/Box and Globalscape, two

companies that compete in the online file storage space. In 2014, news broke of a Dropbox

security flaw that exposed its users’ private data. Globalscape responded by retweeting a news

article with this headline: “Dropbox and Box Leak Files in Security Through Obscurity

Nightmare.”

When the negative news came out about Dropbox and Box, Globalscape could have been

affected in two ways. On one hand, it could have been positive in terms of product market

competition, since Globalscape didn’t have any security breakdowns as its competitor did. At the

same time, it could also have been negative from the technology spillover perspective—the

93
Cao, Jiang, &Lei
_____________________________________________________________________________________

market might have assumed Globalscape was subject to the same technology vulnerability

(Figure 4). Hence the NPD tweet was a signal to the market that what happened to Dropbox

doesn’t apply to Globalscape.

Figure 4. The implications of negative peer disclosure

To build a dataset of NPD, Cao, Fang, and Lei (2021) collect tweets that mention a

competitor from corporate Twitter accounts, and employ sentiment analysis to identify negative

peer disclosures. The study find that NPD are issued by well-known and successful companies

such as Nvidia, T-Mobile, Symantec, and others. The propensity of NPD increases with the

degree of product market rivalry and technology proximity. The approach appears to work.

Consistent with NPDs being implicit positive self-disclosures, disclosing firms experience a two-

day abnormal return of 1.6–1.7% over the market and industry. Firms using NPDs tend to

outperform their non-NPD-using peers in the product markets.

94
Cao, Jiang, &Lei
_____________________________________________________________________________________

Reference

Bartov, E., Faurel, L., and Mohanram, P. 2018. Can Twitter Help Predict Firm-Level Earnings

and Stock Returns? The Accounting Review, 93, 25-57.

Blankespoor, E., Miller, G., and White, H. 2014. The role of dissemination in market liquidity:

Evidence from firms’ use of Twitter. The Accounting Review, 89(1), 79-112.

Bradley, D., Gokkaya, S., Liu, X., and Xie, F. 2017. Are all analysts created equal? Industry

expertise and monitoring effectiveness of financial analysts. Journal of Accounting and

Economics, 63(2-3), 179-206.

Campbell, J., DeAgelis, M., and Moon, J. 2019. Skin in the game: personal stock holdings and

investors’ response to stock analysis on social media. Review of Accounting Studies, 24,

731-779.

Cao, S., Fang, V., Lei, L. 2021. Negative peer disclosure. Journal of Financial Economics,

140(3), 815-837.

Dube, S., and Zhu, C. 2021. The disciplinary effect of social media: Evidence from firms’

responses to Glassdoor Reviews. Journal of Accounting Research, 59(5), 1783-1825.

Green, Huang, Wen, and Zhou. 2019. Crowdsourced employer reviews and stock returns.

Journal of Financial Economics, 134, 236-251.

Hales, J., Moon, J., and Swenson, L. 2018. A new era of voluntary disclosure? Empirical

evidence eon how employee postings on social medial relate to future corporate

disclosures. Accounting, Organizations and Society, 68-69, 88-108.

Huang, K., Li, M., Markov, S. 2020. What do employees know? Evidence from a social media

platform. The Accounting Review, 95(2), 199-226.

95
Cao, Jiang, &Lei
_____________________________________________________________________________________

Jiang, J., Wang, I., and Wang, K. 2018. Revolving rating analysts and ratings of mortgage-

backed and asset-backed securities: Evidence from LinkedIn. Management Science,

64(12), 5461-5959.

Jung, M., Naughton, J., Tahoun, A., and Wang C. 2018. Do firms strategically disseminate?

Evidence from corporate use of social media. The Accounting Review, 93(4), 225-252.

Lee, L., Hutton, A., and Shu, S. 2015. The role of social media in the capital market: Evidence

from consumer product recalls. Journal of Accounting Research, 53(2), 367-404.

Li, C., Lin, A., Lu, H., and Veenstra, K. 2020. Gender and beauty in the financial analyst

profession: evidence from the United States and China. Review of Accounting Studies, 25,

1230-1262.

Tang, V. (2017). Wisdom of Crowds: Cross-sectional variation in the informativeness of third-

party-generated product information on Twitter. Journal of Accounting Research, 56(3).

989-1034.

96
Cao, Jiang, &Lei
_____________________________________________________________________________________

Chapter 7. Data Analytics in Environmental, Social, and Governance

7.1. Corporate Governance

Corporate governance refers to the set of processes, customs, policies, laws, and

institutions that impact the way a corporation is directed, administered, or controlled. The need

for corporate governance arises from the separation of ownership and control and information

asymmetry. Prior to the Industrial Revolution, businesses were owned and managed by a handful

of individuals, known as sole proprietors. This simple form gave way to the more complicated

corporate form where individuals invest in a corporation, which is then managed by corporate

managers chosen by those investors. This evolution introduces agency relationships where

corporate managers are agents who control corporate resources, and investors are the principals

who are not involved in daily corporate operation.

Separation of ownership and control inevitably raises concerns about whether the agent is

acting in the best interests of the principal. It also allows managers to obtain private information

by exercising control. Hence, separation of ownership and control naturally leads to a separation

of ownership and information, where managers enjoy an informational advantage over investors.

Following the Myers and Majluf (1984) paradigm, information asymmetry occurs “…when firms

have information that investors do not have.” To address the agency and the information

asymmetry concerns, in the traditional corporate governance system, shareholders appoint boards

of directors to monitor senior managers. Boards of directors could receive advice from parties

such as auditors and legal counsels. Figure 1 describes this corporate governance environment.

Traditionally, the sole aim of corporate governance has been to maximize shareholder

value to protect the interests of shareholders. In recent years, there has been a growing emphasis

on corporate social responsibility, which broadens the scope of corporate governance.

97
Cao, Jiang, &Lei
_____________________________________________________________________________________

Specifically, the corporate governance system today should aim to guarantee the interests of all

the firm’s stakeholders. In addition to shareholders, corporate managers also hold

responsibilities towards multiple external monitors and stakeholders such as creditors, regulators,

employees, customers, community, etc.

Figure 1. Participants in the corporate governance system

7.2. Textual data for corporate governance

7.2.1. Proxy statement

Data in proxy statements


When seeking information about a company, financial statements, annual reports, and

conference calls are often the first-place people look. However, the proxy statement can be just

as informative, if not more so, as it delves into business relationships, the backgrounds, and

compensation of corporate officers.

A proxy statement is a document that public companies provide to their shareholders to

help them understand how to vote at shareholder meetings and make informed decisions about

98
Cao, Jiang, &Lei
_____________________________________________________________________________________

how to delegate their votes to a proxy. It covers a variety of issues, such as proposals for new

additions to the board of directors, information on directors' salaries, information on bonus and

options plans for directors, corporate actions like proposed mergers or acquisitions, dividend

payouts, and any other declarations made by the company's management.

Figure 2. Apple’s Notice of 2022 Annual Meeting of Shareholders

Below is some of the information you can glean from this important document.

• Important issues vote on. Figure 2 shows the items of business board voting

recommendations in Apple’s Notice of 2022 Annual Meeting of Shareholders.

Figure 3 lists the items of business, board voting recommendations, and how to vote

in Walmart’s 2022 proxy statement. Typical items for voting at annual meetings of

shareholders include election of directors, ratification of appointment of independent

99
Cao, Jiang, &Lei
_____________________________________________________________________________________

registered public accounting firm, advisory vote of approve executive compensation,

approval of employee stock plan, shareholder proposals, etc.

• Details about management, their experience, and qualifications.

• Management compensation and whether their compensation structure is aligned with

shareholder interests. Figure 4 presents the summary compensation table in Apple

Inc’s 2022 proxy statement. It provides detailed information on the amount and

composition of executive compensation for the top-five paying executives in the past

three years. Figure 5 reflects Apple’s analysis concerning the vesting of CEO’s

Restricted Stock Units (RSU). The analysis provides detailed information about the

conditions required for RSU vesting, the corresponding performance outcomes, and

the resulting vesting outcomes.

Figure 3. Walmart’s 2022 proxy statement

100
Cao, Jiang, &Lei
_____________________________________________________________________________________

• Potential conflicts of interests, such as related-party transactions that may not be

beneficial to the company.

• Loans advanced to senior executives. These loans can deprive the company of capital,

are often made on generous terms, and sometimes are forgiven, footing shareholders

with the bill.

Figure 4. Summary Compensation Table in Apple Inc’s 2022 proxy statement

Shareholder voting on important corporate issues sometimes leads to proxy contests, also

known as proxy battles. This occurs when a group of shareholders joins forces in an attempt to

oppose and vote out the current management or board of directors, essentially creating a battle

for control of the company between shareholders and senior management. Proxy fights (Figure

6) are commonly initiated by dissatisfied shareholders who convene with other shareholders to

pressure management and the board of directors to make changes within the company.

Shareholders use their votes to pressure the board of directors by voting against them at

the annual general meeting (AGM).

101
Cao, Jiang, &Lei
_____________________________________________________________________________________

Figure 5. CEO RSU vesting analysis in Apple Inc’s 2022 proxy statement

102
Cao, Jiang, &Lei
_____________________________________________________________________________________

Figure 6. Proxy voting

7.2.2. Corporate social responsibility disclosure

Data in corporate social responsibility disclosure


Corporate social responsibility (CSR) is an integral part of corporate governance

designed to ensure a company’s operations are ethical and beneficial to all stakeholders in the

society. The concept of CSR is broad and varies among companies, but the fundamental

principle is to operate in a sustainable manner that is economically, socially, and

environmentally responsible.

Information about CSR can be obtained from both internal and external sources.

Sustainability report serve as a primary source of CSR information provided by companies.

These reports contain information about the environmental, social, and governance (ESG)

impacts of a company’s operations. Investors and other stakeholders are increasingly calling for

103
Cao, Jiang, &Lei
_____________________________________________________________________________________

more transparency in companies’ sustainability and ESG strategies, and many legislative

documents require or will mandate non-financial CSR information.

Global Reporting Initiative (GRI) is an international organization that provides

independent standards for companies to report non-financial information. These standards are

designed to help businesses identify their impacts on climate change, the environment, human

rights, and corporate governance. Although the GRI standards are non-mandatory and non-

binding, they serve as the foundation for the proposed Corporate Sustainability Reporting

Directive (CSRD) and the forthcoming mandatory European Sustainability Reporting Standards

(ESRS) are based on the GRI structure. ESRS is a set of standards (analogous to IFRS) that

companies must comply with when reporting sustainability information. The use of sustainability

report is an effective way for companies to answer a wide variety of questions that stakeholders

may raise in a single document. In addition to GRI standards, the Sustainability Accounting

Standards Board (“SASB”) also offers guidance on how to prepare informative ESG

information. SASB is an independent, private sector standards-setting organization whose

mission is to help businesses around the world identify, manage and report on the sustainability

topics that SASB believes matter most to investors. Other ESG reporting frameworks include the

frameworks proposed by the Task Force on Climate-related Financial Disclosures (TCFD)and

United nations Sustainable Development Goals, Figure 7 presents the reporting frameworks

adopted in Apple’s 2021 ESG report.

In addition to sustainability reports, companies also frequently disclose CSR information

on their websites, in proxy statements, and on social media platforms. CSR information could

also be collected from alternative sources outside of companies, for example, government

agencies or non-governmental watchdog organizations.

104
Cao, Jiang, &Lei
_____________________________________________________________________________________

Figure 7. Reporting frameworks of Apple’s 2021 ESG Report

7.3. Emerging technologies as governance mechanisms

7.3.1. Governance with availability of alternative data

Regulators have conventionally aim at regulating equal information access to information

generated inside firms. The rise of big data and data analytics generates a large amount of

information outside firms by tracking “footprints”, for example, satellite images, internet traffic,

credit card scan, censors, social media postings, etc. Such alternative information could be ahead

of or incremental to managerial information. Zhu (2019) shows that externally generated

alternative data are predictive of firm performance and offset insider advantage. We can imagine

that the governance power of growing data that provide information on ownership structures,

leadership quality, shareholder sentiment, governance risks, etc. is very promising. In addition,

there is an increasing demand and supply of information about environmental, social, and

governance, which would help hold managers accountable along these important dimensions.

Access to alternative information is universal but uneven depending on skills and

resources available. The rise of big data creates information asymmetry in accessing traditional

data as well. The SEC estimates that “as much as 85% of the documents visited are by internet

bot”. The ability to process big data has become increasingly critical to establish informational

advantage. Cao, Jiang, Wang, and Yang (2021) find that, when alternative data becomes

105
Cao, Jiang, &Lei
_____________________________________________________________________________________

available, analysts affiliated with brokerage firms equipped with AI capacity provide more

accurate forecasts. Cao, Jiang, Yang, and Zhang (2021) document that increases in machine

downloads of SEC filings are associated with decreases in time to the first trade; however,

increases in machine downloads of SEC filings widen bid-ask spreads.

On the other hand, firms are responding to the increasing use of machine in processing

corporate information. Cao, Jiang, Yang, and Zhang (2021) find that the publication of Loughran

and McDonald (2011) prompts firms to reduce the use of negative words in Loughran and

McDonald (2011) in corporate filings (Figure 8). Since the way that machines process corporate

filings is largely rule-based, the change in manager behavior would impact the effectiveness of

machine learning in corporate governance. As a result, it is important to bear in mind that,

knowing the power of big data and data analytics, managers are incentivized to change behavior

to influence and manipulate the outcomes of machine processing.

7.3.2. Governance with distributive ledger and blockchains: Shareholder voting and smart
contracting

Shareholding voting allows shareholders to be directly involved in corporate governance.

A proxy contest is a campaign to solicit votes (or proxies) in opposition to management at an

annual or special meeting of stockholders or through action by written consent. Today the most

common types of proxy contests are contests by activist stockholders seeking board

representation or control, generally with the objective of maximizing return on the activist’s

investment in the short-term. The proxy contest serves as a tool to drive change. Brav, Jiang, Li,

and Pennington (2021) document that approximately one percent of the firms are targeted with

proxy contests in a given year.

106
Cao, Jiang, &Lei
_____________________________________________________________________________________

Figure 8. Frequency of Loughran and McDonald (2011) negative words

This figure plots LM – Harvard Sentiment of 10-K and 10-Q filings and compares sentiment of firms with high
machine downloads with that of the low group. LM – Harvard Sentiment is the difference of LM Sentiment and
Harvard Sentiment. LM Sentiment is defined as the number of Loughran-McDonald (LM) finance-related negative
words in a filing divided by the total number of words in the filing. Harvard Sentiment is defined as the number of
Harvard General Inquirer negative words in a filing divided by the total number of words in the filing. Filings are
sorted into top tercile or bottom tercile based on Machine Downloads. LM Sentiment and Harvard Sentiment
sentiments are normalized to one, respectively, in 2010 within each group, one year before the publication of
Loughran and McDonald (2011). The dotted lines represent the 95% confidence limits.

The importance of shareholder voting indicates that shareholding records are pivotal to

corporate governance. In this sense, blockchains can help with maintaining transparent

shareholding records and resolves the problem of “double voting” (Yermack 2017). Blockchains

can also be implemented to add new features to the existing shareholding voting system, for

example, tenure-based voting which is a system that awards greater voting power to shares held

for a longer duration (Edelman, Jiang, and Thomas 2019), voting power of outside shares

contingent on firm performance, decentralized autonomous organizations which serves as self-

sufficient proxy advisories that empowers retail investors.

In addition to voting, the blockchain technology makes “smart contracts” feasible. Smart

contracts are digital contracts allowing terms contingent on decentralized consensus that are

tamper-proof and typically self-enforcing through automated execution (Cong and He 2019).

107
Cao, Jiang, &Lei
_____________________________________________________________________________________

Such contracts mitigate traditional moral hazard as “hidden actions” becomes verifiable and thus

save enforcement cost and deter strategic behavior.

108
Cao, Jiang, &Lei
_____________________________________________________________________________________

References

Brav, A., Jiang, W., Li, T., and Pinnington, J. 2019. Picking friends before picking (proxy)

fights: How mutual fund voting shapes proxy contests. Columbia Business School

Research Paper (18-16).

Cao, S., Jiang, W, Wang, J, and Yang, B. 2021. From Man vs. Machine to Man+Machine: The

Art and AI of Stock Analyses. NBER Working Paper.

Cao, S., Jiang, W, Yang, B, Zhang, A. 2022. How to Talk When a Machine is Listening?

Corporate Disclosure in the Age of AI. Review of Financial Studies, forthcoming.

Cong, L., and He, Z. 2019. Blockchain Disruption and Smart Contracts. Review of Financial Studies,

32(5), 1754-1797.

Edelman, P., Jiang, W., and Thomas, R. 2019. Will Tenure Voting Give Corporate Managers

Lifetime Tenure? Texas Law Review, 97(5), 991-1030.

Loughran, T., and McDonald, B. 2011. When Is a Liability Not a Liability? Textual Analysis,

Dictionaries, and 10-Ks. Journal of Finance, 66(10), 35-65.

Myers, S., and Majluf, N. 1984. Corporate Financing and Investment Decisions When Firms

Have Information that Investors Do Not Have. Journal of Financial Economics, 13, 187-

221.

Yermack, D. 2017. Corporate Governance and Blockchains. Review of Finance, 21(1), 7-31.

Zhu, C. 2019. Big Data as a Governance Mechanism. 32(5), 2021-2061.

109
Cao, Jiang, &Lei
_____________________________________________________________________________________

Chapter 8 Analyzing Image Data


8.1 Image in corporate executive presentation
Corporate executive presentations are a distinctive form of corporate disclosures. These

events include non-deal road shows, which are organized to generate investor interest and

promote the company's image; initial public offering (IPO) road shows, where executives present

to potential investors prior to going public; broker-hosted investor conferences, which allow

CEOs to connect with a broader investor audience; and capital market day events, dedicated to

outlining the company's long-term vision and strategy.

Corporate executive presentations are characterized by two key features. Firstly, due to

the time constraints faced by executives when delivering live presentations, these presentations

often incorporate a significant amount of visual and graphic information in their slides.

Executives understand the importance of conveying complex ideas succinctly, and visuals help

to convey information quickly and effectively. These visuals can include charts, graphs,

diagrams, images, and videos, all of which aid in capturing the attention of the audience and

enhancing their understanding of the presented material. Figure 1 shows an example of charts

used in executive compensations. Figure 2 presents the images of production sites under

construction in an executive compensation.

Secondly, executive presentations differ from other forms of corporate disclosures by

providing a wealth of visual information about the firm's product designs and operational plans.

While other corporate disclosures primarily focus on quantitative data, such as financial

statements and performance metrics, executive presentations offer a complementary perspective

by showcasing the visual aspects of the company's offerings. This can include detailed product

designs, prototypes, manufacturing processes, supply chain diagrams, and strategic plans. Figure

3 shows two examples of the images of product designs and prototypes used in executive

110
Cao, Jiang, &Lei
_____________________________________________________________________________________

compensations. By presenting these visual elements, executives aim to provide a comprehensive

view of the company's future direction, growth strategies, and competitive advantage.

Figure 1 An example of charts in executive compensations

Figure 2 An example of images in executive presentations

111
Cao, Jiang, &Lei
_____________________________________________________________________________________

Figure 3 Two examples of product images in executive presentations

8.2. Empirical example


Visual and graphic information has been difficult to analyze due to its unstructured and

high-dimensional nature. An image can contain tens of thousands of pixels, each with millions

of possible colors, that form complex patterns and objects. Recent advances in machine learning

112
Cao, Jiang, &Lei
_____________________________________________________________________________________

and AI have endowed image recognition algorithms with capabilities comparable to

humans. Cao, Cheng, Yang, Xia, and Yang (2023) leverage deep learning to extract key features

of firms’ operations from corporate executive presentations.

As the first step, they manually review and classify (label) a random subsample of images

into several different categories, providing a training sample for the machine learning algorithms.

They classify each image into one of three categories: Operations Summary, Operations

Forward, and Others. To minimize human errors in the labeling process, they cross-validate and

require a consensus on the classification by at least three graduate research assistants. They use a

two-step bootstrapping process to construct the training sample. They first label an initial random

sample of 3,000 images. They then use this initial sample to train the machine learning model

and make initial predictions on the potential classifications

of all images. They then select a final training sample of 20,000 pre-classified images with

balanced numbers of images in each category. Finally, they manually classify the final training

sample.

A plethora of machine learning models, including random forests, gradient boosting, and

neural networks, have found use in a wide range of applications. Image recognition has been an

important problem for deep learning, and a milestone in image recognition is the development of

ImagNet by Google (Li et al., 2009), which achieved performance at par with humans. The

primary deep learning model employed by ImageNet and other leading image recognition

algorithms is a special class of neural networks, the Convolutional Neural Network (CNN). The

CNN is a multiple-layer neural network where the lower layers capture finer details, and the

higher layers extract high-level information, such as objects in the image.

113
Cao, Jiang, &Lei
_____________________________________________________________________________________

The challenge for recognizing business pictures is that there are no ready-made models

for this purpose, and training a CNN model usually requires a large training dataset. Therefore,

Cao et al. (2023) utilize an advanced machine learning technique called transfer learning (Pratt,

1993; Rajat et al., 2006) to build their own deep learning model based on pre-trained CNN

models and train the model with their business image sample. Specifically, they first build a

neural network on top of a pre-trained CNN neural network from a state-of-the-art image

recognition model VGG16 (Simonyan and Zisserman, 2014). They then keep the parameters of

CNN layers fixed and fine-tuned model with the training sample. The resulting model is termed

the Transfer CNN model. Transfer learning takes advantage of existing CNN models

trained with very large datasets and also adapts the model to a specific business problem.

Cao et al. (2023) also consider the model that utilizes both image and text information from

presentations (Transfer CNN + Text) and the transfer learning technique.

They train four different model architectures: 1) A CNN model that starts from scratch

(CNN ); 2) A deep learning model that processes both images and texts (CNN +

Text); 3) A transfer learning model that relies on pre-trained CNN model to process images

(Transfer CNN ); 4) A transfer learning model that processes both images and texts (Transfer

CNN + Text). For each model, they evaluate its out-of-sample performance using the ten-fold

cross-validation approach, in which we use the stratified sampling method to split samples

and report the results in Table 1. They use four measures to evaluate the out-of-sample

performance of the models. Accuracy is the ratio of correct predictions to total observations.

Precision is the ratio of true positives to the sum of true positives and false positives. Recall is

the ratio of true positives to the sum of true positives and false negatives. We calculate Precision

and Recall for each category and then average across the three categories. F1 score is the

114
Cao, Jiang, &Lei
_____________________________________________________________________________________

harmonic mean of Precision and Recall. Among the four architectures, Transfer CNN and

Transfer CNN + Text have the best performance in terms of F1 score and accuracy. Transfer

CNN + Text outperforms other models with an accuracy of 80.0% and F1 score of 79.3%. After

fitting transfer CNN + text model, they use the fitted model to obtain a final classification for the

entire image sample.

Table 1 Performance of machine learning models

Based on the classified categories of each slide page, they aggregate the number of pages

under a certain category to presentation level and scale by the total number of pages. They find a

presentation slide deck includes 3.6% Operations Forward slides and 11% Operations Summary

slides on average. The standard deviations of Operations Forward and Operations

Summary are 5.8% and 12.2%, respectively. Figure 4 shows the time series of the number of

different types of information contained in presentations from 2006 to 2018. Beside a clear

increasing trend in the number of all types of presentations, interestingly, the ratio of

presentations with Operations Forward images also increases over years. Specifically, in 2006,

only 35% of corporate presentations include Operations Forward images; this number becomes

40% in 2010, and 47% in 2018, suggesting that firms are more likely to include Operations

Forward visual information in recent years.

115
Cao, Jiang, &Lei
_____________________________________________________________________________________

Figure 4 Time series of corporate presentations

This figure plots annual number of presentations (bar plot and left axis) and the ratio of presentations with
Operations Forward images over all types of presentations (line plot and right axis) in our sample from 2006 to
2018. Presentations are classified as with Operations Forward images if any slides in the presentation displays
figures with Operations Forward information.

116
Cao, Jiang, &Lei
_____________________________________________________________________________________

References
Cao, S., Cheng, Y., Yang, M., Xia, Y., and Yang, B. 2023. Visual Information in the Age of AI:

Evidence from Corporate Executive Presentations. Working paper.

117
Cao, Jiang, &Lei
_____________________________________________________________________________________

Chapter 9 Analyzing the Balance Sheet


9.1. Data structure in in balance sheet
Data structure of the balance sheet
A balance sheet provides a snapshot of a company’s financial position at a point in time.

It outlines the company’s resources (assets), namely, what the company owns. The balance sheet

also reports the sources of financing for these assets. There are two primary methods through

which a company can finance its assets. Firstly, it can raise funds from stockholders, known as

owner financing. It can also acquire capital from banks, creditors, and suppliers, which is known

as nonowner financing. This means both owners and nonowners hold claims on the company’s

assets. Owner claims on assets are referred to as equity, and nonowner claims are referred to as

liabilities. As all financing is directed towards investments, we can establish the fundamental

relationship: investing (assets) equals financing (liabilities + equity). This equality is called the

accounting equation (Figure 1).

Figure 1. The accounting equation in the balance sheet

Figure 2 illustrates Los Gatos Corporation’s balance sheet as of December 31, 2013.

There are fourteen line items across the categories of assets, liabilities, and owners’ equity. For a

118
Cao, Jiang, &Lei
_____________________________________________________________________________________

larger company with intricate business structures and models, the balance sheet could encompass

over 250 variables. These variables serve as valuable information for stakeholders seeking to

understand companies’ operating strategies and outcomes. For instance, managers can leverage

the balance sheet to assess liquidity and solvency; while investors can use it to evaluate

companies’ operating performance.

Figure 2. Balance sheet of Los Gatos Corporation

119
Cao, Jiang, &Lei
_____________________________________________________________________________________

Debate about fair value accounting

Debate about fair value accounting


Before delving into the analysis of a balance sheet, it is crucial to acknowledge that the

values of the variables within the balance sheet are determined based on accounting standards

and managerial judgement. Assets and liabilities are measured either at fair value or historical

cost, following relevant accounting standards. Under historical cost accounting method, the focus

lies on the initial price paid by the company during the acquisition of an asset or the incurrence

of a liability. The balance sheet reflects either the purchase price or a reduced value due to

factors such as obsolescence, depreciation or depletion. For financial assets, the price remains

unchanged until the security is liquidated. Historical cost accounting is considered more

conservative and reliable since it is based on a fixed price that is fully known, namely the actual

price paid by the company. While this eliminates uncertainty in the initial valuation decision, it

introduces uncertainty in future periods regarding the true value of assets, which impairs the

relevance of historical cost.

An alternative approach is to measure assets and liabilities at fair value, which represents

the price at which knowledgeable and willing parties would exchange or settle them. Fair value

accounting entails adjusting the prices of certain assets on the balance sheet in each reporting

period to reflect changes in market prices. Fair value accounting enhances relevance of

accounting information. However, determining the fair value of assets and liabilities is not

always straightforward, as it involves subjectivity. Given the pros and cons of both historical cost

and fair value accounting, debates persist on how assets and liabilities on the balance sheet

should be valued. When analyzing a balance sheet, it is essential to consider not only the values

of the assets and liabilities but also how they are measured.

120
Cao, Jiang, &Lei
_____________________________________________________________________________________

Furthermore, the evolution of business models across various industries has led to a shift

in value creation, with increasing emphasis on intangible assets such as ideas, knowledge,

brands, content, data, and human capital, rather than physical assets like machinery or factories.

However, the accounting framework has not kept pace with this transformation. Existing

accounting standards often fail to recognize the value generated by certain intangible assets, both

in terms of their representation on the balance sheet or disclosure in footnotes. While tangible

assets like property and equipment are typically included on a company's balance sheet,

investments made in internally-generated intangibles are generally expensed as they are incurred.

Consequently, a company's most valuable assets often remain unaccounted for on its balance

sheet. When examining a balance sheet, it becomes crucial to consider the implicit value of such

intangible assets. Due to the intricate nature of intangibles and the diversity in how companies

manage and investors evaluate them, there is no universally applicable method for their

measurement.

9.2. Empirical example: Analyzing data in balance sheet

Analyzing the balance sheet


An empirical example of analyzing the balance sheet
Prior studies have investigated the implication of asset growth for shareholders. Fairfield,

Whisenant, and Yohn (2003) find that, after adjusting for current profitability, asset growth

exhibits negative associations with one-year-ahead return on assets. On the other hand, Cooper,

Gulen, and Schill (2008) document that asset growth rates are strong predictors of future stock

returns. Is asset growth good or bad for shareholders?

While stockholders are owners of public companies, the day-to-day control of company

resources lies in the hands of professional managers. This separation of ownership and control

gives rise to what is known as agency problems. One of the typical agency problems is empire

121
Cao, Jiang, &Lei
_____________________________________________________________________________________

building. Managers are incentivized to grow companies aggressively to fulfill personal and

career ambitions, but reckless expansions could result in inefficient usage of resources and

decreases in shareholder wealth. Therefore, asset growth could suggest healthy growth of

companies’ business but might also indicate empire building. To fully understand the implication

of asset growth, it is important to separate growth of assets from normal business activities and

growth of assets caused by empire building.

Companies engage in a range of activities that can be categorized as either operating or

nonoperating. Operating activities encompass the production and sale of company products and

services to customers. Nonoperating activities involve the nonstrategic investment of cash in

marketable securities and debt financing endeavors. Asset growth can stem from both operating

and nonoperating activities. The growth of total assets can be broken down into two components:

growth funded by operating liabilities and growth funded by debt and equity (refer to Figure 3).

For instance, a company may negotiate favorable credit terms with suppliers, which essentially

represents a loan from the suppliers to the company. Alternatively, the company could obtain a

bank loan to finance purchases from suppliers. Both financing activities increase assets and

liabilities as per the accounting equation.

There is no doubt that both suppliers and banks meticulously assess the financial standing

of a company before making financing decisions. However, suppliers may possess a comparative

advantage in terms of information as they have industry-specific knowledge, engage in daily

business transactions, and have access to unbiased private information. Additionally, suppliers

have stronger economic incentives as they are typically less diversified in credit risk compared to

banks. Consequently, it is plausible to consider whether growth financed by these more informed

stakeholders, such as suppliers, may indicate better future performance, while growth financed

122
Cao, Jiang, &Lei
_____________________________________________________________________________________

by debt and equity could potentially predict worse future performance. This hypothesis can be

explored by analyzing data from the balance sheet.

Figure 3. Decomposing balance sheet items

In order to examine the consequences of asset growth related to operating and

nonoperating liabilities, Cao (2016) decomposes asset growth into non-financing growth and

financing growth. Non-financing growth in assets is driven by increases in operating liabilities,

such as accounts payable, while financing growth in assets arises from increases in debt or equity

financing, such as bank loans. Non-financing growth can be further decomposed into operating

growth, growth in accounts payable (representing financing from suppliers), and growth in tax

payable (representing financing from tax authorities). The focus is to understand the distinct

implications of operating growth, growth in accounts payable, and growth in tax payable for a

company's operating performance and stock market performance as investors are primarily

concerned with the two performance metrics.

123
Cao, Jiang, &Lei
_____________________________________________________________________________________

Figure 4. Decomposing asset growth

Cao (2016) measures operating performance with return on assets (ROA) which is

computed as net income divided by average total assets at the beginning and the end of a year

and reflects return from the perspective of the entire company. This return includes both

profitability (numerator) and total assets (denominator). To earn a high return on assets,

managers must earn profit and manage assets to minimize the assets invested to the level

necessary to achieve the profit. The explanatory variables are various components of asset

growth. The regression model is stated as below where ROAt+1 is one-year ahead ROA; ROAt is

current ROA; and ΔROAt is the change in ROA from t-1 to t.

ROAt+1=β0+β1Asset growth+ β2ROAt+ β3ΔROAt +ε (1)

Table 1 tabulates the regression results including various types of asset growth as

explanatory variables. Regression 1 includes growth in accounts payable (OAgrowth_AP) as one

of the explanatory variables. Regression 2 includes growth in operating assets other than

accounts payable and tax payable (OAgrowth_Other) as one of the explanatory variables.

Regression 3 include both in the same model. The results suggest that growth in accounts

124
Cao, Jiang, &Lei
_____________________________________________________________________________________

payable is negatively associated with future ROA while growth in operating assets other than

accounts payable and tax payable is positively associated with future ROA, respectively.

Regression 4 and 5 adds growth on net operating assets (NOAgrowth) as an additional

explanatory variable. The results show that growth in net operating assets is negatively

associated with future ROA. Regression 6 and 7 further indicate that growth in both current and

long-term net operating assets is negatively associated with future ROA. Interestingly, both

growth of accounts payable and growth of operating assets other than accounts payable and tax

payable become positively associated with future ROA when controlling for growth of net

operating assets. This means that growth of net operating assets is a correlated omitted variable

in Regression 1 to regression 3.

In Regression 8, growth in various operating assets are included in one regression and

“horse raced” against each other. The results confirm that growth of operating assets financed by

operating liabilities is positively associated with future ROA; growth of operating assets financed

by debt and equity is negatively associated with future ROA.

Table 1. The implications of decomposition of OAgrowth_OL and NOAgrowth for


one-year ahead ROA and stock returns

Cao (2016) measures stock market performance with one-year ahead stock return and

apply the Fama-MacBeth procedure in this analysis to investigate the association between

125
Cao, Jiang, &Lei
_____________________________________________________________________________________

operating growth, growth in accounts payable, and growth in tax payable and a company's stock

market performance.

Table 2 tabulates the regression results. Regression 1 includes growth in accounts

payable (OAgrowth_AP) as one of the explanatory variables. Regression 2 includes growth in

operating assets other than accounts payable and tax payable (OAgrowth_Other) as one of the

explanatory variables. Regression 3 include both in the same model. Similar to the findings with

operating performance, growth in accounts payable is negatively associated with future ROA

while growth in operating assets other than accounts payable and tax payable is positively

associated with future ROA, respectively.

As growth in net operating assets could be an omitted correlated variable, Cao (2016)

includes it as an additional explanatory variable in Regression 4 and Regression 5. The results of

Regression 4 to Regression 7 show that growth in net operating assets is negatively associated

with future stock return. Further, growth of operating assets other than accounts payable and tax

payable becomes positively associated with future stock return when controlling for growth of

net operating assets. Regression 8 confirms that growth of operating assets financed by operating

liabilities is positively associated with future stock return; growth of operating assets financed by

debt and equity is negatively associated with future stock return.

A natural follow-up question is that whether the difference between growth of different

components of assets is recognized by investors so that a profitable trading strategy could be

developed. Growth of operating assets financed by operating liabilities should predicts future

stock returns only when investors do not recognize the difference among various types of asset

growth. To test this possibility, Cao, Wang, and Yeung (2022) regress asset growth on three-day

stock return around earnings announcement. Model (1) to (3) in Table 3 show that asset growth

126
Cao, Jiang, &Lei
_____________________________________________________________________________________

financed by operating liabilities (OPERATING_GRWOTH) is positively associated with three-

day earnings announcement return for future two quarters; asset growth financed by

nonoperating liabilities (FINANCING_GRWOTH) is positively associated with three-day

earnings announcement return for future three quarters. The results suggest that investors do not

realize the difference between asset growth financed by operating liabilities and debt or equity

initially; it takes investors, on average, six months to figure out the difference.

Table 2. Fama-MacBeth regressions of subsequent stock returns on decompositions of


OAgrowth_OL and NOAgrowth

Table 3. Regressing quarterly earnings announcement return on growth variables

Since average investors do not immediately recognize the difference between asset

growth driven by operating liabilities and by equity or debt, a profitable trading strategy can be

developed by holding stocks with low growth in nonoperating assets and selling stocks with high

growth in nonoperating assets. The outcome of such a trading strategy can be examined using the

portfolio sort method.

127
Cao, Jiang, &Lei
_____________________________________________________________________________________

Specifically, stocks are sorted based on total asset growth (TAgrowth) and nonoperating

asset growth (NOAgrowth) into quintiles by year, forming 25 portfolios. Then average returns on

these portfolios are computed over a year. The equal-weighted portfolios show that the trading

strategy based on net operating asset growth yields returns ranging from 6.69% to 12.29%

holding total asset growth constant. The value-weighted portfolios show that the trading strategy

yields returns ranging from 7.67% to 9.05% holding total asset growth constant. On the contrary,

a trading strategy based on total asset growth does not yield significant positive returns.

Table 4. Comparisons of one-year-ahead abnormal returns of portfolio based on


NOAgrowth and TAgrowth

128
Cao, Jiang, &Lei
_____________________________________________________________________________________

9.3. Machine learning application on balance sheet data

The field of equity markets research has experienced significant growth and advancements

due to the utilization of AI and big data. Initially, early studies in this domain focused on

employing new methodologies and data to gain deeper insights into existing research questions,

particularly in the areas of earnings and returns forecasting. Machine learning algorithms have

proven advantageous over traditional regression techniques, as they allow for non-linearity, the

incorporation of high-dimensional and complex time-series data, and the implementation of cross-

validation techniques. Consequently, the adoption of AI and machine learning algorithms holds

the potential to enhance forecasting performance. However, more recent studies have shifted their

focus towards addressing emerging questions that have arisen as a result of AI and big data,

including the comparison between human and machine performance in various tasks. This

transition reflects the evolving landscape of equity markets research driven by technological

advancements.

In a recent study, Chen, Cho, Dou and Lev (2022) use decision tree methods to predict

directions and signs of earnings, comparing them with a conventional logit model and financial

analyst forecasts. They feed in more than 4,000 financial items identified through XBRL tags in

corporate 10-K filings in current and lagged years, as well as their annual changes, which together

yield over 12,000 input variables. For every three-year period, they assign the first two years as a

training period and the final year as the validation period for model selection, and then conduct

out-of-sample tests. They find that the machine learning approach demonstrates better out-of-

sample predictive power and significant returns to portfolios formed on the basis of the AI-

generated predictions. It should be noted, however, that the logit model is estimated based on a

129
Cao, Jiang, &Lei
_____________________________________________________________________________________

limited group of selected input variables; therefore, the superiority of the decision tree approach is

attributable to both nonlinearity and the large group of inputs.

Cao and You (2021) examine the efficacy of machine learning in forecasting corporate

earnings as compared to conventional fundamental analysis models. They specifically examine

three linear machine learning models, ordinary least squares regression (OLS), least absolute

shrinkage and selection operator (LASSO), and Ridge regression (RIDGE), as well as three

nonlinear machine learning models, random forest (RF), gradient boosting regression (GBR) and

artificial neural networks (ANNs). These they compare with six conventional time-series and

cross-sectional models. Feeding in a selection of 56 features from financial statements, they find

that nonlinear machine learning models generate significantly more accurate and informative

forecasts than the conventional forecasting models found in the literature. Notably, the superior

forecasting capabilities of nonlinear machine learning models are attributable to both the

nonlinearity and more disaggregated input features.

Chattopadhyay, Fang and Mohanram (2022) apply machine learning algorithms in

earnings forecasts in an international setting, finding that the GBR and RF models perform the

best in a global setting compared to other simple linear models in the extant literature. The

performance gain is particularly large for international firms with poorer information environments

and more volatile earnings.

Binz, Schipper and Standridge (2022) apply a neural-network-based machine learning

algorithm in a Dupont analysis framework to estimate Nissim and Penman (2001)’s structure of

decomposing accounting profitability, and compare its out-of-sample predictability with random

walk and linear regression models. Unlike the previous two studies, the inputs of their machine

learning algorithm are ratios based on Nissim and Penman (2001), and their focus is on the

130
Cao, Jiang, &Lei
_____________________________________________________________________________________

nonlinear relation between the input factors and the target profitability measures to be forecasted.

They find that machine learning algorithms that incorporate nonlinearity perform better than the

random walk or linear models and that investing strategies based on intrinsic values generated

from those forecasts generate significant abnormal returns. They further find that using a long time

series of past information actually impairs forecasting performance.

Artificial intelligence (AI) and machine learning technologies have significantly

influenced various aspects of our lives, including lifestyle, culture, economy, and environment.

Accounting research has not been immune to this impact. Taking advantage of the advancements

in AI and machine learning, accounting researchers have begun to harness AI technologies and

new data in the realms of asset, liability, and equity.

131
Cao, Jiang, &Lei
_____________________________________________________________________________________

Appendix 9. Regression Methods

A9.1. Linear Regression

An overview of regressions
Three examples of regressions
We use regression to develop understanding of relationships between variables. In

regression, and in statistical modeling in general, we want to model the relationship between an

output variable, or a response, and one or more input variables, or factors. Depending on the

context, output variables might also be referred to as dependent variables, outcomes, or

simply Y variables, and input variables might be referred to as explanatory

variables, effects, predictors or X variables. We can use regression, and the results of regression

modeling, to determine which variables have an effect on the response or help explain the

response. This is known as explanatory modeling. We can also use regression to predict the

values of a response variable based on the values of the important predictors. This is generally

referred to as predictive modeling.

Simple linear regression is used to model the relationship between two continuous

variables. The below model describes how Y changes for given values of X. Because the

individual data values for any given value of X vary randomly about the mean, we need to

account for this random variation, or error, in the regression equation. We add the Greek letter

epsilon to the equation to represent the random error in the individual observations:

Y=β0+β1X1+ε

Multiple linear regression is used to model the relationship between a continuous

response variable and a series of continuous or categorical explanatory variables. When we fit a

multiple linear regression model, we add a slope coefficient for each explanatory variable. Each

132
Cao, Jiang, &Lei
_____________________________________________________________________________________

coefficient represents the average increase in Y for every one-unit increase in that explanatory

variable, Xi, holding the other explanatory variable constant.

Y=β0+β1X1+ β2X2+…+ βiXi+ε

A9.2. Fama-Macbeth Regression

Fama-Macbeth regression
The Fama-MacBeth procedure is used to estimate consistent standard errors in the

presence of cross-sectional correlation. The first step is a cross-sectional regression of the model

to obtain the estimated beta-factor of a stock at t over a period of T. The second step is to

compute the overall estimate (𝜆𝜆) and standard-errors (SE) using the time-series estimates of the

beta-factor under the assumption that error terms are uncorrelated over time. A more modern

approach is to run a standard panel regression and then cluster on the date variable.

Y=β0+β1X1+ε

1 2
𝛴𝛴(𝛽𝛽1−𝜆𝜆)
𝜆𝜆=Σβ1/T, SE=�𝑇𝑇 𝑇𝑇

A9.3. Portfolio Sorts

Two-way sorting
Risk-adjusted return sorting
Economic theory, or empirical conjecture, often yields a prediction that expected returns

should be increasing (or decreasing) in some characteristic or feature. Portfolio sorts are very

widely-used to test the theory or conjecture. One of the appeals of tests of the “top-minus-

bottom” spread in portfolio returns is that they can be interpreted as the expected return on a

trading strategy: short the bottom portfolio and invest in the top portfolio, reaping the difference

in expected returns. A test based on a portfolio sort is usually conducted as follows:

133
Cao, Jiang, &Lei
_____________________________________________________________________________________

• Individual stocks are sorted according to a given characteristic;

• These stocks are then grouped into N portfolios;

• Average returns on these portfolios over a subsequent period are then computed;

• The significance of the relationship is judged by whether the “top” and “bottom”

portfolios have significantly different average returns.

134
Cao, Jiang, &Lei
_____________________________________________________________________________________

Reference
Binz, O., Schipper, K., and Standridge, K. 2022. What Can Analyst Learn from Artificial

Intelligence about Fundamental Analysis? Working paper.

Cao, S. 2016. Reexamining Growth Effects: Are All Types of Asset Growth the Same?

Contemporary Accounting Research, 33(4), 1518-1548.

Cao, K., and You, H. 2021. Fundamental Analysis Via Machine Learning. Working paper.

Chattopadhyay, A., Fang, B., and Mohanrm, P. 2022. Machine Learning, Earnings Forecasting

and Implied Cost of Capital – US and International Evidence. Working paper.

Chen, X., Cho, T., and Dou, Y. 2022. Predicting Future Earnings Changes Using Machine

Learning and Detailed Financial Data. Journal of Accounting Research, 60(2), 467-515.

Cao, S., Wang, Z., and Yeung, P.E. 2016. Skin in the Game: Operating Grgowth, Firm

Performance, and Future Stock Returns. Journal of Financial and Quantitative Analysis,

57(7), 2559-2590.

Cooper, M., Gulen, H., and Schill, M. 2008. Asset Growth and the Cross-section of Stock

Returns. Journal of Finance, 63(4), 1609-1651.

Fairfield, P., Whisenant, S., and Yohn, T. 2003. Accrued Earnings and Growth: Implications for

Future Profitability and Market Mispricing. The Accounting Review, 78(1), 353-371.

135
Cao, Jiang, &Lei
_____________________________________________________________________________________

Chapter 10 Analyzing the Income Statement


10.1. Data structure in income statement
Data structure of the income statement
The income statement reports on a company’s performance over a period of time and lists

amounts for its top line revenue and its expenses. Revenue less expenses equals the bottom-line

net income. The income statement reports on both operating and nonoperating activities.

Operating activities are those that relate to bringing a company’s products or services to market

and any after-sales support. The income statement captures operating revenues and expenses,

yielding operating profit. Major operating line items in the income statement are revenues, costs

of goods sold (COGS), and selling, general, and administrative expense (SG&A). Nonoperating

activities relate to such items as borrowed money that creates interest expense and nonstrategic

investments in marketable securities that yield interest or dividend revenue. Typical

nonoperating line items on the income statement include interest expense on debt and lease

obligations, loss or income relating to discontinued operations, debt issuance and retirement

costs, interest and dividend income on investments, and gains or losses on the sale of

investments. Figure 1 provides an example of an income statement filed by Microsoft

Corporation.

Operating profit less income tax on operating profit results in net operating profit after

tax. This measure of a company’s operating performance warrants special attention because it is

the lifeblood of a company’s value creation and growth. Total profit less total income tax results

in net income. Net income is not equivalent to net operating profit after tax because nonoperating

income could be transitory or irrelevant. Holding net income constant, a company with more

operating income would have higher-quality earnings.

136
Cao, Jiang, &Lei
_____________________________________________________________________________________

Figure 1. An income statement of Microsoft Corporation.

The proportion of income attributable to the core operating activities of a business is one

important aspect of earnings quality. Thus, if a business reports an increase in profits due to

improved sales or cost reductions, the quality of earnings is considered to be high. Conversely,

an organization can have low-quality earnings if changes in its earnings relate to other issues,

such as the aggressive use of accounting rules, inflation, the sale of assets for a gain, or increases

in business risk. In general, any use of accounting trickery to temporarily bolster earnings

reduces the quality of earnings. A key characteristic of high-quality earnings is that the earnings

are readily repeatable over a series of reporting periods, rather than being earnings that are only

reported as the result of a one-time event.

137
Cao, Jiang, &Lei
_____________________________________________________________________________________

Figure 2. Operating income and non-operating income

10.2. Earnings and stock prices

Data in the income statement is closely related to stock markets. There is a natural

positive relation between expected earnings and stock prices because investors expect dividends,

which are paid out of earnings. Early research by Ball and Brown (1968) confirmed this expected

relation.

Figure 3. Relation between expected earnings and stock prices

138
Cao, Jiang, &Lei
_____________________________________________________________________________________

We can use regression to confirm the association between earnings and stock prices. In

the equation P=β0+β1X+ε, P represents stock price and X represents earnings per share. β1

reflects the relation between earnings and stock prices, or earnings multiple.

Figure 2. 2002 P/E relationship for 40 restaurant companies

Earnings multiples often vary significantly across companies, industries, and business

cycles. This is because stock price implies the current value of all expected future cash flows

(CFt) which depends on current and future earnings.

P=β0+β11X1+β12X2+β13X3+ε

The permanent component of earnings (X1) has a lasting impact on future cash flows

while the transitory component of earnings (X2) only has one-time effect on future cash flows.

Value irrelevant components of earnings (X3) might even have minimal impact on future cash

flows. If we regress stock price, P, on the three components of earnings, we would obtain the

earnings multiple corresponding to each earnings component. Overall earning multiple, β1, is

determined by the weight of each earnings component and the earnings multiple corresponding

to each earnings component.

139
Cao, Jiang, &Lei
_____________________________________________________________________________________

Figure 4 Components of earnings

Investors like to see high-quality earnings, since these earnings tend to be repeated in

future periods and provide more cash flows for investors. Thus, entities that have high-quality

earnings are also more likely to have high stock prices. Conversely, those entities reporting

lower-quality earnings will not attract investors, resulting in lower stock prices. For example, in

Table 1, both firm A and firm B report earnings of $10 per share. 60 percent of firm A’s earnings

are permanent, 30 percent are transitory, and only 10 percent are value-irrelevant. In contrast,

only 50 percent of firm B’s earnings are permanent, 20 percent are transitory, and the remaining

30 percent are value-irrelevant. Since firm A has a larger proportion of permanent component

and a smaller proportion of value-irrelevant component than firm B, firm A’s earnings quality is

deemed to be better than that of firm B. As a result, firm A has an earnings multiple of 3.3,

higher than firm B’s earnings multiple of 2.7.

140
Cao, Jiang, &Lei
_____________________________________________________________________________________

Table 1. Comparison of high and low earnings quality

10.3. Empirical example: Post-earnings announcement drift (PEAD) anomaly


An empirical example of analyzing the income statement
The Post-earnings announcement drift is a phenomenon in which cumulative abnormal

returns of a stock tend to move in the same direction as the firm's earnings surprise for an

extended period of time. The PEAD anomaly refers to the phenomenon whereby portfolios based

on information in past earnings earn abnormal returns. Ball and Brown (1986) were the first to

note that even after earnings are announced, estimated cumulative abnormal returns continue to

drift up for “good news” firms and down for “bad news” firms. Foster, Olsen, and Shevlin

(1984) estimate that over the 60 trading days subsequent to an earnings announcement, a long

position in stocks with unexpected earnings in the highest decile, combined with a short position

in stocks in the lowest decile, yields an annualized abnormal return of about 25 percent before

transactions costs.

One class of explanations for PEAD suggests that at least a portion of the price response

to new information is delayed. The delay might occur either because traders fail to assimilate

141
Cao, Jiang, &Lei
_____________________________________________________________________________________

available information, or because certain costs (e.g., transaction costs) exceed gains from

immediate exploitation of information for a sufficiently large number of traders. What is less

clear is why a delayed price response would occur.

One possibility is that the market erroneously assuming a seasonal random walk for

expected earnings and ignoring the autocorrelations in earnings. For instance, a company

announces a new long-term contract with a customer that increases earnings at t, resulting in

large positive seasonally adjusted unexpected earnings (SUEt). This contract would also bring in

a stream of future earnings, leading to large positive SUE in subsequent years (SUEt+1, SUEt+2,

SUEt+3, etc.). The value of this stream of earnings which is the current value of all future SUE

should be factored into the stock price at t. If all investors immediately factor the current value of

all future SUEs into the stock price at time t, current SUEt should not be associated with future

abnormal returns. If investors fail to adjust stock price expectations immediately, then the current

SUEt, should predict future abnormal return. Specifically, assuming investors delay response to

the stream of SUEs from t to t+1, current SUEt should predict abnormal return at t+1. As a result,

informed investors can earn abnormal returns by longing stocks with high SUE and short stocks

with low SUE at time t. This prediction can be examined by estimating the following regression

model. If SUEt predicts abnormal returns at t+1, θ1 should be significantly positive.

Abrett+1=α+θ1SUEt+εt+1 (1)

For example, suppose Apple Inc. signed a contract with a client that yielded a $1 million

profit at time t, and this contract possesses a 40% likelihood of renewal. Hence, the overall

estimated value of the contract would be $1.4 million. If all investors were aware of the 40%

renewal probability, the current SUE will have no relation to future returns, as the market has

already incorporated all relevant contract information into stock prices in a timely manner.

142
Cao, Jiang, &Lei
_____________________________________________________________________________________

However, if investors undervalued the contract and estimated its value to be $1.2 million, the

current SUE could predict future stock returns. This is because the information about the

remaining $0.2 million has not yet been reflected in stock prices at time t. In such a case,

informed investors can make profits by buying stocks with high SUE and selling stocks with low

SUE at a given point in time.

Figure 4. Seasonally adjusted unexpected earnings and announcement returns

Cao and Narayanamoorthy (2012) verify the existence of the PEAD anomaly. Panel A of

Table 2 shows that SUE at t predicts future SUEs up to the subsequent three quarters. Panel B

and Panel C shows the association between current SUE and future three-day earnings

announcement abnormal returns up to two quarters and quarter-long abnormal returns up to three

quarters. Variable definitions are listed in Appendix 10.

Figure 6 illustrates that from 1988 to 2008, implementing a trading strategy based on the

PEAD anomaly consistently resulted in positive abnormal returns. In practice, this strategy is

commonly employed by investors.

143
Cao, Jiang, &Lei
_____________________________________________________________________________________

Table 2. SUE persistence and abnormal returns

Figure 6. Trading strategy based on the PEAD anomaly

It is intuitive to posit that the level of autocorrelation is lower for the companies with

more volatile earnings. As shown in Table 3, the autocorrelation coefficient is significantly

smaller when earnings volatility (EVOL) increases.

To examine whether the effect of earnings volatility on the level of SUE autocorrelation

passes on to the association between current SUE and future abnormal returns, we can augment

the regression model above to include an interaction term between SUEt and EVOLt .

Abrett+1=α+θ1SUEt+θ2EVOLt +θ3SUEt* EVOLt +εt+1 (2)

144
Cao, Jiang, &Lei
_____________________________________________________________________________________

Table 3. Effect of earnings volatility on SUE persistence

The results of this analysis are provided in Table 4. The coefficient of DSUE in Model 1

of Panel A suggests that three-day earnings announcement abnormal returns in three days

surrounding the earnings announcement date at t+1 is approximately 0.7 percent . Model 1 of

Panel B suggests that abnormal stock returns from quarter t to quarter t+1 is about 6 percent. The

negative coefficient of EVOL*DSUE suggests that abnormal returns are higher for the stocks

with less earnings volatility. Model 2 and Model 3 in both panels control for company size and

whether a company suffers losses at t. The effect of earnings volatility is robust to including both

controls.

As noted above, the delay in investor response might occur because transaction costs are

prohibitively high. To exclude this alternative explanation, we include earnings volatility and a

proxy for transaction costs, SPREAD, in the same regression. If θ3 continues to be significantly

negative, then the effect of earnings volatility at least is not completely attributable to transaction

145
Cao, Jiang, &Lei
_____________________________________________________________________________________

costs. Table 5 show that θ3 remains significantly negative, indicating a significant effect of

earnings volatility on the association between SUE and abnormal returns in addition to the effect

of transaction costs.

Abrett+1=α+θ1SUEt+θ2EVOLt+θ3SUEt* EVOLt+θ4SPREADt+θ4SUEt* SPREADt +εt+1 (3)

Table 4. Effect of earnings volatility on PEAD returns

Table 5. Effect of earnings volatility on PEAD returns controlling for Spread

146
Cao, Jiang, &Lei
_____________________________________________________________________________________

10.4. Machine learning application on income statement data


It is important to understand the patterns underlying income statements in order to make

informed investment decisions. However, the complexity arises from the substantial number of

income-related variables. In general, it is challenging to extract valuable insights from extensive

data. Then, the question arises: how can we effectively gain insights from the vast amount of

data present in income statements?

There are two primary approaches to gaining insights from the broad range of income-

related variables present in income statements. The first approach is theory-based, where we first

come up with a hypothesis based on established theories and test its validity. By doing so, we
147
Cao, Jiang, &Lei
_____________________________________________________________________________________

can determine whether the patterns suggested by the theory can provide valuable guidance to

investors.

Another approach is through the application of machine learning techniques, which is

particularly suited for handling high dimensional data. In this approach, all income-related

variables are incorporated into machine learning models. Then, the machine processes the data,

continuously learning and adapting to identify patterns and relations between the variables and

subsequent stock returns. The advantage of using machine learning techniques lies in their ability

to handle vast amounts of data and identify complex patterns that might not be apparent through

traditional analysis methods. Specifically, machine learning algorithms have the capacity to

analyze the interactions among multiple income-related variables, thereby discovering

hidden patterns and providing valuable insights.

148
Cao, Jiang, &Lei
_____________________________________________________________________________________

Appendix 10. Key variable explanations


A10.1. Abnormal return
Abnormal return, also known as “excess return,” refers to the unanticipated profits (or

losses) generated by a security/stock. Abnormal returns are measured as the difference between

the actual returns that investors earn on an asset and the expected returns. Expected returns are

estimated using, for example, CAPM model. Abnormal returns can be positive or negative.

Positive abnormal returns are realized when actual returns are greater than expected returns.

Negative abnormal returns (or losses) occur when the actual return is lower than what was

expected.

Abnormal return = Actual return - Expected return

For instance, in Cao and Narayanamoorthy (2012), ARq, t+1 is calculated by subtracting

the CRSP value-weighted index from the raw return for the period between two days after the

quarter earnings announcement date and one day before the next announcement date. ARs, t+1 is

calculated by subtracting the CRSP value-weighted index from the raw return during the three-

day window (−1, +1) around quarter t+1’s announcement.

A10.2. Expected return


The expected return is equal to the risk-free return plus a risk premium. It is based on the

idea of systematic risk (otherwise known as non-diversifiable risk) that investors need to be

compensated for in the form of a risk premium. A risk premium is a rate of return greater than

the risk-free rate. When investing, investors desire a higher risk premium when taking on more

risky investments.

Expected return = Risk-free rate +β* (Market return - Risk-free rate)

149
Cao, Jiang, &Lei
_____________________________________________________________________________________

The risk premium required by investors is based on the β of that stock. β is a measure of a

stock’s risk reflected by measuring the fluctuation of its price changes relative to the overall

market. In other words, it is the stock’s sensitivity to market risk. For instance, if a company’s β

is equal to one, the expected return on a stock is equal to the average market return. A β of -1

means a stock has a perfect negative correlation with the market. Cumulative Abnormal Return

(CAR) refers to the sum of abnormal returns over a given period of time. It allows investors to

measure the performance of an asset or security over a specific period of time, especially since

abnormal returns over short windows tend to be biased.

Figure 1. Capital Asset Pricing Model

The Security Market Line (SML) graphically represents the relationship between

expected returns and the associated risk levels. A security or portfolio that is in equilibrium lies

on the SML, indicating that it is fairly priced, as its expected return equals the return required by

the market at that level of risk. Assets lying above the SML are undervalued, as they offer a

higher return than what is required for their level of risk. Conversely, assets lying below the

SML are overvalued, as they offer a lower return than what is required for their level of risk.

150
Cao, Jiang, &Lei
_____________________________________________________________________________________

A10.3. Standardized unexpected earnings


The Standardized Unexpected Earnings (SUE) is calculated by taking the difference

between the current earnings and the earnings from the same quarter in the previous year,

divided by the closing market value of the preceding fiscal quarter.

DSUE is the SUE decile rank for each quarter transformed by dividing the rank by 9 and

subtracting 0.5, resulting in values that range from −0.5 to +0.5.

A10.4. Earnings volatility


Earnings volatility (VOL) is the variance of the most recent eight quarterly earnings

(including quarter t), scaled by average total assets.

EVOL is the earnings volatility (VOL) decile rank for each quarter transformed by

dividing the rank by 9 and subtracting 0.5, resulting in values that range from −0.5 to +0.5.

151
Cao, Jiang, &Lei
_____________________________________________________________________________________

Reference

Ball, R., and Brown, P. 1968. An Empirical Evaluation of Accounting Income Numbers. Journal

of Accounting Research, 6(2), 159-178.

Cao, S., and Narayanamoorthy, G. 2012. Earnings Volatility, Post-Earnings Announcement

Drift, and Trading Frictions, Journal of Accounting Research, 50(1), 41-74.

Foster, G., Olsen, C. and Shevlin, T. 1984. Earnings Releases, Anomalies, and the Behavior of

Security Returns. The Accounting Review, 59(4), 574-603.

152

You might also like