Date of Award




Document Type


Degree Name

Doctor of Philosophy (PhD)


Department of Information Science

Content Description

1 online resource (xiii, 194 pages) : illustrations (some color)

Dissertation/Thesis Chair

Jagdish Gangolly

Committee Members

Sue Faerman, Ozlem Uzuner


10-Ks, annual report, detection, fraud, qualitative content, Corporation reports, Accounting fraud, Fraud investigation

Subject Categories

Accounting | Library and Information Science | Linguistics


High profile cases of fraudulent financial reporting such as those that occurred at Enron and WorldCom have shaken public confidence in the U.S. financial reporting process and have raised serious concerns about the roles of auditors, regulators, and analysts in financial reporting. In order to address these concerns and restore public confidence, the Sarbanes-Oxley Act (SOX) of 2002 was enacted. However, SOX has not lived up to its promise. Numerous cases of fraudulent financial reporting have surfaced in the post-SOX era. So far, the major thrust of research has been on examining fraud that has already been discovered. This dissertation creates a methodology to proactively identify means to detect fraud by examining the qualitative content of annual reports using natural language processing tools. The methodology is created using Support Vector Machines, a supervised machine learning technique. In this research, we examine both the verbal content and the presentation style of the qualitative portion of the annual reports and seek to explore linguistic features that distinguish fraudulent annual reports from non-fraudulent annual reports. To detect fraud, it is important to investigate qualitative content as textual content of annual reports contains richer information than the financial ratios, which can be easily camouflaged. This study also creates a classification metric for early prediction of fraud by examining changes in the qualitative content of annual reports for pre-fraud, fraud and post-fraud periods of fraud companies. What distinguishes this methodology from earlier research on fraud detection is its use of qualitative textual content in annual reports as opposed to quantitative financial information such as ratios, which have limited ability to predict fraud as discussed in the literature. Our results indicate that employment of linguistic features is an effective means to detect fraud.