Mining big annual statement datasets to predict highly lucrative companies using classification trees and forests

Master's Thesis from the year 2014 in the subject Economics - Statistics and Methods, grade: 1,0, University of Duisburg-Essen (Wirtschaftswissenschaften), course: Masterarbeit, language: English, abstract: In this thesis it is predicted if a regarded firm will grow extraordinary in the next year and maybe even become a big company in the medium term. This is crucial information for private investors and fund managers who need to decide whether they should invest in a certain firm. Companies like Apple and Amazon have shown in the past that people who recognized the potential of such companies and bought their shares have earned a lot of money. The prediction models, which are described in this paper, can also be used by politicians to identify companies which are eligible for funding. Because growing companies oftentimes hire many employees, it might be meaningful to facilitate their development process by selective subsidies to reduce unemployment. Furthermore, it is possible to question the prediction results of a financial analyst if he came to a different conclusion than a model. Since annual reports are often publically available for free, it is reasonable to take advantage of them for such a prediction. Additionally, various information providers maintain huge databases with annual reports. A big data approach promises to further improve accuracy of predictions. This paper introduces methods, which enable to generate knowledge out of these huge data sources to identify extraordinary lucrative firms. To generate these prediction models, a data mining approach is used which is based on the approved CRISP-DM proceeding model for data mining processes. CRISP-DM ensures comparability and the consideration of best practices. The prediction models are based on classification trees and forests because they have some very substantial advantages over other methods like neural networks, which are frequently used in literature. For instance, the underlying algorithms of the used model do not require a certain distributional assumption, accept both quantitative and qualitative inputs, and is not sensitive with respect to outliers. But the two most important advantages are that a tree can be easily interpreted by users which is important for the previously described stakeholders because it is not easy to trust the results of a model which one does not understand. This is why a lack of understanding might impede the practical implementation of such a model. Besides that, the used algorithms can handle missing data which occur very often in the available dataset. In other analysis, these data entries would have been removed even if only one value is missing.