Multi-relational learning for full text document classification
UNIVERSAL IDENTIFIER: http://hdl.handle.net/11093/3740
DOCUMENT TYPE: doctoralThesis
ABSTRACT
Text mining is an area of Artificial Intelligence that has grown in importance in recent years due to the large amount of textual data that must be handled during decision-making processes and in the development of strategies in various sectors of society. Document classification is one of the most important techniques in many environments, such as web pages or document classification, sentiment
analysis of social network users, spam email detection, information sharing and recommendation systems, among others. More specifically, in the field of medicine, the number of documents handled
by medical literature repositories such as the National Center for Biotechnology Information (NCBI)or MEDLINE has grown exponentially in recent years. This has led to the need to develop more
efficient computational techniques and methods for searching and classifying documents to extract relevant knowledge that drives new findings in scientific research. The approaches undertaken to
extract information from scientific literature databases usually depend exclusively on the titles and abstracts of scientific articles for the classification of documents.
However, users who search for full texts are more likely to find relevant articles than those who only search in titles and abstracts. This finding emphasizes the relevance of full text collections for
text retrieval and serves as a foundation for research on algorithms that take advantage of the rapid increasing growth of digital archives. On the other hand, the use of full text documents generates a
higher number of terms, which must be analyzed to verify if they can positively lead to improvements in the classification process. This raises the need to discard the terms that do not contribute to better discriminate the class that a
document belongs to and to reduce the huge number of terms to give to the classifier. This approach is called preprocessing and has several techniques that may be applied.
The best preprocessing techniques must be selected to reduce the overwhelming number of terms introduced by the full text and Semantic enrichment (another technique typically used to enrich the
data set with domain specific content) without degrading the accuracy rate. When working with full text, the terms from the different sections (all the MEDLINE scientific
documents have the same sections structure: Title, Abstract, Introduction, Methods and materials, Results and Conclusions) are available. Knowing the impact of each document section and which
section combination produce better classifications is also part of this research work. Furthermore, it is essential to determine if handling full text documents is better than simply searching in the Title
and Abstract. And it is important to analyze whether searching full text documents is better than searching a specific combination of document sections. This leads to another interesting research
topic related to the representation of features (terms) for text mining when using the full text, which is discussed in this thesis. For this, based on documents extracted from the MEDLINE corpus, in
full text, the documents were divided into the different sections, so that it was possible to study the individual impact of each section in the classification process, as well as the impact of the combination
of the different sections performed with different weights. With this division it is possible to apply a different learning algorithm to each individual section. Also it is feasible to study the impact of
multi-relational algorithms to discover if a multi-relational approach could contribute in a positive way to the classification process, specifically in models building.