DSpace Repository

Improving Medical Document Classification via Feature Engineering

Show simple item record

dc.contributor.advisor Gao, Xiaoying
dc.contributor.advisor Mei, Yi
dc.contributor.author Abdollahi, Mahdi
dc.date.accessioned 2021-12-14T02:03:12Z
dc.date.available 2021-12-14T02:03:12Z
dc.date.copyright 2021 en_NZ
dc.date.issued 2021 en_NZ
dc.identifier.uri https://ir.wgtn.ac.nz/handle/123456789/17857
dc.identifier.uri https://openaccess.wgtn.ac.nz/articles/thesis/Improving_Medical_Document_Classification_via_Feature_Engineering/25583424
dc.identifier.uri https://doi.org/10.26686/wgtn.25583424
dc.description.abstract Document classification (DC) is the task of assigning the predefined labels to unseen documents by utilizing the model trained on the available labeled documents. DC has recently attracted much attention in the medical field because many issues can be formulated as classification problems. For example, categorizing clinical risk factors, automatic disease classification, and electronic health records classification are some applications of text classification. DC is critical for medical document management and analysis. Medical DC can assist doctors in decision making and correct decisions can reduce medical expenses. Medical documents have special attributes that distinguish them from other texts and make them difficult to analyze. For example, many acronyms and abbreviations, and short expressions make it more challenging to extract knowledge. The current classification performance on medical documents is not satisfactory. Furthermore, the source of data is not sufficient due to patients’ privacy. This thesis aims to enhance the input feature sets of the medical DC methods to improve their classification performance. Additionally, it develops new data augmentation methods to deal with the shortage of data. To approach these goals, this work has developed new feature manipulation methods (such as future extraction, feature selection, and feature construction) in supervised learning systems to extract new meaningful feature sets. Moreover, it develops ontology and dictionary-based data augmentation approaches to create new synthetic documents. This thesis utilizes Evolutionary Computation (EC) techniques such as Particle Swarm Optimisation (PSO) and other deep learning methods such as Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), Hierarchical Attention Network (HAN) to achieve its objectives. The main goal of this thesis is to develop new feature engineering approaches to medical document classification by using domain-specific knowledge of the problem which automatically extracts prominent features, constructing new high-level features, selects informative features, and augments new synthetic documents from the original documents. These methods can improve medical document classification performance by enriching the quality of the input data. This thesis develops three feature engineering approaches including domain-specific feature extraction, two-stage and three-stage PSO-based methods to automatically extract, construct, and select new high-level features for classification. The results demonstrate that two-stage and three stage approaches outperformed the compared related works. This thesis proposes two novel ontology-based data augmentation approaches to make new synthetic documents from the original training data sets for medical document classification. These approaches can make new synthetic documents from the original documents by employing a domain-specific ontology and a general dictionary to double/triple the size of the training data set and improve the performance of medical document classification. The results show that these approaches successfully improved medical document classification performance. This thesis develops two dictionary-based data oversampling approaches to make new synthetic documents from the original training data sets for medical document classification problems. The proposed approach can make new synthetic documents with high variety compared to similar methods. The proposed approaches make an imbalanced data set balanced and improve the classification performance too. The results show better classification performance. en_NZ
dc.language.iso en_NZ
dc.publisher Te Herenga Waka—Victoria University of Wellington en_NZ
dc.relation.uri https://www.wgtn.ac.nz/library/about-us/policies-and-strategies/copyright-for-the-researcharchive
dc.subject Natural Language Processing en_NZ
dc.subject Medical Document Classification en_NZ
dc.subject Machine Learning en_NZ
dc.title Improving Medical Document Classification via Feature Engineering en_NZ
dc.type Text en_NZ
vuwschema.contributor.unit School of Engineering and Computer Science en_NZ
vuwschema.type.vuw Awarded Doctoral Thesis en_NZ
thesis.degree.discipline Computer Science en_NZ
thesis.degree.grantor Te Herenga Waka—Victoria University of Wellington en_NZ
thesis.degree.level Doctoral en_NZ
thesis.degree.name Doctor of Philosophy en_NZ
dc.subject.course COMP690 en_NZ
vuwschema.subject.anzsrcforV2 461199 Machine learning not elsewhere classified en_NZ


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search DSpace


Browse

My Account