Improving Medical Document Classification via Feature Engineering

Abdollahi, Mahdi

dc.contributor.advisor	Gao, Xiaoying
dc.contributor.advisor	Mei, Yi
dc.contributor.author	Abdollahi, Mahdi
dc.date.accessioned	2021-12-14T02:03:12Z
dc.date.available	2021-12-14T02:03:12Z
dc.date.copyright	2021	en_NZ
dc.date.issued	2021	en_NZ
dc.identifier.uri	https://ir.wgtn.ac.nz/handle/123456789/17857
dc.identifier.uri	https://openaccess.wgtn.ac.nz/articles/thesis/Improving_Medical_Document_Classification_via_Feature_Engineering/25583424
dc.identifier.uri	https://doi.org/10.26686/wgtn.25583424
dc.description.abstract	Document classification (DC) is the task of assigning the predefined labels to unseen documents by utilizing the model trained on the available labeled documents. DC has recently attracted much attention in the medical field because many issues can be formulated as classification problems. For example, categorizing clinical risk factors, automatic disease classification, and electronic health records classification are some applications of text classification. DC is critical for medical document management and analysis. Medical DC can assist doctors in decision making and correct decisions can reduce medical expenses. Medical documents have special attributes that distinguish them from other texts and make them difficult to analyze. For example, many acronyms and abbreviations, and short expressions make it more challenging to extract knowledge. The current classification performance on medical documents is not satisfactory. Furthermore, the source of data is not sufficient due to patients’ privacy. This thesis aims to enhance the input feature sets of the medical DC methods to improve their classification performance. Additionally, it develops new data augmentation methods to deal with the shortage of data. To approach these goals, this work has developed new feature manipulation methods (such as future extraction, feature selection, and feature construction) in supervised learning systems to extract new meaningful feature sets. Moreover, it develops ontology and dictionary-based data augmentation approaches to create new synthetic documents. This thesis utilizes Evolutionary Computation (EC) techniques such as Particle Swarm Optimisation (PSO) and other deep learning methods such as Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), Hierarchical Attention Network (HAN) to achieve its objectives. The main goal of this thesis is to develop new feature engineering approaches to medical document classification by using domain-specific knowledge of the problem which automatically extracts prominent features, constructing new high-level features, selects informative features, and augments new synthetic documents from the original documents. These methods can improve medical document classification performance by enriching the quality of the input data. This thesis develops three feature engineering approaches including domain-specific feature extraction, two-stage and three-stage PSO-based methods to automatically extract, construct, and select new high-level features for classification. The results demonstrate that two-stage and three stage approaches outperformed the compared related works. This thesis proposes two novel ontology-based data augmentation approaches to make new synthetic documents from the original training data sets for medical document classification. These approaches can make new synthetic documents from the original documents by employing a domain-specific ontology and a general dictionary to double/triple the size of the training data set and improve the performance of medical document classification. The results show that these approaches successfully improved medical document classification performance. This thesis develops two dictionary-based data oversampling approaches to make new synthetic documents from the original training data sets for medical document classification problems. The proposed approach can make new synthetic documents with high variety compared to similar methods. The proposed approaches make an imbalanced data set balanced and improve the classification performance too. The results show better classification performance.	en_NZ
dc.language.iso	en_NZ
dc.publisher	Te Herenga Waka—Victoria University of Wellington	en_NZ
dc.relation.uri	https://www.wgtn.ac.nz/library/about-us/policies-and-strategies/copyright-for-the-researcharchive
dc.subject	Natural Language Processing	en_NZ
dc.subject	Medical Document Classification	en_NZ
dc.subject	Machine Learning	en_NZ
dc.title	Improving Medical Document Classification via Feature Engineering	en_NZ
dc.type	Text	en_NZ
vuwschema.contributor.unit	School of Engineering and Computer Science	en_NZ
vuwschema.type.vuw	Awarded Doctoral Thesis	en_NZ
thesis.degree.discipline	Computer Science	en_NZ
thesis.degree.grantor	Te Herenga Waka—Victoria University of Wellington	en_NZ
thesis.degree.level	Doctoral	en_NZ
thesis.degree.name	Doctor of Philosophy	en_NZ
dc.subject.course	COMP690	en_NZ
vuwschema.subject.anzsrcforV2	461199 Machine learning not elsewhere classified	en_NZ