DSpace Repository

Genetic Programming for Binary Classification with High-dimensional Unbalanced Data

Show simple item record

dc.rights.license Author Retains Copyright en_NZ
dc.contributor.advisor Xue , Bing
dc.contributor.advisor Zhang, Mengjie
dc.contributor.author Pei, Wenbin
dc.date.accessioned 2021-10-22T22:19:58Z
dc.date.available 2021-10-23
dc.date.available 2021-10-22T22:19:58Z
dc.date.copyright 2021-10-23
dc.date.issued 2021-10-23
dc.identifier.uri https://ir.wgtn.ac.nz/handle/123456789/11279
dc.identifier.uri https://api.figshare.com/v2/account/articles/16862482
dc.identifier.uri https://doi.org/10.26686/wgtn.16862482
dc.description.abstract Class imbalance and high dimensionality have been acknowledged as two tough issues in classification. Learning from unbalanced data, the constructed classifiers are often biased towards the majority class, and thereby perform poorly on the minority class. Unfortunately, the minority class is often the class of interest in many real-world applications, such as medical diagnosis and fault detection. High dimensionality often makes it more difficult to handle the class imbalance issue. To date, most existing works attempt to address one single issue, without consideration of solving the other. These works could not be effectively applied to some challenging classification tasks that suffer from both of the two issues. Genetic programming (GP) is one of the most popular techniques from evolutionary computation, which has been widely applied to classification tasks. The built-in feature selection ability of GP makes it very powerful for use in classification with high-dimensional data. However, if the class imbalance issue is not well addressed, the constructed GP classifiers are often biased towards the majority class. Accordingly, this thesis aims to address the joint effects of class imbalance and high dimensionality by developing new GP based classification approaches, with the goal of improving classification performance. To effectively and efficiently address the performance bias issue of GP, this thesis develops a fitness function that considers two criteria, namely the approximation of area under the curve (AUC) and classification clarity (i.e. how well a program can separate the two classes). To further improve the efficiency, a new program reuse mechanism is designed to reuse previous effective GP individuals. According to experimental results, GP with the new fitness function and the program reuse mechanism achieves good performance and significantly saves training time. However, this method treats the two criteria equally, which is not always reasonable. To avoid manually weighing the two criteria in the fitness evaluation process, we propose a novel two-criterion fitness evaluation method, where the obtained values on the two criteria are combined in pairs, instead of summing them together. Then, a three-criterion tournament selection is designed to effectively identify and select good programs to be used by genetic operators for generating better offspring during the evolutionary learning process. Experimental results show that the proposed GP method achieves better classification performance than compared methods. Cost-sensitive learning is a popular approach to addressing the problem of class imbalance for many classification algorithms in machine learning. However, cost-sensitive algorithms are dependent on cost matrices that are usually designed manually. Unfortunately, it is often not easy for humans, even experts, to accurately specify misclassification costs for different mistakes due to the lack or incompleteness of domain knowledge related to actual situations in many complex tasks. As a result, these cost-sensitive algorithms cannot be directly applied. This thesis develops new GP based approaches to developing cost-sensitive classifiers without requiring cost matrices from humans. The newly developed cost-sensitive GP methods are able to construct classifiers and learn cost values or intervals automatically and simultaneously. The experimental results show that the new cost-sensitive GP methods outperform compared methods for high-dimensional unbalanced classification in almost all comparisons. Cost-sensitive GP classifiers treat the minority class as being more important than the majority class, but this may cause an accuracy decrease in the overlapping areas where the prior probabilities of the two classes are about the same. In the thesis, we propose a neighborhood method to detect overlapping areas, and then use GP to develop cost-sensitive classifiers that employ different classification strategies to classify instances from the overlapping areas or the non-overlapping areas. en_NZ
dc.language.iso en_NZ
dc.publisher Te Herenga Waka—Victoria University of Wellington
dc.rights Author Retains Copyright en_NZ
dc.rights.uri https://www.wgtn.ac.nz/library/about-us/policies-and-strategies/copyright-for-the-researcharchive
dc.subject Genetic Programming en_NZ
dc.subject Classification en_NZ
dc.subject Class imbalance en_NZ
dc.subject High dimensionality en_NZ
dc.title Genetic Programming for Binary Classification with High-dimensional Unbalanced Data en_NZ
dc.type Text en_NZ
dc.date.updated 2021-10-22T22:19:58Z
vuwschema.contributor.unit School of Engineering and Computer Science en_NZ
vuwschema.subject.anzsrcfor 080108 Neural, Evolutionary and Fuzzy Computation en_NZ
vuwschema.subject.anzsrctoa 1 PURE BASIC RESEARCH en_NZ
vuwschema.type.vuw Awarded Doctoral Thesis en_NZ
thesis.degree.discipline Computer Science en_NZ
thesis.degree.grantor Te Herenga Waka—Victoria University of Wellington en_NZ
thesis.degree.level Doctoral en_NZ
thesis.degree.name Doctor of Philosophy en_NZ
dc.subject.course COMP 690 en_NZ
vuwschema.subject.anzsrcforV2 460203 Evolutionary computation en_NZ

Files in this item

This item appears in the following Collection(s)

Show simple item record

Search DSpace


My Account