Genetic Programming for Binary Classification with High-dimensional Unbalanced Data

Pei, Wenbin

dc.rights.license	Author Retains Copyright	en_NZ
dc.contributor.advisor	Xue , Bing
dc.contributor.advisor	Zhang, Mengjie
dc.contributor.author	Pei, Wenbin
dc.date.accessioned	2021-10-22T22:19:58Z
dc.date.available	2021-10-23
dc.date.available	2021-10-22T22:19:58Z
dc.date.copyright	2021-10-23
dc.date.issued	2021-10-23
dc.identifier.uri	https://ir.wgtn.ac.nz/handle/123456789/11279
dc.identifier.uri	https://api.figshare.com/v2/account/articles/16862482
dc.identifier.uri	https://doi.org/10.26686/wgtn.16862482
dc.description.abstract	Class imbalance and high dimensionality have been acknowledged as two tough issues in classification. Learning from unbalanced data, the constructed classifiers are often biased towards the majority class, and thereby perform poorly on the minority class. Unfortunately, the minority class is often the class of interest in many real-world applications, such as medical diagnosis and fault detection. High dimensionality often makes it more difficult to handle the class imbalance issue. To date, most existing works attempt to address one single issue, without consideration of solving the other. These works could not be effectively applied to some challenging classification tasks that suffer from both of the two issues. Genetic programming (GP) is one of the most popular techniques from evolutionary computation, which has been widely applied to classification tasks. The built-in feature selection ability of GP makes it very powerful for use in classification with high-dimensional data. However, if the class imbalance issue is not well addressed, the constructed GP classifiers are often biased towards the majority class. Accordingly, this thesis aims to address the joint effects of class imbalance and high dimensionality by developing new GP based classification approaches, with the goal of improving classification performance. To effectively and efficiently address the performance bias issue of GP, this thesis develops a fitness function that considers two criteria, namely the approximation of area under the curve (AUC) and classification clarity (i.e. how well a program can separate the two classes). To further improve the efficiency, a new program reuse mechanism is designed to reuse previous effective GP individuals. According to experimental results, GP with the new fitness function and the program reuse mechanism achieves good performance and significantly saves training time. However, this method treats the two criteria equally, which is not always reasonable. To avoid manually weighing the two criteria in the fitness evaluation process, we propose a novel two-criterion fitness evaluation method, where the obtained values on the two criteria are combined in pairs, instead of summing them together. Then, a three-criterion tournament selection is designed to effectively identify and select good programs to be used by genetic operators for generating better offspring during the evolutionary learning process. Experimental results show that the proposed GP method achieves better classification performance than compared methods. Cost-sensitive learning is a popular approach to addressing the problem of class imbalance for many classification algorithms in machine learning. However, cost-sensitive algorithms are dependent on cost matrices that are usually designed manually. Unfortunately, it is often not easy for humans, even experts, to accurately specify misclassification costs for different mistakes due to the lack or incompleteness of domain knowledge related to actual situations in many complex tasks. As a result, these cost-sensitive algorithms cannot be directly applied. This thesis develops new GP based approaches to developing cost-sensitive classifiers without requiring cost matrices from humans. The newly developed cost-sensitive GP methods are able to construct classifiers and learn cost values or intervals automatically and simultaneously. The experimental results show that the new cost-sensitive GP methods outperform compared methods for high-dimensional unbalanced classification in almost all comparisons. Cost-sensitive GP classifiers treat the minority class as being more important than the majority class, but this may cause an accuracy decrease in the overlapping areas where the prior probabilities of the two classes are about the same. In the thesis, we propose a neighborhood method to detect overlapping areas, and then use GP to develop cost-sensitive classifiers that employ different classification strategies to classify instances from the overlapping areas or the non-overlapping areas.	en_NZ
dc.language.iso	en_NZ
dc.publisher	Te Herenga Waka—Victoria University of Wellington
dc.rights	Author Retains Copyright	en_NZ
dc.rights.uri	https://www.wgtn.ac.nz/library/about-us/policies-and-strategies/copyright-for-the-researcharchive
dc.subject	Genetic Programming	en_NZ
dc.subject	Classification	en_NZ
dc.subject	Class imbalance	en_NZ
dc.subject	High dimensionality	en_NZ
dc.title	Genetic Programming for Binary Classification with High-dimensional Unbalanced Data	en_NZ
dc.type	Text	en_NZ
dc.date.updated	2021-10-22T22:19:58Z
vuwschema.contributor.unit	School of Engineering and Computer Science	en_NZ
vuwschema.subject.anzsrcfor	080108 Neural, Evolutionary and Fuzzy Computation	en_NZ
vuwschema.subject.anzsrctoa	1 PURE BASIC RESEARCH	en_NZ
vuwschema.type.vuw	Awarded Doctoral Thesis	en_NZ
thesis.degree.discipline	Computer Science	en_NZ
thesis.degree.grantor	Te Herenga Waka—Victoria University of Wellington	en_NZ
thesis.degree.level	Doctoral	en_NZ
thesis.degree.name	Doctor of Philosophy	en_NZ
dc.subject.course	COMP 690	en_NZ
vuwschema.subject.anzsrcforV2	460203 Evolutionary computation	en_NZ