Scalable and sparse class probability estimation with weighted support vector machines
Abstract: Classification problems have broad applications in many scientific areas such as biology, engineering, finance, and medicine. Support vector machines (SVMs) are popular classification tools due to their flexibility, accuracy, and fast computation for high-dimensional datasets. However, standard SVMs cannot estimate the probability of a data point belonging to a given class and hence fails to provide a confidence measure of uncertainty for label prediction. This drawback can be tackled by weighted SVMs (wSVMs) (Wang, Shen and Liu, 2008; Wu, Zhang and Liu, 2010; Wang, Zhang and Wu, 2019), which estimate class probabilities by aggregating multiple classification rules trained by weighted samples. Despite their promising performance, existing wSVMs have limitations in their theory, computation, and applications. Specifically, they do not perform variable selection, so both their classification accuracy and interpretability may suffer when too many noise features are present in the data. Also, the multiclass wSVM algorithm proposed by Wang, Zhang and Wu (2019) has a polynomial complexity in K (the number of classes), which is suboptimal. To address these issues, we developed new ideas to improve existing wSVMs from three different perspectives. The first project develops more efficient and scalable algorithms to reduce the complexity of K-class wSVMs from polynomial time to linear time. The second project develops a new robust classification method, called the SIBOW-SVM, for brain MRI image analysis and shows that it can outperform the state-of-art convolutional neural networks (CNNs). The third project proposes a set of new frameworks to achieve sparse learning in binary wSVMs and perform simultaneous variable selection, classification, and probability estimation in high-dimensional data. For each project, I present new statistical methods and algorithms, as well as extensive numerical studies and real examples to demonstrate finite sample performance.