Skip to main content
Deployable AI: Solutions to Label Sparsity and Classification Model Selection

Deployable AI: Solutions to Label Sparsity and Classification Model Selection

Date18th Mar 2024

Time10:30 AM

Venue MR - I (SSB 233, First Floor)

PAST EVENT

Details

Deploying artificial intelligence (AI) for practical problem-solving comes with challenges well-recognized by global academic and industrial research communities. While the new generation of AI has generated excitement, laboratory successes have only been translated to applications in narrow domains. These domains typically have modest expectations for reliability from AI systems, small costs of failure, and strong incentives for users to make AI systems succeed. Broadening the deployment of AI is not simply a matter of translational research and engineering. Several fundamental research questions involving algorithmic, systemic, and societal aspects must be answered. Relevant research areas include learning from noisy and incomplete data, incorporating domain knowledge in AI systems, explainable deep learning for medical imaging and attention models, resource-efficient implementations of deep learning, robustness to adversarial attacks on deep networks, social perception of AI in healthcare applications, and enforcing fairness in AI systems.

In this thesis, we provide an overview of the challenges in Deployable AI constituting broadly five sub-categories, namely Societal-centric, Organization-centric, Privacy & Trust-centric, System-centric, and Data-centric challenges. We focus solely on the data-centric challenges that rely on data distribution nuances. Various fundamental issues exist within the data-centric category, including class imbalance, label scarcity, data drift, classification complexity, model selection, feature engineering, and model parameter tuning.

We scope our research to a subset of data-centric challenges such as label sparsity and model selection. We address label sparsity in two forms:- a) class imbalance through directed sampling & boosting and b) label scarcity through semi-supervised classification trees. We also develop a method for predicting the empirical classification complexity of the dataset and extend it to an automatic model selection method that maps the dataset characteristics to empirical classification model fitness.

As the first problem, we address the challenge of classifying imbalanced binary datasets. The problem has greater research significance, as the data from most real-world applications follow a non-uniform class distribution. Directed data sampling and data-level cost-sensitive methods use the data point importance information to sample from the dataset such that the essential data points are retained and possibly oversampled. We propose a novel topic-modeling-based weighting framework to compute the importance of the data points in an imbalanced dataset based on the topic-posterior probabilities estimated through topic modeling. We propose TODUS, a topics-oriented directed undersampling algorithm that follows the estimated data distribution to draw samples from the dataset, which aims to minimize the loss of important information during random undersampling. We also propose TOMBoost, a topic-modeled boosting scheme based on the weighting framework, particularly tuned for learning with class imbalance. Our empirical study spanning 40 datasets shows that \texttt{TOMBoost} outperforms other boosting and sampling methods in at least 37 datasets on average. We also empirically show that \texttt{TOMBoost} minimizes the model bias faster than other popular boosting methods.

As the second problem, we address the label sparsity challenge with our novel semi-supervised classification tree algorithm. The natural order of data availability is usually unlabeled, as labels are task-specific. Label sparsity indicates the scarce availability of labeled data. Classification trees are simple and explainable models for data classification, which are grown by repeated dataset partitioning based on a split criterion. In a classification tree learning task, when the class ratio of the unlabeled part of the dataset is made available, it becomes feasible to use the unlabeled data alongside the labeled data to train the tree in a semi-supervised style. We are motivated to use the abundantly available unlabeled data to facilitate building classification trees. We propose a semi-supervised approach to growing classification trees, where we apply maximum mean discrepancy (MMD) for estimating the class ratio at every node split. Our experimentation using several binary and multiclass classification datasets showed that our semi-supervised classification tree is statistically better than traditional decision tree algorithms in 31 of 40 datasets. Moreover, our method works better even for datasets with mild-to-moderate class imbalance.

As the third problem, we address the challenges of estimating the dataset's classification complexity and selecting a suitable model class for building a classifier with the best empirical model fitness for a given dataset. Traditionally, model selection is based on cross-validation, meta-learning, and user preferences, which are often time-consuming and resource-intensive. %Clustering Indices measure the ability of a clustering algorithm to induce good-quality neighborhoods with similar data characteristics. We propose a prediction system to estimate the empirical classification complexity of a dataset for a given set of model classes by learning a discriminant function that associates the data characteristics to the classification complexity. We also propose a novel method for automated classification model selection from a set of candidate model classes by determining the empirical model fitness for a dataset based only on its clustering indices. We propose a regression task for a given model class on the dataset's clustering indices to the expected classification performance. We compute the test dataset's clustering indices and directly predict the expected classification performance using the learned regressor for each candidate model class to recommend a suitable model class for dataset classification.

We evaluate our classification complexity prediction system and the model selection method through cross-validation with 60 publicly available binary class datasets. Our empirical study confirms that the \emph{top3} model recommendation is accurate for over 75\% datasets, and the classification complexity prediction is accurate for over 90\% datasets. We also propose an end-to-end Automated ML system for data classification based on our model selection method. Our end-to-end system evaluation against popular commercial and noncommercial Automated ML systems using a different collection of 25 public domain binary class datasets shows that the proposed system outperforms others with an excellent average model recommendation rank of 1.68.

Speakers

Mr. Sudarsun Santhiappan, Roll No: CS13D030

Computer Science and Engineering