Current question-answering benchmarks predominantly focus on accuracy in realizable prediction tasks. Conditioned on a question and answer-key, does the most likely token match the ground truth? Such benchmarks necessarily fail to evaluate language models’ ability to quantify outcome uncertainty. In this work, we focus on the use of language models as risk scores for unrealizable prediction tasks. We introduce folktexts, a software package to systematically generate risk scores using large language models, and evaluate them against benchmark prediction tasks. Specifically, the package derives natural language tasks from US Census data products, inspired by popular tabular data benchmarks. A flexible API allows for any task to be constructed out of 28 census features whose values are mapped to prompt-completion pairs. We demonstrate the utility of folktexts through a sweep of empirical insights on 16 recent large language models, inspecting risk scores, calibration curves, and diverse evaluation metrics. We find that zero-shot risk sores have high predictive signal while being widely miscalibrated: base models overestimate outcome uncertainty, while instruction-tuned models underestimate uncertainty and generate over-confident risk scores.
Unprocessing Seven Years of Algorithmic Fairness
André F Cruz, and Moritz Hardt
In The Twelfth International Conference on Learning Representations (ICLR), 2024
(Oral Presentation)
Seven years ago, researchers proposed a postprocessing method to equalize the error rates of a model across different demographic groups. The work launched hundreds of papers purporting to improve over the postprocessing baseline. We empirically evaluate these claims through thousands of model evaluations on several tabular datasets. We find that the fairness-accuracy Pareto frontier achieved by postprocessing contains all other methods we were feasibly able to evaluate. In doing so, we address two common methodological errors that have confounded previous observations. One relates to the comparison of methods with different unconstrained base models. The other concerns methods achieving different levels of constraint relaxation. At the heart of our study is a simple idea we call unprocessing that roughly corresponds to the inverse of postprocessing. Unprocessing allows for a direct comparison of methods using different underlying models and levels of relaxation. Interpreting our findings, we recall a widely overlooked theoretical argument, present seven years ago, that accurately predicted what we observe.
2023
FairGBM: Gradient Boosting with Fairness Constraints
André F Cruz, Catarina Belém, Sérgio Jesus, and
3 more authors
In The Eleventh International Conference on Learning Representations (ICLR), 2023
Machine Learning (ML) algorithms based on gradient boosted decision trees (GBDT) are still favored on many tabular data tasks across various mission critical applications, from healthcare to finance. However, GBDT algorithms are not free of the risk of bias and discriminatory decision-making. Despite GBDT’s popularity and the rapid pace of research in fair ML, existing in-processing fair ML methods are either inapplicable to GBDT, incur in significant train time overhead, or are inadequate for problems with high class imbalance. We present FairGBM, a learning framework for training GBDT under fairness constraints with little to no impact on predictive performance when compared to unconstrained LightGBM. Since common fairness metrics are non-differentiable, we employ a "proxy-Lagrangian" formulation using smooth convex error rate proxies to enable gradient-based optimization. Additionally, our open-source implementation shows an order of magnitude speedup in training time when compared with related work, a pivotal aspect to foster the widespread adoption of FairGBM by real-world practitioners.
2022
Turning the Tables: Biased, Imbalanced, Dynamic Tabular Datasets for ML Evaluation
Sérgio Jesus, José Pombal, Duarte Alves, and
5 more authors
In Advances in Neural Information Processing Systems (NeurIPS), 2022
Evaluating new techniques on realistic datasets plays a crucial role in the development of ML research and its broader adoption by practitioners. In recent years, there has been a significant increase of publicly available unstructured data resources for computer vision and NLP tasks. However, tabular data—which is prevalent in many high-stakes domains—has been lagging behind. To bridge this gap, we present Bank Account Fraud (BAF), the first publicly available 1 privacy-preserving, large-scale, realistic suite of tabular datasets. The suite was generated by applying state-of-the-art tabular data generation techniques on an anonymized, real-world bank account opening fraud detection dataset. This setting carries a set of challenges that are commonplace in real-world applications, including temporal dynamics and significant class imbalance. Additionally, to allow practitioners to stress test both performance and fairness of ML methods, each dataset variant of BAF contains specific types of data bias. With this resource, we aim to provide the research community with a more realistic, complete, and robust test bed to evaluate novel and existing methods.
2021
Promoting Fairness through Hyperparameter Optimization
André F Cruz, Pedro Saleiro, Catarina Belém, and
2 more authors
In 2021 IEEE International Conference on Data Mining (ICDM), 2021
Considerable research effort has been guided towards algorithmic fairness but real-world adoption of bias reduction techniques is still scarce. Existing methods are either metric- or model-specific, require access to sensitive attributes at inference time, or carry high development or deployment costs. This work explores the unfairness that emerges when optimizing ML models solely for predictive performance, and how to mitigate it with a simple and easily deployed intervention: fairness-aware hyperparameter optimization (HO). We propose and evaluate fairness-aware variants of three popular HO algorithms: Fair Random Search, Fair TPE, and Fairband. We validate our approach on a real-world bank account opening fraud case-study, as well as on three datasets from the fairness literature. Results show that, without extra training cost, it is feasible to find models with 111% mean fairness increase and just 6% decrease in performance when compared with fairness-blind HO.
TimeSHAP: Explaining Recurrent Models through Sequence Perturbations
João Bento, Pedro Saleiro, André F Cruz, and
2 more authors
In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021
Although recurrent neural networks (RNNs) are state-of-the-art in numerous sequential decision-making tasks, there has been little research on explaining their predictions. In this work, we present TimeSHAP, a model-agnostic recurrent explainer that builds upon KernelSHAP and extends it to the sequential domain. TimeSHAP computes feature-, timestep-, and cell-level attributions. As sequences may be arbitrarily long, we further propose a pruning method that is shown to dramatically decrease both its computational cost and the variance of its attributions. We use TimeSHAP to explain the predictions of a real-world bank account takeover fraud detection RNN model, and draw key insights from its explanations: i) the model identifies important features and events aligned with what fraud analysts consider cues for account takeover; ii) positive predicted sequences can be pruned to only 10% of the original length, as older events have residual attribution values; iii) the most recent input event of positive predictions only contributes on average to 41% of the model’s score; iv) notably high attribution to client’s age, upheld on higher false positive rates for older clients.
2020
On Document Representations for Detection of Biased News Articles
André Ferreira Cruz, Gil Rocha, and Henrique Lopes Cardoso
In Proceedings of the 35th Annual ACM Symposium on Applied Computing, 2020
Detecting bias in text is an increasingly relevant topic, given the information overload problem. Automating this task is crucial for our needs of quality news consumption. With this in mind, we explore modern deep learning approaches, including contextualized word embeddings and attention mechanisms, to compare the effects of different document representation choices. We design token-wise, sentence-wise and hierarchical document representations. Focusing on hyperpartisan news detection, we show that hierarchical attention mechanisms are able to better capture information at different levels of granularity (including intra and inter-sentence), which seems to be relevant for this task. With an accuracy of 82.5%, our best performing system is based on an ensemble of hierarchical attention networks with ELMo embeddings, achieving state-of-the-art performance on the SemEval-2019 Task4 dataset.
Fairness-Aware Hyperparameter Optimization
André Miguel Ferreira Cruz
University of Porto, Faculty of Engineering, Jul 2020