publications | André F. Cruz

2024

Evaluating language models as risk scores

André F. Cruz, Moritz Hardt, and Celestine Mendler-Dünner

In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track (NeurIPS), 2024

Abs arXiv Bib Code

Current question-answering benchmarks predominantly focus on accuracy in realizable prediction tasks. Conditioned on a question and answer-key, does the most likely token match the ground truth? Such benchmarks necessarily fail to evaluate language models’ ability to quantify outcome uncertainty. In this work, we focus on the use of language models as risk scores for unrealizable prediction tasks. We introduce folktexts, a software package to systematically generate risk scores using large language models, and evaluate them against benchmark prediction tasks. Specifically, the package derives natural language tasks from US Census data products, inspired by popular tabular data benchmarks. A flexible API allows for any task to be constructed out of 28 census features whose values are mapped to prompt-completion pairs. We demonstrate the utility of folktexts through a sweep of empirical insights on 16 recent large language models, inspecting risk scores, calibration curves, and diverse evaluation metrics. We find that zero-shot risk sores have high predictive signal while being widely miscalibrated: base models overestimate outcome uncertainty, while instruction-tuned models underestimate uncertainty and generate over-confident risk scores.
@inproceedings{cruz2024evaluating, title = {Evaluating language models as risk scores}, author = {Cruz, Andr\'{e} F. and Hardt, Moritz and Mendler-Dünner, Celestine}, booktitle = {The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track (NeurIPS)}, year = {2024}, url = {https://openreview.net/forum?id=qrZxL3Bto9}, }
Unprocessing Seven Years of Algorithmic Fairness

André F Cruz, and Moritz Hardt

In The Twelfth International Conference on Learning Representations (ICLR), 2024 (Oral Presentation, top 5%)

Abs arXiv Bib Code

Seven years ago, researchers proposed a postprocessing method to equalize the error rates of a model across different demographic groups. The work launched hundreds of papers purporting to improve over the postprocessing baseline. We empirically evaluate these claims through thousands of model evaluations on several tabular datasets. We find that the fairness-accuracy Pareto frontier achieved by postprocessing contains all other methods we were feasibly able to evaluate. In doing so, we address two common methodological errors that have confounded previous observations. One relates to the comparison of methods with different unconstrained base models. The other concerns methods achieving different levels of constraint relaxation. At the heart of our study is a simple idea we call unprocessing that roughly corresponds to the inverse of postprocessing. Unprocessing allows for a direct comparison of methods using different underlying models and levels of relaxation. Interpreting our findings, we recall a widely overlooked theoretical argument, present seven years ago, that accurately predicted what we observe.
@inproceedings{cruz2024unprocessing, title = {Unprocessing Seven Years of Algorithmic Fairness}, author = {Cruz, Andr\'{e} F and Hardt, Moritz}, booktitle = {The Twelfth International Conference on Learning Representations (ICLR)}, year = {2024}, url = {https://openreview.net/forum?id=jr03SfWsBS}, }

2023

FairGBM: Gradient Boosting with Fairness Constraints

André F Cruz, Catarina Belém, Sérgio Jesus, and 3 more authors

In The Eleventh International Conference on Learning Representations (ICLR), 2023

Abs arXiv Bib Code

Machine Learning (ML) algorithms based on gradient boosted decision trees (GBDT) are still favored on many tabular data tasks across various mission critical applications, from healthcare to finance. However, GBDT algorithms are not free of the risk of bias and discriminatory decision-making. Despite GBDT’s popularity and the rapid pace of research in fair ML, existing in-processing fair ML methods are either inapplicable to GBDT, incur in significant train time overhead, or are inadequate for problems with high class imbalance. We present FairGBM, a learning framework for training GBDT under fairness constraints with little to no impact on predictive performance when compared to unconstrained LightGBM. Since common fairness metrics are non-differentiable, we employ a "proxy-Lagrangian" formulation using smooth convex error rate proxies to enable gradient-based optimization. Additionally, our open-source implementation shows an order of magnitude speedup in training time when compared with related work, a pivotal aspect to foster the widespread adoption of FairGBM by real-world practitioners.
@inproceedings{cruz2023fairgbm, title = {Fair{GBM}: Gradient Boosting with Fairness Constraints}, author = {Cruz, Andr\'{e} F and Bel{\'{e}}m, Catarina and Jesus, S{\'{e}}rgio and Bravo, Jo{\~{a}}o and Saleiro, Pedro and Bizarro, Pedro}, booktitle = {The Eleventh International Conference on Learning Representations (ICLR)}, year = {2023}, url = {https://openreview.net/forum?id=x-mXzBgCX3a}, }

2022

Turning the Tables: Biased, Imbalanced, Dynamic Tabular Datasets for ML Evaluation

Sérgio Jesus, José Pombal, Duarte Alves, and 5 more authors

In Advances in Neural Information Processing Systems (NeurIPS), 2022

Abs arXiv Bib Code

Evaluating new techniques on realistic datasets plays a crucial role in the development of ML research and its broader adoption by practitioners. In recent years, there has been a significant increase of publicly available unstructured data resources for computer vision and NLP tasks. However, tabular data—which is prevalent in many high-stakes domains—has been lagging behind. To bridge this gap, we present Bank Account Fraud (BAF), the first publicly available 1 privacy-preserving, large-scale, realistic suite of tabular datasets. The suite was generated by applying state-of-the-art tabular data generation techniques on an anonymized, real-world bank account opening fraud detection dataset. This setting carries a set of challenges that are commonplace in real-world applications, including temporal dynamics and significant class imbalance. Additionally, to allow practitioners to stress test both performance and fairness of ML methods, each dataset variant of BAF contains specific types of data bias. With this resource, we aim to provide the research community with a more realistic, complete, and robust test bed to evaluate novel and existing methods.
@inproceedings{jesus2022turning, author = {Jesus, S\'{e}rgio and Pombal, Jos\'{e} and Alves, Duarte and Cruz, Andr\'{e} F and Saleiro, Pedro and Ribeiro, Rita and Gama, Jo\~{a}o and Bizarro, Pedro}, booktitle = {Advances in Neural Information Processing Systems (NeurIPS)}, editor = {Koyejo, S. and Mohamed, S. and Agarwal, A. and Belgrave, D. and Cho, K. and Oh, A.}, pages = {33563--33575}, publisher = {Curran Associates, Inc.}, title = {Turning the Tables: Biased, Imbalanced, Dynamic Tabular Datasets for ML Evaluation}, volume = {35}, year = {2022}, }

2021

Promoting Fairness through Hyperparameter Optimization

André F Cruz, Pedro Saleiro, Catarina Belém, and 2 more authors

In 2021 IEEE International Conference on Data Mining (ICDM), 2021

Abs arXiv Bib Code

Considerable research effort has been guided towards algorithmic fairness but real-world adoption of bias reduction techniques is still scarce. Existing methods are either metric- or model-specific, require access to sensitive attributes at inference time, or carry high development or deployment costs. This work explores the unfairness that emerges when optimizing ML models solely for predictive performance, and how to mitigate it with a simple and easily deployed intervention: fairness-aware hyperparameter optimization (HO). We propose and evaluate fairness-aware variants of three popular HO algorithms: Fair Random Search, Fair TPE, and Fairband. We validate our approach on a real-world bank account opening fraud case-study, as well as on three datasets from the fairness literature. Results show that, without extra training cost, it is feasible to find models with 111% mean fairness increase and just 6% decrease in performance when compared with fairness-blind HO.
@inproceedings{cruz2021promoting, author = {Cruz, Andr\'{e} F and Saleiro, Pedro and Bel\'{e}m, Catarina and Soares, Carlos and Bizarro, Pedro}, booktitle = {2021 IEEE International Conference on Data Mining (ICDM)}, title = {Promoting Fairness through Hyperparameter Optimization}, year = {2021}, url = {https://doi.org/10.1109/ICDM51629.2021.00119}, volume = {}, number = {}, pages = {1036-1041}, doi = {10.1109/ICDM51629.2021.00119}, }
TimeSHAP: Explaining Recurrent Models through Sequence Perturbations

João Bento, Pedro Saleiro, André F Cruz, and 2 more authors

In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021

Abs arXiv Bib Code Website

Although recurrent neural networks (RNNs) are state-of-the-art in numerous sequential decision-making tasks, there has been little research on explaining their predictions. In this work, we present TimeSHAP, a model-agnostic recurrent explainer that builds upon KernelSHAP and extends it to the sequential domain. TimeSHAP computes feature-, timestep-, and cell-level attributions. As sequences may be arbitrarily long, we further propose a pruning method that is shown to dramatically decrease both its computational cost and the variance of its attributions. We use TimeSHAP to explain the predictions of a real-world bank account takeover fraud detection RNN model, and draw key insights from its explanations: i) the model identifies important features and events aligned with what fraud analysts consider cues for account takeover; ii) positive predicted sequences can be pruned to only 10% of the original length, as older events have residual attribution values; iii) the most recent input event of positive predictions only contributes on average to 41% of the model’s score; iv) notably high attribution to client’s age, upheld on higher false positive rates for older clients.
@inproceedings{bento2021timeshap, author = {Bento, Jo\~{a}o and Saleiro, Pedro and Cruz, Andr\'{e} F and Figueiredo, M\'{a}rio A.T. and Bizarro, Pedro}, title = {TimeSHAP: Explaining Recurrent Models through Sequence Perturbations}, year = {2021}, isbn = {9781450383325}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3447548.3467166}, doi = {10.1145/3447548.3467166}, booktitle = {Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining}, pages = {2565–2573}, numpages = {9}, keywords = {Shapley values, TimeSHAP, RNN, XAI, SHAP, explainability}, location = {Virtual Event, Singapore}, series = {KDD '21}, }

2020

On Document Representations for Detection of Biased News Articles

André Ferreira Cruz, Gil Rocha, and Henrique Lopes Cardoso

In Proceedings of the 35th Annual ACM Symposium on Applied Computing, 2020

Abs Bib PDF

Detecting bias in text is an increasingly relevant topic, given the information overload problem. Automating this task is crucial for our needs of quality news consumption. With this in mind, we explore modern deep learning approaches, including contextualized word embeddings and attention mechanisms, to compare the effects of different document representation choices. We design token-wise, sentence-wise and hierarchical document representations. Focusing on hyperpartisan news detection, we show that hierarchical attention mechanisms are able to better capture information at different levels of granularity (including intra and inter-sentence), which seems to be relevant for this task. With an accuracy of 82.5%, our best performing system is based on an ensemble of hierarchical attention networks with ELMo embeddings, achieving state-of-the-art performance on the SemEval-2019 Task4 dataset.
@inproceedings{cruz2020document, author = {Cruz, Andr\'{e} Ferreira and Rocha, Gil and Cardoso, Henrique Lopes}, title = {On Document Representations for Detection of Biased News Articles}, year = {2020}, isbn = {9781450368667}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3341105.3374025}, doi = {10.1145/3341105.3374025}, booktitle = {Proceedings of the 35th Annual ACM Symposium on Applied Computing}, pages = {892–899}, numpages = {8}, keywords = {hyperpartisan news, natural language processing, bias detection, document representation, deep learning}, location = {Brno, Czech Republic}, series = {SAC '20}, }

Fairness-Aware Hyperparameter Optimization

André Miguel Ferreira Cruz

University of Porto, Faculty of Engineering, Jul 2020

MSc Thesis

Bib PDF Website

@mastersthesis{cruzthesis2020,
  author = {da Cruz, Andr\'{e} Miguel Ferreira},
  title = {{Fairness-Aware Hyperparameter Optimization}},
  school = {University of Porto, Faculty of Engineering},
  year = {2020},
  month = jul,
  address = {R. Dr. Roberto Frias, 4200-465 Porto, Portugal},
  note = {MSc Thesis}
}