: Optimal performance is desired for decision-making in any field with binary classifiers and diagnostic tests, however common performance measures lack depth in information. The area under the receiver operating characteristic curve (AUC) and the area under the precision recall curve are too general because they evaluate all decision thresholds including unrealistic ones. Conversely, accuracy, sensitivity, specificity, positive predictive value and the F1 score are too specificthey are measured at a single threshold that is optimal for some instances, but not others, which is not equitable. In between both approaches, we propose deep ROC analysis to measure performance in multiple groups of predicted risk (like calibration), or groups of true positive rate or false positive rate. In each group, we measure the group AUC (properly), normalized group AUC, and averages of: sensitivity, specificity, positive and negative predictive value, and likelihood ratio positive and negative. The measurements can be compared between groups, to whole measures, to point measures and between models. We also provide a new interpretation of AUC in whole or part, as balanced average accuracy, relevant to individuals instead of pairs. We evaluate models in three case studies using our method and Python toolkit and confirm its utility.

Deep ROC Analysis and AUC as Balanced Average Accuracy, for Improved Classifier Selection, Audit and Explanation

Osmani, Venet;
2022

Abstract

: Optimal performance is desired for decision-making in any field with binary classifiers and diagnostic tests, however common performance measures lack depth in information. The area under the receiver operating characteristic curve (AUC) and the area under the precision recall curve are too general because they evaluate all decision thresholds including unrealistic ones. Conversely, accuracy, sensitivity, specificity, positive predictive value and the F1 score are too specificthey are measured at a single threshold that is optimal for some instances, but not others, which is not equitable. In between both approaches, we propose deep ROC analysis to measure performance in multiple groups of predicted risk (like calibration), or groups of true positive rate or false positive rate. In each group, we measure the group AUC (properly), normalized group AUC, and averages of: sensitivity, specificity, positive and negative predictive value, and likelihood ratio positive and negative. The measurements can be compared between groups, to whole measures, to point measures and between models. We also provide a new interpretation of AUC in whole or part, as balanced average accuracy, relevant to individuals instead of pairs. We evaluate models in three case studies using our method and Python toolkit and confirm its utility.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/11582/329886
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? 1
social impact