The Risk Metric Translation Layer: Why Precision and FPR Aren't Mirror Images
Sometimes, I found it inefficient to communicate the antifraud team's performance metrics, such as precision and false positive rate, to other teams or even within the antifraud team. It is because those words are also used loosely in everyday communication. I said, “Our policy precision is 90%.” The counterpart replied, “10% false positive rate is too high to accept.” You know there is a gap in the understanding of precision and false positive rate (FPR).
On the business side, when they say “10% FPR is too high,” they mean: out of all the good users, how many are falsely labelled as bad? They worried about user experience. On the risk side, when we say “precision is 90%”, we mean: out of all the users who are being labelled as bad, how many are actually bad?. We are concerned about the performance of the models/rules themselves. By deriving the relationship, we can bridge the gaps between two metrics.
Let’s revisit the two-room analogy for intuitive understanding. FPR only cares about the good users’ room (N). If FPR increases, the absolute number of false positives increases. Precision [TP/(TP+FP)]will drop as well because the denominator became larger. But let’s keep precision constant since we want to find the relationship between the two. This forces the model to capture more TP. In other words, recall (TP/P) must increase. Finally, since FPR and recall are just ratios, the relative number of N and P also affects the relationship, a.ka. prevalence [P/(P+N)]. To summarize, precision and FPR are connected by recall and prevalence. 90% precision does not imply an FPR of 10%, because we must also account for recall and the natural fraud rate (prevalence).
Mathematically, we can derive the relationship as:
where theta is the prevalence.
Consider the following scenarios to consolidate the intuition, assuming the model is reasonably good (both precision and recall are high):
FPR is low, and precision is high when prevalence is low.
FPR and precision are high when prevalence is high.
You may already notice that even if model performance is poor, FPR can be low if prevalence is very low. For example, if the prevalence is 0.1%, the recall is 10%, the precision is 20%, and the FPR is only 0.04%.
Finally, the question is which number should we use. I think both should be reported separately, as the audiences are different. For the risk team, we should focus on precision and recall, as they are directly linked to model/policy performance. However, we should also check FPR as it affects the user experience.
