I'm Jessica.in previous post I mentioned methodological issues regarding the study of AI-assisted decision making, such as those used to evaluate different model explanation strategies. In a typical research setting, people are given some kind of decision task (e.g., considering the characteristics of this defendant, decide whether to convict or release them), are asked to make a decision, and then are asked to make a decision on the model's predictions. Give them access and see if they change their minds. heart. Her work on this type of AI-assisted decision-making task is interesting as organizations deploy predictive models to support human decision-making in areas such as healthcare and criminal justice. Ideally, a human would be able to use the model to improve the performance it obtains on its own, or if the model were deployed without any human involvement in the loop (known as complementarity). can.
The most frequently used definition of good confidence is that if a model's predictions are followed but are wrong, this is overreliance. If it does not match the model's prediction, but it is correct, this is labeled underconfidence. Otherwise, it is labeled as having good reliability.
This definition is problematic for several reasons. One is that AI may be more likely to choose the right action than a human, but it will still end up wrong. It makes no sense to say that humans made the wrong choice in following such a case. Because this approach is based on a posteriori correctness, it confuses the two causes of suboptimal human behavior. That is, the AI is not accurately estimating the correct probability, and he is not making the correct choice of whether to follow the AI based on his own beliefs.
By scoring decisions in action space, we can determine the correct action (which prediction to follow) even in scenarios where the probabilities of the human and AI being correct are very similar, or where either the AI or the human is likely to be correct. A similar penalty will apply if you do not select. The probability of being correct is higher. Nevertheless, there are many papers using this method, some with hundreds of citations.
in A statistical framework for measuring AI dependenceZiyang Guo, Yifan Wu, Jason Hartline and I wrote:
Humans often make decisions with the help of artificial intelligence (AI) systems. A common pattern is for the AI to recommend an action to a human who has control over the final decision. Researchers recognize that enabling humans to rely on AI appropriately is a key element to achieving complementary performance. We argue that the current definition of adequate reliability used in such studies lacks a formal statistical basis and may be subject to inconsistency. We propose a formal definition of reliability based on statistical decision theory. This separates the concept of dependence, as the probability that a decision maker will follow an AI's predictions, from the challenges that humans may face in distinguishing between signals and forming accurate beliefs about the situation. . Our definition yields a framework that can be used to guide the design and interpretation of research on human-AI complementarity and dependence. Using recent literature-based AI-advised decision-making research, we demonstrate how our framework can be used to separate losses due to incorrect dependence from losses due to not correctly discriminating signals. We assess these losses by comparing them to a baseline and benchmark of complementary performance defined by the expected payoff achieved by a rational agent facing the same decision-making task as the behavioral agent.
This is a similar approach to the rational agent framework for data visualization, but here the decision maker has a set of features that consist of feature values for an instance, the AI's predictions, the human's predictions, and optionally some explanation. It is assumed that the settings are to receive the signal that is sent. AI judgment. The decision maker chooses which prediction to follow.
The upper bound or best achievable performance (rational benchmark) in such studies can be calculated as the expected score of a rational decision maker on a randomly sampled decision task. A rational decision maker has prior knowledge about the data-generating model (the joint distribution of the signal and ground truth state). Seeing instances in a decision trial, they accurately perceive the signal, arrive at a Bayesian posterior belief about the distribution of payoff-related states, and choose the action that maximizes their expected utility over the posterior. This is calculated in the payoff space defined by the scoring rules, as the cost of error can vary in magnitude.
By defining a rational agent baseline, i.e., the expected performance of a rational decision maker without access to signals on a randomly selected decision task from an experiment, we can You can define the value of completion. This represents the score that a rational agent would have obtained if it could rely solely on its prior beliefs about the data-generating model, so the baseline always chooses the better of humans alone or AI alone. Predicted score for fixed strategy.
When designing or interpreting experiments for reliance on AI, the first thing to do is to see how close the baseline is to the benchmark. As you can see in the diagram above, we want to make sure that there is enough room for the human and her AI team to improve performance beyond the baseline. If your baseline is very close to your benchmark, it's probably not worth adding a human.
Once you run the experiment and observe how well people make these decisions, interpret the value of imputation as how much added value humans are bringing compared to their baseline decisions. can be treated as a comparison unit for This normalizes the observed scores within a range where the rational agent baseline is 0 and the rational agent benchmark is 1, and where the observed human and her AI performance is. This is done by checking. This can help you understand the magnitude of effects when comparing different settings. For example, given two model explanation strategies A and B that are compared in an experiment, calculate the expected human performance on randomly sampled decision trials under A and B, and calculate (score_A −score_B)/score_B. You can measure the improvement by calculating the value. Complement.
It is also possible to deconstruct sources of error in the performance of research participants. To do this, we define a benchmark for „falsely relying“ rational decision makers. This is the expected score of a rational agent constrained to the confidence level observed in the study participants. Therefore, this is the highest score that a decision maker relying on an AI could achieve in the same overall percentage of time if they were fully aware of the probability of the AI being right compared to the probability that a human would be right on all decision-making tasks. It becomes. Since the trust level of the mistakenly relying benchmark and the study participants are the same (i.e., they both accept her AI's predictions at the same rate), the difference in decisions is entirely based on whether they accept the AI's predictions in different instances. I hope so. It is possible that a rational decision maker who mistakenly relies will always accept her AI predictions in the top X% ranked by performance advantage over human predictions, but research participants will not. There is a gender.
By calculating a reasonable benchmark of false trust relative to the observed trust levels of research participants, we can distinguish between: loss of trustlosses due to over- or under-reliance on AI (defined as the difference between the rational and incorrectly relied upon benchmarks divided by the value of the rational imputation), and discrimination Loss, the loss of not accurately distinguishing between cases in which the AI outperforms humans and cases in which humans outperform the AI (defined as the difference between the inaccurate benchmark and the participant's expected score divided by the value) of rational completion).
We applied this approach to some well-known studies on AI dependence and found that given how much better AI is than humans, there is a lack of possibility to see complementarity within the studies. From observing that AI . It has also been observed that researchers make comparisons between conditions with different upper and lower performance bounds without taking into account the differences.
This paper contains much more. For example, we discuss how rational benchmarks, which are upper bounds that represent a rational decision maker's expected score on a randomly selected decision task, can overfit empirical data. This occurs when the signal space is very large (e.g., the instances are text documents) and when few human predictions are observed for each signal. We explain how a rational agent can determine the best response to an optimal coarsening of the empirical distribution, such that the true rational benchmark is bounded by this and an upper bound on overfitting.
While we focused on showing how to improve research on human-AI teams, we believe that as organizations consider whether the introduction of AI has the potential to improve human decision-making processes, I'm excited about the potential this framework can help with. We are currently thinking about what practical questions such a framework can be used to answer (beyond whether the combination of humans and AI is effective).