Understanding Black-box Predictions via Influence Functions

Understanding Black-box Predictions via Influence Functions Pang Wei Koh 1 Percy Liang 1. Abstract point (Ribeiro et al., 2016) or by perturbing the test point to How can we explain the Predictions of a black - see how the prediction changes (Simonyan et al., 2013; Li box model? In this paper, we use Influence func- et al., 2016b; Datta et al., 2016; Adler et al., 2016). These tions a classic technique from robust statis- works explain the Predictions in terms of the model, but how can we explain where the model came from? [ ] 10 Jul 2017. tics to trace a model's prediction through the learning algorithm and back to its training data, In this paper, we tackle this question by tracing a model's thereby identifying training points most respon- Predictions through its learning algorithm and back to the sible for a given prediction .

To scale up Influence training data, where the model parameters ultimately de- Functions to modern machine learning settings, rive from. To formalize the impact of a training point on a we develop a simple, efficient implementation prediction , we ask the counterfactual: what would happen that requires only oracle access to gradients and if we did not have this training point, or if the values of this Hessian-vector products. We show that even on training point were changed slightly? non-convex and non-differentiable models where the theory breaks down, approximations to influ- Answering this question by perturbing the data and retrain- ence Functions can still provide valuable infor- ing the model can be prohibitively expensive.

To overcome mation. On linear models and convolutional neu- this problem, we use Influence Functions , a classic tech- ral networks, we demonstrate that Influence func- nique from robust statistics (Cook & Weisberg, 1980) that tions are useful for multiple purposes: under- tells us how the model parameters change as we upweight standing model behavior, debugging models, de- a training point by an infinitesimal amount. This allows us tecting dataset errors, and even creating visually- to differentiate through the training to estimate in closed- indistinguishable training -set attacks. form the effect of a variety of training perturbations.

Despite their rich history in statistics, Influence Functions 1. Introduction have not seen widespread use in machine learning; to the best of our knowledge, the work closest to ours is Wo- A key question often asked of machine learning systems jnowicz et al. (2016), which introduced a method for ap- is Why did the system make this prediction ? We want proximating a quantity related to Influence in generalized models that are not just high-performing but also explain- linear models. One obstacle to adoption is that influ- able. By Understanding why a model does what it does, we ence Functions require expensive second derivative calcu- can hope to improve the model (Amershi et al.)

, 2015), dis- lations and assume model differentiability and convexity, cover new science (Shrikumar et al., 2016), and provide which limits their applicability in modern contexts where end-users with explanations of actions that impact them models are often non-differentiable, non-convex, and high- (Goodman & Flaxman, 2016). dimensional. We address these challenges by showing that However, the best-performing models in many domains we can efficiently approximate Influence Functions using , deep neural networks for image and speech recogni- second-order optimization techniques (Pearlmutter, 1994;. tion (Krizhevsky et al.)

, 2012) are complicated, black - Martens, 2010; Agarwal et al., 2016), and that they remain box models whose Predictions seem hard to explain. Work accurate even as the underlying assumptions of differentia- on interpreting these Black-box models has focused on un- bility and convexity degrade. derstanding how a fixed model leads to particular predic- Influence Functions capture the core idea of studying mod- tions, , by locally fitting a simpler model around the test els through the lens of their training data. We show that 1. Stanford University, Stanford, CA. Correspondence to: they are a versatile tool that can be applied to a wide variety Pang Wei Koh Percy Liang <pli- of seemingly disparate tasks: Understanding model behavior, debugging models, detecting dataset errors, and creating visually-indistinguishable adversarial training exam- Proceedings of the 34 th International Conference on Machine ples that can flip neural network test Predictions , the train- Learning, Sydney, Australia, PMLR 70, 2017.

Copyright 2017. by the author(s). ing set analogue of Goodfellow et al. (2015). Understanding Black-box Predictions via Influence Functions 2. Approach Perturbing a training input Consider a prediction problem from some input space X Let us develop a finer-grained notion of Influence by study- ( , images) to an output space Y ( , labels). We are ing a different counterfactual: how would the model's pre- given training points z1 , .. , zn , where zi = (xi , yi ) dictions change if a training input were modified? X Y. For a point z and P parameters , let def n L(z, ) be the loss, and let n1 i=1 L(zi , ) be the em- For a training point z = (x, y), define z = (x + , y).

Pirical risk. The empirical risk minimizer is given by Consider the perturbation z 7 z , and let z , z be the def Pn empirical risk minimizer on the training points with z in = arg min n1 i=1 L(zi , ).1 Assume that the em- place of z. To approximate its effects, define the parameters pirical risk is twice-differentiable and strictly convex in ; def in Section 4 we explore relaxing these assumptions. resulting from P moving mass from z onto z : ,z , z =. 1 n arg min n i=1 L(zi , ) + L(z , ) L(z, ). An Upweighting a training point analogous calculation to (1) yields: Our goal is to understand the effect of training points on a d ,z , z = Iup,params (z ) Iup,params (z).

D . model's Predictions . We formalize this goal by asking the =0. = H 1 L(z , ) L(z, ) . (3).. counterfactual: how would the model's Predictions change if we did not have this training point? As before, we can make the linear approximation z , z . Let us begin by studying the change in model pa- n1 (Iup,params (z ) Iup,params (z)), giving us a closed- rameters due to removing a point z from the train- form estimate of the effect of z 7 z on the model. Anal- ing set. Formally, this change is z , where def ogous equations also apply for changes in y. While in- z = arg min zi 6=z L(zi , ). However, retraining P. fluence Functions might appear to only work for infinitesi- the model for each removed z is prohibitively slow.

Mal (therefore continuous) perturbations, it is important to Fortunately, Influence Functions give us an efficient approx- note that this approximation holds for arbitrary : the - imation. The idea is to compute the parameter change if z upweighting scheme allows us to smoothly interpolate be- were upweighted by some small , giving us new param- tween z and z . This is particularly useful for working with def Pn discrete data ( , in NLP) or with discrete label changes. eters ,z = arg min n1 i=1 L(zi , ) + L(z, ). A. classic result (Cook & Weisberg, 1982) tells us that the in- If x is continuous and is small, we can further approxi- fluence of upweighting z on the parameters is given by mate (3).

Understanding Black-box Predictions via Influence Functions

Tags:

Information

Transcription of Understanding Black-box Predictions via Influence Functions

Understanding Black-box Predictions via Influence Functions

Tags:

Information

Documents from same domain

Related documents