Skip to content
Accessible Published by De Gruyter March 2, 2018

What is Gained from Past Learning

Judea Pearl

Abstract

We consider ways of enabling systems to apply previously learned information to novel situations so as to minimize the need for retraining. We show that theoretical limitations exist on the amount of information that can be transported from previous learning, and that robustness to changing environments depends on a delicate balance between the relations to be learned and the causal structure of the underlying model. We demonstrate by examples how this robustness can be quantified.

1 Introduction

Assume that we have learned a certain relation R in environment that is governed by a probability function P. Now the environment changes and P turns into P. We would still like to estimate R, but in the new environment, P. The basis of much works on “Transfer Learning,” “Robust Learning,” “Domain Adaptation,” and “Life Long Learning” (L2L) hinges on the intuition that it would be a great waste to start learning R(P) from scratch, instead of amortizing what we learned in P.

This intuition assumes, of course, that the two environments share some features in common, and that the shared features are significant in determining R. Surely, if the two environments are totally different, then we might as well start learning things from scratch – there is simply no other choice. Similarly, if the target relation R is defined exclusively on the novelty part of P, no advantage would be realized by transferring what was learned in P.

To anchor this intuition in a formal setting[1] let us assume that the target relation R can be decomposed into a set S of sub-relations, and that these sub-relations fall into two categories:

SA

– sub-relations on which P and P agree, and

SD

– sub-relations on which P and P disagree.

One obvious saving that can be realized from knowing SA is in learning time. If we have trained the learner on 100 cases from each distribution, we can estimate SA using all 200 samples, and SD using the 100 samples of P. The net result being that some portions of R receive extra samples, which render them more precise, thus making the estimate of R more precise (i. e., less susceptible to sampling bias). Conversely, if we aspire to achieve a given precision in R, less samples, or shorter learning time, would be realized overall.

A simple example can illustrate this logic.

Example 1.

Let X and Y be two sets of variables governed by a joined distributionP=P(x,y). X could represent class labels and Y a set of measurements, or features. If our task is to infer X on the basis of measurements of Y, then the relation of interest isR=P(x|y), which can be learned by drawing samples from P.

Let us assume that P changes intoPsuch that the prior probability remains the same,P(x)=P(x), but the conditional probabilityP(y|x)changes. (This would be the case, for example, when the instruments for measuring Y undergo changes.) We can either learnP(x|y)from scratch, by drawing samples fromP, or we can borrow samples drawn previously from P, pool them with what we observe inPand obtain an improved estimate ofR=P(x|y). This can be done by decomposing R into a product of prior and conditional probabilities, then capitalizing on the equalityP(x)=P(x).

We have:

(1)R=P(x|y)=P(y|x)P(x)/P(y)=P(y|x)P(x)/P(y)

The last expression permits us to use the more precise estimatedP(x)rather than rely solely on the small-sample estimate ofP(x|y).

This simple example raises a fundamental question: Is it always beneficial to decompose a relation into components, estimate each component individually, some with improved precision, then recombine the results?

A competing intuition might claim that the exercise of decomposing, estimating, and combining introduces new sources of noise, compared to, say, estimating the relation in one shot.

The question is further complicated by the fact that decompositions are not unique. Eq. (1), for example can also be written as:

(2)R=P(x|y)=P(y|x)P(x)/xP(y|x)P(x)

This calls for refraining from learning P(y) directly in the new environment, but estimating P(y|x) and P(y|x) for all x at the new environment, then averaging the results to get a composite estimate of P(y) as shown in the denominator.

It is not at all clear that the refinement offered by the denominator of (2) would improve precision over the estimator defined in (1). Assume for example that Y is a single binary variable, whereas X is a vector of continuous variables. Decomposing the P(y) as in the denominator of Eq. 1 would entail estimating all factors P(y|x) and averaging the estimates. Estimating P(y) from scratch, in contrast, may offer definite advantages, despite the fact that we have not borrowed any information from P.

We thus ask the following questions:

  1. 1.

    Given a relation R, which of its decompositions gains by borrowing and which does not?

  2. 2.

    Which relations R have a beneficial decomposition and which do not?

  3. 3.

    Given that borrowing is beneficial, can we quantify the benefit?

2 The Transfer Benefit Ratio (TBR)

To get a theoretical handle on the problem, let us take the simple problem of estimating the regression coefficient τ of Y on X in the chain model of Fig. 1.

Figure 1 A chain model where b changes and a remains the same.

Figure 1

A chain model where b changes and a remains the same.

Here b is the regression coefficient of Y on Z,

b=/dzE[Y|Z=z],
a is the regression coefficient of Z on X
a=/xE[Z|X=x],

and, based on the chain structure:

R=τ=/xE[Y|X=x]=ab.

Let us assume that we estimate a and b using Ordinary Least Square (OLS) on a large number (N1) of cases from P. Now b changes to b, while a remains the same. How are we to estimate τ in P, if we can draw only a small number of samples (N2) in the new environment?

We have two options:

  1. 1.

    We ignore the estimates obtained in the training environment and estimate τ from scratch, obtaining

    τˆ=i=1nxiyii=1nxi2.
  2. 2.

    We estimate a and b separately, and multiply their estimates, with a receiving samples from both environments and b from P only.

Let aˆ,bˆ, be respectively the OLS estimators of a,b. To measure the benefit of borrowing the estimate aˆ from the training environment, we need to compare the efficiency of aˆbˆ to that of τˆ, recalling that aˆ is estimated using N1 training cases from P, and τˆ and bˆ are estimated using N2 training cases from P.

The ratio of the asymptotic variances of these two estimators will measure the merit of transferring knowledge from one environment to another, and will be called here the Transfer Benefit Ratio (TBR).

This measure translates directly to improvement in the learning speed. When TBR is high, a small number of cases (N2) in the novel environment would be sufficient to achieve a given precision, whereas a low TBR would require a high number of cases to achieve such precision.

Intuitively, the benefit of transfer would be more pronounced when the part shared by the two environments is noisy and the novel part is noiseless. Under such conditions, assessments of the target quantity τ are highly vulnerable to inaccuracies in estimating the relation between X and Z, and it is here that the training conducted in P can be most beneficial.

Exact analysis (see Appendix I) reveals that, for N2N1, the TBR is given by the following formula

(3)TBRN2/N10=1ρb2ρa2ρa2(1ρb2),

where ρa2 and ρb2 are the squared correlation coefficients

(4)ρa2=cov2(XZ)var(X)var(Z)ρb2=cov2(YZ)var(Y)var(Z).

Equation (3) quantifies the intuition that transfer learning is more beneficial when the novelty between the two environments is almost deterministic (ρb approaches 1) so that the few observations conducted in the new environment would suffice to complete the adaptation.

Appendix I generalizes this result to any N1/N2 ratio and presents 3-dimensional charts of how the TBR varies with both the N1/N2 ratio and the statistics of X,Y, and Z. Remarkably, it shows that TBR is greater than unity even for N1=N2. This means that there is benefit to the two-step estimation of τ (using the product aˆbˆ over the single step estimator τˆ, even when the environment does not change and we are faced with the problem of estimating τ given the chain model of Fig. 1. This phenomenon reflects a more general pattern in estimation: proper utilization of modeling assumptions can improve estimation efficiency, provided those assumptions are valid [3], [4].

Clearly, this exercise is oversimplified in that it assumes just two linear relationships XZ and ZY one invariant and one novel. Yet, such rudimentary analysis must be conducted to understand the speed-up provided by prior learning, the factors that determine this speed-up, and how to optimize those factors.

In more realistic situations, it is not at all clear that a speed-up would be achieved regardless of the problem structure. In our example, we capitalized on the chain structure, which rendered X and Y conditional independent given Z. Under such conditions, the product estimator is superior to the one-shot estimator even when no environmental change takes place (i. e., N1=N2). On the other hand, when N1N2, the benefit of transfer learning is realized even in the absence of independence constraints.

We have so far not considered the possibility of minimizing the number of variables needed to be measured in the new environment. Cases exist where, despite differences between P and P, R can be estimated entirely in the source environment, without taking any measurements in P. In other cases, some measurements in the new environments are needed, but the number of variables involved can be minimized by proper design [1].

3 Conclusions

We have demonstrated by simple examples that it is possible to quantify the benefit of borrowing information from previous learning, and that this benefit depends on the structure of the data generating model. This leaves open the general question of deciding, for any given relation R, how can it best benefit from previous learning, and how robust can it be to changes in the target environment? We conjecture that the understanding of such theoretical questions is necessary for designing algorithms that take maximum advantage of previous learning and spend minimum resources on re-learning that which could be borrowed.

Funding source: Defense Advanced Research Projects Agency

Award Identifier / Grant number: #W911NF-16-057

Funding source: National Science Foundation

Award Identifier / Grant number: #IIS-1302448

Award Identifier / Grant number: #IIS-1527490

Award Identifier / Grant number: #IIS-1704932

Funding source: Office of Naval Research

Award Identifier / Grant number: #N00014-17-S-B001

Funding statement: This research was supported in parts by grants from Defense Advanced Research Projects Agency #W911NF-16-057, National Science Foundation #IIS-1302448, #IIS-1527490, and #IIS-1704932, and Office of Naval Research #N00014-17-S-B001.

Acknowledgment

I am indebted to Professor Jinyong Hahn for teaching me the secrets of asymptotic variance analysis. The 3-dimensional plots of Fig. 3 were produced by Elias Bareinboin.

Appendix I Composition and transfer in a two-stage process

In experiments involving a two-stage process as in Fig. 1, Cox has shown that the estimated regression coefficient between treatment and response has a reduced variance if computed as a product of two estimates, one for each stage of the process [3]. Below we summarize Cox’s analysis and adapt it to the problem of information transfer across populations.

Figure 2 A two-stage process with intermediate variable Z.

Figure 2

A two-stage process with intermediate variable Z.

The linear model depicted in Fig. 1 can be representd by the following structural equations:

(5)z=ax+ϵ1,y=bz+ϵ2withcov(x,ϵ1)=cov(x,ϵ2)=cov(ϵ1,ϵ2)=0.

The process is depicted in Fig. 2. Our target of analysis is the regression coefficient of Y on X, i. e., the coefficient of x in the equation

(6)y=τx+ϵ3withcov(x,ϵ3)=0.

As before, let aˆ,bˆ, and τˆ be respectively the OLS estimators of a,b,τ. Cox showed that the asymptotic variance of τˆ is greater than that of the product aˆbˆ, or

var(τˆ)/var(aˆbˆ)1,

with equality holding only in pathological cases of perfect determinism. Specifically, he computed the n-sample variances to be:

(7)var(τˆ)=[var(ϵ2)+b2var(ϵ1)]/nvar(X)
(8)var(bˆ)=var(ϵ2)/n[a2var(X)+var(ϵ1)]
(9)var(aˆ)=var(ϵ1)/nvar(X)
(10)var(aˆbˆ)=a2var(bˆ)+b2var(aˆ)=a2var(X)(var(ϵ2)+b2var(ϵ1))+b2var2(ϵ1)nvar(X)[a2var(X)+var(ϵ2)].
Thus,

(11)var(τˆ)var(aˆbˆ)=a2var(X)+var(ϵ1)a2var(X)+var(ϵ1)b2var(ϵ1)/[var(ϵ2)+b2var(ϵ1]=a2var(X)+var(ϵ1)a2var(X)+var(ϵ1)F

which is greater than 1 because F=b2var(ϵ1)/[var(ϵ2)+b2var(ϵ1)]1.

The relation to transfer learning surfaces when a and b are estimated from two diverse populations, Π and Π. Let us assume that a is the same in the two populations, and is estimated by aˆ using N1 samples, pooled from both. b is presumed to be different, and is estimated by bˆ using N2 samples form Π alone. We need to compare the efficiency of estimating τ using the product (aˆbˆ), to that of estimating τ directly, using N2 samples from Π. The TBR, or the ratio of the asymptotic variances of these two estimators, can now be calculated as follows:

Keeping track of the number of samples entering each estimator, we have

(12)var(τˆ;N2)=var(ϵ2)+b2var(ϵ2)/N2var(X)
(13)var(bˆ;N2)=var(ϵ2)/N2[a2var(X)+var(ϵ1)]
(14)var(aˆ;N1)=var(ϵ1)/N1var(X)
(15)var(aˆbˆ;N1,N2)=a2var(bˆ)+b2var(aˆ)=N1a2var(X)var(ϵ2)+b2var(ϵ1)[a2N2var(x)+N2var2(ϵ1)N1N2var(X)[a2var(X)+var(ϵ2)].
Taking the ratio, we have
(16)TBR=var(τˆ;N2)var(aˆbˆ;N1,N2)
(17)=N1[a2var(X)+var(ϵ1)][var(ϵ2+b2var(ϵ1)]a2var(X)[N1var(ϵ2)+N2b2var(ϵ1)]+N2b2var(ϵ1)
(18)=a2var(X)+var(ϵ1)a2var(X)F1+var(ϵ1)F2,
where

(19)F1=var(ϵ2)+b2var(ϵ1)N2/N1var(ϵ2)+b2var(ϵ1)

and

(20)F2=N2b2var(ϵ1)/N1[var(ϵ2)+b2var(ϵ1)].
Figure 3 Illustrating the behavior of the Transfer Benefit Ratio (Eq. (21)) for different values of N2/N1$N_{2}/N_{1}$ with X and Y axes representing ρa$\rho _{a}$ and ρb$\rho _{b}$ respectively. (a) N2/N1=1$N_{2}/N_{1}=1$ (no transfer) TBR represents the benefit of decomposition alone. (c) N2/N1=0.5$N_{2}/N_{1}=0.5$ represents data sharing between two equi-sampled studies. (d) N2/N1=0.1$N_{2}/N_{1}=0.1$ showing a more pronounced benefit near the ρb=1$\rho _{b}=1$ region, where the Z→Y$Z\to Y$ process becomes noiseless. (f) the limit case when N2/N1→0$N_{2}/N_{1}\to 0$, sharing marked benefit throughout the ρb=1$\rho _{b}=1$ and ρa=0$\rho _{a}=0$ regions, and no benefit near the ρb=0,ρa=1$\rho _{b}=0,\rho _{a}=1$ corner.

Figure 3

Illustrating the behavior of the Transfer Benefit Ratio (Eq. (21)) for different values of N2/N1 with X and Y axes representing ρa and ρb respectively. (a) N2/N1=1 (no transfer) TBR represents the benefit of decomposition alone. (c) N2/N1=0.5 represents data sharing between two equi-sampled studies. (d) N2/N1=0.1 showing a more pronounced benefit near the ρb=1 region, where the ZY process becomes noiseless. (f) the limit case when N2/N10, sharing marked benefit throughout the ρb=1 and ρa=0 regions, and no benefit near the ρb=0,ρa=1 corner.

Since both F1 and F2 are smaller than 1 for N2<N1, we conclude that the TBR is greater than one for N2<N1, which means that it is beneficial to decompose the estimation task into two stages and use a higher number of samples, N1, to estimate the shared component: cov(X,Z).

Expression (17) can be simplified using correlation coefficients, as defined in Eq. (4) and gives:

(21)TBR=1ρb2ρa2ρa2(1ρb2)+ρb2(1ρa2)N1/N2

The behavior of Eq. (21) for different values of N2/N1 is illustrated in Fig. 3(a, b, c, d).

For N2=N1 we obtain Cox’s ratio (11) which quantifies the benefit of decomposition alone, without transfer. The ratio greatly exceeds one when both ρa2 and ρb2 are small, and approaches one when either or both of ρa2 and ρb2 are near one. This means that the benefit of decomposition is substantial if and only if both processes are noisy, whereas if either one of them comes close to being deterministic, decomposition has no benefit.

This is reasonable; there is no benefit to decomposition unless Z brings new information which is not already in X or Y.

For N2<N1, however, the TBR ratio represents the benefit of both decomposition and transfer. For the ratio to greatly exceed one we now need that both ρa2 and ρb2 be small. However, the TBR becomes unity (useless transfer) only when ρa is unity; ρb=1 does not render it useless. It means that transfer is useless only when the process in agreement (XZ) is deterministic. Having disagreement on a deterministic mechanism does not make the transfer useless, as long as the process in agreement is corrupted by noise and can benefit from the extra samples from Π.

Indeed, taking the extreme case of deterministic ZY process (ρb=1), there is a definite advantage to borrowing N1 samples from the source population to estimate a and multiply it by b, rather than estimating c directly with the N2 samples available at the target population. Two such samples can determine b precisely, and can hardly aid in the estimation of a.

The limit of TBR as N1/N2 increases indefinitely and represents transfer between a highly explored environment (large N1) and one highly novel (low N2). The limit of (21) reads:

TBRN2/N10=1ρb2ρa2ρa2(1ρb2),

which establishes Eq. (3). It reveals that the Transfer Benefit Ratio will be most significant when the populations share noisy components (e. g., low correlation between X and Z) and differ in noiseless components (high correlation between Y and Z). Under such conditions, accurate assessment of the target quantity τ is highly vulnerable to inaccuracies in estimating a, and it is here that the large sample taken from Π can be most beneficial.

Appendix II Extension to saturated models

Figure 4 Saturated model in which Y depends on both X and Z.

Figure 4

Saturated model in which Y depends on both X and Z.

In Appendix I, the benefit of transfer learning was demonstrated using an “over-identified” model (Fig. 2) which embodied the conditional independence , and for which the product estimator aˆbˆ was consistent. The question we analyze in this Appendix is whether benefit can be demonstrated in “saturated” models as well (also called “just identified”), such as the one depicted in Fig. 4.

This model represents the following regression equations

Y=bz+cx+ϵ1Z=ax+ϵ2

and the target quantity is again the total regression coefficient τ in the equation

y=τx+ϵwithcov(x,ϵ)=0,

which is given by τ=cov(X,Y)/var(X)=c+ab.

Again, τ can be estimated in two ways:

  1. 1.

    A one-shot way: compute the OLS regression of Y on X, call this estimator τˆ.

  2. 2.

    A two-shot way: compute the sum: θˆ=cˆ+aˆbˆ where aˆ,bˆ, and cˆ are the OLS estimators of a,b,c respectively.

We now ask whether the variance of the composite estimator θˆ will be smaller than the one-shot estimator τˆ, as we have seen in the over-identified model of Fig. 1. We further ask whether data sharing would be beneficial in case a is the same in both population while b and c are different.

Using an analysis similar to that of Appendix I, one can show that the answer to the first question is negative, while that of the second question is positive. In other words, we lose the intrinsic advantage of decomposition, but we can still draw advantage from data sharing if a is the same in the two populations. Formally, while the efficiency of the composite estimator θˆ=aˆbˆ+cˆ is identical to that of the one-shot estimator τˆ,[2] the variance of the former can be reduced if a is estimated using a larger sample than would be available to the one-shot estimator. In particular, assuming that aˆ is estimated using N1 samples and bˆ,cˆ, and τˆ using N2 samples, the asymptotic variances of θˆ and τˆ, can be obtained by the delta method, and read:

(22)var(cˆ+aˆbˆ)=var(ϵ2)/N2var(X)+b2var(ϵ1)/N1var(X)
(23)var(τˆ)=b2var(ϵ1)+var(ϵ2)]/N2var(X)
Consequently, the TBR is given by

(24)TBR=var(τˆ)/var(cˆ+aˆbˆ)=[1(1N2/N1)b2var(ϵ1)/(b2var(ϵ1)+var(ϵ2))]1.

We see that for a single population and N1=N2 decomposition in itself carries no benefit, (TBR=1); the one-shot estimator is as good as the two-shot estimator. This stands in contrast to the over-identified model of Fig. 1, for which the TBR was greater than unity (Eq. (21)) except in pathological cases. Moreover, the loss of benefit is not due to the disappearance of over-identification conditions from the model, but due to the composite estimator’s failure to detect and utilize such conditions when they are valid. This can be seen from the fact that Eq. (24) (as well as the equality τˆ=cˆ+aˆbˆ) remains unaltered even when c=0. In other words, it is not the actual value of c that counts but the structure of the estimator we postulate. If we are ignorant of the fact that c=0 in the actual model and go through the trouble of estimating τ by the sum cˆ+aˆbˆ, instead of aˆbˆ, the variance will be greater than what we would have gotten had we detected the model structure correctly and used the estimator τˆ=aˆbˆ to reflect our knowledge.

For N2/N1<1 however, the picture changes dramatically; Eq. (24) demonstrates a definite benefit to composite estimation (TBR>1) which increases with var(ϵ2). The intuition is similar to that given in Appendix I. When the ZY process was almost deterministic. we obtained TBR>1. Here too, if the Y equation is deterministic, we can estimate it precisely with just a few samples (N2) from P and use additional (N1N2) samples for estimating the noisy XZ process which is common to both populations. The one-shot estimator will suffer from this noise if allowed only N2 sample from P.

References

1. Pearl J, Bareinboim E. Transportability of causal and statistical relations: A formal approach. In: Burgard W, Roth D, editors. Proceedings of the Twenty-Fifth Conference on Artificial Intelligence (AAAI-11). Menlo Park, CA: AAAI Press; 2011. p. 247–54. Available at. http://ftp.cs.ucla.edu/pub/stat_ser/r372a.pdf.Search in Google Scholar

2. Pearl J, Bareinboim E. External validity: From do-calculus to transportability across populations. Statistical Science. 2014;29:579–95.Search in Google Scholar

3. Cox D. Regression analysis when there is prior information about supplementary variables. The Journal of the Royal Statistical Society, Series B. 1960;22:172–6.Search in Google Scholar

4. Pearl J. Some thoughts concerning transfer learning, with applications to meta-analysis and data-sharing estimation. Tech. Rep. R-387. Los Angeles, CA: Department of Computer Science, University of California; 2012. Working paper. http://ftp.cs.ucla.edu/pub/stat_ser/r387.pdf.Search in Google Scholar

5. Hahn J, Pearl J. Precision of composite estimators. Tech. Rep. R-388. Los Angeles, CA: Department of Computer Science, University of California; 2011. In preparation. http://ftp.cs.ucla.edu/pub/stat_ser/r388.pdf.Search in Google Scholar

Published Online: 2018-3-2
Published in Print: 2018-3-26

© 2018 Walter de Gruyter GmbH, Berlin/Boston