We consider ways of enabling systems to apply previously learned information to novel situations so as to minimize the need for retraining. We show that theoretical limitations exist on the amount of information that can be transported from previous learning, and that robustness to changing environments depends on a delicate balance between the relations to be learned and the causal structure of the underlying model. We demonstrate by examples how this robustness can be quantified.
Assume that we have learned a certain relation R in environment that is governed by a probability function P. Now the environment changes and P turns into . We would still like to estimate R, but in the new environment, . The basis of much works on “Transfer Learning,” “Robust Learning,” “Domain Adaptation,” and “Life Long Learning” (L2L) hinges on the intuition that it would be a great waste to start learning from scratch, instead of amortizing what we learned in P.
This intuition assumes, of course, that the two environments share some features in common, and that the shared features are significant in determining R. Surely, if the two environments are totally different, then we might as well start learning things from scratch – there is simply no other choice. Similarly, if the target relation R is defined exclusively on the novelty part of , no advantage would be realized by transferring what was learned in P.
To anchor this intuition in a formal setting let us assume that the target relation R can be decomposed into a set S of sub-relations, and that these sub-relations fall into two categories:
– sub-relations on which P and agree, and
– sub-relations on which P and disagree.
One obvious saving that can be realized from knowing is in learning time. If we have trained the learner on 100 cases from each distribution, we can estimate using all 200 samples, and using the 100 samples of . The net result being that some portions of R receive extra samples, which render them more precise, thus making the estimate of R more precise (i. e., less susceptible to sampling bias). Conversely, if we aspire to achieve a given precision in R, less samples, or shorter learning time, would be realized overall.
A simple example can illustrate this logic.
Let X and Y be two sets of variables governed by a joined distribution. X could represent class labels and Y a set of measurements, or features. If our task is to infer X on the basis of measurements of Y, then the relation of interest is, which can be learned by drawing samples from P.
Let us assume that P changes intosuch that the prior probability remains the same,, but the conditional probabilitychanges. (This would be the case, for example, when the instruments for measuring Y undergo changes.) We can either learnfrom scratch, by drawing samples from, or we can borrow samples drawn previously from P, pool them with what we observe inand obtain an improved estimate of. This can be done by decomposing R into a product of prior and conditional probabilities, then capitalizing on the equality.
The last expression permits us to use the more precise estimatedrather than rely solely on the small-sample estimate of.
This simple example raises a fundamental question: Is it always beneficial to decompose a relation into components, estimate each component individually, some with improved precision, then recombine the results?
A competing intuition might claim that the exercise of decomposing, estimating, and combining introduces new sources of noise, compared to, say, estimating the relation in one shot.
The question is further complicated by the fact that decompositions are not unique. Eq. (1), for example can also be written as:
This calls for refraining from learning directly in the new environment, but estimating and for all at the new environment, then averaging the results to get a composite estimate of as shown in the denominator.
It is not at all clear that the refinement offered by the denominator of (2) would improve precision over the estimator defined in (1). Assume for example that Y is a single binary variable, whereas X is a vector of continuous variables. Decomposing the as in the denominator of Eq. 1 would entail estimating all factors and averaging the estimates. Estimating from scratch, in contrast, may offer definite advantages, despite the fact that we have not borrowed any information from P.
We thus ask the following questions:
Given a relation R, which of its decompositions gains by borrowing and which does not?
Which relations R have a beneficial decomposition and which do not?
Given that borrowing is beneficial, can we quantify the benefit?
2 The Transfer Benefit Ratio (TBR)
To get a theoretical handle on the problem, let us take the simple problem of estimating the regression coefficient τ of Y on X in the chain model of Fig. 1.
Here b is the regression coefficient of Y on Z,
and, based on the chain structure:
Let us assume that we estimate a and b using Ordinary Least Square (OLS) on a large number of cases from P. Now b changes to , while a remains the same. How are we to estimate τ in , if we can draw only a small number of samples in the new environment?
We have two options:
We ignore the estimates obtained in the training environment and estimate τ from scratch, obtaining
We estimate a and b separately, and multiply their estimates, with a receiving samples from both environments and b from only.
The ratio of the asymptotic variances of these two estimators will measure the merit of transferring knowledge from one environment to another, and will be called here the Transfer Benefit Ratio (TBR).
This measure translates directly to improvement in the learning speed. When TBR is high, a small number of cases in the novel environment would be sufficient to achieve a given precision, whereas a low TBR would require a high number of cases to achieve such precision.
Intuitively, the benefit of transfer would be more pronounced when the part shared by the two environments is noisy and the novel part is noiseless. Under such conditions, assessments of the target quantity τ are highly vulnerable to inaccuracies in estimating the relation between X and Z, and it is here that the training conducted in P can be most beneficial.
Exact analysis (see Appendix I) reveals that, for , the TBR is given by the following formula
where and are the squared correlation coefficients
Equation (3) quantifies the intuition that transfer learning is more beneficial when the novelty between the two environments is almost deterministic ( approaches 1) so that the few observations conducted in the new environment would suffice to complete the adaptation.
Appendix I generalizes this result to any ratio and presents 3-dimensional charts of how the TBR varies with both the ratio and the statistics of and Z. Remarkably, it shows that TBR is greater than unity even for . This means that there is benefit to the two-step estimation of τ (using the product over the single step estimator , even when the environment does not change and we are faced with the problem of estimating τ given the chain model of Fig. 1. This phenomenon reflects a more general pattern in estimation: proper utilization of modeling assumptions can improve estimation efficiency, provided those assumptions are valid , .
Clearly, this exercise is oversimplified in that it assumes just two linear relationships and one invariant and one novel. Yet, such rudimentary analysis must be conducted to understand the speed-up provided by prior learning, the factors that determine this speed-up, and how to optimize those factors.
In more realistic situations, it is not at all clear that a speed-up would be achieved regardless of the problem structure. In our example, we capitalized on the chain structure, which rendered X and Y conditional independent given Z. Under such conditions, the product estimator is superior to the one-shot estimator even when no environmental change takes place (i. e., ). On the other hand, when , the benefit of transfer learning is realized even in the absence of independence constraints.
We have so far not considered the possibility of minimizing the number of variables needed to be measured in the new environment. Cases exist where, despite differences between P and , R can be estimated entirely in the source environment, without taking any measurements in . In other cases, some measurements in the new environments are needed, but the number of variables involved can be minimized by proper design .
We have demonstrated by simple examples that it is possible to quantify the benefit of borrowing information from previous learning, and that this benefit depends on the structure of the data generating model. This leaves open the general question of deciding, for any given relation R, how can it best benefit from previous learning, and how robust can it be to changes in the target environment? We conjecture that the understanding of such theoretical questions is necessary for designing algorithms that take maximum advantage of previous learning and spend minimum resources on re-learning that which could be borrowed.
Funding source: Defense Advanced Research Projects Agency
Award Identifier / Grant number: #W911NF-16-057
Funding source: National Science Foundation
Award Identifier / Grant number: #IIS-1302448
Award Identifier / Grant number: #IIS-1527490
Award Identifier / Grant number: #IIS-1704932
Funding source: Office of Naval Research
Award Identifier / Grant number: #N00014-17-S-B001
Funding statement: This research was supported in parts by grants from Defense Advanced Research Projects Agency #W911NF-16-057, National Science Foundation #IIS-1302448, #IIS-1527490, and #IIS-1704932, and Office of Naval Research #N00014-17-S-B001.
I am indebted to Professor Jinyong Hahn for teaching me the secrets of asymptotic variance analysis. The 3-dimensional plots of Fig. 3 were produced by Elias Bareinboin.
Appendix I Composition and transfer in a two-stage process
In experiments involving a two-stage process as in Fig. 1, Cox has shown that the estimated regression coefficient between treatment and response has a reduced variance if computed as a product of two estimates, one for each stage of the process . Below we summarize Cox’s analysis and adapt it to the problem of information transfer across populations.
The linear model depicted in Fig. 1 can be representd by the following structural equations:
The process is depicted in Fig. 2. Our target of analysis is the regression coefficient of Y on X, i. e., the coefficient of x in the equation
As before, let , and be respectively the OLS estimators of . Cox showed that the asymptotic variance of is greater than that of the product , or
with equality holding only in pathological cases of perfect determinism. Specifically, he computed the n-sample variances to be:
which is greater than 1 because .
The relation to transfer learning surfaces when a and b are estimated from two diverse populations, Π and . Let us assume that a is the same in the two populations, and is estimated by using samples, pooled from both. b is presumed to be different, and is estimated by using samples form alone. We need to compare the efficiency of estimating τ using the product , to that of estimating τ directly, using samples from . The TBR, or the ratio of the asymptotic variances of these two estimators, can now be calculated as follows:
Keeping track of the number of samples entering each estimator, we have
Since both and are smaller than 1 for , we conclude that the TBR is greater than one for , which means that it is beneficial to decompose the estimation task into two stages and use a higher number of samples, , to estimate the shared component: .
For we obtain Cox’s ratio (11) which quantifies the benefit of decomposition alone, without transfer. The ratio greatly exceeds one when both and are small, and approaches one when either or both of and are near one. This means that the benefit of decomposition is substantial if and only if both processes are noisy, whereas if either one of them comes close to being deterministic, decomposition has no benefit.
This is reasonable; there is no benefit to decomposition unless Z brings new information which is not already in X or Y.
For , however, the TBR ratio represents the benefit of both decomposition and transfer. For the ratio to greatly exceed one we now need that both and be small. However, the TBR becomes unity (useless transfer) only when is unity; does not render it useless. It means that transfer is useless only when the process in agreement is deterministic. Having disagreement on a deterministic mechanism does not make the transfer useless, as long as the process in agreement is corrupted by noise and can benefit from the extra samples from Π.
Indeed, taking the extreme case of deterministic process , there is a definite advantage to borrowing samples from the source population to estimate a and multiply it by b, rather than estimating c directly with the samples available at the target population. Two such samples can determine b precisely, and can hardly aid in the estimation of a.
The limit of TBR as increases indefinitely and represents transfer between a highly explored environment (large ) and one highly novel (low ). The limit of (21) reads:
which establishes Eq. (3). It reveals that the Transfer Benefit Ratio will be most significant when the populations share noisy components (e. g., low correlation between X and Z) and differ in noiseless components (high correlation between Y and Z). Under such conditions, accurate assessment of the target quantity τ is highly vulnerable to inaccuracies in estimating a, and it is here that the large sample taken from Π can be most beneficial.
Appendix II Extension to saturated models
In Appendix I, the benefit of transfer learning was demonstrated using an “over-identified” model (Fig. 2) which embodied the conditional independence , and for which the product estimator was consistent. The question we analyze in this Appendix is whether benefit can be demonstrated in “saturated” models as well (also called “just identified”), such as the one depicted in Fig. 4.
This model represents the following regression equations
and the target quantity is again the total regression coefficient τ in the equation
which is given by .
Again, τ can be estimated in two ways:
A one-shot way: compute the OLS regression of Y on X, call this estimator .
A two-shot way: compute the sum: where , and are the OLS estimators of respectively.
Using an analysis similar to that of Appendix I, one can show that the answer to the first question is negative, while that of the second question is positive. In other words, we lose the intrinsic advantage of decomposition, but we can still draw advantage from data sharing if a is the same in the two populations. Formally, while the efficiency of the composite estimator is identical to that of the one-shot estimator , the variance of the former can be reduced if a is estimated using a larger sample than would be available to the one-shot estimator. In particular, assuming that is estimated using samples and , and using samples, the asymptotic variances of and , can be obtained by the delta method, and read:
We see that for a single population and decomposition in itself carries no benefit, ; the one-shot estimator is as good as the two-shot estimator. This stands in contrast to the over-identified model of Fig. 1, for which the TBR was greater than unity (Eq. (21)) except in pathological cases. Moreover, the loss of benefit is not due to the disappearance of over-identification conditions from the model, but due to the composite estimator’s failure to detect and utilize such conditions when they are valid. This can be seen from the fact that Eq. (24) (as well as the equality ) remains unaltered even when . In other words, it is not the actual value of c that counts but the structure of the estimator we postulate. If we are ignorant of the fact that in the actual model and go through the trouble of estimating τ by the sum , instead of , the variance will be greater than what we would have gotten had we detected the model structure correctly and used the estimator to reflect our knowledge.
For however, the picture changes dramatically; Eq. (24) demonstrates a definite benefit to composite estimation () which increases with . The intuition is similar to that given in Appendix I. When the process was almost deterministic. we obtained . Here too, if the Y equation is deterministic, we can estimate it precisely with just a few samples from and use additional samples for estimating the noisy process which is common to both populations. The one-shot estimator will suffer from this noise if allowed only sample from .
1. Pearl J, Bareinboim E. Transportability of causal and statistical relations: A formal approach. In: Burgard W, Roth D, editors. Proceedings of the Twenty-Fifth Conference on Artificial Intelligence (AAAI-11). Menlo Park, CA: AAAI Press; 2011. p. 247–54. Available at. http://ftp.cs.ucla.edu/pub/stat_ser/r372a.pdf.10.1109/ICDMW.2011.169Search in Google Scholar
3. Cox D. Regression analysis when there is prior information about supplementary variables. The Journal of the Royal Statistical Society, Series B. 1960;22:172–6.10.1111/j.2517-6161.1960.tb00363.xSearch in Google Scholar
4. Pearl J. Some thoughts concerning transfer learning, with applications to meta-analysis and data-sharing estimation. Tech. Rep. R-387. Los Angeles, CA: Department of Computer Science, University of California; 2012. Working paper. http://ftp.cs.ucla.edu/pub/stat_ser/r387.pdf.10.2139/ssrn.2343866Search in Google Scholar
5. Hahn J, Pearl J. Precision of composite estimators. Tech. Rep. R-388. Los Angeles, CA: Department of Computer Science, University of California; 2011. In preparation. http://ftp.cs.ucla.edu/pub/stat_ser/r388.pdf.Search in Google Scholar
© 2018 Walter de Gruyter GmbH, Berlin/Boston