Accessible Requires Authentication Published by De Gruyter April 5, 2014

An Adaptive Learning Model with Foregone Payoff Information

Naoki Funai

Abstract

In this paper, we provide theoretical predictions on the long-run behavior of an adaptive decision maker with foregone payoff information. In the model, the decision maker assigns a subjective payoff assessment to each action based on his past experience and chooses the action that has the highest assessment. After receiving a payoff, the decision maker updates his assessments of actions in an adaptive manner, using not only the objective payoff information but also the foregone payoff information, which may be distorted. The distortion may arise from “the grass is always greener on the other side” effect, pessimism/optimism or envy/gloating; it depends on how the decision maker views the source of the information. We first provide conditions in which the assessment of each action converges, in that the limit assessment is expressed as an average of the expected objective payoff and the expected distorted payoff of the action. Then, we show that the decision maker chooses the optimal action most frequently in the long run if the expected distorted payoff of the action is greater than the ones of the other actions. We also provide conditions, under which this model coincides with the experience-weighted attraction learning, stochastic fictitious play and quantal response equilibrium models, and thus this model provides theoretical predictions for the models in decision problems.

Appendices

Since we know that αiαj if QiQj, in the following proofs, we compare the limit assessments of two actions.

Proof of Lemma 4

For each action, i, the following equation holds:

Qi=Ci(Q)E[πi]+(1Ci(Q))E[Di(πi)].

Hence if min{E[Di(πi)],E[πi]}max{E[Dj(πj)],E[πj]}holds, then

Qi=Ci(Q)E[πi]+(1Ci(Q))E[Di(πi)]Cj(Q)E[πj]+(1Cj(Q))E[Dj(πj)]=Qj.

Proof of Lemma 5 (i)

Here, we prove by contradiction. First, we consider the case in which the condition E[Di(πi)]>E[Dj(πj)]E[πi]E[πj] holds. We now assume that Qi<Qj. Since E[πi]E[πj], we have

E[πi]=Qi(1Ci(Q))(E[Di(πi)]E[πi])Qj(1Cj(Q))(E[Dj(πj)]E[πj])=E[πj].

Note that since Qi<Qj,we have

[6]E[Dj(πj)]E[πj]E[Di(πi)]E[πi].

Now, since E[Di(πi)]>E[Dj(πj)], we have

E[Di(πi)]=Qi+Ci(Q)(E[Di(πi)]E[πi])>Qj+Cj(Q)(E[Dj(πj)]E[πj])=E[Dj(πj)].

And by the hypothesis that Qi<Qj, we have

[7]E[Di(πi)]E[πi]>E[Dj(πj)]E[πj].

However, the inequalities [6] and [7] contradict each other.

Next, we consider the case in which the condition E[Di(πi)]=E[Dj(πj)]=E[πi]E[πj] holds. Since the limit assessment of each action takes a value between the expected objective payoff and expected distorted payoff of the action, we should have that QiQj.

Last, we consider the case in which the condition E[Di(πi)]=E[Dj(πj)]>E[πi]E[πj] holds. Again, we assume that Qi<Qj. Since E[Di(πi)]=E[Dj(πj)],we have

Qi+Ci(Q)(E[Di(πi)]E[πi])
=Qj+Cj(Q)(E[Dj(πj)]E[πj]).

However, this contradicts that Qi<Qj,Ci(Q)<Cj(Q) and 0<E[Di(πi)]E[πi]E[Dj(πj)]E[πj].

Proof of Lemma 5 (ii)

We assume that one of the following inequalities,

E[πi]E[πj]E[Di(πi)]E[Dj(πj)],

holds strictly. Also we assume that E[Dj(πj)]>0. Now consider some Q1RM, such that Q1iQ1j for any ji. Then

Ci(Q1)E[πi]+(1Ci(Q1))E[Di(πi)]Cj(Q1)E[πj]+(1Cj(Q1))E[Dj(πj)].

What we show here is that the trajectories of ODEs [2] starting from the points with Q1i=Q1j never enter the area of Q with Qi<Qj so that at the unique rest point Q, which is globally asymptotically stable, we should have that QiQj.

First, consider the initial point Q1, such that

Q1i=Q1jCj(Q1)E[πj]+(1Cj(Q1))E[Dj(πj)]Ci(Q1)E[πi]+(1Ci(Q1))E[Di(πi)].

Note that (i) Q˙i0 and Q˙j0, (ii) if Q˙i=0, then Q˙j=0, and (iii) if Q˙i>0 and Q˙j0, then

0Cj(Q1)E[πj]+(1Cj(Q1))E[Dj(πj)]QjCi(Q1)E[πi]+(1Ci(Q1))E[Di(πi)]Qi1.

Therefore, the trajectories starting from Q1 do not enter the area with Qi<Qj.

Next, we assume that

Cj(Q1)E[πj]+(1Cj(Q1))E[Dj(πj)]<Q1i=Q1j
Ci(Q1)E[πi]+(1Ci(Q1))E[Di(πi)].

Then Q˙j<0 and Q˙i0, and it is obvious that the trajectories of the ODEs do not enter the area with Qi<Qj.

Finally, we assume that

Cj(Q1)E[πj]+(1Cj(Q1))E[Dj(πj)]Ci(Q1)E[πi]+(1Ci(Q1))E[Di(πi)]
<Q1i=Q1j.

Then Q˙i<0, Q˙j<0 and

1<Cj(Q1)E[πj]+(1Cj(Q1))E[Dj(πj)]Q1jCi(Q1)E[πi]+(1Ci(Q1))E[Di(πi)]Q1i.

And again, the trajectories of the ODEs do not enter the area with Qi<Qj.

In sum, the trajectories that start from the points on the line with Qi=Qj never enter the area of Q with Qi<Qj, and thus QiQj. We can apply this argument to the other cases where E[Dj(πj)]0.⃞

Proof of Proposition 4

Assume that Qi<Qj. By the property of choice rules, we have that Ci(Q)<Cj(Q)and hence Ci(Q)<(1Ci(Q)). Since Ci(Q)<12, we have

Qi=E[πi]+(1Ci(Q))(E[Di(πi)]E[πi])E[πi]+12(E[Di(πi)]E[πi])=E[Di(πi)]+E[πi]2.

Since Qjmax{E[Dj(πj)],E[πj]}, we have QiQj. However, this condition contradicts the original hypothesis.⃞

Proof of Proposition 5

We show that if E[πi]E[πj] then QiQj and thus αiαj. Now we assume that E[πi]E[πj]. Then

QiQj=E[πi]E[πj]+(1αi)E[G(πiπj)](1αj)E[G(πjπi)]=(E[πi]E[πj])+(1αi)E[G(πiπj)]+αiE[G(πiπj)]=(E[πi]E[πj])+E[G(πiπj)].

Since E[G(πiπj)]G(E[πi]E[πj])0, we have that QiQj and thus αiαj.⃞

References

Beggs, A. W. 2005. “On the Convergence of Reinforcement Learning.” Journal of Economic Theory122:136. Search in Google Scholar

Benaïm, M. 1999. “Dynamics of Stochastic Approximation Algorithms.” In Séminaire De Probabilités, XXXIII, Lecture Notes in Mathematics, vol. 1709, edited by J.Azéma, M.Émery, M.Ledoux and M.Yor, 168. Berlin: Springer. Search in Google Scholar

Benaïm, M., and M. W.Hirsch. 1999. “Mixed Equilibria and Dynamical Systems Arising from Fictitious Play in Perturbed Games.” Games and Economic Behavior29:3672. Search in Google Scholar

Borkar, V. S. 2008. Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge, UK: Cambridge University Press. Search in Google Scholar

Brown, G. W. 1951. “Iterative Solution of Games by Fictitious Play.” In Activity Analysis of Production and Allocation, edited by T. C.Koopmans. New York: Wiley. Search in Google Scholar

Camerer, C., and T. H.Ho. 1999. “Experience-Weighted Attraction Learning in Normal Form Games.” Econometrica67:82774. Search in Google Scholar

Cominetti, R., E.Melo, and S.Sorin. 2010. “A Payoff-Based Learning Procedure and Its Application to Traffic Games.” Games and Economic Behavior70:7183. Search in Google Scholar

Conley, T. G., and C. R.Udry. 2010. “Learning About a New Technology: Pineapple in Ghana.” American Economic Review100:3569. Search in Google Scholar

Duffy, J., and N.Feltovich. 1999. “Does Observation of Others Affect Learning in Strategic Environments? An Experimental Study.” International Journal of Game Theory28:13152. Search in Google Scholar

Erev, I., and A. E.Roth. 1998. “Predicting How People Play Games: Reinforcement Learning in Experimental Games with Unique, Mixed Strategy Equilibria.” American Economic Review88:84881. Search in Google Scholar

Fudenberg, D., and D. M.Kreps. 1993. “Learning Mixed Equilibria.” Games and Economic Behavior5:32067. Search in Google Scholar

Grosskopf, B., I.Erev, and E.Yechiam. 2006. “Foregone with the Wind: Indirect Payoff Information and Its Implications for Choice.” International Journal of Game Theory34:285302. Search in Google Scholar

Grygolec, J., G.Coricelli, and A.Rustichini. 2012. “Positive Interaction of Social Comparison and Personal Responsibility for Outcomes.” Frontiers in Psychology3:25. Search in Google Scholar

Hall, P., and C. C.Heyde. 1980. Martingale Limit Theory and Its Application. New York: Academic Press. Search in Google Scholar

Heller, D., and R.Sarin. 2001. “Adaptive Learning with Indirect Payoff Information.” Working Paper. Search in Google Scholar

Hofbauer, J., and W. H.Sandholm. 2002. “On the Global Convergence of Stochastic Fictitious Play.” Econometrica70:226594. Search in Google Scholar

Hopkins, E. 2002. “Two Competing Models of How People Learn in Games.” Econometrica70:214166. Search in Google Scholar

Laslier, J.-F., R.Topol, and B.Walliser. 2001. “A Behavioral Learning Process in Games.” Games and Economic Behavior37:34066. Search in Google Scholar

Leslie, D. S., and E. J.Collins. 2005. “Individual q-Learning in Normal Form Games.” SIAM Journal on Control and Optimization44:495514. Search in Google Scholar

McKelvey, R. D., and T. R.Palfrey. 1995. “Quantal Response Equilibria for Normal Form Games.” Games and Economic Behavior10:638. Search in Google Scholar

Roth, A. E., and I.Erev. 1995. “Learning in Extensive-Form Games: Experimental Data and Simple Dynamic Models in the Intermediate Term.” Games and Economic Behavior8:164212. Search in Google Scholar

Rustichini, A. 1999. “Optimal Properties of Stimulus-Response Learning Models.” Games and Economic Behavior29:24473. Search in Google Scholar

Sarin, R., and F.Vahid. 1999. “Payoff Assessments without Probabilities: A Simple Dynamic Model of Choice.” Games and Economic Behavior28:294309. Search in Google Scholar

Tsitsiklis, J. N. 1994. “Asynchronous Stochastic Approximation and Q-Learning.” Machine Learning16:185202. Search in Google Scholar

Watkins, C. J. C. H., and P.Dayan. 1992. “Q-Learning.” Machine Learning8:27992. Search in Google Scholar

  1. 1

    Grygolec, Coricelli, and Rustichini (2012) investigate the effect of envy and gloating on the evaluations of unchosen actions. However, they have not provided any theoretical investigation for the case.

  2. 2

    It is continuous, due to the dominated convergence theorem, while it is a function – not correspondence – due to the fact that the probability of the shock-affected assessments of two actions being the same is zero.

  3. 3
  4. 4

    By the result of Tsitsiklis (1994), we can relax the condition and allow the sequence of weighting parameters to be stochastic. In addition, we can allow weighting parameters among actions to be different in each period. That is, the sequence of weighting parameters is {λni}n,i, where λni is a weighting parameter of action i in period n.

  5. 5

    For example, see Hall and Heyde (1980, 36).

  6. 6

    It is worth to note that there exist mean-preserving distortion functions, such that the distribution of Di(πi) becomes a mean-preserving spread (or contraction) of the one of πi for each iA.

  7. 7

    They use δ for the discount factor.

  8. 8

    See Grygolec, Coricelli, and Rustichini (2012) for this form of distortion function.

  9. 9

    Boundedness is also satisfied, since we assume that payoffs are bounded.

  10. 10

    If there are more than two actions, then we will have different dynamics for assessments. The case is left for future work.

  11. 11

    Alternatively, each population is large enough, so that the probability of a decision maker being picked again is almost 0.

Published Online: 2014-4-5
Published in Print: 2014-1-1

©2014 by De Gruyter