Skip to content
Licensed Unlicensed Requires Authentication Published by De Gruyter Oldenbourg February 13, 2018

Randomization in Online Experiments

Konstantin Golyaev

Abstract

Most scientists consider randomized experiments to be the best method available to establish causality. On the Internet, during the past twenty-five years, randomized experiments have become common, often referred to as A/B testing. For practical reasons, much A/B testing does not use pseudo-random number generators to implement randomization. Instead, hash functions are used to transform the distribution of identifiers of experimental units into a uniform distribution. Using two large, industry data sets, I demonstrate that the success of hash-based quasi-randomization strategies depends greatly on the hash function used: MD5 yielded good results, while SHA512 yielded less impressive ones.

JEL Classification: C1; C8; C9

Acknowledgements:

I thank the James M. Kilts Center for Marketing at the University of Chicago Booth School of Business for making the Dominick’s data available. The views expressed in this article are those of the author and not of Microsoft Corporation. For comments and suggestions on previous drafts of this paper, I thank the co-editor, Harry J. Paarsch, as well as two anonymous referees. All remaining errors are, of course, mine.

References

Gilbert, S.L., Lynch N.A. (2002), Brewer’s Conjecture And the Feasibility of Consistent, Available, Partition-Tolerant Web Services. ACM SIGACT News 33: 51–59.10.1145/564585.564601Search in Google Scholar

Graham, R.L., Knuth D.E., Patashnik O. (1994), Concrete Mathematics: A Foundation for Computer Science. Reading, MA, USA, Addison-Wesley.Search in Google Scholar

Gueron, S.S.J., Walker J. (2011), SHA-512/256. Proceedings of the 2011 Eighth International Conference on Information Technology: New Generations, pp. 354–358.10.1109/ITNG.2011.69Search in Google Scholar

Harris R.P., M. Helfand, S.H. Woolf, K.N. Lohr, C.D. Mulrow, S.M. Teutsch, D. Atkins, Methods Work Group, Third US Preventive Services Task Force (2001), Current Methods of the US Preventive Services Task Force: A Review of the Process. American Journal of Preventive Medicine 20 (3 Suppl).10.1016/S0749-3797(01)00261-6Search in Google Scholar

Fisher, R.A. (1935), The Design of Experiments. Edinburgh, UK, Oliver and Boyd.Search in Google Scholar

Knight, F.H. (1921), Risk, Uncertainty, and Profit. Boston, MA, Hart,Schaffner and Marx.Search in Google Scholar

Kohavi, R., Longbotham R., Sommerfield D., Henne R.M. (2009), Controlled Experiments On the Web: Survey and Practical Guide.10.1007/s10618-008-0114-1Search in Google Scholar

Matsumoto, M., Nishimura T. (1998), Mersenne Twister: A 623-Dimensionally Equidistributed Uniform Pseudo-Random Number Generator. ACM Transactions on Modeling and Computer Simulation 8: 3–30.10.1145/272991.272995Search in Google Scholar

Paarsch, H.J., Golyaev K. (2016), A Gentle Introduction to Effective Computing in Quantitative Research: What Every Research Assistant Should Know. Cambridge, USA, MIT Press.Search in Google Scholar

Rivest, R. (1992), The MD5 Message-Digest Algorithm. USA, RFC 1321, RFC Editor.10.17487/rfc1321Search in Google Scholar

Schilling, M.F. (2012), The Surprising Predictability of Long Runs. Mathematics Magazine 85 (2): 141149.10.4169/math.mag.85.2.141Search in Google Scholar

appendix

A Example of hashing

A wonderful illustration of hashing is provided in Graham et al. (1994). Consider a data structure that consists of collection of key-value pairs. For example, a retailer might have a database of customers and store a number of data values for each customer. Imagine a key-value store (dictionary) where keys are names of customers and values are arbitrary arrays of customer-specific data. In practice, unique, randomly-generated customer identifiers would likely serve as keys; in a large database, the likelihood of naming collisions among customers becomes too high. Using customer names as keys, however, facilitates exposition.

As an example, consider locating a particular customer within the database, specifically assume that a total of I customers are recorded in the customer database. A naï ve implementation of storage and retrieval would involve maintaining a lookup table K(i) for each customer i1,,I. Searching for data concerning specific customer i would then involve the following steps:

  1. Set i=1.

  2. If i>I, stop and return “failure”.

  3. If i=i, stop and return “success”.

  4. Increase i by 1 and go back to step 1.

In the worst-case scenario, finding data concerning customer i involves (I+1) reads from the list of keys if one decides to traverse it in this fashion. In a real-world application, where I can easily be in billions, this method is too slow. Hashing was designed to speed up the process by storing the keys in J smaller lists.

Formally, a hash function\/ maps a key kK(i) into a list of numbers h(k) between 1 and J. In addition, two auxiliary tables are created, F(j) and N(i), where F(j) points to the first record in list j1,,J, and N(i) points to the “next” record after record i in its list.

As an illustration, let J=4 based on the first letter of customer’s name as follows:

  1. j=1 if first letter is between A and F,

  2. j=2 if first letter is between G and L,

  3. j=3 if first letter is between M and R, and

  4. j=4 if first letter is between S and Z.

Before any data are recorded, set I=0. To formalize the notion of an empty list, set F(1)=F(2)=F(3)=F(4)=1 In addition, set N(i)=0 to denote when i is the last entry in its list. Armed with these definitions, one can begin inserting data into this new structure.

Assume that the first customer who needs to be recorded is named Nora. The hash function defined above will insert Nora’s record into list j=3, since the first letter of her name, N, is between M and R. Now I=1, F(3)=1, and all other values of F and H are so far unchanged. Let the name of the second customer be Glenn. Inserting him into the database would change I to 2 and set F(2)=2, with no other changes. Now suppose that the third customer is named James. Adding him would result in I=3, and N(2)=3. With three records in the database the entire structure now looks as follows:

  1. [–] Number of records: I=3, and number of hash buckets: J=4.

  2. [–] Available keys: K(1)=Nora, K(2)=Glenn, K(3)=James.

  3. [–] Indices of first records in each hash bucket: F(1)=1, F(2)=2, F(3)=1, F(4)=1.

  4. [–] Indices of next records in each hash bucket after the first one: N(1)=0, N(2)=3, N(3)=0.

Fast-forwarding the example, assuming that 18 customer records were inserted into the database, things now looks like the following:

List 1List 2List 3List 4
DianneGlennNoraScott
ArielJamesMichaelTina
BrianJenniferNicholas
FrancisJoanRay
DouglasJeremyPaula
Jean

From this example, one can see that, in the worst-case, it would take six steps to locate the record using these four lists. With 18 total records this is a three-fold speed up in search and retrieval. In the average case, the time it takes to locate a record falls from (I/2) to (1/J); when I and J are large, this makes a big difference. A precise search algorithm for an entry i would look as follows:

  1. Set j=h(i) and i=F(j).

  2. If i0, stop and return “failure”.

  3. If K(i)=i, stop and return “success”.

  4. Set j=i, set i=N(j), and return to step 2.

To locate Jennifer in the above example with 18 records, set j=2, since J is between G and L, and i=2, since Glenn is the first entry in the second list. Because Glenn is not Jennifer, update i to 3, since N(2)=3, that is, James is the next record in the second list after Glenn. Now, James is also not Jennifer, so i is updated to N(3). The exact value would be determined by when Jennifer was added to the database, but the important part is that now K(N(3))=i, that is, the search terminates successfully after three iterations.


Article note

This article is part of the special issue “Big Data” published in the Journal of Economics and Statistics. Access to further articles of this special issue can be obtained at www.degruyter.com/journals/jbnst.


Received: 2016-10-26
Revised: 2017-04-24
Accepted: 2017-05-15
Published Online: 2018-02-13
Published in Print: 2018-07-26

© 2018 Oldenbourg Wissenschaftsverlag GmbH, Published by De Gruyter Oldenbourg, Berlin/Boston