Most scientists consider randomized experiments to be the best method available to establish causality. On the Internet, during the past twenty-five years, randomized experiments have become common, often referred to as A/B testing. For practical reasons, much A/B testing does not use pseudo-random number generators to implement randomization. Instead, hash functions are used to transform the distribution of identifiers of experimental units into a uniform distribution. Using two large, industry data sets, I demonstrate that the success of hash-based quasi-randomization strategies depends greatly on the hash function used: MD5 yielded good results, while SHA512 yielded less impressive ones.
I thank the James M. Kilts Center for Marketing at the University of Chicago Booth School of Business for making the Dominick’s data available. The views expressed in this article are those of the author and not of Microsoft Corporation. For comments and suggestions on previous drafts of this paper, I thank the co-editor, Harry J. Paarsch, as well as two anonymous referees. All remaining errors are, of course, mine.
Gilbert, S.L., Lynch N.A. (2002), Brewer’s Conjecture And the Feasibility of Consistent, Available, Partition-Tolerant Web Services. ACM SIGACT News 33: 51–59.10.1145/564585.564601Search in Google Scholar
Graham, R.L., Knuth D.E., Patashnik O. (1994), Concrete Mathematics: A Foundation for Computer Science. Reading, MA, USA, Addison-Wesley.Search in Google Scholar
Gueron, S.S.J., Walker J. (2011), SHA-512/256. Proceedings of the 2011 Eighth International Conference on Information Technology: New Generations, pp. 354–358.10.1109/ITNG.2011.69Search in Google Scholar
Harris R.P., M. Helfand, S.H. Woolf, K.N. Lohr, C.D. Mulrow, S.M. Teutsch, D. Atkins, Methods Work Group, Third US Preventive Services Task Force (2001), Current Methods of the US Preventive Services Task Force: A Review of the Process. American Journal of Preventive Medicine 20 (3 Suppl).10.1016/S0749-3797(01)00261-6Search in Google Scholar
Fisher, R.A. (1935), The Design of Experiments. Edinburgh, UK, Oliver and Boyd.Search in Google Scholar
Knight, F.H. (1921), Risk, Uncertainty, and Profit. Boston, MA, Hart,Schaffner and Marx.Search in Google Scholar
Matsumoto, M., Nishimura T. (1998), Mersenne Twister: A 623-Dimensionally Equidistributed Uniform Pseudo-Random Number Generator. ACM Transactions on Modeling and Computer Simulation 8: 3–30.10.1145/272991.272995Search in Google Scholar
Paarsch, H.J., Golyaev K. (2016), A Gentle Introduction to Effective Computing in Quantitative Research: What Every Research Assistant Should Know. Cambridge, USA, MIT Press.Search in Google Scholar
A Example of hashing
A wonderful illustration of hashing is provided in Graham et al. (1994). Consider a data structure that consists of collection of key-value pairs. For example, a retailer might have a database of customers and store a number of data values for each customer. Imagine a key-value store (dictionary) where keys are names of customers and values are arbitrary arrays of customer-specific data. In practice, unique, randomly-generated customer identifiers would likely serve as keys; in a large database, the likelihood of naming collisions among customers becomes too high. Using customer names as keys, however, facilitates exposition.
As an example, consider locating a particular customer within the database, specifically assume that a total of customers are recorded in the customer database. A naï ve implementation of storage and retrieval would involve maintaining a lookup table for each customer . Searching for data concerning specific customer would then involve the following steps:
If , stop and return “failure”.
If , stop and return “success”.
Increase by 1 and go back to step 1.
In the worst-case scenario, finding data concerning customer involves reads from the list of keys if one decides to traverse it in this fashion. In a real-world application, where can easily be in billions, this method is too slow. Hashing was designed to speed up the process by storing the keys in smaller lists.
Formally, a hash function\/ maps a key into a list of numbers between and . In addition, two auxiliary tables are created, and , where points to the first record in list , and points to the “next” record after record in its list.
As an illustration, let based on the first letter of customer’s name as follows:
if first letter is between A and F,
if first letter is between G and L,
if first letter is between M and R, and
if first letter is between S and Z.
Before any data are recorded, set . To formalize the notion of an empty list, set In addition, set to denote when is the last entry in its list. Armed with these definitions, one can begin inserting data into this new structure.
Assume that the first customer who needs to be recorded is named Nora. The hash function defined above will insert Nora’s record into list , since the first letter of her name, N, is between M and R. Now , , and all other values of and are so far unchanged. Let the name of the second customer be Glenn. Inserting him into the database would change to 2 and set , with no other changes. Now suppose that the third customer is named James. Adding him would result in , and . With three records in the database the entire structure now looks as follows:
[–] Number of records: , and number of hash buckets: .
[–] Available keys: , , .
[–] Indices of first records in each hash bucket: , , , .
[–] Indices of next records in each hash bucket after the first one: , , .
Fast-forwarding the example, assuming that customer records were inserted into the database, things now looks like the following:
|List 1||List 2||List 3||List 4|
From this example, one can see that, in the worst-case, it would take six steps to locate the record using these four lists. With total records this is a three-fold speed up in search and retrieval. In the average case, the time it takes to locate a record falls from to ; when and are large, this makes a big difference. A precise search algorithm for an entry would look as follows:
Set and .
If , stop and return “failure”.
If , stop and return “success”.
Set , set , and return to step 2.
To locate Jennifer in the above example with records, set , since J is between G and L, and , since Glenn is the first entry in the second list. Because Glenn is not Jennifer, update to 3, since , that is, James is the next record in the second list after Glenn. Now, James is also not Jennifer, so is updated to . The exact value would be determined by when Jennifer was added to the database, but the important part is that now , that is, the search terminates successfully after three iterations.
This article is part of the special issue “Big Data” published in the Journal of Economics and Statistics. Access to further articles of this special issue can be obtained at www.degruyter.com/journals/jbnst.
© 2018 Oldenbourg Wissenschaftsverlag GmbH, Published by De Gruyter Oldenbourg, Berlin/Boston