We describe our digitization of a uniquely detailed study of 19th century production methods assembled by the United States Department of Labor (1899). The staff spent five years collecting and assembling data on the production of hundreds of highly specific products (as well as some services) at the production operations level using traditional artisanal (“hand”) methods and by the (then) most modern “machine” methods, measuring productivity in terms of the time taken to complete a specific task or set of tasks. The data proved too complex and voluminous to use, except as a source of anecdotes, until now. We describe how we have made these invaluable data from the first industrial revolution tractable to modern analysis and how they might be used to provide insight and perspective into the effects of robotics and artificial intelligence on labor during the third industrial revolution.
The rapid diffusion of artificial intelligence (AI) and robotics in modern manufacturing have generated a robust debate over automation and the future of work.  An analogous debate occurred in the United States during the second half of the 19th century when manufacturing first underwent widespread mechanization. The disruptive impact of labor-saving machines and the associated adoption of inanimate power sources on existing employment relations became a major flash point. For example, David Ames Wells, a highly influential American economist and political adviser of the time, wrote in 1889 “the increasing frequency of strikes and industrial revolts [...] have been largely prompted by changes in the conditions of production resulting from prior labor-saving inventions and discoveries” and that “the depression of industry in recent years has been experienced with greatest severity in those countries where machinery has been most extensively adopted [...]”.  Indeed, the events described by Wells even inspired a utopian science fiction novel, Looking Backwards, which prophesied the eventual adoption of a universal basic income to address widespread technology-driven labor displacement.  It quickly became one of the era’s bestsellers. Mechanization with its associated speed and physical dangers and the specter of inexhaustible inanimate power driving it also played into an ongoing debate regarding hours of work and worker fatigue. 
Politicians paid close attention to the societal debate underway. In the early 1890s, Congress issued a joint resolution directing the Commissioner of Labor to “investigate and report upon the effect of the use of machinery upon labor and the cost of production, the relative productive power of hand and machine labor […] and whether changes in the creative cost of products are due to a lack or surplus of labor or to the introduction of power machinery”.  As a contemporary would observe “such a task would seem sufficient to stagger any ordinary statistician, even with the resources of the department of labor behind him” but the Commissioner of Labor, Carroll Davidson Wright, was not deterred.  The investigation took almost five years to complete and was published as the Commissioner’s thirteenth annual report. It totals a little more than 1,600 pages in two volumes, perhaps two-thirds of which are complex tables that both summarized and detailed hand and machine labor data.  Hereafter we will refer to the published report as the “HML study” – the Hand and Machine Labor study.
Ideally, Wright sought to measure the average time and the labor cost savings by comparing what he called “hand methods” of manufacturing to “machine methods” and calculate the time savings arising from inanimate power – in other words, to measure the quantitative importance of “mechanization”. The hand methods that he had in mind “should not be construed to mean a method whereby a product is made entirely by the unaided hand and absolutely without the use of machines but rather as the primitive method of production which was in vogue before the general use of automatic or power machines” [emphasis added] and adopted in the artisan shops that provided locally produced goods for the domestic market in the early nineteenth century.  The proverbial artisan shop in the United States was very small, often an artisan working alone or with an assistant or two.  The tools employed were simple, operated by human muscle (or, in some cases, using animal power). By contrast, the “machine methods” were those of the state-of-the-art mechanized factory, often with hundreds or thousands of workers, in which machinery powered by inanimate means, especially steam, was used in the production process and where intensive division of labor was practiced. 
Publication of the report satisfied the Congressional mandate, and the tables would provide a rich source of anecdotes about changes in production methods for specific goods. However, Wright was careful to warn the reader “it has not been possible to summarize general results in such a way as to show the displacement and expansion of labor in various directions […]”  – that is, he and his team of agents were unable to answer the two questions that motivated the study. The reasons are simple – they lacked the computational means to organize the enormously complex information along with a statistical framework so that it could be analyzed and interpreted systematically. The HML study is not obscure – it was widely publicized at the time and well-known to previous generations of historians. However, past generations of scholars judged it far too difficult to use, even after the invention of modern computers. 
Recently, we have digitized the printed HML study, coding, and restructuring these historical data in ways that make them tractable to systematic investigation using modern econometric techniques. The final product creates an historical lens through which changes in today’s economy can also be viewed in long-term perspective. This paper details our efforts to make the historical data on machine versus hand production usable, to implement the study’s crosswalk and assemble blocks of comparable production tasks, as well as to extract and code gerunds on task level activities, and to code the skill levels of the occupations listed, and other text. Where possible, we will illustrate the different steps with concrete examples from the HML study.
Over the past several decades historians, economic and otherwise, have made impressive gains in digitizing historical documents. In the American case, it is fair to say that the greatest progress has been in digitizing the various historical published census volumes and their underlying micro-data.  However, there are other historical government documents that are more complex than census volumes and which, to this day, remain only in hard copy form. Until recently, the HML study was one of these, perhaps the supreme example. We believe this discussion will therefore be instructive for researchers facing the challenges of digitizing similarly complicated historical data.
2 The HML Study: A Primer
The basic observation in the HML study was called a “unit”. We have retained the HML nomenclature for simplicity, confusing though it might be at times, elaborating and explaining their unique terminology as necessary. Fundamentally, a “unit” in the HML is a quantity (rather than an entity) of a highly specific, precisely defined good. While the term is singular (the product), the reported data are actually for two different establishments, one producing the good using the “hand method”; the other producing the exact same good (so far as precise matching was possible) using the “machine method.” A HML “unit,” thus, is sort of a cross between SIC/NAICS (Standard Industrial Classification/North American Industrial Classification System) industry codes which were developed to classify establishments by the activity in which they were engaged and the now-familiar 12-digit Universal Product Codes (and bar codes) that uniquely identify a product (including its maker).  For example, Unit 71 details the production of 100 pairs of “men’s medium grade, calf, welt, lace shoes, with single soles and soft box toes”.  The hand producer in this unit was a shoemaker working alone; the machine producer was a factory (or a part thereof), employing 371 different workers, excluding all persons not directly engaged in the manufacture of that good, the supervision of its production, or the inspection of the product regardless of how important such work might be to the overall success of the venture. Online Appendix Figure 1 (https://doi.org/10.1515/jbwg-2023-0002) displays the left- and right-hand pages from the HML study for some of the tasks involved in producing these shoes and provides a snapshot of the raw data that we are digitizing. Note that important data appear not only in the tables themselves but also in the headers to the tables.
Not surprisingly, production by these different methods as originally recorded typically took place at very different scales. Where necessary (and it was in most cases, especially in hand production), production was scaled to industry quantity or scale norms by adjusting the time (and thus the cost) spent on tasks by the appropriate factor, keeping the number of workers unchanged.
A unit was thus a matched pair of production operations or “chains” making the same identical product but by different methods. All data for the machine-half of a unit came from contemporaneous sources. However, only about a quarter of the hand-producing halves of units were contemporaneous, the rest having exited the business. Instead, most relied upon historical records or contemporaneous memory. 
The data were reported in two volumes by the HML staff. In Volume One, for each unit, the following information was reported separately for hand and machine labor production: an industry classification, the product, quantity (actual and standardized), the year in which the production under each method took place, the number of separate tasks of production, the number of different workers employed, and the total number of hours of work to produce the given quantity, the total labor costs, and the average daily hours of operation of the producers in the unit. In Volume Two, as illustrated in Online Appendix Figure 1, for each separate operation in the production (hand v machine) of each good (= table row), there is a written description of the operation, in the order in which it was performed within machine production (repeated in both the lefthand column on a verso page and the rightmost column of a recto page) a list of capital goods (tools and the like or machines) used in the task; the type of motive power (if used); the number of workers assigned to each machine (if used in the task); the number, age, and gender of the workers employed in the task; the occupational titles of the workers employed in the task; the hours of work by each employee engaged in the task; the labor cost of each employee engaged in the task based on their rate of pay; and miscellaneous information (= table columns).  While the same data were reported for hand production and machine production, the hand production tasks were grouped such that the intermediate output entering an operation or group of operations and that emerging at the conclusion were at the same stage of manufacture as in machine production, with a crosswalk between hand and machine labor production being provided based upon the sequential numbering of machine-production operations. Making proper use of the HML crosswalk is essential to using the operations-level data for statistical analysis (see below).
The central goal of the HML study was to measure differences between the “most modern machine method[s]” compared with the “old fashioned hand process […] in vogue before the general use of automatic or power machines”. Commissioner of Labor Wright himself had sparked some of this interest when, in his first annual report to Congress, he had drawn attention to the “temporary displacement of labor and to conditions of industry and of society which would exist without the presence of power machinery” and provided numerous examples.  In small arms production, for example, one man using conventional hand tools, turned and fitted one musket stock per ten-hour day whereas using specialized machines and dividing the tasks between them, three men could turn and fit between 125 and 150 musket stocks per day, a 40 to 50-fold productivity gain. Similarly, data from boot and shoe manufacturers suggested an 80 percent savings in labor for machine over handicraft production.  These were quoted frequently and the Wisconsin Bureau of Industrial and Labor Statistics subsequently did its own study of 49 trades reporting the “product per day of hand and machine labor,” but without pairing observations and matching products. 
Such anecdotal information, however, offered only a very partial and incomplete picture of what was going on. The HML study provided much more detail. Even so, Wright cautioned from the start that the words “hand” and “machine” were not used in their strictest sense but rather to characterize two different methods of production. Machines were used in “hand” production although these were of the simplest kind – saws, hammers, chisels, knitting needles and the like – what he called “the primitive method of production which was in vogue before the general use of automatic or power machines.” Similarly, under machine production, not every operation was performed by a machine. A key distinction was that in machine production “every workman has his particular work to perform, generally but a very small portion of that which goes to the completion of the article”.  In other words, the division of labor as much as the use of powered or automatic machinery itself was an identifying feature of machine production. This distinction is reinforced by our empirical results to date.
The raw data were collected by trained agents, compiling the information from direct observation or current and historical written records. For the machine production units, most of the observations pertain to production activities occurring in the mid-to-late 1890s (1894-98). However, in many cases, the staff was unable to find matching hand units from the same year, presumably because these were no longer in existence (itself a clue to their relative productivity). In such cases, the agents relentlessly (and successfully) sought out historical records. For each establishment, the year to which the data refer was recorded as part of the table header and in the discussion. Where no US-based hand producer could be found, agents found establishments overseas like those which no longer survived in the United States. However, all machine production data were drawn from domestic producers. 
In preparing their report, the HML staff paid careful attention to quality differences between the hand and machine products noting any differences (generally more favorable to the machine-made item than the hand) in the text describing the production processes in each unit. They also tracked (so far as possible and with the same level of detail) the production of any intermediate purchased inputs, injecting the “external” data into the mix at the appropriate time in the production process. Activities such as “Furnishing Power” (where appropriate) were reported at the end of all tasks that relied upon it. The same was true of supervision. This helped us identify outsourced activities.
Compared with the late 19th century federal census of manufacturing (which Wright also helped oversee), the data collected by the HML study were vastly more detailed and complex. Indeed, the complexity totally overwhelmed statisticians at the time, as Wright himself noted.  Such difficulties, however, did not stop Wright from offering summary opinions about scope of technological unemployment, for example in a lecture at the prominent Chautauqua Institution.  Nor was Wright alone in asserting conclusions based on data that were beyond his analytical reach. An anonymous article published by Scientific American in 1900 also made use of the HML study to claim: “Modern machinery, again, has so greatly enlarged the productive power of the workman that it becomes possible to pay him wages far in advance of those earned by his hand-labor predecessor, and the same labor-saving devices, while raising his wage, have increased its purchasing power by lowering the cost of food and clothing and many of the luxuries of life. Hence, the automatic machine is not, as the agitator will even yet suggest, the enemy of labor, but is in every respect its best friend.”  Whatever the merits (or demerits) of these commentaries, they were not based on any comprehensive investigation or analysis of the HML study. Given the complexity of the data relative to the information technologies available to researchers at the start of the twentieth century, such an investigation was beyond their powers although, as a contemporary would observe “on every page the economist will find new and striking illustrations of the advantages of the division of labor, of production on a large scale and of the factory system”. 
3 Digitizing the HML Study: PDF Files
As a federal government document, hard copies of the HML study are widely available in libraries, especially those designated as U.S. government depositories.  The Portable Document Format (PDF) files that we used to digitize the HML study were created ca. 2005-06 from the copy held at the University of North Carolina library at Chapel Hill.  These were imported into Excel spreadsheets the “old-fashioned” way – that is through (very laborious) manual entry by Digital Divide Data, a social enterprise employing survivors of war in Cambodia.  The spreadsheets required extensive proofing and correction by the authors despite DDD’s diligent efforts. Indeed, this process of (minor) correction continues almost twenty years later as we discover new typographical errors and other issues when focusing upon specific fields or aspects connected with our research. More significantly, the digitized product from DDD, whether fully proofed or not, still has all the complexity that bedeviled efforts by earlier generations – including those of Wright and the Bureau of Labor – to use the data collected.
PDF files of the HML study are now widely available and readily accessed. For example, the HathiTrust,  a not-for-profit collaborative venture between academic and research institutions serving primarily member institutions has two different copies – one from the University of California at Berkeley library,  the other from Columbia University.  Although we have the DDD-digitized data at our fingertips, we still sometimes need to refer to the original printed text. We have found the Columbia University images (in general) to be better for our purposes, doubtless because it was scanned at a higher resolution (612x792 v. 414x667). However, these files are not always perfect – there are missing pages, out-of-focus images, images that capture the hands of the human operator and obscure part of the document, excessive skew, and so on. Indeed, in our current “master” PDF copy – the HathiTrust copy of Columbia University’s volumes – we swapped out pages 712-721 in Volume Two for those from a superior Google Books version. 
The technical problems involved in producing a usable scanned version of the HML study are numerous. Invariably, the binding on a surviving copy is fragile with rotted signature stitching from contact with the cheap acidic wood-pulp paper commonly used in late nineteenth century publications. The lignin in the paper has caused darkening of the pages reducing the contrast between the background and the type while the paper itself has become brittle – problems familiar to all of us who have photocopied such documents.  Some of this damage is visible, for example, in the darkening of the righthand image in Online Appendix Fig. 1. These issues are exacerbated by the format used by the Government Printing Office to print the data. The tables are set in approximately 6-point type. Moreover, because of the large quantity of data relating to each observation, all important tables – the summary tables in Volume One and the task-level tables in Volume Two – are printed verso-recto with less than perfect registration – that is, horizontal continuity – between the even (verso) and odd (recto) numbered pages thanks to the binding. Opening the book flat certainly damages its already fragile structure. Fortunately (for us, anyway), while the tables extend across both pages of the opened book, the print itself does not span the entire width as a single block so that the binding does not obscure or destroy data. Moreover, each “row” in each table on a page has a reference number at the lefthand edge of the verso page and the righthand edge of the recto page (see Online Appendix Figure 1 above) providing registration.
With any printed data, Optical Character Recognition (OCR) always seems like a seductive, alluring technology for data entry. It has evolved substantially from the expensive, slow, and dedicated Kurzweil machines of the 1970s and 1980s though its principles remain unchanged – pattern recognition of graphic images. Currently, the “state of the art” (for personal computers) is ABBYY FineReader (currently in version 15). Nevertheless, while OCR is orders of magnitude faster that human eye and fingers in making decisions and rendering those digitally, OCR remains an imperfect technology. It makes mistakes – recognizing errant marks as characters, for example – that human eyes and brain instantly reject. Such rejection is usually context-based. ABBYY, to reduce such errors, divides images up into zones such as “text,” “table,” and “picture” treating each separately. Pictures in ABBYY are simply copied as a pattern to be rendered as accurately as possible with no interpretation. ABBYY interprets text as a series of sequential characters (“read” left-to-right, right-to-left, top-to-bottom, or bottom-to-top, language dependent) whose pattern (including blank space) is to be matched. “Table” recognizes the text content while preserving the horizontal and vertical structure of the matrix and relative positioning.
In practice, this is not necessarily as straightforward as it sounds. For example, text regions may still have a complex structure.  Taking advantage of progress in semi-supervised learning and self-training, Shen and co-authors extend this meta broad pattern-matching and have proposed a framework they call Object-Level Active Learning Annotation – OLALA – to streamline document recognition.  Their objective is to select for human inspection and annotation only those areas of an image with the most ambiguous predictions and to learn from that experience. We did none of this and are not convinced (at this stage of development in the technology) that it would have improved our product or reduced its cost.
Recently, we learned of another effort to digitize the HML data by Christophe Combemale, a doctoral student at Carnegie Mellon University in Engineering and Public Policy.  He used a different and novel approach, best described as a blend of traditional human proof-reading by students combined with an initial pass at automated data entry. Combemale used the “Data from Picture” feature found in the Microsoft Excel app on smartphones. The feature appears under the “Insert” menu in the Excel app and the data is accessed directly from the camera (= “Data from Picture”) and the resulting text is rendered to column data. Essentially this same feature also appears in the personal computer versions of Excel although there the expectation is that the data are already in some sort of graphic image (including PDF). The resulting un-proofed OCR version is far from satisfactory (Online Appendix Figure 2) although, in fairness, in the Macintosh version of Excel used for our test runs, the cells are mostly tagged “Low confidence cell” in the OCR window.  Virtually no cell entry was rendered correctly. Interestingly, a photograph made with an iPhone 8 of just a small part of the Macintosh computer monitor screen of this same PDF image was recognized more accurately using the Excel app on the smartphone (Online Appendix Figure 3). However, errors still are present.
The OCR output from ABBYY FineReader is vastly superior to that imported through Excel directly (Online Appendix Figure 4). It also cleaned up relatively quickly. For example, the dot leaders used by the US Government Printing Office to help one read the rows correctly are quickly removed (replacing them with nothing using “Find/Replace”). It is also easy to remove the (haphazard) cell borders. Even so, the rendition is still highly imperfect, and we remain unconvinced that developments in machine methods to date would have made data entry for the HML study faster or more cost-efficient.
4 The HML Crosswalk
Digitizing the HML study was only the first step in creating a usable and useful data set. To answer the questions posed by Carroll Wright, it is necessary to construct and implement the HML “crosswalk”. The crosswalk is what permits the researcher to make consistent comparisons between hand and machine labor at the production operation level. However, this construction and implementation was time-consuming due to the crosswalk’s inherent complexity. The HML study is organized by product group; and, within each group, by unit. Data for the hand producer are listed first, followed by the machine producer.  To understand the HML crosswalk, it is useful to follow a particular example. The example we have chosen is Unit 71, “100 pairs of men’s medium grade, laced shoes” (see Online Appendix Figure 5 for machine production and Online Appendix Figure 6 for hand production – the concluding task lines in hand production of this good also appear as part of Online Appendix Figure 1).  The shoe size is not specified but is (implicitly) assumed to be different for each pair (otherwise some economies could have been achieved by not having to make 100 pairs of lasts – the wooden forms around which shoes are shaped and assembled – as these were durable).
In machine production, the HML staff identified 173 separate tasks (only the first 54 tasks are shown in Online Appendix Figure 5), listing (and numbering) in the order in which they were performed, from tanned hides to finished shoe. The numbering appears at the left edge of verso pages and the right edge of recto pages. Operations included not only those directly related to the manufacture of shoes, like sorting leather, cutting out the vamps (the main part of the shoe between the toe and the laces), the quarters (the heel portions), toes, soles, insoles, and heels and sewing these together around the last to form the shoe and punching holes for the laces, but also tasks finishing the shoes like smoothing the welts, waxing and polishing, matching pairs, stamping with the maker’s name and size and boxing for shipment. The data also include the operations to keep the shoe-making machinery in good order and maintain and fire the steam engine that powered the various machines. Some of the operations, like sorting, required nothing more than a good eye. Others, like cutting out the shoe parts, still used basic hand tools (scissors and knives) rather than steam-powered die presses. Eighty of the tasks, however, including trimming, making eyelets, nailing heels, polishing and buffing, made use of steam power driving specialized machines.
Hand production involved 72 operations (44 of which are shown in Online Appendix Figure 6) considerably fewer than machine production. Shoemaking by hand began with the individual shoemaker tracing around each foot to create a cutting pattern and carving a last. These steps were crucial for the fit of the shoe and would be repeated for each foot of each customer served by the shoemaker. Producing lasts was time consuming, taking 54 minutes 24 seconds per pair – almost 92 hours for the production run of 100 different pairs of shoes (see the circled entry in Online Appendix Figure 6). As in machine production, the leather for the shoes was sorted and selected but in hand production this was treated as a single task, presumably so that the uppers for a single pair of shoes could come from the same hide and thus be better matched.
Whereas the lefthand and righthand marginal labeling of tasks in the machine production tables was regular and ordered, the labeling of tasks in hand production was anything but. This is because the HMLS investigators carefully linked each operation in hand production to the corresponding operation in machine production using the machine operation number. Tasks in hand production, like those in machine production, are listed in the order in which they were performed but that order often differed by production mode. A number alone indicated that the hand operation matched up with the machine operation that had been assigned the same number – that is, the operation(s) began and ended with the product in the same state of completion in both machine and hand labor. Multiple numbers indicate that multiple machine tasks were performed as a single task in hand production. A capital letter denoted a hand labor activity that did not exist in machine production – for example, the making of patterns and lasts by the shoemaker (labelled as operation “A” in Online Appendix Figure 6). The machine-producer instead bought the lasts’ for left and right feet in standard sizes from outside specialist suppliers leaving the customer to buy the shoes that fit them best rather than shoes made to fit them. These patterns and lasts would be used in the fabrication of thousands of pairs of shoes (one of the many advantages of standardizing sizes).  Machine tasks that were a part of several hand tasks had lowercase letters appended to the machine task number. The machine operation was represented by the number portion while the position of the appended letter in the alphabet indicating how many preceding sub-operations had already been performed. It is these left- and right-most column entries that provide the crosswalk – created by the HML staff based on their detailed knowledge and observation of processes – between the hand and machine operations within units. To use them, however, required an additional step.
We generated a new object, which we call a “block”. A block is a group of tasks identified by the HML staff as “equivalent” between hand and machine production in the sense that the product entering in the block is the same – at the same stage of completion – regardless of the production mode. It is also at the same degree of completion when it exits the block. These blocks are of size H (in hand labor) or M (in machine labor), where H and M are the number of HML staff-identified tasks that were grouped to produce the equivalent product-stage. We then link these blocks within the unit (the paired observation of production facilities producing the good either by “hand” or “machine”). To make the mapping (linkage) complete between the two modes, we also include mappings to zero – that is to those tasks that disappeared in the switch to machine production as well as to those new tasks that emerged in machine production. For example, the “in-house” production of shoe lasts by the custom shoemaker vanished in the switch to the factory, mass-produced product, replaced by purchased forms made in standard sizes by an outside supplier. We refer to these as 1:0 block links. Analogously, the hand producer made no use of steam power to drive nonexistent machinery and so there was no need to maintain and fire a steam engine. Also, since hand producers were small compact operations, there was no need for supervision or monitoring of production by a foreman at physically remote locations. These are identified as 0:1 block links. For all other blocks, there is a mapping of H:M between hand and machine production where H and M are integers greater than zero. These blocks “overlap” between hand and machine production, and thus allow consistent comparisons of production operation characteristics, such as the amount of labor time to complete the activities inherent in the block.
Once the block links are well-defined, it is straightforward, if tedious, to create a dataset for econometric analysis. In our case, we started with an EXCEL spreadsheet that had all of the variables created (and cleaned) from the digitized HML study. We created a STATA dataset from the spreadsheet and then performed operations on it to create the block links and the associated variables (for example, the amount of labor time, or whether steam power is employed). The “final” regression sample consists of block links that will enter into a regression analysis, for example, like equation :
The dependent variable in this regression is the difference (Δ) in the natural logarithm (ln) of the labor time in machine vs. hand labor, in the block link. This is subscripted (j) for units and (a) for block links, and the regression includes dummy variables (a large number of them) for units (β(j)) and block link types (γ(a)). The key independent variable is Mechanized, which is a “one-touch” measure of whether inanimate power (either steam or waterpower) was used in the particular machine labor block.
Table 1 summarizes the block links in the regression sample. The vast majority are 1:1, meaning that the HML staff matched up a single activity in hand production with a single activity in machine production. From a data management point of view, these are (by far) the easiest observations to handle.  As Table 1 shows, however, there were the other, more complex block links, in which production was reorganized, sometimes dramatically so, in the transition from hand to machine labor. Interestingly, when production was re-organized, it was associated with greater use of inanimate power and, invariably, greater division of labor.
|Block Link Type (Hand:Machine)||Number of Block Links||Mean Fraction Steam, Machine Labor||Mean Fraction Water, Machine Labor||Mean Fraction Mechanized, Machine Labor||Mean Value, Δ ln T|
|1:M, M > 1||619||0.732||0.055||0.784||-1.920 [0.147]|
|H:1, H > 1||250||0.704||0.052||0.744||-2.729 [0.065]|
|H:M, H, M> 1||124||0.815||0.073||0.879||-2.189 [0.112]|
|Total, Regression Sample||4,405||0.522||0.032||0.552||-1.761 [0.172]|
Source: J. Atack/R.A. Margo/P. Rhode, ‘Mechanization Takes Command’? Inanimate Power and Production Times in Late Nineteenth Century American Manufacturing, in: Journal of Economic History 82, 2022, p. 672. Computed originally from digitized HML study United States Department of Labor, Hand and Machine Labor (Thirteenth Annual Report), Washington DC 1899. Notes: Block links: - 1:1: a single hand labor operation is mapped by the HML crosswalk to a single machine labor operation; 1:M, M> 1: a single hand labor operation is mapped to a block of M machine operations, M> 1; H:1, H> 1: A block of H (>1) hand operations is mapped to a single machine labor operation; H:M: A block of H hand labor operations is mapped to a block of M machine labor operations, H and M > 1. Mechanized = 1 if machine block used steam or waterpower or both; see text. NA: not applicable. Bracketed figures are geometric means of Δ ln T (for example, 0.172 = exp (-1.761) for the regression sample in the final row).
Equation 1 is the conceptual tool that Wright and his team of agents did not have, along with the relevant inanimately (by electricity) powered machine – the modern computer and its software – to “automate” the processing of the vast amounts of data that was collected and reproduced in the published HML volumes. The mean value of Δ ln T (a, j) is -1.761; if we take the exponent, and multiply by 100 percent, we get 17.2 percent (= exp (-1.761) x 100 percent). This means that the typical machine labor operation – whether it was part of a simple 1:1 block or a more complex non: 1-1 block – took just 17.2 percent of the time to complete as the “equivalent” hand labor block. “Equivalent” here, we reiterate, holds constant the “output” of the block. It is, thus, a measure of the relative labor productivity of machine production, so machine labor was about 6 times (≈ 1/0.172) as productive as hand labor. As noted earlier, this calculation is one that Wright wanted to perform, but he was unable to do so. 
Recall that Wright also wanted to determine how important the use of inanimate power was in raising the productivity of machine labor. This depends on the sign, magnitude (and, implicitly, the statistical significance) of the coefficient λ, and whether (or not) we have an “identification strategy” such that our estimate can be considered “causal” (as opposed to a “correlation”). Assuming for the moment that the OLS estimate is causal, if mechanization raised productivity – as Wright, and everyone else alive at the time, believed – then λ will be negative. Our initial estimate of λ (by OLS) is, indeed, negative and highly significant: λ = -1.037 (s.e. = 0.06). The interpretation of λ is as follows – holding the other variables in Equation 1 constant, the reduction in time use was approximately 65 percent greater (= [1 – exp(-1.037)] x 100) in the blocks that were mechanized under machine labor compared with non-mechanized blocks. Even if we are confident that our estimate of λ is causal, the answer to Wright’s question depends on more than just the magnitude of the coefficient; it also depends on how much “mechanization” took place, that is, the mean value of Mechanized. As can be seen from Table 1, this mean value is 0.552; a little more than half of the machine blocks used inanimate power. Multiplying the estimate of λ (-1.037) times the mean value of Mechanized (0.552) and dividing by the mean value of Δ ln T (a, j) gives us the “percent explained” by mechanization = 32.5 percent (= [-1.037 x 0.552)/-1.761] x 100). Thus, while mechanization clearly mattered, it was far from the sole reason why machine labor was more productive than hand labor. 
Of course, we cannot claim that the OLS estimate is causal, if for no other reason than that inanimate power was not randomly assigned, either by the HML staff or by whomever managed the machine labor establishment that the staff surveyed. This creates the possibility of endogeneity bias. One common form of such bias – measurement error – is almost certainly irrelevant in the HML study, due to the (extreme) care that the agents took in collecting the data. Another common form of bias, omitted variables, could be present but the most likely form this would take would involve worker characteristics and there seems to be little evidence of this, either.  This leaves the third possibility – reverse causality. In this context, reverse causality is highly likely – inanimate power would be used if the manager thought there would be significant time savings, which would bias the OLS coefficient upwards in magnitude.
To correct for such a bias, economists like ourselves would search for an instrumental variable (IV) to use in a two-stage least squares (2SLS) estimation of Equation 1. A valid IV would be correlated (satisfy the “relevance” condition) with the endogenous variable (Mechanized) but uncorrelated with the error term ε (the “exclusion” restriction). Even in the best of situations, such can be like looking for the proverbial needle in a haystack. In our setting, the problem is rendered more difficult because the IV has to be measured at the production operation level. 
Our solution was to engage the digitized HML in another form of computer analysis, that of text processing. The HML study is not just a collection of tables with numbers, but includes vast amounts of text, both in the tables and as descriptive material in the volumes. Economics is now on the verge of a major revolution in “big data” involving the use of digitized text to study economic issues; for the most part, however, the construction of instrumental variables using this has not (yet) been a major part of the analysis. 
Our processing extracted instances of “gerunds” appearing in the HML study. We use gerunds because this is how the HML staff (like many other observers in the period) described production activities. A gerund starts out as an English verb to which “-ing” has been appended. These words then function as nouns in grammatical context. For example, consider the gerund “reading,” derived from the verb “read”. In the sentence, “I enjoy reading,” “reading” is a gerund, a noun that describes an action or activity described by the root verb (in this case “to read”) – that is, gerunds are active. All gerunds end in “-ing” but not all words ending in “-ing” are gerunds. For example, in the sentence “I am reading a book,” “reading” is not a gerund, but rather the present participle of the verb describing a continuous activity. 
Some gerunds occur frequently in the HML text descriptions, such as “cutting,” which appears 14 times in just the descriptions of the work performed in the hand production of men’s medium grade shoes (Unit 71). It appears in 22 of the 173 distinct operations for the machine production of the same product. Indeed, cutting was the single most common gerund in the HML study and was used in describing approximately 2,400 tasks. It describes the act of parting or dividing of organic materials like leather, paper and textiles as well as metals using an edge tool.
We used these gerunds as the basis for an instrumental variable. A member of our research term (Atack) with expertise in the history of technology was given the list of principal gerunds appearing in the text descriptions of operations represented in the block links in the regression sample and asked to sort them into two bins without consulting the HML study. Based on this expert’s knowledge of the history, gerunds for which there was some technical feasibility of mechanization worldwide by the end of the 19th century were sorted into one bin (bin #2), and those for which there was either very little or none were sorted into the other (bin #1). For gerunds that were inherently vague or too general in describing the underlying activity (for example, “making”) the sorting erred on the side of caution (bin #1).
From this sorting, we created our instrument, called MECHABLE, which =1 if the gerund in the hand text description was sorted into bin #2, 0 if sorted into bin #1. In the first stage of the 2SLS estimation, we regress Mechanized on MECHABLE; in the second stage, we regress Δ ln T on the predicted value of Mechanized from the first stage regression. To the best of our knowledge, ours is one of first uses of text processing in economics to create an IV. The 2SLS estimate of λ is, as we expected, smaller in magnitude (-0.749) although the standard error is larger (-0.165), so much so that we cannot reject the hypothesis that there is no significant difference between it and the OLS estimate.
The text processing of the HML study described above does not exhaust the additional programming that was required to turn econometrics-ready data out of the digitized PDF. Currently, for example, we are studying a related question, that of “de-skilling” – the greater use of unskilled and semi-skilled workers in machine production versus skilled blue-collar artisans in hand production. This, too, requires text processing – the extraction of occupation titles from the digitized PDF, which then are classified (by the old-fashioned scholarly method, that is, hand) into occupation categories. Once the occupation categories are created, we perform operations in EXCEL and STATA on the original data set to classify the categories of work – semi-skilled vs. skilled blue collar, for example – involved in the block, and then compare these between machine and hand labor. Our preliminary analysis shows a very substantial amount of de-skilling – a little more than a third of the block links. De-skilling was more likely to occur in conjunction with mechanization, but mechanization per se is not the main explanation – rather, de-skilling was a consequence of the intense division of labor that characterized large-scale factory production in late nineteenth century America.
5 Concluding Remarks
Tremendous advances in computerization and statistical analysis offer great promise in unlocking “big data” from the past and enhancing scholarly understanding of major events in human history, such as the Industrial Revolution. Sometimes, as in the case of the Hand and Machine Labor study, data were collected that, at the time, was far beyond current capabilities of analysis. Rather than letting such data fade into obscurity, advances in computation allow us to probe their secrets, and hopefully shed light on modern developments to which they are linked though historical time.
Supplementary Material: The online version of this article offers supplementary material (https://doi.org/10.1515/jbwg-2023-0002). Online Appendix Fig. 1: Verso and Recto Pages of Showing Parts of Two Hand and Machine Labor Tables to be Digitized; Online Appendix Fig. 2: OCR Rendition of Data from HML Study Using Desktop Excel “Insert/Data from Picture” Command from PDF File; Online Appendix Fig. 3: OCR Rendition of Data from HML Study Using iPhone version of Excel “Insert/Data from Picture” Command from Cell Phone Photo of PDF image on Macintosh Screen; Online Appendix Fig. 4: ABBY FineReader 15 OCR Rendition of a Table from HML Study; Online Appendix Fig. 5: Unit 71, HML Data on the First 54 Tasks in the Machine Production of Men’s Medium Grade Shoes; Online Appendix Fig. 6: Unit 71, Data on Some of the Tasks in the Hand Production of Men’s Medium Grade Shoes.
About the authors
Jeremy Atack is Professor Emeritus of Economics at Vanderbilt University in Nashville where he taught from 1993-2013. Before that he was on the faculty of the University of Illinois at Urbana-Champaign. He is a Research Associate with the National Bureau of Economic Research (NBER), a Fellow of the Cliometrics Society, and has served as editor of leading economic history journals.
Robert Margo is Professor of Economics at Boston University where he has been since 2005. Prior to that he was on the faculty of the University of Pennsylvania, Colgate University and Vanderbilt University. He is a Research Associate with the National Bureau of Economic Research (NBER), a Fellow of the Cliometrics Society, and has served as editor of leading economic history journals.
Paul W. Rhode teaches at the University of Michigan at Ann Arbor where he has been since 2009. Previously, he was on the faculty of the University of North Carolina at Chapel Hill and the University of Arizona. He is a Research Associate with the National Bureau of Economic Research (NBER), a Fellow of the Cliometrics Society, and has served as editor of leading economic history journals.
© 2023 Jeremy Atack/Robert A. Margo/Paul W. Rhode, published by De Gruyter
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.