Data on Digital Transformation in the German Socio-Economic Panel

: Public debates and current research on “ digitalization ” suggest that digital technologies could profoundly transform the world of work. While broad claims are common in these debates, empirical evidence remains scarce. This calls for reliable data for empirical research and evidence-based policymaking. We implemented a data module in the Socio-Economic Panel to gather information on digitalization in three domains: artificial intelligence (AI), platform work, and digitalized workplace. This paper describes the existing approaches to measure technological exposure, the challenges in operationalization of digital transformation in a household survey, the implemented questionnaire items, and the research potential of this new data.


Introduction
Technological change and its impact on society is a longstanding topical question in the social sciences. Its current stageoften summarized under the umbrella term digitalization or digital transformationchanges not only the means of production, but also the way humans and machines interact as it leads to a deeper interconnectedness of the physical and the virtual worlds. Thus, the potential effects of digitalization are expected to be widespread, including not just flexible working hours, work practices, and workplaces, but may also spillover to private lives by shifting the boundaries between work and leisure. Digitalization is expected to bring many opportunities, such as new types of work that can create labor market attainment for people previously excluded from gainful employment. Digitalization can also change some work activities done by humans, thus changing demand for occupations and creating new sources for income inequality. For these and many other reasons, the digital transformation remains of critical interest to both researchers and policymakers.
Empirical evidence on digital transformation is very limited, because, first, digitalization is still a vague term that describes various processes spanning many areas; and, second, the phenomenon is just starting to unfold. This results in a lack of appropriate data sources. This paper presents an innovative data module on digitalization with precise definitions of pertinent areas of digitalization. It measures digitalization along three dimensions: a) AI; b) platform work; and c) digitalized workplace. The module was implemented within an established multi-topic longitudinal data infrastructurethe Socio-Economic Panel (SOEP), one of the largest longitudinal household panels in Germany. In 2019, the module was first introduced to the Innovation Sample (SOEP-IS with about 1500 respondents). In 2020, a revised version of the module was introduced to the main sample of the survey (SOEP-Core with about 30,000 respondents). The module complements the existing information available from SOEP respondents on workplaces and other topics, including individual health and well-being, with easy linkage to the outcomes of individual households as well as aggregated to the regional level using geo-coding. This paper has two purposes: First, we provide a comprehensive description of the innovative data module. Secondly, we illustrate the methodological considerations on the development of such a data module, by analyzing the existing literature in economics and sociology devoted to technological transformation, the formalization of the technological exposure, and its impact channels. We also document existing approaches to quantify various aspects of digitalization. In particular, we derive several criteria for item design that contribute to the methodology of measuring technological progress and its impacts in household surveys.
Designed for a joint project on digital transformation based at the Socio-Economic Panel and the Technical University of Berlin, these data are related to a joint research agenda in economics (with focus on employment changes) and sociology (with focus on job quality). The project was funded by the German Ministry of Labor and Social Affairs. Besides the topics our research group primarily aims at, we expect the data module on digitalization to have wide additional potential for the research community. As of 2021, the module was fielded in one cross-section of the respective SOEP survey. A repetition of the module in the SOEP in two or three years would facilitate a longitudinal data structure and, thus, provide the opportunity to estimate dynamics of the digital spread in the society. We discuss the research potentials of these data in further detail below.

Data-Driven Research on Technological Change
Technological change and its various impacts belong to important research fields in many social sciences, including economics and sociology. Economic studies are mostly centered on the impacts on employment and wages, whereas sociological studies investigate changes of workplace characteristics, job quality, and the role of employees in the production process. In the following, we briefly review the current state of literature in economics and sociology focusing on the data sources used.
In economics, a rapid development of the relevant literature took place when data on the exposure to technology became available. An important strand of literature goes back to the ideas of Autor et al. (2003), who propose a task-based approach. Within this approach, each job is a bundle of tasks. Some of these tasks are called routine, 1 which means that they are programmable and can be executed by a computer. This approach historically relies on the expert job descriptions from the Dictionary of Occupational Titles (DOT) and its successor O*NET. In Germany, a similar data base called BERUFENET exists (Dengler et al. 2014). Another important source for occupational contents in Germany is the BIBB-BAuA-IAB employment survey, which gathers information on the task contents of respondents' jobs and allows calculation of routine task shares as a proxy for technological exposure. This approach proves to be very useful for explaining changes in the employment and wage structures that took place starting in the 1980s. An important advantage of the survey-based task data lies in their variation within occupations, which allows comparative inference for socio-economic groups (Black and Spitz-Oener 2010), nuanced comparisons of occupational similarity (Gathmann and Schönberg 2010), and over time changes in job content dynamics (Fedorets 2019). The cross-sectional structure of the BIBB-BAuA-IAB constitutes its main limitation.
An alternative approach to assess technological exposure is taken by Frey and Osborne (2017), who use expert assessments of the current technological advancement and "engineering bottlenecks" to predict employment perspectives of jobs. The main result of this paper predicting vast job loss, however, was challenged by multiple studies (Acemoglu and Restreppo 2019;Arnzt et al. (2017); Salomons 2018, 2017;Bonin et al. 2015;Dekker et al. 2017;Dengler and Matthes 2018). Other aggregate measurement approaches focus on the number of industrial robots in economic sectors using data from the IFR, the International Federation of Robotics (Acemoglu and Restrepo 2020;Graetz and Michaels 2018) or relying on patent contents as a proxy for technological adoption (Webb 2020).
Similar to economics, existing general sociological studies of technology adoption in the labor market are largely devoted to wages and employment opportunities. Research on job quality, however, emphasizes that focusing only on financial dimensions of inequality oversimplifies the phenomenon by disregarding non-monetary job characteristics such as working conditions, work-life balance, and job satisfaction (Böckerman and Ilmakunnas 2006;Hackman and Oldham 1976;Muñoz de Bustillo et al. 2011). In sociology, the empirical literature on digitalization thus focusses on the diversity and ambiguity of the effect of digital technology on employment (Hirsch-Kreinsen 2015, 2018a). Focusing on job quality, it is argued that digital technology may lead to "digital Taylorism" (Holford 2019;Kirchner et al. 2020;The Economist 2015) in some areas, and to "digital selfdetermination" in other areas. From this perspective, digital technology is likely linked to direct changes in the organization of work and communication structures, which in turn are central to indirect consequences such as job satisfaction and perceptions of justice at work (Ambrose and Schminke 2009;Bies 2005;Roscigno et al. 2018;Sauer and May 2017). Thus, potential consequences of digitalization are manifold, going far beyond wages and job losses. Empirical evidence from a sociological perspective on the consequences of digitalization is scarce, with the few existing studies documenting complex interrelationships and rather incremental changes (Evers et al. 2018;Hirsch-Kreinsen 2018b;Kuhlmann 2021). Some studies report rather ambiguous effects of digital technology in relation to autonomy or work intensification (Kirchner et al. 2020;Meyer et al. 2019;Ruiner and Klumpp 2020).
Previous research often fails to distinguish the purpose for which digital technology is used (computers enhance communication and creativity, support many work processes, or can monitor work processes), obscuring differential implications of different usage of technologies. Pfeiffer (2012) emphasizes that the interaction between information and communication technology, organizational change, and individual effects on workers is often assumed but remains open for empirical testing. Central to these considerations is lacking data on the forms of technological use in the work process. Therefore, it is necessary to gather data on both the work tools, and their forms of use to gain coherent insights on the consequences of digitalization.

Digital Exposure and its Impact Channels
With respect to the scientific and political debate on digitalization, 2 the following essential aspects of digitalization can be singled out ( Figure 1): a) Artificial Intelligence (AI) b) Platform work c) Digitalized workplace Artificial Intelligence (AI) comprises the current technology of automated content recognition and related feedback. Although AI can be formally seen as an aspect of digitalized workplace, we single it out given its relevance in the scientific and policy debate on digitalization (c.f. BMWi 2020; Enquete Kommission Künstliche Intelligenz 2020). Platform work constitutes a new organizational form that is closely connected to the digital technologies (Kirchner 2019;Kirchner and Beyer 2016). An interplay of these aspects shows to what extent employees are exposed to digital technologies, which can then be linked to the various effects in the workplace. The digitalized workplace can be described by gathering data on the digital tools applied (including AI), their usage and on the tasks they help to execute. All the aforementioned aspects have direct as well as indirect effects. Direct effects of technological exposure may include changes of workplace (including spread of 2 See, e.g. the agenda of the Policy Lab Digital, Work & Society https://www.denkfabrik-bmas. de/en/.

Data on Digital Transformation
work from home), working time and its flexibility, work-life conflict, autonomy and monotony of work, as well as related stress levels. Indirect effects may include changes in well-being, the effort-reward-imbalance (ERI), and justice perceptions. Both direct and indirect effects of the digitalization of the world of labor speak to the overall concept of "job quality." Here it should be noted that the questionnaire items developed based on this conceptual model do not require causal statements presented to respondents, such as "did your stress level increase due to digital exposure," but rather introduce global items on digital exposure, tools, usage, tasks, and the effects, which can then be linked through statistical analysis. This helps to prevent subjective assessments of the causality by the respondents and broadens the applicability of the data in empirical work.

Criteria for Measuring Digitalization in Individual-Level Data
Measuring AI, platform work, and the digitalized workplace by means of survey data poses several challenges. The most important one is that among the general population, there is no unified understanding of such phenomena as digitalization and artificial intelligence. Current technological trends are often described by different buzz words that change over time and between real-life applications. Moreover, respondents are often unaware of the technological equipment (such as number and type of robots or interconnectedness of devices) used in their firm. In contrast, employees are knowledgeable about equipment of their workplace, as well as about technologies and tasks that they use. Based on these methodological challenges and the limitations of data commonly used in empirical literature in economics and sociology, we set out the following criteria for the design of the innovative data module on digital exposure. 1. Technology description instead of buzz words. This criterion helps both to rule out differential interpretation of items as well as to ensure its sustainability over time. Exposure to digitalization can be proxied by measuring usage of certain technologies for specific tasks, instead of direct questioning on respondents' perceived exposure to digitalization. 2. Technological usage. In empirical applications, it may be important to differentiate between new technology (e.g. a smartphone) used for pre-existing tasks (e.g. making a call) and for new purposes (e.g. as a modern time stamp clock). Over time, technological solutions may incorporate new innovative usages.
3. Broadness. The items to measure technology and its consequences must be applicable to as many occupations as possible (e.g. not only to tech or manufacturing). 4. Granularity. This criterion ensures adaptability of items to future changes in technology. For instance, the definition of AI in the first wave of the module includes text, language, and image recognition, along with responses to specific questions and data analysis. Over time, new technologies may arise and some of the existing AI features will become so widely used that the subsequent understanding of AI will inherently incorporate the new technological edge. 5. Compatibility. The new items must be compatible with the existing "questionnaire landscape" of the survey that they are introduced to, which are the SOEP-IS and SOEP-Core in our case.

Data Availability
In the following, we describe the data sets SOEP-Core and SOEP-IS. Both data sets are available free of charge for the scientific community from the SOEP department. 3 A 50% random sample of SOEP-Core is also available as a data set for teaching purposes.

SOEP-Core
The SOEP is a longitudinal survey of private households in Germany, existing since 1984 (Goebel et al. 2019). Over time, the survey grew substantially, due to sample refreshments and inclusion of new projects, slowly becoming a family of related surveys. For example, SOEP was the first survey to include East Germans after reunification and features special sub-samples that allow for a detailed analysis of migrants and refugees in Germany. In 2010, the main survey of the SOEP (often referred to as SOEP-Core) contained information from more than 25,000 individuals who are surveyed annually. The survey field is currently carried out in cooperation with infas, a specialty survey firm that also conducts the initial data preparation. Thereafter the data are transferred to the SOEP, which then validates these data, constructs user-friendly variables, and provides the survey weighting scheme. Due to the field efforts and reputation of the survey, the annual re-survey rates are exceptionally high. The questionnaire includes a broad set of information on individuals in their household context: sociodemographic status, labor market characteristics, income information, psychological indicators, health indicators, political attitudes, worries, expectations, family background, education, and ongoing training (Schröder et al. 2020). While some of these aspects are volatile and are surveyed annually, other topics occur in specialized modules at a lower frequency. This broadness and flexibility of the SOEP is the key to its innovativeness. In 2020, the SOEP survey included the digitalization module that takes center stage in this paper. The elaboration of the data module on digital exposure leaned on the existing questionnaire landscape of the SOEP, which was screened for items related to digital exposure and its potential effects. Appendix A presents the existing questionnaire landscape and the usage of items to measure single impact channels as described in Figure 1. The development of new items for digital exposure and some of its impact channels leans on the existing items on working time arrangements, distance to work, work intensity, job satisfaction, and justice perceptions.

SOEP-IS
The SOEP Innovation Sample (SOEP-IS) constitutes one of the sub-samples of the SOEP-Core that was singled out as a free-standing survey product to be used for testing innovative questionnaire items. The SOEP-IS annually gathers proposals on innovative items from the research community, which are then evaluated by an independent committee on their relevance and ethics considerations. Since this procedure ensures better item quality, the innovative module was also introduced to SOEP-IS based on a formal item proposal and is part of the data release 2019. For the innovation module, the primary aim was to pre-test items related to digitalization, but also to create the first pool of data that can be used to study digitalization in an empirical setting. However, it must be noted that the size of the survey (currently, about 1500 respondents for this data module) remains a serious limitation for in-depth statistical analysis, especially when sample partitioning is involved.
As with any large-scale survey program, survey time is scarce and questionnaire design aims to keep instruments short. Following this aim, the item pool fielded in SOEP-IS was reduced in SOEP-Core building on the criteria topical fit and measurement quality. Table 1 summarizes the item pool in SOEP-IS and SOEP-Core. 6 Item Description

AI Exposure
The main challenge in capturing AI through a household questionnaire is that there is no unified definition of AI that would be widely understood in society and invariant over time (c.f. Ganz et al. 2021). Moreover, knowledgeability of workers of their interaction with AI may be limited. In order to measure AI exposure, we asked a series of questions on the frequency of respondents' involvement in tasks like identification and processing of language, images, written information, processing and evaluation of datasets, as well as answering questions on specialized knowledge. In 2020, these features are commonly understood as AI functionality that may Table : Summary of item pool on innovation module "digitalization of the world of labor" in SOEP-IS and SOEP-Core.

SOEP-IS
SOEP-Core AI exposure Four items, potential exposure AI specific tasks Q N/A Four items, exposure AI specific tasks Q Q One item, exposure AI (yes/no) Q N/A Platform work Four items each on platform work (a) selling goods, (b) rent out property, and (c) selling services [Six items, platform work overall, not asked separately for (a), (b), (c) in SOEP-IS] have found practical spread in everyday jobs (Bauer et al. 2019;BMWi 2020;Döbel et al. 2018). In SOEP-IS, we are able to differentiate between respondents who are involved in these activities themselves and who use digital systems of the same functionality, which may help to address the difference between substitutability and complementarity of humans and digital systems (Giering and Kirchner 2021;Huchler 2020). For reasons of brevity, SOEP-Core contains only the part on the use of digital systems with the contemporary functionality of AI, thus focusing on the technological usage rather than potential.

Platform Work
Capturing platform work in surveys is challenging due to the lack of a unified understanding of platform work among respondents. For example, it is important that this item focuses on the self-employed activities of the respondent and rules out the work inquiries that dependent employees may receive from their employers from web-based applications used to manage production processes. As Bonin and Rinne (2017) note, it is particularly important in this respect to ask for the name of the application or web site used by the respondent to find platform work. The item that we suggested differentiates between work assignments, selling goods online, and letting property. For working assignments, we also asked whether the assignment was executed online or in the real world. For all types of platform work, we asked about the type of income it generates for the respondents, its magnitude, and related time effort. In SOEP-IS, the questions on income and time effort relate to all types of platform work, whereas in SOEP-Core this information is gathered separately for work assignments, selling activities, and letting activities.

Digitalized Workplace
The digitalized workplace is primarily described by the technological tools that respondents use and how they use them. Workers in Germany use a wide range of tools. Asking how each tool is used is time-consuming and, thus, not feasible in a multitopic survey like the SOEP. Therefore, we ask two different types of questions: First, we include items on which digital tools are used. Second, we ask general questions on tasks and usage of digital technologies. We capture frequency of digital tool usage covering the use of computers, laptops, tablets, smartphones, robots, scanners, measuring and diagnostic devices, messengers, and resource-determining programs. To assess the extent that respondents understand what digital tools are, we added an item that explicitly asks if the respondents work with a digital tool that is not on the presented list providing an open format response option in SOEP-IS. Analyses revealed that the presented list seems to capture existing technologies well and, thus, the additional open format item was dropped from SOEP-Core.
To assess how digital tools are useddifferentiating between communicative, supportive, and regulatory functionswe follow two strategies in SOEP-IS: First, building on items proposed by Reimann et al. (2020), we introduced four items on the frequency of receiving automated instructions, receiving help, preserving information, and quality management. Second, leaning on the task-based literature (Autor et al. 2003;Spitz-Oener 2006), we additionally introduced a battery of tasks potentially related to digital exposure that may help identify jobs that are exposed to the current stage of technological progress. The final SOEP module covers only the task-based approach.

Direct and Indirect Consequences
The list of potential direct and indirect consequences of digitalization in the sphere of work is long, ranging from direct consequences such as flexibility with regard to the place of work to indirect consequences for health and well-being. The existing SOEP questionnaire already gathers detailed information on working conditions (e.g. place of work, timing of work) and well-being (e.g. effort-reward-imbalance (ERI), health, and domain-specific satisfaction). Complementing the picture on direct and indirect consequences of digitalization, we introduce new items focused on measuring autonomy at work, organizational fairness, work-life conflicts, and concerns about individual risks of being exposed to digital technologies. The latter two topics highlight the advantage of incorporating the topic of digitalization into a longitudinal household study: First, assessing work-life conflicts provides the link between the two primary units of investigationthe individual and the householdand, second, concerns about technological change are assessed by all individuals, irrespective of their employment situation and can be traced over time.
The complete questionnaires of SOEP-IS and SOEP-Core are available online: SOEP-IS: German version, English version; SOEP-Core: German version, English version.

Research Potentials and Conclusions
The research potentials of the presented data set are manifold and can be used to address various research questions. First, the data modules allow for measurement Data on Digital Transformation of technological diffusion in different domains, which can be validated using other data sources in methodological studies. For instance, the diffusion of AI measured in the data can be compared to other individual-and firm-level surveys, as well as aggregated industry data such as the number of patents related to AI. Second, the data allows for capturing digitalization among the self-employed, a group that remains unreached by more common employer or employee surveys. Third, the data on digital technology at the workplace can be linked to observed family relations, which may help to identify spillover effects of digital exposure within families. Fourth, leaning on the task-based economic literature, the data module can be used to predict the impact of digital diffusion on the labor market, especially on employment opportunities, wage inequality, and other aspects. Fifth, digital exposure can be related to those items measuring job quality, including such non-monetary parameters as autonomy at work or justice perceptions. Sixth, the data can be linked to other topics that are present in the SOEP data such as training opportunities at work, physical and mental health, political views, etc. Seventh, the SOEP data allows using the place of residence of the respondents to merge additional regional information; therefore, it can relate digital exposure to the existing regional infrastructure. Last, but not least, new research perspectives would arise when the module is repeated, thus offering a longitudinal perspective.
To conclude, we would like to underline that measurement of digital exposure is related to many challenges, be it in household surveys or other data sources. We believe that best insights for evidence-based policy may be derived from research based on different data sources and stemming from different disciplines. Therefore, we believe that sharing information on designing data modules can help to boost the quality of research data.