Lifecycle models for research data are often abstract and simple. This comes at the danger of oversimplifying the complex concepts of research data management. The analyses of 90 different lifecycle models lead to two approaches to assess the quality of these models. While terminological issues make direct comparisons of models hard, an empirical evaluation seems possible.
Lebenszyklus-Modelle für Forschungsdaten sind oft abstrakt und einfach. Hierin liegt die Gefahr, ein zu einfaches Bild der komplexen Forschungsdatenlandschaft zu zeichnen. Die Analyse von 90 dieser Modelle führt zu zwei Ansätzen, die Qualität dieser Modelle zu bewerten. Die Uneinheitlichkeit in der Terminologie erschwert einen direkten Vergleich zwischen den Modellen, wohingegen eine empirische Evaluierung der Modelle in Reichweite liegt.
Advances in science are usually the product of a team rather than individuals. It is obvious that more than one researcher is needed to further science, since new insights are based on the work of others, and scientific publications are reviewed by peers. Maybe less obvious is the necessity for a number of other actors: Research software engineers help to develop state-of-the art tools, communication specialists disseminate important scientific findings, and data librarians support researchers in data management tasks. These three professions gain the more importance, as the role of digital methods and forms of communication increases.
Both aspects of modern research, its collaborative nature and the fast-evolving technical possibilities, are best exemplified by the task to manage research data. A large number of services, tools, protocols, best practices, and policies have been created and are currently competing for adoption. This state of creolization leads itself to a research question: How can we describe, explain, assess, and maybe even predict phenomena in research data management? Of what nature is the interaction between researchers and other professionals? The most common answer to this question is to model the phenomena of research data management along a lifecycle.
While the term “research data lifecycle” is used in many books, papers, blogs, a commonly shared definition is not available. Most of these models break down the phenomena of research data management into a series of tasks or states of data and relate them to different roles or actors. As Perrier et al. (2017) indicates, these models are often not evaluated in a manner that allows to reproducibly derive the same model for a certain purpose (explaining, educating, etc.). The quality of a model which has not been evaluated is at least doubtful. If it remains unclear how the quality of these models can be assessed, their contribution to a better theoretical understanding of research data management remains an open question.
The contribution of this paper is the analysis of 90 data lifecycles, in order to identify ways to evaluate these models. Two approaches are presented:
One approach focuses on the comparison of data lifecycle models and tries to derive common quality indicators from the literature (and data lifecycle models published in non-classical ways).
The alternative approach abstracts from the usage of the models found in the literature survey, suggests a classification with regard to the purposes the models are developed for and derives empirical evaluation criteria from these purposes.
The rest of the paper is structured as follows: In Section 2 we will examine the related work. Our methodological approach is discussed in Section 3. The results are presented in Section 4 and discussed in Section 5.
2 Related work
Hodge (2000) is the one of the first research data lifecycle models in the sense indicated above. Despite the early publication date, few practical aspects have been added to the description of research data management tasks by later lifecycle models. It is derived from a literature review and interviews with 18 leaders of contemporary “cutting edge” projects. Unfortunately, this publication is rarely cited as a reference.
Möller (2013) shows an approach similar to ours: Based on a survey of lifecycle models, an abstract data lifecycle model is derived and a classification scheme is developed. In contrast to Möller (2013), we do not define a lifecycle model but a common scheme shared by all found lifecycle models. One of the features by which Möller (2013) classifies, is the distinction between prescriptive and descriptive models, which comes very close to our proposal to classify along the purpose the model was designed for. Our method is more focused on evaluation and the resulting classification is therefore more fine-grained with regard to that. Möller (2013) provides more classifications of features, of which some are irrelevant for evaluation (e.g. the distinction between homogeneous and heterogeneous lifecycles).
Veenstra and Broek (2015), Sinaeepourfard et al. (2016b) and Pouchard (2016) are alike to Möller (2013) in their approach to review existing models and derive an own lifecycle model based on a gap analysis. None of the three publications offer generic and empirical evaluation criteria or a metamodel for the existing models. Their lifecycle model is designed to supersede the existing approaches for a specific context.
The model of Veenstra and Broek (2015) is not targeted at scientific data per se, but at open data in governmental context. The authors clearly state the empirical methods by which the model was derived, the evaluation of their model is project-specific and hence iterative.
Sinaeepourfard et al. (2016b) and Pouchard (2016) both propose a lifecycle model for Big Data. Although they model the same phenomena, the models are not similar. While Pouchard (2016) does not describe evaluation criteria of the model, Sinaeepourfard et al. (2016b) proposes the 6Vs of Big Data (Value, Volume, Variety, Velocity, Variability, Veracity) as a base to evaluate data lifecycle models in the context of Big Data. This evaluation is also applied to evaluate other data lifecycle models for their aptness to describe Big Data challenges. This evaluation is the most rigorous we found in the literature, but it is limited to the context of Big Data and itself is based on a theoretical concept instead of empirical evaluation.
Perrier et al. (2017) provides a scoped review of 301 articles and 10 companion documents discussing the practices of research data management in academic institutions between 1995 and 2016. The review is not limited to, but includes publications discussing data lifecycle models. The discussion includes the observation that of the papers reviewed, only a view provided empirical evidence for their results, which is in accordance to our findings. The study classifies the papers based on the UK data lifecycle, which fortunately is preserved as an attachment to this paper (its “official” version has changed since the original publication).
A survey was executed, to derive a framework to compare data lifecycle models to each other and to find the purposes for which those models are designed. Since not every research data lifecycle is described in an academic publication, our approach was to use a combination of methods of a classical literature review with a “snowball” method (following references from a first set of models to enlarge the number of found models). Starting from a research in May 2017, which facilitated search engines (google scholar, BASE), literature databases (ACM digital library and IEEE Xplore), and a list of already known articles, a first set of 35 data lifecycles was collected.
The search terms used included any combination of two out of the three following words: “research”, “data”, and “lifecycle”. This deliberately included lifecycle models which are not specifically dedicated to research data (e.g. governmental data, linked open data), but we found no essential differences in both conceptualization and evaluation of these models in comparison to research data lifecycle models in the strict sense. Decisive inclusion criteria for a resource were a check for a textual or graphical representation of a set of actions regarding data or states of data. Following the references to other resources (either links or citations), we stopped to collect further models when we reached 90 data lifecycle and our analyses did not reveal new aspects.
After an evaluation of 35 lifecycle models, a common pattern emerged, which was successfully applied to the following 55 models, and therefore positively evaluated. All models included at least one of the following characteristics, which are the building blocks of the metamodel:
A set of states in which data are during their scientific processing (such as creation, analysis, preservation, etc.)
A connection between these states (in the sense of edges in a directed graph)
A set of roles in the context of research data management (researchers, data stewards/librarians, funders, etc.)
A set of actions with regard to research data management (collecting, documenting, annotating, etc.)
A mapping of roles, actions, and states to each other (e.g. “in state creation researchers describe their methods”)
Since the lifecycle models differ widely with regard to their representation (different graphical and textual representations), a homogeneous processing was not possible at first. To ease the analysis and comparisons between the models, they were transcribed into an XML representation. A schema to validate the XML representations was used to guarantee quality of the representations. During the processing of the sources for the data lifecycle models, excerpts stating the purpose the lifecycle models were collected. The classification of purposes is a result of an abstraction from these excerpts and the context in which some of the data lifecycle models were published at (e.g. training material, service advertisements).
3.1 Threats to validity
The collection method does not guarantee completeness, which means there might be models of research data lifecycles not captured by our analysis. Since Perrier et al. (2017) already provides a scoping review of the relevant literature, this is an acceptable defect. Our approach was focused on finding criteria to compare data lifecycle models and to evaluate the fitness of lifecycle models in general, which does not necessitate completeness.
The list of purposes a model can be designed for is probably not complete, at least in a generic sense (models could for example also be used to exemplify). However, the list should include all relevant applications of models in the context of research data management.
We only included English and German resources describing data lifecycles. As far as models had been described in other languages, they often seemed to be translations. When German models would bias our results, we excluded them from the statistics (this is clearly stated in the text).
The heterogeneity of the sources for data lifecycle models can be seen in Figure 1. 62 % of the models are published in a medium that is citable in the classical sense (journals, proceedings, or books); 78 % (70) of the found models have a graphical representation.
The remainder of this section is divided into two parts: The first part presents our statistical evaluation results, based on the metamodel presented in Section 3. The presented numbers will be the basis for the discussion how the “dimensions” of the metamodel could facilitate a comparison between data lifecycle models. The second part proposes a classification of data life cycle models via their application (including an example) a derived method of evaluation.
4.1 Comparison of lifecycle models along the metamodel
39 % (35) of the models define actions, 14 % (13) define roles, and only 13 % (12) define both. Some of the models that only include states “encode” an action into the state of the data (e.g. “analyzing” or “data cleansing”), which makes them hard to compare with other models that separate state and actions. 11 % (10) of the models provide a mapping of activities and roles to specific states. For partial mappings, Table 1 can be consulted.
The five characteristics listed in Section 3 allow us to define classes of data life cycle models. Each class extension is defined by the characteristics fulfilled by its members, i.e. there is a class for all models which define states and actions, but no roles etc. This classification determines a partial order that allows realizing a partial comparison. The following data lifecycle models provide all five characteristics and are therefore members of the “highest” class with regard to the partial order: Hamish et al. (2003), Knight (2006), Kuberek (2013), Möller (2013), Veenstra and Broek (2015), Pouchard (2016), Peng et al. (2016), Sarmiento Soler et al.(2016).
The number of states in the data lifecycles ranges between three and thirteen, the number of actions between zero and 42, the number of roles between zero and eight. We found 399 different terms for a state, 54 different terms for roles, and 454 different terms for a research data related activity (case-insensitive matching, non-English resources were ignored). All these numbers give evidence to the obvious heterogeneity in the existing terminology in research data management.
Deriving a total order from the partial order would allow us to compare all data lifecycle models to each other (and not only the classes). To achieve this we would need to have the criteria of completeness for each characteristic, i.e., to answer the question, whether a model includes all essential states, actions, roles, mappings, and connections in the finest resolution. Given the already stated heterogeneity, this task is virtually impossible to accomplish on the collected model descriptions alone: The semantical mapping between two terms is often not possible, since they lack a rigorous definition and the models differ in their granularity.
4.2 Evaluation criteria based on model application
These are the classes abstracted from the 90 data lifecycle models. Each class corresponds to the purpose a model was designed for:
Documentation: Models can be used to describe certain aspects of reality, hence document it. If a model is used to document the reality of research data management, its main evaluation criteria is its correspondence with actual research data management practices. Since these practices differ widely with regard to tools, standards, protocols, and policies, there is certainly not one model that can claim to be the research data lifecycle. Methodologically speaking, the evaluation of a model designed to document is executed by the same approach by which it is (or should have been) created: Interviewing experts is an appropriate method to test the accurateness of such a model. Examples for models, which are used to document the actual state of research data management, include the DataOne data lifecycle model and the lifecycle of CENS data.
Explanation: A model explains a set of phenomena, if its usage leads to a better understanding of it. Explanatory models are to documenting models as tutorials are to manuals. Data lifecycle models that explain certain aspects of research data management are evaluated along the success in educative outcome. The evaluation how apt a model is to explain to researchers, for example, how they can make data more reusable, is therefore a task that should use the methods of empirical educational theory. The “lifecycle stages of environmental datasets” is an example of this kind of purpose.
Design: Designing a desired state with a model is the (re-)arrangement of components that could be also part of a documenting model. In the context of research data management, a model that arranges states of data items, roles, and actions can be evaluated according to the set of features such a desired state would have. This is comparable to the model that depicts the layout of a house: One can show how this specific layout would facilitate the usage by a family, a bachelor, or old persons in need of care. This indicates that an evaluation of a model is only possible, if the model is assessed together with a set of objectives (a use case or a set of generic principles). An example for a data lifecycle model that can be subsumed under this category is the data lifecycle of the Inter-University Consortium for Political and Social Research (ICPSR).
Assessment: To assess means to map the actual state to a desired state and qualify or quantify the conformance. Either the model is used to describe the actual state or the desired state. Such an assessment is implicitly carried out when statements are made that a certain service “supports the research data lifecycle”. Whether or not a set of models are suitable for assessment depends on their specific evaluation of how well they are equipped to document or to design respectively. An example for such a usage is the United States geological survey science data lifecycle model.
Instruction: Another way to relate documenting and designing models to each other is to use them to steer and execute transitions from the actual state to a desired state. Such a transition typically includes the orchestration of tasks and the allocation of resources as done in classical project management. A prominent example is to use a lifecycle model as tool to plan a data-intensive project. Whether a couple of models (one documenting, one designing) is suitable for planning and executing such a transition is not only determined by the composite evaluation of the two models, but also by the success of the transition. An example for a research data lifecycle that claims to support this activity is the DCC curation lifecycle model or the community-driven open data lifecycle model.
This section is structured in the same way as the previous one: First, the model comparisons will be discussed and after that the empirical evaluation criteria.
Models providing all five aspects of research data management should be considered of higher quality than models that only provide them partially. More detailed models are easier to evaluate, since their resolution allows to map their components easier to real phenomena. While this is a first start to compare the quality between data lifecycle models, it does not take into account whether the states, actions, roles, their connections, and mappings are complete. It is obvious from the numbers presented in Section 4 that handling of heterogeneity of the terms for states, actions, and roles is a very complex task. As stated, another problem is handling the different resolutions of the lifecycle models: There is no obvious way to handle mereological relations between states, actions, and roles of two distinct models. A core set of states, actions, and role that typically are part of a data life cycle is therefore not deducible objectively with the methods presented in this paper. These “canonical sets” would allow answering the question of completeness of a lifecycle model, defining a total order on the set of data lifecycle model, and therefore a means to compare the models with regard to quality.
An option to come to such an evaluation criterion would be to postulate canonical sets for each component of a data lifecycle. If this turns out to be a viable option, it is recommendable to start with the 50 states, 35 roles, and 84 actions that are part of the models in the highest class according to the partial order. A good starting point to converge the terminology would be the ontology produced by the RDA Data Foundation and Terminology Interest Group.
The evaluation methods proposed in Section 4 on the other hand are ready to be used by researchers. It is to be expected that a positive evaluation according to one purpose might imply a conflict to another one. Take the example of documentation and explanation: Typically, good explanations place greater emphasis on certain aspects compared to others, if this helps grasping central concepts. This might entail simplifications or incompleteness in the model that are not acceptable in the context of documentation.
Examples for evaluating a design model with regard to objectives are Findability, Accessibility, Interoperability or Reusability (the FAIR principles). Although a convergence on these principles is a goal embraced by many, there is no common agreement with regard to all aspects of these objectives. Whether these principles or maturity models are more apt as a means to assess practices of research data management or to instruct actors in certain data-related task, is a question that only rigorous empirical evaluation can answer.
Whereas a systematic comparison of data lifecycle models is not easy, based on the approach proposed, the evaluation criteria for models based on the purpose they were designed for is a viable option. Scientific papers proposing a model for research data management should clearly state the purpose of the model and consequently include an evaluation with regard to this purpose. This would bring evidence-based methods into the field of scientific infrastructure research. Evidence-based statements improve the quality of the research, foster reproducibility of findings, and ease comparison between different theoretical approaches. A more rigorous definition or re-usage of definitions of terms will furthermore ease comparability between different models in the future.
These considerations do not only apply for research data models, but could be extended to other tasks of scientific infrastructure research, including, but not limited to models for research software development or standards with regard to technical scientific infrastructures. The improvement of methods for this research field will have impact to all disciplines, since they will profit from new insights gained that lead to improved services of research service providers.
About the authors
Leibniz-Rechenzentrum der Bayerischen Akademie der Wissenschaften, Boltzmannstr. 1, D-85748 Garching bei München
Leibniz-Rechenzentrum der Bayerischen Akademie der Wissenschaften, Boltzmannstr. 1, D-85748 Garching bei München
Faundeen, John; Burley, Thomas; Carlino, Jennifer; Govoni, David; Henkel, Heather; Holl, Sally; Hutchison, Vivian; Martìn, Elizabeth; Montgomery, Ellyn; Ladino, Cassandra (2014): The United States geological survey science data lifecycle model. Tech. rep. US Geological Survey. Accessible via DOI:10.3133/ofr20131265, accessed 2018-11-26.10.3133/ofr20131265Search in Google Scholar
Hodge, Gail M. (2000): Best practices for digital archiving – an information life cycle approach. In: D-Lib Magazine, 6 (1). Accessible via http://www.dlib.org/dlib/january00/01hodge.html.10.1045/january2000-hodgeSearch in Google Scholar
ICPSR (2012): Guide to social science data preparation and archiving: best practice throughout the data life cycle. Accessible via http://www.icpsr.umich.edu/icpsrweb/content/deposit/guide.Search in Google Scholar
Hamish, James; Ruusalepp, Raivo; Anderson, Sheila; Pinfield, Stephen (2003): Feasibility and requirements study on preservation of e-prints. Tech. rep. Joint Information Systems Committees (JISC). Accessible via http://www.sherpa.ac.uk/documents/feasibility_eprint_preservation.pdf.Search in Google Scholar
Knight, Gareth (2006): A lifecycle model for an e-print in the institutional repository. Accessible via http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.132.1916.Search in Google Scholar
Michener, William K.; Jones, Matthew B. (2012): Ecoinformatics: supporting ecology as a data-intensive science. In: Trends in ecology & evolution, 27 (2), 85–93. DOI:10.1016/j.tree.2011.11.016.10.1016/j.tree.2011.11.016Search in Google Scholar
Peng, Ge; Ritchey, Nancy; Casey, Kenneth; Kearns, Edward; Privette, Jeffrey; Saunders, Drew; Jones, Philip; Maycock, Tom; Ansari, Steve (2016): Scientific stewardship in the open data and big data era (Roles and responsibilities of stewards and other major product stakeholders). In: D-Lib Magazine, 22 (5/6). DOI:10.1045/may2016-peng.10.1045/may2016-pengSearch in Google Scholar
Perrier, Laure; Blondal, Erik; Ayala, A. Patricia; Dearborn, Dylanne; Kenny, Tim; Lightfoot, David; Reka, Roger; Thuna, Mindy; Trimble, Leanne; MacDonald, Heather (2017): Research data management in academic institutions: A scoping review. In: PLOS ONE, 12 (5), 1–14. DOI: 10.1371/journal.pone.0178261.10.1371/journal.pone.0178261Search in Google Scholar
Sarmiento Soler, Alejandra; Ort, Mara; Steckel, Juliane; Nieschulze, Jens (2016): An introduction to data management. Accessible via DOI:10.5281/zenodo.46715.Search in Google Scholar
Sinaeepourfard, Amir; Garcia, Jordi; Masip-Bruin, Xavier; Marìn-Torder, Eva (2016a): A comprehensive scenario agnostic data lifecycle model for an efficient data complexity management. In: Proceedings of the 2016 IEEE 12th International Conference on e-Science, 276–81.10.1109/eScience.2016.7870909Search in Google Scholar
Sinaeepourfard, Amir; Garcia, Jordi; Masip-Bruin, Xavier; Marìn-Torder, Eva (2016b): Towards a Comprehensive Data Lifecycle Model for Big Data Environments. In: Proceedings of the 3rd IEEE/ACM International Conference on Big Data Computing, Applications and Technologies (BDCAT ‘16), 100–06. DOI: 10.1145/3006299.3006311.10.1145/3006299.3006311Search in Google Scholar
Veenstra, Anne Fleur van; van den Broek, Tijs (2015): A community-driven open data lifecycle model based on literature and practice. In: Case Studies in e-Government 2.0: Changing Citizen Relationships, ed. by Imed Boughzala, Marijn Janssen and Saїd Assar. Springer International Publishing, 183–98. DOI:10.1007/978-3-319-08081-9_11.10.1007/978-3-319-08081-9_11Search in Google Scholar
Wallis, Jillian C.; Pepe, Alberto; Mayernik, Matthew S.; Borgman, Christine L. (2008): An exploration of the life cycle of e-science collaborator data. Accessible via http://hdl.handle.net/2142/15122.Search in Google Scholar
Weber, Tobias; Kranzlmüller, Dieter (2018): How FAIR can you get? Image retrieval as a use case to calculate FAIR metrics. In: Proceedings of the 2018 IEEE 14th International Conference on e-Science, 114–24. DOI:10.1109/eScience.2018.00027.10.1109/eScience.2018.00027Search in Google Scholar
Wilkinson, Mark D.; Dumontier, Michel; Aalbersberg, IJsbrand Jan; Appleton, Gabrielle; Axton, Myles; Baak, Arie; Blomberg, Niklas; Boiten, Jan-Willem; da Silva Santos, Luiz Bonino; Bourne, Philip E and others (2016): The FAIR guiding principles for scientific data management and stewardship. In: Scientific data, (3). DOI:10.1038/sdata.2016.18.10.1038/sdata.2016.18Search in Google Scholar
Wittenburg, Peter; Strawn, George (2018): Common patterns in revolutionary infrastructures and data. Accessible via https://www.rd-alliance.org/sites/default/files/Common_Patterns_in_Revolutionising_Infrastructures-final.pdf.Search in Google Scholar
© 2019 Walter de Gruyter GmbH, Berlin/Boston