Systems Biology Markup Language (SBML) Level 3 Package: Distributions, Version 1, Release 1

Abstract Biological models often contain elements that have inexact numerical values, since they are based on values that are stochastic in nature or data that contains uncertainty. The Systems Biology Markup Language (SBML) Level 3 Core specification does not include an explicit mechanism to include inexact or stochastic values in a model, but it does provide a mechanism for SBML packages to extend the Core specification and add additional syntactic constructs. The SBML Distributions package for SBML Level 3 adds the necessary features to allow models to encode information about the distribution and uncertainty of values underlying a quantity.


Revision History
The following table summarizes the history of this document.  1 Introduction and motivation 1 1.1 What is the Distributions package? 2 The Distributions package (also known as distrib) provides an extension to SBML Level 3 that extends MathML to 3 allow draws from distributions, and also provides the ability to annotate model elements with information about 4 the distribution their values were drawn from. The Distributions package adds support to SBML for sampling from a probability distribution. In particular the 7 following are in scope:   workaround within the core SBML language itself, although it is possible to define the necessary information using 4 annotations on SBML elements. Frank Bergmann proposed such an annotation scheme for use with SBML Levels This required that a library of definitions be maintained as part of the SBML standard and in their proposal they 26 defined an initial small set of commonly used distributions. The proposal was never implemented. The "distrib" package was discussed at the Seattle SBML Hackathon 2 and this section is an almost verbatim 29 reproduction of Darren Wilkinson's report on the meeting 3 . In the meeting, Darren presented an overview of the 30 problem 45 , building on the old proposal from the Newcastle group (see above: Section 2.2.1). There was broad 31 support at the meeting for development of such a package, and for the proposed feature set. Discussion following 32 the presentation led to consensus on the following points: 33 ■ There is an urgent need for such a package. ■ Random numbers must not be used in rate laws or anywhere else that is continuously evaluated, as then 3 simulation behaviour is not defined 4 ■ Although there is a need for a package for describing extrinsic noise via stochastic differential equations in 5 SBML, such mechanisms should not be included in this package due to the considerable implications for 6 simulator developers 7 ■ We probably don't want to layer on top of UncertML (www.uncertml.org), as this spec is fairly heavy-weight, 8 and somewhat tangential to our requirements quantities should be considered.

12
It was suggested that new constructs could be introduced into SBML via user-defined functions by embedding 13 "distrib" constructs in a manner illustrated by the following example: This approach allows the use of a "default value" by simulators which do not understand the package (but simulators 33 which do will ignore the <math> element). The package would nevertheless be "required", as it will not be simulated 34 correctly by software which does not understand the package.

35
Informal discussions following the break-out covered topics such as: Class boxes are also drawn with dashed lines to further distinguish them.

10
Blue: Items colored blue are new components introduced in this package specification. They have no 11 equivalent in the SBML Level 3 Core specification.

12
Red lines: Classes with red lines in the corner are fully defined in a different figure. 13 We also use the following typographical conventions to distinguish the names of objects and data types from other 14 entities; these conventions are identical to the conventions used in the SBML Level 3 Core specification document: and primitive types drawn from the XML Schema language may or may not start with a capital letter.

30
[elementName]: In some cases, an element may contain a child of any class inheriting from an abstract base class.

31
In this case, the name of the element is indicated by giving the abstract base class name in brackets, meaning 32 that the actual name of the element depends on whichever subclass is used. documents, with no semantic changes between the two in any distrib element, due to the addition of id and name 2 to the DistribBase class. 3 In addition, SBML documents using a given package must indicate whether the package may be used to change the 4 mathematical meaning of SBML Level 3 Core elements. This is done using the attribute required on the <sbml> 5 element in the SBML document. For the Distributions package, the value of this attribute must be "true", as it 6 defines new csymbols that may be used in any MathML. Note that the value of this attribute must always be set to 7 "true", even if the particular model does not contain any of these csymbols.

25
XML Namespace use 26 For element names, XML has clear rules about how to declare and use namespaces. In typical SBML documents, the 27 Distributions namespace will be defined as above, and elements will therefore need to be prefixed with "distrib:".

28
In contrast to element names, XML attribute names are completely defined by the element in which they appear, 29 and never have a "default" namespace defined. The element itself declares whether any attributes should be defined 30 with a namespace prefix.

31
Following the typical convention used by SBML packages, any attribute that appears in a UML diagram in this 32 specification may either be defined with no namespace prefix, or be defined with the distrib namespace as a prefix.

33
(No attributes are defined here as extentions of existing core SBML elements, and thus none of them are required to 34 have the distrib namespace as a prefix.)

Primitive data types 36
The Distributions package uses data types described in Section 3.1 of the SBML Level 3 Core specification, and adds 37 the additional primitive types described below. The type ExternalRef is derived from the type string with the additional requirement that it be a valid URI. An

43
The type UncertKind is derived from the type string and its values are restricted to being one of the follow-44 ing possibilities: "coefficientOfVariation", "kurtosis", "mean", "median", "mode", "sampleSize", "skewness", "standardDeviation", "standardError", "variance", "confidenceInterval", "credibleInterval", "inter-1 quartileRange", "range", "externalParameter", and "distribution". Attributes of type UncertKind cannot 2 take on any other values. The meaning of these values is discussed in the context of the UncertParameter class's 3 definition in Section 3.11 on page 19. The Distributions package has two simple purposes. First, it provides a mechanism for sampling a random value 7 from a probability distribution. This implies that it must define the probability distribution and then must sample a 8 random value from that distribution. Second, it provides a mechanism for describing elements with information 9 about their uncertainty. An example use case for this is to provide the standard deviation for a value. Another might 10 be describing a parameter's distribution so it could be used in a parameter scan experiment.

11
Sampling from probability distributions is achieved by allowing new MathML elements, and encoding uncertainty by 12 extending SBase, which in turn uses the Uncertainty class. Several distributions and statistics are defined explicitly 13 in this specification, but more can be defined by referencing an external ontology such as ProbOnto through the 14 UncertParameter class.

15
When a call to a distribution is defined in the extended Math, it is sampled when it is invoked. If a particular sampled 16 value should be used multiple times, that value must be assigned to a parameter first, such as through the use of an 17 InitialAssignment or EventAssignment. When an external distribution is defined, it is not used in the math of the 18 model, but may be used externally where appropriate.  The newly-allowed csymbol elements are defined in Table 1 on the next page.

51
Many of the distributions take exactly two or four arguments (or exactly one or three arguments). For those functions, 52 the optional last two arguments are min and max, for when the draw from the distribution is constrained to be 53 between those two values. For all functions, the min boundary is inclusive; that is, a value of min may be returned 54 by the function (though this may be very unlikely for draws from a continuous distribution). For all continuous 55 distributions, the max boundary is not inclusive; that is, a value of max will never be returned. The continuous 56 distributions are normal, cauchy, chisquare, exponential, gamma, laplace, lognormal, and rayleigh. For the 57 discrete distributions, the max boundary is inclusive: that is, a value of max may indeed be returned. The discrete 58 distributions are binomial and poisson.

59
The value of min must be less than the value of max for all continuous distributions, and the value of min must be 60 less than or equal to the value of max for all discrete distributions. Additionally, the min and max values of a discrete 61 distribution must span at least one integer between them, inclusive.

62
To define a distribution with only one bound, the other bound should be defined as INF or -INF, as appropriate.

63
For those distributions that have an intrisic lower bound of 0, setting min to 0 or any negative number will have no 64 effect, but is legal.
The versions of cauchy and laplace with one argument draw from the corresponding distribution with that 66 argument as its scale value, and a value of "0" for its location. If an SBML interpreter is unable to calculate one or more of the above extended MathML functions, it may simply 50 fail, or it might choose to return the mean of the given function instead. In either case, it is a good idea to inform the 51 user that the model cannot be interpreted by the software as intended. Note that the mean of a discrete distribution 52 is not necessarily a legal return value for that function, as it may not be an integer.

53
The mean values in Table 2 on the following page may be used as a fallback for software that cannot perform draws 54 from a distribution. Note that truncated versions of these functions will have different means. Note also that the 55 cauchy distribution has no mean, by definition.  desirable, and the information is not provided in a separate package, this information may be incorporated into a 26 future version of this specification.

27
Any other package that defines new contexts for MathML will also be either discrete or continuous. Discrete 28 situations (such as those defined in the SBML Level 3 Qualitative Models package) are, as above, well-defined.

29
Continuous situations (as might arise within the Spatial Processes package, over space instead of over time) will most 30 likely be ill-defined. Those packages must therefore either define for themselves how to handle distrib-extended 31 MathML elements, or leave it to some other package/annotation scheme to define how to handle the situation.

Using a normal distribution 35
In this example, the initial value of y is set as a draw from the normal distribution nor mal (z, 10): This use would apply a draw from a normal distribution with mean z and standard deviation 10 to the symbol y.

The DistribBase class 1
The DistribBase class is an abstract base class which is the parent class for every class in this Distributions package. The id attribute is of type SId, and must be unique among other ids in the SId namespace in the parent Model, and 8 has no mathematical meaning, unless stated otherwise in the definition of that object. The name attribute is of type 9 string, and is provided to allow the user to define a human-readable label for the object. It has no uniqueness 10 restrictions.

The extended SBase class 12
As can be seen in Figure 2, the SBML base class SBase is extended to include an optional Uncertainty child containing information about the distribution or set of samples from which they were drawn. 4 A few SBML elements can interact in interesting ways that can confuse the semantics here. A Reaction element 5 and its KineticLaw child, for example, both reference the same mathematical formula, so only one should be 6 extended with an Uncertainty child element. Similarly, the uncertainty of an InitialAssignment will be identical to 7 the uncertainty of the element it assigns to, and therefore only one of those elements should be extended.

8
Other elements not listed above should probably not be given an Uncertainty child, as it would normally not 9 make sense to talk about the uncertainty of something that doesn't have a corresponding mathematical meaning.
10 However, because packages or annotations can theoretically give new meaning (including mathematical meaning) 11 to elements that previously did not have them, this is not a requirement.

12
It is important to note that the uncertainty described is defined as being the uncertainty at the moment the element's 13 mathematical meaning is calculated, and does not describe the uncertainty of how that element changes over time. to change over time due to unknown processes, but which had a known average and standard deviation could be 20 given an AssignmentRule that set that Species amount to the known average, and the AssignmentRule itself could 21 be given an Uncertainty child describing the standard deviation of the variability. just as they can with other annotation information.

32
Note that for elements that change in value over time, the described uncertainty applies only to the element's initial 33 state, and not to how it changes in time. For typical simulations, this means the element's initial assignment.

34
The child UncertParameter children are named according to their class, so any UncertSpan child will have the ele-35 ment name uncertSpan, and any UncertParameter base class child will have the element name uncertParameter.

36
Propagation of error 37 It may be possible to propagate the error defined in Uncertainty elements through the mathematics defined in a 38 simulation of the model. Be advised that this will be a complicated system, and may involve calculating partial 39 derivates of equations that are not explicitly encoded. Many simulators choose instead to estimate the error through 40 stochastic simulations. Either approach should be possible with a properly encoded distrib model.

Attributes inherited from SBase 1
An Uncertainty always inherits the optional metaid and sboTerm attributes, and inherits optional id and name 2 attributes as described in Section 3.8 on page 17. The id of an Uncertainty has no mathematical meaning.   The type attribute defines what the UncertParameter describes. Depending on the type, other attributes will 28 be allowed or not, and the class must either be the base UncertParameter or the UncertSpan class, according to 29 Table 3 on the next page.    3.11.4 The definitionURL attribute 1 The optional definitionURL attribute (of type ExternalRef) may be used when the type of the UncertParam-2 eter is "distribution", and must be used when the type of the UncertParameter is "externalParameter". The   The optional math element contains MathML, and may only be used for an UncertParameter of type "distribution" 13 or "externalParameter". When defined for a "distribution", the MathML should define that distribution, such 14 as by using one of the extended csymbol definitions from this specification. 3.11.7 The child ListOfUncertParameters element 16 The optional child ListOfUncertParameters element may only be used for an UncertParameter of type "distri-   2 If x has a continuous probability distribution P , then [a, b] is a 95% confidence interval if b a P (x) = 0.95. 3 Unless specified otherwise, the confidence interval is usually chosen so that the remaining probability is split 4 equally, that is P (x < a) = P (x > b). If x has a symmetric distribution, then the confidence intervals are usually 5 centered around the mean. However, non-centered confidence intervals are possible and are better described 6 by their lower and upper quantiles or levels. For example, a 50% confidence interval would usually lie between 7 the 25% and 75% quantiles, but could in theory also lie between the 10% and 60% quantiles, although this 8 would be rare in practice. The confidenceInterval allows you the flexibility to specify non-symmetric 9 confidence intervals however in practice we would expect the main usage to be for symmetric intervals.

10
The confidenceInterval child of a Uncertainty is always the 95% confidence interval. For other confidence 11 intervals, use an UncertParameter of type "externalParameter" instead.  Finally, we have the "distribution" and "externalParameter" types: 28 ■ distribution: If the uncertainty is defined by a known distribution, that distribution may either be defined 29 by using the child math element, or by using the definitionURL. When the math child is used, that math 30 should contain the distribution in question: typically this will be a distribution csymbol but may be something 31 more complicated, like a piecewise function. If the definitionURL is used, many more distributions may 32 be used than are defined in this specification (like an externalParameter, below). To fully define this 33 distributon, it will almost certainly be necessary to further define that distribution with child UncertParameter 34 elements. For example, a Beta distribution takes two parameters (α and β), each of which could be defined by 35 a child UncertParameter of type "externalParameter", with appropriate definitionURL values. A type of 36 value "distribution" is only valid for UncertParameter elements, not UncertSpan elements.

37
■ externalParameter: This type is uniquely described by an appropriate definitionURL, and is provided to 38 allow a modeler to encode externally-provided parameters not otherwise explicitly handled by this specifi- That is, it is the expected value of the deviation from the mean to the power k. 24 In particular, µ 0 = 1, µ 1 = 0 and µ 2 is the variance of x.

25
■ correlation: The correlation between two random variables x 1 and x 2 is the extent to which these variable 26 vary together in a linear fashion. It is characterized by the coefficient 27 µ 1 and µ 2 are the means of x 1 and x 2 respectively, and σ 1 and σ 2 are their respective standard deviations.

28
Note this is strictly not a description of uncertainty, but it can be useful to represent the correlation between 29 two variables. Generally a covariance specification would be preferred since this describes the uncertainty.

30
■ decile: A decile, d , is any of the nine values that divide the sorted quantities into ten equal parts, so that 31 each part represents 1/10 of the sample, population or distribution. The first decile is equivalent to the 10th 32 percentile.

33
■ moment: For a given positive natural number k, the k th moment of a random variable x is defined as In particular, µ 0 = 1 and µ 1 is the mean of x. The moments can be defined with respect to some point a, that calculated as

44
■ probability: Given a random variable x with probability density function f (x), the probability that x lies in 45 some part of its domain X is defined as P (x ∈ X ) = x∈X f (x). X can be defined as a lower-or upper-bounded 46 range, e.g., P (x < 3.2), or as the intersection of several such ranges, e.g., P (x ≥ 1.7 ∩ x < 3.2).
■ quantile: Given a random variable x, the n-quantiles are the values of x which split the domain into n 48 regions of equal probability. For instance, the k th n-quantile is the value q k for which P (x < q k ) = k n . For some common values of n, the n-quantiles have additional names, namely quartiles for n = 4, deciles for n = 10 and percentiles for n = 100. More generally, a quantile can be associated to any probability p, so that q is the 1 value of x below which a proportion p of the probability lies, i.e., P (x < q) = p. The plot on the right shows the 2 1st to 9th 10-quantiles (or deciles) for a normal distribution (µ = 4, σ = 1) as orange dots. The blue curve is the 3 cumulative density function of x. Note how the quantiles split the probability (y-axis) into 10 equal regions.

The uncertainty of a Species 8
A Species is a unique SBML construct in that its value is either an amount or a concentration, depending on 9 the value of its hasOnlySubstanceUnits attribute ("true" for amount, or "false" for concentration). The value 10 of its uncertainty tracks with this: if the value of hasOnlySubstanceUnits on the parent Species is "true", the 11 uncertainty is in terms of amounts, and if "false", the uncertainty is in terms of concentration.

12
If a Species is being modeled in SBML in amounts, but was measured in terms of its concentration, or visa versa, an 13 InitialAssignment should be created that explicitly handles this conversion and assigns the appropriate value to the 14 Species, as in the example below. Here, the uncertainty of the species "S_amt" is not set explicitly, and instead can be derived from the uncertainty of 51 the values in its initial assignment ("S_conc" and "C").

Examples using Uncertainty 1
Several examples are given to illustrate the use of the Uncertainty class: 2 3.15.1 Basic Uncertainty example 3 In this examples, a species is given an Uncertainty child to describe its standard deviation: 4 5 <species id="s1" compartment="C" initialAmount="3.22" hasOnlySubstanceUnits="true" Here, the species with an initial amount of 3.22 is described as having a standard deviation of 0.3, a value that might 15 be written as "3.22 ± 0.3". This is probably the simplest way to use the package to introduce facts about the 16 uncertainty of the measurements of the values present in the model.

17
It is also possible to include additional information about the species, should more be known: 18 19 <species id="s1" compartment="C" initialAmount="3.22" hasOnlySubstanceUnits="true" In this example, the initial amount of 3.22 is noted as having a mean of 3.2, a standard deviation of 0.3, and a variance 33 of 0.09. Note that the standard deviation can be calculated from the variance (or visa versa), but the modeler has 34 chosen to include both here for convenience. Note too that this use of the Uncertainty element does not imply that 35 the species amount comes from a normal distribution with a mean of 3.2 and standard deviation of 0.3, but rather 36 that the species amount comes from an unknown distribution with those qualities. If it is known that the value 37 was drawn from a particular distribution, an UncertParameter of type "distribution" should be used, rather than 38 UncertParameter elements of type "mean" and "standardDeviation".

39
Note also that 3.22 (the initialAmount) is different from 3.2 (the mean): evidently, this model was constructed as a 40 realization of the underlying uncertainty, instead of simply using the mean.

Defining a random variable 42
In addition to describing the uncertainty about an experimental observation one can also use this mechanism 43 to describe a parameter as a random variable. In the example below the parameter, Z, is defined as following a 44 gamma distribution, with a given shape and scale. No value is given for the parameter so it is then up the modeler to 45 decide how to use this random variable. For example they may choose to simulate the model in which case they 46 may provide values for shape_Z and scale_Z and then sample a random value from the simulation. Alternatively they may choose to carry out a parameter estimation and use experimental observations to estimate shape_Z and 48 scale_Z.

49
For added information, the modeler has chosen to include the observed mean and variance of the value. These are close to the expected mean and variance from the given distribution (1.0 and 0.1, respectively, given the shape and 1 scale), but were slightly different due to the sample size. <parameter id="shape_Z" value="10" constant="true"/> 5 <parameter id="scale_Z" value="0.1" constant="true"/> 6 <parameter id="Z" constant="true">

Defining external distributions 27
If an SBML value is drawn from a distribution not defined explicitly in this specification, it is necessary to use 28 an UncertParameter of type "externalParameter" to define the distribution's parameters. In this example, the 29 parameter p1 was drawn from a zeta distribution, with a shape parameter of 2.37. An UncertParameter of type 30 "distribution" is created with the 'zeta' URI, with a child UncertParameter of type "externalParameter" with the 31 'shape' URI for its definitionURL. For readability, 'zeta' and 'shape' were used as the names of these parameters. It is also possible to create even more complex structures with the UncertParameter scheme. In this example, we 48 define a categorical distribution based on data from three patients. The parent UncertParameter is defined to be the 49 'categorical' distribution, with three 'category' children, each with two child 'value' and 'probability' parameters. <uncertParameter type="distribution" definitionURL="http://dist.org/categorical"> 5 <listOfUncertParameters> 6 <uncertParameter type="externalParameter" id="p1" definitionURL="http://dist.org/category"> 7 <listOfUncertParameters> 8 <uncertParameter type="externalParameter" value="1.01" 9 definitionURL="http://dist.org/cat_val"/> 10 <uncertParameter type="externalParameter" value="0.5" When implementing Distributions support, it would be possible to include "backwards" support for this annotation 8 convention by wrapping any calls to a distribution in a FunctionDefinition, and annotating that using this scheme.
9 Table 4 is taken from the above document by Frank Bergmann, and can be used as a template if translating from 10 that FunctionDefinition system to the Distributions extended Math system. The suggested fallback function returns 11 the mean of the distribution. As an example, here is a complete (if small) model that uses the above "custom annotation" scheme:

The Arrays package 50
This package is dependent on no other package, but might rely on the Arrays package to provide vector and matrix 51 structures if those are desired/used. Note that currently, the only case where arrays could be used is when an 52 UncertParameter of type "externalParameter" is defined that requires array input or output.

SBML Level 3 Version 2 54
This package may be used with either SBML Level 3 Version 1 Core, or SBML Level 3 Version 2 Core, and no construct 55 in this package changes as a result: the addition of id and name to DistribBase means that the addition of those 56 attributes to SBase in SBML Level 3 Version 2 Core is redundant.

57
Another change between SBML Level 3 Version 1 and Version 2 is that in Version 2, core elements and core Math may refer to package ids with mathematical meaning. However, Distributions UncertParameter elements do not 1 have mathematical meaning, and may not be used in this fashion. Instead, the var attribute should be used to 2 connect the element to a core Parameter, instead of using the value attribute. This approach has the advantage of 3 working both in Version 1 and Version 2 of SBML Core. These constructs can be used in identical ways in other SBML Level 3 packages. 1 The following examples are more fleshed out than the ones in the main text, and/or illustrate features of this package 2 that were not previously illustrated. This is a very straightforward use of a log normal distribution. The key point to note is that a value is sampled from 5 the distribution and assigned to a variable when it is invoked in the initialAssignment elements in this example.

23
For convenience and brevity, we use the shorthand "distrib:x" to stand for an attribute or element name x in 24 the namespace for the Distributions package package, using the namespace prefix distrib. In reality, the prefix 25 string may be different from the literal "distrib" used here (and indeed, it can be any valid XML namespace prefix 26 that the modeler or software chooses). We use "distrib:x" because it is shorter than to write a full explanation 27 everywhere we refer to an attribute or element in the Distributions package namespace.

28
Attributes from this package are listed in these rules as having the "distrib:" prefix, but as is convention for SBML

distrib-10302
The value of a distrib:id must conform to the syntax of the SBML data type SId (Reference: Section A. Validation of SBML documents

distrib-10303
The value of a distrib:name must have a value of data type string.

distrib-20306
The attribute distrib:value on an UncertParameter must have a value of data type double.

distrib-20307
The value of the attribute distrib:var of an UncertParameter object must be the identifier 3 of an existing object derived from the SBase class and defined in the enclosing Model object.

distrib-20507
The attribute distrib:valueUpper on an UncertSpan must have a value of data type double. editing, and the template upon which this document is based.