Reliability and measurement error of exercise-induced hypoalgesia in pain-free adults and adults with musculoskeletal pain A systematic review

Objectives ‒ We systematically reviewed the reliability and measurement error of exercise-induced hypoalgesia (EIH) in pain-free adults and in adults with musculoskeletal (MSK) pain. Methods ‒ We searched EMBASE, PUBMED, SCOPUS, CINAHL, and PSYCINFO from inception to November 2021 (updated in February 2024). In addition, manual searches of the grey literature were conducted in March 2022, September 2023, and February 2024. The inclusion criteria were as follows: adults – pain-free and with MSK pain – a single bout of exercise (any type) combined with experimental pre-post pain tests, and assessment of the reliability and/or measurement error of EIH. Two independent reviewers selected the studies, assessed their Risk of Bias (RoB) with the COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) RoB tool, and graded the individual results (COSMIN modi ﬁ ed Grading of Recommendations Assessment, Development, and Evaluation). Results ‒ We included ﬁ ve studies involving pain-free individuals ( n = 168), which were deemed to have an overall “ doubtful ” RoB. No study including adults with MSK pain was found. The following ranges of parameters of

We believe studying EIH could serve different purposes.First, it could provide a model to understand the mechanisms of the acute effects of exercise on human pain processing, which are not clear yet [1].Second, it could improve the adherence of pain sufferers to exercise interventions.The relationship between exercise-induced pain and EIH seems to differ between pain-free individuals and those with pain conditions [1].The former consistently report EIH after somewhat painful exercise [20,23], with evidence even suggesting that painful exercise potentiates EIH [24].In contrast, exercise-induced pain has been associated with impaired EIH in the latter [25,26].Therefore, in the context of clinical pain, tailoring exercise prescriptions based on individual levels of exercise-induced hypo or hyperalgesia could make sessions more tolerable, increasing adherence to exercise programs [27] and ultimately leading to enhanced clinical outcomes [28].Third, the lack, or absence, of EIH could be a prognostic factor for clinical pain outcomes [29].Indeed, the discrepancy in EIH between persons with and without pain led some researchers to hypothesize that an alteration in EIH could predict the prognostic of pain sufferers.Vaegter et al. [30] showed that among patients with knee osteoarthritis, those with stronger EIH before a total knee arthroplasty surgery reported less pain 6 months after.However, Woznowski-Vu et al. [31] recently showed that EIH had no prognostic value on either pain or disability at 3 months follow-up in adults with back pain.This use of EIH thus warrants further investigation.
A mandatory step to achieve these objectives is to ensure that the hypoalgesic response to exercise is reliable.However, preliminary evidence in pain-free individuals suggests otherwise: studies on EIH using cycling [4,20], walking [5], and isometric [23] exercises combined with PPT and tolerance assessments have reported low-reliability metrics.To our knowledge, there is no published or ongoing systematic review (SR) on this topic, while it would allow us to gather all available evidence, assess its quality, and determine the most reliable ways to assess EIH.Thus, we aim to assess the reliability and measurement error of EIH in pain-free adults and adults with MSK pain, critically appraise the relevant literature, and formulate recommendations for the assessment of EIH.

Methods
Prior to this review, we searched three databases (EPISTEM-ONIKOS, PROSPERO, and Cochrane Library; from September 12, 2021 to September 19, 2021), which index SR in the health literature.No SR was identified in these databases or by a general web or manual search, and thus, we concluded that the reliability and measurement error of EIH had so far not been systematically reviewed and that no ongoing SR intended to do so.The results of these preliminary searches can be found in Supplementary material: 1.Preliminary searches.We followed the COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) guideline for SRs [32] and the reporting guidelines of the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement; the completed PRISMA 2020 items checklist can be found in Supplementary material: 2. Completed PRISMA 2020 item checklist.The protocol for this SR was registered on PROSPERO on May 10, 2022 (CRD42022325298), before starting the data collection process.

Eligibility criteria
We used the framework provided by the COSMIN initiative [33] to determine the following eligibility criteria: Studies that used a single bout of exercise (providing mode, intensity, and duration) combined with an experimental pain induction procedure (pain threshold, pain tolerance, or pain ratings of experimentally induced painful stimuli) measured within 2 h upon exercise cessationwe chose this time-frame, also used by Bonello et al. [34], to target only studies on the effects of acute exercise while excluding those that considered longer lasting effects of exercise.We excluded studies that did not adequately characterize their exercise intervention in terms of mode, intensity, and duration (unless the authors provided the missing information) and studies that assessed pain ratings of clinical pain (i.e., patients with MSK pain reporting a change in their usual pain level after exercise), used repeated exercise sessions in between measurements, did not include a pain-related measure within 2 h upon exercise cessation or combined exercise with another intervention (e.g., manual therapy, neuromuscular electrical stimulation, medication, or supplementation).• Language: No limits.
• Study type: Due to the nature of our research question, we targeted both observational studies and experimental studiesparallel or cross-over designfrom any year.Both phases of the cross-over studies were considered.We excluded narrative reviews, SRs, meta-analyses, commentaries, book chapters, and conference reports.However, the references for these studies were searched.• Measurement properties of interest: Reliability and measurement error of EIH.Details of these metrics are provided below.

Information sources and search strategy
The search strategy was designed and conducted by the principal investigator (VA).Team members provided feedback in terms of key words and search strings (peerreviewed by SA and DS).Five databases were searched using both controlled vocabulary specific to each database, natural language terms related to the different components of the research question and a validated search filter for finding studies on measurement properties designed by Terwee et al. [35].The details of the search strings for each database can be found in Supplementary material: No limits were applied.In total, 4,762 records were exported to a reference manager software (Endnote 20; endnote.com)and were imported, after duplicates removal, into the Covidence software (covidence.org).After completion of the study selection process, VA conducted manual searches examining the grey literature (March 27, 2022, September 04, 2023, and February 20, 2024) and the reference lists of the included studies (March 18, 2022).

Selection process
Two independent reviewers (VA and DS) screenedafter a pilot phase on 25 references and using SR software (covidence.org)-first, the titles and abstracts of the studies considered for inclusion (first screening round), and then the full text of the studies (second screening round) using the criteria set up above.After each screening round, they discussed their decisions until they reached a consensus.A third reviewer (SA) settled any disagreement.

Data collection process
The principal investigator (VA) collectedafter a pilot phase on three studies with the study teamdata from the included studies using a standardized extraction form (Supplementary material: 4. Data extraction form).The data collected were reviewed and discussed with the study team until a consensus was reached.SA settled any disagreement.When information was missing, VA contacted the authors by email.The details of this correspondence can be found in Supplementary material: 5. Correspondence with the authors.

Data items
We collected data related to the following categories (the full list of the collected variables and outcomes is in Supplementary material: 4 We defined reliability and measurement error, according to the COSMIN initiative [32], as "The proportion of the total variance in the measurements which is because of 'true' differences among patients" and "the systematic and random error of a patient's score that is not attributed to true changes in the construct to be measured," respectively.In this definition, "true change" refers to "the average score that would be obtained if the [measurements] were given an infinite number of times" [36].We collected all the reliability parameters used by the authors of the included studies, but we only reported those that the COSMIN initiative [33] recommends for reliability or measurement errorin the case where ICC and Pearson's r were calculated by the authors, we only reported the ICC.When studies used control and experimental conditions, we collected data related to EIH (e.g., baseline PPT) only from the experimental conditions.A meta-analysis regarding the effect size of EIH was not the aim of this review, and these data were deemed unnecessary.

Study RoB assessment
We assessed the RoB of the included studies with the COSMIN RoB tool [33], which is recognized as the best current tool to assess the RoB of studies investigating the reliability and measurement error of an outcome measurement instrument.The COSMIN RoB Tool uses a four-point system -"very good," "adequate," "doubtful," or "inadequate"to rate six common standards for reliability and measurement error -five on design requirements and one on "other flaws"and additional standards for preferred statistical methods: three for reliability and two for measurement error (Supplementary material: 6. COSMIN RoB tool).We summarized the RoB for each standard and determined the overall assessment for each study, as recommended by the COSMIN initiative, with the worst-score-counts method: the lowest rating of all standards determined the study RoB.When a study involved multiple test modalities (e.g., PPT and cuff pain tolerance), the RoB was assessed for each of themwe considered that RoB could depend on how each test modality was applied and that within-studies variations were possible.Two independent reviewers (VA and DS) assessed each study and discussed their results until they reached a consensus.A third reviewer (SA) settled any disagreement.We created plots (using R package "robvis" [37]) to report the RoB assessments.

Data synthesis
We created evidence tables (using R packages "flextable" [38] and "officedown" [39]) to describe the study characteristics, the way of operationalization of EIH, the elements of reliability and measurement error of each individual study, and the outcomes.When studies reported the SEM, we computed the Smallest Detectable Change with a 95% confidence interval (SDC 95; it is a change score that will be observed in only 5% of the individuals that are "truly unchanged" [40]) as SEM*√2*1.96[41].When relevant, we organized the data according to the test modality used and its site of application, which we divided into two categories: local (i.e., the stimulus was applied to a limb involved in the exercise) and remote (i.e., the stimulus was applied to a limb not, or less, involved in the exercise).We also created plots (using R package "ggplot2" [42]) to display graphically the values of ICC, SEM, and SDC 95.
We rated the results of the statistics used in each study to assess reliability and measurement error as "+," "−," or "?," respectively "sufficient," "insufficient," and "indeterminate," according to the criteria for good reliability and measurement error from the COSMIN initiative [33] (Supplementary material: 7. COSMIN criteria for good reliability).
We assessed inconsistency in the results (i.e., ratings of " +" or "−" account for less than 75% of all ratings) and compared the designs (e.g., way of operationalizing EIH), populations, and statistical computations of the included studies.Although one of our objectives was to pool reliability parameters from included studies, this was not possible because EIH protocols and exercise parameters greatly differed in each study.
Two independent reviewers (VA and DS) graded the level of evidence of each individual result as "high," "moderate," "low," or "very low" with the modified Grading of Recommendations Assessment, Development, and Evaluation (GRADE) approach published by the COSMIN initiative [32].A third reviewer (SA) settled any disagreement.GRADE assessments (details about the criteria used for GRADE are provided Supplementary material: 8. COSMIN modified GRADE approach) were based on RoB in included studies, inconsistency in the ratings of the results of the included studies, imprecision (based on the total sample size of the summarized studies), and indirectness (i.e., the population of the summarized studies differed from the target population of this review).We reported the assessment of each outcome in an evidence table.

Study selection
We identified 4,762 records (i.e., the title and/or abstract of a report, which supplies information about a study [43]) through database searching.After duplicate removal, 2,443 records remained, and 2,383 were excluded after screening titles and abstracts.We reviewed 58 full reports; two potentially eligible reports [44,45] were excluded as only their conference report could be found.The details of the reasons for exclusion after reading the full texts can be found in Supplementary material: 9. Reasons for exclusion of reports.We finally included five studies [4,5,20,23,46].Then, we searched the references of the included studies and the grey literature, but no new eligible article was found.Figure 1 displays the PRISMA flow diagram of the study selection process.

Study characteristics
The characteristics of the included studies are described in Tables 1-4.

Study characteristics
Table 1 summarizes the information regarding study design, study setting, funding, sample characteristics, and the manifestation of EIH.Four studies [5,20,23,46] used a cross-over design (but only one [5] was randomized), and one used a within-group pre-post design [4].All studies [4,5,20,23,46] recruited gender-mixed samples (females/males) of healthy Reliability of EIH in pain-free adults and adults with MSK pain  5  [4,5,46] reported that some combinations of test modality and site did not result in EIH: Gomolka et al. [4] assessed PPT at the hand in session 1; Hviid et al. [5] used PPT and cuff pressure pain threshold (cPPT) at any location in all sessions; and Vaegter et al. [46] assessed PPT at the upper trapezius muscle in all sessions.None of the studies included patients with MSK pain.

Way of operationalization of EIH
Tables 1-3 summarize the information regarding the way of operationalization of EIH.We found methodological similarities across all the included studies [4,5,20,23,46]: EIH was assessed with PPT applied on at least two sites (local and remote); a similar data processing and storage methodology was adopted: the data were first physically recorded then transferred into an electronic spreadsheet and securely stored; subjects were instructed to refrain from physical activity and analgesic medication on the days on participation and were familiarized with the test modalities; EIH was computed from absolute (score test modality post-exercisescore test modality baseline ) and relative (absolute change/score test modality baseline × 100) change as a continuous variable; the manifestation of EIH was dichotomized in "responders" and "non-responders" when the absolute change was, respectively, greater and lower than the within-session SEM of the test modality; statistical analysis involved analysis of variance (ANOVA) to test for EIH occurrence (slight variations across studies are detailed in Table 3).Studies differed regarding their exercise component (see details in Table 2), experimental pain component (see details in Table 2), and preparatory actions (see details in Table 3): three studies [4,20,46] used a cycling exercise, one study [5] used a walking exercise, and one study [23] used an isometric squat exercise; Standardized support during the exercise (i.e., motivation of the participants) was implemented in four studies [5,20,23,46]; one study [4] did not use a control condition, all the others [5,20,23,46] compared exercise with rest; one study [5] used, besides the PPT, the cPPT and the Cuff Pressure Pain Tolerance (cPTT) to induce experimental pain; on the days of participation, subjects were instructed to refrain from caffeine (four studies [4,20,23,46]), nicotine (three studies [4,23,46]), and alcohol (one study [46]).Adjustments of the exercise apparatus (cycle-ergometer settings) were reported in three studies [4,20,46]: one study [46] conducted a sub-maximal test prior to the EIH sessions to determine work-rate at the Lactate Threshold (LT).None of the studies reported preparatory actions related to the equipment used to induce pain.

Reliability and measurement error assessment
Table 4 presents information regarding the assessment of reliability and measurement error.All studies [4,5,20,23,46] investigated the test-retest reliability of EIH and targeted a population of healthy human adults.Also, in all of them, the assessor had received training (either 2 weeks [4] or 3 weeks [5,46,20,23]), the same rater conducted the assessment in all sessions, the procedures were repeated on all sessions and only the occasions varied, the sessions took place at the same time of day separated by 1 (all studies [4,5,20,23,46]) to 3 (1 study [4]) weeks, and the reliability of EIH (when computed as a discrete variable) was assessed with unweighted kappa.Two-way mixed effects for single measurement intra-class correlation coefficients (ICC 3.1) for consistency (four studies [5,20,23,46]) and agreement (one study [4]) were used to investigate the reliability of the EIH computed as a continuous variable.Measurement error was assessed in three studies [4,20,23] using SEM for consistency.

RoB in studies
The RoB assessments from studies investigating the reliability and measurement error of EIH are summarized in Figure 2. In all studies [4,5,20,23,46], we considered that subjects were stable between the sessions as only healthy participants were included; thus, we rated R1 and M1related to the clinical stability of the population -"very good."The time interval between sessions ranged from 1 (all studies [4,5,20,23,46]) to 3 weeks (one study [4]); we considered that domains R2 and M2 were "very good." Four studies [4,20,23,46] stated that measurement sessions were identicalthe domains R3 and M3 were judged "very good"; in one study [5], we considered that the authors could have provided more evidence that sessions were identical and we deemed the domains R3 and M3 "adequate."We considered all the studies [4,5,20,23,46] to be the repeated measurements on the construct to be measured?R2/M2, Was the time interval between the repeated measurements appropriate?R3/M3, Were the measurement conditions similar for the repeated measurements-except for the condition being evaluated as a source of variation?R4M4, Did the professional(s) administer the measurement without knowledge of scores or values of other repeated measurement(s) in the same patients?R5/M5, Did the professional(s) assign scores or determine values without knowledge of the scores or values of other repeated measurement (s) in the same patients?R6/M7, Were there any other important flaws in the design or statistical methods of the study?; R7, For continuous scores: was an intraclass correlation coefficient (IC) calculated?R8, For ordinal scores: was a (weighted) kappa calculated?R9, For dichotomous/nominal scores: was kappa calculated for each category against the other categories combined?M7, For continuous scores: was the SEM, SDC, limits of agreement or coefficient of variation calculated?M8, For dichotomous/nominal/ordinal scores: Was the percentage specific (e.g.positive and negative) agreement calculated?(cPPT, cuff pressure pain threshold; PPT, pressure pain threshold; cPTT, cuff pressure pain tolerance).
We judged the domain R7 as follows: "very good" (one study [4]); the authors used ICC 3.1 agreement to assess the test-retest reliability of EIH as a continuous score; "adequate" (four studies [5,20,23,46]), ICC 3.1 consistency was used to assess test-retest reliability and the authors provided evidence that no systematic errors occurred (similar EIH display in all sessions, no or minor differences between exercise parameters across sessions); "doubtful" (two studies [5,20]), ICC 3.1 consistency was used to assess test-retest reliability but differences across sessions in the magnitude of EIH were observedreflecting a lack of agreement between sessions, potentially arising from bias.Domain R9 was considered "very good" in all studies [4,5,20,23,46] unweighted kappa was used to measure the reliability of EIH as a dichotomous variable.The domain M7 was judged as follows: "adequate" in two studies [4,23], SEM consistency was used to determine measurement error and the authors provided evidence that no systematic errors occurred; "doubtful" in one study [20], SEM consistency was used and the manifestation of EIH varied across sessions.The overall RoB was deemed "doubtful" for all studies [4,5,20,23,46].5 summarizes this information and the reliability of EIH computed as a dichotomous score.Based on the COSMIN criteria for good reliability and measurement error, we rated all the ICCs and kappa values as "insufficient" ("−"; ICCs and kappas < 0.7) and the SDC 95 values as "indeterminate" ("?"; minimal important change [MIC] is not defined).

Certainty of evidence
We could not synthesize the individual results because the exercise conditions differed across studies.However, we graded the certainty of evidence for each individual result using a modified GRADE approach (Table 5).The overall RoB assessment of each study, for both reliability and measurement error, was "doubtful," which corresponds to a "serious" RoB; we downgraded the quality of evidence by Table 5: Results of individual studies, ratings, and GRADE 1 level.Imprecision, which refers to the total sample size, was below 50 in each study; we downgraded the quality of evidence by 2 levels.Indirectness, which refers to differences between the target populations of the studies and the review, was considered absentall studies [4,5,20,23,46] recruited gender-mixed samples of healthy adults, our population of interest; we did not downgrade the quality of evidence for indirectness.Inconsistency, which relates to the exclusion of heterogeneous results from a pooled result, was considered not applicablewe did not pool any of the results; we did not downgrade the quality of evidence for inconsistency.We downgraded the quality of evidence by three levels in total; thus, the level of certainty of each individual result was considered "very low."A tabular summary of the modified GRADE results for each study is provided in Supplementary material: 8. COSMIN modified GRADE approach.

Main results
To our knowledge, this is the first SR on the reliability and measurement error of EIH.We found five studies [4,5,20,23,46] conducted in pain-free human adults but none in adults with MSK pain.Our results suggest, with a very low level of certainty, that the between sessions test-retest reliability and the measurement error of EIH are, respectively, "insufficient" (i.e., ICCs and kappas < 0.7) and "indeterminate" (unknown MIC) in pain-free human adults.

Reliability and measurement error
The values of the ICCs of EIH varied across studies, test modalities, and sites and were below the COSMIN criteria for good reliability (ICC > 0.7), suggesting poor reliability of EIH.Overall, they appeared to be the highest (≥0.2) when cPPT and cPTT were used to elicit EIH and when the intensity of the exercise was standardized (e.g., using a percentage of maximal HR).The site of the test category (i.e., local vs remote) did not seem to impact the ICC values, suggesting a somewhat smaller involvement in EIH reliability than the test modality and exercise protocol.This is in contrast with the effect of site on the magnitude of EIH, where the hypoalgesic effect of exercise seems to be greater at a limb involved in the exercise [1,4,46].
The COSMIN initiative [33] recommends examining the measurement error parameters in relation to the value of the MIC to determine whether clinically relevant change can be distinguished from measurement error.While the MIC of EIH is currently unknown, we can compare the ranges of the SDC 95 with those of the mean absolute and relative change (used to compute EIH) to estimate the importance of error in the measurements [36].In the three included studies that assessed measurement error [4,20,23], the ranges of the SDC 95 exceeded the magnitude of EIH.For example, in Gomolka et al. [4], the mean of the means of absolute change in PPT assessed at the biceps femoris muscle in the two sessions was 83.04 kPa and the estimated SDC 95 was 180.81 kPa.Thus, to investigate the effect of a variable/intervention on the EIH displayed by an individual with a similar baseline value of absolute change (83 kPa), one would then need to observe more than a twofold change (180 kPa) in EIH to be confident that the observed change is not the mere result of the random fluctuations that 95% of the people could display without "truly" changing [40].Reliability is, in its formal expression, the ratio of the between subjects' variability and the total variability (which encompasses the subject's variability and the measurement error) [36]; therefore, the low estimates of reliability that were observed in the included studies could be explained by a high level of measurement error in EIH.

Limitations of the evidence included in the review
The evidence included [4,5,20,23,46] in this SR came from a single research group.It is mainly limited by its overall RoB, which was deemed "doubtful" in all studies.Our main concerns were the lack of blindness of the assessor, the lack of control of potential sources of error, and the statistical methods.

Sources of error
Vaegter et al. [52] showed that negative pre-exercise information reduces EIH, and none of the included studies [4,5,20,23,46] assessed the participants' beliefs, expectations, and knowledge regarding exercise, EIH, and pain assessments [53].There are other factors, which were not controlled in the included studies, that could modulate EIH through an impact on exercise performance, exercise metabolism and/or nociceptive processing: the phase of menstrual cycle (for aerobic exercise, Hoeger Bement et al. [54] did not find an influence of the phase of the menstrual cycle on EIH after an isometric exercise), the support provided during the exercise, the intake of non-analgesic medications, and the pre-exercise nutritional status.Variations in these elements among participants and across sessions could have contributed to measurement error.A better understanding of the mechanism(s) [1,55,56] that drive EIH will be important to determine the variables to control in future studies.

Statistical methods
ICCs are ratios of the between-subjects variance and the total variance.The total variance includes the betweensubjects variance, the residual-variance and the variances due to random and systematic error (i.e., bias) [57]; they are accounted for differently across ICC models and formulas (e.g., the formula for consistency does not include the variance due to bias) [57].
All the included studies [4,5,20,23,46] used ICC 3.1, which implies that the raters were "fixed" (i.e., not random).This choice does not introduce bias in the reliability estimates but limits the generalization of the results to other raters [57].Two studies [5,20] reported between sessions differences in EIH, potentially caused by systematic errors, and used ICC 3.1 with the formula for consistency thus leading to a risk of overestimation of the ICCs.Yet, it is unlikely that it had a significant impact on their results as their reliability estimates were similar to those of the other studies.
All studies [4,5,20,23,46] assessed the reliability of EIH as a dichotomous score.Another method could have been to use the SDC 95 instead of the SEMit would have given a more precise level of confidence [40].However, this approach might have proven unpractical: the SDC 95 of the test modalities that we computed seems to exceed the post-exercise absolute changes reported in the included studies, which would have led to considering all participants as "nonresponders."For example, the SDC 95 of the PPT at the biceps femoris muscle from Gomolka et al. [4] was 133.05 kPa, while the mean absolute change for the same parameters was 85.5 kPa.Furthermore, neither the SEM nor the SDC 95 is recommended for this use: they are metrics of measurement error, which relates to confidence in the absence of change in people who do not change, rather than confidence in correctly identifying change, which relates to responsiveness [40,41] (i.e., "the ability of an [outcome] instrument to detect change over time in the construct to be measured" [58]).Thus, a preliminary step for a classification approach could be to determine a more adequate criterion for "true EIH".A relevant possibility is the MIC, which is a metric of responsiveness and not measurement error and is thus more suitable to identify "responders" than the SDC 95 [40,41].However, to our knowledge, it has never been investigated for EIH.

Population
The reliability data obtained in pain-free participants cannot be generalized to clinical populationsreliability is a population-dependent construct [36].The lack of data on the reliability and measurement error of EIH in MSK pain sufferers remains unexplained and stands in stark contrast to the substantive amount of publications comparing EIH in patients with pain-free controls [1,59], with recent studies conducted, among others, in people with chronic low back pain [60], osteoarthritis [61], and WAD [62].We believe that this knowledge gap stands in the way of effective clinical implementation of EIH: its use to guide exercise prescription or to predict clinical outcomes requires developing ways to induce it reliably in clinical populations and defining cut-off values to classify patients (e.g., the MIC).

Limitations of the review process
This review has used a comprehensive search strategy and an in-depth data collection process.Because of the heterogeneity in the methods of the included studies (i.e., clinical heterogeneity), we could not quantitatively pool their results.Also, because we only included five studies, we could not provide a statistical or graphical (e.g., funnel plot) assessment of publication biaswe would have lacked statistical power [63].However, we think that our results are not at high risk of publication bias: our search strategy was not limited to electronic databases; we carefully searched references and grey literature [64], and the reliability estimates of all the included studies were below the criteria for good reliability of the COSMIN initiative [33], whereas publication bias tends to be related to "positive" results [65].

Implications for future research
We have provided evidence that EIH might be unreliable, which can weaken the conclusions of the numerous studies on the topic [1].Despite their limitations, our results demonstrate that the EIH literature must be interpreted with caution, highlighting a knowledge gap regarding EIH reliability (especially in MSK pain sufferers), and can be used to provide recommendations for further research in this field.We suggest that future studies prioritize investigating the reliability and measurement error of EIH in patients with MSK pain and the MIC of EIH in these populations.Furthermore, we recommend that future studies tailor the intensity of the exercise conditions, explore other means of experimental pain induction than PPT, blind the assessors/raters to the scores of the participants, and consider the participants' beliefs and expectations regarding pain and exercise.Also, additional studies are needed to shed light on the sources of variations of EIH and its underlying mechanism(s).For reliability studies, we consider that two-way random-effects ICC models with formula for agreement should be used by researchersit would allow them to generalize their results to randomly selected raters (and thus to other research teams) and prevent an overestimation of reliability in the presence of bias.Finally, when more studies on the reliability of EIH using a homogenous methodology are published, future SRs on the topic could consider a metanalysis by subgrouping studies according to the exercise type, exercise intensity, and testing site to improve the precision of the reliability and measurement error estimatesthus increasing the certainty of the results.

Conclusions
We conclude, with a very low level of certainty, that EIH is insufficiently reliable and that its measurement error is indeterminate in pain-free human adults.Furthermore, these measurement properties of EIH have never been investigated in patients with MSK paina literature gap that should be prioritized by researchers in the field of EIH.To improve their methodological quality, future studies could consider using exercises with standardized intensity (e.g., using a percentage of maximal HR), other experimental pain induction methods than PPT (e.g., cuff pressure pain tolerance), rater blinding, strict control of the sources of variations (e.g., participants' expectations), appropriate computation of ICCs (i.e., agreement), and averages of multiple measurements.

Figure 2 :
Figure 2: RoB in studies investigating the (a) reliability and (b) measurement error of EIH.Domains: R1/M1, Were patients stable in the time betweenthe repeated measurements on the construct to be measured?R2/M2, Was the time interval between the repeated measurements appropriate?R3/M3, Were the measurement conditions similar for the repeated measurements-except for the condition being evaluated as a source of variation?R4M4, Did the professional(s) administer the measurement without knowledge of scores or values of other repeated measurement(s) in the same patients?R5/M5, Did the professional(s) assign scores or determine values without knowledge of the scores or values of other repeated measurement (s) in the same patients?R6/M7, Were there any other important flaws in the design or statistical methods of the study?; R7, For continuous scores: was an intraclass correlation coefficient (IC) calculated?R8, For ordinal scores: was a (weighted) kappa calculated?R9, For dichotomous/nominal scores: was kappa calculated for each category against the other categories combined?M7, For continuous scores: was the SEM, SDC, limits of agreement or coefficient of variation calculated?M8, For dichotomous/nominal/ordinal scores: Was the percentage specific (e.g.positive and negative) agreement calculated?(cPPT, cuff pressure pain threshold; PPT, pressure pain threshold; cPTT, cuff pressure pain tolerance).

Figure 3 : 13 3. 4
Figure 3: Reliability of EIH.EIH was computed as (a) absolute change and (b) relative change.The ICCs determined in the included studies are reported with their 95% confidence intervals.They vary according to the site of the test and the test modality applied, which was further categorized as local (i.e., the stimulus was applied to a limb involved in the exercise) and remote (i.e., the stimulus was applied to a limb not, or less, involved in the exercise).All values are below the COSMIN criteria for good reliability (ICC > 0.7).(cPPT, cuff pressure pain threshold; PPT, pressure pain threshold; cPTT, cuff pressure pain tolerance).

Figure 4 :
Figure 4: Measurement error of EIH.EIH was computed as (a and b) absolute change and (c and d) relative change.The SEMs of (a) absolute change (SEM abs) and (c) relative change (SEM rel) determined in the included studies are reported.We computed the SDC 95% of (b) absolute change (SDC 95 abs) and (d) relative change (SDC 95 rel).They vary according to the site of the test and the test modality applied, which was further categorized as local (i.e., the stimulus was applied to a limb involved in the exercise) and remote (i.e., the stimulus was applied to a limb not, or less, involved in the exercise).To compare the SEM abs and SDC 95 bs across studies, we divided them by the mean of the mean baseline (mean b) values across sessions and multiplied by 100.(cPPT, cuff pressure pain threshold; PPT, pressure pain threshold; cPTT, cuff pressure pain tolerance).

Table 2 :
Way of operationalization of EIH, data collection, and data processing/storageReliability of EIH in pain-free adults and adults with MSK pain  7

Table 3 :
Way of operationalization of EIH, preparatory actions, and EIH scoring methods

Table 4 :
Reliability and measurement error assessment NA NR, not reported; NA, not applicable; EIH, exercise-induced hypoalgesia; PPT, pressure pain threshold; cPPT, cuff pressure pain threshold; cPTT, cuff pressure pain tolerance; ICC, intraclass correlation coefficient; SEM, standard error of measurement; Model 3.1, two way mixed effects, single rater/measurement, model.Reliability of EIH in pain-free adults and adults with MSK pain  11