Skip to content
Publicly Available Published by De Gruyter February 7, 2020

Applying Virtualized Real-Time Response Measurement on TV-Discussions with Multi-Person Panels

Thomas Waldvogel


Televised debates are major events in electoral campaigns, serving voters as a substantial source of political information and reaching millions of the potential electorate. Scholars have made use of this potential by applying Real-Time Response Measurement (RTR) to assess reception and perception processes. However, RTR-research has yet almost exclusively focused on duel scenarios. In this paper, we argue that the focus on the duel format in political and communication science research is inappropriate. Thus, we apply virtualized RTR on TV-debates with a multi-person panel. Drawing on data of two field studies (n = 1191/1058) conducted with the Debat-O-Meter – an innovative virtualized RTR-Measurement toolbox – in the course of the 2017 federal election in Germany, we show that virtualized RTR-Measurement can indeed produce valid and reliable data regarding TV-discussions with a multi-person podium. Additionally, we find perception processes to be primarily shaped by party identification and prior political preferences such as candidate orientation. Furthermore, our results give strong evidence that candidate preferences are substantially affected by debate reception. Overall, our data follows an established structure well-known from research on TV-duels. As such, research on TV discussions with a multi-person panel is compatible with the existing repertoire of methods, offers great potential for political communication research and provides results that can be linked to current findings in empirical debate research while making its own contribution to the research field.

1 Introduction

Televised debates are major events in electoral campaigns that serve voters as a substantial source of political information, enabling a direct comparison of opposing policies, parties and their leaders, while reaching thousands or even millions of voters. Attracting a tremendous percentage of the potential electorate, TV debates concentrate the electoral contest into a one-evening format that covers major topics and showcases the main contenders. As “miniature campaigns” (Maier and Faas 2011b) such debates reach many undecided viewers and serve as a substantial source of direct campaign information for voters who are less sophisticated or attentive to politics.

Due to their prominent role in election campaigns, political debates have attracted major public and academic attention. So far, media coverage and academic research has primarily focused on televised debates covering duel scenarios where two main contenders from the biggest parties, which are most likely to appoint the head of government, discuss current issues. However, the duel format has repeatedly been criticized (see Donsbach 2002): Critics argue that the duel format contributes to a distortion of political competition, as it perpetuates structural disadvantages for small parties (Donsbach 2002: p. 22). Guido Westerwelle, former Chancellor candidate of the FDP, used this argument to lodge a constitutional complaint at the Federal Constitutional Court concerning his non-consideration by the public broadcasters in their first edition of the television event in the 2002 federal election campaign. The Constitutional Court eventually rejected Westerwelle’s claim with reference to the “graded equal opportunities” (Federal Constitutional Court 2002, 2 BvR 1332/02), however, the potential permission of smaller parties to participate in political TV discussions is still highly controversial (Bachl et al. 2013; Wagschal et al. 2017) – especially in view of the changing political landscape, as it was observable in the 2017 election.

Besides their outstanding role in electoral campaigns and media coverage, the focus on duel scenarios in political and communication research is also caused by technical limitations of the present RTR-instrument. A large body of work has relied on Real-Time Response Measurement techniques to capture the instantaneous reactions of the audience with physical devices such as sliders, dials and push-button systems being the standard approach. This reliance on physical devices, however, leads to restrictions of lab-based settings with potentially negative side effects on e.g. external validity (Reinemann and Maurer 2010: p. 253; Maier et al. 2018: p. 615). Secondly, due to their design, dialer- and slider-systems are restricted to one evaluation item with two evaluation objects or poles which additionally facilitates applicability on duel scenarios and impedes RTR-studies on TV discussions with a multi-person panel (Waldvogel and Metz 2017).

The problems associated with this situation are increasingly being addressed with the help of virtualized, internet-based approaches, freeing researchers from the need to use physical devices in laboratory settings. Virtualized RTR seems to have several benefits: (1) Higher cost-efficiency since participants use their own mobile devices. (2) Improvement of spatial representation since interviewees no longer have to enter a laboratory and participants can respond to the media stimulus in natural reception settings via internet from home. (3) Enhancement of the size of the sample and respective subgroups increasing the robustness of the derived findings. (4) Heightening of the flexibility of the instruments’ implementation as different configurations of the graphical user interface (GUI) can be implemented software-based. Furthermore, the scaling and the level of the collected data can be flexibly customized just as the input mode (reset mode vs. latched mode). Most importantly, with respect to our object of interest and research, the number of evaluation items and objects can also be easily adapted to various discussion scenarios. Thus virtualized RTR enables scholars to transcend duel scenarios and facilitates studies on TV discussions with a multi-person panel.

So far, only few studies have investigated perception processes and the effects of the reception of TV discussions with a multi-person panel on political preferences (Metz et al. 2016; Faas and Maier 2017). Studies that deal with methodological issues are even completely lacking when it comes to multi-person panel scenarios. Hence, most findings on televised debates are derived from studies on the duel format. The question of how much the present knowledge can be transferred to TV discussions with a multi-person panel, simply because these situations have so far hardly been accessible to RTR measurement, is notorious: the evaluation of more than two individuals imposes a higher cognitive demand on behalf of the recipients as it is in the case of duel situations because a simple friend-or-foe logic like incumbent vs. challenger is inappropriate and evaluations might follow more complex patterns. In addition, outside of the lab controls of internal validity are relaxed to better the external validity. Yet, while virtualized devices may promise to improve accessibility, their reliability and validity still proves to be mostly unexplored in scenarios of multi-person panels. Filling up these research gaps is the main goal of this paper. In order to achieve these goals, four major research questions need to be addressed regarding TV discussions with a multi-person panel:

RQ1: Do virtualized RTR generate reliable data?

RQ2: Is RTR field data internally valid?

RQ3: What shapes perception processes of political TV debates with a multi-person podium?

RQ4: Does the reception of a political TV debate with a multi-person panel affect political preferences?

We will specify these research questions in Section 4. The article opens with a general description of the current state of research in the field of televised debates, including a brief review of the literature. We continue by presenting the Debat-O-Meter, a mobile RTR-App we developed for the 2017 federal election in Germany displaying its features and functions. We then follow up on our research questions and state our theoretical assumptions. Afterwards, the studies’ procedures and designs are detailed as we provide rich data from diverse samples collated with the Debat-O-Meter. We proceed with a description of our measures, analytical techniques and data, followed by a detailed report of our results. Finally, the work closes with a concluding discussion on the boundaries and benefits of virtualized RTR to measure instantaneous reactions to TV discussions with a multi-person panel in real time.

2 State of Research

Political debates have drawn increasing international attention – not only in public but also in academia, first and foremost in the fields of political and communication sciences (e.g. Maurer and Reinemann 2003; Maurer et al. 2007; Schill and Kirk 2009, 2014; Wolf 2010; Bachl et al. 2013; Papastefanou 2013; Boydstun et al. 2014b; Faas and Maier 2014; Metz et al. 2016; Schill et al. 2016). There is a vast body of research which investigates the perception of and effects on recipients of political debates broadcasted on TV (for an overview see e.g. Benoit et al. 2003; McKinney and Carlin 2004). Most of this work has relied on RTR in order to measure the instantaneous reactions of viewers in real time. The immediate measurement of viewers’ reactions on particular elements of audio-visual stimuli like televised debates and in real time, transcends monolithic views on the perception and the effects of media content on its recipients. RTR Measurement reveals individual information processing and reduces problems faced by ex-post survey designs, i.e. serial-position effects, failing memories, rationalizations, social desirability, retrospection effects or hindsight bias by continuously surveying respondents throughout the full length of the stimulus (Schill 2016: p. 14).

The flourishing literature based on RTR can be structured in two branches. The first branch has its primary focus on information processing as well as on the effects of debate reception on e.g. political involvement, candidate preferences and voting choice. The second strand turns towards methodological issues, investigating the reliability and validity of RTR-Measurement.

With regard to the first branch, RTR-based research has revealed fundamental findings concerning the perception of televised debates and their effects on potential voters. Present studies have shown that the recipients’ political predispositions play a fundamental role on how debate exposure affects the audience. In general, contenders can benefit from political debates over confronting incumbents since they are less known, especially if the broadcasted debate takes place early in the run up, when voters are still largely undecided and uninformed (Benoit et al. 2002; McKinney and Warner 2013). Research has also shown that debate exposure affects viewers’ evaluations of candidates and the parties they represent. Voters with strong party identification (henceforth PID) tend to reinforce existing voting preferences and candidate evaluations. Undecided voters who are attentive to politics in general but show a weak PID are more affected by conversion (Maier 2007a; Maier and Faas 2011a; Bachl 2013; McKinney and Warner 2013). Debate exposure can raise the willingness to participate in an upcoming election (Faas and Maier 2004a; Klein 2005; McKinney et al. 2011; Maier et al. 2013) and may affect voting choice, meaning a good performance in a TV-debate is of great importance for a candidate’s electoral success (Benoit and Hansen 2004; Faas and Maier 2004b; Klein and Pötschke 2005; Klein and Rosar 2007; Maier 2007c; Maier et al. 2013; McKinney and Warner 2013). Furthermore, research has found that debate exposure can enhance political knowledge, as well as internal and external political efficacy. In particular, voters less sophisticated to politics can benefit since TV debates can reduce disparities of political efficacy and knowledge (Benoit et al. 2003; Benoit and Hansen 2004; Maurer and Reinemann 2006; Maier 2007b; Faas and Maier 2011; McKinney et al. 2011; Maier et al. 2013; McKinney and Warner 2013; Gottfried et al. 2014). A major feature of RTR-based debate research is the identification of structural processes of perception and of effects on recipients: TV debates are shown to influence voting choice indirectly by affecting candidate orientations, which is based on previously existing preferences and shaped by perceived debate performance. The perception of being the winner of a debate is largely a function of the instantaneous evaluation during the discussion, even if partisans usually expect “their” candidate to win.

The second branch focuses on methodological issues by testing tools with different technical configurations (Reinemann et al. 2005); Maier et al. 2007, 2009, 2016a; Papastefanou 2013; Metz et al. 2016). Scholars have verified the reliability of RTR-instruments with varying approaches: (1) test-retest-, (2) parallel-test- and (3) split-half-designs have been applied to assess the reliability of RTR-instruments. Since test-retest designs seem to be inappropriate to measure spontaneous reactions of an audience to the same media stimulus, it is not surprising that studies relying on a test-retest procedure to examine reliability report inconsistent findings: While Fenwick and Rice (1991) and Hughes and Lennox (1990) find high test-retest correlations, Boyd and Hughes (1992) report coefficients of 0.53–0.64, which they assess as “low scores.”

In addition, RTR has been examined on parallel-test-reliability (Reinemann et al. 2005; Maier et al. 2007, 2016a; Metz et al. 2016). In their comparative study of a push button and slider RTR system, Maier et al. (2007) identify significant correlations of 0.38 for the whole media stimulus and up to 0.69 for certain key sequences. Aside from these findings on physical devices, there are relatively few studies examining the reliability of virtualized RTR-devices. Metz et al. (2016) provide first insights by comparing a virtualized slider version of the Debat-O-Meter with physical dials, finding correlations of 0.77 between two randomized groups following a televised debate of a 2016 state election in Germany in a controlled lab-design. These findings are affirmed by Maier et al. (2016a) who report coefficients of parallel-test-reliability surpassing 0.51. They compare two non-randomized groups, one following the 2013 chancellor debate in Germany with physical dials in a laboratory setting and a second smaller group of thirty-two students following the debate with a pre-installed, mobile RTR-App in the setting of their private home.

Studies based on split-half-designs repeatedly measured strong scores of internal consistency. Hallonquist and Suchman (1944) report inter-correlations between 0.95 and 0.99, Hallonquist and Peatman (1947) find correlations of 0.80 up to 0.94 and Schwerin (1940) reports coefficients of 0.89 and 0.93. Recently, Papastefanou (2013) recommended the use of Cronbach’s alpha for calculating the reliability of RTR data. Testing his ambulatory RTR-device, he reports scores of internal consistency surpassing 0.90. Extending this approach, Bachl (2014) finds coefficients from α=0.92 to α=0.95 in his study.

Research on methodological issues also raises the question of internal validity of the RTR-data. The question is addressed in two different ways: a common concept to assess the internal validity of the data has been construct validity, conceived as a significant association of RTR evaluation and PID, assuming that party identification is an individual’s stable, affective tie to a political party (e.g. Campbell et al. 1960) and therefore determines the perception and evaluation of political contenders and issues. A second concept to assess internal validity of the RTR-data is criterion validity: Post debate verdicts on contenders’ debate performances are substantially associated with the spontaneous real-time impressions of viewers, remaining even after controlling for PID (Biocca et al. 1994; Reinemann et al. 2005; Maier et al. 2007; Maier and Strömbäck 2009; Bachl 2013; Maier 2013; Papastefanou 2013). With respect to virtualized RTR-Measurement Maier et al. (2016a) verify construct validity by analyzing the association of real-time responses and PID. They also prove criterion validity, operationalized as a significant correlation of the RTR score and the evaluation of the perceived debate performance after the debate in the post-survey. These correlations are confirmed by a path analysis leading to a structure well known from previous research on televised debates in Germany (Maier et al. 2007; Bachl 2013). Several studies have eventually confirmed that RTR data correlates with related variables in expected ways, drawing a positive picture of its internal validity.

Conversely, external validity of RTR-studies is much less assured considering that most knowledge derives from testing in quasi-experimental laboratory designs tied to the usage of physical devices with a limited number of participants (Reinemann and Maurer 2010: p. 253; Maier et al. 2018: p. 615). In addition, it should be noted that the findings presented above are based almost exclusively on studies investigating duel formats such as presidential debates and chancellor duels. There are only few recent studies examining TV discussions with a multi-person panel (Metz et al. 2016; Faas and Maier 2017).

In summary, we can state that RTR-based debate research has revealed various and profound findings. However, these results are based almost exclusively on studies dealing with TV duels. The specifics of perception processes and effects of the reception of TV discussions with a multi-person podium, however, have attracted less attention, meaning that the depicted findings are yet not verified for these discussion scenarios. In order to reduce this gap, we will examine the extent to which findings on the duel format can also be validated for TV debates with a multi-person podium. For this purpose, we present panel data from two field studies collected with the Debat-O-Meter in the course of the 2017 federal election in Germany. This novel measuring tool for the collection of survey data and real-time reactions to political TV debates will be described in detail in the next section.

3 The Debat-O-Meter

The Debat-O-Meter (see Figure 1) is a web-based application for mobile devices developed in the course of an interdisciplinary research project including computer and political scientists of the University of Freiburg in Germany. It combines functionalities of a virtualized RTR-tool and consists of a modular structure. The Debat-O-Meter thus extends beyond a classical RTR input device. It is a virtual platform on which scholars can conduct “conventional” RTR studies online meaning that it provides a “virtual laboratory” by implementing a phasic structure well known from usual RTR-based study designs in laboratory settings. The modular structure of the Debat-O-Meter attends the classical RTR study concept and ensures internal validity: after anonymous registration, users are introduced to the instrument surface and its measurement instructions in a tutorial. Participants are then asked to answer questions about their sociodemographic background, political preferences and patterns of political behavior as well as their expectations for the following debate in a pre-survey. The core function of the online tool is the RTR module. In accordance with the measurement instructions, it allows the participants to evaluate the debate in real-time. The gathered data is instantly saved on the server and features the value, exact timestamp and the users’ pseudonyms. Immediately after the debate, all users are redirected to a post-survey. At the end of the process, the Debat-O-Meter displays the subject’s evaluations from the debate, split by the given panel topics and the individual overall rating of the candidates. Additionally, users finally receive a conclusive overview of all participant ratings and overall debate perception.

Figure 1: Module structure and graphical user interfaces of the Debat-O-Meter.

Figure 1:

Module structure and graphical user interfaces of the Debat-O-Meter.

4 Research Questions and Hypotheses

At the outset, we already unfolded four major research questions in respect of applying virtualized RTR on TV-discussions with a multi-person panel. In the following we will specify these research questions theoretically.

Firstly, we turn to the question of internal consistency of RTR-evaluations which is considered as an established indicator for assessing reliability in the methodological literature and refers to our first research question (RQ1): Do virtualized RTR generate reliable data?

Ever since its invention reliability of RTR has been questioned by scholars, relying on different approaches: split-half-designs, test-retest procedures and parallel-test-designs (Hallonquist and Suchman 1944; Hallonquist and Peatman 1947; Hughes and Lennox 1990; Fenwick and Rice 1991; Boyd and Hughes 1992; Reinemann et al. 2005; Maier et al. 2007, 2016a; Kercher et al. 2012; Papastefanou 2013; Metz et al. 2016). A common concern of these different approaches is to unveil the internal consistency of the evaluations. Only recently, Papastefanou (2013) applied Cronbach’s Alpha as a parameter to assess the reliability of RTR Measurement (see also Bachl 2014: p. 108–110) in order to monitor how well participants’ ratings intercorrelate across the debate. It conceives the stream of incoming ratings as a test-retest scenario assuming that the RTR signal can be seen as a longitudinal measurement that continuously poses the same question throughout the debate (Schill 2016: p. 14). Hence, participants’ reactions should not vary considerably in the short run if the data is collected reliably. Crucial correlations across the distinct quasi-items are thus interpreted as evidence of an internally consistent measurement procedure. Existent literature suggests alpha scores above 0.9 as a threshold for high reliability, stating that α-values surpassing 0.8 are acceptable (Bortz and Döring 2006: p. 199). Assuming that the Debat-O-Meter produces reliable data we hypothesize that α-scores in our studies remain above the crucial threshold of 0.8.

Secondly, we turn to the internal validity of our generated RTR-data relating to our second research question (RQ2): Is RTR field data internally valid?

In order to address this issue, we refer to a causal model well known in debate research (Maier 2007a). Based on this model, internal validity is conceived twofold in the methodological literature: construct validity is defined as a significant association of RTR and PID which is assumed to determine the perception of political actors and issues. Criterion validity on the other hand is conceived as a substantial correlation of the instantaneous evaluations via RTR and the conclusive assessment on the debate performance of the single discussants after the reception of the discussion.

In order to answer the two remaining research questions, we extend this established model in Figure 2 (see Bachl 2013). The expanded model provides information about the structural processes of perception by modelling their context. In our studies, perception processes are conceived in two ways: The RTR data provides insight into the spontaneous real-time reactions of the viewers. The post-survey, on the other hand, provides clues to a more deliberate evaluation of the perceived debate performance. Unveiling these processes refers to our third research question (RQ3): What shapes perception processes of political TV debates with a multi-person podium?

Figure 2: Structural equation model.

Figure 2:

Structural equation model.

The displayed structural equation model also provides insights into the effects on the audience caused by the reception of televised debates, since it incorporates political preferences before and after reception. This addresses our fourth research question (RQ4): Does the reception of a political TV debate with a multi-person panel affect political preferences?

Figure 2 provides an overview of the processes which we expect to see prior to, during, and after the reception of the televised debate. Related models have repeatedly been verified with respect to TV-duels, but it has never been applied and validated for TV-debates with a multi-person panel. The following subsection explains the individual paths of the model (for the following see also Bachl 2013: p. 171–176).

Party identification is considered as central construct of voting behavior (Campbell et al. 1960). It therefore has to be concluded that PID has a large influence on the reception of televised debates. This assumption was repeatedly tested and proven in lab-based reception studies for duel discussions (e.g. Maier 2007a). We are testing whether this remains valid for recipients watching a TV-debate with a multi-person panel who are equipped with a virtualized form of RTR which individuals can use to evaluate political discussions at home in their “natural” surroundings (see also Maier et al. 2016a). We assume that PID significantly influences all relevant constructs: this includes the assumption that the candidate evaluation given prior to the debate will be reinforced (Path A1) and that the prospects of debate performance will be pre-determined (Path A2). According to the literature we also assume that PID as a long-term stable, “psychological party membership,” leads to a selective perception of the debate (e.g. Maier and Faas 2011b). Viewers follow the discussion through their “political lens” and we expect that the impressions of the viewers during the televised debate are decisively shaped by their PID (Path A3). The internal validity of our data collection centers around this assumption which is defined as construct validity in the methodological literature (Reinemann et al. 2005; Bortz and Döring 2006: p. 201f.; Maier et al. 2007, 2016a; Papastefanou 2013). Construct validity is given for our virtual RTR-instrument when it generates data consistent with an empirically tested theory, in this case PID, that incorporates the construct to be measured, in our case the measured perception of the debate via RTR. PID does not only shape the political predisposition and reception of the debate. Our model further assumes that PID continues to influence the evaluation of the candidates (Path A4) and the performance evaluation of the discussants (Path A5) even after the debate.

Alongside PID, other political predispositions of the viewers may have a large influence on all subsequent evaluations. It can therefore be assumed that the assessment of a candidate prior to the debate influences how the candidate will be rated after the debate (Path B4). Furthermore, due to effects evoked by partisanship, we presume that individuals consistently evaluate the preferred candidates more positively, leading to higher expectations on debate performances (Path B1), a higher rating throughout the debate (Path B2) and higher ratings in retrospect (Path B3). We then expand our deliberations on the influence of the expected debate performance. We assume a noticeable effect on the measured performance during (Path C1) and after (Path C2) the debate as we anticipate path dependence. [1]

By taking these two paths into account, the presented model incorporates the evaluation of debate performance first, by evaluating all single statements made by the candidates during the debate with RTR and second, by a final evaluation of each candidate’s performance after the debate. While the evaluation item in the post-survey is of a more concluding character, the live RTR evaluation encompasses the spontaneous reactions of the viewers. Both, RTR rating and post-survey evaluation have the same objective but with differing measurements. For that reason, a strong correlation between the two variables can be assumed (Path D2). This assumption has significant meaning when we consider the validity of the RTR-Measurement. It is discussed in methodological literature under the term of criterion validity (Bortz and Döring 2006: p. 200 f.). Criterion validity refers to empirical correlations between the to-be-proofed measuring instrument and the alternatively measured, external criteria in the expected direction (see Reinemann et al. 2005; Maier et al. 2007). The judgement about debate performance can serve as one of these criteria for RTR-Measurement. This is based on the consideration that ratings which were measured during the debate combine to form a judgement about who performed how in the debate. The more positive a viewer perceives a candidate during the debate, the more likely the viewer will evaluate the candidate as the winner of the debate afterwards. Criterion validity would then occur if a strong correlation between the impressions of the viewers during the debate (RTR) and the viewers’ concluding judgement (post-survey) existed. Finally, we assume that the perceived debate performance affects the evaluation of the candidates after the televised debate in both cases (Paths D1 and E1): the better (poorer) the performance of a candidate was perceived, the better (poorer) he should be evaluated after the debate.

5 Data and Methods

To assess the aforementioned research questions, we will present data of two recent studies based on the Debat-O-Meter. Primarily, RTR Measurement is implemented in studies covering duel-scenarios with two discussants from the major parties debating opposing policies. As this duel scenario seems inappropriate considering the political and electoral system in Germany, we will test in the following, if our assumptions hold true for televised debates covering more than two candidates. We will use data of two debates which took place in the run up of the 2017 federal elections in Germany. The studies are detailed hereafter.

Before turning to the studies’ description, we will detail our recruitment of participants: On the basis of numerous media cooperations, the Debat-O-Meter was offered as a second screen, as a participative project and as an innovative tool of political education, notably motivating citizens less sophisticated to politics. As an incentive for media partners to encourage their recipients to participate, we offered resources for both preliminary debate coverage as well as next-day analyses on aggregate data level for the debate’s post-reporting. With respect to single participants, we also provided an analysis on individual level within the app.

Since our recruitment strategy was based on “open access” with the aim to engage as many participants as possible, the data selection was confined afterwards in order to gain an appropriate sample for analysis. Hence, we are focussing on participants who completed the pre-survey before the debate had started, [2] who rated the candidates via RTR and filled out the post-survey in a reasonable time span. [3] Additionally, we purged cases whose rating behaviour refuted sincere, human participation. [4] Therefore, we are confident to attain data which structure and quality come close to that of a regular lab setting.

Study 1: “Die 10 wichtigsten Fragen der Deutschen” (ProSiebenSat.1)

Stimulus: The foursome debate between Alice Weidel (AfD), Christian Lindner (FDP), Katja Kipping (Die Linke) and Katrin Göring-Eckardt (Greens) was the kick-off debate in the run up of the 2017 federal election and took place on 30th August 2017, late night (10.30 p.m.). It lasted 118 minutes and was broadcasted by the private channel Sat1. 290,000 viewers followed the debate on their TV screens.

Sample: About 19,000 people logged into the Debat-O-Meter, more than 10,000 users passed the tutorial, completed the pre-survey and used the RTR-Module. After 2 hours of discussion and passing midnight, 2014 users completed the post survey which led to 1191 observations after deploying our data quality criteria.

Study 2: “Schlussrunde der Spitzenkandidaten” (ARD/ZDF)

Stimulus: The last televised debate of the run-up took place on 21st September 2017, 3 days before election day. Starting at 10 p.m., it was broadcasted by the two nationwide public channels ARD and ZDF. 4.37 million viewers watched the debate in which all major parties participated with leading politicians: Ursula von der Leyen (CDU), Manuela Schwesig (SPD), Joachim Herrmann (CSU), Katrin Göring-Eckardt (Grüne), Sahra Wagenknecht (Die Linke), Christian Lindner (FDP) and Alexander Gauland (AfD) discussed the most important problems facing the country in a debate that lasted 92 minutes.

Sample: During the debate more than 5000 people logged into the app. 3623 subjects entered the RTR-module by passing the tutorial, completing the pre-survey and giving at least one real-time evaluation, whereas about 1982 users ran through the whole process, completing the post survey and getting individual results in the VAA-module at the end of the debate. Applying our aforementioned criteria leaves us with 1058 cases.

Device: In both studies the Debat-O-Meter was implemented as a push-button version in reset mode with gradual input options enabling viewers to evaluate the televised debate with their own mobile devices in the setting of their private homes. Participants could express their spontaneous reactions to the debate by differentiating evaluations on a five-point scale ranging from double plus for a very good to double minus expressing a very bad evaluation. If no button is pressed, it is interpreted as a neutral position. For statistical analysis the data is recoded to a scale ranging from −2 to +2.

Recruitment strategy: Due to limited resources, we opted to follow the common approach in empirical debate research and apply a non-probability sampling method to generate our sample. Recent studies have repeatedly demonstrated the applicability of this approach (e.g. Bachl et al. 2013; Boydstun et al. 2014b; Maier et al. 2016a; Wagschal et al. 2017). At the same time, we are aware of the limitations of this convenience sample strategy. To be clear, our sample is not representative for the general population in Germany. With regard to the samples’ demographics (see Table 1), we inspect a structure of participants well known in online research by over-representing younger, male participants with a high level of education and a substantial interest in politics. This pattern differs slightly between the two debates in terms of age: in line with the overall audience of Sat.1 the viewers of the foursome debate are younger. Almost no difference is discernable with respect to the viewers’ interest in politics and their formal education. However, it is remarkable that nearly all relevant demographic groups are contained in the sample, though no special efforts have been made in order to reach particular groups of the overall population. Against this background, it is not the aim of our study to form universal inferences for the general population of Germany. Rather we are interested in estimating relationships between variables. Current research shows that non-probability samples can be used to create valid estimates, even in cases in which the sample differs substantially from the population. Thus, unbiased estimates of debate effects require diverse but not representative samples (Boydstun et al. 2014a). Therefore, we have refrained from any form of weighting; also in order to avoid the impression of presenting representative results on the basis of our sample. As such, we are confident that a representative sample quality is not necessary to answer our basic research questions, as long the major social groups appear in our data, although the limitations must be taken into account in the following interpretation of our results.

Table 1:

Samples’ Demographics and Characteristics.

“10 Fragen” “Schlussrunde” National
 Male 75.3 67.7 49.3
 Female 24.7 32.3 50.7
 Under 18 2.6 0.6 16.2
 18–20 10.1 5.7 3.2
  21–29 29.9 14.0 11.1
  30–39 20.3 13.6 12.3
  40–49 16.0 13.4 14.0
  50–59 13.5 20.6 15.8
  60–69 6.0 22.3 11.6
  70 and above 1.6 9.8 15.8
Education (highest level)
 No qualification/NA 0.5 1.0 3.7
 Pupil 1.6 1.0 3.6
 Secondary general 2.9 5.7 32.9
 Intermediate secondary 16.9 20.4 29.4
 University entrance diploma or University/Applied Studies degree 78.0 72.0 29.5
Interest in politics
 Very strong 51.0 52.3 9.7
 Strong 36.3 37.1 27.5
 Intermediate 11.5 9.8 44.2
 Weak 1.0 0.9 14.0
 Not at all 0.2 0.0 4.6

6 Results

6.1 Internal Consistency of RTR Data

In order to assess the question of data’s reliability when applying virtualized RTR to TV discussions with a multi-person panel, a focal point within methodological literature is the internal consistency of the data. As mentioned in Section 4, a suitable approach has been applied in Papastefanou (2013: p. 16) who implements Cronbach’s alpha to indicate test-retest reliability (see also Bachl 2014: p. 108–110). As a means to verify the reliability of mobile RTR and to apply Cronbach’s alpha to our data, we dissected the debates into segments of 30 seconds of duration in order to build the needed quasi-items. We then summed positive and negative ratings of every single participant separately for the different debaters within a given interval. These participant-by-item tables form the basis for our computation of Cronbach’s alpha. As political involvement affects perception processes and may thus influence debate reception, internal consistency may also be afflicted. To confront this issue, we made use of the large number of participants in our studies that enables us to investigate different subgroups. Therefore, we additionally separated the datasets along potentially moderating variables, such as “interest in politics,” “strength of voting intention” and “education.”

As mentioned above, existing literature suggests alpha scores surpassing 0.9 as a threshold for high reliability, stating that α-values above 0.8 remain acceptable (Bortz and Döring 2006: p. 708). Our results displayed in Table 2 disclose high reliability scores, meaning that Cronbach’s alpha usually remains above the 0.9 threshold and never drops below 0.8. Comparing both studies, an obvious difference in α-score-levels can readily be seen. Regarding our first study, in none of the 64 cases α-values drop below 0.9 level, showing a high reliability of our measurement in general. However, looking closer at the table we can see several patterns in the data structure. While theoretical deliberations would suggest that people less involved in the political world show minor internal consistency in their ratings, our findings in the first study do not support this assumption. Indeed, α-scores show a slightly higher level for categories indicating less political involvement. This is particularly true for the level of political interest and the certainty of voting intention. A lower degree of motivational aspects thus seems to generate fewer inconsistencies in RTR ratings and level out evaluations. Remarkably, using a virtual RTR-instrument to evaluate a debate at home does not seem to be a crucial question of cognitive abilities as α-scores in both categories remain above the 0.9 threshold regarding the formal education variable. Notably, users with high formal education show a more consistent rating behavior than the group with lower formal education. These ambiguous results on the role of political involvement in debate reception are consistent with previous findings (e.g. Reinemann and Maurer 2010; Maier et al. 2016b). Irrespective of its theoretical interpretation, for the practical application of RTR this means that it is crucial to cover a large range of demographic groups and be highly attentive when interpreting results before making inferences about debate perceptions beyond the investigated (sub-)group. Regarding the relation of positive and negative inputs, it is obvious that viewers of the foursome debate show stronger internal consistency in affirmative evaluations than it is apparent in their dislikes. Only for Göring-Eckardt (Grüne), this finding is reversed as seven of eight pairwise comparisons disrupt this pattern, indicating that she seems to be consistently disliked by her adversaries throughout the debate. When we take a closer look at the debaters, we see that evaluations on Kipping (Die Linke) and Weidel (AfD) are more internally consistent than those on Lindner (FDP) and Göring-Eckardt (Grüne). This could stem from their positioning in the political spectrum: While Kipping (Die Linke) and Weidel (AfD) represent the outer range of the political left-right scheme and viewers’ political attitudes towards their representatives might be more static, Lindner (FDP) and Göring-Eckardt (Grüne) are located more to the center and political evaluations towards modest standings might be more flexible.

Table 2:

Cronbach’s Alpha Overall and for Different Subgroups.

“Die 10 wichtigsten Fragen der Deutschen”
“Schlussrunde der Spitzenkandidaten”
Kipping (Die Linke)
Göring-Eckardt (Grüne)
Lindner (FDP)
Weidel (AfD)
von der Leyen (CDU)
Schwesig (SPD)
Herrmann (CSU)
Göring-Eckardt (Grüne)
Wagenknecht (Die Linke)
Lindner (FDP)
Gauland (AfD)
Pos Neg Pos Neg Pos Neg Pos Neg Pos Neg Pos Neg Pos Neg Pos Neg Pos Neg Pos Neg Pos Neg
All 0.984 0.940 0.918 0.934 0.956 0.910 0.974 0.940 0.893 0.888 0.893 0.888 0.897 0.875 0.893 0.888 0.893 0.888 0.910 0.900 0.956 0.834
IP↑ 0.986 0.940 0.928 0.933 0.959 0.916 0.975 0.943 0.902 0.891 0.902 0.891 0.906 0.881 0.902 0.891 0.902 0.891 0.909 0.905 0.958 0.841
IP↓ 0.994 0.980 0.985 0.990 0.975 0.983 0.985 0.976 0.997 0.947 0.997 0.947 0.977 0.932 0.997 0.947 0.997 0.947 0.994 0.963 0.949 0.977
VI↑ 0.940 0.943 0.925 0.928 0.942 0.905 0.974 0.926 0.910 0.899 0.910 0.899 0.915 0.888 0.910 0.899 0.910 0.899 0.919 0.909 0.962 0.846
VI↓ 0.994 0.963 0.942 0.970 0.977 0.960 0.979 0.968 0.932 0.891 0.932 0.891 0.929 0.931 0.932 0.891 0.932 0.891 0.958 0.933 0.975 0.926
VI↔ 0.995 0.951 0.950 0.959 0.975 0.947 0.975 0.955 0.950 0.930 0.950 0.930 0.934 0.925 0.950 0.930 0.950 0.930 0.982 0.934 0.962 0.931
FE↑ 0.986 0.944 0.923 0.940 0.957 0.921 0.975 0.944 0.903 0.909 0.903 0.909 0.906 0.901 0.903 0.909 0.903 0.909 0.930 0.913 0.966 0.846
FE↓ 0.962 0.935 0.954 0.938 0.944 0.928 0.968 0.932 0.964 0.876 0.964 0.876 0.963 0.876 0.964 0.876 0.964 0.876 0.923 0.930 0.957 0.902

  1. IP ≙ Interest in politics: IP↑=“strong” or “very strong,” IP↓=all other; VI ≙ Certainty of vote intention: VI↑=“certain” or “very certain,” VI↔=“undecided,” VI↓=all other; FE ≙ Formal education: FE↑=“University entrance diploma or University/Applied Studies degree,” FE↓=all other.

Table 3:

Structural Equation Model – Study 2.

DV IV [Path] Gauland (AfD) Hermann (CSU) Lindner (FDP) Von der Leyen (CDU) Schwesig (SPD) Göring-Eckardt (Grüne) Wagenknecht (Linke)
Pre: Performance Party identification [A2] 0.480*** 0.252*** 0.303*** 0.100*** −0.249*** −0.276*** −0.480***
RTR: Performance Party identification [A3] 0.536*** 0.416*** 0.257*** 0.247*** −0.353*** −0.466*** −0.632***
Post: Performance Party identification [A5] 0.120*** 0.044 0.102*** 0.001 −0.205*** −0.203*** −0.022
RTR: Performance [D2] 0.731*** 0.638*** 0.622*** 0.677*** 0.576*** 0.624*** 0.783***
Pre: Performance R2 0.230 0.064 0.092 0.010 0.062 0.076 0.231
RTR: Performance R2 0.287 0.173 0.066 0.061 0.125 0.217 0.400
Post: Performance R2 0.644 0.433 0.430 0.459 0.457 0.549 0.635
χ2 95.380 71.509 91.523 99.899 80.814 84.990 0.400
p 0.000 0.000 0.000 0.000 0.000 0.000 0.527
RMSEA 0.302 0.261 0.295 0.309 0.277 0.284 0.000
SRMR 0.126 0.117 0.141 0.154 0.130 0.130 0.008
n 1038 1038 1038 1038 1038 1038 1038

  1. ***p<0.01; **p<0.05; *p<0.10.

Turning to the second study, α-scores are generally on a lower level. Almost one third of all cases indicate alpha values below 0.9, though they never come close to the 0.8 threshold. Most of the values are centered around the mark of 0.9 indicating strong internal consistency. Remarkably, negative evaluations are primarily concerned of weakened α-values. This is in line with our findings from study 1. Only five of the 33 cases dropping below 0.9 are affiliated with positive ratings – those can be seen in the overall ratings in Table 2. Merely in six out of 56 pairwise comparisons α-values of positive evaluations are lower than for negative ratings. This might support findings of other studies showing that viewers tend to express more positive evaluations than they rate contestants negatively (Bachl 2014: p. 280). Regarding the other cases crossing the 0.9 line affiliated with negative ratings, no clear pattern can be found despite the fact that the group with lower formal education again shows low α-scores. This might undergird our finding based on the first study suggesting that the internal consistency of user ratings is affected by cognitive abilities but still remains on a high level. Looking at the α-values assorted by the debaters, coefficients are fairly levelled. Overall, comparing our findings of both studies the picture is quite uniform.

Critical considerations about Cronbach’s alpha are commonly made with respect to the number of items. For a high amount of items, a low intercorrelation can enhance the indicator on a high level already, at the risk of overestimating measurement reliability (Bortz and Döring 2006: p. 199). To prevent this problem, we successively varied the size of the single slices from 30 seconds up to 300 seconds and recalculated α-values for the resulting scales. The resulting Table 4 (see appendix) shows that the predominant part of the coefficients remains above the 0.8 threshold in both studies. As we expected, we can see declining α-scores if we expand the size of the slices. In study 1, the lowest value traceable is 0.844 for negative evaluations on Katja Kipping (Die Linke) in the five-minutes-slice. The lowest scores for the “Schlussrunde” are also placed in the most expanded slice, but three values clearly cross the 0.8 threshold. The lowest score is 0.712 for negative evaluations on Alexander Gauland (AfD). Considering references from the methodological literature, we can state that even the α-values of the 300-second slices resulting in 23 (foursome debate) and 18 (“Schlussrunde”) quasi-items largely remain above the critical value of 0.8. Therefore, the size of slices does not seem to be problematic. Summing up our results from this subsection, we can affirm our first research question: virtualized RTR can be considered to generate reliable and internally consistent data with respect to TV-discussions with a multi-person panel, even when controlling for participants’ characteristics and their level of political involvement.

6.2 Structural Equation Modelling: Internal Validity, Perception Processes and the Effects of Debate Reception

To assess internal validity of our data, perception processes and the effects of debate reception on political preferences, we follow a well-known and firmly established procedure laid out in previous works focusing on lab-based studies on TV duels (Maier 2007a; Bachl 2013; exception see: Maier et al. 2016a). These studies investigate to what extent a RTR signal is correlated with other variables involved in the process of debate reception and candidate evaluation. Methodological contributions following this approach rely on structural equation modeling. These models usually incorporate political preferences and pre-dispositions before and after the debate and draw up their relation to the RTR-signal (see Figure 2). Section 4 already outlined our theoretical expectations. In the following subsection we will test these hypotheses empirically based on our data of the two studies under investigation. Thereby, candidate evaluation is captured as an overall assessment of the candidate ranging from −2 (“very low”) to +2 (“very high”). Accordingly, live debate performance is measured as the mean evaluation during the debate (ranging from −2 for “very bad” to +2 for “very good”). The same applies to the expected/perceived debate performance. In order to comply with the methodological requirements of structural equation modeling, PID was operationalized as a metric variable, ranging from −3 to +3, reproducing the political left-right-spectrum. [5] One may argue that this is inappropriate regarding the categorical nature of PID. [6] Thus, we additionally estimated all our models with a dummy-variable considering the categorical assumption. As can be seen in the appendix (Tables 5 and 6), the alternative model specifications have a negative side effect on the model fit, but at the same time the results are almost identical to the primary models so that our estimations are confirmed by the categorical specification. Against this backdrop, we will start our analysis with the foursome debate, followed by our results of the “Schlussrunde.”

We estimated all models with robust standard errors for each candidate separately. In Figure 3 , we merged the results of the four models we calculated for “Die 10 wichtigsten Fragen der Deutschen.” Herein, we have to be aware that neither of the four models passes the likelihood-ratio-test. However, as the test has been shown to regularly reject models with a large number of cases, this seems to be of lesser relevance (see Weiber and Mühlhaus 2014: p. 204). To take the model structure into account, we instead rely on “root mean-square error of approximation” (RMSEA) and “standardized root mean square residual” (SRMR) statistics. These statistics indicate a good model fit. The four models presented in Figure 3 display both standardized coefficients (arrows) and the associated R2 values (boxes).

Figure 3: Structural equation model – study 1.

Figure 3:

Structural equation model – study 1.

As we have mentioned in Section 4, we expect that the influence of PID is far-reaching on all subsequent variables. We hypothesized that the candidate evaluation given prior to the debate will be reinforced (Path A1). All four models indicate that PID has a substantial influence on how the candidates are evaluated before the debate. Noticeably, this influence is less pronounced for Lindner but still highly significant. While the relation of partisanship and candidate evaluation is strong and significant before the debate, this relationship is less clear with respect to candidate evaluations after the debate (Path A4). For Kipping and Lindner, PID remains substantial but loses its relevance for post-debate evaluation of Göring-Eckardt and Weidel. A similar pattern of an ambiguous influence of PID before and after the debate can be observed with respect to the expected and perceived debate performance, respectively. While PID influences the expected debate performance (yet to a lesser extent), expectations are mainly shaped by one’s candidate evaluation. PID’s beta-coefficients are significant for Kipping, Göring-Eckardt and Lindner but remain on a low level (Path A2), whereas for Weidel of the newly-founded AfD, PID has no substantial effect or is not yet existent at least. Conversely, the coefficients of candidate orientation are substantial and significant for all candidates, indicating that political preferences determine expectations of candidates’ performance. When we examine the relationship of PID and the perceived debate performance after the debate, the picture remains ambiguous (Path A5). Although for Kipping and Weidel coefficients are significant, all betas are on a low level and debate performances of Göring-Eckardt and Linder do not seem to be affected by PID at all. This twofold pattern of the PID’s effect on candidate evaluation and debate performance before and after the debate could be a first indication for a substantial effect of the debate itself which we have captured by the RTR signal.

Thereby, we address core issues of the methodological discussion about the internal validity of Real-Time Response Measurement. In this discussion, a major concept is construct validity, defined as “pronounced associations between the measured RTR scores and PID” in the context of debate research (Maier et al. 2007: p. 66). The displayed results in Figure 3 verify this expectation (Path A3): though betas are on a low level, all coefficients are highly significant, acknowledging pronounced associations between the mobile RTR signal and PID and thus approving RTR-instruments’ internal validity. A second crucial concept in methodological literature is criterion validity. In the context of RTR-based research we expect high correlations between RTR scores and perceived debate performance after the debate (Path D2). This expectation is based on the consideration that measured ratings during the debate combine to form a judgement about who performed how in the debate. All four models displayed in Figure 3 show strong and significant coefficients. Noticeably, the relative sizes of the coefficients are comparable to similar models in the literature for lab-based studies on TV duels (see Maier 2007a: p. 103; Bachl 2013: p. 184; Maier et al. 2016a: p. 550). This gives strong evidence that criterion validity is given for virtualized RTR applied on a TV-discussion with a multi-person panel. Thus, two major concepts of internal validity can be verified in the context of mobile RTR by our first study.

In order to assess perception processes (RQ3) we detail our investigations on real-time evaluations and post-debate verdicts on the perceived debate performance of the candidates. Besides the above verified association of PID (Path A3), we assumed in Section 4 that the live evaluation of debate performance (RTR) is affected by one’s candidate preferences (Path B2) and expectations on the debate performance (Path C1). As can be seen in Figure 3, we can affirm these assumptions: Considering that the expected debate performance is mainly shaped by candidate preference (Path B1), the direct effect of the latter variable is more substantial while all coefficients are highly significant. As such, political preferences like candidate orientation (before the debate) do have a major impact on the real-time perception of the debate. Regarding the perceived debate performance after the debate, we presumed perception processes to be determined by PID (Path A5), the instantaneous perception of the debate (Path D2), one’s candidate preferences (Path B3) and expectations (Path C2). Besides the impact of real-time evaluation, which we already assessed with respect to our report on criterion validity, it can be seen in Figure 3 that RTR-ratings have the most substantial association to post-debate verdicts on the perceived debate performance even after controlling for PID and subsequent variables. While coefficients for both candidate preference and expectations are significant but less substantial, the picture for PID’s influence remains ambiguous. This in a way contradicts findings on TV duels (e.g. Maier et al. 2016a) and could be an indication that the higher complexity of multi-person podiums moderates the effect pattern of PID as the simple friend-foe logic may no longer apply. According to our data, this seems to be even more important if candidates are located more to the center of the political spectrum. While the PID influences the reflective perception of the debate performance for the two candidates of the political periphery, Kipping (Linke) and Weidel (AfD), this holds not true for Lindner (FDP) and Göring-Eckardt (Grüne), who are located more to the political center.

In order to complement our investigation, we proceed with post-debate candidate preferences which refers to our forth research question (RQ4) querying if political preferences are affected by debate reception when it comes to multi-person scenarios. In Section 4 we hypothesized that alongside PID (Path A4) the assessment of a candidate prior to the debate (Path B4), RTR-evaluations (Path D1) and post-debate verdicts on the candidates’ performances (Path E1) determines how the candidate will be evaluated after the debate. Our models largely validate all four assumptions: while associations of paths B4 and E1 are significant and substantial, path D1 is less pronounced but remains on a medium level, whereas betas of path A4 are weak and merely significant for Lindner (FDP) and Kipping (Linke). In other words: candidate preferences (after the debate) are mainly shaped by the perception of the debate whether instantaneous or reflective while being moderated by prior candidate evaluation, whereby PID seems to be of lesser relevance for post-debate candidate orientation, regarding TV debates with a multi-person podium.

Turning to our second study, we are facing problems reproducing the encompassing model in Figure 2 because of missing data. For the “Schlussrunde” we could not survey all variables due to restricted resources. Moreover, we could not include all available variables for statistical reasons. [7] Against this backdrop, we follow a minimised model laid out by Maier (2007a: p. 103) to verify the core issues: construct and criterion validity. This reduced model focuses on PID, expected debate performance, RTR evaluations as well as perceived debate performance and outlines their statistical associations. It therefore condenses investigations of paths A2, A3, A5 and D2 of the encompassing model in Figure 2. As it incorporates all relevant variables to assess internal validity, it is a promising approach to clarify current insights and may complement our findings from Study 1 with regard to our RQ2. In operation and operationalization, we follow the procedure applied in our first study: candidate evaluations and debate performances are variables ranging from −2 to +2. PID is operationalized as a metric variable representing the political left-right spectrum and ranging from −3 to +3. [8] Again, we estimated all models with robust standard errors for each candidate separately and we must bear in mind that neither of the seven models passes the likelihood-ratio-test. Therefore, we rely on “root mean-square error of approximation” (RMSEA) and “standardized root mean square residual” (SRMR) statistics, once more. In contrast to the models of the foursome debate, test statistics query a good model fit (see Table 2), except for Wagenknecht’s model which complies with the given thresholds (Weiber and Mühlhaus 2014): p. 205–210). Nevertheless, as will be shown in the following, the resulting structure of all models can be interpreted properly. Due to the large podium of the “Schlussrunde,” we condensed the results of the seven models we computed in Table 3 depicting both standardized coefficients and the associated R2 values.

Again, the influence of PID on subsequent variables run in line with our expectations, as effects are far-reaching but two-parted before and after the debate: while PID has only slight influence on post-debate assessments about debate performance (Path A5), PID substantially preforms expectations of the viewers before the debate, regarding how one’s “own” candidate will perform (Path A2). Coefficients are significant for all discussants, while betas are more pronounced for Gauland (AfD) and Wagenknecht (Die Linke) who present the outer range of the political spectrum where political alliances and partisanship might be more potent, particularly at the end of the election campaign. We can retrieve this pattern regarding the associations of PID and RTR scores (Path A3). Thereby, we turn towards the concept of construct validity once again and our data shows strong indication of internal validity: coefficients are even more pronounced than in our first study and all betas are highly significant. Again, the relative sizes of the coefficients are comparable to equivalent models in the literature for lab-based studies on TV duels (see Maier 2007a: p. 103; Bachl 2013: p. 184; Maier et al. 2016a: p. 550). Closing our reflexions on the effects of PID, we can further state that results are consistent considering that all beta signs point in the right direction. Another piece of evidence could be derived from the correlation of RTR-scores and the viewer ratings about the perceived debate performances in the post-survey (Path D2) which is condensed in the concept of criterion validity in methodological literature. When taking a closer look at our results of study 2 (see Table 3), we can infer that associations are significant and strong and therefore remain in place even after controlling for PID. Again, coefficients are even more pronounced than in study 1 and the relative sizes of the betas are comparable to studies conducted in laboratory settings regarding TV duels. This is a strong indication for criterion validity.

Taking together our results of study 1 and 2, both construct and criterion validity are given for our data gathered in the natural surroundings of our participants watching TV-debates with a multi-person panel. Our results are in line with findings from laboratory studies on TV duels. Though control of internal validity is diminished outside the lab, this does not seem to impair data quality. Additionally, the evaluation of more than two discussants does not seem to overwhelm the study participants. In consequence, we can reply positively to our second research question (RQ2): our real-time field data is internally valid even if we apply virtualized RTR to TV discussions with a multi-person panel.

Regarding our research question 3 (RQ3), we have shown that perception processes – whether instantaneous or conclusive – are primarily shaped by PID and prior political preferences such as candidate orientation. Furthermore, our results give strong evidence that post-debate candidate preferences are substantially affected by debate reception. As such, candidate preferences are mainly shaped by the perception of the debate whether of immediate or reflective nature, while being moderated by prior candidate evaluations. Even when evaluating more than two contestants in a TV-discussion, study participants seem to be able to sharpen their political verdicts and preferences. In consequence, we can reply positively to our fourth research question (RQ4): the reception of a political TV debate with a multi-person panel indeed affects political preferences.

To complete the picture, the relations displayed in our models validate a structure testified in lab-based RTR-research on TV-duels indicating that results can be linked to current findings in empirical debate research while making its own contribution to the research field.

7 Conclusion

At the outset we have argued that its recent virtualization facilitates to apply RTR on televised debates with a multi-person panel while transcending the focus of debate research on TV duels. To underpin our argumentation, we unfolded four research questions scrutinizing whether findings made by investigating these new scenarios are equivalent to those reported in studies on TV duels. While the first two questions provide a methodological perspective by querying the reliability and internal validity of the gathered data via virtualized RTR, the latter two turn to the question of information processing and the effects of the reception of televised debates with a multi-person panel. In order to answer these questions, we opened with a brief review of the literature on RTR, presented the Debat-O-Meter as an innovative exponent of virtualized RTR and detailed our theoretical framework before we described our study procedures and extensively presented our results of two studies conducted in the course of the 2017 federal election in Germany.

Our brief review of literature revealed a seemingly inappropriate focus of RTR-based research on TV-duels, while lacking studies investigating TV-discussions with a multi-person panel due to restrictions of the present physical measuring instrument. In order to overcome existing limitations of common RTR-devices and to provide access to TV discussions with a multi-person podium, the Debat-O-Meter – as a virtualized RTR-platform that enables researchers to measure spontaneous reactions of an audience on political discussions in natural surroundings – may offer appropriate features and functions.

Regarding reliability (RQ1), we proved that internal consistency of the data is given. Drawing on Cronbach’s alpha, we showed that the vast majority of the coefficients comply with the 0.8 threshold required by literature, even when examining subgroups split by participants’ characteristics or dissecting the debate in increasing sizes of slices to consider methodological deficiency of the indicator. In order to assess the internal validity of the RTR-signal (RQ2), we referred to an established model (Maier 2007a) verifying construct and criterion validity by applying structural equation modelling. We presented strong indications that both construct and criterion validity are given for our data, depicting real-time reactions to TV-discussions with a multi-person panel via virtualized RTR.

With regard to perception processes (RQ3) and effects of the reception of televised debates with a multi-person podium (RQ4), we have extended the model to include variables of political predispositions and preferences (Bachl 2013). Within this framework, we showed perception processes to substantially be shaped by PID and prior candidate orientation, while the debate itself has a significant impact on the post-debate verdicts of candidates’ performances. In addition, our models gave strong indications that political preferences such as candidate orientation are significantly affected by debate reception as it is known from research on TV duels. To sum up, the relations depicted in our models validate a structure verified in lab-based RTR-research on TV-duels.

As such, our research on TV discussions with a multi-person podium using virtualized RTR can be linked to previous findings from debate research which has so far mainly focused on laboratory studies and duel scenarios. At the same time, some limitations remain for our complementary approach. On the one hand, this entails the development of a security architecture that is able to fend off external threats by bots and DDoS attacks. While these are general problems for which IT solutions already exist and can also be implemented in virtual research environments like the Debat-O-Meter, the question of synchronizing user input and media stimulus is a specific challenge for virtualized RTR measurement. When leaving the laboratory, it is no longer clear to the researchers which channel (satellite, cable, stream, etc.) the participants use to watch the discussion on television. However, their signals can be delayed differently due to different broadcasting paths (so called “playout delay”). This poses a great challenge for the synchronization of the individual measurement series and for the linkage with the media stimulus as well, with a major impact on data analysis. Here, again, concepts such as (1) watermarking, (2) fingerprinting, (3) user feedback and (4) statistical methods like expectation maximization might provide solutions. The different approaches have different requirements: For (1) watermarking, an artificial audio or video signal is inserted into the television signal, which can be used to identify an exact moment on the side of the client device. This approach is highly dependent on external conditions, as cooperation of the TV station is necessary to insert the artificial signal into the broadcast. (2) Fingerprinting attempts to extract signatures from the original television signal (without watermark) that are characteristic for a particular moment in time. However, both approaches imply a major intrusion into the privacy of the user, as a signal must be recorded. With (3) user feedback, participants are asked to perform an action on the client device at certain times. For example, pressing a “moderator button” could lead to a distinct profile that allows extraction of the playout delay. However, this is dependent on the user’s long-term cooperation throughout the debate. In contrast to conventional second screen offers, the Debat-O-Meter does not require the playout delay to be detected in real time, as no content is displayed to the user parallel to the television broadcast. Instead, synchronization can a posteriori be done. This opens up the possibility to synchronize the ratings of the users in retrospect using (4) statistical methods such as expectation maximization. Although these approaches might be promising, they have not yet been systematically tested within the framework of RTR. Furthermore, the question about the sample unit needs to be addressed. This issue is of a more fundamental nature because the influence between members of a group who follow the TV debate together, cast doubt on the individual as a constant unit of investigation and potentially introduces patterns of dependence among units. Sorting by IP addresses and specific information in the measurement instructions to enhance standardization might help to mitigate this problem in the future. As such, participants could be asked not to talk to each other during debate reception or even to watch alone in order to increase standardization and strengthen the validity of the measurement’s procedure. In addition, we must acknowledge the limitations of our open sampling technique when making inferences. Thus, identifying adequate methods of weighting RTR data could be a perspective for future research.

This being said, our studies reveal some substantial merits: Due to the virtualization of the measuring instruments, their flexibility facilitate the application to different discussion formats while simultaneously being very cost-efficient. This novel approach opens up large N field studies since subjects are able to follow a televised debate in natural reception situations using their own mobile devices at home. By reducing the barriers to participation, this new method may also improve the quality and diversity of the sample. At the same time, the gathered RTR-data from our field studies comply with common quality standards in terms of internal consistency and validity as known from laboratory studies on TV duels. Regarding our findings, it should be noted that they are in line with established research and can open up new perspectives: When investigating TV discussions with a multi-person panel, research is no longer confined to examine evaluation behaviour following a simple friend-foe logic like incumbent vs. challenger, as it is common for TV duels. Our new approach allows the investigation of more complex patterns that better reflect the complexity of multi-party systems with their need to build coalitions and the potentially strategic behaviour of voters. Furthermore, regarding democracy there are good reasons to take a closer look at TV discussions with more than two competitors: Parliamentary systems live on the plurality of political parties, persons and positions. TV duels do not adequately reflect the pluralism of a multi-party system and structurally put small parties at a disadvantage in political competition. This can lead to bizarre situations, such as in the 2011 TV duel in Baden-Württemberg where the Greens appointed the prime minister after the election but the latter had not been invited to the TV duel. The examination of TV discussions with a multi-person podium using virtualized RTR can help to transcend the shortening of political competition to two candidates (which is inappropriate for parliamentary systems with proportional representation and a multi-party system): by increasing the attraction of reception through a second screen feature, providing basic information for media coverage and firmly establishing TV debates with a multi-person panel as an object of investigation in the research field. As such, our novel approach might also contribute to a strengthening of the democratic process since it reflects the diversity of positions and argumentative strategies within the political discourse.


This paper would not have been possible without support and advice from all members of the Debat-O-Meter team, especially Thomas Metz which I greatly appreciate. I thank the anonymous reviewers and the editor of Statistics, Politics and Policy for their helpful comments and suggestions on the manuscript.


Table 4:

Cronbach’s Alpha by Different Seizes of Slices.

Candidate/size of slices (in seconds) 30 60 90 120 150 180 210 240 270 300
“Die 10 wichtigsten Fragen der Deutschen”
 Kipping (Die Linke) Pos 0.984 0.978 0.970 0.970 0.968 0.967 0.964 0.958 0.954 0.960
Neg 0.940 0.921 0.900 0.890 0.879 0.873 0.866 0.850 0.848 0.844
 Göring-Eckardt (Grüne) Pos 0.918 0.892 0.881 0.885 0.968 0.879 0.868 0.883 0.875 0.864
Neg 0.934 0.914 0.896 0.884 0.876 0.869 0.865 0.862 0.869 0.857
 Lindner (FDP) Pos 0.956 0.942 0.931 0.927 0.922 0.917 0.913 0.915 0.905 0.907
Neg 0.910 0.882 0.873 0.862 0.854 0.862 0.850 0.852 0.846 0.847
 Weidel (AfD) Pos 0.974 0.962 0.952 0.947 0.937 0.933 0.923 0.921 0.915 0.912
Neg 0.940 0.924 0.908 0.906 0.904 0.895 0.888 0.890 0.885 0.886
“Schlussrunde der Spitzenkandidaten”
 Von der Leyen (CDU) Pos 0.893 0.853 0.829 0.814 0.826 0.812 0.790 0.806 0.799 0.804
Neg 0.888 0.870 0.854 0.853 0.846 0.833 0.841 0.836 0.827 0.819
 Hermann (CSU) Pos 0.897 0.871 0.833 0.822 0.808 0.821 0.800 0.790 0.811 0.797
Neg 0.875 0.853 0.838 0.824 0.809 0.827 0.801 0.785 0.781 0.779
 Schwesig (SPD) Pos 0.893 0.853 0.829 0.814 0.826 0.812 0.790 0.806 0.799 0.804
Neg 0.888 0.870 0.854 0.853 0.846 0.833 0.841 0.836 0.827 0.819
 Lindner (FDP) Pos 0.910 0.872 0.857 0.844 0.831 0.828 0.834 0.848 0.793 0.847
Neg 0.900 0.873 0.858 0.842 0.840 0.841 0.823 0.821 0.813 0.816
 Göring-Eckardt (Grüne) Pos 0.893 0.853 0.829 0.814 0.826 0.812 0.790 0.806 0.799 0.804
Neg 0.888 0.870 0.854 0.853 0.846 0.833 0.841 0.836 0.827 0.819
 Gauland (AfD) Pos 0.956 0.939 0.926 0.919 0.910 0.912 0.905 0.897 0.895 0.876
Neg 0.834 0.794 0.772 0.759 0.764 0.733 0.740 0.721 0.715 0.712
 Wagenknecht (Linke) Pos 0.893 0.853 0.829 0.814 0.826 0.812 0.790 0.806 0.799 0.804
Neg 0.888 0.870 0.854 0.853 0.846 0.833 0.841 0.836 0.827 0.819

Table 5:

Structural Equation Model – Study 1 (with PID-Dummy).

DV IV [Path] Weidel (AfD) Lindner (FDP) Kipping (Linke) Göring-Eckardt (Grüne)
Pre: Evaluation Party identification [A1] 0.657*** 0.508*** 0.523*** 0.423***
Pre: Performance Party identification [A2] 0.118*** 0.033 0.042* −0.008
Pre: Evaluation [B1] 0.736*** 0.638*** 0.694*** 0.750
RTR: Performance Party identification [A3] 0.008 0.120*** 0.092*** 0.051**
Pre: Evaluation [B2] 0.716*** 0.588*** 0.603*** 0.551***
Pre: Performance [C1] 0.139*** 0.095*** 0.148*** 0.209***
Post: Performance Party identification [A5] 0.025 −0.007 −0.004 −0.004
Pre: Evaluation [B3] 0.118*** 0.164*** 0.098*** 0.157***
Pre: Performance [C2] 0.141*** 0.189*** 0.164*** 0.111***
RTR: Performance [D2] 0.631*** 0.521*** 0.655*** 0.619***
Post: Evaluation Party identification [A4] 0.054*** 0.042*** 0.048*** 0.015
Pre: Evaluation [B4] 0.385*** 0.326*** 0.268*** 0.289***
RTR: Performance [D1] 0.190*** 0.285*** 0.220*** 0.192***
Post: Performance [E1] 0.384*** 0.324*** 0.443*** 0.482***
Pre: Evaluation R2 0.432 0.258 0.273 0.179
Pre: Performance R2 0.669 0.430 0.514 0.557
RTR: Performance R2 0.704 0.522 0.591 0.553
Post: Performance R2 0.741 0.594 0.714 0.672
Post: Evaluation R2 0.870 0.736 0.780 0.784
χ2 0.188 1.218 4.543 1.619
p 0.665 0.270 0.033 0.203
RMSEA 0.000 0.014 0.058 0.024
SRMR 0.001 0.003 0.005 0.003
n 1064 1064 1064 1064

  1. ***p<0.01; **p<0.05; *p<0.10.

Table 6:

Structural Equation Model – Study 2 (with PID-Dummy).

DV IV [Path] Gauland (AfD) Hermann (CSU) Lindner (FDP) Von der Leyen (CDU) Schwesig (SPD) Göring-Eckardt (Grüne) Wagenknecht (Linke)
Pre: Performance Party identification [A2] 0.544*** 0.238*** 0.262*** 0.281*** 0.228*** 0.273*** 0.317***
RTR: Performance Party identification [A3] 0.517*** 0.346*** 0.275*** 0.403*** 0.227*** 0.189*** 0.548***
Post: Performance Party identification[A5] 0.184*** −0.023 0.114*** 0.081*** 0.041* 0.051** 0.030
RTR: Performance [D2] 0.705*** 0.663*** 0.616*** 0.655*** 0.634*** 0.703*** 0.782***
Pre: Performance R2 0.295 0.057 0.068 0.079 0.052 0.075 0.100
RTR: Performance R2 0.267 0.120 0.076 0.163 0.052 0.036 0.301
Post: Performance R2 0.666 0.429 0.431 0.479 0.415 0.510 0.639
χ2 158.099 181.264 207.060 184.252 212.393 226.855 24.150
p 0.000 0.000 0.000 0.000 0.000 0.000 0.000
RMSEA 0.389 0.417 0.446 0.420 0.451 0.466 0.149
SRMR 0.116 0.126 0.142 0.124 0.145 0.160 0.052
n 1038 1038 1038 1038 1038 1038 1038

  1. ***p<0.01; **p<0.05; *p<0.10.


Bachl, M. (2013) “Die Wirkung des TV-Duells auf die Bewertung der Kandidaten und die Wahlabsicht,” In: (Bachl, M., F. Brettschneider and S. Ottler, eds.) Das TV-Duell in Baden-Württemberg 2011. Inhalte, Wahrnehmungen und Wirkungen, Wiesbaden: Springer VS, pp. 171–198.10.1007/978-3-658-00792-8_8Search in Google Scholar

Bachl, M. (2014) Analyse rezeptionsbegleitend gemessener Kandidatenbewertungen in TV-Duellen. Dissertation, Universität Hohenheim.Search in Google Scholar

Bachl, M., F. Brettschneider and S. Ottler (eds.) (2013) Das TV-Duell in Baden-Württemberg 2011. Inhalte, Wahrnehmungen und Wirkungen. Wiesbaden: Springer VS.10.1007/978-3-658-00792-8Search in Google Scholar

Benoit, W. L. and G. J. Hansen (2004) “Presidential Debate Watching, Issue Knowledge, Character Evaluation, and Vote Choice,” Human Communication Research, 30:121–144.10.1111/j.1468-2958.2004.tb00727.xSearch in Google Scholar

Benoit, W. L., M. S. McKinney and M. T. Stephenson (2002) “Effects of Watching Primary Debates in the 2000 U.S. Presidential Campaign,” Journal of Communication, 52:316–331.10.1111/j.1460-2466.2002.tb02547.xSearch in Google Scholar

Benoit, W. L., G. J. Hansen and R. M. Verser (2003) “A Meta-Analysis of the Effects Viewing U.S. Presidential Debates,” Communication Monographs, 70:335–350.10.1080/0363775032000179133Search in Google Scholar

Biocca, F., P. David and M. West (1994) “Continuous Response Measurement (CRM): A Computerized Tool for Research on the Cognitive Processing of Communication Messages,” In: (Lang, A., ed.) Measuring Psychological Responses to Media Messages. Hillsdale: Erlbaum, pp. 15–64.Search in Google Scholar

Bortz, J. and N. Döring (2006) Forschungsmethoden und Evaluation für Human-und Sozialwissenschaftler., 4., überarbeitete Auflage. Heidelberg: Springer.10.1007/978-3-540-33306-7Search in Google Scholar

Boyd, T. C. and G. D. Hughes (1992) “Validating Realtime Response Measures,” In: (Sherry Jr., John F. and Brian Sternthal, eds.) NA – Advances in Consumer Research Volume 19. Provo, UT: Association for Consumer Research, pp. 649–656.Search in Google Scholar

Boydstun, A. E., J. Feezell, R. A. Glazier, T. P. Jurka and M. T. Pietryka (2014a) “Colleague Crowdsourcing: A Method for Incentivizing National Student Engagement and Large-N Data Collection,” PS: Political Science & Politics, 47(4):829–834.10.2139/ssrn.2210745Search in Google Scholar

Boydstun, A. E., R. A. Glazier, M. T. Pietryka and P. Resnik (2014b) “Real-Time Reactions to a 2012 Presidential Debate A Method for Understanding Which Messages Matter,” Public Opinion Quarterly, 78:330–343.10.1093/poq/nfu007Search in Google Scholar

Campbell, A., P. Converse, W. Miller and D. Stokes (1960) The American Voter. New York: Wiley.Search in Google Scholar

Donsbach, W. (2002) “Zur Politischen Bewertung Einer Medialen Inszenierung: Sechs Gründe gegen Fernsehduelle,” Die Politische Meinung, 296:19–25.Search in Google Scholar

Faas, T. and J. Maier (2004a) “Chancellor-Candidates in the 2002 Televised Debates,” German Politics, 13:300–316.10.1080/0964400042000248214Search in Google Scholar

Faas, T. and J. Maier (2004b) “Mobilisierung, Verstärkung, Konversion? Ergebnisse eines Experiments zur Wahrnehmung der Fernsehduelle im Vorfeld der Bundestagswahl 2002,” Politische Vierteljahresschrift, 45:55–72.10.1007/s11615-004-0004-0Search in Google Scholar

Faas, T. and J. Maier (2011) “Medienwahlkampf. Sind TV-Duelle nur Show und damit nutzlos?” In: (Bytzek, E. and S. Roßteutscher, eds.) Der unbekannte Wähler? Mythen und Fakten über das Wahlverhalten der Deutschen. Frankfurt am Main: Campus, pp. 99–114.Search in Google Scholar

Faas, T. and J. Maier (2014) “Wahlkämpfe im Miniaturformat: Fernsehdebatten und ihre Wirkung am Beispiel des TV-Duells 2013 zwischen Angela Merkel und Peer Steinbrück,” Information – Wissenschaft & Praxis, 65:163–168.10.1515/iwp-2014-0029Search in Google Scholar

Faas, T. and J. Maier (2017) “TV-Duell und TV-Dreikampf im Vergleich: Wahrnehmungen und Wirkungen.” In: (Faas, T., J. Maier and M. Maier, Hrsg.) Merkel gegen Steinbrück. Analysen zum TV-Duell vor der Bundestagswahl 2013. Berlin, Heidelberg: Springer, pp. 207–217.10.1007/978-3-658-05432-8_13Search in Google Scholar

Federal Constitutional Court (2002) “2 BvR 1332/02 vom 30. August 2002, Rn (1–10).” in Google Scholar

Fenwick, I. and M. D. Rice (1991) “Reliability of Continuous Measurement Copy-Testing Methods,” Journal of Advertising Research, 31:23–29.Search in Google Scholar

Gottfried, J. A., B. W. Hardy, K. M. Winneg and K. H. Jamieson (2014) “All Knowledge is not Created Equal: Knowledge Effects and the 2012 Presidential Debates,” Presidential Studies Quarterly, 44:389–409.10.1111/psq.12129Search in Google Scholar

Hallonquist, T. and E. E. Suchman (1944) “Listening to the Listener. Experiences with the Lazarsfeld-Stanton Program Analyzer.” In: (Lazarsfeld, P. F. and F. Stanton, eds.) Radio Research 1942–1943. New York: Duell, Sloan and Pearce, pp. 265–334.Search in Google Scholar

Hallonquist, T. and J. G. Peatman (1947) “Diagnosing Your Radio Program or the Program Analyzer at Work.” In: (Institute for Education by Radio, ed.) Education on the Air: Yearbook of the Institute for Education by Radio. Columbus, OH: Ohio State University Press, pp. 463–474.Search in Google Scholar

Hughes, G. D. and R. Lennox (1990) “Realtime Response Research: Construct Validation and Reliability Assessment.” In: (William Bearden, et al., eds.) Enhancing Knowledge Development in Marketing. Chicago, IL: American Marketing Association, pp. 284–288.Search in Google Scholar

Kercher, J., M. Bachl, C. Vögele and F. Vohle (2012) “The MediaLiveTracker. A New Online-Tool for Real-Time-Response-Measurement.” Vortrag auf der 14. Jahrestagung der Deutschen Gesellschaft für Onlineforschung (GOR), Mannheim.Search in Google Scholar

Klein, M. (2005) “Der Einfluss der beiden TV-Duelle im Vorfeld der Bundestagswahl 2002 auf die Wahlbeteiligung und die Wahlentscheidung. Eine log-lineare Pfadanalyse auf der Grundlage von Paneldaten,” Zeitschrift für Soziologie, 34:207–222.10.1515/zfsoz-2005-0303Search in Google Scholar

Klein, M. and M. Pötschke (2005) “Haben die beiden TV-Duelle im Vorfeld der Bundestagswahl 2002 den Wahlausgang beeinflusst? Eine Mehrebenenanalyse auf der Grundlage eines 11-Wellen-Kurzfristpanels.” In: (Falter, J. W., O. W. Gabriel and B. Weßels, eds.) Wahlen und Wähler. Analysen aus Anlass der Bundestagswahl 2002. Wiesbaden: Verlag für Sozialwissenschaften, pp. 357–370.10.1007/978-3-322-80516-4_14Search in Google Scholar

Klein, M. and U. Rosar (2007) “Wirkungen des TV-Duells im Vorfeld der Bundestagswahl 2005 auf die Wahlentscheidung,” KZfSS Kölner Zeitschrift für Soziologie und Sozialpsychologie, 59:81–104.10.1007/s11577-007-0004-3Search in Google Scholar

Maier, J. (2007a) “Erfolgreiche Überzeugungsarbeit. Urteile über den Debattensieger und die Veränderung der Kanzlerpräferenz.” In: (Maurer, M., C. Reinemann, J. Maier and M. Maier, eds.) Schröder gegen Merkel. Wahrnehmung und Wirkung des TV-Duells 2005 im Ost-West-Vergleich. Wiesbaden: VS Verlag, pp. 91–109.10.1007/978-3-531-90709-3_5Search in Google Scholar

Maier, J. (2007b) “Eine Basis für Rationale Wahlentscheidungen? Die Wirkungen des TV-Duells auf Politische Kenntnisse.” In: (Maurer, M., C. Reinemann, J. Maier and M. Maier, eds.) Schröder gegen Merkel. Wahrnehmung und Wirkung des TV-Duells 2005 im Ost-West-Vergleich. Wiesbaden: VS Verlag, pp. 129–143.10.1007/978-3-531-90709-3_7Search in Google Scholar

Maier, M. (2007c) “Verstärkung, Mobilisierung, Konversion. Wirkungen des TV-Duells auf die Wahlabsicht.” In: (Maurer, M., C. Reinemann, J. Maier and M. Maier, eds.) Schröder gegen Merkel. Wahrnehmung und Wirkung des TV-Duells 2005 im Ost-West-Vergleich. Wiesbaden: VS Verlag, pp. 145–165.10.1007/978-3-531-90709-3_8Search in Google Scholar

Maier, J. (2013) “Rezeptionsbegleitende Erfassung Individueller Reaktionen auf Medieninhalte. Bedeutung, Varianten, Qualität und Analyse von Real-Time-Response-Messungen,” ESSACHESS – Journal for Communication Studies, 6:169–184.Search in Google Scholar

Maier, J. and T. Faas (2011a) “Das TV-Duell 2009. Langweilig, Wirkungslos, Nutzlos? Ergebnisse eines Experiments zur Wirkung der Fernsehdebatte zwischen Angela Merkel und Frank-Walter Steinmeier.” In: (Oberreuter, H., ed.) Am Ende der Gewissheiten: Wähler, Parteien und Koalitionen in Bewegung. München: Olzog, pp. 147–166.Search in Google Scholar

Maier, J. and T. Faas (2011b) “‘Miniature Campaigns’ in Comparison: The German Televised Debates, 2002–09,” German Politics, 20:75–91.10.1080/09644008.2011.554102Search in Google Scholar

Maier, M. and J. Strömbäck (2009) “Advantages and Limitations of Comparative Audience Responses to Televised Debates: A Comparative Study of Germany and Sweden.” In: (Maier, J., M. Maier, M. Maurer, C. Reinemann and V. Meyer, eds.) Real-Time Response Measurement in the Social Sciences. Methodological Perspectives and Applications. Frankfurt am Main: Peter Lang, pp. 97–116.Search in Google Scholar

Maier, J., M. Maurer, C. Reinemann and T. Faas (2007) “Reliability and Validity of Real-Time Response Measurement: a Comparison of Two Studies of a Televised Debate in Germany,” International Journal of Public Opinion Research, 19:53–73.10.1093/ijpor/edl002Search in Google Scholar

Maier, J., M. Maier, M. Maurer, C. Reinemann and V. Meyer (eds.) (2009) Real-Time Response Measurement in the Social Sciences. Methodological Perspectives and Applications. Frankfurt am Main: Peter Lang.Search in Google Scholar

Maier, J., T. Faas and M. Maier (2013) “Mobilisierung durch Fernsehdebatten: zum Einfluss des TV-Duells 2009 auf die Politische Involvierung und die Partizipationsbereitschaft.” In: (Weßels, B., H. Schoen and O. W. Gabriel, eds.) Wahlen und Wähler. Springer Fachmedien Wiesbaden, pp. 79–96. in Google Scholar

Maier, J., J. F. Hampe and N. Jahn (2016a) “Breaking Out of the Lab Measuring Real-Time Responses to Televised Political Content in Real-World Settings,” Public opinion quarterly, 80:542–553.10.1093/poq/nfw010Search in Google Scholar

Maier, J., B. Rittberger and T. Faas (2016b) “Debating Europe: Effects of the “Eurovision Debate” on EU Attitudes of Young German Voters and the Moderating Role Played by Political Involvement,” Politics and Governance, 4(1):55–68.10.17645/pag.v4i1.456Search in Google Scholar

Maier, J., T. Faas, B. Rittberger, J. Fortin-Rittberger, K. A. Josifides, S. Banducci, P. Bellucci, M. Blomgren, I. Brikse, K. Chwedczuk-Szulc, M. C. Lobo, M. Cześnik, A. Deligiaouri, T. Deželan, W. deNooy, A. Di Virgilio, F. Fesnic, D. Fink-Hafner, M. Grbeša, C. Greab, A. Henjak, D. N. Hopmann, D. Johann, G. Jelenfi, J. Kavaliauskaite, Z. Kmetty, S. Kritzinger, P. C. Magalhães, V. Meyer, K. Mihailova, M. Mirchev, V. Pitkänen, A. Ramonaite, T. Reidy, M. Rybar, C. Sammut, J. Santana-Pereira, G. Spurava, L.-P. Spyridou, A. Stefanel, V. Štětka, A. Surdej, R. Tardos, D. Trimithiotis, C. Vezzoni, A. Világi and G. Zavecz (2018) “This Time it’s Different? Effects of the Eurovision Debate on Young Citizens and its Consequence for EU Democracy–Evidence from a Quasi-Experiment in 24 Countries”, Journal of European Public Policy, 25(4):606–629.10.1080/13501763.2016.1268643Search in Google Scholar

Maurer, M. and C. Reinemann (2003) Schröder gegen Stoiber: Nutzung, Wahrnehmung und Wirkung der TV-Duelle. Wiesbaden: Westdeutscher Verlag.10.1007/978-3-322-80456-3Search in Google Scholar

Maurer, M. and C. Reinemann (2006) “Learning Versus Knowing. Effects of Misinformation in Televised Debates,” Communication Research, 33:489–506.10.1177/0093650206293252Search in Google Scholar

Maurer, M., C. Reinemann, J. Maier and M. Maier (eds.) (2007) Schröder gegen Merkel. Wahrnehmung und Wirkung des TV-Duells 2005 im Ost-West-Vergleich, Wiesbaden: VS Verlag.Search in Google Scholar

McKinney, M. S. and D. B. Carlin (2004) “Political Campaign Debates.” In: (Kaid, L. L., ed.) Handbook of Political Communication Research. Mahwah: Lawrence Erlbaum Associates, pp. 203–234.Search in Google Scholar

McKinney, M. S. and B. R. Warner (2013) “Do Presidential Debates Matter?,” Argumentation and Advocacy, 49:238–258.10.1080/00028533.2013.11821800Search in Google Scholar

McKinney, M. S., L. A. Rill and D. Gully (2011) “Civic Engagement through Presidential Debates: Young Citizens Attitudes of Political Engagement throughout the 2008 Election.” In: (McKinney, M. S. and M. C. Banwart, eds.) Communication in the 2008 U.S. Election: Digital Natives Elect a President. New York: Peter Lang, pp. 121–141.Search in Google Scholar

Metz, T., U. Wagschal, T. Waldvogel, M. Bachl, L. Feiten and B. Becker (2016) “Das Debat-O-Meter: ein neues Instrument zur Analyse von TV-Duellen,” ZSE Zeitschrift für Staats-und Europawissenschaften|Journal for Comparative Government and European Policy, 14:124–149.10.5771/1610-7780-2016-1-124Search in Google Scholar

Papastefanou, G. (2013) Reliability and Validity of RTR Measurement Device. Gesis. Leibniz-Institut für Sozialwissenschaften. Working Paper 2013-27.Search in Google Scholar

Reinemann, C. and M. Maurer (2010) “Leichtgläubig und Manipulierbar? Die Rezeption Persuasiver Wahlkampfbotschaften durch Politisch Interessierte und Desinteressierte.” In: (Faas, T., K. Arzheimer and S. Roßteutscher, eds.) Information – Wahrnehmung – Emotion. Wiesbaden: VS Verlag, pp. S. 239–257.10.1007/978-3-531-92336-9_12Search in Google Scholar

Reinemann, C., J. Maier, T. Faas and M. Maurer (2005) “Reliabilität und Validität von RTR-Messungen,” Publizistik, 50:56–73.10.1007/s11616-005-0118-4Search in Google Scholar

Schill, D. (2016) “The History, Reliability, Validity, and Utility of Real Time Response.” In: (Schill, D., R. Kirk and A. E. Jasperson, eds.) Political Communication in Real Time: Theoretical and Applied Research Approaches. Abingdon, Oxford, UK: Routledge, pp. 31–56.10.4324/9781315669083Search in Google Scholar

Schill, D. and R. Kirk (2009) “Applied Dial Testing: Using Real-Time Response to Improve Media Coverage of Debates.” In: (Maier, J., M. Maier, M. Maurer, C. Reinemann and V. Meyer, eds.) Real-Time Response Measurement in the Social Sciences. Methodological Perspectives and Applications. Frankfurt am Main: Peter Lang, pp. 155–173.Search in Google Scholar

Schill, D. and R. Kirk (2014) “Courting the Swing Voter: ‘Real Time’ Insights Into the 2008 and 2012 U.S. Presidential Debates,” American Behavioral Scientist, 58(4):536–555.10.1177/0002764213506204Search in Google Scholar

Schill, D., R. Kirk and A. E. Jasperson (2016) Political Communication in Real Time: Theoretical and Applied Research Approaches. Abingdon, Oxford, UK: Routledge. in Google Scholar

Schwerin, H. (1940) “An Exploratory Study of the Reliability of the Program Analyzer,” Journal of Applied Psychology, 24(6):742–745.10.1037/h0058363Search in Google Scholar

Wagschal, U., T. Waldvogel, T. Metz, B. Becker, L. Feiten, S. Weishaupt and K. Singh (2017) “Das TV-Duell und die Landtagswahl in Schleswig-Holstein: Das Debat-O-Meter als neues Instrument der politischen Kommunikationsforschung,” ZParl Zeitschrift für Parlamentsfragen, 48(3):594–613.10.5771/0340-1758-2017-3-594Search in Google Scholar

Waldvogel, T. and T. Metz (2017) “Real-Time-Response-Messungen.” In: (Jäckle, S., ed.) Neue Trends in den Sozialwissenschaften. Wiesbaden, Germany: Springer Fachmedien Wiesbaden, pp. 307–331. in Google Scholar

Weiber, R. and D. Mühlhaus (2014) Strukturgleichungsmodellierung: Eine Anwendungsorientierte Einführung mit Hilfe von AMOS, SmartPLS und SPSS (2., erweit. Aufl.). Heidelberg: Springer.10.1007/978-3-642-35012-2Search in Google Scholar

Wolf, B. (2010) Beurteilung Politischer Kandidaten in TV-Duellen: Effekte Rezeptionsbegleitender Fremdmeinungen auf Zuschauerurteile. Angewandte Medienforschung: Bd. 50, Baden-Baden: Nomos.10.5771/9783845227467Search in Google Scholar

Published Online: 2020-02-07
Published in Print: 2020-06-25

©2020 Walter de Gruyter GmbH, Berlin/Boston