Show Summary Details
More options …

# Data and Information Management

4 Issues per year

Open Access
Online
ISSN
2543-9251
See all formats and pricing
More options …
Volume 2, Issue 1

# To Phrase or Not to Phrase – Impact of User versus System Term Dependence upon Retrieval

Christina Lioma
/ Birger Larsen
/ Peter Ingwersen
Published Online: 2018-05-22 | DOI: https://doi.org/10.2478/dim-2018-0001

## Abstract

When submitting queries to information retrieval (IR) systems, users often have the option of specifying which, if any, of the query terms are heavily dependent on each other and should be treated as a fixed phrase, for instance by placing them between quotes.In addition to such cases where users specify term dependence, automatic ways also exist for IR systems to detect dependent terms in queries. Most IR systems use both user and algorithmic approaches. It is not however clear whether and to what extent user-defined term dependence agrees with algorithmic estimates of term dependence, nor which of the two may fetch higher performance gains. Simply put, is it better to trust users or the system to detect term dependence in queries? To answer this question, we experiment with 101 crowdsourced search engine users and 334 queries (52 train and 282 test TREC queries) and we record 10 assessments per query. We find that (i) user assessments of term dependence differ significantly from algorithmic assessments of term dependence (their overlap is approximately 30%); (ii) there is little agreement among users about term dependence in queries, and this disagreement increases as queries become longer; (iii) the potential retrieval gain that can be fetched by treating term dependence (both user- and system-defined) over a bag of words baseline is reserved to a small subset (approximately 8%) of the queries, and is much higher for low-depth than deep precision measures. Points (ii) and (iii) constitute novel insights into term dependence.

## 1 Introduction

When submitting queries to information retrieval (IR) systems, users may often specify which, if any, among the query terms are heavily dependent on each other and should be treated as a fixed phrase, for instance by placing them between quotes. The IR system then adapts the processing accordingly to retrieve text containing the same terms in the same order as what is inside the quotes. In addition to such cases where users specify term dependence, there also exist automatic ways for IR systems to detect dependent terms in queries (Fagan, 1989; Lioma, Simonsen, Larsen, & Hansen, 2015; Michelbacher, Evert, & Schütze, 2011) (overviewed in Section 2). Most IR systems support both such user and algorithmic approaches to detect term dependence in incoming queries. It is not however clear how much user and algorithmic assessments of term dependence agree, nor which of the two is likely to benefit retrieval performance the most.

To study this, we compare user assessments of term dependence to algorithmic assessments in 334 queries. We collect the user assessments by recruiting 101 search engine users through the CrowdFlower crowdsourcing platform and by examining their selection of term dependence. We produce the algorithmic assessments using four state-of-the-art term dependence ranking models (Lioma et al., 2015; Metzler & Croft, 2005). Given a query, both user and algorithmic approaches decide if the query contains heavily dependent terms that should be treated as a fixed phrase instead of a bag of words. We compare retrieval performance between user and algorithmic methods of deciding term dependence, and also against a bag of words (no term dependence) baseline, using standard TREC datasets. Our findings agree with prior work (Hagen, Potthast, Beyer, & Stein, 2012; Hagen, Potthast, Stein, & Bräutigam, 2011) that users disagree not only with the algorithmic methods, but also among themselves. In addition, we report novel and interesting findings, showing for the first time, that this disagreement varies across different retrieval aspects such as query length or evaluation rank depth. Specifically we find that (i) there is little agreement among users about term dependence in queries, and this disagreement increases as queries become longer; (ii) the potential retrieval gain that can be fetched by treating term dependence over a bag of words baseline is reserved to a small subset of the queries, and is much higher for low-depth than deep precision measures.

We next overview related work (Section 2), and describe our crowdsourcing (Section 3) and retrieval (Section 4) experiments. We conclude by discussing our findings (Section 5).

## 2 Related Work

The user option of specifying term dependence in queries has existed since the mid 1970s as phrase operators, where a mixture of controlled vocabulary (descriptors), which contained phrases, and free-text searching was applied. Phrase (or proximity) operators have been particularly important in bibliographic IR systems, such as DIALOG or Web of Science. At the time, the users of bibliographic IR systems were mostly professional librarians, trained in using a wide range of operators including phrasing (spanning a range of term nearness options) from adjacent to a distance of n terms in specified search fields, like title or abstract, or in the basic index. Early analyses of retrieval interaction from the late 1970s, e.g. (Fenichel, 1981; Oldroyd & Citroen, 1977), did not publish statistics on the use of phrase operators, but rather focused on the number and nature of query terms, eventually reaching the consensus that phrase and other proximity operators “were scarcely used”. For instance, Fenichel (Fenichel, 1981) reported that novice users on average used 7.9 terms per search session, including descriptor terms, while moderately experienced users used on average 9.6 terms per search session, and experienced users used on average 14.4 terms per search session. Fenichel attributed the use of descriptor terms and phrases to the need for term alternatives and support during search. A decade later, Fidel (Fidel, 1991) measured for the first time the use of phrase operators by professional experienced DIALOG users, and found that (a) each command application, named a “move”, applied 13.3 terms per search, and that (b) phrase operations constituted only 1.45% of all queries ((Fidel, 1991), p. 518, Table 2). Other log analyses of the DIALOG system (Hsieh-Yee, 1993; Saracevic & Kantor, 1988; Spink & Saracevic, 1997) also studied the use of phrase operators, among other things, but did not give statistics on their use per search and query cycles, nor on differences between novice and experienced searchers. No statistical evidence on the number of search terms per query statement or cycle was given, nor their nature (single terms, phrases, etc.).

Soon after, Web search was underway. Web search was done in a less structured environment than bibliographic IR: Up to approximately the year 2000, bibliographic IR mainly gave access to metadata in fields and an abstract per record; full records were introduced later. Another major difference between Web Search and bibliographic IR is that, although still fundamentally following Boolean logic, Web search did (and still does) not allow for set manipulation, did not have thesaurus support, and search sessions were overall shorter. As descriptors from a controlled vocabulary or a thesaurus (in bibliographic IR) leave little room for generating meaningful phrases and applying phrase operators, one would expect the use of phrase operators to increase in web search, compared to bibliographic search, practically leading to shorter search sessions than bibliographic retrieval. In addition, shorter search sessions, combined with a mostly layman rather than trained professional user population, is likely to have had an impact on the use of phrasing operators when searching. Indeed, that was the case. The first log-based study of web searching (Jansen, Spink, Bateman, & Saracevic, 1998) studied 51K queries posed by 18,113 Excite users, where fixed phrases could be specified between quotes, and found that 6% of all users used phrase operators. Jansen et al. (Jansen, Spink, Bateman, & Saracevic, 1998) suggest that the users had great difficulty in applying logical and language-based operations on the web from its start. Similar results to (Jansen, Spink, Bateman, & Saracevic, 1998) were reported by Silverstein et al. (Mishne & De Rijke, 2005), who studied 153M AltaVista queries, and by Spink et al. (Spink, Wolfram, Jansen, & Saracevic, 2001), who studied 531K EXCITE queries. Wang et al. (Wang, Berry, & Yang, 2003) conducted the first longitudinal Web log study by analyzing 4 years’ logged queries (541K) in a university website during 1997 -- 2001, and reported that some users were capable of querying using fixed phrases, but did not give statistics. Slightly later (2005), Jansen, Spink & Pedersen (Jansen, Spink, & Pedersen, 2005) studied 1.5M queries logged from AltaVista during 24 hours in 2002. They found that Boolean language was used for 6% of queries (p. 563), but no specific analysis of phrase operations was done. Among the 25 most frequent queries, none of them contained phrases. Similarly, Jansen, Booth & Spink did not analyse the phrase issue in their very large scale web log study carried out in 2005 (Jansen, Booth, & Spink, 2009). They analysed 1,523,793 queries executed on the Dogpile meta search engine and found the average number of search terms per query reaching 2.79 SD=1, 54 terms (p. 1365, Table 4). Almost 71 % of the search sessions consisted of only one query.

The above log studies include few direct analyses of how users perceive term dependence in queries through phrase operators. It seems that overall users rarely query using fixed phrases: phrases have been used in queries at a rate of 1.45% in bibliographic retrieval (Fidel, 1991) and 6% in web retrieval (Jansen, Spink, Bateman, & Saracevic, 1998). For the vast majority of queries, users tend to apply single terms as tokens for concepts (Jansen et al., 2005; Wang, Berry, & Yang, 2003). The use of the phrase operator seems to make more sense in free text search, where users must formulate the relevant phrase themselves. The number of terms in queries is difficult to establish, but we know the average number of query terms per bibliographic search session over the period 1980-97, as shown in Figure 1. If a search session on average consists of three iterations, corresponding to query submissions, and 15 search commands, including field codes and other logical operations (Saracevic & Kantor, 1988), then each query on average consists of approximately 5 command operations. The overall small number of terms per query in bibliographic retrieval that we see in Figure 1 somewhat corresponds to the small term quantity later observed in free text retrieval on web engines. In web retrieval, the trend from the mid-90s is a slight growth in the use of multiple-term queries and thus an increase in the average number of terms applied per query, from 2.0 to 2.73 in 2005. However, no descriptors exist, and the searcher, most often inexperienced, must formulate his/her own query statements. One observes a large proportion of errors and scarce use of the phrase operator, 6% over all queries (Jansen, Spink, Bateman, & Saracevic, 1998).

Figure 1

Average number of query terms per search session in bibliographic online systems and terms per query statement in Web retrieval during 1981–2005 based on different transaction log analyses.

The reasons why phrases have been used so infrequently in information retrieval as a whole have not been studied. It is not clear to what extent users do not use phrasing because they think that it does not improve retrieval, or because they do not know of its existence, or because they cannot operate it properly, or because they tend to rarely apply meaningful term phrases when searching. Instead, users tend to apply single terms as tokens for concepts (Jansen et al., 2005; Wang, Berry, & Yang, 2003). Bibliographic searchers as well as web searchers appear to commit many errors, and failure may create uncertainty and lead to very simplistic query structures.

Even though in computational linguistics user assessments of term dependence have revealed interesting findings (Michelbacher, Kothari, Forst, Lioma, & Schütze, 2011), for instance with respect to their lack of symmetry (cf. section 6 in (Lioma & Hansen, 2017), or e.g. larger dependence from Pyrrhian to victory than from victory to Pyrrhian) (Michelbacher, Evert, et al., 2011)), these advances have not been used in IR yet. In general, the last study recording how users specified term dependence was from 2005 (Jansen, Spink, & Pedersen, 2005).

On the contrary, algorithmic approaches to detect and process term dependence have been explored much more in IR, for instance in ad-hoc retrieval (Lioma & van Rijsbergen, 2008), patent retrieval (Jochim, Lioma, & Schütze, 2011), domain-specific retrieval on physics academic literature (Lioma, Kothari, & Schuetze, 2011), or more formally using logic (Lioma, Larsen, Schütze, & Ingwersen, 2010). A recent comprehensive overview is given in (Lioma et al., 2015). It seems that the most popular methods for automatically detecting heavily dependent terms in queries rely on the co-occurrence frequency of the query terms in some query log or other large enough corpus (this has also resulted in thorough investigations of query term distributions (Petersen, Simonsen, & Lioma, 2016)). The main premise is that the more often some query terms co-occur, the more dependent they are likely to be. This premise has been long applied (Fagan, 1989; Mishne & De Rijke, 2005). Recently, an alternative family of models was proposed (Lioma et al., 2015) to automatically detect heavily dependent terms, which relies not on their frequency, but on their semantic distance when perturbed with synonyms. We use two of the best performing models of (Lioma et al., 2015) to automatically detect heavily dependent terms in queries in Section 4.

Very relevant to our work is also the area of query segmentation or phrase identification, where several studies compare human versus automatics approaches to query segmentation and discuss their the impact on TREC data. One such large dataset for instance is published by Hagen et al. (Hagen et al., 2011) and contains 50,000+ queries segmented by 10 annotators each. This dataset was subsequently also used by Hagen et al. again in (Hagen et al., 2012). In both papers, findings indicate low human agreement for some queries. Another dataset is published by Roy et al. (Saha Roy, Ganguly, Choudhury, & Laxman, 2012) and contains 500 queries and 3 annotators. Further studies of human phrase detection have also been published in the NLP/Computational Linguistics community, see e.g. the work by Ramanath et al. (Ramanath, Choudhury, Bali, & Roy, 2013), and the datasets of human segmentation by Bendersky, Croft & Smith (Bendersky, Croft, & Smith, 2011), and Bergsma & Wang (Bergsma & Wang, 2007).

## 3 Crowdsourcing Term Dependence

To obtain human assessments of term dependence, we engaged 101 Web search engine users through the CrowdFlower1 (CF) crowdsourcing platform. The CF experiment was entitled To Phrase Or Not To Phrase -- Exact Phrases in Search Engine Queries and included an initial task description phase (Section 3.1), a training session (Section 3.2), and the final assessment session (Section 3.3). We describe these next.

## 3.1 Initial Task Description

Users were introduced to the concept of quotes as exact phrase markers in queries, in order to receive results that contain that exact phrase and are potentially more accurate. They were then informed that they would be presented with queries and would have to select if and how to use quotes to specify exact phrases in those queries. They were asked not to use search engines to assess the results, but instead to decide based on their intuition and experience in web search. Figure 2 (a) shows the example shown to users, which illustrates all possible term dependence combinations for a query. The last option (I do not understand the query) was to be chosen when a query did not make sufficient sense for them to recommend whether to use term dependence or not. Only one option could be chosen per query.

Figure 2

(a) Example query with term dependence options given to CrowdFlower users. Quotes mark dependent terms. (b) 10 most frequent unigrams (with frequencies) extracted from user comments during training.

We showed only queries to users, not any associated context about the underlying information need or search task. On one hand, this may limit how well users understand the query and, by extension, how reliably they can assess if and when to specify fixed phrases in the query. On the other hand, this setup (of providing to users queries without any further information on the information need or search task) is a popular practice (Blanco et al. 2011 (Blanco et al., 2011), Metrikov, Pavlu & Aslam 2015 (Metrikov, Pavlu, & Aslam, 2015), Yilmaz et al. 2012 (Yilmaz, Kazai, Craswell, & Tahaghoghi, 2012)) that facilitates large-scale experimentation at relatively low cost (in IR experimental datasets, there exist significantly fewer queries with context information, than queries without context information). We chose the option of experimenting with a large number of queries, because the larger the query sample, the more generalizable and robust our findings on that sample. However, to address cases where users may not be able to understand the query due to lack of information on the underlying information need or task, we also specify the option “I do not understand the query”. Users were instructed to use this option and simply skip queries they did not feel confident assessing.

## 3.2 Session I: Training

The initial task description was followed by a compulsory training session on 52 test queries. Each of our 101 users was shown a query with all possible term dependence options, like in Figure 2 (a), and had to select one option only. Even though it would have made sense for users to be allowed to make more than one choice, the CF interface did not allow choosing multiple options. There was also a comment box for optionally typing feedback. After making their choice, users could see the answer we thought was correct, with an explanation. The queries used in this training session were not part of the queries used for retrieval later in Section 4, but a random selection from (a) the TREC 2012 Web adhoc track queries and (b) from queries that we made up to intentionally include heavily dependent terms in a majority of them. Table 1 displays the 52 train queries, with the most popular user choice of term dependence between quotes. Each of these 52 queries was assessed by 101 users. The scores in brackets in Table 1 show the average user agreement on the most popular user choice for each query, which we computed as the % of users (out of all 101 users) who agree on the most popular term dependence option for each query. For instance, the average agreement of 69% for “rain man” means that 70 out of 101 users (≈69%) selected the option “rain man”. The 52 train queries are sorted in Table 1 by decreasing user agreement.

Table 1

Train queries used on the CrowdFlower training session. Quotes mark the most popular user choice of term dependence for each query. Each query is assessed by 101 users. The percentages in brackets indicate how many out of the 101 users who assessed each query chose the most popular term dependence option (shown in this Table).

Table 2 summarises the statistics of the user assessments of the 52 train queries. User agreement in the last column of Table 2 refers to how many out of the 101 assessments received for each train query agree on the single most popular term dependence option for that query. We report the average of this number across all 52 queries (Table 2, row 5), or the average of this number across query subsets split according to query length (Table 2, rows 1-4).

Table 2

Statistics of train queries.

We see in Table 2 that, overall, users disagreed on the most popular term dependence option for each query a bit more than they agreed (overall they agreed on average 49% of the times – see Table 2, row 5, last column). Comparing rows 1-5 in Table 2 we see that user disagreement increased as query length increased, probably because of the increased number of phrasing options (more query terms result in more term dependence combinations). We also see that users chose term dependence instead of bag of words in approximately 53.8% of all 52 train queries on average (Table 2, row 5). This rate is higher than what was reported in the literature in Section 2 because the 52 train queries were chosen to intentionally include term dependence in a majority of them. Note that allowing users to choose multiple options, which might be suitable for this task, was not allowed by the CF interface.

Initially we planned to exclude users who failed the training session, as a way to combat crowdsourcing misconduct. Failing the training session consisted of disagreeing with the answer we thought was correct for 27 or more out of 52 train queries (i.e. more than half of the training queries). However, we soon observed that most users failed the training; in fact, they disagreed with our ground truth, just as much as they disagreed among them. This caused strong reactions from users, who described their frustration in the comments box. Figure 2 (b) shows the most frequent unigrams extracted from those user comments. Strongly negative adjectives and expletives prevail. We realised that this variation in user assessments was part of the subjective nature of this task, so we did not exclude users who failed the training. We did however filter users according to the user trust score that CrowdFlower provides, and selected only users with the highest user trust, as follows: CrowdFlower divides all users into three groups according to their trust score. CrowdFlower reports that this score is computed based on the user performance on previous tasks, but no further information on how this score is computed is given. The first group contains users of low user trust, the second of medium user trust, and the third of high user trust. We selected users from the third group only.

## 3.3 Session II: Assessment

After the training session, users proceeded to the assessment session. They were shown 20 queries per page, and had to select one term dependence option per query. We used 282 TREC queries and gathered in total 10 user assessments per query2. These 282 queries are all the TREC 6-8 queries (301-450, title only) of the AdHoc track and queries 1-200 of the Web AdHoc tracks of TREC 2009-2012, except those that contain only one term after stopword removal. Users had a minimum of 40 seconds to spend on each page, otherwise they were removed from the job. They were awarded 0.10 USD per page. We did not specify any maximum assessments per user, nor did we use restrictions on the crowdsourcing platforms that CF syndicates from, on geography, or on language. Even though users were asked to assess the queries without inspecting live Web search results, there is no guarantee that they did not do so. A pointer to this direction may be the time they spent on each assessment, which was overall quite low: all 2820 user assessments were completed in under 3 hours.

Table 3 summarises the user assessments of the 282 TREC queries in terms of query length, phrasing or bag of words choice, user agreement and trust. We explain user agreement and user trust next. In Table 3, user agreement refers to how many out of the 10 assessments received for each query agree on the single most popular term dependence option for that query. We report the average of this number across all 282 queries (Table 3, row 5), or the average of this number across query subsets split according to query length (Table 3, rows 1-4). User trust in Table 3 and Figure 3 is the trust score provided by CF as a number between 0 and 1 per user, and is based on user performance on training questions in previously completed jobs. There is no information on how this user trust score was computed. We report the average of this number across all 282 queries (Table 3, row 5), or the average of this number across query subsets split according to query length (Table 3, rows 1-4).

Figure 3

CrowdFlower user trust (x axis) versus % of queries (out of all 282 queries) where users choose term dependence (y axis), binned.

Table 3

Query statistics (TREC queries).

We see in Table 3 that, similarly to Table 2, users again tend to disagree a bit more than they tend to agree (average user agreement is 48.9% -- see row 5). Moreover, similarly to Table 2, user agreement also increased as query length decreased (see rows 1-4). However, unlike Table 2, users choose phrasing in approximately 34.9% of all queries. This is lower than the percentage we observed in the training queries because the queries in Table 3 were not chosen by us intentionally to include queries where phrasing was needed or according to whether they contained phrases or not. They are standard TREC queries.

Figure 3 shows that higher trust users are more likely to use bag of words over term dependence, and vice versa (Spearman correlation coefficient ρ: 0.7). Sporadic use of term dependence actually gives better retrieval results, as we will see later in Section 4. So, it looks like more trusted users might be aware of this and might use term dependence more economically than less trusted users.

Finally, we also computed the overall user agreement on all assessments (not only the most popular) to get a collective idea of the general agreement among our assessors. We used Krippendorff’s alpha coefficient, which is a statistical measure of inter-annotator agreement that is applicable to any number of annotators, to incomplete (missing) data, and because it adjusts itself to small sample sizes (Hayes & Krippendorff, 2007). Krippendorff’s alpha coefficient α =1 indicates perfect agreement, α =0 indicates absence of agreement, and α <0 indicates that disagreements are systematic and exceed what should be expected by chance. Krippendorff’s alpha coefficient α is defined as follows:

$a=1−DoDe$(1)

where Do is the disagreement observed and De is the disagreement expected by chance.

$Do=1nc∈Rk∈Rδ(c,k)u∈UmunckuP(mu,2)$(2)

where δ is a metric function, n is the total number of pairable elements, R is the set of all possible annotations an annotator can give, u is the annotations of all annotators for a given example, U is the multiset of all u for all examples, mu is the number of items in u, ncku is the number of (c,k) pairs in u, and P is the permutation function.

$De=1Pn,2c∈Rk∈Rδc,kPck$(3)

where Pck is the number of ways the pair (c,k) can be made.

$Pck=c≠kncnkc=knc(nc−1)$(4)

Different metric functions (δ) can be used. Generally, for values v and w,

$δv,w≥0,δv,v =0,δv,w =δw,v$(5)

We found that the assessments of our users have in general a very low Krippendorff’s alpha coefficient: α <0.09. This value of α <0.09 means that disagreement among user assessments is too systematic to be by chance, hence that our findings are statistically generalizable.

## 4 Retrieval with User versus System Term Dependence

We compare the human assessments of term dependence collected in Section 3 (user choice of term dependence) to automatic decisions of term dependence, with respect to the retrieval performance they yield for the 282 TREC queries described in Section 3. Specifically we compare 1 run of user choice of term dependence to 6 runs of automatic decisions of term dependence. We explain these 7 runs below:

User choice of term dependence run:

1. User choice of bag of words or term dependence per query: For each query, we used the most popular user choice (among the 10 CrowdFlower users). This choice can be either bag of words or any combination of term dependence, as the example in Figure 2(a) illustrates.

Automatic decisions of term dependence runs:

2. Bag of words (no term dependence) for all queries.

3. Automatic choice of bag of words or term dependence per query using the Good Turing (GT) model with median smoothing from (Lioma, Kothari, & Schuetze, 2011). (We have explained this model in Section 4.1.)

4. Automatic choice of bag of words or term dependence per query using the ATC model from (Lioma, Kothari, & Schuetze, 2011). (We have explained this model in Section 4.1.)

5. Treat as dependent only adjacent query terms, for all queries, using the well-known Markov random field model of sequentially dependent query terms (MRF_S) (Lioma, Larsen, Schütze, & Ingwersen, 2010).

6. Treat as dependent all query terms, for all queries, using the well-known Markov Random Field model of fully dependent query terms (MRF_F) (Lioma, Larsen, Schütze, & Ingwersen, 2010).

7. Choice of bag of words or any combination of term dependence among the query terms (as illustrated in Figure 2(a)), per query, according to what gives the best performance each time. This is an upper bound run, included to show the margin for improvement we can expect to achieve by selecting a bag of words or term dependence each time.

Next, we explain the GT and ATC models used respectively in runs (3) – (4).

For all seven runs, the ranking model is a unigram, query likelihood, Dirichlet-smoothed language model. We implement term dependence using the Indri query language for ordered windows #1(). For example, #1(white house) matches white house as an exact phrase. We use no stemming and remove stopwords from the queries only (as in (Metzler & Croft, 2005)). We use Indri 5.8 for indexing and retrieval of at most 1000 documents per query. We evaluate retrieval effectiveness using four standard measures of low-depth and gradually deeper precision: MRR, P@10, NDCG@20, and MAP. We report these four evaluation measures for all 282 queries, not separately per query. We retrieve documents from Disks 4-5 (minus the Congressional Records for TREC7-8) for queries 301-450 and from ClueWeb09B for queries 1-200.

## 4.1 The GT and ATC term dependence models

GT and ATC detect which queries are more likely to be noncompositional. Non-compositional queries are queries whose meaning cannot be deduced from the meaning of their composing terms, such as hot dog or red tape, for instance. These queries must be treated as fixed phrases in IR (Lioma et al., 2015). GT works as follows:

• Step 1

It generates “perturbed” queries, where a single query term at a time is replaced by a synonym.

• Step 2

It produces a language model for each term in the original query and in each perturbed query (using distributional semantics of that term, extracted from some large corpus).

• Step 3

It combines the language models of the query terms to produce a language model for each query and for each perturbed query.

• Step 4

It computes the divergence between the language models of (a) the query and (b) its perturbed queries; the higher this divergence, the more noncompositional the query. Lioma et al. (Lioma, Kothari, & Schuetze, 2011) showed that retrieval performance improves when non-compositional queries (detected in the above way) are submitted to the IR system inside quotes (i.e. are treated as fixed phrases of strong term dependence).

The GT model builds the language model of each query term (step 2 above) as follows:

$PGTq,t=(r+1)S(ffr+1)CqS(ffr)forr>0$(6)

where PGT(q, t) is the probability of a term t with frequency r in query q, ff is a vector with frequencies for term frequencies (also known as double counts), Cq is the count of all terms in the context windows of q, and S is a function fitted through the observed values of ff to get the expected count of these values (see (Gale & Sampson, 1995; Lioma et al., 2015) for more). For zero count values, the probability is calculated as follows:

$PGTq,t=ff1Cqforr=0$(7)

where ff1 is the frequency of frequency of hapax legomena (events occurring once). We extract the context windows from Disks4-5 for queries 301-450, and from ClueWeb09B for queries 1-200, exactly as described in Lioma et al. 2015. The above produces a language model for each term per query or perturbation. To produce one language model for the whole query or perturbation, we sort the language models of their terms and use the median of their values. We refer to this as GT median.

The ATC model follows the same high-level methodology as PG, with the difference that it produces vectors instead of language models for each query term in steps 2-3, and it computes the vector distance (instead of language model divergence) in step 4. Specifically, ATC builds a vector for each term (in step 2), where the elements of that vector correspond to that term’s distributional semantics. The weight of each element in the vector is computed as the average of the weights of that term in all its context windows (wit) as follows:

$wit=(0.5+0.5fitmaxf)logFA(Nn(t))i=1N(0.5+0.5fitmaxflogFA(Nn(t)))2$(8)

where wit is the weight of term t in context window i, fit is the frequency of t in context window i, maxf is the maximum frequency of any term in any context window, N is the total number of context windows, and n(t) is the number of context windows containing t. The vectors of all query terms are combined with their pointwise multiplication.

## 4.2 Parameter Tuning

We tune parameter μ of the Dirichlet-smoothed ranking model in this range: {100, 500, 800, 1000, 2000, 3000, 4000, 5000, 8000, 10000}. We also tune the threshold θ of the Good Turing and ATC models, which controls how many queries to select as term dependent each time, identically to (Lioma et al., 2015) in this range: {1 … 45} per TREC batch of 50 queries. To make sure that our results are not overfitted to the specific queries used in this experiment, we tune each parameter per evaluation measure using 3-fold cross validation, and we report the average of the three test folds.

## 4.3 Experimental Findings

Table 4 shows the results of our retrieval runs. When comparing user to system-selected term dependence, user selections are better for MRR on ClueWeb09, while system ones are better the rest of the times. User and system assessments agree 30.4% on average, meaning that it is the remaining 69.6% that impacts this behavior of MRR in ClueWeb09.

Table 4

Retrieval effectiveness of manual (user-specified) versus automatic (algorithmically decided) term dependence. %UB is the % difference from the upper bound. %TD is the % of queries out of all 282 queries that use term dependence. USER CHOICE uses either bag of words or phrases, as chosen by manual user assessments. (We adopt the most popular user choice on CrowdFlower.) Out of the 6 automatic methods, MFR_S and MRF_F use term dependence for all queries and GT and ATC use term dependence for a subset of the queries. UPPER BOUND uses bag of words or phrasing according to which it fetches the best score. Bold in red boxes marks best scores (excluding the upper bound). N/A denotes not applicable.

Comparing both user and system-selected term dependence to the upper bound, we see that users choose more term dependence (32% for Disks 4-5 queries and 38% for ClueWeb09 queries) than is actually required for optimal retrieval performance (between 1.6% - 6.6% for Disks 4-5 queries and between 9.3% - 17% for ClueWeb09 queries). The upper bound choice of term dependence is on average for all datasets and evaluation measures 8%. This value is much closer to the 6% reported in the literature for web search (Hayes & Krippendorff, 2007), than the user choice of term dependence which is on average (32% + 35%)/2 = 36.5%. This practically means that there is certainly margin for improving the automatic selection of term dependence through more strict selection. Interestingly, the users’ intuitive, and possibly linguistic, interpretation of term dependence is the most damaging of all to retrieval performance.

The bag of words (BOW) run is mostly, but not always, the best method, across all datasets and evaluation measures. Note that bag of words was more often the choice of users with higher CF trust (cf. Figure 3). The reason why BOW tends to perform overall better than non-BOW approaches in our experiments may be connected to the distribution of query length in our dataset (shown in Table 5). Because we have removed 1-term queries from the TREC query sets we use, the majority of the queries (161 out of 282 queries in total) tend to have between 3 and 5 terms. The longer the queries, we reason, the more difficult it is to decide which part of the query, if any, has strong term dependence and should be treated as a fixed phrase. This difficulty was clearly shown in the human choice of term dependence reported in Table 3, where we see that user agreement on what part, if any, of the query should be placed between quotes decreases while query length increases, while user trust remains approximately the same. In most literature, experiments with TREC datasets are reported on the complete batches of queries, where the majority of queries contain 1-2 terms (i.e. they are relatively short). In those batches of queries, BOW is usually not the best performing method, because term dependence can be detected relatively more easily between 2 terms than between 3-5 terms.

Table 5

Query length of the 282 TREC queries.

However, even though BOW performs overall better than our other methods, we cannot conclude that phrasing may not be necessary, because of the upper bound reported in the last row: we see that BOW is performance-wise quite far from the upper bound, which uses sometimes phrasing and sometimes BOW (depending on which of them two fetches higher performance). Specifically BOW is between 1% and 14.5% worse than the upper bound, meaning that using BOW at all times, for all queries, is not the best choice.

We also see in Table 4 that the lower the depth of precision (i.e. measured by NDCG@20, P@10, MRR), the harder it is to approach the upper bound for ClueWeb09. We see this trend also in Figure 4, which shows that the highest gain is obtained for MAP and NDCG@20 when users agree approximately 70% and 90% respectively, whereas the highest P@10 and MRR gain is obtained when users agree approximately 20%. That is, improving low-depth precision is a much tougher task.

Figure 4

Gain of user choice of phrasing or not over bag of words (y axis) versus user agreement (x axis). Binned. For positive y axis values, user choice > bag of words, and vice versa. The straight line marks no difference between user choice and bag of words.

To further understand the above results, we look at the top 5 queries where user choice outperformed the system choice, and vice versa (Tables 6 - 7). Several of the queries where user selection beats system selection tend to contain geographical names (indiana, california, yelowstone, culpeper). On the contrary, queries where system choice is best, tend to contain more high level descriptors that are more general and hence less discriminative in their meaning. Another reason why several queries where user choice underperforms compared to system choice could be the intuitive interpretation of a phrase by users, e.g. british chunnel …, schengen agreement, magnetic levitation …, without considering that treating these as fixed phrases may leave out synonymous or alternative phrases that are perhaps equally or even more frequent, such as british channel tunnel, schengen treaty. Maglev may have been misinterpreted by users as a proper noun, when in fact it is an abbreviation of magnetic levitation, and as such an alternative rather than part of the same phrase.

Table 6

Top 5 queries where user choice outperformed system choice. The best performing phrasing options per query are shown under column QUERY.

Table 7

Top 5 queries where system choice outperformed user choice. The best performing phrasing options per query are shown under column QUERY.

## 5 Discussion and Conclusions

The main findings of our study of user decided vs. system decided term dependence are that i) there is little consensus among users about when to phrase query terms, ii) user-assessed term dependence differs significantly from algorithmically-assessed term dependence, and iii) the potential retrieval gain that can be fetched by any type of term dependence over a bag of words baseline is fairly low, but non-negligible with potential improvements possible in 8% of the queries. We also see that improving on low-depth precision is a much harder task, and that user decided term dependence for low-depth precision measures can outperform other approaches. As low-depth precision is important to users, this may explain why users use phrase operators in a small share of their searches as indicated in related work (Jansen, Spink, & Pedersen, 2005).

There are some limitations to our work. We use TREC queries, not users’ own queries, and we evaluate retrieval with TREC relevance assessments, not by asking users. We do so for the sake of replicating and comparing to existing results. An explicit assumption of this, is that query phrasing can be perceived by users for a query that is not their own. For 99.9% of the assessed queries, users explicitly stated that they understood the queries they assessed. Even though understanding a query is not synonymous to cognitively formulating an information need and expressing it as a query, this study uses the former to approximate the latter, as is often common practice in such studies (Lioma, Larsen, & Schutze, 2011).

Furthermore, the low agreement among users about term dependence, combined with the CrowdFlower setup of only allowing one choice, meant that we could not use the training tasks as a quality filter as initially intended. The question is then if the quality of the crowdsourcing assessments is too low. We believe that most of the collected assessments are genuine, because (i) we chose users with the highest trust score provided by CrowdFlower, (ii) there is some agreement among users, and (iii) many complained about unfairness during training. Fraudulent users, we assume, are unlikely to spend extra time giving feedback (albeit negative) on the task. The fact that very few chose the “I do not understand the query” option indicates that there were no significant language fluency issues; if users were not fluent enough to understand a query, they would have skipped it.

Finally, even though we experiment with two standard TREC datasets containing 282 queries, and even though we make every effort to avoid overfitting by using x fold validation, our results may not be always generalizable to other data. We have chosen one dataset that is more representative of web search (ClueWeb09B) and one that is representative of more curated ad hoc search (Disks 4&5), but there are several other domains and contexts that are not represented in our experimental setup. We thus conclude that our findings are only reasonably valid for the domains represented by our TREC datasets.

In summary, in our experiments user defined term dependence improves retrieval performance in a minority of queries and mainly for low-depth precision. Some gains are possible for certain queries and the most promising direction to realise these improvements appears to be to focus on identifying these automatically, either statistically or by further developing linguistically informed methods such as those in (Lioma et al., 2015). In the future, we plan to investigate the effects of strength or degree of term dependence. We did not do so in this study, to keep our scenario similar to the real-life search scenario of using quotes to search for phrases. However, as the automatic approaches (GT and ATC from (Lioma et al., 2015)) output a degree of term dependence, and as we have collected 10 assessments of user decided term dependence per query, in the future we plan to investigate the effect of degrees of term dependence.

## Acknowledgements

This work has been partially funded by the following grants: REACT (Responsible Impact), supported by Det Obelske Familiefond Danmark; QUARTZ (Quantum Information Access and Retrieval Theory), supported by Horizon 2020 Marie Skłodowska-Curie Innovative Training Networks.

## References

• Bendersky, M., Croft, W. B., & Smith, D. A. (2011, June). Joint annotation of search queries. Paper presented at the Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1 (pp. 102-111). Association for Computational Linguistics. Google Scholar

• Bergsma, S., & Wang, Q. I. (2007). Learning noun phrase query segmentation. Paper presented at the Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, June 28-30, 2007, Prague, Czech Republic (pp.819-826). DBLP. Google Scholar

• Blanco, R., Halpin, H., Herzig, D. M., Mika, P., Pound, J., Thompson, H. S., & Tran Duc, T. (2011, July). Repeatable and reliable search system evaluation using crowdsourcing. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval (pp. 923-932). ACM. http://dx.doi.org/10.1145/2009916.2010039

• Fagan, J. L. (1989). The effectiveness of a nonsyntactic approach to automatic phrase indexing for document retrieval.Journal of the American Society for Information Science, 40(2), 115-132. http://dx.doi.org/10.1002/(SICI)1097-4571(198903)40:2%3C115::AID-ASI6%3E3.0.CO;2-B

• Fenichel, C. H. (1981). Online searching: Measures that discriminate among users with different types of experiences. Journal of the Association for Information Science and Technology, 32(1), 23-32. http://dx.doi.org/10.1002/asi.4630320104

• Fidel, R. (1991). Searchers’ selection of search keys: III. Searching styles. Journal of the Association for Information Science & Technology, 42(7), 515-527. http://dx.doi.org/10.1002/(SICI)1097-4571(199108)42:7%3C515::AID-ASI6%3E3.0.CO;2-F

• Gale, W. A., & Sampson, G. (1995). Good-turing frequency estimation without tears. Journal of quantitative linguistics, 2(3), 217-237. http://dx.doi.org/10.1080/09296179508590051

• Hagen, M., Potthast, M., Beyer, A., & Stein, B. (2012, October). Towards optimum query segmentation: in doubt without. In Proceedings of the 21st ACM international conference on Information and knowledge management (pp. 1015-1024). ACM. http://dx.doi.org/10.1145/2396761.2398398

• Hagen, M., Potthast, M., Stein, B., & Bräutigam, C. (2011, March). Query segmentation revisited. In Proceedings of the 20th international conference on World wide web (pp. 97-106). ACM. http://dx.doi.org/10.1145/1963405.1963423

• Hayes, A. F., & Krippendorff, K. (2007). Answering the call for a standard reliability measure for coding data. Communication methods and measures, 1(1), 77-89. http://dx.doi.org/10.1080/19312450709336664

• Hsieh-Yee, I. (1993). Effects of search experience and subject knowledge on the search tactics of novice and experienced searchers. Journal of the Association for Information Science & Technology, 44(3), 161-174. http://dx.doi.org/10.1002/(SICI)1097-4571(199304)44:3%3C161::AID-ASI5%3E3.0.CO;2-8

• Jansen, B. J., Booth, D. L., & Spink, A. (2009). Patterns of query reformulation during Web searching. Journal of the American Society for Information Science & Technology, 60(7), 1358–1371. http://dx.doi.org/10.1002/asi.21071

• Jansen, B. J., Spink, A., Bateman, J., & Saracevic, T. (1998, April). Real life information retrieval: A study of user queries on the web. In ACM SIGIR Forum (Vol. 32, No. 1, pp. 5-17). ACM. http://dx.doi.org/10.1145/281250.281253

• Jansen, B. J., Spink, A., & Pedersen, J. (2005). A temporal comparison of AltaVista Web searching. Journal of the Association for Information Science and Technology, 56(6), 559-570. http://dx.doi.org/10.1002/asi.20145

• Jochim, C., Lioma, C., & Schütze, H. (2011, June). Expanding queries with term and phrase translations in patent retrieval. In Information Retrieval Facility Conference (pp. 16-29). Springer, Berlin, Heidelberg. http://dx.doi.org/10.1007/978-3-642-21353-3_3

• Lioma, C., & Hansen, N. D. (2017). A study of metrics of distance and correlation between ranked lists for compositionality detection. Cognitive Systems Research, 44, 40-49. http://dx.doi.org/10.1016/j.cogsys.2017.03.001

• Lioma, C., Kothari, A., & Schuetze, H. (2011, July). Sense discrimination for physics retrieval. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval (pp. 1101-1102). ACM. http://dx.doi.org/10.1145/2009916.2010069

• Lioma, C., Larsen, B., & Schutze, H. (2011, September). User Perspectives on Query Difficulty. In International Conference on Advances in Information Retrieval Theory (Vol.6931, pp.3-14). Springer-Verlag. Berlin, Heidelberg. http://dx.doi.org/10.1007/978-3-642-23318-0_3

• Lioma, C., Larsen, B., Schütze, H., & Ingwersen, P. (2010, August). A subjective logic formalisation of the principle of polyrepresentation for information needs. In Proceedings of the third symposium on Information interaction in context (pp. 125-134). ACM. http://dx.doi.org/10.1145/1840784.1840804

• Lioma, C., Simonsen, J. G., Larsen, B., & Hansen, N. D. (2015, August). Non-compositional term dependence for information retrieval. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval(pp. 595-604). ACM. http://dx.doi.org/10.1145/2766462.2767717

• Lioma, C., & van Rijsbergen, C. K. (2008). Part of speech n-grams and information retrieval. Revue française de linguistique appliquée, 13(1), 9-22. Google Scholar

• Metrikov, P., Pavlu, V., & Aslam, J. A. (2015, October). Aggregation of crowdsourced ordinal assessments and integration with learning to rank: A latent trait model. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management (pp. 1391-1400). ACM. http://dx.doi.org/10.1145/2806416.2806492

• Metzler, D., & Croft, W. B. (2005, August). A Markov random field model for term dependencies. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 472-479). ACM. http://dx.doi.org/10.1145/1076034.1076115

• Michelbacher, L., Evert, S., & Schütze, H. (2011). Asymmetry in corpus-derived and human word associations : Corpus Linguistics and Linguistic Theory. Corpus Linguistics & Linguistic Theory, 7(2), 245-276. http://dx.doi.org/10.1515/cllt.2011.012

• Michelbacher, L., Kothari, A., Forst, M., Lioma, C., & Schütze, H. (2011, July). A cascaded classification approach to semantic head recognition. Paper presented at the Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 793-803). Association for Computational Linguistics.Google Scholar

• Mishne, G., & De Rijke, M. (2005, March). Boosting web retrieval through query operations. In the European Conference on Information Retrieval (pp. 502-516). Springer, Berlin, Heidelberg. http://dx.doi.org/10.1007/978-3-540-31865-1_36

• Oldroyd, B. K., & Citroen, C. L. (1977). Study of strategies used in on-line searching. Online Review, 1(4), 295-310. http://dx.doi.org/10.1108/eb023957

• Petersen, C., Simonsen, J. G., & Lioma, C. (2016). Power law distributions in information retrieval. ACM Transactions on Information Systems (TOIS), 34(2), 1-37. http://dx.doi.org/10.1145/2816815

• Ramanath, R., Choudhury, M., Bali, K., & Roy, R. S. (2013, August). Crowd prefers the middle path: A new iaa metric for crowdsourcing reveals turker biases in query segmentation. Paper presented at the Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Vol. 1, pp. 1713-1722). Google Scholar

• Saha Roy, R., Ganguly, N., Choudhury, M., & Laxman, S. (2012, August). An IR-based evaluation framework for web search query segmentation. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval (pp. 881-890). ACM. http://dx.doi.org/10.1145/2348283.2348401

• Saracevic, T., & Kantor, P. (1988). A study of information seeking and retrieving. III. Searchers, searches, and overlap. Journal of the American Society for Information Science, 39(3), 161–176. http://dx.doi.org/10.1002/(SICI)1097-4571(198805)39:3%3C197::AID-ASI4%3E3.0.CO;2-A

• Spink, A., & Saracevic, T. (1997). Interaction in information retrieval: Selection and effectiveness of search terms. Journal of the Association for Information Science & Technology, 48(8), 741--761. http://dx.doi.org/10.1002/(SICI)1097-4571(199708)48:8%3C741::AID-ASI7%3E3.0.CO;2-S

• Spink, A., Wolfram, D., Jansen, M. B., & Saracevic, T. (2001). Searching the web: The public and their queries. Journal of the Association for Information Science and Technology, 52(3), 226-234. http://dx.doi.org/10.1002/1097-4571(2000)9999:9999%3C::AID-ASI1591%3E3.0.CO;2-R

• Wang, P., Berry, M. W., & Yang, Y. (2003). Mining longitudinal Web queries: Trends and patterns. Journal of the Association for Information Science and Technology, 54(8), 743-758. http://dx.doi.org/10.1002/asi.10262

• Yilmaz, E., Kazai, G., Craswell, N., & Tahaghoghi, S. M. (2012, August). On judgments obtained from a commercial search engine. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval (pp. 1115-1116). ACM. http://dx.doi.org/10.1145/2348283.2348496

## Footnotes

• 1
• 2

We collected 10 assessments per query. This does not imply that each user assessed 10 queries. Individual users assessed a different number of queries. Once we had 10 assessments per query, we removed that query from the pool of queries that were available for assessment in CrowdFlower.

## About the article

Accepted: 2018-01-17

Published Online: 2018-05-22

Citation Information: Data and Information Management, Volume 2, Issue 1, Pages 1–14, ISSN (Online) 2543-9251,

Export Citation