Skip to content
Licensed Unlicensed Requires Authentication Published by De Gruyter Mouton August 18, 2016

Lexeme-based collexeme analysis with DepCluster

  • Xuri Tang EMAIL logo

Abstract

This paper introduces a tool for lexeme-based collexeme analysis. The tool uses cluster analysis to generate the typical constructions of a given lexeme and computes the collostruction strength of the constructions. These two functions enable the tool to facilitate efficient studies of lexeme–construction interactions in large-scale data. As a case study, the paper examines the lexeme “cause”. It shows that the tool provides strong statistical evidence that confirms earlier findings about the negative semantic prosody of the lexeme. In addition, the collexeme analyses with the tool show that the lexeme is typically used in attitudinal constructions. The case study demonstrates that the tool can enhance the efficiency, comprehensiveness and granularity in lexeme-based collexeme analysis.

Funding statement: The National Social Science Fund of China, (Grant / Award Number: ‘11CYY030’).

References

Blondel, V. D., A.Gajardo, M. Heymans, P. Senellart & P. V. Dooren. 2004. A measure of similarity between graph vertices: Applications to synonym extraction and web searching. SIAM Review 46(4). 647–666.10.1137/S0036144502415960Search in Google Scholar

Boas, H. C. & I. A. Sag. 2012. Sign-based construction grammar. Stanford, CA: CSLI Publications/Center for the Study of Language and Information.Search in Google Scholar

Bybee, J. L. (2013). Usage-based theory and exemplar representations of constructions. In G. Trousdale & T. Hoffmann (eds.), The Oxford handbook of construction grammar, 49–69. New York: Oxford University Press.Search in Google Scholar

Bybee, J. L., R. D. Perkins & W. Pagliuca. 1994. The evolution of grammar: Tense, aspect, and modality in the languages of the world. Chicago: University of Chicago Press.Search in Google Scholar

Depraetere, I. 2012. Time in sentences with modal verbs. In R. I. Binnick (ed.), The Oxford handbook of tense and aspect, 989–1019. New York: Oxford University Press.10.1093/oxfordhb/9780195381979.013.0035Search in Google Scholar

Ellis, N. C. & F. Ferreira–Junior. 2009. Construction learning as a function of frequency, frequency distribution, and function. The Modern Language Journal 93(3). 370–385.10.1111/j.1540-4781.2009.00896.xSearch in Google Scholar

Ellis, N. C. & D. Larsen-Freeman. 2009. Constructing a second language: Analyses and computational simulations of the emergence of linguistic constructions from usage. Language Learning 59. 90–125.10.1111/j.1467-9922.2009.00537.xSearch in Google Scholar

Ester, M., H.-P. Kriegel, J. Sander & X. Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. Paper presented at the Second International Conference on Knowledge Discovery and Data Mining (KDD–96).Search in Google Scholar

Fillmore, C. J. 2013. Berkeley construction grammar. In G. Trousdale & T. Hoffmann (eds.), The Oxford handbook of construction grammar, 111–132. New York: Oxford University Press.10.1093/oxfordhb/9780195396683.013.0007Search in Google Scholar

Fillmore, C. J., P. Key & M. K. O. Connor. 1988. Regularity and idiomaticity in grammatical constructions: The case of let alone. Language 64. 501–538.10.2307/414531Search in Google Scholar

Goldberg, A. E. 1995. Constructions: A construction grammar approach to argument structure. Chicago: University of Chicago Press.Search in Google Scholar

Goldberg, A. E. 1996. Construction grammar. In K. Brown & J. Miller (eds.), Concise encyclopedia of syntactic theories, 68–71. Oxford: Pergamon.Search in Google Scholar

Goldberg, A. E. 2003. Constructions: A new theoretical approach to language. Trends in Cognitive Sciences 7(5). 219–224.10.1016/S1364-6613(03)00080-9Search in Google Scholar

Goldberg, A. E. 2006. Constructions at work: The nature of generalization in language. Oxford, New York: Oxford University Press.Search in Google Scholar

Goldberg, A. E. 2013. Constructionist approaches. In G. Trousdale & T. Hoffmann (eds.), The Oxford handbook of construction grammar, 15–31. New York: Oxford University Press.10.1093/oxfordhb/9780195396683.013.0002Search in Google Scholar

Gordon, A. S. & R. Swanson. 2007. Generalizing Semantic Role Annotations Across Syntactically Similar Verbs. Paper presented at the 45th Annual Meeting of the Association of Computational Linguistics, Prague, Czech Republic.10.21236/ADA470421Search in Google Scholar

Gries, S. T. 2005. Syntactic priming: A corpus-based approach. Journal of Psycholinguistic Research 34(4). 365–399.10.1007/s10936-005-6139-3Search in Google Scholar

Gries, S. T. 2012. Frequencies, probabilities, and association measures in usage-/exemplar-based linguistics. Studies in Language 36(3). 477–510.10.1075/bct.67.02griSearch in Google Scholar

Gries, S. T. 2013. Data in Construction Grammar. In G. Trousdale & T. Hoffmann (eds.), The Oxford handbook of construction grammar, 93–110. New York: Oxford University Press.10.1093/oxfordhb/9780195396683.013.0006Search in Google Scholar

Gries, S. T. 2015. More (old and new) misunderstandings of collostructional analysis: On Schmid and Küchenhoff (2013). Cognitive Linguistics 26. 505.10.1515/cog-2014-0092Search in Google Scholar

Gries, S. T. & A. Stefanowitsch. 2004. Extending collostructional analysis: A corpus-based perspective on ‘alternations’. International Journal of Corpus Linguistics 9(1). 97–129.10.1075/ijcl.9.1.06griSearch in Google Scholar

Halkidi, M., Y. Batistakis & M. Vazirgiannis. 2001. On clustering validation techniques. Journal of Intelligent Information Systems 17(2–3). 107–145.10.1023/A:1012801612483Search in Google Scholar

Halkidi, M. & M. Vazirgiannis. 2001. Clustering Validity Assessment: Finding the Optimal Partitioning of a Data Set. Paper presented at the Proceedings of the 2001 IEEE International Conference on Data Mining.Search in Google Scholar

Hilpert, M. 2007. Germanic future constructions: A usage-based approach to grammaticalization. (Ph.D.), Rice University. http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=325733310.1075/cal.7Search in Google Scholar

Hunston, S. 2007. Semantic prosody revisited. International Journal of Corpus Linguistics 12(2). 249–268.10.1075/bct.18.06hunSearch in Google Scholar

Hunston, S. & G. Francis. 2000. Pattern grammar a corpus-driven approach to the lexical grammar of English (pp. xi, 288 p.). http://site.ebrary.com/lib/ascc/Doc?id=500019310.1075/scl.4Search in Google Scholar

Johannesson, M. 2000. Modelling asymmetric similarity with prominence. British Journal of Mathematical and Statistical Psychology 53(1). 121–139.10.1348/000711000159213Search in Google Scholar

Kawahara, D., D. W. Peterson, O. Popescu & M. Palmer. 2014. Inducing Example-based Semantic Frames from a Massive Amount of Verb Uses. Paper presented at the 14th Conference of the European Chapter of the Association for Computational Linguistics, Gothenburg, Sweden.10.3115/v1/E14-1007Search in Google Scholar

Koo, T. & M. Collins. 2010. Efficient third-order dependency parsers. Paper presented at the Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden.Search in Google Scholar

Lakoff, G. 1987. Women, fire, and dangerous things: What categories reveal about the mind. Chicago: University of Chicago Press.10.7208/chicago/9780226471013.001.0001Search in Google Scholar

Langacker, R. W. 1987. Foundations of cognitive grammar. Stanford, CA: Stanford University Press.Search in Google Scholar

Liu, Y., Z. Li, H. Xiong, X. Gao & J. Wu. 2010. Understanding of internal clustering validation measures. Paper presented at the Proceedings of the 2010 IEEE International Conference on Data Mining.10.1109/ICDM.2010.35Search in Google Scholar

Marneffe, M.-C. D. & C. D. Manning. 2008. The Stanford typed dependencies representation. Paper presented at the Coling 2008: Proceedings of the workshop on Cross-Framework and Cross-Domain Parser Evaluation, Manchester, United Kingdom.10.3115/1608858.1608859Search in Google Scholar

McEnery, T. & A. Hardie. 2012. Corpus linguistics: Method, theory and practice. New York: Cambridge University Press.10.1093/oxfordhb/9780199276349.013.0024Search in Google Scholar

Mortelmans, T. 2007. Modality in cognitive linguistics. In D. Geeraerts & H. Cuyckens (eds.), The Oxford handbook of cognitive linguistics, 869–889. New York: Oxford University Press.Search in Google Scholar

Mukherjee, J. & S. T. Gries. 2009. Collostructional nativisation in New Englishes: Verb-construction associations in the International Corpus of English. English World-Wide 30(1). 27–51.10.1075/eww.30.1.03mukSearch in Google Scholar

Nosovskiy, G. V., D. Liu & O. Sourina. 2008. Automatic clustering and boundary detection algorithm based on adaptive influence function. Pattern Recognition 41(2008). 2757–2776.10.1016/j.patcog.2008.01.021Search in Google Scholar

Ortony, A., R. J. Vondruska, M. A. Foss & L. E. Jones. 1985. Salience, similes, and the asymmetry of similarity. Journal of Memory and Language 24(5). 569–594.10.1016/0749-596X(85)90047-6Search in Google Scholar

Palmer, F. R. 2001. Mood and modality, 2nd edn. Cambridge & New York: Cambridge University Press.10.1017/CBO9781139167178Search in Google Scholar

Rosch, E. 1978. Principles of categorization. In E. Rosch & B. B. Lloyd (eds.), Cognition and Categorization, 27–48. Hillsdale, NJ: Lawrence Erlbaum.Search in Google Scholar

Russell, S. J., P. Norvig & E. Davis. 2010. Artificial intelligence: A modern approach, 3rd edn. Upper Saddle River: Pearson.Search in Google Scholar

Sagae, K. & A. S. Gordon. 2009. Clustering words by syntactic similarity improves dependency parsing of predicate-argument structures. Paper presented at the Proceedings of the 11th International Conference on Parsing Technologies, Paris, France.10.3115/1697236.1697273Search in Google Scholar

Socher, R., J. Bauer, C. D. Manning & A. Y. Ng. 2013. Parsing With Compositional Vector Grammars. Paper presented at the ACL 2013.Search in Google Scholar

Stefanowitsch, A. 2013. Collostructional analysis. In G. Trousdale & T. Hoffmann (eds.), The Oxford handbook of construction grammar, 290–306. New York: Oxford University Press.10.1093/oxfordhb/9780195396683.013.0016Search in Google Scholar

Stefanowitsch, A. & S. T. Gries. 2003. Collostructions: Investigating the interaction between words and constructions. International Journal of Corpus Linguistics 8(2). 209–243.10.1075/ijcl.8.2.03steSearch in Google Scholar

Stefanowitsch, A. & S. T. Gries. 2005. Covarying collexemes. Corpus Linguistics and Linguistic Theory 1(1). 1–43.10.1515/cllt.2005.1.1.1Search in Google Scholar

Stubbs, M. (1995). Collocations and semantic profiles: On the cause of the trouble with quantitative methods. Function of Language 2(1). 1–22.10.1075/fol.2.1.03stuSearch in Google Scholar

Theodoridis, S. & K. Koutroumbas. 2009. Pattern recognition, 4th edn. Burlington, MA & London: Academic Press.Search in Google Scholar

Tullo, C. & J. Hurford. 2003. Modelling Zipfian Distributions in Language. Paper presented at the Language Evolution and Computation Workshop/Course at ESSLLI, Vienna.Search in Google Scholar

Tversky, A. 1977. Features of similarity. Psychological Review 84. 327–352.10.1016/B978-1-4832-1446-7.50025-XSearch in Google Scholar

Van der Auwera, J. & A. Hammann. 2005. Situational possibility; Epistemic possibility; Overlap between epistemic and situational possibility. In M. Haspelmath (ed.), The world atlas of language structures. Oxford & New York: Oxford University Press.Search in Google Scholar

Webelhuth, G. 2012. The distribution of that-clause in English: An SBCG account. In H. C. Boas & I. A. Sag (ed.), Sign-based construction grammar, 203–228. Stanford, CA: CSLI Publications/Center for the Study of Language and Information.Search in Google Scholar

Wiechmann, D. 2008. On the computation of collostruction strength: Testing measures of association as expressions of lexical bias. Corpus Linguistics and Linguistic Theory 4(2). 253–290.10.1515/CLLT.2008.011Search in Google Scholar

Xiao, R. & T. McEnery. 2006. Collocations, semantic prosody, and near synonymy: A cross-linguistic perspective. Applied Linguistics 27(1). 103–129.10.1093/applin/ami045Search in Google Scholar

Appendix I: Technical issues in DepCluster

This appendix explains the six important issues in implementing DepCluster [7]: the retrieval of constructs, the measurement of construct similarity, the clustering optimization, the choice of clustering algorithm, the choice of cluster validity index and the construct mergence. The retrieval of constructs relates to the module of construct retrieval (discussed in Section 3). The other issues are involved in the module of construction induction (also discussed in Section 3). Among them, the measurement of construct similarity, the choice of clustering algorithm and the choice of cluster validity index are more important in determining the clustering quality.

1 Construct retrieval

A construct, denoted by C=<O-Deps, I-Deps>, is automatically retrieved from a dependency tree. A dependency tree is an acyclic graph as shown in Figure 4. It consists of dependencies (directed and typed links) between words. Each dependency is denoted by the form SYNTACTIC-RELATION(head, complement). For instance, in the dependency VMOD(artifact, caused), the SYNTACTIC-RELATION is VMOD, the head is artifact and the complement is caused. DepCluster uses both dependency types and words to represent constructs.

Figure 4: A dependency tree.
Figure 4:

A dependency tree.

It is easy to distinguish the I-Deps and the O-Deps in a dependency tree. In topology, a dependency tree is a directed acyclic graph. The relations among words in the tree are denoted by the directed edges. Thus, the I-Deps of a word are all the ingoing edges of the word while the O-Deps are all the outgoing edges of the word.

The present research chooses dependency trees to represent constructions because Dependency Grammar is a well-formed, explicit and easy-to-interpret grammatical formalism. In Examples (1–6), the SUBJECTVERB relation between words is denoted by NSUBJ, and the VERBOBJECT relation is denoted by DOBJ. With these terms, the relations between words are easily understood. However, in some other grammatical formalisms, complicated devices are needed to denote these relations. For instance, the parse tree path is used in Phrase Structure Grammar for the purpose, as proposed in Gordon and Swanson (2007) and Sagae and Gordon (2009). A parse tree path is a sequence of transitions from the target word to the governing node in a parse tree. For instance, the dependency type VMOD is expressed by the path “caused↑VBN↑VP↑NP↓NP↓NN↓artifact”. (VBN, VP, NP and NN are Penn Tree Bank tags. VBN stands for past participle verbs, VP for verb phrase, NP for noun phrase and NN for common nouns). It is obvious that the parse tree path is comparatively more obscure. Kawahara et al. (2014) also use Dependency Grammar to represent syntactic patterns when they use cluster analysis to build preferential selection database.

The idea of representing a construction with I-Deps and O-Deps is supported by research in computational linguistics and other fields. Koo and Collin (2010) report better precision in dependency parsing when a factor model includes both I-Deps and O-Deps. Blondel et al. (2004) provide a mathematical support. They introduce a method to compute the similarity score of two vertices on the basis of ancestor vertices and descendant vertices. In dependency trees, the ancestor vertices are the O-Deps and the descendant vertices are the I-Deps.

2 Measurement of construct similarity

The measurement of similarity between constructs is based on two assumptions. The first assumption is that only the SYNTACTIC-RELATIONs participate in similarity computation, while the lexemes in the constructs and the linear order of dependencies do not. The second assumption is that the outer syntactic functions (denoted by O-Deps) are dependent on the inner syntactic structures (denoted by I-Deps).

With these two assumptions, the similarity between two constructs C1 and C2 is defined in Formula [1]. The similarity between the two constructs is the product of the similarity between the two sets of O-Deps and the similarity between the two sets of I-Deps. Examples (18–19) are two illustrations with constructs from Example (1) and Example (2). The way to compute Sim (O-Deps1, O-Deps2) is the same as the way to compute Sim (I-Deps1, I-Deps2). Therefore, the following discussion is focused on the computation of Sim (I-Deps1, I-Deps2).

[1]SimC1,C2Sim<ODeps1,IDeps1>,<ODeps2,IDeps2>SimODeps1,ODeps2×SimIDeps1,IDeps2
(18)

Sim(O-Deps1, O-Deps2) = Sim({VMOD}, {VMOD})

(19)

Sim(I-Deps1, I-Deps2) = Sim({AGENT}, {AUXPASS, AGENT})

The computation of Sim(I-Deps1, I-Deps2) involves two more issues. One is the similarity asymmetry. The similarity between two constructs might be different when the prototype is different, as denoted in Formula [2]. In the formula, Sim(I-Deps1, I-Deps2) denotes how similar I-Deps2 is to I-Deps1 when I-Deps1 is chosen as the prototype, and Sim(I-Deps2, I-Deps1) denotes how similar I-Deps1 is to I-Deps2 when I-Deps2 is chosen as the prototype.

[2]SimIDeps1,IDeps2SimIDeps2,IDeps1

Example (19) is a good illustration for the similarity asymmetry between constructs. With information from Example (1) and Example (2), the dependencies involved in the similarity computation are given below:

A=I-Deps1={AGENT(caused, duration)}.

B=I-Deps2={AUXPASS(caused, being), AGENT(caused, chlorofluorocarbons)}.

If A is used as the prototype, B is considered similar to A because the dependency type AGENT is also found in B even though B contains an extra dependency. However, if B is chosen as the prototype, the similarity between them should be lower because A contains only one dependency and is only a subset of B. A is more prototypical because it contains less specific information, while B is less prototypical because it contains more specific information.

The other issue is that dependency types have different weights in similarity estimation. Suppose the I-Deps of Example (4) is the prototype, the tasks are to measure the similarity between this prototype and the I-Deps of Example (3), and the similarity between the prototype and the I-Deps of Example (5). Corresponding data are given in Example (20) as A, B and C, respectively. B differs from both A and C in one dependency type. Generally speaking, A and B are much more different while B and C are more similar. This is because the XCOMP in A and the ADVMOD in C weigh differently in similarity measurement.

(20)
ABC
I-Deps for Example (3)I-Deps for Example (4)I-Deps for Example (5)
NSUBJ (caused, things)NSUBJ (cause, it)NSUBJ (cause, it)
AUX (caused, have)AUX(cause, can)ADVMOD(cause, point)
XCOMP(caused, drop)DOBJ(cause, kinds)DOBJ (cause, lot)

The weight of a dependency type is associated with its prominence in cognition. It is generally agreed that the measurement of proximity is associated with stimuli (Johannesson 2000; Ortony et al. 1985; Tversky 1977). Johannesson (2000) proposes that the experience-directed similarity is proportional to the prominence of stimuli and demonstrates with data that better estimates of model parameters can be obtained if the prominence of stimuli is taken into consideration. Ortony et al. (1985) explicitly discuss the asymmetric similarity in language and argue that the asymmetry is attributed to an imbalance in the salience of the shared attributes. It is thus assumed that the more salient the shared attributes are, the more similar the pair should be. The similarity between B and C is higher because they share the more salient dependency type DOBJ and differ in one less salient type, while A and B are less similar because they differ in two salient dependency types XCOMP and DOBJ.

In DepCluster, the salience of a dependency type is determined by its co-occurring frequency with the target lexeme. Table 4 lists the frequency of the dependency types in Examples (1–6). Their weights are computed on the basis of their frequency.

Table 4:

Dependency weight computation.

O-DepsI-Deps
TypeVMODRCMODROOTCCOMPAGENTAUXPASSNSUBJAUXXCOMPDOBJADVMOD
Freq.21212143221
Weight0.330.170.330.170.330.170.670.500.330.330.17

Note: O-Deps, outer dependencies; I-Deps, inner dependencies.

With both issues accounted for in the above, the similarity between two constructs can be computed with a similarity measure. There have been various measures proposed for similarity computation, including the Tanimoto measure, the Euclidean distance, the Pearson’s correlation coefficient, the cosine similarity, etc. (please refer to Theodoridis and Koutroumbas (2009) for an excellent review of the topic). DepCluster has used the similarity measure proposed by Tversky (1977), given in Formula [3].

[3]SimDeps1,Deps2=Deps1Deps2Deps1Deps2+αDeps2Deps1+βDeps1Deps2,α+β=1

In the formula, the set Deps1 is the prototype, “Deps1Deps2” indicates the shared dependencies between the two constructs, “Deps2Deps1” indicates the unique dependencies possessed by Deps2 and “Deps1Deps2” indicates the unique dependencies possessed by Deps1. The two parameters α and β are adjustable, denoting the significance of the differences between Deps2 and Deps1. Note that the weights of the dependency types need to be normalized. For instance, in computing the similarity between Example (2) and Example (1), the “I-Deps2I-Deps1” is computed as follows:

O-Deps2O-Deps1=0.330.+0.17=0.66

This is because both AUXPASS and AGENT are included in the I-Deps2 and the weights of the shared dependencies are normalized.

Formula [3] can be used to compute either Sim (O-Deps1, O-Deps2) or Sim (I-Deps1, I-Deps2). Thus, the similarity between Examples (2) and (1) when Example (1) is the prototype can be computed as follows:

SimC1,C2=SimODeps1,ODeps2×SimIDeps1,IDeps2=0.33/0.330.33/0.33+0.7×0+0.3×0×0.33/(0.33+0.17)0.33/(0.33+0.17)+0.7×(0.17/(0.33+0.17))+0.3×0=0.741

Similarly, the similarity between Examples (1) and (2) when Example (2) is the prototype is computed as follows:

SimC2,C1=SimODeps2,ODeps1×SimIDeps2,IDeps1=0.33/0.330.33/0.33+0.7×0+0.3×0×0.33/0.330.33/0.33+0.7×0+0.3×(0.17/(0.33+0.17))=0.909

Table 5 lists the similarity measurement results for all the constructs in Examples (1–6) on the assumption that α is 0.7 and β is 0.3. Note that the table gives the distance values, not the similarity values, as the distance values will be used for illustrations in the following sections.

Table 5:

Measurement of distances among Examples (1–6).

Ex. (1)Ex. (2)Ex. (3)Ex. (4)Ex. (5)Ex. (6)
Ex. (1)00.0911111
Ex. (2)0.25901111
Ex. (3)110111
Ex. (4)11100.2930.222
Ex. (5)1110.18900.450
Ex. (6)1110.2220.5380

The ratio α/β has an influence on similarity computation. Figure 5 plots the hypothetical change of similarity against the change of the ratio. When the ratio is small, its change should lead to sharp fluctuations in similarity. When the ratio is comparatively large, the influence of its change is rather insignificant. A smaller ratio of α/β indicates that the differences between the TWO constructs are taken into account in the similarity computation, while a bigger ratio means that the computation is biased toward ONE construct.

Figure 5: The α/β${\alpha \mathord{\left/{\vphantom {\alpha \beta}} \right. \kern-\nulldelimiterspace} \beta}$ ratio units influence on similarity (the ratio is given in percentage).
Figure 5:

The α/β ratio units influence on similarity (the ratio is given in percentage).

3 Clustering optimization

DepCluster adopts the Hill-Climbing Algorithm (Russell et al. 2010: 122) and performs iteratively several rounds of clustering to search for the best clustering result. Each iteration includes two operations: the clustering operation and the validation operation. The algorithm used in the clustering operation is a modified version of DBSCAN (Density-Based Spatial Clustering of Applications with Noise) (Ester et al. 1996). The validity index used in the second operation is the S_Dbw Validity Index (Halkidi and Vazirgiannis 2001). The iteration continues until no improvement in the S_Dbw Index is made in M time iterations. The best clustering result is selected according to the S_Dbw index. The pseudocode is given as follows:

  1. Sort the list of constructs in descendent order according to the Epsilon values of the constructs (explained in Section 4 in this appendix). The descendent order is the current processing order Oi;

  2. Repeat the following steps for M times:

  3. Apply DBSCAN to the construct list to obtain a clustering result Ri;

  4. Assess Ri to obtain a validity index value Si;

  5. If Si is greater than Soptimum, Soptimum:= Si, [8]Roptimum:= Ri, Ooptimum:= Oi, M:=0, obtain Otemp by randomly swapping six elements in Ooptimum, Oi:= Otemp, and return to (a)

  6. Otherwise, if Si is no better than Soptimum, obtain Otemp by randomly swapping six elements in Ooptimum, Oi:= Otemp, and return to (a).

Note that each time when a better validity index is obtained, the searching process will start again with the new settings and it will repeat itself M times for a better clustering result. Thus, a bigger M indicates a more thorough search. In the case study, M is set to 40 based on the experiments with the data.

4 Clustering algorithm

Cluster analysis has been widely studied in the machine-learning field. A good review can be found in Theodoridis and Koutroumbas (2009). DepCluster uses DBSCAN for the clustering operation. To cluster a list of constructs, DBSCAN begins from the first construct in the list and searches in the list for neighbors whose distances to the construct are smaller than the Epsilon, a threshold of distance defined in advance. The neighbors thus discovered are used again to search for new neighbors until no new neighbor is found. If the number of neighbors found in the process surpasses a minimum number defined in advance, the construct and its neighbors are grouped into a cluster and are taken out of the list. In the present study, the threshold that allows a group of constructs to form a cluster is 1/N, where N is the number of input constructs.

DepCluster extends the basic DBSCAN with two modifications. The first modification is in regard of Epsilon. Epsilon is the distance threshold used to decide whether one construct is a neighbor of another construct. In the basic DBSCAN, the value of this parameter is empirically specified after several experiments with data. The value is also fixed for all items involved in the clustering process. This can be problematic. To find a proper value for the parameter is difficult because too small a value may exclude some true neighbors and too big a value may add noises. To avoid these drawbacks, DepCluster computes the Epsilon for every construct in the input list. Thus, in DepCluster, the value of Epsilon is not fixed but rather is adapted for every construct. DepCluster achieves this purpose by finding a possible cluster boundary for every construct, using the histogram of deviations as proposed in ADAptive CLUStering (ADACLUS) (Nosovskiy et al. 2008).

Examples (1) and (4) are used to explain the idea of adaptable Epsilon values. Figure 6 presents the deviation histograms for Examples (1) and (4), respectively, calculated from Table 5. Figure 6(a) shows that there are two constructs (including itself) deviating from Example (1) within the span [0, 0.1], while the deviations of other constructs are within [0.9, 1.0]. If the value of Epsilon for the construct is set at 0.1, a good clustering result can be obtained for Example (1). Thus, 0.1 is the cluster boundary of Example (1). Similarly, in Figure 6(b), the Epsilon value for Example (4) can be set at 0.3. Thus, Examples (1) and (4) have different Epsilon values.

Figure 6: (a) Deviation histogram for Example (1) and (b) deviation histogram for Example (4).
Figure 6:

(a) Deviation histogram for Example (1) and (b) deviation histogram for Example (4).

Formally, based on the ideas proposed in ADACLUS (Nosovskiy et al. 2008), the procedure for DepCluster to obtain the Epsilon values for all constructs is given as follows. Table 6 gives the Epsilon values for all the six constructs in Examples (1–6).

  1. Specify the universal neighborhood searching range R. This requires two sub-steps:

  2. First, for each construct, obtain the minimal distance between this construct and the other constructs. The obtained minimal distances of all the constructs are then used to obtain H(x), the frequency histogram of these minimal distances;

  3. Second, define the universal neighborhood searching range R as follows:

    Rglob=minx:HMD<x=0.9
    Rloc=minx:xRglob,Hx=0
    [4]R=Rglob+1×Rloc

where Rglob denotes the 90 % of the distribution of H(x) and Rloc denotes the first zero greater than Rglob. (Please refer to the footnote [9] below for the explanation of the digital value 1 in Formula [4]).

  1. For each construct, its Epsilon value is determined in the following two sub-steps:

  2. For the construct in question, all the other constructs whose distances to the construct are less than R form the candidate set a. The distances between the constructs in a and the construct in question are collected to build the frequency histogram H′(x).

  3. The Epsilon value for the construct is computed as follows:

    di;1=minx:x<m,Hx=0
    a=di;1+di;22
    b=a+2×stdi

where di;1 is the largest zero that does not exceed the mean (m) in H′(x), di;2 is the 20 % fraction of the distribution H′(x) and stdi is the standard deviation of the distances between the construct in question and the constructions in a. The Epsilon value of the construct is denoted by b.

Table 6:

Adaptive Epsilons for the constructs in Examples (1–6).

Ex. 1Ex. 2Ex. 3Ex. 4Ex. 5Ex. 6
Epsilon0.190.5400.560.770.92

The second modification of the basic DBSCAN algorithm is about the way a neighborhood is formed. The basic SCAN can discover clusters of arbitrary shape (hyper-plane) because one data item is included into a neighborhood as long as it is similar to any item already inside the neighborhood. However, in collexeme analysis, the constructs in a cluster are required to be isomorphic. Thus, DepCluster adopts the neighborhood-voting approach to ensure the isomorphism inside a cluster. The basic idea is that a new construct is not allowed to enter a neighborhood if its distance to any member of the neighborhood is larger than the Epsilon value of the member. Let us consider Examples (4)–(6). Assume that Examples (4) and (5) are already in a neighborhood. We now consider whether Example (6) can be included in the neighborhood. For this purpose, the neighborhood-voting approach performs not only the comparison between Examples (4) and (6), but also the comparison between Examples (5) and (6). The two comparisons are illustrated in Table 7 wherein the distance data come from Table 5 and the Epsilon data come from Table 6. Because both Examples (4) and (5) in the cluster accept Example (6) as a neighbor, Example (6) is included in the neighborhood.

Table 7:

An illustration for the community-voting approach.

DistanceEpsilonIs neighbor
Comparison between Ex. (4) and (6)0.2220.54Yes
Comparison between Ex. (5) and (6)0.450.77Yes

5 Cluster validity index

DBSCAN is not a deterministic clustering algorithm. The clustering result generated by DBSCAN varies according to the order in which the items are processed. Therefore, DepCluster uses the Hill-Climbing algorithm to search for the best clustering result. Thus, a cluster validity index is needed to decide which clustering result is more acceptable. DepCluster adopts the S_Dbw (Halkidi et al. 2001; Halkidi and Vazirgiannis 2001) as the cluster validity index. The S_Dbw index performs better than the Silhouette index, the SD index and many others [10] (Halkidi and Vazirgiannis 2001; Liu et al. 2010). It is robust to several factors that may affect the clustering quality, such as noise, variation in density, monotonicity, sub-clusters and skewness.

Given a clustering result obtained by DBSCAN, the S_Dbw index measures both the consistency and the dispersion of the clusters. To measure the clustering consistency, the index considers the homogeneity and compactness of the clusters and uses the variances of individual clusters and the variance of the overall data set for the purpose. Formally, the clustering consistency is measured by Scatt, defined as follows. [11]

Variance of the overall data set:

σx=1nk=1nxkxˉ2

Variance of a cluster:

σxp=1cik=1nxkpxp2

The average scattering for clusters:

Scatt=1nci=1ncσxiσx

The basic idea behind Scatt is that if the clusters are more coherent, their variances will be smaller, and the Scatt value will also be smaller.

The dispersion of the clusters is measured by what Halkidi and Vazirgiannis (2001) refer to as the inter-cluster density, defined as follows:

Densbw=1ncnc1i=1ncj=1,ijncdensityui,jmaxdensitypi,densitypj

where ui,j is the middle point of the line segment defined by the two clusters pi and pj. The density function density(ui,j) is the number of constructs that meet the following criterion: the shortest distance of the construct in question (e. g. one construct in pi) to the other cluster (e. g. pj) is less than the double of the average standard deviation of all clusters. The average standard deviation of all clusters is defined as follows:

stdev=1ncp=1ncσxp

The density function densitypx denotes the number of constructs inside the cluster px that meet the following criterion: the distance between the construct in question and the center of the cluster is less than the stdev defined above.

The S_Dbw validity index is then defined as follows: S_Dbw=Scatt+Densbw.

6 Construct mergence

The issue of construct mergence is about forming a construction from a construct cluster. Because each construct is a directed graph, to form a construction from, a cluster of constructs is to merge several directed graphs into one directed graph. Those with the identical dependency type are merged into one. The words associated with the dependencies are ignored. The frequency of the dependency types is counted in the process. Table 1 illustrates how the construct mergence is performed.

Appendix II: Typical constructions for cause

This appendix lists the 21 constructions obtained by DepCluster in the case study. For each construction, the information given in the table includes the number of constructs in the construction, its percentage in the total data set, the I-Deps and the O-Deps. Note that the syntactic slots given in the table are those that have frequency higher than 15 % of the number of constructs in the corresponding cluster. The names of the slots are explained in Marneffe and Manning (2008).

No.Count/Pct.I-DepsO-Deps
151/2.57 %mark (1.000000e+00),

nsubj (1.000000e+00),

dobj (1.000000e+00),

aux_mod (8.039216e-01),

prep_in (5.098039e-01),

aux_t_a (1.960784e-01),

neg (1.568627e-01)
ccomp (8.627451e-01)
283/4.18 %nsubj (1.000000e+00),

dobj (1.000000e+00),

aux_mod (6.867470e-01),

prep_in (4.819277e-01),

aux_t_a (3.012048e-01)
root (3.975904e-01),

rcmod (3.734940e-01)
3163/8.21 %nsubj (1.000000e+00),

dobj (1.000000e+00),

aux_mod (6.196319e-01),

aux_t_a (3.803681e-01)
root (4.110429e-01),

rcmod (2.269939e-01),

ccomp (1.533742e-01)
463/3.17 %nsubj (1.000000e+00),

xcomp (1.000000e+00),

aux_mod (6.507937e-01),

aux_t_a (3.174603e-01)
root (4.126984e-01),

rcmod (2.857143e-01)
5169/8.51 %dobj (1.000000e+00)vmod (5.502959e-01)
6218/10.98 %nsubj (1.000000e+00),

dobj (1.000000e+00)
rcmod (3.853211e-01),

root (3.623853e-01)
728/1.41 %agent (1.000000e+00),

prep_in (3.571429e-01),

prep_to (1.785714e-01)
vmod (9.285714e-01)
8248/12.49 %agent (1.000000e+00)vmod (9.838710e-01)
928/1.41 %mark (1.000000e+00),

nsubj (1.000000e+00),

xcomp (1.000000e+00),

aux_mod (6.428571e-01),

aux_t_a (3.571429e-01)
ccomp (6.785714e-01),

advcl (2.500000e-01)
1072/3.63 %mark (1.000000e+00),

nsubj (1.000000e+00),

dobj (1.000000e+00),

aux_mod (6.805556e-01),

aux_t_a (4.166667e-01)
ccomp (8.055556e-01),

advcl (1.527778e-01)
11117/5.89 %aux_to (1.000000e+00),

xsubj (1.000000e+00),

dobj (1.000000e+00),

prep_in (1.623932e-01)
xcomp (9.914530e-01)
1250/2.52 %xcomp (1.000000e+00)vmod (6.600000e-01)
1365/3.27 %nsubj (1.000000e+00),

dobj (1.000000e+00),

advmod (1.000000e+00),

aux_mod (5.846154e-01),

aux_t_a (4.615385e-01)
root (5.384615e-01),

rcmod (2.307692e-01)
1455/2.77 %mark (1.000000e+00),

nsubj (1.000000e+00),

dobj (1.000000e+00)
ccomp (7.090909e-01),

advcl (2.000000e-01)
1537/1.86 %dobj (1.000000e+00),

aux_to (8.648649e-01),

prep_in (1.891892e-01)
xcomp (5.135135e-01),

vmod (2.702703e-01)
1649/2.47 %nsubj (1.000000e+00),

dobj (1.000000e+00),

advmod (1.000000e+00)
root (5.306122e-01),

rcmod (2.448980e-01)
1759/2.97 %nsubjpass (1.000000e+00),

agent (1.000000e+00),

auxpass (1.000000e+00)
root (6.101695e-01),

rcmod (1.694915e-01),

ccomp (1.525424e-01)
1843/2.17 %nsubj (1.000000e+00),

xcomp (1.000000e+00)
rcmod (4.651163e-01),

root (3.255814e-01)
1941/2.06 %mark (1.000000e+00),

nsubjpass (1.000000e+00),

agent (1.000000e+00),

auxpass (1.000000e+00)
ccomp (7.317073e-01),

advcl (1.951220e-01)
2019/0.96 %nsubjpass (1.000000e+00),

agent (1.000000e+00),

auxpass (1.000000e+00),

aux_mod (9.473684e-01),

prep_in (2.105263e-01)
root (6.315789e-01),

rcmod (2.631579e-01)
2129/1.46 %agent (1.000000e+00),

advmod (1.000000e+00),

nsubjpass (5.517241e-01),

auxpass (5.517241e-01)
vmod (4.482759e-01),

root (4.137931e-01)

Note: “neg” stands for the negation, “aux_mod” for the modal auxiliary, “aux_t_a” for the auxiliary of tense and aspect, “xcomp” for the open clause complement, “nsubj” for the nominal subject, “ccomp” for the clause complement, “rcmod” for the relative clause modifier and “prep_in” for the collapsed prepositional phrase that contains in as the preposition.

Published Online: 2016-8-18
Published in Print: 2017-5-1

© 2017 Walter de Gruyter GmbH, Berlin/Boston

Downloaded on 29.3.2024 from https://www.degruyter.com/document/doi/10.1515/cllt-2015-0007/pdf
Scroll to top button