Hyperbolic Feature-based Sarcasm Detection in Telugu Conversation Sentences

: Recognition of sarcastic statements has been a challenge in the process of sentiment analysis. A sarcastic sentence contains only positive words conveying a negative sentiment. Therefore, it is tough for any automated machine to identify the exact sentiment of the text in the presence of sarcasm. The existing systems for sarcastic sentiment detection are limited to the text scripted in English. Nowadays, researchers have shown greater interest in low resourced languages such as Hindi, Telugu, Tamil, Arabic, Chinese, Dutch, Indonesian, etc. To analyse these low resource languages, the biggest challenge is the lack of available resources, especially in the context of Indian languages. Indian languages are very rich in morphology which pose a greater challenge for the automated machines. Telugu is one of the most popular languages after Hindi among Indian languages. In this article, we have collected and annotated a corpus of Telugu conversation sentences in the form of a question followed by a reply for sarcasm detection. Further, a set of algorithms are proposed for the analysis of sarcasm in the corpus of Telugu conversation sentences. The proposed algorithms are based on hyperbolic features namely, Interjection, Intensifier, Question mark and Exclamation symbol. The achieved accuracy is 94%.


Introduction
Sentiment analysis is a technique that analyses people's opinions, sentiments and emotions towards a target such as products, services, events, organizations, individuals, etc. [1]. The presence of sarcasm in the sentence makes sentiment analysis difficult as sarcasm flips the sentiment value. Therefore, sarcasm is considered as critical to identify the sentiment from a given text.
Sarcasm is a special kind of sentiment that frequently occurs during the communication between people and is mostly intentional. It is a nuanced form of language in which people state the opposite of what is implied. It can also be stated as the turbulent feature that people are often used to convey a negative meaning using only positive words or even compounded, inflated positive words [2]. An example of a simple sarcastic sentence is: "I love being ignored #sarcasm". In this example, the sentiment seems to be positive as "love" is present, but the situation is negative as "no one wants to be ignored". It means the sentence is written in a sarcastic way. It can be easily understood that sarcasm detection in the text is tough due to the lack of intonation or facial expressions. Therefore, identifying sarcasm in the text is a challenging task. Recent works in the direction of sarcasm detection have influenced local native languages. This is mainly due to the usage of regional languages while communicating through social media. Most of the existing algorithms for sarcasm detection are applicable for text data scripted in English [2][3][4][5][6][7]. In the domain of low resourced languages namely, Hindi, Telugu, Tamil, Arabic, Chinese, Spanish, etc., there is very little work done so far. [8][9][10] In Low resourced languages domain, unavailability of the datasets is the biggest challenging task for researchers. So, it gives us a seed idea to work on sarcasm detection in this domain especially on Indian languages. Indian languages such as Hindi, Telugu, Tamil, etc. are very rich in morphology which poses another challenge for the researchers to work on it. Telugu is the second most popular language in India just after Hindi, and it has a lot of importance over other Indian languages. In the 16th century, Italian explorer Niccolo Da Conti who visited the Vijayanagara empire described Telugu as Italian of the east. Rabindranath Tagore, the well-known Bengali writer, has once heard Telugu poetry and said "Is it a language or music?", and he also said that Telugu is the sweetest of all (Indian) languages. The famous Tamil poet Subramania Bharati has sung thus "Sundara Telunginil Pattisaithu" which means "Sing in beautiful Telugu". Srikrishna Devaraya, South Indian king and a non-native speaker of Telugu said "Desabhaashalandu Telugu Lessa (Telugu is the best among all the languages in this country). Telugu being the second most spoken language in India is growing its importance and majority of the Telugu speaking social media users started communicating in their native language. An automated sentiment analyser with sarcasm detection method will enhance the better analysis of the communicated text.
In this article, we collected a corpus of Telugu conversation sentences in the form of the question followed by a reply from Telugu comedy TV shows such as "Jabardasth comedy show", "Extra Jabardasth comedy show", "Comedy Raja Band Baja", "Patas Punch", etc. This corpus comprises of conversation between different comedy actors. It is mostly in the form of a question followed by a sarcastic reply. These replies are considered as sarcastic in the context of the question. Three algorithms have been devised by analysing the corpus for sarcasm detection. These three proposed algorithms are devised based on the occurrences of hyperbolic features in the Telugu conversation sentences.
Rest of the article is organized as follows: Section 2 describes related work. The proposed scheme is discussed in Section 3. Results are shown in Section 4 and conclusion is given in Section 5.

Related Work
This section gives a survey on existing methods for sarcasm detection. Majority of the work in sarcasm detection has been done in English language as it is the most dominating language used in social media for communication. In recent times, sarcasm detection on English scripted domains such as Twitter data, product reviews, website comments, etc., were done tremendously by many researchers [2-5, 7, 11]. In the domain of Low resourced languages such as Hindi, Telugu, Tamil, Chinese, Arabic, etc., very little work has been done [8][9][10]. The following subsections will detail about sarcasm detection in English and low resourced languages.

Sarcasm Detection on English Language
Lexical features play a vital role in detecting irony and sarcasm in text [12]. Lexical features along with syntactic features were used to detect sarcastic tweets. A semi-supervised approach [11] was used to detect sarcasm in tweets and Amazon product reviews. They used two interesting lexical features, namely pattern-based (high-frequency words and content words) and punctuation-based to build a weighted K-Nearest Neighbor (KNN) classification model to perform sarcasm detection. Numerous lexical features derived from linguistic inquiry and word count [13], WordNet affect [14] and pragmatic features such as emoticons, smiles and replies were explored [3] to identify sarcasm in tweets. A well-constructed lexicon-based approach was used to detect sarcasm based on an assumption that sarcastic tweets are a contrast between a positive sentiment and a negative situation [5] and for lexicon generation, they used unigram, bigram and trigram features. The Intensifier is used as hyperbole features to detect sarcasm in tweets as utterance with a hyperbole. For example -'fantastic weather when it rains' is identified as sarcastic with more ease than the utterance without a hyperbole like -'the weather is good when it rains' [4]. The utterance with the hyperbole 'fantastic' may be easier to interpret more sarcastic than the utterance with the non-hyperbolic 'good'. Interjection words are used as hyperbole feature to identify sarcasm in tweets [2]. They also used a parsing technique to divide a tweet into phrases to generate the lexicon file to identify sarcasm in Twitter data. Rather than lexical and linguistic traits, the Twitter user's behavioral trait is used as the feature for sarcasm detection. Behavioral context was used to convey the sarcasm and employed theories from behavioral and psychological studies to construct a behavioral modeling framework tuned for detecting sarcasm [6]. Sarcasm requires some shared knowledge between speaker and audience, and it is a profoundly contextual phenomenon. Most computational approaches to sarcasm detection, however, treat it as a purely linguistic matter, using information such as lexical cues and their corresponding sentiment as predictive features [15]. A system was develpoed that identifies sarcastic tweets to predict the result of an upcoming election by analysing people's opinion on Twitter [16]. They used Exclamation mark (!), Question mark (?), Hashtag sarcasm and Irony, Emoticons, Adjectives and Verbs as features to identify sarcastic polarity in Twitter data using supervised machine learning approach. The author's past tweets can provide an additional context for sarcasm detection. They exploited the author's past sentiment on the entities in a tweet to detect the sarcastic intent [17]. A framework was introduced based on the linguistic theory of context incongruity and inter-sentential incongruity for sarcasm detection by considering the previous post in the discussion thread [18]. A Hadoop based framework that captures a massive amount of real-time tweets and processes it with a set of algorithms that identify sarcastic sentiment efficiently was also proposed [7].

Sarcasm Detection on Low-Resourced Languages
The first work on detecting sarcasm on Low-Resourced languages was done in Indonesian social media data [8]. The dataset was gathered manually from Twitter and proposed two additional features to detect sarcasm after a common sentiment analysis was conducted. The features are the negativity information and the number of Interjection words. They also employed translated SentiWordNet in the sentiment classification. Thelwall et al. [19] have provided a huge number of informal messages posted every day on social network sites, blogs and discussion forums. Till date algorithms are devised to identify sentiment and sentiment strength that help to understand the role of emotion in this informal communication and also to identify inappropriate or anomalous affective utterances, potentially associated with threatening behaviour to the self or others. A set of features specifically for detecting sarcasm in social media were introduced and had deployed a novel Multi Strategy Ensemble Learning Approach (MSELA) to handle imbalance problem in English and Chinese sentences [9]. A system was proposed to detect sarcastic sentences in Hindi language using Support Vector Machines [10]. They focused on features like Emoticons and Punctuation marks for sarcasm detection. As per our best knowledge, no work on sarcasm detection in Telugu is found so far.

Proposed Scheme
This section describes the model for sarcasm detection followed by the process of Telugu data collection and annotation. It also explains the POS tagging followed by tagged data analysis to form the rules for sarcasm detection in Telugu conversation sentences.

Model for Sarcasm Detection
The pipeline process of sarcasm detection in Telugu conversation sentences is shown in Figure 1. It starts with data collection followed by data annotation. In the next step, we identify the appropriate POS tag information of the annotated Telugu sentences. Further, each tagged data was analysed for categorization of sentences from the occurrences of hyperbole features namely, Interjection, Intensifier, Question mark and Exclamation mark. Based on these hyperbole features, a set of algorithms are proposed for sarcasm detection in each category.

Figure 1: Model for Sarcasm Detection
Further, for testing process, the test sentences are initially fed to the process of POS tagger to obtain the tagged sentences. Now, these tagged sentences are applied on the algorithms to classify as either sarcastic or not sarcastic.

Data Collection
Since, Telugu is a low resourced language, the availability of the data is very rare on the Internet. So, we have collected it manually from different sources such as TV series, Web, Internet, etc. The process of data  Figure 2. We have collected around 5500 Telugu sentences. Majority of the conversation sentences are from various Telugu TV comedy series of the ETV plus channel namely, "Jabardasth", "Extra Jabardasth", "Pataas", "Cinema Chupista Mava", "Express Raja", "Naa Show Naa Istam" etc. To collect these sentences, nearly 350 archive episodes were watched all together. As the sentences were taken from video, collected data are in the form of conversation sentences between two or more. Therefore, the structure of the sentences is in the form of the question followed by a reply. The collected dataset has been made publicly available through the GitHub. One can find data on the link https://github.com/sbharti1984/ Telugu-Sarcastic-Sentences.

Data Annotation
The collected dataset was distributed among professionals in the Telugu language who are teachers and practitioners. They gave very good response and annotated these 5500 sentences manually to find the sentence as sarcastic or not. After collecting the annotation results from all the three individuals, we observe that the structure of conversation sentences followed any one of the following patterns: 1. Normal question followed by a Normal reply. 2. Normal question followed by a Sarcastic reply. 3. Sarcastic question followed by a Normal reply. 4. Sarcastic question followed by a Sarcastic reply.
A list of sample annotated conversation sentences (one for each above patterns) that are collected is shown in Figure 3. Based on annotation, we classified the sentences of the dataset and observed that most of the sentences belong to the second pattern: "normal question followed by a sarcastic reply".
To measure the Inter-Annotator Agreement (IAA), there are two coefficients such as Cohen's Kappa [20] and Fleiss Kappa [21]. If the annotators are two or more than two, The Cohen's Kappa or Fleiss Kappa is used In this work, as the annotators are three (more than two), we approached Fleiss Kappa coefficient, and the formula for that is shown in Equation 1. We got the IAA as 0.85, which is said to be perfect agreement.
where, P − Pe : gives the degree of agreement actually achieved above chance, 1 − Pe : gives the degree of agreement that is attainable above chance.
In this article, our assumptions are as follows: 1. The sentences belonging to the pattern "normal question followed by normal reply" are non-sarcastic sentences. 2. The sentences belonging to the pattern "normal question followed by sarcastic reply" are sarcastic sentences. 3. The sentences belonging to the pattern "sarcastic question followed by normal reply" are sarcastic, but the frequency of occurrences is very rare. 4. The sentences belonging to the pattern "sarcastic question followed by sarcastic reply" are sarcastic, but the frequency of occurrences is very rare.
With these assumptions, we observed that out of 5500 sentences, 5200 sentences were sarcastically annotated and rest 300 was not sarcastic. In this work, we considered only sentences that follow "normal question followed by sarcastic reply" pattern as a sarcastic sentences. The occurrences of this pattern are very frequent and comprise of approximately 97% of the sarcastic sentences dataset. The sentences belonging to the patterns "sarcastic question followed by normal reply" and "sarcastic question followed by sarcastic reply" are omitted because of the rarity. Based on this assumption, we have developed the rules to detect the sarcasm in the given sentence. For this experiment, the analysis dataset, training and testing set are as follows: 1. After annotation, 5200 sentences were found sarcastic in a total of around 5500 sentences. 2. 5044 (97% of 5200) sarcastic sentences followed the pattern normal question followed by sarcastic reply. 3. All 5044 sarcastic sentences were used for the analysis to develop the rules which will detect the sarcasm. 4. 188 (94 sarcastic and 94 non-sarcastic) sentences were used as testing set which are not part of the dataset.

POS Tagging
POS tagging is the process of assigning a correct POS tag such as Noun, Verb, Adverb, etc., to each word of the given input sentence. POS taggers are developed by modelling the morphosyntactic structure of NLP . The Telugu tagger is similar to the model 5 described in Table 2 of [22], but with a focus on Telugu. The corpora are downloaded, cleaned and tagged with a high Precision and low Recall tagger. As the tagger is trained on large data, the tagger is expected to handle large vocabulary and also predicting the tags of unknown words using known words. They followed HMM-based approach and the Indian language standard tagset [23] which comprise 21 tags to build the tagger. The available Telugu tagger is based on TnT tagger, which is well known for its robustness and speed. Some of the Telugu tags used in this article are shown in Figure 4. An example of Telugu sentences with corresponding POS tag information is shown in Figure 5.

Proposed Algorithms
In this article, we analysed 4950 sarcastic sentences for training set and observed that these sentences could be classified into three categories: 1. Sentence that start with Telugu negation word. 2. Sentence that start with Telugu interjection word.

Reply of a sentence in the form of a question.
A list of Telugu negation words and Telugu interjection words are given in Figures 6 and 7 respectively. Based on the observation from Telugu conversation sarcastic sentences, a set of three algorithms are proposed to detect sarcasm in each of the categories. The proposed algorithms are as follows:

Telugu_Negation_Word_Start(TNWS) 2. Telugu_Interjection_Word_Start(TIWS)
3. Reply_in_Form_of _Question(RiFoQ) In this article, we proposed three algorithms for identifying sarcasm in Telugu conversation sentences as given in algorithms 1, 2, and 3. Here, we explained the working procedure of all the three algorithms using examples.

Telugu_Negation_Word_Start(TNWS)
This algorithm is based on Telugu negation words i.e. "ledu", "kaadu" and "vaddu". During the analysis of sarcastic sentences, we observed that these negation words frequently appear as a starting word in sarcastic reply of the conversation sentences as shown in Figure 8. Based on this observation, we proposed an algorithm for the sentences whose reply starts with Telugu negation word as shown in Algorithm 1.
Algorithm 1 takes the corpus of Telugu conversation sentences (C) as an input and extracts the first word and last word of every sentence and store it in F_tok and L_tok respectively. Next, it compares F_tok with Telugu negation words, i.e., "ledu", "kaadu", "vaddu" and L_tok with a question mark (?) and exclamation mark (!). If F_tok match with any of the above mentioned Telugu Negation words and L_tok match with either Exclamation or Question mark, then the given sentence is classified as sarcastic, otherwise, check for other two proposed algorithms. Next, we find POS tag information of F_tok and store it in FT. If FT is the Telugu interjection word as shown in Figure 7, then those sentences are fed into Algorithm 2 and rest of the sentences are fed into Algorithm 3.  Figure 8. According to Algorithm 1, any reply starts with one of the negation word given in Table 6 and ends with either (?) or (!) then sentence will be classified as sarcastic. In Figure 8, the given Telugu sentences are in the form of a question followed by a reply. The reply of the first sentence starts with Telugu negation word "ledu" and ends with exclamation symbol (!). Therefore, the given sentence is sarcastic. Similarly, the reply of the second sentence is starts with negation word "kaadu" and ends with exclamation symbol (!). Therefore, the given sentence is sarcastic. In proposed Algorithm 1, we have considered those negation words that occur very frequently in sarcastic Telugu conversation sentences. These negation words act as an intensifier in reply.

Telugu_Interjection_Word_Start(TIWS)
This algorithm is based on Telugu interjection words such as "ayyo", "haa", "alaana", etc. While analysing Telugu sarcastic sentences, we observed that many sarcastic replies start with Telugu interjection words as shown in Figure 9. Based on this observation, we proposed an algorithm for the sentences whose reply starts with Telugu interjection words as shown in Algorithm 2. Algorithm 2 takes a sentence as input that start with interjection word and finds the POS tag information for every sentence and append it to file TF. Next, it finds the first, second and the third tag and stores in FT, ST and TT respectively. If FT is an Interjection (INJ) and ST is an Adjective (JJ), then the sentence is classified as sarcastic. Otherwise, if the FT is an INJ and a bigram tag (NN + VM) sequence is present anywhere in the rest of sentence, the sentence is classified as sarcastic. Finally, if FT is INJ and ST is a pronoun (PRP), and TT is either main verb (VM) or noun (NN), then the sentence is classified as sarcastic. Otherwise, the sentence is not sarcastic.
Algorithm 2 is based on Telugu interjection word as examples shown in Figure 9. According to Algorithm 2, any reply starts with one of the interjection word given in Table 7 and POS tag value either a noun or verb present anywhere in remaining part then sentence will be classified as sarcastic. Similarly, other rules are given. In Figure 9, the given Telugu sentences are in the form of a question followed by a reply. The POS tags sequence for the reply of the first sentence is: "INJ SYM PRP NN NN VM SYM". The reply of the first sentence starts with Telugu interjection tag "INJ", and noun (NN) appears at 4 th and5 th position or verb (VM) appears at 6 th position. Therefore, given sentence is classified as sarcastic. The one condition is sufficient either presence of NN or VM. Similarly, for the second sentence as well. In proposed Algorithm 2, we have considered those interjection words that occur very frequently in sarcastic Telugu conversation sentences.

Reply_in_Form_of _Question(RiFoQ)
We observed that several sarcastic replies were in the form of a question during analysis of Telugu sarcastic sentences as shown in Figure 10. Therefore, we proposed an algorithm for the conversation sentences whose reply was in the form of a question and shown in Algorithm 3.
Algorithm 3 takes rest of the sentences as an input that neither starts with Telugu Negation word nor Telugu interjection words. Next, it finds the POS tag information for every sentence and appends it to file TF.

Algorithm 3: Reply_in_Form_of _Question(RIFOQ)
Input: Telugu conversation sentence (S). Output: Classified into either Sarcastic or not Sarcastic. In the next step, it finds the first, second, third and the second-last-tag of every sentence and stores it in FT, ST, TT and SLT respectively. If 'WQ' tag is present in TF and in the corresponding sentence, if SLT is either NN or VM and the LT is '?', then the sentence is classified as sarcastic. Otherwise, if FT is VM, ST and SLT is either VM or NN and the LT is '?', then the sentence is classified as sarcastic. Finally, if FT is RB, ST is PRP, TT is NN, SLT is either NN or VM and the LT is '?', then the sentence is classified as sarcastic. Otherwise, the sentence is not sarcastic. Algorithm 3 is based on a unique feature as a reply of a conversation sentence is in the form of a question. To identify a reply is in the form of question, one need to check the presence of question mark tag (WQ) and end symbol is question mark (?). Some other rules are given in Algorithm 3. In Figure 10, the given Telugu sentences are in the form of a question followed by a reply. The POS tags sequence for the reply of the first sentence is: "WQ SYM PRP QC NN NN NN VM VM SYM". According to the algorithm, reply contains a question tag (WQ), and last two tags are either a verb (VB) followed by symbol or noun (NN) followed by a symbol. The symbol is a question mark (?). Therefore, given sentence is classified as sarcastic. Similarly, the POS tags sequence for second sentence's reply is: "RB PRP NN VM SYM". According to the other rule, if any reply starts with tag adverb (RB) followed by the second tag is a pronoun (PRP) followed by the third tag is a noun (NN), and last two tags are a verb (VM) and question mark symbol (?) then given sentence is classified as sarcastic.

Results
This section describes the performance of the proposed algorithms to detect sarcasm in Telugu conversation sentences.

Statistical Evaluation Metrics
There are three statistical parameters namely, Precision, Recall and F − score used to evaluate the proposed approaches. Precision shows how much relevant information is identified correctly and Recall shows how much extracted information is relevant. F − score is the harmonic mean of Precision and Recall. Equations 2, 3, and 4 shows the formula to calculate Precision, Recall and F − score.

Analysis with Machine Learning Approaches
In this article, we performed analysis of proposed algorithms using various machine learning approaches as shown in Table 5. The features are extracted by analysing the annotated sentences. The Figure 11 shows the features used to train the classifiers and sample of learned instances of each feature. The experiment was done on all the proposed algorithms individually and combined of all three algorithms by performing 5 trials as well as varying the training and testing split ratio as start with 20 upto 100. The brief introduction of all the used classifiers in this work are as follows: The performance of Algorithm 1 is shown in Figure 12. It is observed that, the highest accuracy of 90% is reported with a train_test split ratio of (80-20) by the all classifiers. After that, for any train_test split ratio, all classifiers reports the same except the Naive Bayes which fluctuates over the variations of ratio. All the 5 trials of (80-20) split by the classifiers is shown in Table 1.
The Figure 13 shows the performance of Algorithm 2. It is observed that, the maximum accuracy of 86.43% is reported with a train_test split ratio of (60-40) by all the classifiers except the Naive Bayes. All the 5 trials of (60-40) split by the classifiers is shown in Table 2.    The Figure 14 depicts the performance of Algorithm 3. It is observed that, the maximum accuracy of 87.65% is reported with a train_test split ratio of (60-40) by all the classifiers. All the 5 trials of (60-40) split by the classifiers is shown in Table 3.  The performance of the combined Algorithm is depicted through Figure 15. It is observed that, the maximum accuracy of 85% is reported with a train_test split ratio of (80-20) by all the classifiers. All the 5 trials of (80-20) split by the classifiers is shown in Table 4.

Experimental Evaluation
Experiments were conducted on the algorithms for sarcasm detection with 188 Telugu conversation sentences as a testing set. The testing set consists of a 50:50 ratio of sarcastic and non-sarcastic conversation sentences i.e. 94 sarcastic sentences and 94 non-sarcastic sentences as ground truth. The experimental result in the form of confusion matrix over 188 testing sentences is given in Table 6. Further, precision, recall, and F-score are given in Table 7.

Conclusion
In the area of sarcasm sentiment detection in low resource domain like Hindi, Telugu, Tamil, Arabic, etc., little work has been done. The reason behind is the scarcity of datasets for analysis and experiment. The collection of the dataset in this domain is the biggest challenging task. In this article, we built a dataset of Telugu conversation sentences manually from videos and annotated as sarcastic sentences. To identify sarcasm in collected dataset, we proposed a set of algorithms. There is no reported work on Telugu sarcasm detection so far. Therefore, these algorithms make an initiation in this direction. The proposed algorithms attain an accuracy of 94.14% with the limited amount of Telugu conversation datasets.