Deep Bidirectional LSTM Network Learning-Based Sentiment Analysis for Arabic Text

: Sentiment analysis aims to predict sentiment polarities (positive, negative or neutral) of a given piece of text. It lies at the intersection of many fields such as Natural Language Processing (NLP), Computational Linguistics, and Data Mining. Sentiments can be expressed explicitly or implicitly. Arabic Sentiment Analysis presents a challenge undertaking due to its complexity, ambiguity, various dialects, the scarcity of resources, the morphological richness of the language, the absence of contextual information, and the absence of explicit sentiment words in an implicit piece of text. Recently, deep learning has obviously shown a great success in the field of sentiment analysis and is considered as the state-of-the-art model in Arabic Sentiment Analysis. However, the state-of-the-art accuracy for Arabic sentiment analysis still needs improve-mentsregardingcontextualinformationandimplicitsentimentexpressedindifferentrealcases.Inthispaper, an efficient Bidirectional LSTM Network (BiLSTM) is investigated to enhance Arabic Sentiment Analysis, by applying Forward-Backward encapsulate contextual information from Arabic feature sequences. The experimental results on six benchmark sentiment analysis datasets demonstrate that our model achieves significant improvements over the state-of-art deep learning models and the baseline traditional machine learning methods.


Introduction
To find an automatic way to analyse, classify, and determine the attitude of a speaker in social networks is very important. Indeed, it is the most empirical way to get direct feedback or information from people [1]. For instance, in business, it allows companies to automatically gather the opinions of their customers about their products or services. In politics, it can help to infer the public orientation and reaction towards political events, which can help in decision making [2]. As a result, Sentiment analysis (SA), which aims at extracting people's opinions automatically, has gained interest in recent years in politics, social media, and business.
Sentiment Analysis refers to the use of NLP, computational linguistics, and data mining techniques to identify and retrieve certain sentiment(s) from text [3]. Its goal is to extract the sentiment conveyed in a piece of text (tweet, post, etc.) based on its content and its level of analysis (document level, sentence level, aspect level, and word level). Furthermore, the application of Sentiment Analysis in Arabic text is a timely subject. Given its importance as a language (Arabic which is recognized as the fifth most widely spoken language in the world and is considered the official or native language for 22 countries approximately more than 300 million people) [4,5]. It has three varieties, which include classical Arabic that is found in religious and old scripts.
Modern Standard Arabic (MSA) found in today's written scripts and mainly spoken in formal channels. The third is the colloquial or dialectal Arabic (DA), the spoken language in informal channels.
Arabic is both morphologically rich and highly ambiguous; it has complex morpho-syntactic agreement rules and a lot of irregular forms, and it has a large number of dialectal variants with no writing standards. Without proper processing and handling, learning robust general models over Arabic text can be hard to do. Furthermore, compared to English, there are less freely available resources for Arabic Sentiment analysis in terms of Sentiment lexicons and annotated corpora of sentiments. These challenges have fuelled extensive research interest in Arabic Sentiment Analysis (ASA) [6].
Despite the importance of Arabic Sentiment Analysis, state-of-the-art artificial intelligence (AI) systems rely on deep learning techniques and have achieved immense success in many domains. Deep Learning (DL), a subfield of machine learning (ML), depends on a set of algorithms in order to learn multiple levels of representation with the aim of finding a model for high-level abstractions in data. One of the main benefits of deep learning over various traditional machine learning algorithms is its capacity to execute feature engineering on its own, lies in their ability to perform semantic composition-generating a vector of representation for text units by combining their finer-grained constituents or entities-efficiently and low-dimensional space.
Recurrent neural networks (RNNs) are deep learning neural networks designed specifically to learn sequences of data and are mainly used for textual data classification. However, RNNs suffer from the vanishing gradient problem when handling long sequence of data. LSTM neural networks [7] were proposed as a solution for this problem and have proven to be efficient in many real-world issues like speech recognition [8], image captioning [9], music composition [10]. However, extracting a sentiment highly depends on the review's contextual information. Direct feedforward neural networks lack to some extent the ability to take contextual information into consideration and hence act poorly in the ASA task. Therefore, the sentiment analysis approach in this paper is the use of a Bidirectional LSTM Network (BiLSTM) with the ability of extracting the contextual information from the feature sequences by dealing with both forward and backward dependencies. Moreover, our BiLSTM allows us to look ahead by employing a forward LSTM, which processes the sequence in chronological order, and a backward LSTM, which handles the sequence in reverse order. The output is then the concatenation of the corresponding states of the forward and backward LSTM.
In our proposed approach, in order to ameliorate the state-of-art performance of the Arabic Sentiment Analysis, we have deployed a deep learning model based on BiLSTM for sentiment classification. Our contribution can be summarized as follow: 1. We investigate the benefits of Arabic pre-processing such as tokenization, Punctuations removal, Latin characters removal, Digits removal, Normalization, and Stemming. 2. Our work has used Deep Neural Network Bidirectional LSTM Network with the ability of extracting the contextual information from the feature sequences of Arabic sentences. 3. Our BiLSTM model significantly outperforms, in terms of accuracy, other deep learning models (CNN and LSTM) on the majority of the benchmark datasets. 4. It successfully achieves the highest accuracy on Arabic Sentiment Analysis classification as compared to Baseline Traditional Machine Learning methods and outperforms the state-of-the-art deep learning model's accuracy.
The remainder of this paper is organized as follows. We will first overview some related recent works on Arabic sentiment analysis approaches and methods. Then, In Section 3, we will describe our Arabic Sentiment Analysis system using deep learning model in detail. Section 4 will provide the experimental study and the obtained results. Section 5 will present the discussion. Finally, section 6 as a conclusion, an outline for future work will be given

Related Work
Sentiment analysis aims to classify the subjective texts in two or more categories. The most obvious ones are positive and negative. We refer to this problem as Binary SA (BSA). Some works include a third class for neutral text. We refer to this problem as ternary SA (TSA). A final option is to consider sentiment based on some ranking or rating system, such as the 5-star rating system. This is known as Multi-way SA(MWSA) [11].
The available research on Arabic Sentiment Analysis approaches can be categorized into Machine Learning, Lexicon-based, and Hybrid or combined approaches.
The machine learning is the most commonly used approach in sentiment analysis. [12] performed SA of tweets written in Modern Standard Arabic (MSA) and Egyptian dialects. They have collected 1000 tweets (500 positives and 500 negatives). They have used standard n-gram features and experimented with several classifiers (SVM and NB). [13] developed a sentiment analysis tool for colloquial Arabic and MSA to evaluate it in social networks. They have collected 1080 Arabic reviews from social media and news sites, and K-NN was used as a classifier. In [14], They applied SVM and NB classifiers for sentiment analysis on Arabic reviews and comments from Yahoo Maktoob website. [15] developed a supervised system for Arabic social media (SAMAR) using an SVM classifier. [16] presents an Arabic sentiment analysis tool with three classifiers SVM, K-NN, and naïve Bayes.
Another approach for Arabic sentiment analysis is the lexicon-based approach. It is usually implemented when the data are unlabeled. Lexicons are sentiment dictionaries with the word and its occurring sentiment or sentiment score. [17] proposed a system consists of two different parts: the first part is a free online game. The aim of this game was to build a lexicon with positive and negative words. The second part was the sentiment analysis which classified reviews according to their sentiments. [18] created a lexicon with 120,000 Arabic terms. They collected a large number of articles from the Arabic news website.
[19] developed a hybrid approach that utilized a lexicon-based approach as well as a machine learningbased approach. The lexicon was constructed by translating the SentiStrength English lexicon. The classification step was conducted using Maximum Entropy (ME) and K-Nearest Neighbors (KNN). [20] conducted an experiment to develop an ASA system. The authors conducted three experiments. The first experiment classified comments using the SVM machine learning-based approach only. Alternatively, the second experiment used the lexicon-based approach with the lexicon only; they constructed SSWIL in order to classify the comments. The third experiment combined the two approaches, that is, classifying comments using the SSWIL lexicon and then applying the SVM machine learning-based approach. [21] built a new Arabic lexicon by merging two MSA lexicons and two Egyptian Arabic lexicons. They used SVM as a classifier and adapted one of the state-of-the-art Semantic Orientation (SO) algorithms. [22] investigated sentiment analysis with dialectical Arabic words and MSA. The classifiers they used were NB and SVM. To convert dialectical words to MSA, the authors used a dialect lexicon that contains dialectical words with their corresponding MSA words.
Deep learning is a branch of machine learning which aims to model high-level abstractions in data. This is done using model architectures that have complex structures or those composed of multiple nonlinear transformations [23]. Only a few researchers have explored deep learning models in Arabic text.
[24] explored several deep learning models: Deep neural network (DNN): DNN applies the backpropagation to a conventional neural network with several layers; Deep belief networks (DBN): DBN pre-trains phases before feeding it into other steps; Deep autoencoder (DAE): DAE reduces dimensionality to original models; Combined DAE with DBN; and Recursive autoencoder (RAE): The RAE parses raw sentence words in the best order which then minimizes the error of creating the same sentence words in the same order. The experimentation results show that the DAE model gives a better representation of the sparse input vector. The best model was the RAE leading to an accuracy of 74%. Moreover, the RAE model's performance was better than other models by around 9%.
In [25] they proposed enhancements to the RAE model, which was proposed in [24], to adapt to challenges that arise with the Arabic text. Morphological tokenization is proposed to overcome the overfitting and morphological complexity of the Arabic sentences. The model was tested with three different datasets. The same authors [25] participated in Sem-Eval 2017 task 4 with their RAE model and achieved an accuracy equal to 41% [26]. [27] applied the recursive neural tensor network (RNTN) model on the Arabic text. The model was trained by a sentiment treebank called (ARSENTB), which is a morphologically enriched treebank created by the authors. They used word2vec embedding using the CBOW model on the QALB corpus [28].
In [29], the authors tested their model proposed in [27] with ASTD [30] dataset of 10,006 tweets. The RNTN was trained twice, first time using lemmas, where each lemma represents a set of words that have the same meaning and differ by inflectional morphology only, and in the second time using raw words. RNTN, when trained by lemmas, showed better accuracy.
[31] performed a system to analyse opinions about health services. They gathered their dataset from Twitter hashtags and ended up with 2026 tweets. They compared two DL models, namely, DNN and CNN, with word2vec embedding. The CNN model had the best accuracy. However, in this study, the CNN model was trained on a very small dataset, and the two models did not address the negation problem. Lately, the authors proposed another model in [32] to overcome the limitation of training CNN on a small dataset. Instead, they trained a combined CNN and lexicon model on top of word2vec constructed from a large corpus acquired from multiple Arabic journals. The accuracy of their model has increased from 90% to 92%. The same authors in [33] applied sentiment analysis on a health dataset using combined deep learning algorithms (CNN-LSTM). It also explored the effectiveness of using different levels of sentiment analysis. The experimentation results show that word-level and Ch5-gram-level have shown better sentiment classification results.
Another work was proposed in [34], using the DNN algorithm. The authors have used eight layers in the model to classify Arabic tweets. The sentiment of each tweet is given by extracting the sentiment words from the tweet using a lexicon and then summing their polarities. Although the model showed good performance, it exhibited sensitivity in its performance towards different datasets, and there was not any consideration for negation.
In [35] they have examined two word embedding (CBOW and SG), using a corpus with 3.4 billion Arabic words selected from 10 billion words collected by crawling web pages. Then, to classify sentiments, a CNNbased model was trained by a previously trained word embedding.
In [36], five architectures were used, including CNN, CNN-LSTM, simple LSTM, stacked LSTM, and combined LSTM to analyse Arabic tweets. They employed dynamic and static CBOW and SG word embeddings to train the models. Experiment results showed that the combined LSTM model trained by dynamic CBOW outperformed the other models.
In [2], They have used an ensemble model, combining Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) models, to predict the sentiment of Arabic tweets. Their results show a significant increase in the performance of the models, and F1-score of 64.46% was achieved with their model.
In [37], they have proposed a hybrid incremental learning model for Arabic sentiment analysis. Their model uses two different ML classifiers and one DL classifier, which is an RNN. The input to the network is a set of 16 different weights calculated from three different lexicons that form the feature vector.
[38] addressed the aspect-based sentiment analysis for Arabic Hotels reviews. Their dataset consisted of 2291 Arabic reviews. The dataset was prepared using AraNLP [39] and MADAMIRA [40] tools, which were used to extract semantic, syntactic, and morphological features. They used the RNN approach, and the network consisted of five hidden layers.
[41] proposed a DL model to analyse sentiment expressed at the aspect level. The proposed model is based on the LSTM network, where the input to the model is the text embeddings along with the aspect embedding.
A summary of the DL Arabic sentiment analysis models that have been proposed is presented in Table 1.     Figure 1 shows the overall architecture of the proposed Arabic Sentiment Analysis system using the BiLSTM deep learning model. It contains two main components: "data pre-processing and cleaning" and "Sentiment classification". In the next section, we will give all the details about each component.

Data Preprocessing and Cleaning
Sentences pre-processing, which is the first step in our method, converts the Arabic sentences to a form that is suitable for a sentiment analysis system. These pre-processing tasks include Punctuations removal, Latin characters removal, Stop word removal, Digits removal, Tokenization, Normalization, and Light Stemming. These linguistic are used to reduce the ambiguity of words in order to increase the accuracy and the effectiveness of our approach. The pre-processing of Arabic sentences consists of the following steps:

Tokenization
Tokenization is a method for dividing texts into tokens; Words are often separated from each other by blanks (white space, semicolons, commas, quotes, and periods). These tokens could be individual words (noun, verb, pronoun, and article, conjunction, preposition, punctuation, numbers, and alphanumeric) that are converted without understanding their meaning or relationships. The list of tokens becomes an input for further processing. In this work, we use "Tokenizer" from Keras¹, which is the Python Deep Learning library.

Latin Characters Removal and Digits Removal
In addition, we remove the Latin characters and digits, which appear in the sentences and do not have any meaning or indications in our method.

Word Normalization
Normalization aims to normalize certain letters that have different forms in the same word to one form. For example, the normalization of "" (Hamza)," d" (aleph mad), " d" (aleph with hamza on top), " ð"(hamza on waw), "d " (aleph with Hamza at the bottom), and " ø"(Hamza on ya) to "d" (aleph). Another example is the normalization of the letter "ø" to "ø " and the letter " è" to " è". We remove the diacritics such as , , , , ! , , because these diacritics are not used in extracting the Arabic roots and not useful in the approach proposed. Finally, we duplicate the letters that include the symbol " " è Ë d, because these letters are used to extract the Arabic roots, and removing them affects the meaning of the words.

Light Stemming
Light stemming is the affix removal approach that refers to a process of stripping off a small set of prefixes and/or suffixes to find the root of the word. In this work, we use the Information Science Research Institute's (ISRI) stemmer [44]. It uses a similar algorithm to word rooting of the Khoja stemmer [45]. However, it does not employ a root dictionary for lookup. In addition, if a word cannot be rooted, it is normalized by the ISRI stemmer (e.g., removing certain determinants and end patterns) instead of leaving the word unchanged. Furthermore, it defines a set of diacritical marks and affix classes. For example:

Sentiment Classification Using Deep Learning Model
In this section, we explain the background of our sentiment classifier architecture based on the BiLSTM model. Figure 2 shows our BiLSTM model. We use Keras's text pre-processing library to convert each sentence to a sequence of integers. It takes each word in the sentence and replaces it with its corresponding integer value from the vocabulary index. An entire sentence can be mapped to a vector of size s where s is the number of words in the sentence. We follow [46] zero-padding strategy such that all sentences have the same vector dimension X ∈ R s (we chose s = 100).
LSTMs are part of the recurrent neural networks (RNN) family, which are neural networks that are constructed to deal with sequential data by sharing their internal weights across the sequence [47]. LSTM addresses the problem of the vanishing error gradient and captures long term dependencies by using its gates to manage the error gradient. The Mathematical representation of LSTM can be shown as: Where x t the current word embedding, W h and U t are the weights matrices, b h is the bias term and f (x) is a non-linear function, usually chosen to be tanh and h t is the regular hidden state.
Where i t is called the input gate, f t is the forget gate, c t is the memory cell, σ is the sigmoid function, and o is the Hadamard product. Spontaneously, the forget gate decides which previous information should be forgotten, while the input gate controls what new information should be stored in the memory cell. Finally, the output gate decides the amount of information from the internal memory cell should be exposed. This gate units help a LSTM model remember significant information over multiple time steps [48]. A smaller version of the LSTM model is illustrated in Figure 3.
Deep Bidirectional LSTM Network Learning-Based Sentiment Analysis for Arabic Text | 403 Figure 3: LSTM model architecture [48] One drawback from the LSTM is that it does not sufficiently take into account post word information because the sentence is read only in one direction; forward. To solve this problem, we use what is known as a bidirectional LSTM, which is two LSTMs whose outputs are stacked together. One LSTM reads the sentence forward, and the other LSTM reads it backward. We concatenate the hidden states of each LSTM after they processed their respective final word [47], technically, BiLSTM applies two separate LSTM units, one for the forward direction and one for the backward direction. Two hidden states h t forward and h t backward from these LSTM units are concatenated into a final hidden state h t bilstm : Where ⊕ is concatenation operator. [49] proposed a learning model based on LSTM for semantic relationship classification and found that BiLSTM can discover richer semantic information and make full use of contextual information than LSTM. [50] utilize BiLSTM to obtain high-level semantic information features from word embedding and completes sentence-level relationship classification.

Experiments and Results
In this section, we will evaluate the effectiveness of our proposed method using six benchmark sentiment analysis datasets. Section 4.1 presents the data sets. Section 4.2 describes the evaluation metrics. Section 4.3 details the parameters setting. Finally, Section 4.4 presents the experimental results.

Datasets
In this paper, the experiments are conducted using six benchmarks sentiments analysis datasets. Furthermore, to show the flexibility of our model on various domains. We have used only two sentiment classes, i.e., Positive and Negative, and we removed the objective class because the class distribution was highly skewed, and it is more important to focus on opinion classification rather than subjectivity classification.

ASTD: Arabic Sentiment Tweets Dataset
The authors of [30] presented a sentiment analysis dataset from Twitter, they were grouped into four categories (positive, negative, neutral, and objective).

ArTwitter: Twitter Data set for Arabic Sentiment Analysis
The authors of [43] have manually built a labeled sentiment analysis dataset form Twitter. The dataset contains 2000 labeled tweets (1000 positive tweets and 1000 negative ones) collected by using a tweet crawler.

LABR: A Large-Scale Arabic Book Reviews Dataset
In [51], they presented a large dataset of Arabic Book reviews. This dataset contains over 63,000 book reviews in Arabic. The book reviews were harvested from the website² during March 2013. Each book review comes with a review ID, the user ID, the book ID, the rating (1 to 5), and the text of the review.

MPQA: Multi-Perspective Question Answering
The authors of [52] presented a news articles dataset. The dataset contains news articles from a wide variety of news sources manually annotated for opinions and other private states (i.e., beliefs, emotions, sentiments, speculations, etc.)

Large Arabic Multi-domain Resources for Sentiment Analysis
[53] proposed a dataset of 33K automatically annotated Reviews in Domains of Movies, Hotels, Restaurants, and Products. The datasets cover four domains as follows: i. Hotel Reviews (HTL): For the hotel's reviews were scrapped from TripAdvisor³. ii. Restaurant Reviews (RES): For the restaurant's reviews were scarped from Qaym⁴ and TripAdvisor. iii. Movie Reviews (MOV): The movie's domain dataset was built out of scrapping reviews from Elcine-mas⁵ covering around 1K movies. iv. Product Reviews (PROD): For the Products domain, a dataset of reviews was scraped from the Souq⁶ website. The dataset includes reviews from Egypt, Saudi Arabia, and the United Arab Emirates.

Evaluation metrics
To evaluate the performance of the sentiment analysis system, we have employed four well-known metrics, namely, Accuracy, Precision, Recall, and F-measure. There are widely used to measure sentiment prediction. Accuracy, Precision, Recall, and F-measure are defined as follows: Where TP is the number of sentences that are positive and predicted correctly as positive, FP is the number of sentences that are negative and predicted incorrectly as positive, TN is the number of sentences that are negative and predicted correctly as negative, and FN is the number of sentences that are positive and predicted incorrectly as negative. Note that, the higher the Precision, the more accurate the prediction of the positive class, A high recall means a high number of sentences from the same class is labeled to its exact class, F1measure is a weighted average of Precision and recall, and for the accuracy, it simply reports the ratio of the correctly classified sentences regardless of their class.
Besides, A comprehensive evaluation of classifier performance can be obtained by the ROC:

ROC = P(x/Positive) P x/Negative
Where P(x/C) denotes the conditional probability that a data entry has the class label C. A ROC curve plots the classification results from the most positive to the most negative classification [54].

Parameters setting
There are a number of parameters to tune. In this work, several experiments were conducted to find the optimal parameters. Only the parameters that yield the best results are reported. For this purpose, we use the size of the training set is 80% of the whole dataset, and the test set contains the remaining 20% of the dataset. The number of epochs is 10 for all the experiments. For the regularization of the neural networks and to avoid the over-fitting problem, we apply Dropout, with a dropout rate of 0.5.

Experimental Results
This section is intended to compare the performance of the proposed model described in Section 3 with the state-of-art Arabic Sentiment Analysis methods. Additionally, we compare our results of a deep learn-ing model with two baselines prevalent in traditional machine learning, namely the Random Forest classifier and the Support Vector Machine classifier. Also, we have mentioned the impact of Light stemming on our proposed approach. Table 3 presents the comparison between our proposed approach and the state-of-art Arabic Sentiment Analysis methods on each dataset. It is clear that our deep learning model improved the performance of sentiment classification, and our model achieves an Accuracy of 72.25% on the ASTD dataset, 91.82% on the ArTwitter dataset, and 92.61% on Main-AHS dataset, which outperforms the state-of-the-art methods. Table 4 and Figure 4 shows the detailed performance of our deep learning model on each of the six Arabic datasets. We compare our results of the BiLSTM model with two baselines prevalent in traditional machine learning, namely the Random Forest classifier and the Support Vector Machine classifier. Our results in Table 4 consistently reveal superior performance through the use of our BiLSTM model over the baseline traditional machine learning.   Our motivation behind choosing the BiLSTM-based approach such a classification based-method, due to the Forward-Backward encapsulate contextual information during varied Arabic based-target learning stages. More generally, we have conducted other experiments by applying two other deep learning methods with regard to specific training settings. We compare our results of BiLSTM with two popular deep learning models, such as the Convolutional Neural Network (CNN) and the Long Short-Term Memory (LSTM) network. [46] defined CNNs to have convolving filters over each input layer in order to generate the best features. [55] confirmed that CNN is a powerful tool to select features in order to improve the prediction accuracy. [56] showed the capabilities of LSTMs in learning data series by considering the previous outputs. As demonstrated in Table 5 We show that the better performance obtained in terms of Accuracy is obtained by our BiLSTM model on the majority of the benchmark datasets used in the learning stage.
Furthermore, The results in Table 5 have shown that applying sentiment classification with light stemming gives a significant performance improvement of 3.36% on ASTD, 4.61% on ArTwitter, 1.37% on LABR, 1.95% on MPQA, 1.06% on Multi-Domain, and 1.97% on Main-AHS dataset.
Additionally, In Figure 5 we calculate the ROC score for each baseline traditional machine learning method and Deep Learning model on each dataset used in the experiment. As shown, ROC scores of baseline traditional machine learning (SVM and Random forest) generally have the lowest ROC score in the different datasets compared with the other deep learning models in general and BiLSTM based model in particular.
Furthermore, Figure 6, Figure 7, and Figure 8. illustrate the accuracies on different datasets over 10 epochs. Each line represents a different deep learning model. The Accuracy results confirmed our finding

Discussion
The objective of this work is to propose a novel Arabic Sentiment analysis approach to overcome the limited ability of the feed-forward model by extracting unlimited contextual information on both directions of the Arabic sentence. From the results presented in Section 4.4, we can highlight that our proposed method is able to yield the best results in terms of sentiment prediction quality. BiLSTM shows higher results than other deep learning models (CNN and LSTM) on the majority of the benchmark datasets. This is due to the fact that BiLSTM can more effectively learn the context of each word in the text, it accesses both preceding and succeeding contextual features by combining a forward hidden layer and a backward hidden layer. Moreover, we found that BiLSTM can discover richer semantic information and make full use of contextual information than LSTM.
In Addition, Our results through the use of our BiLSTM model over the baseline traditional machine learning. This due to the fact that Deep Learning algorithms achieved accuracies better than traditional machine learning methods, and according to [57], large training data makes SVM inefficient and costly, as SVM is not scalable to huge size data. When the training data is noisy and imbalanced, it can affect the outcome of SVM due to its high training execution and low generalization error [57]. For the Random Forest algorithm, the complexity grows with the number of trees in the forest and the number of training samples we have. Furthermore, our model was enriched with morphological features, including stems, in order to overcome the lexical sparsity and ambiguity issue. The results show that applying light stemming gives a significant improvement of 2.39% as an average on the six datasets used in our experiments, and according to [6], these features achieved significant performance improvements on the data containing a mixture of MSA and DA.
The main advantages of this work are the capacity to effectively improve the quality of sentiment predictions by investigating the benefits of Arabic pre-processing such as tokenization, Punctuations removal, Latin characters removal, Digits removal, Normalization, Light Stemming, and the ability to consider the contextual information by dealing with both forward and backward dependencies.

Conclusion and Future work
In this work, we have addressed the Sentiment Analysis problem for Arabic Text. This paper exploits the benefit of using a deep learning model on the performance of the Arabic Sentiment Analysis system. We have used the BiLSTM deep learning model with the ability of extracting the contextual information to predict the sentiment of the Arabic text. Experiments are conducted on six benchmark datasets to evaluate the performance of our presented approach. The results show the effectiveness of BiLSTM to perform sequential data models and can further extract the contextual information by dealing with both forward and backward dependencies from the feature sequences. Comparisons with some state-of-art baseline methods, it demonstrates that in most cases our deep learning model is more effective and efficient in terms of the classification quality. Besides, the model achieves significant improvements in the Accuracy and F1-measure results over the existing models. Our Model will definitely help to ensure further exploration.
In future work, we plan to compare the impact of different recent contextualized word embedding (e.g., GoVe, ELMO, ULMFiT, BERT, GPT, GPT-2, and XLNet) on the performance of our presented deep learning model using different Arabic sentiment analysis datasets. Furthermore, we plan to work on dialectal Arabic corpora to cover all variations of Arabic words.