Detecting Textual Propaganda Using Machine Learning Techniques

Social Networking has dominated the whole world by providing a platform of information dissemination. Usually people share information without knowing its truthfulness. Nowadays Social Networks are used for gaining influence in many fields like in elections, advertisements etc. It is not surprising that social media has become a weapon for manipulating sentiments by spreading disinformation. Propaganda is one of the systematic and deliberate attempts used for influencing people for the political, religious gains. In this research paper, efforts were made to classify Propagandist text from NonPropagandist text using supervised machine learning algorithms. Data was collected from the news sources from July 2018-August 2018. After annotating the text, feature engineering is performed using techniques like term frequency/inverse document frequency (TF/IDF) and Bag of words (BOW). The relevant features are supplied to support vector machine (SVM) and Multinomial Naïve Bayesian (MNB) classifiers. The fine tuning of SVM is being done by taking kernel Linear, Poly and RBF. SVM showed better results than MNB by having precision of 70%, recall of 76.5%, F1 Score of 69.5% and overall Accuracy of 69.2%.


Introduction:
In this era of technology, computer science plays an essential role in providing solutions to almost all the emerging fields. With the advent of the internet in the 1970s, there has been a revolutionary improvement in computer science and has found its place in multidisciplinary subjects like remote sensing, technical diagnosis, traffic control systems, criminology, medical imaging, image processing, data mining, and automatic surveillance. Due to these applications, there is tremendous growth in the number of hardware and software products in the market (1). Nowadays data analytics is the major research area which discovers the patterns from large data sets, it is integrated with various fields like bioinformatics, Natural Language Processing (NLP), Machine Learning etc.(2). In Data mining important information is mined from the text, image and video etc.The data mining can perform the descriptive or predictive task, for characterizing data there are descriptive tasks and predictive tasks make the projection of the future on the basis of old data (3). Clustering, Correlation, Pattern Finding are some of the tasks in data mining (4). Analyzing Online Social Networks is a challenging process, due to its enormous usage, variety, large volume, veracity, and real-time data. Online Social Networks (OSN) are communicating via computer-mediated tools which assist the formation and distribution of information, concepts, business welfares and new practices of communication via virtual communities and links (4). OSNs are gradually used for purposes like advertising, brand construction, political campaigning, and recommender systems, customer response, sentiment analysis(6,7) etc. Social network saturation worldwide is increasing day by day, in 2017, about 71 per cent of internet consumers were using social network and the statistics are estimated to raise in 2019 and is assessed that around 2.77 billion social media users will be in 2019, rising from 2.46 billion in 2017 1 the statistics is shown in Fig.1. The modern age is the age of technology, the advancement in information and communication technology (ICT) changed the classical process of sending and receiving the information. Everybody in this universe wants to be updated irrespective of their behavioural differences. The adversarial use of social media is common for spreading deceptive/misleading information which is social, economic, and political intimidation (8). Social media has been effectively used by fear mongers for triggering mass hysteria and panic (9). There are three types of attacks that took place in cyber network operationsphysical, syntactic and semantic attacks.

Physical attacks
Physical attacks are attacks that affect the hardware of the system and occur at the physical level. By removing the hardware, the system disappears as well, but physical attacks as such need no elaboration here due to our diversity in the research problem.

Syntactic attacks
Syntactic attacks occur due to the technologies, this attack occurs at the syntactic level which contains instructions for designers and users which they give to a machine and contains protocols through which machines interact with one another. Hacking is one type of syntactic attacks whose main aim is to steal data. A hacker can alter everything regarding a system arbitrarily. Since our work is regarding text analytics there is no role of ours in this type of attacks.

Semantic Attacks
In semantic attacks human-computer interfaces are attacked. The effects of semantic attacks are not visible like other types of attacks (physical, syntactic). Semantic attacks are the most dangerous attacks which change the information content or the meaning of information (10). Semantic attacks are divided into many categories viz Overt attack (include Phishing, Spam etc.) and Covert attack. Our focus would be on Covert attacks i.e. misinformation, disinformation and propaganda (11). This domain is considered as an emerging field of research and is being critically studied by researchers. It has been suggested by many authors to use the social computing properties of the users on online social networks for determining the credibility of the information. Social networks are being used enormously for spreading fabricated information. The spread of information becomes easier for users with the increase in usage of OSN's. Information can be truthful or dubious, the doubtful information can be shared deliberately or un-deliberately. The doubtful information can be categorized as misinformation and disinformation. Misinformation is that information where the user does not know the truthfulness of information that is being spread whereas disinformation is that information in which the user deliberately shares/spread false/true information (12). Disinformation usually occurs in politics, health, finance, technology etc. Orchestrated Astroturf is used for manipulating political conversations, even during election times (13) (14). The compromised accounts are being used for spreading disinformation and these accounts may also be used for spreading propaganda. Propaganda comes under the type of disinformation which is defined as the systematic and deliberate process to shape opinions, influence thoughts, and direct behaviour of a person for  201 achieving the desired intention of a propagandist. Propaganda identification in online social networks is the need of an hour and is mainly used for gaining the people's faith in some person or some community or party. Propaganda plays a major role in politics, specifically political propaganda has gained much interest of the researchers of the world. In the USA presidential election 2016, political propaganda has a major role in the winning of Donald Trump(15). Radical propaganda can be shared by posting messages, related to violence, religion, sectors and events (16). Due to the semantical nature of propagandistic text it is difficult to differentiate text into the binary class (propaganda and non-propaganda). Till now propaganda is being analysed manually. Machine learning has shown the promising results in every field of life, whether it is biological science or arts and humanities (17). The Prevalence of manipulation tactics on social media in 65 countries is shown in Fig. 2 .A framework is being used which will classify text into binary class using machine learning algorithms. This system will somehow help researchers to perform more work in this aree. Hybrid feature engineering is used by combining TF/IDF and Bag of Words. The paper is organized under various sections starting from the literature survey from the already existing literature which is related to the proposed work followed by the proposed methodology using data extraction, pre-processing & feature extraction and classification algorithms. In corresponding sections, discussion on the findings and the experimental results and future scope of the study in this area of research are provided.

Motivation
The modern age is the age of the internet especially social media and with the increasing use of online social network, there are more chances of exploitation of this platform. Social networks are used to share the information, but the information shared or posted can be misinformation or disinformation. On social networks propaganda is being shared enormously and is usually used for political influences. By detecting the propaganda within time limits a lot of false information can be stopped which is being shared by hatemongers. Thus, it can help in many activities like in law enforcement, cyber-crime analysis etc. The significant contribution of this work are as follows:  A novel dataset of 6k records extracted/collected is built by annotating manually with the knowledge of 16 different techniques of propaganda.  Hybrid approach of feature engineering is done by combining two techniques of feature extraction.  To our knowledge this is a first work where accuracy of human annotated text for classification of propagandist text and nonpropagandist text is significantly replicated by machine classification.

Related Work:
The ever-growing attractiveness and beauty of using social networks directly or indirectly affects our daily life, by causing us to trust on other people's ideas & suggestions when making small or big decisions, from buying low-cost minor products and voting online in elections for the formation of the new government. It is not surprising that social media has become a weapon for manipulating sentiments by spreading disinformation as per the trend. False content and propaganda are extensively used on social media and must be identified and countered. H. Gao et.al (18) studied the characteristics of the malicious accounts wherein they focused on the messages that contain URL's in text form. With the increase in social media users, the events related to politics, government schemes are mostly discussed and the abuse in social media has become common. J. Ratkiewicz et.al (19) proposed a method for detecting and tracking political abuse in social networks. At the time of the election, the structure of the network shows some deflection from the normal structure. A. Halu et.al (20) studied the structure of social networks during the election campaign. Model for opinion dynamics was proposed that showed wide region parties survive by gathering a finite fraction of the votes at election time. N. Ramakrishnan et.al (21) analyzed and mined data regarding civil unrest in online social networks. They performed data extraction related to the specific event, identified the contributing factors, event evolution analysis. J. S. Liu et.al (22) identified the inner circles of  (25) identified events in the social network. Three models detecting unspecified events in social networks are used, for topic modelling LDA, for document clustering TF/IDF, and for feature clustering wavelet analysis is used. S. Lightfoot (26) studied the effect of social bots on politics (political propaganda through social bots). According to them, social bots have an only negative effect on politics. They found that social bots played a vital role in spreading fake news and accounts that continuously spread misinformation are significantly more likely to be bots. People with radical thoughts use social media only for spreading hate and fear. M. Ashcroft et.al (27) proposed a semantic graph-based approach for radicalization detection in social media. The pro-ISIS users tend to discuss more about religion, historical events and ethnicity while anti-ISIS users focus more on politics, geographical locations and interventions against ISIS. They also shared the dubious information in the social network. K. K. Kumar et.al (28) detected misinformation on online social networks using cognitive psychology. The cognitive process consists consistency of message, the coherency of a message, the credibility of the source and general acceptability of message. The collected data sets used keywords Syria, Egypt etc.by using Twitter API. G. Mazzoleni et.al (29) identified the socially mediated type of populist communication profoundly affected by the specific nature of social media. They measured the degree of populist communicative ideology on the basis of appeal to the people, attacking the elite and by ostracizing the others.

Proposed Methodology:
An elaborated analytical and experimental methodology is used based on the extensive literature survey and the corresponding discussion with the experts of the field on the problem. Based on the various factors that influence the performance of the model, real-time data was collected from the twitter and the data is submitted to the model and the final results are generated. The overall methodology comprises of five main steps: (i) Corpus/Data collection (ii) Data annotation (iii) Pre-processing and feature extraction (iv) Machine classification (v) Evaluation and validation. The complete picture of the proposed methodology that is used in our work is depicted below in Fig.3. Data collection is the first phase of the proposed methodology, in this phase, Twitter API is used for the extraction of data from Twitter. Various steps are to be followed before accessing twitter API these are: 1. Apply for a developer account. 2. Make an application related to your work. 3. Get the credentials like Access Token, Secret Token etc. 4. Use these credentials for data extraction with the help of some programming language.
The data is extracted using various hashtags like #propaganda, #Hoaxes, #Falseflag etc. R Language is used for extraction of data from Twitter. Data is also extracted from online news lets and break them into sentences such that the length of each news sentence does not exceed 300 characters. Figure 4 gives the length distribution of News/Tweets. After the data is extracted, it needs to be classified for performing the classification supervised data is needed. For this task, various annotators were used for labelling the data.

Figure 4. News Length Distribution Annotation
In order to classify the posts whether they are propagandist posts or not, various experts of the field were consulted for their expert input and suggestions. Two leading journalists and two computer science professionals were given the task for annotating the data. The annotators were asked to classify the data on the basis of semantics and context. Binary classification is being used in this work. The annotators checked the relevance of the post, whether it is propaganda or non-propaganda by taking different techniques in consideration like a bandwagon, glittering, card stacking, repetition, transfer and name-calling etc.

Corpus Collection
The experiments were performed on twitter, a social networking site that is mostly used by celebrities, politicians etc. for expressing their views regarding the latest issues that are occurring around the world. Various hashtags have been used for extraction of data from twitter like #Propaganda, #Hoaxes, #FalseFlag etc. After extracting the 20K Number of tweets an annotation was performed and the tweets were labelled into two categories (0, 1). 0 means non-propaganda and 1 means Propaganda with the help of various researchers that are working on similar type of problems. Figure 5 shows the visual representation of labelled classes with their corresponding length in characters. A supervised dataset has been developed in which there are 6K number of tweets/News. For removing biasness equal number of labelled tweets and news are used that is if T = total number of labelled tweets. X 1 = number of tweets labelled as propaganda.
X 2 = number of tweets labelled as nonpropaganda, then T = i=2 ∑ i=1 X i . And X 1 = X 2 After annotation, 60% of the tweets/news was found to be about politics and are propagandist in nature. Feature Engineering: Data Pre-processing Given the collected data, pre-processing and feature design is concerned with the source text. Unidecode was used for the conversion of input raw string such that variation is reduced by generating unicode characters into ASCII. The data was pre-processed by removing various uneven text like non-english text, hashtags, URLs, stop words etc. The next step in data pre-processing is data tokenization in which text is being tokenized by splitting off punctuation from words and part of speech tagging is being done to each token by assigning them morphosyntactic category. In the last step, lemmatization is being performed by leading to the base form of each token, as shown in Fig.6.

Feature Selection
After performing pre-processing, the set of features have been defined for identifying propaganda. Features selected were TF-IDF, sentence length, readability grade level, emotion, LIWC, emphatic and semantic features for identifying propaganda. Term Frequency/ Inverse Document Frequency, numeric measure for scoring a word/term in a tweet/news based on how often it appears in the tweet/news and a given collection of tweets. If in a tweet/news a term is frequently appearing then the word/term should be important and the high score should be assigned to that term/word. Term frequency is the amount of words/terms appearing in each news/tweet. The following Equation is calculating the TF/IDF in context to our corpus.
( , , ) = ( , ) * ( , ) ( , ) = | | 1 + |{ ∈ : ∈ }| Where t is the term as a feature, w denotes each tweet in the corpus and D is the total number of tweets/news in the dataset (Document space). Bag of Words features: Consists of words and lemma uni, bi and trigrams. Bigrams and trigram words were included such that more information can be extracted from the text. The features extracted are shown in Fig.7.

Classification
The literature of text classification, it was found that Support Vector Machine (SVM) and Multinomial Naïve Bayes (MNB) are frequently used algorithms for performing classification task. SVM, supervised machine learning approach used for classification tasks as well as for regression problems whereas Multinomial Naïve Bayes (MNB) uses a classical Bayes algorithm for text classification (30). Here support vector machine (SVM) is used for classification purpose, as our data size is not too large. In our work, data is plotted in the n-dimensional space (where n is number of relevant features, in our work 40 relevant features are selected), and the value of each feature is the value of a particular coordinate. By giving N data points for training set: = Words used in Sentence. By tuning SVM and MNB properly, it has been observed that they show more accuracy than other complex machine learning algorithms.

Experimental Results and Discussions:
In our experiment, linear SVM and Multinomial Naïve Bayesian classification algorithms were used. After fine-tuning of SVM, the results showed better accuracy and performance on the given data set. The dataset was divided in the ratio of 70:30 in which 70 per cent of labelled data was used for training and 30 per cent was used for testing purpose. The dataset was not balanced as more textual data was regarding non-propaganda as compared to propaganda data. For performing statistical derivations like mean and standard deviation, statistical equations are used. The statistical results are shown in Fig.8. The statistical analysis is being performed such that much knowledge about the data can be gained which is used for classification purpose. Python language was used for implementing the above work using Spyder IDE. The input is given in the form of a    Comparing the results of the algorithms, it was found that Support Vector Machine is better for performing text classification for our problem. The comparative analysis of these algorithms is shown in Table 3 and also graphically in Fig.10. 208 After performing analysis it was revealed that propagandistic text have greater length than non-propagandistic text. The majority of data used in our research was related to politics. More and more data that span many fields need to be collected for better analysis of propaganda. More human effort is needed to perform the labelling of the text into various categories. As the data increases, manual annotation becomes tough. Automatic manual annotation program is needed that will learn from the semantics of supplied text. More feature engineering is needed for achieving the better accuracy. Long Short Term Memory (LSTM) may be used in future for achieving the better results.

Conclusion:
As data sharing has become comfortable with the increasing use of the latest social networking platforms, dubious information is being posted tremendously. In this paper, filthy information, a propagandist text which can be true or false were considered. Propaganda is widely used for gaining influence among common persons for mostly at times of elections. The Support Vector Machine and Multinomial Naïve Bayesian Algorithms are chosen for performing classification. Various features are being selected for performing this task. The results have shown that SVM has 69% accuracy with F1-Score 0.81 for Non-Propagandist class & 0.58 for Propaganda class respectively. In contrast, MNB showed 68% accuracy with F1-Score 0.78 for Non-Propagandist class & 0.58 for Propaganda class respectively. It was also revealed that propagandistic sentences are having larger length then non-propagandistic. Results also showed that SVM outperforms MNB by having a recall of 0.99 and 0.54 with precision 0.69 and 0.71 respectively of two classes. In future other Machine Learning algorithms can be used for achieving better accuracy, with good F1-Score, precision and recall. Ensemble and deep learning techniques may also be used. More feature extraction methods would be used to extract relevant features.