Fake News Detection Model Basing on Machine Learning Algorithms

the Abstract The rapid growth of the internet and easy communication has made it quick and simple to create and spread news. Social media users now generate and share more information than before, but some of it is false and unrelated to reality. Detecting false information in text is challenging, even for experts who need to consider multiple factors to determine authenticity. Malicious misinformation on social media negatively affects societies, especially during crises like terrorist attacks, riots, and natural disasters. To minimize the harmful impact, it is crucial to identify rumors quickly. This study aims to build a learning model for detecting fake news. This research paper relies on finding and analyzing the characteristics of the text, then the words are converted into features using TF-IDF technology, after that the highest-ranking features are identified for the purpose of studying and distinguishing the spread of news, whether it is real or fake using machine learning techniques. Finally, the Logistic Regression, Decision Tree, Gradient Boosting and Random Forest algorithm has been adapted. The accuracy of Logistic Regression is 0.985, Random Forest (0.989) whereas the accuracy of Decision Tree is 0.994 and Gradient Boosting (0.9949), respectively.


Introduction
The internet and communication technologies have revolutionized communication, making it faster and more accessible.Popular social networks like Facebook have played a key role in spreading news rapidly, replacing traditional print media.For example, Facebook referral traffic accounts for 70% of all traffic to news websites 1 .In the era of fast and efficient information flow through online platforms, people are susceptible to deception and manipulation, leading to lasting consequences.The popularity of social networking websites has enabled global information sharing and consumption, making news readily accessible.The way that information is disseminated around the world and on the web has changed recently as a result of web-based media 2 .The mass distribution of false information has detrimental effects on people and society.It disrupts social authenticity, promotes biased opinions, particularly through political propaganda, and hampers the comprehension and response to real news 3 .Text categorization is the technique of organizing and classifying texts based on their content.It plays a crucial role in natural language processing (NLP), especially in tasks like subject labelling, spam detection, and sentiment analysis.NLP enables the Published Online First: January, 2024 https://doi.org/10.21123/bsj.2024.8710P-ISSN: 2078-8665 -E-ISSN: 2411-7986 Baghdad Science Journal automatic detection of relevant medical research documents, journals, and other sources worldwide by selecting specific labels or categories.However, evaluating the similarity of training dataset inputs remains important even after categorization.By employing NLP, machine learning, and data mining, patterns in electronic texts are automatically discovered and uncovered 4 .The technology's main goal is to make it possible for people to handle tasks involving text mining and information extraction from textual tools.
Technologies for information extraction (IE) aim to extract exact information from text-based materials.This is the first method that shows that the terms "text mining" and "data extraction" can be used synonymously 5 .Many researches have concentrated on identifying and categorizing fake news on social media platforms, including in web content and social media 6 .
In 7 ,the authors employed the vectorizers Count-Vectorizer (CV), Hashing-Vectorizer(HV), and Term Frequency-Inverse Document Frequency (TF-IDF) to categorize fake news, a multi-level voting ensemble model was deployed.The study in 8 acknowledged the detection of false content in an effective method.The proposed methodology for identifying fake news was divided into four stages: data pre-processing, feature reduction, feature extraction, and classification.Data pre-processing involves tokenization, stop-word elimination, and stemming to pre-process the input data.In order to increase accuracy, the characteristics are decreased in the second stage using PPCA.For the extraction of text features, many techniques were used 9 .The authors advised creating text word embedding for the titles of the Reddit submissions using InferSent and BERT.In order to extract the features from the Reddit submission thumbnails, VGG16, EfficientNet, and ResNet50 were used.The research in 10 suggested an upgraded deep neural networks for false news detection.Two deep learning methods are The Updated (one to three layers) as well as the improved GRU (one to three layers).Researchers investigated a sizable sample of tweets that contain information about COVID-19 in particular.The automated technique in 11 ,motivated by natural human behaviours to detect bogus news.Humans naturally double-check new knowledge with reputable sources.A machine learning (ML) framework that automates the act of cross-checking fresh information with a collection of predetermined reputable sources is built using natural language processing (NLP), checks whether the wording of the tweet is consistent with reliable news using a Random Forest model.Optimization by particle swarms (PSO), the genetic algorithm (GA), and the Slap swarm algorithm were used in the suggested method in 12 to implement three wrapper feature selections for evolutionary classifications in order to limit the number of symmetrical features and achieve high accuracy (SSA).
The main objective of this work is designing, implementing, and evaluating a proposed system with classifiers from Logistic Regression, Random Forest, Decision Tree, and Gradient Boosting, assuming it is successful, in order to assist reduce the harm that fake news causes to the general public and the media ecosystem.It is vital that to create tools to detect fake news on social media automatically.Social bots are one of the most common ways to contribute fake news.Social media bots automatically create material and disseminate it to users of social media.Although the technologies used by social media have been carefully updated, they are still not effective in filtering fake news or disseminating critical information.More work is required before the news can reach the world at large.

Fake News Terminology
Fake news is any information that pertains to society and makes statements or descriptions about anything specific in speeches, bulletins, articles, headlines, postings, microblogs, or tweets 13 .
Although the quantity of fake news, there are mostly two categories: partly false news which a crucial piece of information was missing, and totally fake news, with some or all of the news being made up to give it a completely new meaning or merely to make it more interesting.The world of fake news is very broad, fleeting, and its paths somewhat blurred, even if there are occasions when many forms of disinformation are transformed into fake news.To do this, one must be aware of the many forms of fake news 14 .On the basis of distribution channels and content, fake news can be divided into a number of sub-categories.Among them are:

Content Based Fake News
Based on content, fake news categorizes the news according to the type of content it contains.People may misinterpret news that is satirical or contains material that was meant to be a joke for actual news 15 .A commentary on something may be personal thoughts or opinions that are not well educated, but the person or group delivering it may have a large number of followers who value their comments for whatever they are delivering 6 .Content based fake news can be further classified to the following types 15 :  Fabricated stories: Stories that are completely fake that are spread to trick people, frequently for political or economic gain.
 Misleading headlines: Headlines that are intended to attract more attention but are not totally true or representational of the story's content 16 .
 Satire or parody: Stories that are meant to be humorous but are often mistaken for real news.
 Propaganda: Information that is disseminated to advance a specific political agenda, frequently by misrepresenting or presenting the facts in a biased way 16 .
 Hoaxes: Stories that are circulated with the intention of misleading others, frequently in order to cause harm or sow fear 17 .
 Clickbait: Headlines that often offer little real information or content but are intended to tempt readers to click on a story.It is a trap meant to generate advertising revenue or spread malware; when a user interacts with it while online, they are then sent to an advertisement or a dangerous page 15 .

Propagation-Based Fake News
Propagation-based fake news focuses on the origin and spread of false information.

Data Pre-processing
The data needs to be pre-processed before being used in the training, testing, and modelling procedures.The real and fake news are combined before proceeding to subsequent phases.Tokenization describes interpreting and grouping isolated tokens to generate higher-level tokens in addition to dividing strings into fundamental processing units.Word tokenization includes preprocessed raw texts are divided into textual units 18,19 .During the dataset cleaning stage, the columns from the datasets that weren't needed for processing were removed.The punctuation and stop words were also taken off.Stop-words, such as "I," "are," "will," "shall," and "is it," are words that often appear in sentences.URLs are removed, because the URLs in the articles have no significance.Pre-processing data is regarded as an essential stage in natural language processing applications like key word extraction and search engine optimization, uppercase letters were changed into lowercase.To transform the information https://doi.org/10.21123/bsj.2024.8710P-ISSN: 2078-8665 -E-ISSN: 2411-7986 Baghdad Science Journal collection into the proper organization, preprocessing is required.All NAN values were instantly removed from the dataset.The dataset appeared fine after cleaning, and the next step would be exploration.

Feature Extraction
Information extraction and natural languages processing employ the statistical measure TF-IDF (Term Frequency-Inverse Document Frequency) to assess the significance of a word in a document.With TF-IDF, a word's significance in a text is intended to be reflected in connection to the corpus of documents as a whole 20 .Term Frequency (TF), which is the frequency at which a term appears in a document, is multiplied by the overall number of words contained in the document to produce Term Frequency-IDF.Inverse Document Frequency (IDF), on the other hand, is a value that indicates how uncommon a word is within the corpus of texts.It is determined by dividing the total number of documents by the number of papers containing the word in logarithmic form 21 .A word's significance in a document is shown by the resulting TF-IDF value, with higher values reflecting a word's greater significance.In addition to feature extraction and dimensionality reduction in machine learning models, TF-IDF is frequently employed in document classification, information retrieval, and text mining applications.Scikit learn, a Python library, was utilized.When utilizing the TF-IDF vectorizer model, this library is ideal.Tasks involving topic modelling, NLP, and machine learning all profit tremendously from TF-IDF.It makes it easier for algorithms to forecast outcomes using word importance.In Eq. 1, the formula for determining the TF is displayed 20 : The IDF, or inverse document frequency, is the next factor that must be determined to make sure the model functions effectively.It is employed to evaluate how significant a term is across the board in the dataset.The IDF formula is displayed in Eq. 2:

IDF(t) = log 𝑒 ( 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑤𝑖𝑡ℎ 𝑡𝑒𝑟𝑚 𝑡 𝑖𝑛 𝑖𝑡 ) 2
The TF-IDF should then be determined as the next step.Eq. 3 depicts the formula for the inverse document frequency integrated into term frequency The TF-IDF algorithm retrieved the feature engineering from the real and false news in used dataset and counted the most pertinent terms.

Classification
In data mining and machine learning, the process of classification involves training a model to identify a given input's class or category based on a collection of attributes.It is a supervised learning technique in which the model has been trained on a labelled dataset with the intention of discovering the link between the features and the target class 1 .The proposed system uses more than one method of classification, including: Logistic Regression: Machine learning's use of logistic regression makes it necessary to establish a link between the likelihood and highlights (outcome) of a given result 22 .When a categorical value is being predicted, a logistic regression classifier is used 23 .Logistic regression is a method used in supervised learning.It is employed to ascertain or predict the probability that a binary (yes/no) event will take place.Making use of machine learning to predict whether or not news is likely to be fake could serve as an example of logistic regression.For instance, it will respond with true or false when anticipating the value.The likelihood (outcome) and highlights (probability) of a particular result can be connected using logistic regression.Considering its ease of use, effectiveness, and interpretability, logistic regression is a well-liked and frequently applied technique 23 .

Random Forest Classifier:
A random forest classifier, is a method for creating various decision trees and combining them to provide a more precise and reliable prediction 24 .A decision tree or sacking classifier's hyperparameters are essentially identical https://doi.org/10.21123/bsj.2024.8710P-ISSN: 2078-8665 -E-ISSN: 2411-7986 Baghdad Science Journal to those of the random forest.While creating the trees, this method increases the model's arbitrariness 25,26 .There are diverse arbitrary trees that provide worth, and worth with more votes is the genuine aftereffect of this classifier.

Decision Tree Classifier:
A decision tree or binary classifier have nearly same hyperparameters to the random forest.Through the development of the trees, this method increases the model's arbitrariness.The non-parametric supervised learning techniques that decision trees are known for can be applied to tasks like classification and regression.It functions as intended 25 .Order trees are tree models where the goal variable can assume a discrete arrangement of attributes.Based on the Gini index, decision trees perform well and may be made rapidly.The decision tree classifier will be our final machine learning algorithm.Nonparametric supervised learning techniques for both processes, such as classification and regression problems, are well known for decision trees.A decision tree might also work well for detecting false news.
Gradient Boosting Classifiers: is a kind of ensemble learning algorithm applied to classification and regression tasks in supervised machine learning.To build a stronger, more precise model, it pools several weak learners (like decision trees).The fundamental principle behind gradient boosting is to train several models in succession, with each model attempting to fix the mistakes made by the model before it 27 .To reduce a loss function that represents the model's error, the gradient boosting approach optimizes the weak learners' parameters 28 .Through a weighted sum or voting method, the predictions of all the weak learners are combined to create the final prediction.High accuracy and robustness of Gradient Boosting Classifiers are well-known characteristics.

Results and Discussion
Various criteria were employed to assess the performance of the algorithms.The bulk of them are built on the confusion matrix 29 .The classification model's performance on the test set is tabulated as a confusion matrix, which includes the four parameters true positive, false positive, true negative, and false negative Table 1.Accuracy: The most often used measure of accurately predicted observations, whether true or wrong, is accuracy 30 .To assess a model's accuracy, use the Eq. 4 as below:

Table 1. Confusion matrix
In this scenario, a classification model is being created, and unfavorable effects may result from false positives or false negatives.A strong model is implied by a high accuracy score under most circumstances.Similarly, if accurate facts were present in a report that was considered fraudulent, the same could be said.To account for incorrectly categorized observations, three additional measures have been included: precision, recall, and F1-score.
Recall: is the total number of classifications outside of the genuine class that were correctly created.In our case, it represents the percentage of items that were accurately predicted out of all the articles that were successfully predicted, recall can be computed in Eq. 5 can be seen below 31 :

5
F1-Score: The F1-score illustrates the compromise between recall and accuracy.Between each pair, it calculates the harmonic mean.As a result, it takes into consideration observations that are both falsely positive and falsely negative.To determine the F1score, apply the formula below.The following Eq.6 can be used to calculate F1-score 31,32  The recall, precision, and F1 scores of each algorithm on the dataset used for the false news detection model are summarized in Fig. 2.

Figure 2. Proposed System Results
The performance results for each method on the used datasets are summarized in Fig. 2. It is clear that the Gradient Boosting classifier has a maximum accuracy of 99%.After considering the observations and analyses presented above, it has been concluded that the suggested fake news detection model has been successful in achieving a notable increase in accuracy, precision, recall, and F-score with a very low value of false-positive rate.
A recall value of 0.98 or 0.99 signifies that these models have a high ability to identify actual instances of fake news, minimizing the chances of false negatives (misclassifying real fake news as genuine).This high recall implies that the models have a low rate of missing or overlooking fake news articles.
A precision value of 0.99 or 1 implies that these models have a high ability to correctly classify instances as fake news, minimizing the chances of false positives (misclassifying genuine news as fake).This high precision suggests that the models have a low rate of incorrectly labeling real news articles as fake.
The F1 score values of 0.99 for all four models-Logistic Regression, Random Forest, Decision Tree, and Gradient Boosting-suggest that they exhibit strong performance in accurately classifying instances as fake news.This implies that these models have a high ability to correctly detect fake news while minimizing misclassifications of genuine news.
The main factor leading to the superior performance of Gradient Boosting is the working principle which efficiently identifies errors and minimizes them in each iteration.The basic intuition behind the working of Gradient Boosting is to use multiple classification which combine multiple weak learners to assign higher weights to misclassified data points.The gradient boosting algorithm optimizes the parameters of the weak learners to minimize a loss function that represents the error of the model.The final prediction is made by combining the predictions of all the weak learners through a weighted sum or voting mechanism.Therefore, during each subsequent iteration, the model is able to correctly identify the misclassified points whereas regularization parameters are used to reduce the overfitting problem.

A Comparison with Existing Works
In order to prove the effectiveness of the proposed model, it is necessary to compare it and show its strengths and weaknesses with the previous used techniques.Table 2 shows a brief comparison between the methods used to diagnose and identify fake news.

Advantages
Disadvantages accur acy BERT 10 Captures contextual information effectively.Training time can be lengthy.

High
VGG16 9 Good at capturing visual features in images.
Limited to image-based fake news detection.
86.Based on variables including the dataset, feature representation, tuning of the hyperparameters, and the needs of the false news detection task, the actual performance and restrictions of these algorithms can differ.The choice of method depends on the specific requirements of the fake news detection task, the available dataset, the interpretability desired, computational constraints, and the trade-off between accuracy and efficiency 33 , 34 .It is essential to evaluate and compare these methods using appropriate metrics and techniques to identify the most suitable approach for a given application [35][36][37] .

Conclusion
Fake news spreads incredibly quickly.It might not be possible to tell it apart from truthful reporting.Nowadays, users download the data and share it with others.They don't even verify it to see if it's real, and by the end of the day, the false information has gone so far from its origin that it is no longer recognizable.Researchers suggested a practical method for identifying misleading information and fake news using machine learning algorithms.Using Logistic Regression, Random Forest, Decision Trees, and Gradient Boosting a technique to identify bogus news.Fake news identification has a number of open issues that need to be researched.For instance, a vital first step in limiting the spread of false news is recognizing the essential elements involved in the distribution of news.
Using a supervised classifier, the fake news detection problem has been approached as a classification problem.A significant, labelled dataset was required for supervised classification.It is the job of categorization to give a certain unlabelled location a name or class.A classifier is formally defined as a model or function that assigns a class to the class label point of an input.A training data set with precisely class-labeled points is required to build the supervised classifier.The model is ready to predict the class or label for any new point after it has been constructed and may then be evaluated on a test data set.Following a description of the proposed framework, descriptions of the algorithms, datasets, and performance assessment indicators are provided.The suggested approach employed a binary supervised classifier to detect the bogus news.The association between independent factors and a dependent variable may be established using classification techniques and models.The classification models often depict traits, patterns, or principles of categorization that are concealed in the material.The proposed system model for detection is shown in Fig.1.

Figure 1 .
Figure 1.Fake news detection model