Detecting Deception Using Natural Language Processing and Machine Learning in Datasets on COVID-19 and Climate Change
In today’s world of fast-growing technology and an inexhaustible amount of data, there is a great need to control and verify data validity due to the possibility of fraud. Therefore, the need for a reliable form of detection of such content is not surprising. Some of the ways in which deception manifests itself on the Internet are using identity deception, mimicking data and processes for the purpose of stealing credit card numbers or other private information, charging false invoices for services not performed, hacking sites, offering false excuses and promises, false advertising, spreading propaganda and false information, and other forms of fraud. Therefore, detecting deception, whether in face-to-face interaction or while communicating through a certain medium, is of great importance.
The great need to find a reliable method for deception detection is even more emphasized due to the fact that people tell approximately two lies per day [1]. Lying is undoubtedly a skill that is deeply rooted in the human existence, and it has been perfected over the years to a level that is difficult to recognize by even the most experienced professionals. The question is what makes the distinction between truth and lies/deception, especially in the verbal aspect, and if it exists, what is the best way to determine it? Do most of the information lie in non-verbal behavior, or are there certain linguistic patterns that can serve as sufficiently precise indicators of deception? Is there a difference in deception during face-to-face and computer-mediated communication, whether synchronous or asynchronous, verbal or non-verbal?
Current research covers deception detection in computer-mediated communication [2,3]; detection of fake reviews on social platforms [4,5]; deception detection collected from public trials [6,7]; the use of crowdsourcing platforms, such as Amazon Mechanical Turk, for generating deception datasets [5,8]; etc.
This is precisely one of the differences between deception and lying, where lying does not necessarily mean convincing another person was done successfully; it only refers to “making a false statement with the intention of deceiving”.
A slightly broader, generally accepted definition of lying is the following: “A lie is a statement made by one who does not believe it with the intention that someone else shall be led to believe it” [10]. According to the stated statement, four main conditions are defined that must be fulfilled in order for a certain statement to be identified as a lie: the person should make the statement (statement condition), the person making the statement should believe that the statement is false (untruthfulness condition), an untrue statement must be given to another person—the recipient of the statement (addressee condition)—and lastly, the person making the statement must lie with the intention to convince the recipient of the statement to believe the untruthful statement to be true (intention to deceive the addressee condition) [9]. Here, too, there are debates regarding the very definition of “lying”, and they concern the intention with which a person lies; these debates lead to two opposing groups, namely the theories of Deceptionism and Non-Deceptionism. The former group believes that intention is necessary for lying while the latter believe the opposite. The theory of Deceptionism is further divided into Simple Deceptionism, Complex Deceptionism, and Moral Deceptionism. Simple Deceptionists believe that for lying, it is necessary to make an untrue statement with the intention of deceiving another person while Complex Deceptionists additionally believe that the intention to deceive must be manifested in the form of a breach of trust or belief. Moral Deceptionism state that lying not only requires making an untrue statement with the intention of deceiving, but also violating the moral rights of another person. On the other hand, the theory of Non-Deception dictates that lying is a necessary and sufficient condition to make an untrue claim, and it is further divided into the theory of Simple Non-Deceptionism and Complex Non-Deceptionism [9].
Unlike lying, deception itself does not only involve verbal communication, but also manifests itself through various non-verbal signs, such as leading another person to the wrong conclusion by certain behavior, using non-linguistic conventional symbols or symbols that determine similarity (icons), etc. It is also possible to deceive someone with an exclamation, a question, a command, an omission of an important statement, and even silence [9].
Zhou and colleagues [2] believe that the sender in CMC distances himself from the message that “reduces their accountability and responsibility for what they say, and an indication of negative feelings associated with the act of deceiving”. Likewise, one of the important indicators of why people are better at detecting deception via computers is precisely the fact that they lack certain information about the other person, so they are much more suspicious and will suspect deception sooner.
Research in CMC examines the influence of Linguistic Style Matching (LSM) and Interpersonal Deception Theory (IDT) on the linguistic characteristics of conversations during honest and fake conversations [2,12]. The LSM theory explains how people in a conversation adapt each other’s linguistic style to match their partner’s. According to the LSM theory, deception in a conversation can be detected by analyzing the verbal characteristics of the interlocutor (who is unaware of the deception) and not exclusively those of the speaker (who is lying) given that their linguistic styles match [12]. In the study conducted during synchronous computer-mediated communication, correlation was recorded in the linguistic style of the interlocutors. More precisely, the correlation was achieved when using first, second, and third person pronouns and negative emotions. An interesting conclusion was that the linguistic profiles of both interlocutors coincided to a greater extent during false communication compared to true communication, especially in a case when the speaker was motivated to lie [12]. There is a possibility that speakers deliberately use LSM when trying to deceive in order to appear more credible to the partner, which is what the IDT theory deals with. IDT studies the context of the speaker (who is lying) and the interlocutor (who is not aware of the deception) and the changes in their linguistic styles through honest or false communication, with the difference that it understands these changes as strategic behavior that the speaker uses to facilitate the deception process [13]. “Deceivers will display strategic modifications of behavior in response to a receiver’s suspicions, but may also display non-strategic (inadvertent) behavior, or leakage cues, indicating that deception is occurring ” [2]. On the other hand, interlocutors in the case of non-strategic behavior may become suspicious and ask more questions, thus forcing the speaker to change his linguistic style and adapt to the interlocutor.
However, the first problem that arose in model use was the lack of labeled input data with truthful and deceptive statements. In a majority of prior work, data were collected using crowdsourcing platforms [4,5,8] while other research was done on “real” data collected through social experiments [2,3] by analyzing public trials [6,7] or by some other method.
Another problem was the choice of text processing methods and machine learning models that gave the best prediction. By analyzing the choice of machine learning models in previous research, it could be concluded that Naive Bayes and SVM classifiers have proven to be very successful in solving this type of a problem [4,5,8], so they represent the choice of methods in this study as well.