Can Chatgpt Detect IA
Can Chatgpt Detect IA
1
ganized as follows. Section 2 provides an overview of the
relevant literature on LLMs and Deepfake face detection.
Section 3 presents the methodology of our study. Compre-
hensive evaluation results and analysis are given in Section
4, and Section 5 concludes the article..
2. Background
2.1. Large Language Models
Large Language Models (LLMs) are large-scale founda-
tional deep neural network models (characterized by bil-
Figure 1. The overall process of using multimodal LLMs to lions of parameters) that perform natural language-related
detect AI-generated face images. tasks. Their basic function is to predict the next words in
sentences based on previous words. LLMs typically adopt
the answer. The text prompt is crucial, as it forms the sole the transformer architecture [34], distinguished by its atten-
interface between the user and the multimodal LLM chatbot tion mechanism that evaluates the importance of different
for media forensic tasks. Our study focuses on the forms of words for understanding the text. This architecture provides
text prompts that can effectively elicit meaningful responses a more advanced memory structure for handling long-term
from LLMs 2 . On a set of face images, we conduct extensive dependencies than traditional recurrent neural networks, es-
qualitative and quantitative evaluations of the performance pecially when the model was pre-trained on a large text
of popular multimodal LLMs on this task. Our initial ex- corpus and later fine-tuned with minimal modifications for
periments have yielded several key insights: specific datasets. LLMs are typically trained on gigantic
• Multimodal LLMs demonstrate a certain capability to volumes of unlabeled text from the Internet. The training
distinguish between authentic and AI-generated imagery, process for LLMs capitalizes on the statistical patterns of
drawing on their semantic understanding. This discern- human languages and can be subsequently tuned to other
ment is interpretable by humans, offering a more intuitive applications.
and user-friendly option compared to traditional machine The popularity of LLMs is largely attributed to the fam-
learning (ML) detection methods. ily of generative pretrained transformers (GPTs) developed
• The efficacy of multimodal LLMs in identifying AI- by OpenAI. The GPT-1 model, which debuted in 2018, has
generated images is satisfactory, with an Area Under the 117 million parameters and is the first practical LLM that
Curve (AUC) score of approximately 75%. However, achieved human-level language understanding in tasks such
their accuracy in recognizing genuine images is notice- as textual entailment and reading comprehension. Subse-
ably lower. This discrepancy arises because a lack of se- quently, GPT models have quickly evolved with scaled-up
mantic inconsistencies does not automatically confirm an capacity and improved performance of task-agnostic and
image’s authenticity from the LLMs’ standpoint. few-shot learning challenges. The GPT4V model, intro-
• The semantic detection capabilities of these LLMs cannot duced by OpenAI in 2022, has a whopping 175 billion pa-
be fully harnessed through simple binary prompts, which rameters. Considering that the total corpus on the Inter-
can lead to their refusal to provide clear answers. Effec- net up to 2022, which more or less represents all human-
tive prompting techniques are crucial for maximizing the generated texts throughout history, is about 500 billion to-
potential of multimodal LLMs in differentiating between kens, one can think of the GPT-4 model as a compression
real and AI-generated images. model of all human knowledge captured in written texts [8].
• Presently, multimodal LLMs do not incorporate signal In this sense, it is perhaps not so surprising that GPT -4
cues or data-driven approaches for this task. While their can achieve human-level performance in text-understanding
independence from signal cues enables them to identify tasks. LLM models have recently been extended for cross-
AI-created images regardless of the generation model modal understanding. In late 2023, OpenAI released the lat-
used, their performance still falls short of the latest de- est GPT-for-vision (GPT4v) model [2], which accepts im-
tection methodologies. ages as input and text prompts. This has been followed up
We hope that this study will encourage future exploration by other LLMs from major companies, such as Google Bard
of the use and improvement of LLMs for media forensics + Gemini [33].
and DeepFake detection. The remainder of the paper is or- Ordinary users were exposed to the power of LLMs
through conversational agents (chatbots) that use LLMs to
2 All text prompts and results used in this study will be available from engage in natural dialogues for question answering, text
https://github.com/shanface33/GPT4MF_UB. summarization, recommendations, and assistance of writ-
2
Table 1. Detailed information of the evaluation dataset from
DF 3 [17]. ‘SG2’ stands for the StyleGAN2 model, ‘LD’
represents the Latent Diffusion model, and ‘PP’ed’ means
post-processed data.
Raw PP’ed
Real
SG2 LD SG2 LD
Number 1,000 1,000 1,000 1,000 1,000
Image Size 5122 5122 5122 2562 2562
Format PNG PNG JPEG PNG, JPEG PNG, JPEG
Figure 2. Which of these images are real and AI-generated? developers and users of these detection algorithms.
.(e) Fake, (d) Real, (c) Fake, (b) Fake, (a) Fake
Answer:
3. Methodology
.(j) Real, (i) Fake, (h) Real, (g) Real, (f) Fake
ing and debugging code, etc. The most well-known LLM- Our study aims to evaluate the utility and efficacy of multi-
based chatbot is OpenAI’s ChatGPT. Since its introduc- modal LLMs in media forensics, and we choose the prob-
tion in November 2022, ChatGPT has rapidly become the lem of identifying AI-generated images of human faces as
fastest growing consumer app ever, having over 100 million the main focus. The rationale is as follows. Firstly, while
monthly active users within just two months of its release. the multimodal LLMs are technically equipped to analyze
Besides providing an intuitive conversational user interface, video and audio content, their optimal performance is ob-
the chatbots also help improve the underlying LLMs by us- served with images. Secondly, detecting realistic DeepFake
ing reinforcement learning from human feedback to gain face images is one of the most thoroughly studied topics.
user feedback. It can be used to compare the capabilities of a multimodal
LLM with state-of-the-art methods. Thirdly, prior research
2.2. DeepFake Faces: Generation and Detection
identified a wealth of semantic indicators. Human can iden-
AI-generated realistic human face images are the earliest tify semantic inconsistencies in faces, making the study
and the most well-known examples of DeepFakes. Deep- much more accessible to viewers. We can use these estab-
Fake faces are created with generative adversarial networks lished semantic cues to craft targeted prompts to enhance
(GANs) and diffusion models. They have a high level of detection efficacy. We choose to use OpenAI’s GPT4V Vi-
realism in fine details of skin and facial hairs and challenge sion model (i.e., GPT4V-vision-preview 3 ) as the subject of
human’s ability to distinguish from images of real human the study. It provides an API that greatly streamlines exper-
faces (Fig. 2). DeepFake faces have been used as pro- imental procedures, especially for Python-based implemen-
file images for fake social media accounts in disinformation tations. This feature is instrumental in simulating conver-
campaigns [1, 5, 6, 28]. sational contexts on a large scale. We design experiments
Existing DeepFake face detection methods are mostly in which GPT4V model assesses whether a face image is
formulated as binary classification problems. Based on AI-generated based on the text prompts in Fig. 1. We also
the features used, these methods fall into three major cat- consider Google Gemini 1.0 Pro API for comparison (note
egories. Methods in the first category (e.g., [4, 14, 23, 25, that the Gemini web app has restrictions on analyzing im-
40]) are based on inconsistencies exhibited in the physi- ages containing human faces).
cal/physiological aspects in the DeepFake images. Meth- Data: Our experiments are based on a set of 1, 000 real
ods in the second category (e.g., [13, 21, 22, 26, 37]) use face images from the FFHQ dataset [19] dataset and an-
signal-level artifacts introduced during the synthesis pro- other 2, 000 images created with generative AI models from
cess. The majority of current detection methods (e.g., the DF 3 dataset [17]. All images contain a single human
[3, 7, 9, 12, 16, 27, 39]) fall into data-driven methods that face. Two AI generative models are considered, namely
directly use various DNNs trained on real and DeepFake StyleGAN2 [20] and Latent Diffusion [29]. We also adopt
samples to capture specific artifacts. There also exist sev- two evaluation protocols from the DF 3 dataset [17]. This
eral large-scale benchmark datasets to evaluate DeepFake includes assessing the basic detection performance of raw
detection performance [7, 17, 30, 35]. data and evaluating the robustness of post-processed Deep-
Current DeepFake face detection methods are typically Fake data through mixed operations such as JPEG Com-
developed using programming languages like Python and pression, Gaussian Blur, face blending, adversarial attacks,
specialized libraries to construct neural network models or and multi-image compression. Detailed information on the
other machine learning algorithms (e.g., Scikit-Learn, Py- data used is given in Table 1. A few examples of the real
Torch, TensorFlow). These models are then trained on and AI-generated images are shown in Fig. 3.
datasets of labeled data. However, the programming lan-
guage interface represents a significant hurdle for both the 3 https://platform.openai.com/docs/guides/vision
3
tempt at interacting with LLMs for this task. However, our
experiments (detailed later) show that such simple prompts
are not effective – in many cases, the LLM declines to re-
spond to the requests due to a lack of context or safety
concerns. When the LLM did respond, the responses were
not informative. Prompt #6 goes beyond simple binary an-
swers: we ask the LLM to identify signs of synthesis and, in
addition, request it to justify the answers. This additional re-
quest can lead the LLM to be more guided, resulting in the
lowest rejection rate. Prompt #7 goes even further, which
includes a more detailed list of clues about possible aspects
of which DeepFake faces exhibit semantic inconsistencies.
Overall, the more context-rich prompts have lower rates of
Figure 3. Examples of evaluation data. ‘SG2’ stands for rejections. On the other hand, the more detailed prompts
the StyleGAN2 model, ‘LD’ represents the Latent Diffusion may lead to lower accuracies. This is possibly because it
model, and ‘PP’ed’ means post-processed data. limits the cues for the LLM to consider, so the LLM may
not be able to correctly identify DeepFakes with artifacts
Text Prompts: Text prompts embody the instruction and not exactly included in the list. In addition, Prompt #7 uses
request to the LLM to detect DeepFake faces. Properly more tokens (72) than #6 (31), which increases the cost of
designed prompts can bring forth the power of semantic running the LLMs. Because of these reasons, we subse-
knowledge in the LLMs to this task. We consider prompts quently conducted our experiment based on prompt #6.
of different levels of richness of contexts and additional in- Performance Metrics: For each text-image prompt, we
formation in our experiments: query the LLM multiple times and calculate a numerical
• Prompt #1: Tell me if this is an AI-generated image. An- score by averaging the results (No = 0 and Yes = 1). This
swer yes or no. approach offers two benefits. Firstly, it diminishes the vari-
• Prompt #2: Tell me if this is a real image. Answer yes or ability in LLM responses to identical queries, attributable
no. to the probabilistic nature of the underlying LLMs. Sec-
• Prompt #3: Tell me the probability of this image being ondly, using numerical decision scores enables the applica-
AI-generated. Answer a probability score between 0 and tion of performance metrics beyond mere accuracy, such as
100. the area under the ROC (AUC) score. Compared to clas-
• Prompt #4: Tell me the probability of this image being sification accuracies, the AUC score is less affected by im-
real. Answer a probability score between 0 and 100. balanced data, provides a more comprehensive performance
• Prompt #5: Tell me if this is a real or AI-generated im- evaluation, and allows us to compare the LLM’s perfor-
age. mance with existing programmed detection methods. AUC
• Prompt #6: Tell me if there are synthesis artifacts in the score is a real number in [0, 1], with higher values corre-
face or not. Must return with 1) yes or no only; 2) if yes, sponding to better performance. As the LLM may decline
explain where the artifacts exist by answering in [region, to respond to a query, another important performance metric
artifacts] form. is the rejection rate, which measures the fraction of queries
• Prompt #7: I want you to work as an image forensic ex- that the LLM declines. We also report the single-class ac-
pert for AI-generated faces. Check if the image has the ar- curacy at the fixed threshold of 0.5.
tifact attribute listed in the following list and ONLY return Model Parameters: All batch tests were performed
the attribute number in this image. The artifact list is [1- through API calls. In the evaluation with the GPT4V APIs,
asymmetric eye iris; 2-irregular glasses shape or reflec- we adopted settings similar to those described in [8]. For
tion; 3-irregular teeth shape or texture; 4-irregular ears or the Gemini model, we used Gemini-1.0-pro-vision, which
earrings; 5-strange hair texture; 6-inconsistent skin tex- is free of charge and supports up to 60 requests per minute.
ture; 7-inconsistent lighting and shading; 8-strange back- The total cost of this study is approximately $130, and it
ground; 9-weird hands; 10-unnatural edges]. took around 30 days.
The first two simple binary prompts ask for straightforward
Yes/No answers. The third and fourth ones go beyond bi- 4. Experiment Results
nary answers and also ask for a numerical value of like-
4.1. Qualitative and Quantitative Results
lihood. The fifth one makes the LLM to choose between
two alternatives of the image to be real or DeepFake. These We show several examples of using GPT4V model with
prompts are simple prompts and would be a user’s first at- Prompt #6 to determine if an input image contains a Deep-
4
Figure 4. Examples of GPT4V for DeepFake face detection. Left: Results for AI-generated images from the DF 3
dataset [17]. Right: Results for real faces from the FFHQ dataset [19]. The responses for AI-generated faces are la-
beled in pink , while those for the real faces are labeled in green . Both success (w/ ) and failure (w/ ) cases are shown.
Figure 5. Examples of Gemini 1.0 Pro for DeepFake face detection. Left: Results for AI-generated images from the DF 3
dataset [17]. Right: Results for real faces from the FFHQ dataset [19]. The responses for AI-generated faces are labeled in
pink , while those for real faces are labeled in green . Both success (w/ ) and failure (w/ ) are shown. We can see that
even though some yes/no results are accurate, the supporting evidence is not.
Fake face in Fig. 4. The left column corresponds to cases Fig. 6 shows the receiver operational curves (ROCs) and
when the input images are generated with various AI mod- the corresponding AUC scores obtained using API calls (as
els, and the right column is for the cases of real images. described in Section 3) over the evaluation dataset with the
Both success (with check marks) and failure cases (with same prompt – GPT4V has a 79.5% AUC on raw latent
crosses) are shown. These results indicate that the GPT4V diffusion-generated face images and 77.2% AUC score on
model achieved a reasonable detection accuracy on this StyleGAN-generated face images. The performance con-
task. We also offer comparison outputs of Gemini 1.0 Pro in firms that the GPT4V model obviously did not make ran-
Fig. 5, which is less reliable in providing accurate insights dom guesses on this task (corresponding to a ROC as a diag-
for image forensics tasks. onal line and a 50% AUC score). Compared to the GPT4V
The quantitative results corroborate this observation. model, Gemini shows a slight decrease in performance.
5
Table 2. Comparison of AUC (%) in detecting DeepFake
faces. ‘SG2’ stands for the StyleGAN2 model, and ‘LD’
represents the Latent Diffusion model.
Raw data Post-processed
Method
SG2 LD SG2 LD
CNN-aug [35] 96.5 58.6 53.2 52.4
GAN-DCT [10] 53.4 75.4 44.4 56.0
Nodown [11] 99.6 97.1 47.4 44.9
BeyondtheSpectrum [13] 98.1 77.3 45.4 46.9
PSM [16] 99.2 82.5 73.1 71.3
Figure 6. ROC curves of GPT4V and Gemini 1.0 Pro on GLFF [17] 97.5 86.7 80.6 79.4
the DeepFake detection based on averaging the predictions Gemini 1.0 (zero-shot) 76.6 75.1 77.5 81.5
of five rounds of queries, (a) on raw data, (b) on post- GPT4V (zero-shot) 77.2 79.5 88.7 89.8
processed DeepFake data.
Table 3. Comparison of single-class Accuracy (%) in de-
To put these performances into the context of the state- tecting DeepFake faces. ‘SG2’ stands for the StyleGAN2
of-the-art DeepFake face detection methods, we compare model, and ‘LD’ represents the Latent Diffusion model.
them with existing methods in Table 2 for AUC scores and Raw data Post-processed
Table 3 for classification accuracies. Note that all these Method Real
SG2 LD SG2 LD
baseline detectors were trained on a image forensics dataset CNN-aug [35] 89.8 71.9 0.3 38.3 5.5
[35] with 360K ProGAN-generated images [18] and 360K GAN-DCT [10] 92.5 3.7 7.0 20.8 29.4
real images [42]. As it shows, the performance of GPT4V Nodown [11] 81.3 96.3 0.1 3.3 4.50
BeyondtheSpectrum [13] 67.6 42.0 8.0 11.9 15.1
and Gemini 1.0 is on par or slightly better than the early
PSM [16] 78.0 89.8 0.1 4.4 3.3
methods [10, 35], but is not competitive with more recent GLFF [17] 89.9 82.9 0.2 7.6 8.1
detection methods [11, 16, 17]. This may be attributed Gemini 1.0 (zero-shot) 83.3 45.1 48.2 53.2 61.2
to some fundamental aspects between the two approaches. GPT4V (zero-shot) 51.2 86.5 90.3 98.3 99.2
Existing effective DeepFake detection methods can capture
signal-level statistical differences between training real and
AI-generated images. In contrast, multimodal LLM’s deci- case). This suggests that the semantic abnormality iden-
sion is mostly based on semantic-level abnormalities, re- tified by GPT4V may not be specific to DeepFake faces.
flected by the additional explanation in natural language This problem may be solved by refining the model. In con-
in the responses. Therefore, even though the LLM is not trast, the Gemini model achieves a classification accuracy
specifically designed and trained for DeepFake face detec- of 83.3% on real images, dropping to around 50% on gen-
tion, the world knowledge encapsulated in the LLM can be erated faces. The examples in Fig. 5 show that the Gemini
transferred to this task. The semantic reasoning leads to model’s response lacks rationality in analyzing the synthe-
results that are more comprehensible to humans. The de- sis artifacts.
tection is less susceptible to post-processing operations that
can disrupt signal-level features – this is confirmed with the 4.2. Ablation Studies
changes in performances when post-processing is included The quality of the prompt plays a central role in perfor-
in Tables 2 and 3, where classification accuracies on Deep- mance. In addition to the prompts used in the experiments,
Fake face even increase for post-processed images. Another we have also studied other prompts with simpler structures
factor contributing to this performance enhancement is the and compared their performance. Firstly, we quantitatively
inclusion of post-processing operations such as face blend- compare different text prompts in detecting 1,000 raw Styel-
ing and adversarial attacks, which introduce more distinc- GAN2 faces. Table 4 reports the rejection rate and accuracy
tive visual artifacts to the images. of GPT4V with all seven prompts described in Section 3.
On the other hand, we note that most errors of GPT4V The findings indicate that prompts related to direct image
occur on detecting real images – per Table 3, the classifica- forensics result in high rejection rates, particularly those
tion accuracies on real images are around 50%, drastically based on likelihood assessments and prompts requiring a
different from those of AI-generated images, which is above choice between real or fake. Prompts #6 and #7 result in
90%. Some intuitions can be obtained when we examine the fewer rejections with comparable prediction accuracies be-
real face images for these error cases, as shown in Fig. 4. cause they extend beyond mere yes-or-no responses by ask-
These cases include semantic features unusual for “typical” ing the model to identify signs of synthesis. Fig. 7 shows
face images. For instance, different age group (baby in the four examples predicted by GPT4V using different prompts.
first case) or unique hair color (second case) to style (third GPT4V misclassified visually realistic fake faces and inter-
6
Table 4. Comparison results (%) of using different prompts for GPT4V in detecting 1,000 StyleGAN2 faces. Note that the
Accuracy is measured by comparing the number of correct predictions to the total number of samples that were not rejected.
Metric Prompt #1 Prompt #2 Prompt #3 Prompt #4 Prompt #5 Prompt #6 Prompt #7
Rejection Rate 60.2 66.9 100 100 95.8 4.7 33.1
Accuracy 97.49 94.86 - - 88.10 83.83 86.54
Figure 7. Examples of GPT4V for DeepFake face detection. We show success (w/ ) and failure (w/ ), and rejected cases
(shown in dark cyan ). The responses for AI-generated faces are labeled in pink , while those for the real faces are labeled
in green . The Figure is best viewed in color. Zoom in and refer to texts for details.
Figure 8. Comparative analysis of AUC scores (%) across Figure 9. Comparative analysis of AUC scores (%) using
different query rounds of GPT4V in DeepFake Detection. different data size of GPT4V in DeepFake Detection.
prets unusual semantic features in real faces as synthesis ar- scores. This indicates that repeated querying might serve as
tifacts. Next, we show the influence of the number of query an ensemble method for enhancing performance. Finally,
attempts on detection performance. Fig. 8 demonstrates we explore how the dataset size affects the detection per-
that increasing query attempts correlates with higher AUC formance of GPT4V. Fig. 9 presents the comparison results
7
Figure 10. Potential improvement in detecting DeepFake images. The responses for AI-generated faces are labeled in pink ,
while those for the real faces are labeled in green . Success case (w/ ) and failure case (w/ ) are shown.
8
References 30th International Joint Conference on Artificial Intelligence
(IJCAI), 2021. 3, 6
[1] Experts: Spy used ai-generated face to connect with tar- [14] Shu Hu, Yuezun Li, and Siwei Lyu. Exposing GAN-
gets. https://www.theverge.com/2019/6/13/ generated faces using inconsistent corneal specular high-
18677341 / ai - generated - fake - faces - spy - lights. In IEEE International Conference on Acoustics,
linked-in-contacts-associated-press. 3 Speech and Signal Processing (ICASSP), Toronto, Canada,
[2] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- 2021. 3
mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, [15] Mingzhen Huang, Shan Jia, Zhou Zhou, Yan Ju, Jialing Cai,
Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. and Siwei Lyu. Exposing text-image inconsistency using dif-
Gpt-4 technical report. arXiv preprint arXiv:2303.08774, fusion models. In The Twelfth International Conference on
2023. 1, 2 Learning Representations, 2023. 8
[3] Junyi Cao, Chao Ma, Taiping Yao, Shen Chen, Shouhong [16] Yan Ju, Shan Jia, Lipeng Ke, Hongfei Xue, Koki Nagano,
Ding, and Xiaokang Yang. End-to-end reconstruction- and Siwei Lyu. Fusing global and local features for gen-
classification learning for face forgery detection. In Proceed- eralized ai-synthesized image detection. In 2022 IEEE In-
ings of the IEEE/CVF Conference on Computer Vision and ternational Conference on Image Processing (ICIP), pages
Pattern Recognition, pages 4113–4122, 2022. 3 3465–3469. IEEE, 2022. 3, 6
[4] Umur Aybars Ciftci, Ilke Demir, and Lijun Yin. How Do [17] Yan Ju, Shan Jia, Jialing Cai, Haiying Guan, and Siwei Lyu.
the Hearts of Deep Fakes Beat? Deep Fake Source De- Glff: Global and local feature fusion for ai-synthesized im-
tection via Interpreting Residuals with Biological Signals. age detection. IEEE Transactions on Multimedia, 2023. 3,
In IEEE/IAPR International Joint Conference on Biometrics 5, 6
(IJCB), 2020. 3 [18] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen.
[5] CNN. A high school student created a fake 2020 US can- Progressive growing of gans for improved quality, stability,
didate. twitter verified it. https://www.cnn.com/ and variation. arXiv preprint arXiv:1710.10196, 2017. 6
2020/02/28/tech/fake-twitter-candidate- [19] Tero Karras, Samuli Laine, and Timo Aila. A style-based
2020/index.html, . 3 generator architecture for generative adversarial networks. In
[6] CNN. How fake faces are being weaponized online. CVPR, 2019. 3, 5
https : / / www . cnn . com / 2020 / 02 / 20 / tech / [20] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten,
fake-faces-deepfake/index.html, . 3 Jaakko Lehtinen, and Timo Aila. Analyzing and improv-
[7] Riccardo Corvi, Davide Cozzolino, Giada Zingarini, Gio- ing the image quality of stylegan. In Proceedings of the
vanni Poggi, Koki Nagano, and Luisa Verdoliva. On the IEEE/CVF Conference on Computer Vision and Pattern
detection of synthetic images generated by diffusion mod- Recognition, pages 8110–8119, 2020. 3
els. In ICASSP 2023-2023 IEEE International Conference [21] Lingzhi Li, Jianmin Bao, Ting Zhang, Hao Yang, Dong
on Acoustics, Speech and Signal Processing (ICASSP), pages Chen, Fang Wen, and Baining Guo. Face x-ray for more
1–5. IEEE, 2023. 3 general face forgery detection. In CVPR, 2020. 3
[8] Ivan DeAndres-Tame, Ruben Tolosana, Ruben Vera- [22] Yuezun Li and Siwei Lyu. Exposing deepfake videos by de-
Rodriguez, Aythami Morales, Julian Fierrez, and Javier tecting face warping artifacts. In IEEE Conference on Com-
Ortega-Garcia. How good is chatgpt at face biometrics? a puter Vision and Pattern Recognition Workshops (CVPRW),
first look into recognition, soft biometrics, and explainabil- 2019. 3
ity, 2024. 1, 2, 4 [23] Yuezun Li, Ming-Ching Chang, and Siwei Lyu. In Ictu
Oculi: Exposing AI Created Fake Videos by Detecting Eye
[9] Shichao Dong, Jin Wang, Jiajun Liang, Haoqiang Fan, and
Blinking. In IEEE Workshop on Information Forensics and
Renhe Ji. Explaining deepfake detection by analysing im-
Security (WIFS), Hong Kong, 2018. 3
age matching. In European Conference on Computer Vision,
[24] Yuezun Li, Pu Sun, Honggang Qi, and Siwei Lyu. Celeb-DF:
pages 18–35. Springer, 2022. 3
A Large-scale Challenging Dataset for DeepFake Forensics.
[10] Joel Frank, Thorsten Eisenhofer, Lea Schönherr, Asja Fis- In IEEE Conference on Computer Vision and Patten Recog-
cher, Dorothea Kolossa, and Thorsten Holz. Leveraging nition (CVPR), Seattle, WA, United States, 2020. 1
frequency analysis for deep fake image recognition. arXiv [25] Falko Matern, Christian Riess, and Marc Stamminger. Ex-
preprint arXiv:2003.08685, 2020. 6 ploiting visual artifacts to expose deepfakes and face manip-
[11] Diego Gragnaniello, Davide Cozzolino, Francesco Marra, ulations. In 2019 IEEE Winter Applications of Computer
Giovanni Poggi, and Luisa Verdoliva. Are gan generated im- Vision Workshops (WACVW), pages 83–92, 2019. 3
ages easy to detect? a critical analysis of the state-of-the-art. [26] Scott McCloskey and Michael Albright. Detecting GAN-
In ICME, pages 1–6. IEEE, 2021. 6 generated imagery using color cues. arXiv preprint
[12] Ruidong Han, Xiaofeng Wang, Ningning Bai, Qin Wang, arXiv:1812.08247, 2018. 3
Zinian Liu, and Jianru Xue. Fcd-net: Learning to detect [27] Huy H Nguyen, Junichi Yamagishi, and Isao Echizen.
multiple types of homologous deepfake face images. IEEE Capsule-forensics: Using capsule networks to detect forged
Transactions on Information Forensics and Security, 2023. 3 images and videos. In ICASSP 2019-2019 IEEE Interna-
[13] Yang He, Ning Yu, Margret Keuper, and Mario Fritz. Be- tional Conference on Acoustics, Speech and Signal Process-
yond the spectrum: Detecting deepfakes via re-synthesis. In ing (ICASSP), pages 2307–2311. IEEE, 2019. 3
9
[28] Reuters. These faces are not real. https : of lmms: Preliminary explorations with gpt-4v (ision). arXiv
/ / graphics . reuters . com / CYBER - DEEPFAKE / preprint arXiv:2309.17421, 9(1):1, 2023. 8
ACTIVIST/nmovajgnxpa/index.html. 3 [42] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas
[29] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Funkhouser, and Jianxiong Xiao. Lsun: Construction of a
Patrick Esser, and Björn Ommer. High-resolution image large-scale image dataset using deep learning with humans
synthesis with latent diffusion models. In Proceedings of in the loop. arXiv preprint arXiv:1506.03365, 2015. 6
the IEEE/CVF conference on computer vision and pattern
recognition, pages 10684–10695, 2022. 3
[30] Andreas Rössler, Davide Cozzolino, Luisa Verdoliva, Chris-
tian Riess, Justus Thies, and Matthias Nießner. FaceForen-
sics++: Learning to detect manipulated facial images. In
ICCV, 2019. 3
[31] Mark Scanlon, Frank Breitinger, Christopher Hargreaves,
Jan-Niclas Hilgert, and John Sheppard. Chatgpt for digi-
tal forensic investigation: The good, the bad, and the un-
known. Forensic Science International: Digital Investiga-
tion, 46:301609, 2023. 1
[32] Yichen Shi, Yuhao Gao, Yingxin Lai, Hongyang Wang, Jun
Feng, Lei He, Jun Wan, Changsheng Chen, Zitong Yu, and
Xiaochun Cao. Shield: An evaluation benchmark for face
spoofing and forgery detection with multimodal large lan-
guage models. arXiv preprint arXiv:2402.04178, 2024. 1
[33] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui
Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan
Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a
family of highly capable multimodal models. arXiv preprint
arXiv:2312.11805, 2023. 2
[34] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. Attention is all you need. Advances in neural
information processing systems, 30, 2017. 2
[35] Shengyu Wang, Oliver Wang, Richard Zhang, Andrew
Owens, and Alexei A Efros. Cnn-generated images are sur-
prisingly easy to spot... for now. arXiv: Computer Vision and
Pattern Recognition, 2019. 3, 6
[36] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al.
Chain-of-thought prompting elicits reasoning in large lan-
guage models. Advances in Neural Information Processing
Systems, 35:24824–24837, 2022. 8
[37] Moritz Wolter, Felix Blanke, Raoul Heese, and Jochen Gar-
cke. Wavelet-packets for deepfake image analysis and detec-
tion. Machine Learning, 111(11):4295–4327, 2022. 3
[38] Chaoyi Wu, Jiayu Lei, Qiaoyu Zheng, Weike Zhao, Weix-
iong Lin, Xiaoman Zhang, Xiao Zhou, Ziheng Zhao, Ya
Zhang, Yanfeng Wang, et al. Can gpt-4v (ision) serve medi-
cal applications? case studies on gpt-4v for multimodal med-
ical diagnosis. arXiv preprint arXiv:2310.09909, 2023. 1
[39] Qiang Xu, Shan Jia, Xinghao Jiang, Tanfeng Sun, Zhe Wang,
and Hong Yan. Mdtl-net: Computer-generated image detec-
tion based on multi-scale deep texture learning. Expert Sys-
tems with Applications, 248:123368, 2024. 3
[40] Xin Yang, Yuezun Li, Honggang Qi, and Siwei Lyu. Ex-
posing GAN-synthesized faces using landmark locations. In
International Workshop on Information Hiding and Multi-
media Security, Paris, France, 2019. 3
[41] Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang,
Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. The dawn
10