Every Person on the Video Call Was Fake: The $25.6 Million Deepfake Heist
In 2024, a Hong Kong finance worker wired $25.6 million after a deepfake video call with his CFO. Social engineering is entering a new era. Incident response and security awareness training for the deepfake threat era.
The Arup Hong Kong heist
In late January 2024, a finance worker at Arup's Hong Kong office received a message that appeared to come from the company's UK-based Chief Financial Officer. The message requested a confidential financial transaction and invited the worker to join a video conference call to discuss the details [1].
The worker joined the call. On screen were the CFO and several other colleagues, all of whom the worker recognized. They discussed the transaction, the CFO confirmed the instructions, and the worker proceeded to execute 15 separate wire transfers totaling HK$200 million, approximately $25.6 million USD [2].
Every person on the call was a deepfake.
The attackers had obtained publicly available video and audio of the real executives from conference presentations, earnings calls, and social media. They used this material to generate real-time deepfake video and cloned voices that were convincing enough to fool a trained finance professional during a live, interactive video call.
The fraud was discovered a week later when the worker followed up with the real CFO's office about the transactions. By then, the money had been moved through multiple accounts and was largely unrecoverable.
How the deepfakes were generated
Creating a convincing real-time deepfake for a video call requires three components [3]:
Face synthesis. The attacker needs 3 to 10 minutes of video of the target's face from multiple angles. Conference recordings, YouTube talks, and LinkedIn videos provide ample material. Modern face-swap models (based on architectures like SimSwap, FaceFusion, or proprietary tools) can generate photorealistic face replacements in real time with consumer-grade GPUs. An NVIDIA RTX 4090 can render deepfake video at 30 frames per second with sub-100ms latency.
Voice cloning. Current voice cloning technology requires as little as 3 to 10 seconds of clean audio to generate a usable voice model [4]. With 30 to 60 seconds of audio, the clone is nearly indistinguishable from the real person, capturing accent, cadence, pitch, and speech patterns. Services like ElevenLabs, Resemble.AI, and open-source tools like OpenVoice can produce real-time voice output with latency under 200ms.
Behavior modeling. The most sophisticated attacks incorporate behavioral mimicry. The deepfake puppeteer studies how the target speaks (filler words, pauses, hand gestures) and replicates these patterns. On a video call with typical compression artifacts and mediocre lighting, the result is nearly undetectable.
Deepfake-as-a-Service
The tools needed for this attack are now available as commercial services on both the open web and dark web marketplaces [5]:
- Consumer tools ($0 to $50/month): Apps like FaceFusion and DeepFaceLab are free and open source. Commercial tools like Synthesia and HeyGen are designed for legitimate video production but can be repurposed
- Underground services ($1,000 to $10,000): Full-service deepfake creation, including real-time puppeteering for video calls, document forgery with deepfake ID photos, and voice clone development
- Custom operations ($10,000+): Targeted attacks against specific individuals with rehearsed scenarios, multiple deepfake participants, and professional social engineering scripts
The Arup attack likely fell in the $5,000 to $15,000 range for the deepfake production. The return on investment: 1,700x.
Voice cloning: 3 seconds is enough
The speed at which voice cloning has advanced is staggering. In 2020, creating a convincing voice clone required hours of clean audio recordings. By 2023, that dropped to 5 minutes. By 2025, state of the art models can produce a functional clone from 3 to 10 seconds of audio [4].
Sources of target audio are everywhere:
- Voicemail greetings (call the target's phone, let it go to voicemail)
- Conference recordings (YouTube, Vimeo, corporate event archives)
- Podcast appearances
- Earnings calls (publicly available for executives of public companies)
- Social media videos (Instagram, TikTok, LinkedIn)
- Customer service recordings ("this call may be recorded for quality purposes" provides the attacker with material too)
The practical implication: if your voice has ever been recorded and is accessible online or over the phone, someone can clone it well enough to fool your colleagues, your family, and your bank.
Biometric fraud is exploding
The surge in deepfake capability has driven a corresponding surge in biometric fraud. According to Gartner, biometric fraud attacks involving deepfakes increased by 340% between 2023 and 2025 [6].
The most targeted biometric systems:
- Video-based identity verification (the "take a selfie to verify your identity" flow used by banks, crypto exchanges, and government services). Deepfake videos bypass these checks at an alarming rate
- Voice authentication (banking by phone, voice-activated smart assistants). Cloned voices pass voice biometric checks with increasing reliability
- Facial recognition access control (building entry, device unlock). Printed or screen-displayed deepfake images can defeat some systems, while real-time deepfake video defeats most
Gartner predicts that by 2026, 30% of enterprises will no longer trust facial or voice biometrics as standalone identity verification methods due to deepfake capabilities [6]. This is not a distant future prediction. It is a reflection of attacks happening right now.
The TAKE IT DOWN Act
In response to the surge in deepfake abuse (including non-consensual intimate imagery), the US Congress passed the TAKE IT DOWN Act, signed into law in 2025 [7]. The law:
- Criminalizes the creation and distribution of non-consensual deepfake intimate imagery
- Requires social media platforms and hosting providers to remove reported deepfake content within 48 hours
- Establishes penalties of up to 2 years in prison and fines for creating non-consensual deepfakes
- Creates a reporting mechanism through the FTC for victims
The law is a step forward for protecting individuals from deepfake abuse, but it does not address the corporate fraud vector demonstrated in the Arup attack. Business email compromise (BEC) and video call deepfakes fall under existing wire fraud statutes, which carry penalties of up to 20 years but require the FBI to investigate and prosecute.
How to verify identity on video calls
Given that real-time deepfakes are now practical, organizations need new verification protocols for high-value decisions [8]:
Callback verification. Before executing any financial transaction discussed on a video call, hang up and call the requester back on a known phone number (not one provided in the meeting invite or chat). This simple step would have prevented the Arup attack entirely.
Shared secrets. Establish code words or phrases with key personnel that must be spoken during any call authorizing financial transactions. Change these periodically. A deepfake puppeteer cannot reproduce a secret they do not know.
Multi-channel confirmation. Require authorization through a separate channel. If the request comes via video call, require confirmation via a signed email, an authenticated Slack message, or an in-person signature.
Challenge questions. Ask something only the real person would know. Not publicly available information (birthday, alma mater) but operational details ("What was the final number in the Q3 forecast we reviewed yesterday?").
Liveness testing. Ask the person on the call to do something unexpected. Turn sideways and show their profile. Hold up a specific number of fingers. Pick up a nearby object. Current real-time deepfakes struggle with abrupt pose changes, hand interactions, and object occlusion.
Transaction limits and delays. Implement mandatory cooling-off periods for large transactions. No transfer over a certain threshold should execute within the same day it is requested. This gives time for verification and reduces the pressure tactics that social engineering relies on.
Detection tools and their limits
Several companies and research groups offer deepfake detection tools [9]:
- Microsoft Video Authenticator analyzes videos for subtle artifacts at the blending boundaries where the deepfake face meets the real background
- Intel FakeCatcher detects deepfakes by analyzing blood flow patterns in facial video (real faces show subtle color changes as blood pulses; deepfakes do not)
- Sensity AI offers a commercial deepfake detection API
- Deepware Scanner is a free tool for checking uploaded videos
However, detection has fundamental limitations:
- Adversarial arms race. Every detection technique can be countered. When detectors learn to spot blending artifacts, generators learn to eliminate them
- Real-time detection is hard. Analyzing a recorded video for deepfake artifacts is feasible. Detecting a deepfake in real time during a live video call, with compressed video and variable lighting, is significantly harder
- False positives. Poor lighting, low bandwidth, and bad webcams create artifacts that resemble deepfake artifacts. Detection tools tuned for high sensitivity will flag legitimate video calls
- Post-compression analysis. Video calls compress video heavily (Zoom, Teams, and Google Meet all use lossy compression). This compression destroys many of the subtle artifacts that detection tools rely on
The uncomfortable truth is that detection will always lag behind generation. Prevention through verification protocols is more reliable than attempting to detect deepfakes in real time.
The trajectory
The cost of producing a convincing deepfake drops by roughly half every 12 months. The quality improves on a similar curve. Within two to three years, real-time deepfakes will be indistinguishable from real video even under expert analysis [10].
This means:
- Video evidence will become unreliable in legal proceedings without cryptographic provenance
- Video-based identity verification will require fundamental redesign
- Remote work introduces new trust challenges when you cannot physically verify who you are talking to
- Financial controls must evolve to assume that any remote communication could be fabricated
The Arup attack was $25.6 million. It will not be the largest. The technology that made it possible is cheaper and better every month.
Sources
- CNN, "Finance Worker Pays Out $25 Million After Video Call With Deepfake CFO," February 2024
- South China Morning Post, "Hong Kong Police Report $25.6M Deepfake Video Call Fraud at Multinational Firm," February 2024
- MIT Technology Review, "The Technology Behind Real-Time Deepfake Video Calls," 2024
- ElevenLabs, "Voice Cloning: Technical Documentation," 2025; OpenVoice, "Instant Voice Cloning," arXiv:2312.01479
- Recorded Future, "Deepfake-as-a-Service: The Commoditization of Synthetic Media Fraud," 2025
- Gartner, "Predicts 2025: AI Will Disrupt Identity Verification and Biometric Security," 2024
- US Congress, "TAKE IT DOWN Act," Public Law, 2025
- FBI, "Public Service Announcement: Deepfake Audio and Video Used in Business Email Compromise," IC3, 2024
- IEEE, "A Survey of Deepfake Detection Methods," Transactions on Information Forensics and Security, 2024
- RAND Corporation, "The Future of Deepfakes: Implications for National Security," 2025
Want us to check your Deepfakes setup?
Our scanner detects this exact misconfiguration. plus dozens more across 38 platforms. Free website check available, no commitment required.
