Deepfake Audio Detection Model

Fine-tuned Wav2Vec2 model for detecting AI-generated speech. Determines if audio was spoken by a human or created by AI text-to-speech/voice cloning software.

Model Details

Model Description

Fine-tuned Wav2Vec2 transformer for binary audio classification (real vs AI-generated speech). Trained to distinguish authentic human speech from synthetic audio generated by AI text-to-speech and voice cloning services including:

  • ElevenLabs
  • Amazon Polly
  • Hexgrad Kokoro
  • Hume AI
  • Speechify
  • Luvvoice

Developed by: Gary A. Stafford

Note: This model uses transfer learning from a base model already trained for deepfake detection. Fast convergence is expected due to task similarity and TTS engine overlap with the base model's training data.

How to Use

Installation

Install the required dependencies:

pip install transformers torch librosa

Optional: For GPU acceleration (recommended):

# For CUDA 11.8
pip install torch --index-url https://download.pytorch.org/whl/cu118

# For CUDA 12.1
pip install torch --index-url https://download.pytorch.org/whl/cu121

Quick Start

import torch
import librosa
from transformers import AutoModelForAudioClassification, AutoFeatureExtractor

# Load model and feature extractor
model_name = "garystafford/wav2vec2-deepfake-voice-detector"
model = AutoModelForAudioClassification.from_pretrained(model_name)
feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)

# Move to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
model.eval()

# Load and preprocess audio (automatically resamples to 16kHz)
audio, sr = librosa.load("path/to/audio.wav", sr=16000, mono=True)
inputs = feature_extractor(audio, sampling_rate=16000, return_tensors="pt", padding=True)
inputs = {k: v.to(device) for k, v in inputs.items()}

# Run inference
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    probs = torch.nn.functional.softmax(logits, dim=-1)

# Get prediction
prob_real = probs[0][0].item()
prob_fake = probs[0][1].item()
prediction = "fake" if prob_fake > 0.5 else "real"

print(f"Prediction: {prediction}")
print(f"Confidence: {max(prob_real, prob_fake):.2%}")
print(f"Probabilities - Real: {prob_real:.2%}, Fake: {prob_fake:.2%}")

Expected Input

  • Audio format: WAV, MP3, FLAC, or any format supported by librosa
  • Sample rate: Automatically resampled to 16kHz
  • Channels: Converted to mono
  • Duration: Optimal performance on 2.5-13 second clips (model training range)

Output

The model outputs logits (raw, unnormalized scores) for two classes:

  • Class 0: Real (human) audio
  • Class 1: Fake (AI-generated) audio

Converting Logits to Probabilities:

Apply softmax to convert raw logits into interpretable probability scores:

probs = torch.nn.functional.softmax(logits, dim=-1)
  • Single sample: logits.shape = (1, 2)probs.shape = (1, 2) where probs[0] contains [prob_real, prob_fake] summing to 1.0
  • Batch processing: logits.shape = (N, 2)probs.shape = (N, 2) where each sample's probabilities sum to 1.0 independently
  • dim=-1: Applies softmax across classes for each sample, not across samples

Batch Processing Example

import glob

audio_files = glob.glob("audio_folder/*.wav")

for audio_path in audio_files:
    audio, _ = librosa.load(audio_path, sr=16000, mono=True)
    inputs = feature_extractor(audio, sampling_rate=16000, return_tensors="pt", padding=True)
    inputs = {k: v.to(device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)
        probs = torch.nn.functional.softmax(outputs.logits, dim=-1)

    prediction = "fake" if probs[0][1] > 0.5 else "real"
    print(f"{audio_path}: {prediction} ({probs[0][1]:.2%} fake)")

Training Details

Dataset

Source: garystafford/deepfake-audio-detection

Composition:

  • Real audio: YouTube recordings from 14 source videos, human speech samples
  • Synthetic audio: Generated using 6 TTS platforms (ElevenLabs, Amazon Polly, Hexgrad Kokoro, Hume AI, Speechify, Luvvoice)
  • Format: FLAC, 16kHz mono, 2.5-13 second chunks
  • Total samples: 1,866 (balanced: 933 real, 933 fake)
  • Processing: Two-pass audio splitting with silence detection, concatenation of short segments, and VAD-based sub-chunking

Split:

Split Real Fake Total Percentage
Train 746 746 1,492 80%
Validation 93 94 187 10%
Test 94 93 187 10%

Stratified splitting applied to ensure balanced class distribution across all splits.

Training Approach

Base Model: Gustking/wav2vec2-large-xlsr-deepfake-audio-classification - A Wav2Vec2-XLSR model pre-trained on 53 languages and already fine-tuned for deepfake audio detection.

Method: Transfer learning with selective layer freezing:

  • Frozen:
    • Wav2Vec2 feature extractor (convolutional layers)
    • Bottom 12 transformer encoder layers
  • Trained:
    • Top 12 transformer encoder layers (upper half)
    • Classification head (256-dimensional projection + linear classifier)
    • ~160M trainable parameters (approximately half the model)
  • Rationale: Freezing low-level acoustic features while training high-level semantic layers allows the model to adapt to this dataset's specific TTS characteristics and speaker patterns while preserving general audio understanding.

Hyperparameters

Parameter Value
Learning rate 3e-5
Epochs (max) 5
Early stopping patience 3 evaluations
Evaluation frequency Every 30 steps
Per-device batch size 4
Gradient accumulation steps 4
Effective batch size 16
Optimizer AdamW
Warmup ratio 0.1 (10%)
Weight decay 0.01
Save strategy Every 30 steps
Metric for best model ROC-AUC
Precision FP16

Training Statistics:

  • Training samples: 1,492 (746 real, 746 fake)
  • Validation samples: 187 (93 real, 94 fake)
  • Trainable parameters: 160,336,770 (~160M parameters, approximately 50% of full model)
  • Training approach: Freeze feature extractor and bottom 12 transformer layers; train top 12 transformer layers + classification head
  • Convergence: Efficient convergence (typically ~3-4 epochs) due to base model's existing deepfake detection capabilities
  • Why high performance? Transfer learning from a specialist deepfake detector allows rapid adaptation to this dataset while training substantial portions of the model to capture dataset-specific patterns

Architecture

The model uses AutoModelForAudioClassification with a two-class output (0=real, 1=fake):

  • Feature Extractor (Frozen): 7 convolutional layers extract acoustic features from raw audio
  • Transformer Encoder:
    • Layers 0-11 (Frozen): Preserve low-level acoustic and phonetic representations
    • Layers 12-23 (Trained): Adapt high-level semantic features to deepfake patterns
  • Classification Head (Trained): 256-dimensional projection + linear classifier

This architecture balances efficiency with adaptability—frozen layers preserve general audio understanding while trained layers (~160M parameters) learn dataset-specific deepfake detection patterns.

Model Performance

⚠️ IMPORTANT CONTEXT: These high-performance metrics reflect fine-tuning a specialist model on its own domain. The base model (Gustking/wav2vec2-large-xlsr-deepfake-audio-classification) was already trained for deepfake detection, likely on similar TTS engines. These results demonstrate successful adaptation to this specific dataset of 1,866 samples, NOT general deepfake detection capability from scratch. The excellent ROC-AUC (0.998) indicates near-perfect class separation, though 4 samples (2.1%) are still misclassified at the default 0.5 threshold.

Validation Set Performance

The model performs well on the validation set of 187 audio clips (94 real, 93 fake):

Validation Results (at threshold 0.5):

  • Accuracy: 97.9% (183 out of 187 samples correctly classified)
  • ROC-AUC: 0.998 (near-perfect class separation)
  • Balanced Accuracy: 97.9%

Per-Class Metrics (threshold 0.5):

Class Precision Recall F1-Score Support
Real 1.00 0.96 0.98 94
Fake 0.96 1.00 0.98 93

Confusion Matrix (threshold 0.5):

Pred Real Pred Fake
True Real 90 4
True Fake 0 93

Note: Best balanced accuracy of 98.4% achieved at threshold 0.9 (96.8% real recall, 100% fake recall).

Important Notes on Performance

Context for High Performance:

  1. Moderate validation set: 187 samples provides reasonable evaluation, though larger test sets recommended for production validation
  2. Transfer learning: Base model already trained for deepfake detection on similar TTS engines - fine-tuning adapts existing knowledge
  3. Dataset characteristics: TTS-generated audio has distinctive artifacts (prosody patterns, spectral signatures) that differentiate it from human speech
  4. ROC-AUC of 0.998: Indicates near-perfect ranking/separation of classes; 4 real samples misclassified as fake at threshold 0.5, while all fake samples correctly identified
  5. Recommended validation: Test on TTS engines NOT in training data (e.g., OpenAI TTS, Azure Neural, advanced voice cloning systems) for true generalization assessment

Generalization Limitations:

  • Model may not generalize well to:
    • Novel TTS engines not represented in training data
    • Advanced voice cloning/conversion systems
    • Real-time voice manipulation
    • Low-quality recordings with significant noise

Inference Performance

Estimated based on model architecture:

  • Latency: ~50-100ms per sample (varies by hardware)
  • Recommended use: Batch processing for efficiency
Downloads last month
311
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for garystafford/wav2vec2-deepfake-voice-detector

Dataset used to train garystafford/wav2vec2-deepfake-voice-detector