Deepfake Audio Detection Model

Fine-tuned Wav2Vec2 model for detecting AI-generated speech. Determines if audio was spoken by a human or created by AI text-to-speech/voice cloning software.

Model Details

Model Description

Fine-tuned Wav2Vec2 transformer for binary audio classification (real vs AI-generated speech). Trained to distinguish authentic human speech from synthetic audio generated by AI text-to-speech and voice cloning services including:

ElevenLabs
Amazon Polly
Hexgrad Kokoro
Hume AI
Speechify
Luvvoice

Developed by: Gary A. Stafford

Note: This model uses transfer learning from a base model already trained for deepfake detection. Fast convergence is expected due to task similarity and TTS engine overlap with the base model's training data.

How to Use

Installation

Install the required dependencies:

pip install transformers torch librosa

Optional: For GPU acceleration (recommended):

# For CUDA 11.8
pip install torch --index-url https://download.pytorch.org/whl/cu118

# For CUDA 12.1
pip install torch --index-url https://download.pytorch.org/whl/cu121

Quick Start

import torch
import librosa
from transformers import AutoModelForAudioClassification, AutoFeatureExtractor

# Load model and feature extractor
model_name = "garystafford/wav2vec2-deepfake-voice-detector"
model = AutoModelForAudioClassification.from_pretrained(model_name)
feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)

# Move to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
model.eval()

# Load and preprocess audio (automatically resamples to 16kHz)
audio, sr = librosa.load("path/to/audio.wav", sr=16000, mono=True)
inputs = feature_extractor(audio, sampling_rate=16000, return_tensors="pt", padding=True)
inputs = {k: v.to(device) for k, v in inputs.items()}

# Run inference
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    probs = torch.nn.functional.softmax(logits, dim=-1)

# Get prediction
prob_real = probs[0][0].item()
prob_fake = probs[0][1].item()
prediction = "fake" if prob_fake > 0.5 else "real"

print(f"Prediction: {prediction}")
print(f"Confidence: {max(prob_real, prob_fake):.2%}")
print(f"Probabilities - Real: {prob_real:.2%}, Fake: {prob_fake:.2%}")

Expected Input

Audio format: WAV, MP3, FLAC, or any format supported by librosa
Sample rate: Automatically resampled to 16kHz
Channels: Converted to mono
Duration: Optimal performance on 2.5-13 second clips (model training range)

Output

The model outputs logits (raw, unnormalized scores) for two classes:

Class 0: Real (human) audio
Class 1: Fake (AI-generated) audio

Converting Logits to Probabilities:

Apply softmax to convert raw logits into interpretable probability scores:

probs = torch.nn.functional.softmax(logits, dim=-1)

Single sample: logits.shape = (1, 2) → probs.shape = (1, 2) where probs[0] contains [prob_real, prob_fake] summing to 1.0
Batch processing: logits.shape = (N, 2) → probs.shape = (N, 2) where each sample's probabilities sum to 1.0 independently
dim=-1: Applies softmax across classes for each sample, not across samples

Batch Processing Example

import glob

audio_files = glob.glob("audio_folder/*.wav")

for audio_path in audio_files:
    audio, _ = librosa.load(audio_path, sr=16000, mono=True)
    inputs = feature_extractor(audio, sampling_rate=16000, return_tensors="pt", padding=True)
    inputs = {k: v.to(device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)
        probs = torch.nn.functional.softmax(outputs.logits, dim=-1)

    prediction = "fake" if probs[0][1] > 0.5 else "real"
    print(f"{audio_path}: {prediction} ({probs[0][1]:.2%} fake)")

Training Details

Dataset

Source: garystafford/deepfake-audio-detection

Composition:

Real audio: YouTube recordings from 14 source videos, human speech samples
Synthetic audio: Generated using 6 TTS platforms (ElevenLabs, Amazon Polly, Hexgrad Kokoro, Hume AI, Speechify, Luvvoice)
Format: FLAC, 16kHz mono, 2.5-13 second chunks
Total samples: 1,866 (balanced: 933 real, 933 fake)
Processing: Two-pass audio splitting with silence detection, concatenation of short segments, and VAD-based sub-chunking

Split:

Split	Real	Fake	Total	Percentage
Train	746	746	1,492	80%
Validation	93	94	187	10%
Test	94	93	187	10%

Stratified splitting applied to ensure balanced class distribution across all splits.

Training Approach

Base Model: Gustking/wav2vec2-large-xlsr-deepfake-audio-classification - A Wav2Vec2-XLSR model pre-trained on 53 languages and already fine-tuned for deepfake audio detection.

Method: Transfer learning with selective layer freezing:

Frozen:
- Wav2Vec2 feature extractor (convolutional layers)
- Bottom 12 transformer encoder layers
Trained:
- Top 12 transformer encoder layers (upper half)
- Classification head (256-dimensional projection + linear classifier)
- ~160M trainable parameters (approximately half the model)
Rationale: Freezing low-level acoustic features while training high-level semantic layers allows the model to adapt to this dataset's specific TTS characteristics and speaker patterns while preserving general audio understanding.

Hyperparameters

Parameter	Value
Learning rate	3e-5
Epochs (max)	5
Early stopping patience	3 evaluations
Evaluation frequency	Every 30 steps
Per-device batch size	4
Gradient accumulation steps	4
Effective batch size	16
Optimizer	AdamW
Warmup ratio	0.1 (10%)
Weight decay	0.01
Save strategy	Every 30 steps
Metric for best model	ROC-AUC
Precision	FP16

Training Statistics:

Training samples: 1,492 (746 real, 746 fake)
Validation samples: 187 (93 real, 94 fake)
Trainable parameters: 160,336,770 (~160M parameters, approximately 50% of full model)
Training approach: Freeze feature extractor and bottom 12 transformer layers; train top 12 transformer layers + classification head
Convergence: Efficient convergence (typically ~3-4 epochs) due to base model's existing deepfake detection capabilities
Why high performance? Transfer learning from a specialist deepfake detector allows rapid adaptation to this dataset while training substantial portions of the model to capture dataset-specific patterns

Architecture

The model uses AutoModelForAudioClassification with a two-class output (0=real, 1=fake):

Feature Extractor (Frozen): 7 convolutional layers extract acoustic features from raw audio
Transformer Encoder:
- Layers 0-11 (Frozen): Preserve low-level acoustic and phonetic representations
- Layers 12-23 (Trained): Adapt high-level semantic features to deepfake patterns
Classification Head (Trained): 256-dimensional projection + linear classifier

This architecture balances efficiency with adaptability—frozen layers preserve general audio understanding while trained layers (~160M parameters) learn dataset-specific deepfake detection patterns.

Model Performance

⚠️ IMPORTANT CONTEXT: These high-performance metrics reflect fine-tuning a specialist model on its own domain. The base model (Gustking/wav2vec2-large-xlsr-deepfake-audio-classification) was already trained for deepfake detection, likely on similar TTS engines. These results demonstrate successful adaptation to this specific dataset of 1,866 samples, NOT general deepfake detection capability from scratch. The excellent ROC-AUC (0.998) indicates near-perfect class separation, though 4 samples (2.1%) are still misclassified at the default 0.5 threshold.

Validation Set Performance

The model performs well on the validation set of 187 audio clips (94 real, 93 fake):

Validation Results (at threshold 0.5):

Accuracy: 97.9% (183 out of 187 samples correctly classified)
ROC-AUC: 0.998 (near-perfect class separation)
Balanced Accuracy: 97.9%

Per-Class Metrics (threshold 0.5):

Class	Precision	Recall	F1-Score	Support
Real	1.00	0.96	0.98	94
Fake	0.96	1.00	0.98	93

Confusion Matrix (threshold 0.5):

	Pred Real	Pred Fake
True Real	90	4
True Fake	0	93

Note: Best balanced accuracy of 98.4% achieved at threshold 0.9 (96.8% real recall, 100% fake recall).

Important Notes on Performance

Context for High Performance:

Moderate validation set: 187 samples provides reasonable evaluation, though larger test sets recommended for production validation
Transfer learning: Base model already trained for deepfake detection on similar TTS engines - fine-tuning adapts existing knowledge
Dataset characteristics: TTS-generated audio has distinctive artifacts (prosody patterns, spectral signatures) that differentiate it from human speech
ROC-AUC of 0.998: Indicates near-perfect ranking/separation of classes; 4 real samples misclassified as fake at threshold 0.5, while all fake samples correctly identified
Recommended validation: Test on TTS engines NOT in training data (e.g., OpenAI TTS, Azure Neural, advanced voice cloning systems) for true generalization assessment

Generalization Limitations:

Model may not generalize well to:
- Novel TTS engines not represented in training data
- Advanced voice cloning/conversion systems
- Real-time voice manipulation
- Low-quality recordings with significant noise

Inference Performance

Estimated based on model architecture:

Latency: ~50-100ms per sample (varies by hardware)
Recommended use: Batch processing for efficiency

Downloads last month: 311

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for garystafford/wav2vec2-deepfake-voice-detector

Base model

facebook/wav2vec2-xls-r-300m

Finetuned

alefiury/wav2vec2-large-xlsr-53-gender-recognition-librispeech

Finetuned

Gustking/wav2vec2-large-xlsr-deepfake-audio-classification

Finetuned

(1)

this model

garystafford
/

wav2vec2-deepfake-voice-detector

Deepfake Audio Detection Model

Model Details

Model Description

How to Use

Installation

Quick Start

Expected Input

Output

Batch Processing Example

Training Details

Dataset

Training Approach

Hyperparameters

Architecture

Model Performance

Validation Set Performance

Important Notes on Performance

Inference Performance

Model tree for garystafford/wav2vec2-deepfake-voice-detector

Dataset used to train garystafford/wav2vec2-deepfake-voice-detector