Deepfake Audio Detection Model
Fine-tuned Wav2Vec2 model for detecting AI-generated speech. Determines if audio was spoken by a human or created by AI text-to-speech/voice cloning software.
Model Details
Model Description
Fine-tuned Wav2Vec2 transformer for binary audio classification (real vs AI-generated speech). Trained to distinguish authentic human speech from synthetic audio generated by AI text-to-speech and voice cloning services including:
- ElevenLabs
- Amazon Polly
- Hexgrad Kokoro
- Hume AI
- Speechify
- Luvvoice
Developed by: Gary A. Stafford
Note: This model uses transfer learning from a base model already trained for deepfake detection. Fast convergence is expected due to task similarity and TTS engine overlap with the base model's training data.
How to Use
Installation
Install the required dependencies:
pip install transformers torch librosa
Optional: For GPU acceleration (recommended):
# For CUDA 11.8
pip install torch --index-url https://download.pytorch.org/whl/cu118
# For CUDA 12.1
pip install torch --index-url https://download.pytorch.org/whl/cu121
Quick Start
import torch
import librosa
from transformers import AutoModelForAudioClassification, AutoFeatureExtractor
# Load model and feature extractor
model_name = "garystafford/wav2vec2-deepfake-voice-detector"
model = AutoModelForAudioClassification.from_pretrained(model_name)
feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)
# Move to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
model.eval()
# Load and preprocess audio (automatically resamples to 16kHz)
audio, sr = librosa.load("path/to/audio.wav", sr=16000, mono=True)
inputs = feature_extractor(audio, sampling_rate=16000, return_tensors="pt", padding=True)
inputs = {k: v.to(device) for k, v in inputs.items()}
# Run inference
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
probs = torch.nn.functional.softmax(logits, dim=-1)
# Get prediction
prob_real = probs[0][0].item()
prob_fake = probs[0][1].item()
prediction = "fake" if prob_fake > 0.5 else "real"
print(f"Prediction: {prediction}")
print(f"Confidence: {max(prob_real, prob_fake):.2%}")
print(f"Probabilities - Real: {prob_real:.2%}, Fake: {prob_fake:.2%}")
Expected Input
- Audio format: WAV, MP3, FLAC, or any format supported by librosa
- Sample rate: Automatically resampled to 16kHz
- Channels: Converted to mono
- Duration: Optimal performance on 2.5-13 second clips (model training range)
Output
The model outputs logits (raw, unnormalized scores) for two classes:
- Class 0: Real (human) audio
- Class 1: Fake (AI-generated) audio
Converting Logits to Probabilities:
Apply softmax to convert raw logits into interpretable probability scores:
probs = torch.nn.functional.softmax(logits, dim=-1)
- Single sample:
logits.shape = (1, 2)→probs.shape = (1, 2)whereprobs[0]contains[prob_real, prob_fake]summing to 1.0 - Batch processing:
logits.shape = (N, 2)→probs.shape = (N, 2)where each sample's probabilities sum to 1.0 independently dim=-1: Applies softmax across classes for each sample, not across samples
Batch Processing Example
import glob
audio_files = glob.glob("audio_folder/*.wav")
for audio_path in audio_files:
audio, _ = librosa.load(audio_path, sr=16000, mono=True)
inputs = feature_extractor(audio, sampling_rate=16000, return_tensors="pt", padding=True)
inputs = {k: v.to(device) for k, v in inputs.items()}
with torch.no_grad():
outputs = model(**inputs)
probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
prediction = "fake" if probs[0][1] > 0.5 else "real"
print(f"{audio_path}: {prediction} ({probs[0][1]:.2%} fake)")
Training Details
Dataset
Source: garystafford/deepfake-audio-detection
Composition:
- Real audio: YouTube recordings from 14 source videos, human speech samples
- Synthetic audio: Generated using 6 TTS platforms (ElevenLabs, Amazon Polly, Hexgrad Kokoro, Hume AI, Speechify, Luvvoice)
- Format: FLAC, 16kHz mono, 2.5-13 second chunks
- Total samples: 1,866 (balanced: 933 real, 933 fake)
- Processing: Two-pass audio splitting with silence detection, concatenation of short segments, and VAD-based sub-chunking
Split:
| Split | Real | Fake | Total | Percentage |
|---|---|---|---|---|
| Train | 746 | 746 | 1,492 | 80% |
| Validation | 93 | 94 | 187 | 10% |
| Test | 94 | 93 | 187 | 10% |
Stratified splitting applied to ensure balanced class distribution across all splits.
Training Approach
Base Model: Gustking/wav2vec2-large-xlsr-deepfake-audio-classification - A Wav2Vec2-XLSR model pre-trained on 53 languages and already fine-tuned for deepfake audio detection.
Method: Transfer learning with selective layer freezing:
- Frozen:
- Wav2Vec2 feature extractor (convolutional layers)
- Bottom 12 transformer encoder layers
- Trained:
- Top 12 transformer encoder layers (upper half)
- Classification head (256-dimensional projection + linear classifier)
- ~160M trainable parameters (approximately half the model)
- Rationale: Freezing low-level acoustic features while training high-level semantic layers allows the model to adapt to this dataset's specific TTS characteristics and speaker patterns while preserving general audio understanding.
Hyperparameters
| Parameter | Value |
|---|---|
| Learning rate | 3e-5 |
| Epochs (max) | 5 |
| Early stopping patience | 3 evaluations |
| Evaluation frequency | Every 30 steps |
| Per-device batch size | 4 |
| Gradient accumulation steps | 4 |
| Effective batch size | 16 |
| Optimizer | AdamW |
| Warmup ratio | 0.1 (10%) |
| Weight decay | 0.01 |
| Save strategy | Every 30 steps |
| Metric for best model | ROC-AUC |
| Precision | FP16 |
Training Statistics:
- Training samples: 1,492 (746 real, 746 fake)
- Validation samples: 187 (93 real, 94 fake)
- Trainable parameters: 160,336,770 (~160M parameters, approximately 50% of full model)
- Training approach: Freeze feature extractor and bottom 12 transformer layers; train top 12 transformer layers + classification head
- Convergence: Efficient convergence (typically ~3-4 epochs) due to base model's existing deepfake detection capabilities
- Why high performance? Transfer learning from a specialist deepfake detector allows rapid adaptation to this dataset while training substantial portions of the model to capture dataset-specific patterns
Architecture
The model uses AutoModelForAudioClassification with a two-class output (0=real, 1=fake):
- Feature Extractor (Frozen): 7 convolutional layers extract acoustic features from raw audio
- Transformer Encoder:
- Layers 0-11 (Frozen): Preserve low-level acoustic and phonetic representations
- Layers 12-23 (Trained): Adapt high-level semantic features to deepfake patterns
- Classification Head (Trained): 256-dimensional projection + linear classifier
This architecture balances efficiency with adaptability—frozen layers preserve general audio understanding while trained layers (~160M parameters) learn dataset-specific deepfake detection patterns.
Model Performance
⚠️ IMPORTANT CONTEXT: These high-performance metrics reflect fine-tuning a specialist model on its own domain. The base model (
Gustking/wav2vec2-large-xlsr-deepfake-audio-classification) was already trained for deepfake detection, likely on similar TTS engines. These results demonstrate successful adaptation to this specific dataset of 1,866 samples, NOT general deepfake detection capability from scratch. The excellent ROC-AUC (0.998) indicates near-perfect class separation, though 4 samples (2.1%) are still misclassified at the default 0.5 threshold.
Validation Set Performance
The model performs well on the validation set of 187 audio clips (94 real, 93 fake):
Validation Results (at threshold 0.5):
- Accuracy: 97.9% (183 out of 187 samples correctly classified)
- ROC-AUC: 0.998 (near-perfect class separation)
- Balanced Accuracy: 97.9%
Per-Class Metrics (threshold 0.5):
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| Real | 1.00 | 0.96 | 0.98 | 94 |
| Fake | 0.96 | 1.00 | 0.98 | 93 |
Confusion Matrix (threshold 0.5):
| Pred Real | Pred Fake | |
|---|---|---|
| True Real | 90 | 4 |
| True Fake | 0 | 93 |
Note: Best balanced accuracy of 98.4% achieved at threshold 0.9 (96.8% real recall, 100% fake recall).
Important Notes on Performance
Context for High Performance:
- Moderate validation set: 187 samples provides reasonable evaluation, though larger test sets recommended for production validation
- Transfer learning: Base model already trained for deepfake detection on similar TTS engines - fine-tuning adapts existing knowledge
- Dataset characteristics: TTS-generated audio has distinctive artifacts (prosody patterns, spectral signatures) that differentiate it from human speech
- ROC-AUC of 0.998: Indicates near-perfect ranking/separation of classes; 4 real samples misclassified as fake at threshold 0.5, while all fake samples correctly identified
- Recommended validation: Test on TTS engines NOT in training data (e.g., OpenAI TTS, Azure Neural, advanced voice cloning systems) for true generalization assessment
Generalization Limitations:
- Model may not generalize well to:
- Novel TTS engines not represented in training data
- Advanced voice cloning/conversion systems
- Real-time voice manipulation
- Low-quality recordings with significant noise
Inference Performance
Estimated based on model architecture:
- Latency: ~50-100ms per sample (varies by hardware)
- Recommended use: Batch processing for efficiency
- Downloads last month
- 311
Model tree for garystafford/wav2vec2-deepfake-voice-detector
Base model
facebook/wav2vec2-xls-r-300m