MuRIL Loksabha Speech Classifier
A multi-task classifier for analyzing bilingual (English/Hindi) political speeches, built on Google's MuRIL (Multilingual Representations for Indian Languages) base model.
Model Description
This model performs two simultaneous classification tasks on Indian political speeches:
Speech Type Classification (3 classes):
national- National-level speechesregional- Regional-level speechesunclear- Speeches with unclear classification
Topic Classification (variable classes):
- Only predicted when speech type â‰
unclear - Covers various political topics specific to the speech corpus
- Only predicted when speech type â‰
Key Features
- Bilingual Support: Handles both English and Hindi text using MuRIL tokenizer
- Smart Truncation: Head+Tail truncation (512 tokens) preserves both beginning and ending of long speeches
- Balanced Training: Class weighting to handle imbalanced datasets
- Optimized Performance: Mixed precision (fp16) training for efficient GPU usage
- Robust Evaluation: Stratified splitting on speech type with macro-F1 selection
Training Data
The model was trained on a corpus of Indian political speeches containing:
- Columns: Serial number, page, date, English name, Hindi name, party during election, party during speech, speech text, language, type, topic
- Languages: English and Hindi (code-mixed and monolingual)
- Source: Multiple CSV files from political speech databases
Model Architecture
- Base Model:
google/muril-base-cased - Task Heads:
- Type classifier: 3-class output (national, regional, unclear)
- Topic classifier: 7-class output
- Max Sequence Length: 512 tokens
- Truncation Strategy: Head+Tail (preserves start and end context)
Intended Use
Primary Use Cases
- Analyzing political speech content and themes
- Categorizing speeches by scope (national vs. regional)
- Topic extraction from Indian political discourse
- Research on multilingual political communication
Out-of-Scope Use Cases
- General-purpose text classification on non-political content
- Languages other than English and Hindi
- Real-time critical decision-making systems
Training Procedure
Training Hyperparameters
- Optimizer: AdamW
- Learning Rate: 2e-5
- Batch Size: 4 (per device)
- Gradient Accumulation Steps: 4
- Effective Batch Size: 16
- Epochs: 4
- Total Training Steps: 7,184
- Mixed Precision: fp16 enabled
- Topic Loss Weight: 1.0
- Evaluation Strategy: Best model selection on Type macro-F1
- Data Split: Stratified train/test split on Type labels
- Train samples: 28,740
- Test samples: 7,186
Training Hardware
- GPU: NVIDIA RTX 3060 (or similar)
- Mixed precision training: fp16 for memory efficiency
- Training Runtime: ~2.24 hours (8,073 seconds)
- Training Speed: 14.24 samples/second
Preprocessing
- Tokenizer: MuRIL tokenizer (
google/muril-base-cased) - Max length: 512 tokens
- Truncation: Head+Tail strategy
- Padding: To max length
- Class weighting applied for imbalanced classes
Evaluation Results
Evaluated on 7,186 test samples after 4 epochs of training.
Type Classification (3 classes: national, regional, unclear)
| Metric | Score |
|---|---|
| Accuracy | 87.14% |
| Macro F1 | 86.68% |
| Loss | Training: 1.92 |
Performance Highlights:
- Strong performance on speech type classification
- Balanced macro-F1 indicates good performance across all three type classes
- Model effectively distinguishes between national, regional, and unclear speeches
Topic Classification (7 classes)
| Metric | Score |
|---|---|
| Accuracy | 76.82% |
| Macro F1 | 75.40% |
Performance Highlights:
- Good topic identification when type is clear
- Topic predictions only made for speeches where type ≠'unclear'
- Macro F1 of 75.4% shows reasonable balance across topic classes despite potential class imbalance
Training Curve
- Total Training Steps: 7,184
- Final Training Loss: 1.92
- Training Samples per Second: 14.24
- Total Training Time: ~2.24 hours
Citation
If you use this model, please cite:
@misc{muril-multitask-speech,
author = {Prabhanjana Ghuriki},
title = {MuRIL Multitask Speech Classifier},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/GPrabhanjana/loksabhatypetopic}
}
Model Card Authors
- Prabhanjana Ghuriki
- Yashas Rajesh Shetty
- Daksh Vats
Model Card Contact
Additional Files
This repository includes:
label_maps.json- Mapping of label indices to class namesmetrics.json- Full training and evaluation metricsspecial_tokens_map.json- Tokenizer special tokenstokenizer_config.json- Tokenizer configurationvocab.txt- MuRIL vocabulary- Checkpoint directories with model weights
Acknowledgments
- Built on Google's MuRIL model:
google/muril-base-cased - Thanks to Hugging Face Transformers library
License: MIT
Language(s): English, Hindi
Tags: multi-task-learning, muril, political-speeches, hindi, english, indian-languages, text-classification
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for GPrabhanjana/loksabhatypetopic
Base model
google/muril-base-cased