MuRIL Loksabha Speech Classifier

A multi-task classifier for analyzing bilingual (English/Hindi) political speeches, built on Google's MuRIL (Multilingual Representations for Indian Languages) base model.

Model Description

This model performs two simultaneous classification tasks on Indian political speeches:

Speech Type Classification (3 classes):
- national - National-level speeches
- regional - Regional-level speeches
- unclear - Speeches with unclear classification
Topic Classification (variable classes):
- Only predicted when speech type ≠ unclear
- Covers various political topics specific to the speech corpus

Key Features

Bilingual Support: Handles both English and Hindi text using MuRIL tokenizer
Smart Truncation: Head+Tail truncation (512 tokens) preserves both beginning and ending of long speeches
Balanced Training: Class weighting to handle imbalanced datasets
Optimized Performance: Mixed precision (fp16) training for efficient GPU usage
Robust Evaluation: Stratified splitting on speech type with macro-F1 selection

Training Data

The model was trained on a corpus of Indian political speeches containing:

Columns: Serial number, page, date, English name, Hindi name, party during election, party during speech, speech text, language, type, topic
Languages: English and Hindi (code-mixed and monolingual)
Source: Multiple CSV files from political speech databases

Model Architecture

Base Model: google/muril-base-cased
Task Heads:
- Type classifier: 3-class output (national, regional, unclear)
- Topic classifier: 7-class output
Max Sequence Length: 512 tokens
Truncation Strategy: Head+Tail (preserves start and end context)

Intended Use

Primary Use Cases

Analyzing political speech content and themes
Categorizing speeches by scope (national vs. regional)
Topic extraction from Indian political discourse
Research on multilingual political communication

Out-of-Scope Use Cases

General-purpose text classification on non-political content
Languages other than English and Hindi
Real-time critical decision-making systems

Training Procedure

Training Hyperparameters

Optimizer: AdamW
Learning Rate: 2e-5
Batch Size: 4 (per device)
Gradient Accumulation Steps: 4
Effective Batch Size: 16
Epochs: 4
Total Training Steps: 7,184
Mixed Precision: fp16 enabled
Topic Loss Weight: 1.0
Evaluation Strategy: Best model selection on Type macro-F1
Data Split: Stratified train/test split on Type labels
- Train samples: 28,740
- Test samples: 7,186

Training Hardware

GPU: NVIDIA RTX 3060 (or similar)
Mixed precision training: fp16 for memory efficiency
Training Runtime: ~2.24 hours (8,073 seconds)
Training Speed: 14.24 samples/second

Preprocessing

Tokenizer: MuRIL tokenizer (google/muril-base-cased)
Max length: 512 tokens
Truncation: Head+Tail strategy
Padding: To max length
Class weighting applied for imbalanced classes

Evaluation Results

Evaluated on 7,186 test samples after 4 epochs of training.

Type Classification (3 classes: national, regional, unclear)

Metric	Score
Accuracy	87.14%
Macro F1	86.68%
Loss	Training: 1.92

Performance Highlights:

Strong performance on speech type classification
Balanced macro-F1 indicates good performance across all three type classes
Model effectively distinguishes between national, regional, and unclear speeches

Topic Classification (7 classes)

Metric	Score
Accuracy	76.82%
Macro F1	75.40%

Performance Highlights:

Good topic identification when type is clear
Topic predictions only made for speeches where type ≠ 'unclear'
Macro F1 of 75.4% shows reasonable balance across topic classes despite potential class imbalance

Training Curve

Total Training Steps: 7,184
Final Training Loss: 1.92
Training Samples per Second: 14.24
Total Training Time: ~2.24 hours

Citation

If you use this model, please cite:

@misc{muril-multitask-speech,
  author = {Prabhanjana Ghuriki},
  title = {MuRIL Multitask Speech Classifier},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/GPrabhanjana/loksabhatypetopic}
}

Model Card Authors

Prabhanjana Ghuriki
Yashas Rajesh Shetty
Daksh Vats

Model Card Contact

ghurikiprabhanjana@gmail.com

Additional Files

This repository includes:

label_maps.json - Mapping of label indices to class names
metrics.json - Full training and evaluation metrics
special_tokens_map.json - Tokenizer special tokens
tokenizer_config.json - Tokenizer configuration
vocab.txt - MuRIL vocabulary
Checkpoint directories with model weights

Acknowledgments

Built on Google's MuRIL model: google/muril-base-cased
Thanks to Hugging Face Transformers library

License: MIT

Language(s): English, Hindi

Tags: multi-task-learning, muril, political-speeches, hindi, english, indian-languages, text-classification

Downloads last month: -; Downloads are not tracked for this model. How to track

Safetensors

Model size

0.2B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for GPrabhanjana/loksabhatypetopic

Base model

google/muril-base-cased

Finetuned

(45)

this model