MuRIL Loksabha Speech Classifier

A multi-task classifier for analyzing bilingual (English/Hindi) political speeches, built on Google's MuRIL (Multilingual Representations for Indian Languages) base model.

Model Description

This model performs two simultaneous classification tasks on Indian political speeches:

  1. Speech Type Classification (3 classes):

    • national - National-level speeches
    • regional - Regional-level speeches
    • unclear - Speeches with unclear classification
  2. Topic Classification (variable classes):

    • Only predicted when speech type ≠ unclear
    • Covers various political topics specific to the speech corpus

Key Features

  • Bilingual Support: Handles both English and Hindi text using MuRIL tokenizer
  • Smart Truncation: Head+Tail truncation (512 tokens) preserves both beginning and ending of long speeches
  • Balanced Training: Class weighting to handle imbalanced datasets
  • Optimized Performance: Mixed precision (fp16) training for efficient GPU usage
  • Robust Evaluation: Stratified splitting on speech type with macro-F1 selection

Training Data

The model was trained on a corpus of Indian political speeches containing:

  • Columns: Serial number, page, date, English name, Hindi name, party during election, party during speech, speech text, language, type, topic
  • Languages: English and Hindi (code-mixed and monolingual)
  • Source: Multiple CSV files from political speech databases

Model Architecture

  • Base Model: google/muril-base-cased
  • Task Heads:
    • Type classifier: 3-class output (national, regional, unclear)
    • Topic classifier: 7-class output
  • Max Sequence Length: 512 tokens
  • Truncation Strategy: Head+Tail (preserves start and end context)

Intended Use

Primary Use Cases

  • Analyzing political speech content and themes
  • Categorizing speeches by scope (national vs. regional)
  • Topic extraction from Indian political discourse
  • Research on multilingual political communication

Out-of-Scope Use Cases

  • General-purpose text classification on non-political content
  • Languages other than English and Hindi
  • Real-time critical decision-making systems

Training Procedure

Training Hyperparameters

  • Optimizer: AdamW
  • Learning Rate: 2e-5
  • Batch Size: 4 (per device)
  • Gradient Accumulation Steps: 4
  • Effective Batch Size: 16
  • Epochs: 4
  • Total Training Steps: 7,184
  • Mixed Precision: fp16 enabled
  • Topic Loss Weight: 1.0
  • Evaluation Strategy: Best model selection on Type macro-F1
  • Data Split: Stratified train/test split on Type labels
    • Train samples: 28,740
    • Test samples: 7,186

Training Hardware

  • GPU: NVIDIA RTX 3060 (or similar)
  • Mixed precision training: fp16 for memory efficiency
  • Training Runtime: ~2.24 hours (8,073 seconds)
  • Training Speed: 14.24 samples/second

Preprocessing

  • Tokenizer: MuRIL tokenizer (google/muril-base-cased)
  • Max length: 512 tokens
  • Truncation: Head+Tail strategy
  • Padding: To max length
  • Class weighting applied for imbalanced classes

Evaluation Results

Evaluated on 7,186 test samples after 4 epochs of training.

Type Classification (3 classes: national, regional, unclear)

Metric Score
Accuracy 87.14%
Macro F1 86.68%
Loss Training: 1.92

Performance Highlights:

  • Strong performance on speech type classification
  • Balanced macro-F1 indicates good performance across all three type classes
  • Model effectively distinguishes between national, regional, and unclear speeches

Topic Classification (7 classes)

Metric Score
Accuracy 76.82%
Macro F1 75.40%

Performance Highlights:

  • Good topic identification when type is clear
  • Topic predictions only made for speeches where type ≠ 'unclear'
  • Macro F1 of 75.4% shows reasonable balance across topic classes despite potential class imbalance

Training Curve

  • Total Training Steps: 7,184
  • Final Training Loss: 1.92
  • Training Samples per Second: 14.24
  • Total Training Time: ~2.24 hours

Citation

If you use this model, please cite:

@misc{muril-multitask-speech,
  author = {Prabhanjana Ghuriki},
  title = {MuRIL Multitask Speech Classifier},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/GPrabhanjana/loksabhatypetopic}
}

Model Card Authors

  • Prabhanjana Ghuriki
  • Yashas Rajesh Shetty
  • Daksh Vats

Model Card Contact

ghurikiprabhanjana@gmail.com


Additional Files

This repository includes:

  • label_maps.json - Mapping of label indices to class names
  • metrics.json - Full training and evaluation metrics
  • special_tokens_map.json - Tokenizer special tokens
  • tokenizer_config.json - Tokenizer configuration
  • vocab.txt - MuRIL vocabulary
  • Checkpoint directories with model weights

Acknowledgments

  • Built on Google's MuRIL model: google/muril-base-cased
  • Thanks to Hugging Face Transformers library

License: MIT

Language(s): English, Hindi

Tags: multi-task-learning, muril, political-speeches, hindi, english, indian-languages, text-classification

Downloads last month

-

Downloads are not tracked for this model. How to track
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for GPrabhanjana/loksabhatypetopic

Finetuned
(45)
this model