ViConBERT: Context-Gloss Aligned Vietnamese Word Embedding for Polysemous and Sense-Aware Representations
Paper
•
2511.12249
•
Published
This repository is official implementation of the paper: ViConBERT: Context-Gloss Aligned Vietnamese Word Embedding for Polysemous and Sense-Aware Representations
Main architecture
transformers with pip: pip install transformers, or install transformers from source. transformers branch. The process of merging a fast tokenizer for PhoBERT is in the discussion, as mentioned in this pull request. If users would like to utilize the fast tokenizer, the users might install transformers as follows:git clone --single-branch --branch fast_tokenizers_BARTpho_PhoBERT_BERTweet https://github.com/datquocnguyen/transformers.git
cd transformers
pip3 install -e .
requirements :pip3 install -r requirements.txt
| Model | #params | Arch. | Max length | Backbone | Training data |
|---|---|---|---|---|---|
tkhangg0910/viconbert-base |
135M | base | 256 | PhoBERT-base | ViConWSD |
tkhangg0910/viconbert-large |
370M | large | 256 | PhoBERT-large | ViConWSD |
SpanExtractor and text_normalize are implemented in code
import logging
from typing import Optional, Tuple
import re
from transformers import AutoModel, PhobertTokenizerFast,AutoTokenizer
import torch.nn.functional as F
from utils.span_extractor import SpanExtractor
from utils.process_data import text_normalize
import torch
model = AutoModel.from_pretrained(
"tkhangg0910/viconbert-base",
trust_remote_code=True,
ignore_mismatched_sizes=True
)
tokenizer = AutoTokenizer.from_pretrained("tkhangg0910/viconbert-base", use_fast=True)
span_ex =SpanExtractor(tokenizer)
def pipeline(query, target):
query_norm=text_normalize(query)
tokenized_query = tokenizer(query_norm,return_tensors="pt").to(device)
span_idx = span_ex.get_span_indices(query_norm, target)
span =torch.Tensor(span_idx).unsqueeze(0).to(device)
model.eval()
query_vec = model(tokenized_query, span)
return query_vec
# Example: Homonyms: "Khoan"
query_1 = "Tôi đang khoan."
target_1 = "Khoan"
query_vec_1 = pipeline(query_1, target_1)
query_2 = "khoan này bị mất mũi khoan."
target_2 = "khoan"
query_vec_2 = pipeline(query_2, target_2)
query_3 = "Khoan là việc rất tiện lợi."
target_3 = "Khoan"
query_vec_3 = pipeline(query_3, target_3)
def cosine_similarity(vec1, vec2):
return F.cosine_similarity(vec1, vec2, dim=1).item()
sim_1 = cosine_similarity(query_vec_1, query_vec_3)
sim_2 = cosine_similarity(query_vec_2, query_vec_3)
print(f"Similarity between 1: {target_1} and 3: {target_3}: {sim_1:.4f}")
print(f"Similarity between 2: {target_2} and 3:{target_3}: {sim_2:.4f}")
"Khoan"
"chạy"
Zero-shot
Contextual separation of "Khoan", "chạy", and zero-shot ability for unseen words
If you find ViConBERT useful for your research and applications, please cite using this BibTeX:
PhoBERT: ViConBERT used PhoBERT as backbone model.
Base model
vinai/phobert-base