VOOZH about

URL: https://huggingface.co/dhintech/marian-ted2020-id-en-lg

โ‡ฑ dhintech/marian-ted2020-id-en-lg ยท Hugging Face


Enhanced MarianMT Indonesian-English Translation (Meeting Domain Adaptation)

This model is an enhanced fine-tuned version of Helsinki-NLP/opus-mt-id-en with domain-specific adaptation for meeting and business contexts.

๐ŸŽฏ Model Highlights

  • Domain Adaptation: Specialized for meeting and business translation
  • Enhanced Dataset: TED2020 + 2000+ meeting-specific sentence pairs
  • Improved Performance: Better BLEU scores on meeting contexts
  • Robust Training: 80% dataset usage with domain mixing
  • Production Ready: Optimized for real-world meeting scenarios

๐Ÿ“Š Performance Metrics

Metric Base Model This Model Improvement
BLEU Score 1.467 3.736 +154.6%
Translation Speed 1.2s 0.14s -88.2%
Meeting Context Standard Enhanced Domain Adapted

๐Ÿš€ Model Details

  • Base Model: Helsinki-NLP/opus-mt-id-en
  • Training Dataset: TED2020 (80%) + Meeting Domain (10%)
  • Training Strategy: Domain adaptation with enhanced learning
  • Specialization: Business meetings, technical discussions, formal conversations
  • Training Date: 2025-05-27
  • Languages: Indonesian (id) โ†’ English (en)
  • License: Apache 2.0

๐Ÿ› ๏ธ Usage

from transformers import MarianMTModel, MarianTokenizer

# Load model and tokenizer
model_name = "dhintech/marian-ted2020-id-en-lg"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

# Translate Indonesian to English
def translate(text):
 inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)
 outputs = model.generate(
 **inputs,
 max_length=128,
 num_beams=3,
 early_stopping=True,
 do_sample=False
 )
 return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example usage
indonesian_text = "Tim marketing akan bertanggung jawab untuk strategi ini."
english_translation = translate(indonesian_text)
print(english_translation)
# Output: "The marketing team will be responsible for this strategy."

๐Ÿ“ Example Translations

Meeting Context Examples

Indonesian English Context
Selamat pagi semuanya, mari kita mulai rapat hari ini. Good morning everyone, let's start today's meeting. Meeting Opening
Tim marketing akan bertanggung jawab untuk strategi ini. The marketing team will be responsible for this strategy. Task Assignment
Database migration sudah selesai dan berjalan dengan lancar. Database migration is complete and running smoothly. Technical Update
Budget yang disetujui adalah 500 juta rupiah. The approved budget is 500 million rupiah. Financial Discussion

๐ŸŽฏ Intended Use Cases

  • Business Meeting Translation: Real-time translation during meetings
  • Technical Documentation: Translating technical meeting notes
  • Corporate Communication: Formal business correspondence
  • Project Management: Translating project updates and reports
  • Training Materials: Educational and training content translation

๐Ÿ“Š Training Configuration

  • Dataset Size: 118,626 sentence pairs
  • TED2020 Data: 80% of cleaned dataset
  • Meeting Domain Data: 10% specialized meeting content
  • Max Sequence Length: 128 tokens
  • Training Epochs: 12
  • Learning Rate: 1e-05
  • Batch Size: 12 (effective)

๐Ÿ”ง Technical Specifications

  • Model Architecture: MarianMT (Transformer-based)
  • Parameters: ~74M (with selective fine-tuning)
  • Max Input/Output Length: 128 tokens
  • Inference Time: ~0.14s per sentence
  • Memory Requirements:
    • GPU: 3GB VRAM minimum
    • CPU: 4GB RAM minimum

๐Ÿšจ Limitations

  • Domain Specificity: Optimized for formal business/meeting contexts
  • Informal Language: May not perform optimally on very casual Indonesian
  • Regional Dialects: Trained primarily on standard Indonesian
  • Cultural Context: Some cultural nuances may be lost in translation

๐Ÿ“š Citation

@misc{enhanced-marian-id-en-2025,
 title={Enhanced MarianMT Indonesian-English Translation (Meeting Domain Adaptation)},
 author={DhinTech},
 year={2025},
 publisher={Hugging Face},
 journal={Hugging Face Model Hub},
 howpublished={\url{https://huggingface.co/dhintech/marian-id-en-enhanced}},
 note={Enhanced with TED2020 and meeting-specific domain adaptation}
}

๐Ÿ™ Acknowledgments

  • Base Model: Helsinki-NLP team for the original opus-mt-id-en model
  • Dataset: TED2020 corpus and custom meeting domain data
  • Framework: Hugging Face Transformers team

This model is specifically enhanced for Indonesian business meeting translation scenarios with domain adaptation techniques.

Downloads last month
5
Safetensors
Model size
72.2M params
Tensor type
F32
ยท

Model tree for dhintech/marian-ted2020-id-en-lg

Finetuned
(18)
this model

Dataset used to train dhintech/marian-ted2020-id-en-lg