`rinna/nekomata-7b`

Overview

We conduct continual pre-training of qwen-7b on 30B tokens from a mixture of Japanese and English datasets. The continual pre-training significantly improves the model's performance on Japanese tasks. It also enjoys the following great features provided by the original Qwen model.

The inclusive Qwen vocabulary (vocab size > 150k) enables the model to processs Japanese texts much more efficiently than the previously released youri series.
The model supports a maximum sequence length of 32768.

The name nekomata comes from the Japanese word 猫又/ねこまた/Nekomata, which is a kind of Japanese mythical creature (妖怪/ようかい/Youkai).

Library

The model was trained using code based on EleutherAI/gpt-neox.
Model architecture

A 32-layer, 4096-hidden-size transformer-based language model. Please refer to the Qwen paper for architecture details.
Continual pre-training

The model was initialized with the qwen-7b model and continually trained on around 30B tokens from a mixture of the following corpora
- Japanese CC-100
- Japanese C4
- Japanese OSCAR
- The Pile
- Wikipedia
- rinna curated Japanese dataset
Contributors
Release date

December 21, 2023

Benchmarking

Please refer to rinna's LM benchmark page (Sheet 20231221).

How to use the model

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("rinna/nekomata-7b", trust_remote_code=True)

# Use GPU with bf16
# model = AutoModelForCausalLM.from_pretrained("rinna/nekomata-7b", device_map="auto", trust_remote_code=True, bf16=True)

# Use GPU with fp16
# model = AutoModelForCausalLM.from_pretrained("rinna/nekomata-7b", device_map="auto", trust_remote_code=True, fp16=True)

# Use CPU
# model = AutoModelForCausalLM.from_pretrained("rinna/nekomata-7b", device_map="cpu", trust_remote_code=True)

# Automatically select device and precision
model = AutoModelForCausalLM.from_pretrained("rinna/nekomata-7b", device_map="auto", trust_remote_code=True)

text = "西田幾多郎は、"
token_ids = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt")

with torch.no_grad():
 output_ids = model.generate(
 token_ids.to(model.device),
 max_new_tokens=200,
 min_new_tokens=200,
 do_sample=True,
 temperature=1.0,
 top_p=0.95,
 pad_token_id=tokenizer.pad_token_id,
 bos_token_id=tokenizer.bos_token_id,
 eos_token_id=tokenizer.eos_token_id
 )

output = tokenizer.decode(output_ids.tolist()[0])
print(output)

Tokenization

The model uses the original Qwen tokenizer. It augments the cl100k tiktoken tokenizer and has a vocabulary size of 151,936. The inclusive vocabulary helps the model to reach a better tokenization efficiency, especially for Japanese texts.

We compared the Qwen tokenizer (as used in nekomata) and the llama-2 tokenizer (as used in youri) on different text collections and found that the Qwen tokenizer achieves a much better byte2token rate (i.e. the average number of tokens produced from 1 byte of text) as following. A lower byte2token rate indicates a better tokenization efficiency.

Tokenizer	Japanese	English	Multilingual
Qwen	0.24	0.27	0.27
llama-2	0.40	0.29	0.36

How to cite

@misc{rinna-nekomata-7b,
 title = {rinna/nekomata-7b},
 author = {Zhao, Tianyu and Kaga, Akio and Sawada, Kei},
 url = {https://huggingface.co/rinna/nekomata-7b}
}

@inproceedings{sawada2024release,
 title = {Release of Pre-Trained Models for the {J}apanese Language},
 author = {Sawada, Kei and Zhao, Tianyu and Shing, Makoto and Mitsui, Kentaro and Kaga, Akio and Hono, Yukiya and Wakatsuki, Toshiaki and Mitsuda, Koh},
 booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
 month = {5},
 year = {2024},
 pages = {13898--13905},
 url = {https://aclanthology.org/2024.lrec-main.1213},
 note = {\url{https://arxiv.org/abs/2404.01657}}
}

References

@software{gpt-neox-library,
 title = {{GPT}-{N}eo{X}: Large Scale Autoregressive Language Modeling in {P}y{T}orch},
 author = {Andonian, Alex and Anthony, Quentin and Biderman, Stella and Black, Sid and Gali, Preetham and Gao, Leo and Hallahan, Eric and Levy-Kramer, Josh and Leahy, Connor and Nestler, Lucas and Parker, Kip and Pieler, Michael and Purohit, Shivanshu and Songz, Tri and Phil, Wang and Weinbach, Samuel},
 doi = {10.5281/zenodo.5879544},
 month = {8},
 year = {2021},
 version = {0.0.1},
 url = {https://www.github.com/eleutherai/gpt-neox}
}