japanese-gpt-neox-3.6b-instruction-sft-v2
Overview
This repository provides a Japanese GPT-NeoX model of 3.6 billion parameters. The model is based on rinna/japanese-gpt-neox-3.6b and has been finetuned to serve as an instruction-following conversational agent.
This model slightly differs from the previous SFT model rinna/japanese-gpt-neox-3.6b-instruction-sft, where a different data split is used for training.
Model architecture
A 36-layer, 2816-hidden-size transformer-based language model.
SFT vs. previous SFT evaluation
We conducted ChatGPT-based automated evaluation on 100 prompts to assess the performance difference between this SFT model and the previous SFT model.
this SFT vs. previous SFT win tie loss ChatGPT auto. evaluation 55% 0% 45% Finetuning
The finetuning data is the subset of the following datasets and has been translated into Japanese.
The data will not be released.
Model Series
Contributors
Release date
March 31, 2023
I/O Format
A special format has been adopted to construct inputs.
- An input prompt is formatted as a conversation between
ăŚăźăśăźandăˇăšăă. - Each input utterance consists of (1) its speaker (
"ăŚăźăśăź"or"ăˇăšăă "), (2) a colon (":"), (3) a whitespace (" "), and (4) utterance text (e.g."ä¸çă§ä¸çŞéŤăĺąąăŻďź"). - The input prompt should be ended with
"ăˇăšăă : "to acknowledge the model to generate a response. - Since the model's tokenizer does not recognize
"\n", a special newline symbol"<NL>"is used instead. - All the newlines in input and output utterances should be replaced with
"<NL>". - All the utterances in the input prompt should be separated by
"<NL>".
Following is an example to construct an input from a conversation.
prompt = [
{
"speaker": "ăŚăźăśăź",
"text": "ăłăłăżăŻăăŹăłăşăć
ŁăăăŤăŻăŠăăăă°ăăă§ăăďź"
},
{
"speaker": "ăˇăšăă ",
"text": "ăăăŤă¤ăăŚĺ
ˇä˝çăŤčŞŹćăăŚăăă ăăžăăďźä˝ăéŁăăăŽă§ăăăăďź"
},
{
"speaker": "ăŚăźăśăź",
"text": "çŽăçăăŽă§ăă"
},
{
"speaker": "ăˇăšăă ",
"text": "ĺăăăžăăăăłăłăżăŻăăŹăłăşăă¤ăăă¨çŽăăăăăŞăă¨ăăăă¨ă§ăăăćăŁă䝼ä¸ăŤăŹăłăşăĺ¤ăĺż
čŚăăăă§ăăăăďź"
},
{
"speaker": "ăŚăźăśăź",
"text": "ăăăăŹăłăşăŻĺ¤ăăžăăăăçŽă辤ăăŞăăă§ăă"
}
]
prompt = [
f"{uttr['speaker']}: {uttr['text']}"
for uttr in prompt
]
prompt = "<NL>".join(prompt)
prompt = (
prompt
+ "<NL>"
+ "ăˇăšăă : "
)
print(prompt)
# "ăŚăźăśăź: ăłăłăżăŻăăŹăłăşăć
ŁăăăŤăŻăŠăăăă°ăăă§ăăďź<NL>ăˇăšăă : ăăăŤă¤ăăŚĺ
ˇä˝çăŤčŞŹćăăŚăăă ăăžăăďźä˝ăéŁăăăŽă§ăăăăďź<NL>ăŚăźăśăź: çŽăçăăŽă§ăă<NL>ăˇăšăă : ĺăăăžăăăăłăłăżăŻăăŹăłăşăă¤ăăă¨çŽăăăăăŞăă¨ăăăă¨ă§ăăăćăŁă䝼ä¸ăŤăŹăłăşăĺ¤ăĺż
čŚăăăă§ăăăăďź<NL>ăŚăźăśăź: ăăăăŹăłăşăŻĺ¤ăăžăăăăçŽă辤ăăŞăăă§ăă<NL>ăˇăšăă : "
How to use the model
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("rinna/japanese-gpt-neox-3.6b-instruction-sft-v2", use_fast=False)
model = AutoModelForCausalLM.from_pretrained("rinna/japanese-gpt-neox-3.6b-instruction-sft-v2")
if torch.cuda.is_available():
model = model.to("cuda")
token_ids = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
with torch.no_grad():
output_ids = model.generate(
token_ids.to(model.device),
do_sample=True,
max_new_tokens=128,
temperature=0.7,
repetition_penalty=1.1,
pad_token_id=tokenizer.pad_token_id,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id
)
output = tokenizer.decode(output_ids.tolist()[0][token_ids.size(1):])
output = output.replace("<NL>", "\n")
print(output)
"""ăăăăžăăăăžăăŻăăłăłăżăŻăăŹăłăşăéˇćéçç¨ăăăă¨ăŤăăçŽăŽäšžçĽăé˛ăăă¨ăă§ăăžăăăžăăćŻćĽĺăćé帯ăŤăłăłăżăŻăăŹăłăşăçç¨ăăŚăżăăă¨ăă§ăăžăăăăăŚăăłăłăżăŻăăŹăłăşăçŽăŤĺăăŞăăăăŞĺ ´ĺăŻăć°ăăăăŽă芌ăăŚăżăĺż
čŚăăăăžăă</s>"""
Tokenization
The model uses a sentencepiece-based tokenizer.
- The tokenizer has a vocabulary size of 32,000.
- It uses sentencepiece's byte fallback feature to decompose unknown text pieces into UTF-8 byte pieces and to avoid producing
<UNK>tokens. - sentencepiece's
--add_dummy_prefixoption was turned off so that a leading whitespace will not be prepended automatically.print(tokenizer.tokenize("ĺžčźŠăŻçŤă§ăă")) # ['ĺž', '蟊', 'ăŻ', 'çŤ', 'ă§ăă'] # instead of ['â', 'ĺž', '蟊', 'ăŻ', 'çŤ', 'ă§ăă'] as in rinna/japanese-gpt-1b - sentencepiece's
--remove_extra_whitespacesoption was turned off so that leading, trailing, and duplicate whitespaces are reserved.print(tokenizer.tokenize(" ĺžčźŠăŻ çŤă§ăă ")) # ['â', 'â', 'ĺž', '蟊', 'ăŻ', 'â', 'â', 'çŤ', 'ă§ăă', 'â', 'â', 'â'] # instead of ['â', 'ĺž', '蟊', 'ăŻ', 'âçŤ', 'ă§ăă'] as in rinna/japanese-gpt-1b - Don't forget to set
use_fast=Falseto make the above features function correctly.good_tokenizer = AutoTokenizer.from_pretrained("rinna/japanese-gpt-neox-3.6b", use_fast=False) bad_tokenizer = AutoTokenizer.from_pretrained("rinna/japanese-gpt-neox-3.6b") print(good_tokenizer.decode(good_tokenizer.encode("ááááá áŻááá ĺžčźŠăŻ çŤă§ăă "))) # 'ááááá áŻááá ĺžčźŠăŻ çŤă§ăă </s>' print(bad_tokenizer.decode(bad_tokenizer.encode("ááááá áŻááá ĺžčźŠăŻ çŤă§ăă "))) # 'ááááá [UNK]ááá ĺžčźŠăŻ çŤă§ăă </s>'
How to cite
@misc{rinna-japanese-gpt-neox-3.6b-instruction-sft-v2,
title = {rinna/japanese-gpt-neox-3.6b-instruction-sft-v2},
author = {Zhao, Tianyu and Sawada, Kei},
url = {https://huggingface.co/rinna/japanese-gpt-neox-3.6b-instruction-sft-v2}
}
@inproceedings{sawada2024release,
title = {Release of Pre-Trained Models for the {J}apanese Language},
author = {Sawada, Kei and Zhao, Tianyu and Shing, Makoto and Mitsui, Kentaro and Kaga, Akio and Hono, Yukiya and Wakatsuki, Toshiaki and Mitsuda, Koh},
booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
month = {5},
year = {2024},
pages = {13898--13905},
url = {https://aclanthology.org/2024.lrec-main.1213},
note = {\url{https://arxiv.org/abs/2404.01657}}
}
Licenese
- Downloads last month
- 722
Model tree for rinna/japanese-gpt-neox-3.6b-instruction-sft-v2
Base model
rinna/japanese-gpt-neox-3.6b