模型地址:https://huggingface.co/google/gemma-2-2b-it/tree/main
方法一:local-gemma(推荐)
项目主页:https://github.com/huggingface/local-gemma
安装方式:
- CUDA pip install local-gemma"[cuda]"
- MPS pip install local-gemma"[mps]"
- CPU pip install local-gemma"[cpu]"
注意⚠️:
- 这个命令会调用默认的模型参数,例如 google/gemma-2-it
,但是这样调用的话,程序会在后台去 huggingface.co 上下载模型,或者使用默认缓存目录的文件。如果要指定使用本地的文件,那么需要使用 --model
参数指定一个绝对路径,同时制定 --token
为一个任意的字符串。
- 这个命令会不断的记录之前输入的内容,然后会导致显存不够用,所以需要修改一下这个程序,值保存指定的历史记录数量。
- --preset=speed
参数会导致模型进行 compile
,从而加快速度,项目主页说可以加快6倍。不够启动模型和 compile 的时候比较耗时。
代码位置:/usr/local/lib/python3.10/dist-packages/local_gemma/cli.py
local-gemma --preset=speed --model=/root/workspace/gemma-2-2b-it --token x
方法二:直接使用 python 运行
来源: - 代码:https://github.com/nonoesp/live/blob/main/0113/google-gemma/run_gpu.py - 视频:https://www.youtube.com/watch?v=qFULISWcjQc
GPU 代码
# https://huggingface.co/google/gemma-2b-it#running-the-model-on-a-single--multi-gpu
from transformers import AutoTokenizer, AutoModelForCausalLM
model_file = "google/gemma-2b-it" # 如果在本地缓存找不到模型,就会去 huggingface 官网下载
model_file = '/root/workspace/gemma-2-2b-it' # 本地下载好了文件
tokenizer = AutoTokenizer.from_pretrained(model_file)
model = AutoModelForCausalLM.from_pretrained(model_file, device_map="auto")
input_text = "Things to eat in Napoli, Italy."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids, max_length=512)
print(tokenizer.decode(outputs[0]))
CPU 代码
# https://huggingface.co/google/gemma-2b-it#running-the-model-on-a-cpu
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it")
input_text = "Places to visit in Malaga, Spain."
input_ids = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**input_ids, max_length=512)
print(tokenizer.decode(outputs[0]))
方法三:流式输出
来源:https://github.com/AutoGPTQ/AutoGPTQ/issues/448
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer, GenerationConfig
import argparse
#parser = argparse.ArgumentParser(description="Test GPTQ model inference with streaming")
#parser.add_argument("model_name_or_path", type=str, help="Model location, local dir or HF repo")
#args = parser.parse_args()
#model_name_or_path = args.model_name_or_path
model_name_or_path = "/root/workspace/gemma-2-2b-it"
print(f"Loading model and tokenizer for: {model_name_or_path}")
tokenizer = AutoTokenizer.from_pretrained(
model_name_or_path,
use_fast=True,
trust_remote_code=True,
legacy=False
)
model = AutoModelForCausalLM.from_pretrained(
model_name_or_path,
trust_remote_code=True,
device_map="auto"
)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
prompt = "What are three possible advantages for teaching llamas to use AI?"
prompt_template=f'''<|user|>
{prompt}
<|assistant|>
'''
# Convert prompt to tokens
tokens = tokenizer(
prompt_template,
return_tensors='pt'
).input_ids.cuda()
print(f"\nPrompt: {prompt}")
print("Response: ", end="")
generation_params = {
#"eos_token_id": tokenizer.eos_token_id, # 这两行会导致警告
#"pad_token_id": tokenizer.eos_token_id,
"do_sample": True,
"temperature": 0.7,
"top_p": 0.95,
"top_k": 40,
"max_new_tokens": 512,
"repetition_penalty": 1.1
}
generation_config = GenerationConfig(**generation_params)
# Generate streamed output, visible one token at a time
generation_output = model.generate(
tokens,
streamer=streamer,
generation_config=generation_config
)