vLLM

The vLLM frontend is still in experimental state!

How to use SOL and vLLM?

To enable SOL, you first need to edit your config.json and prefix all architectures with "SOL/{MODEL_NAME}".

{
	"architectures": [
		"SOL/LlamaForCausalLM"
	],
	...
}

How to use vLLM Offline Inference?

Next, import import sol.vllm before initializing your model.

# Adapted from: https://docs.vllm.ai/en/latest/getting_started/examples/offline_inference.html
import vllm
import sol.vllm # registers SOL/* models to vLLM 

# Sample prompts.
prompts = [
	"Hello, my name is",
	"The president of the United States is",
	"The capital of France is",
	"The future of AI is",
]
# Create a sampling params object.
sampling_params = vllm.SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.
llm = vllm.LLM(model="my_company/my_model", enforce_eager=True)

for _ in range(args.runs):
	outputs = llm.generate(prompts, sampling_params)
	for output in outputs:
		prompt = output.prompt
		generated_text = output.outputs[0].text
		print(f"\tPrompt: {prompt!r}, Generated text: {generated_text!r}")

How to use vLLM OpenAI Server?

To launch the api server with SOL use python3 -m sol.vllm.entrypoints.openai.api_server --enforce-eager --model path_to_your_model ....

Limitations

  1. Some models might not be supported by SOL yet. Please open an issue if you need support for a specific model.
  2. You need to use --enforce-eager. CUDA Graphs are not yet supported.