Optimized Inference Engines
Inference Engines are designed to generate optimal LLM inference for their respective use-case. They have access to a carefully curated model selection and intelligently route queries to the best-suited LLM for each prompt. Maximize response quality while optimizing for cost and latency.
Supported Engines
- Chat Preview
- Code Preview
Chat Engine
The chat engine is general-purpose chat-interactions such as chatbots, support assistants, etc. It intelligently routes each query to one of the below models:
model selection:
- GPT-4-Turbo
- Claude 3 Sonnet
- Claude 3 Haiku
Code Engine
The code engine is optimized for coding-related use-cases such as code generation, coding copilots, code explanation, etc. It intelligently routes each query to one of the below models:
model selection:
- GPT-4-Turbo
- Claude 3 Sonnet
- Claude 3 Haiku
Usage
An engine is a collection of LLMs with a routing function that identifies the optimal model for each given query. You can treat an engine as a sort of ‘meta LLM’.
from openai import OpenAI
client = OpenAI(
base_url="https://router.neutrinoapp.com/api/engines",
api_key="<Neutrino-API-key>"
)
response = client.chat.completions.create(
# Instead of a specific model, set this to the Neutrino engine of choice
model="chat-preview", # options: "chat-preview", "code-preview"
messages = [
{"role": "system", "content": "You are a helpful AI assistant. Your job is to be helpful and respond to user requests."},
{"role": "user", "content": "What is a Neutrino?"},
],
)
print(f"Optimal model: {response.model}")
print(response.choices[0].message.content)
Streaming Responses
from openai import OpenAI
client = OpenAI(
base_url="https://router.neutrinoapp.com/api/engines",
api_key="<Neutrino-API-key>"
)
response = client.chat.completions.create(
# Instead of a specific model, set this to the Neutrino engine of choice
model="chat-preview", # options: "chat-preview", "code-preview"
messages = [
{"role": "system", "content": "You are a helpful AI assistant. Your job is to be helpful and respond to user requests."},
{"role": "user", "content": "Does a Neutrino have mass?"},
],
stream=True
)
for i, chunk in enumerate(response):
if i == 0:
print(f"Optimal model: {chunk.model}")
print(chunk.choices[0].delta.content, end="")