While creating a customized agent, you will need to select an LLM (a model (also known as LLM - Large Language Model), which will take care of a given task, such as retrieving information from the web or a data source to generate an answer relevant to the user’s query, analyzing data, summarizing content, creating content and much more.

Model name	Provider	Context window (Max number of tokens)
GPT 4o (EU)	Azure OpenAI	128K
GPT 4o (East US)	Azure OpenAI	128k
GPT 4o (South US)	Azure OpenAI	128k
GPT 4o mini (EU)	Azure OpenAI	128k
GPT 4o mini (East US)	Azure OpenAI	128k
GPT 4o mini (South US)	Azure OpenAI	128k
Mistral Mini (EU)Mistral	Azure	128k
Mistral Large	Mistral	128k	Mistral Small	Mistral(EU)	Azure	128k
Azure - Mistral Mini (EUUS)	Azure	128k
Azure - Mistral Large (EUUS)	Azure	128k
Azure - Mistral Mini Claude Haiku (US)Azure	Anthropic	128k	Azure - Mistral Large 200k
Claude Sonnet (US)	AzureAnthropic	128k
Groq - Llama 3.2 11B	Groq	128k
Groq - Llama 3.2 90B	Groq	128k
Claude Haiku	Anthropic200k
Claude Haiku (EU)	AWS	200k
Claude Sonnet (EU)	AWS	200k
Claude Haiku (US)	AWS	200k
Claude Sonnet (US)	AnthropicAWS	200k
Gemini Flash (US)	Google	1 mil250k
Gemini Pro (US)	Google	2 mil250k
Azure - Llama 3.2 11B 1 70B (EUUS)Azure	AWS	128k
Azure - Llama 3.2 90B 1 8B (US)	AWS	128k
Mistral Mini (EU)	AzureMistral	128k
Azure - Llama 3.2 11B (USMistral Large (EU)Azure	Mistral	128k
Azure - Llama 3.2 90B (USMistral Small (EU)Azure	Mistral	128k

Selection criteria

...

...
need more info from comparison tests (Alban will do them when he has time)
Selecting an LLM for an agent depends on the following criteria.:
the type of task done by the agent. Some LLMs excel in specific tasks like …. image analysis or creative writing for example.
the context window of the LLM. This corresponds to the total number of tokens in input and output that can be processed by the LLM.
...
the LLM can process.
the price for the processed tokens. Some LLMs are more expensive than others.
the response time of the LLM. Some LLMs answer faster than others.
the stability of the LLM: For the same question, if two answers provided by the model are very similar both in content and structure.
the verbosity/conciseness of the answers: Some LLMs provide more concise answers while others are more verbose.

Best performing models

Below are our insights on the best-performing models.

Type of task

Image analysis

If you intend for your agents to analyze images, we recommend choosing one of the following models, which tend to perform better for this task:

Claude Sonnet
Claude Haiku
GPT 4o
Gemini Flash
Gemini Pro

Creative writing

For creating writing tasks, we recommend choosing one of the smaller models, as they tend to perform better for these tasks:

GPT 4o mini
Claude Haiku
Gemini Flash
Llama 2.1 8B
Mistral Mini
Mistral Small

Knowledge-based tasks (RAG)

For tasks needing to retrieve information from a data source (RAG), we recommend using one of the larger models as they provide better answers and source their answers:

Claude Sonnet
GPT 4o
Gemini Pro

Context window

Some LLMs have larger context windows than others, which might be interesting if you intend to process substantial documents.

...

Here is the list of models with larger context windows:

Claude Haiku - 200k tokens maximum
Claude Sonnet - 200 tokens maximum
Gemini Pro - 250k tokens maximum
Gemini Flash - 250k tokens maximum

Conversely, if you want to process smaller documents, you might want to choose an LLM with a smaller context window, such as:

GPT 4o mini - 128k tokens maximum
GPT 4o - 128k tokens maximum
Llama 3.1 70B - 128k tokens maximum
Llama 3.1 8B - 128k tokens maximum
Mistral Mini - 128k tokens maximum
Mistral Small - 128k tokens maximum

...

Mistral Large - 128k tokens maximum

Info
The context window size also matters if you want your agent to retain more information from earlier in the conversation.

Price

If one of your selection criteria is the price

...

the response time of the LLM. Some LLMs answer faster than others.

Best performing models

Below are our insights on the best-performing models. (might be outdated)

Overall,

...

GPT 4o is the best-performing model.

...

GPT 4o mini has good performance and is very fast and cheap.

...

Mistral Large is slightly worse than GPT 4o but slightly cheaper.

...

, you can choose one of the less expensive models per request, such as:

Claude Haiku
GPT 4o mini
Gemini Flash
Llama 3.1 8B
Mistral Mini
Mistral Small

Response time

If you want your agents to respond quickly, we recommend you choose one of the models with the best response times:

Gemini Flash
GPT 4o mini

Note that the differences in response time are very small for the other models.

Info
Out of all the models, Mistral Large has the slowest response time.

Stability

The models that provide the most stable answers are:

Claude Haiku
Claude Sonnet
GPT 4o
GPT 4o mini
Gemini Pro

Verbosity/Conciseness

If you want your agents to provide more lengthy and complicated answers, we recommend you choose one of the following models:

Claude Haiku
Claude Sonnet
GPT 4o mini

If you want your agents to provide more concise answers, you can choose one of the following models:

GPT 4o
Llama 8.1 70B
Gemini Flash
Mistral Mini
Mistral Small
Mistral Large

If you want the answers to be neither too verbose nor too concise, you can choose either Gemini Pro or Llama 8.1 8B.

Versions Compared

Old Version 5

New Version Current

Key

Selection criteria

Best performing models

Type of task

Image analysis

Creative writing

Knowledge-based tasks (RAG)

Context window

Price

Best performing models

Response time

Stability

Verbosity/Conciseness

Page Comparison

Versions Compared

Old Version 5

New Version Current

Key

Selection criteria

Best performing models

Type of task

Image analysis

Creative writing

Knowledge-based tasks (RAG)

Context window

Price

Best performing models

Response time

Stability

Verbosity/Conciseness