Home BlogBlog Detail

Anyscale Endpoints: Embedding endpoint, Llama-2 70B fine-tuning and improved sign-up experience

By The Anyscale Team | November 30, 2023

Update June 2024: Anyscale Endpoints (Anyscale's LLM API Offering) and Private Endpoints (self-hosted LLMs) are now available as part of the Anyscale Platform. Click here to get started on the Anyscale platform.

LinkEmbedding Endpoints

Retrieval-augmented generation, or RAG applications are among the most popular applications built with LLMs. Embedding endpoints enables developers to use open-source embedding models. Today, we are starting with gte-large, and developers can access it at $0.05/MTokens. We plan to add more models in the future, and users can request newer embedding models by filling out this google form. For more info visit here.

Example usage:

1import openai
2client = openai.OpenAI(
3    base_url = "https://api.endpoints.anyscale.com/v1",
4    api_key = "esecret_YOUR_API_KEY"
5)
6embedding = client.embeddings.create(
7    model="thenlper/gte-large",
8    input="Your text string goes here",
9)
10print(embedding.model_dump())
11

The output:

1{
2    'data': [
3        {'embedding': [...],
4         'index': 0,
5         'object': 'embedding'
6         }
7     ],
8     'model': 'thenlper/gte-large'
9   ...
10}

LinkLlama-2 70B fine tuning

Fine tuning is a popular technique to allow for model personalization and optimization, making it possible to improve model quality for specific uses, while also reducing costs and improving performance.

We have seen good traction on Llama-2 7B and 13B fine-tuning API. Today we are extending the fine-tuning functionality to the Llama-2 70B model. Llama-2 70B is the largest model in the Llama 2 series of models, and starting today, you can fine-tune it on Anyscale Endpoints with a $5 fixed cost per job run and $4/M tokens of data. You can start inference on the fine-tuned model at $1/M tokens. For more info visit here.

LinkFine-tuning Pricing

Model	Fixed Cost/Run	Price ($/M tokens)
Llama-2-7b-chat-hf	5	1
Llama-2-13b-chat-hf	5	2
Llama-2-70b-chat-hf	5	4

LinkFine-tuned model inference Pricing

Model	Price ($/M tokens)
Llama-2-7b-chat-hf	0.25
Llama-2-13b-chat-hf	0.50
Llama-2-70b-chat-hf	1.00

Example usage:

1import openai
2
3client = openai.OpenAI(
4    base_url = "https://api.endpoints.anyscale.com/v1",
5    api_key = "esecret_yourAuthTokenHere"
6)
7
8# Upload the file
9file_name = "train.jsonl"
10file = client.files.create(
11  file=open(file_name, "rb"),
12  purpose="fine-tune",
13  user_provided_filename=file_name,
14)
15
16# Launch the finetuning job
17client.fine_tuning.jobs.create(
18    model="meta-llama/Llama-2-70b-chat-hf",
19    training_file="file_123",
20)

LinkImprovements to user experience

Users can now get started with Anyscale Endpoints without a credit card. Get started with free credits and add payment information on the account later.

Embedding Endpoints
Llama-2 70B fine tuning
Fine-tuning Pricing
Fine-tuned model inference Pricing
Improvements to user experience

Sharing

Sign up for product updates

Introducing KubeRay v1.4

Deploy DeepSeek‑R1 with vLLM and Ray Serve on Kubernetes

The architecture of a Reinforcement Learning (RL) library is split into two primary components: Generation and Training. During the generation phase, an LLM Engine performs multi-turn rollouts within an environment to produce data and reward signals. This output is then fed into the training phase to update the model's parameters. This process forms a feedback loop, where the progressively improved model generates the next iteration of data for continuous refinement.

Open Source RL Libraries for LLMs

Ready to try Anyscale?

Access Anyscale today to see how companies using Anyscale and Ray benefit from rapid time-to-market and faster iterations across the entire AI lifecycle.