Thinking Machines just released Tinker, an LLM training API for researchers and hackers. The API offers low-level control while abstracting away model deployment challenges.
We built a simple example showcasing how to use Ray along with Tinker to build and run a text-to-SQL model.
There are two primary parts to this use case: data generation and model fine-tuning. We show how to generate a dataset for supervised fine-tuning using Ray. We then show how to use the dataset to fine tune an LLM using Tinker.
We first need to generate data for supervised fine tuning. There are two components to this: generation and evaluation. We generate queries by deploying Qwen-8B using vLLM along with Ray Serve as an Anyscale service to scale LLM inference. We then use Ray Core to execute a large number of parallel tasks to generate candidate SQL queries, then we evaluate each of those queries in a SQL environment and calculate rewards using skyrl-gym.
Here is the application code for running Qwen-8B as a service. This uses Ray Serve’s built-in integration with vLLM to deploy the model.
1# deploy_qwen.py
2
3from ray.serve.llm import LLMConfig, build_openai_app
4
5llm_config = LLMConfig(
6 model_loading_config=dict(
7 model_id="my-qwen-8B",
8 model_source="Qwen/Qwen3-8B",
9 ),
10 accelerator_type="L40S",
11 deployment_config=dict(
12 autoscaling_config=dict(
13 min_replicas=4, max_replicas=8,
14 )
15 ),
16 engine_kwargs=dict(
17 max_model_len=8192,
18 tensor_parallel_size=1
19 )
20)
21
22app = build_openai_app({"llm_configs": [llm_config]})
This service can be deployed by running
1anyscale service deploy -f service.yaml
The service.yaml
file is provided in the appendix.
Here is the code for scaling querying the model, evaluating the queries, and filtering out the unsuccessful queries. This code can naturally be extended in a multi-turn fashion to feed the output of the unsuccessful query back into the model to generate new candidate queries.
1# data_generation.py
2
3from urllib.parse import urljoin
4from datasets import load_dataset
5from skyrl_gym.envs.sql.env import SQLEnv
6from omegaconf import DictConfig
7from openai import OpenAI
8from datasets import load_dataset
9import json
10import ray
11
12dataset = load_dataset("NovaSky-AI/SkyRL-SQL-653-data-newfmt", split="train").to_list()
13
14token = # <FILL IN APPROPRIATE TOKEN>
15base_url = # <FILL IN APPROPRIATE BASE URL>
16
17@ray.remote(num_cpus=0.1)
18def generate_sql(messages):
19 client = OpenAI(api_key=token, base_url=urljoin(base_url, "v1"))
20 response = client.chat.completions.create(
21 model="my-qwen-8B",
22 messages=messages
23 )
24 return response.choices[0].message.content
25
26# Generate SQL queries in parallel
27object_refs = [generate_sql.remote(record["prompt"]) for record in dataset]
28
29# Fetch the results and filter out the unsuccessful ones
30object_refs_and_records = dict(zip(object_refs, dataset))
31successful = []
32remaining = object_refs
33while remaining:
34 [ready_ref], remaining = ray.wait(remaining, num_returns=1)
35 record = object_refs_and_records[ready_ref]
36 messages = record["prompt"]
37
38 try:
39 assistant_response = ray.get(ready_ref)
40 except Exception as e:
41 continue
42
43 conf = DictConfig({"db_path": "/home/ray/data"})
44 env = SQLEnv(conf, record)
45 env.init(messages)
46 try:
47 output = env.step(assistant_response)
48 except AssertionError as e:
49 continue
50
51 print("Reward: ", output["reward"])
52
53 if output["reward"] > 0:
54 successful.append((record, assistant_response))
55
56examples = []
57for record, assistant_response in successful:
58 examples.append(record["prompt"] + [{"role": "assistant", "content": assistant_response}])
59
60with open("/mnt/shared_storage/successful.json", "w") as f:
61 import json
62 json.dump(examples, f)
This job can be submitted by running
1anyscale job submit -f job.yaml --env HF_TOKEN=$HF_TOKEN
The job.yaml
file is provided in the appendix. The examples will be stored in a shared filesystem, though you can store them wherever you want.
We use the Tinker API to tokenize the data and fine-tune the model.
The Tinker API offers a high level of control for training and fine tuning LLMs.
The following can be run in an Anyscale workspace that has tinker installed.
1import tinker
2from tinker import types
3import json
4import numpy as np
5
6service_client = tinker.ServiceClient()
7
8training_client = service_client.create_lora_training_client(
9 base_model="Qwen/Qwen3-8B", rank=32
10)
11tokenizer = training_client.get_tokenizer()
12
13def process_example(messages: dict, tokenizer) -> types.Datum:
14 tokens = tokenizer.apply_chat_template(messages)
15 weights = [1] * len(tokens)
16 input_tokens = tokens[:-1]
17 target_tokens = tokens[1:]
18 weights = weights[1:]
19 return types.Datum(
20 model_input=types.ModelInput.from_ints(tokens=input_tokens),
21 loss_fn_inputs=dict(weights=weights, target_tokens=target_tokens)
22 )
23
24examples = json.load(open("/mnt/shared_storage/successful.json", "r"))
25processed_examples = [process_example(ex, tokenizer) for ex in examples]
26
27# Note: If you are going to train on a larger dataset, you should implement proper minibatch training.
28for _ in range(6):
29 fwdbwd_future = training_client.forward_backward(processed_examples, "cross_entropy")
30 optim_future = training_client.optim_step(types.AdamParams(learning_rate=1e-4))
31 # Wait for the results
32 fwdbwd_result = fwdbwd_future.result()
33 optim_result = optim_future.result()
34 # fwdbwd_result contains the logprobs of all the tokens we put in. Now we can compute the weighted
35 # average log loss per token.
36 logprobs = np.concatenate([output["logprobs"].tolist() for output in fwdbwd_result.loss_fn_outputs])
37 weights = np.concatenate([example.loss_fn_inputs["weights"].tolist() for example in processed_examples])
38 print(f"Loss per token: {-np.dot(logprobs, weights) / weights.sum():.4f}")
39
40# Save the weights
41sampling_client = training_client.save_weights_and_get_sampling_client(name="sql_model")
42print(f"model path: {sampling_client.model_path}")
We now want to check how well the model performs. Let’s first download the model checkpoint (make sure to fill out the model path that was printed by the above code).
1import tinker
2from urllib.parse import urlparse
3
4MODEL_PATH = # <FILL IN THE MODEL PATH PRINTED ABOVE>
5
6parsed_url = urlparse(MODEL_PATH)
7
8service_client = tinker.ServiceClient()
9rest_client = service_client.create_rest_client()
10data = rest_client.download_checkpoint_archive(parsed_url.netloc, parsed_url.path.lstrip('/')).result()
11
12with open('output.tar.gz', 'wb') as f:
13 f.write(data)
We then extract the LoRA weights with mkdir -p /home/ray/sql_lora && tar xvfz output.tar.gz -C /home/ray/sql_lora
and merge the weights with the base (we do this because currently the tinker LoRA weights are not compatible with vLLM and can’t be served directly – this will be fixed going forward).
1from peft import PeftModel
2from transformers import AutoModelForCausalLM, AutoTokenizer
3base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B")
4tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
5model = PeftModel.from_pretrained(base_model, "/home/ray/sql_lora")
6merged_model = model.merge_and_unload()
7
8save_path = "/home/ray/merged_sql_model"
9merged_model.save_pretrained(save_path)
10tokenizer.save_pretrained(save_path)
To run the above code, a few additional setup steps are required.
We define our base image using the following Dockerfile.
1# Dockerfile
2
3FROM anyscale/ray:2.48.0-slim-py312-cu128
4
5RUN sudo apt-get update -y \
6 && sudo apt-get install --no-install-recommends -y build-essential libnuma-dev \
7 && sudo rm -f /etc/apt/sources.list.d/*
8
9RUN curl -LsSf https://astral.sh/uv/install.sh | sh
10
11RUN git clone https://github.com/novasky-ai/SkyRL.git
12WORKDIR /home/ray/SkyRL/skyrl-gym/
13RUN uv pip install --system .
14RUN uv pip install --system "huggingface_hub[cli]" "datasets" "openai" "transformers" "torch" "vllm==0.10.0" "pydantic"
We define the service config as follows.
1# service.yaml
2
3name: deploy-qwen
4containerfile: ./Dockerfile
5
6compute_config:
7 auto_select_worker_config: true
8
9working_dir: .
10
11applications:
12- import_path: deploy_qwen:app
We define the job config as follows.
1# job.yaml
2
3name: data-generation
4containerfile: ./Dockerfile
5
6compute_config:
7 head_node:
8 instance_type: c6a.12xlarge
9 auto_select_worker_config: true
10
11working_dir: .
12
13entrypoint: |
14 uv run --with huggingface_hub huggingface-cli download seeklhy/OmniSQL-datasets data.zip --repo-type dataset --local-dir $HOME && \
15 unzip $HOME/data.zip -d $HOME && \
16 python data_generation.py
17
18max_retries: 0