HomeBlogBlog Detail

Large-Scale Deployment of Ray in Tencent’s Weixin AI Infrastructure

By Weixin Astra Team   |   June 3, 2025

This blog was originally published by Tencent's Weixin backend team. It has been translated and re-published with their permission. Weixin (微信) and WeChat are sister apps run by separate Tencent entities. Weixin serves mainland-China users, while WeChat is the international version.

LinkSummary

In recent years, as the open source project Ray has gradually matured and improved, its adoption within Chinese internet companies has grown explosively. The Weixin team, through its implementation of Ray along with Kubernetes, has overcome many technical challenges, offering valuable insights for building ultra-large-scale distributed systems with Ray.

Link1. Background

Weixin has a large number of AI computing scenarios, which are mainly divided into three types: 

  1. Content recommendation and search: In this use case, AI is mainly used for the production of core features in search, advertising, and recommendation systems.

  2. Product operations: AI is used to support product functionality and content management, including identifying low-quality or high-quality content and maintaining a healthy content ecosystem.

  3. Content creation: With the rise of large models, Weixin has also implemented AI-generated content (AIGC) applications such as text-to-image generation, image-to-image generation, and AI-powered special effects.

At present, AI computing covers almost all business scenarios of Weixin.

Figure 1: AI computing application scenarios in WeixinFigure 1: AI computing application scenarios in Weixin
Figure 1: AI computing application scenarios in Weixin

However, we encountered various problems when implementing AI applications using Weixin’s existing backend infrastructure. 

  • Resource utilization: AI applications are computationally intensive, requiring significant computing resources. Directly using online resources for these tasks can result in prohibitively high costs. 

  • Deployment complexity: Weixin’s backend deployment platforms are primarily designed for deploying I/O-intensive, high-concurrency, and high-throughput microservices. AI applications need support for a large number of heterogeneous hardware accelerators and heterogeneous resource platforms, introducing far greater deployment complexity.

  • Application orchestration complexity: Directly using basic infrastructure components like message queues to handle complex feature dependencies and related asynchronous processes results in low developer efficiency, high risk from code changes, and poor observability. 

  • Lack of platform support: Due to the insufficient platform support, algorithm iteration speed is slow and the threshold for using model capabilities is high. 

Therefore, Weixin urgently needed a low-cost, high-efficiency, and easy-to-use AI compute engine to solve the above problems.

Figure 2: Original infrastructure within WeixinFigure 2: Original infrastructure within Weixin
Figure 2: Original infrastructure within Weixin

For example, OCR is a critical feature for Weixin recommendation and search functionality. OCR computations are extremely resource-intensive, requiring over one million CPU cores, and they have stringent requirements for latency and reliability–feature generation must complete within one minute.

Our existing P6n platform is designed for highly responsive online tasks with millisecond-level latency, and while it meets latency demands, its static resource allocation approach introduces high costs, and the complexity of multi-model deployment does not meet our needs.

On the other hand, the Gemini platform is optimized for large-scale offline batch processing tasks, and thus cannot fulfill the OCR scenario’s needs for real-time performance and reliability.

Therefore, we needed a near-line task processing framework that supports high responsiveness (around 10-seconds), large-scale heterogeneous resource deployment, low costs, and high reliability.

Link2. Why Choose Ray for AI Computing?

Figure 3: Enterprises choosing Ray as their AI compute engine [1]Figure 3: Enterprises choosing Ray as their AI compute engine [1]
Figure 3: Enterprises choosing Ray as their AI compute engine [1]

Ray is a general-purpose distributed computing engine, open-sourced in 2016 by UC Berkeley’s RISELab, and is one of the fastest-growing computing engines. It’s widely used by companies like OpenAI, Ant Group, ByteDance, and Huawei, making it a leading next-generation computational framework.

LinkSimplicity

First, writing distributed applications with Ray is simple and intuitive. Developers don’t need to understand or reason about low-level communication or scheduling details. Using Ray’s simple programming primitives, any Python function or class can be run in a cluster by simply adding a decorator. Ray’s distributed APIs handle all complexity internally—functions are scheduled as stateless tasks, and classes become stateful remote actors.

For example, consider this local Python script that performs OCR on an image using multiple models.

1def detect(image_data):
2    model = load_detect_model()
3    return model(image_data)
4
5def recognize(detect_result):
6    model = load_recognize_model()
7    return model(detect_result)
8
9def ocr(image_data):
10    detect_result = detect(image_data)
11    return recognize(detect_result)
12
13image_data = load_image_data()
14ocr_result = ocr(image_data)

If running this application with a microservice deployment, because the models are too large and a single machine wouldn’t have enough memory to load all models, you would need to deploy three separate microservices: detect, recognize, and OCR, resulting in significant application deployment complexity.

1@ray.remote(num_gpus=1,num_cpus=16)
2def detect(image_data):
3    model = load_detect_model()
4    return model(image_data)
5
6@ray.remote(num_gpus=2,num_cpus=16)
7def recognize(detect_result):
8    model = load_recognize_model()
9    return model(detect_result)
10
11@ray.remote(num_cpus=4)
12def ocr(image_data):
13    det_result = detect.remote(image_data)
14    return recognize.remote(det_result)
15
16image_data = load_image_data()
17ocr_result = ocr.remote(image_data)

If using Ray for OCR inference, you only need to add the @remote decorator and specify the number of CPU and GPU resources used by the model. You can complete the deployment of the OCR application through a Python script, which improves efficiency by at least an order of magnitude.

Figure 4: How Ray AIR unifies ML libraries in a simple way [2]Figure 4: How Ray AIR unifies ML libraries in a simple way [2]
Figure 4: How Ray AIR unifies ML libraries in a simple way [2]

LinkThe ML Ecosystem

Second, most popular ML libraries have strong integration with Ray, and Ray’s native libraries also integrate with these libraries. For example, developers can easily use XGBoost with Ray Train, and Hugging Face with Ray Serve. Similarly, PyTorch and TensorFlow can be easily used with Ray Train. In short, Ray has a rich integration ecosystem, not only with ML libraries, but also with other tools and frameworks.

LinkDevelop Locally

Third, developers scale from their local laptops to large-scale Ray clusters often without changing any code at all.

Overall, Ray offers a high-performance distributed framework with simple primitives, providing a unified foundation for a wide range of distributed workloads. By integrating diverse computational paradigms, Ray significantly enhances workflow efficiency. Additionally, Ray’s comprehensive ecosystem makes it straightforward to integrate mainstream frameworks needed for AI workloads like data processing, training, inference, and serving.

With many industry leaders already choosing Ray for AI computing, Weixin selected Ray as the foundational distributed compute engine for its AI infrastructure.

Link3. Ray’s Large-Scale Deployment in Weixin’s AI Infrastructure


P6n

Gemini

Astra

Scenarios

Highly responsive online backend services

Large-scale offline data processing

Online backend services and large-scale offline data processing

Scheduling latency

Minutes

Minutes

Seconds

Throughput performance

High

Medium

High

Fault tolerance

High

Medium

High

Large-scale batch processing

Not supported

Supported

Supported

Compute costs

High

Medium

Low

Deployment complexity for multiple models

High

High

Low

The existing P6n platform is a microservice deployment framework based on Kubernetes. With automated orchestration and elastic scaling mechanisms, P6n efficiently supports highly responsive online backend services. However, it is not suitable for large-scale batch processing services due to its complexity when deploying multiple models within a single application, as well as its high infrastructure costs. Thus, it doesn’t adequately support scenarios requiring integrated online and offline AI computing.

The Gemini platform, also Kubernetes-based, is tailored for large-scale offline data processing and model training. However, Gemini’s scheduling lacks real-time responsiveness, rendering it unsuitable for AI computation scenarios requiring high responsiveness, high throughput, and high reliability.

To build Astra, a new AI compute platform characterized by high responsiveness, high throughput, high reliability, and low cost, we faced several core technical challenges:

  1. Cost efficiency: Astra must efficiently utilize and scale across diverse, heterogeneous computing resources.

  2. High throughput: It needs to support ultra-large-scale resource scheduling.

  3. Reduced deployment complexity: It should simplify the deployment complexity associated with managing multiple models within a single application.

With Ray as our computing foundation, we solved these three core problems and built AstraRay, an AI compute engine specifically optimized for large-scale AI workloads within Weixin.

Compared to KubeRay, the community Ray Kubernetes operator, AstraRay offers several enhancements.

Feature

AstraRay

KubeRay

Scale for a single application

Millions of nodes

Up to 1000 nodes

Support for unstable resources

Supported

Not supported (high failure rate)

Scaling on heterogeneous resources (multi-cluster)

Supported

Limited - single Kubernetes cluster only

Resource utilization

High

Relatively low

Link3.1 AstraRay Architecture

Figure 5: KubeRay architecture [3]Figure 5: KubeRay architecture [3]
Figure 5: KubeRay architecture [3]
Figure 6: KubeRay task submission process [4]Figure 6: KubeRay task submission process [4]
Figure 6: KubeRay task submission process [4]

The widely used community solution KubeRay, which integrates Ray with Kubernetes, provides an easy-to-use, highly available, and highly scalable cloud-native Ray cluster service that can meet the needs of small and medium-sized AI applications. However, KubeRay suffers from limitations such as relatively small cluster size (supporting only thousands of nodes at most), difficulties scaling heterogeneous resources (since a single Ray cluster can only deploy within one Kubernetes cluster, and does not support federated Kubernetes clusters), and slow scaling (limited by the autoscaling performance of Kubernetes). These issues make it not suitable for the needs of ultra-large-scale AI applications in Weixin.

Figure 7: AstraRay overall architectureFigure 7: AstraRay overall architecture
Figure 7: AstraRay overall architecture

During our deployment of Ray, we encountered three core technical challenges:

  1. Cluster management with millions of pods: Certain applications within Weixin require more than one million CPU cores, significantly exceeding Kubernetes’ limits. We needed each Ray application to support scaling to millions of pods.

  2. Building stable services on unstable resources: Since AI consumes a lot of resources, in order to reduce costs, we use a large amount of low-cost, idle, but less stable computing resources. We hope to provide reliable and stable services atop these unstable resources.

  3. Application deployment complexity: AI applications within Weixin face complexity due to three dimensions of heterogeneity: model, hardware, and modules – resulting in high deployment complexity. We sought a unified, simplified deployment approach to reduce complexity from O(n³) complexity to O(1).

Astra’s deployment architecture builds upon multiple existing internal platforms such as Poseidon, Computing Power Platform, TaiJi, and Gemini, integrating additional Tencent Kubernetes Engine modules to create a super-large cluster with millions of CPU cores and tens of thousands of GPUs. We solved the problem of managing clusters with millions of pods through a service discovery architecture design, addressed the challenge of building stable services with unstable resources through load balancing and disaster recovery scheduling, and solved the problem of multi-model application deployment complexity through application-level scheduling.

The following is a detailed introduction to how we handled these three technical challenges.

Link3.2. Challenges Supporting Million-node Clusters

Link3.2.1 Architectural Choices


Scheduling Scale

Systems

Resource Utilization

Scheduling Concurrency

Optimal Scheduling

Concurrency Control

Monolithic scheduling

Small

Kubernetes Borg

High

Low

No

None

Two-level scheduling

Medium

Yarn Mesos

Low

Medium

No

Pessimistic concurrency control

Shared-state scheduling

Large

Omega

Astra-Starlink

High

High

Yes

Optimistic concurrency control

Figure 8: Classification of cluster scheduling architectures [5]Figure 8: Classification of cluster scheduling architectures [5]
Figure 8: Classification of cluster scheduling architectures [5]

Scheduling architectures typically fall into four categories: monolithic scheduling, two-level scheduling, shared-state scheduling, and hybrid scheduling. The fundamental differences among these scheduling architectures revolve around two key factors:

  1. Whether the scheduler has a global view of all resources during scheduling.

  2. Whether each application has its own dedicated scheduler.

Monolithic Scheduling, as the name suggests, involves a single scheduler with a complete global view of all resources. Google’s Borg and Kubernetes adopt this approach. The advantage of this model is that all tasks are handled by a single scheduler, allowing it to optimize scheduling decisions based on a comprehensive understanding of the entire cluster’s resource usage. However, the performance of the monolithic scheduler can become the bottleneck, limiting its scalability for extremely large clusters.

Two-level Scheduling involves multiple schedulers and is adopted by frameworks like Apache Mesos and Hadoop YARN. In this model, each application’s scheduler first obtains resources from a centralized resource manager, then allocates these resources internally to its specific tasks. While two-level scheduling solves the performance bottlenecks associated with monolithic scheduling, each application’s scheduler only has a partial (local) view of resources, limiting the ability to achieve globally optimal scheduling.

Shared-state Scheduling also uses multiple schedulers, but each scheduler maintains a global view of all resources. Google’s Omega system exemplifies this shared-state scheduling approach, where all schedulers share the same global resource state. Each scheduler concurrently requests resources from the global resource pool, addressing both performance issues and ensuring optimal resource allocation decisions, thus supporting larger-scale clusters. Due to these advantages, AstraRay adopts a shared-state scheduling architecture to support ultra-large-scale resource management. In this architecture, conflicts among resource requests from multiple schedulers are managed using either pessimistic or optimistic concurrency control. AstraRay implements optimistic concurrency control: conflicts, which occur infrequently, are resolved after detection, eliminating the need for a central coordination node and enabling higher scheduling concurrency.

Link3.2.2 Starlink Scheduling

We designed a new scheduling system called Starlink to better handle heterogeneous computing resources and hardware. Starlink adopts a shared-state scheduling architecture and uses optimistic concurrency control to handle scheduling conflicts. It supports deployment on top of various underlying resource platforms (such as Kubernetes, Tencent’s internal Yard platform, or Tencent Cloud’s CVM instances) and allows individual applications to run seamlessly across heterogeneous resource nodes.

Figure 9: Starlink scheduling architectureFigure 9: Starlink scheduling architecture
Figure 9: Starlink scheduling architecture

The Starlink system consists of four primary components:

  • Node: Any node running a Starlink Agent becomes a Node. Each Node reports its state to the Resource component every second and processes application deployment tasks.

  • Resource: The Resource component collects heartbeat signals from Nodes, aggregates these signals, and broadcasts this information to other Resource nodes. Resources maintain a comprehensive online node list and can scale horizontally as a stateless service. To ensure isolation between different business applications and reduce broadcast overhead, multiple Resource clusters can be established.

  • App: Apps are applications running on Starlink. Each App has its own dedicated resource scheduler, which obtains a global resource view from Resource nodes and allocates resources through optimistic concurrency-based scheduling.

  • Scheduler: The Scheduler handles application load balancing and disaster recovery. It dynamically adjusts node weights based on node performance and health, distributing requests using a weighted routing algorithm.

Within Weixin’s backend infrastructure, each microservice typically operates as an independent module. However, due to scalability limits of a single Kubernetes cluster, large-scale AI applications often need to be split across multiple clusters. This fragmentation introduces significant deployment and scaling complexity. Unlike Kubernetes, Starlink utilizes pre-created pods, accelerating scaling operations and simplifying resource migration. Thanks to its efficient design, Starlink supports scaling a single application to millions of nodes. Additionally, optimistic concurrency control allows extremely rapid scheduling, completing scheduling operations for tens of thousands of nodes per minute. Starlink can also schedule resources across multiple underlying resource platforms, supporting heterogeneous hardware types. This eliminates the need for splitting applications into multiple sub-workloads solely due to infrastructure constraints, significantly improving internal resource utilization and operational efficiency.

Link3.3. Challenges Building Stable Services with Unstable Resources

Figure 10: Challenges Building Stable Services with Unstable ResourcesFigure 10: Challenges Building Stable Services with Unstable Resources
Figure 10: Challenges Building Stable Services with Unstable Resources

AstraRay extensively utilizes low-cost or idle computing resources, which typically exhibit instability. Pods running on these resources frequently face eviction or degraded performance (“unhealthy” states), which can result in high service failure rates and increased latency if used directly. Additionally, traditional scheduling methods often cause computational imbalances, leading to poor overall resource utilization.

We addressed the high service failure rates through faster disaster-recovery scheduling. Furthermore, we solved issues related to high latency and inefficient resource utilization by employing improved scheduling algorithms.

Figure 11: Starlink scheduling processFigure 11: Starlink scheduling process
Figure 11: Starlink scheduling process

Link3.3.1 Rapid Disaster Recovery Scheduling

Figure 12: Kubernetes PreStop Hook mechanism [6]Figure 12: Kubernetes PreStop Hook mechanism [6]
Figure 12: Kubernetes PreStop Hook mechanism [6]

To accelerate disaster recovery, we implemented two key strategies:

  1. Before a resource platform (such as Kubernetes) actually evicts a pod, we gracefully terminate the running service using Kubernetes’ PreStop Hook mechanism. Simultaneously, the affected node marks itself as offline and immediately reports this state via heartbeat signals to the Resource nodes.

  2. The Resource nodes aggregate and quickly broadcast these status changes throughout the Resource cluster. Every three seconds, the Scheduler retrieves the latest node statuses from the Resource nodes, dynamically recalculates node weights, and updates its routing table.

By employing this strategy, pod eviction and recovery can be completed within approximately four seconds, significantly reducing application failure rates.

Figure 13: Online-video OCR failuresFigure 13: Online-video OCR failures
Figure 13: Online-video OCR failures

Link3.3.2 Dynamic Weighted SWRR Routing Algorithm

AI applications typically have large computational workloads per request, resulting in relatively low single-node QPS (queries per second). Under these conditions, traditional load-balancing techniques commonly used in Weixin’s backend, such as consistent hashing, struggle to evenly distribute requests across available nodes. Moreover, because low-priority or free resources are frequently reclaimed by higher-priority online tasks, node performance can vary significantly.

To address this, we adopted and optimized the Smooth Weighted Round-Robin (SWRR) algorithm, applying it specifically for low-QPS workloads for the first time. This allows for rapid adjustment and even distribution of requests. We adapted and optimized the SWRR algorithm to operate as follows:

Step 1: Update node weights (every 3 seconds)

For each node: Node Weight = (number of CPU cores or GPUs) * log(remaining utilization) * (current utilization / current concurrency)

This formula provides a simplified model predicting optimal request distribution. The node weight indicates the node’s current capability to handle additional requests, thus allocating more requests to nodes with higher capacity. Specifically:

  • Number of CPU cores or GPUs: Represents total available resources on the node; performance is proportional to available resources. GPU counts are adjusted according to relative GPU performance.

  • log(remaining utilization): Represents the node’s currently available resource capacity. The logarithmic function is used based on empirical observations, particularly effective under high-load conditions.

  • (current utilization / current concurrency): Represents actual node performance. Assuming similar workload characteristics across tasks, this ratio effectively indicates each node’s processing efficiency.

Step 2: Node selection (SWRR standard process)

To optimize performance (since SWRR has O(n) complexity), we apply optimizations such as partitioning nodes into blocks or running multiple algorithm instances. The basic SWRR steps are:

  1. For each node: Node Routing Weight = Node Routing Weight + Node Weight

  2. Select the node with the highest current Routing Weight.

  3. For the selected node: Node Routing Weight = Node Routing Weight – (sum of all node weights)

Example SWRR execution:

Suppose three nodes {A, B, C} with weights {5, 1, 1}:

Figure 14: Example SWRR execution:Figure 14: Example SWRR execution
Figure 14: Example SWRR execution:
Figure 15: Average latency of online-video OCR (milliseconds)Figure 15: Average latency of online-video OCR (milliseconds)
Figure 15: Average latency of online-video OCR (milliseconds)
Figure 16: CPU utilization (request)Figure 16: CPU utilization (request)
Figure 16: CPU utilization (request)

Ultimately, we used an adaptive weighted SWRR algorithm to dynamically balance request distribution, which not only equalized resource utilization but also reduced request latency.

Link3.4. Reducing the Complexity of AI Application Deployment

Figure 17: Deployment complexity of AI applicationsFigure 17: Deployment complexity of AI applications
Figure 17: Deployment complexity of AI applications

Deploying AI applications involves three major dimensions of complexity:

  1. Scaling across multiple models

  2. Scaling across different GPU types

  3. Scaling an application across multiple Kubernetes clusters (especially when a single application exceeds Kubernetes deployment limits)

For large-scale “super applications,” the overall deployment complexity grows as O(n³). AstraRay introduces a novel approach that enables scaling across all three dimensions within a single application, reducing deployment complexity from O(n³) to O(1) and dramatically improving deployment efficiency.

Link3.4.1. Challenges Scaling Across Multiple Models

At the core, scaling across multiple models requires dynamic switching of model runtimes. This presents two main challenges:

  1. Runtime switching at execution time

  2. Fast distribution of models

Dynamic Runtime Switching

Figure 18: Ray dynamic runtime [7]Figure 18: Ray dynamic runtime [7]
Figure 18: Ray dynamic runtime [7]

We first addressed the challenge of runtime environments. While Ray offers a RuntimeEnv mechanism for managing execution environments, it has some limitations:

  • Ray’s native RuntimeEnv does not support using multiple Python versions.

  • Dependencies outside the Python environment (e.g., system packages) must be pre-installed on the base Docker image used by Ray nodes, making it less flexible.

To overcome this, we introduced Conda-based runtime environment management, allowing isolation and packaging of custom Python environments. Unlike Ray’s default Conda support—which requires Ray to start first and forces worker nodes to use the same Python version as the head node—we initialize the runtime before launching Ray. This lets each application use its own custom Python version.

How it works:

Before application code is packaged and uploaded, we use the user-provided requirements.txt to build a Conda environment using conda-pack. This environment is then distributed to the target nodes before Ray starts. This pre-packaging approach also avoids overloading software repositories during large-scale cluster scaling. We also support user-defined packaging for environments like TensorRT, offering powerful customization.

Figure 19: AstraRay Runtime EnvironmentFigure 19: AstraRay Runtime Environment
Figure 19: AstraRay Runtime Environment

Fast Model Distribution

As large models (LLMs, etc.) become more common, model files have grown dramatically—often tens of gigabytes—making distribution a bottleneck. Ray supports specifying a working_dir to distribute code and models, but:

  • Ray relies on a single GCS (Global Control Store) node, and

  • The default size limit is only 500MB, making it unsuitable for production-scale model distribution.

To solve this, we embedded a P2P (peer-to-peer) network into Ray nodes.

The P2P system includes a server-side and an SDK client:

  • The server manages P2P nodes via heartbeats and tracks seed node information.

  • Each P2P node can cache and transfer file shards.

Figure 20: P2P Server-side architectureFigure 20: P2P Server-side architecture
Figure 20: P2P Server-side architecture
Figure 21: P2P SDK-side architectureFigure 21: P2P SDK-side architecture
Figure 21: P2P SDK-side architecture

We further optimized the P2P system for performance and network stability:

  1. NAT traversal: Supports hole punching to establish connections in complex network environments.

  2. Auto-throttling: To prevent P2P from consuming all node bandwidth/CPU, each node is bandwidth-tested and given usage limits.

  3. Global throttling: Even with per-node limits, congestion can still occur at switches or upstream networks. The server can issue global throttling policies to protect other services.

  4. Cold start & hotspot acceleration: When a new file is being distributed and isn’t cached anywhere, sequential downloads can overwhelm specific seed nodes. We randomize the download order of shards to distribute the load more evenly.

Figure 22: P2P Download AccelerationFigure 22: P2P Download Acceleration
Figure 22: P2P Download Acceleration

Link3.4.2. Challenges Scaling Across Multiple Kubernetes Clusters

Figure 23: Ray federated cluster architectureFigure 23: Ray federated cluster architecture
Figure 23: Ray federated cluster architecture

To improve Ray’s scalability, we implemented a Ray federated cluster architecture via Starlink. In this design, a single Ray application can span multiple Ray clusters, with each cluster being fully functional on its own.

The benefits of this approach are:

  • Vertical scalability: You can scale up an individual Ray cluster (e.g., for more actors or heavier workloads).

  • Horizontal scalability: You can add more Ray clusters to scale out capacity.

We also enhanced Ray’s fault tolerance under this federated setup:

  • If a head node goes offline, the system automatically provisions a new Ray cluster.

  • If a worker node fails, a new worker is restarted within the same Ray cluster.

This strategy allows Ray applications to operate reliably even on low-priority, unstable computing resources.

Link3.4.3. Challenges Scaling Across Multiple GPU Types

Figure 24: TFCC inference runtimeFigure 24: TFCC inference runtime
Figure 24: TFCC inference runtime

Scaling inference workloads across different GPU types introduces three major challenges:

  1. Diverse inference workloads: Various inference engines and model formats (PyTorch, ONNX, TensorRT, etc.)

  2. Tedious and repetitive adaptation to heterogeneous hardware: Supporting GPUs from different vendors (e.g., NVIDIA, or Tencent’s in-house chips like Zixiao)

  3. High switching costs across engines and models

To address these, we built on top of the TFCC framework to provide a standardized inference service layer.

The benefits of this approach are:

  • Unified interface for model deployment

  • Engine abstraction—no need for developers to write engine-specific code

  • Built-in hardware adaptation—developers write one version of the application, which can run across multiple GPU types

This enables developers to focus on defining models, without worrying about inference engine details or hardware differences, dramatically simplifying deployment across a wide range of AI application scenarios.

Link4. Conclusion

The arrival of the AI ​​era has brought many challenges to Weixin’s backend infrastructure. We introduced Ray, an industry standard, as our foundation, adapting it to Weixin’s basic environment and providing a convenient and quick AI application development paradigm. At the same time, we simplified Ray’s cluster management complexity and saved considerable machine costs by using low-cost idle resources.

We started the AstraRay project one year ago. Since then, AstraRay has provided a solid foundation for the engineering of Weixin’s AI applications and continues to be optimized, preparing for the implementation of more AI applications in Weixin in the future.

Link5. References

[1] https://www.infoq.cn/article/vt4lewlrgumufibrulhz 

[2] https://www.anyscale.com/blog/four-reasons-why-leading-companies-are-betting-on-ray 

[3] https://docs.ray.io/en/latest/cluster/kubernetes/index.html 

[4] https://juejin.cn/post/7313601254365691941 

[5] https://www.cl.cam.ac.uk/research/srg/netos/camsas/blog/2016-03-09-scheduler-architectures.html 

[6] https://api7.ai/blog/api7-cloud-integrates-kubernetes-service-discovery 

[7] https://github.com/nginx/nginx/commit/52327e0627f49dbda1e8db695e63a4b0af4448b1 

[8] https://www.anyscale.com/blog/handling-files-and-packages-on-your-cluster-with-ray-runtime-environments

Ready to try Anyscale?

Access Anyscale today to see how companies using Anyscale and Ray benefit from rapid time-to-market and faster iterations across the entire AI lifecycle.