WASM I/O • 14-15 Mar • Barcelona 2024

Efficient and portable AI / LLM inference on the edge cloud [Workshop]

Michael Yuan - Second State

Inference accounts for 95% of the AI computing workloads. As AI adoption increases, the cloud native community are looking for more efficient ways to provide and manage AI inference services on the cloud. This tutorial will tie together and showcase open source projects from large communities.

–

As AI applications gain popularity, we are increasingly seeing requirements to run AI or even LLM workloads on the edge cloud with heterogeneous hardware (eg GPU accelerators). However, the simplistic approaches are too heavyweight, too slow and not portable. For example, the PyTorch container image is 3GB and a container image for a C++ native toolchain is 300MB. Python apps also require complex dependency packages and could be very slow. Those container images are dependent on the underlying host’s CPU and GPU, making them difficult to manage.

Wasm has emerged as a lightweight runtime for cloud native applications. For an AI app, the entire Wasm runtime and app can be under 20MB. The Wasm binary app runs at native speed, integrates with k8s and is portable across CPUs & GPUs. In this tutorial, we will demonstrate how to create and run Wasm-based AI applications on edge server or local host. We will showcase AI models and libraries for media processing (Mediapipe), vision (YOLO, OpenCV and ffmpeg) and language (llama2). You will be able to run all examples on your own laptop at the session.

Session

Look who's talking!

Efficient and portable AI / LLM inference on the edge cloud [Workshop]

Michael Yuan - Second State