Pretrain Your Own AI Models with Fast-LLM
Created by ServiceNow, the framework provides the key building blocks for pretraining AI models.
I recently started an AI-focused educational newsletter, that already has over 170,000 subscribers. TheSequence is a no-BS (meaning no hype, no news, etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers, and concepts. Please give it a try by subscribing below:
Pretraining foundation models is often perceived as a capability reserved for big AI labs. Compute coordination ,data orchestration, constant experiments and the AI talent requirement are some of the challenges that make pretraining AI models prohibited for most organizations. However, the emergence of trends such as small language models(SLMs) or sovereign AI have pushed the idea that many companies are going to be building proprietary AI models. In that sense, lowering the bar for pretraining foundation models. And yet, there are very few frameworks that streamline those processes for companies.
Fast-LLM is an open-source library specifically designed for training Large Language Models (LLMs) with a focus on speed, scalability, and cost-efficiency. Developed by ServiceNow Research’s Foundation Models Lab, Fast-LLM aims to empower AI professionals, researchers, and enterprises in pushing the boundaries of generative AI. This essay provides a deep dive into Fast-LLM, highlighting its key capabilities and core architectural components.
Key Capabilities
Fast-LLM distinguishes itself from other libraries through its unique capabilities, enabling faster training, reduced costs, and enhanced scalability.
- Speed:
- Fast-LLM achieves record-breaking training throughput by employing optimized kernels, advanced parallelism, and memory-efficient techniques.
- For example, Fast-LLM can train Mistral-7B at an impressive rate of 10,350 tokens per second per GPU on a cluster with 32 H100 GPUs.
- These optimizations drastically reduce training time and associated costs.
- Scalability:
- Fast-LLM seamlessly scales from single-GPU setups to large compute clusters.
- It supports 3D parallelism (data, tensor, and pipeline), sequence length parallelism, and ZeRO-1, 2, and 3 techniques for optimal memory efficiency.
- This allows users to scale their training infrastructure without sacrificing performance.
- Flexibility:
- Fast-LLM exhibits compatibility with various language model architectures, including Llama, Mistral, StarCoder, and Mixtral.
- It features a modular design that provides users with full control over their training workflows.
- Cost-Efficiency:
- Fast-LLM’s higher throughput per GPU leads to reduced training time, resulting in lower training costs compared to other frameworks.
- Its efficient memory management enables users to train on more tokens with the same budget, ultimately leading to better-trained models without exceeding financial constraints.
- Openness:
- As an open-source library, Fast-LLM allows for full customization and extensibility, unlike proprietary software.
- Developed transparently on GitHub, Fast-LLM fosters trust and collaboration within its community, encouraging contributions and shaping the future of large-scale AI training.
Core Architecture
The underlying architecture of Fast-LLM is carefully crafted to maximize performance and efficiency, employing several key components.
- Unified Support for GPT-Like Architectures:
- Fast-LLM simplifies the implementation of GPT-like models by consolidating them into a single, unified module.
- This reduces redundancy and simplifies adaptation to custom architectures, ensuring consistency and flexibility while minimizing development overhead.
- Optimized Kernels:
- Fast-LLM utilizes highly optimized kernels tailored for various model sizes, ranging from small models around 1 billion parameters to massive clusters handling 70+ billion parameter models.
- These kernels are fine-tuned for maximum throughput across the entire range of model sizes.
- Advanced Parallelism Techniques:
- Fast-LLM leverages advanced parallelism techniques, including 3D parallelism (data, tensor, and pipeline) and sequence length parallelism, to distribute the training workload across multiple GPUs and nodes.
- This enables efficient scaling and accelerates training speeds.
- Memory Efficiency:
- Fast-LLM incorporates memory optimization techniques, such as ZeRO and activation recomputation, to reduce memory usage and enable training of larger models.
- At the 10-billion parameter scale, it avoids costly 3D parallelism through memory optimization, while at the 100-billion parameter scale, it optimally supports 3D parallelism.
Getting Started
Fast-LLM provides a simple and intuitive command-line interface, coupled with pre-built Docker images and YAML configuration files, making it easy for users to set up and run training experiments. The “Quick Start” section of the Fast-LLM documentation offers a detailed guide on how to get started with the library. Here are some code samples from the “Getting Started” article that demonstrate Fast-LLM’s usage:
- Installing Fast-LLM on a Slurm Cluster
sbatch << EOF
#!/bin/bash
#SBATCH - nodes=$(scontrol show node | grep -c NodeName)
#SBATCH - ntasks-per-node=1
#SBATCH - ntasks=$(scontrol show node | grep -c NodeName)
#SBATCH - exclusive
srun bash -c 'pip install - no-cache-dir -e "git+https://github.com/ServiceNow/Fast-LLM.git#egg=llm[CORE,OPTIONAL,DEV]"'
This code snippet demonstrates how to install Fast-LLM on all nodes of a Slurm cluster using a batch script.
2) Launching Training on a Slurm Cluster
sbatch <<EOF
#!/bin/bash
# SBATCH - job-name=fast-llm-train
# SBATCH - nodes=4
# SBATCH - gpus-per-node=8
# SBATCH - ntasks-per-node=1
# SBATCH - exclusive
# SBATCH - output=/app/fast-llm-tutorial/train-output.log
# SBATCH - error=/app/fast-llm-tutorial/train-error.log
export PYTHONHASHSEED=0
export WANDB_API_KEY_PATH=/app/fast-llm-tutorial/.wandb_api_key
export TORCH_NCCL_ASYNC_ERROR_HANDLING=1
export NCCL_DEBUG=INFO
srun \
- container-image="ghcr.io/servicenow/fast-llm:latest" \
- container-mounts="$(pwd)/fast-llm-tutorial:/app/fast-llm-tutorial" \
- container-env="PYTHONHASHSEED,WANDB_API_KEY_PATH,TORCH_NCCL_ASYNC_ERROR_HANDLING,NCCL_DEBUG" \
- gpus-per-node=$SLURM_GPUS_PER_NODE \
- ntasks-per-node=$SLURM_NTASKS_PER_NODE \
bash -c " torchrun - rdzv_backend=static \
- rdzv_id=0 \
- rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} \
- node_rank=\$SLURM_NODEID \
- nproc_per_node=\$SLURM_GPUS_PER_NODE \
- nnodes=\$SLURM_NNODES \
- max_restarts=0 \
- rdzv_conf=timeout=3600 \
- no_python \
fast-llm train gpt \
- config fast-llm-tutorial/train-config.yaml"
This script shows how to launch a training job using Fast-LLM on a Slurm cluster, configuring various parameters such as the number of nodes, GPUs per node, and environment variables.
Fast-LLM — A Game Changer for LLM Training
Fast-LLM emerges as a powerful and versatile open-source library for training large language models. Its unique blend of speed, scalability, flexibility, and cost-efficiency makes it a valuable tool for both research and production environments. Fast-LLM’s focus on performance, combined with its ease of use and commitment to community-driven development, positions it as a potential game changer in the field of large-scale AI training. By empowering AI practitioners to train sophisticated language models more efficiently, Fast-LLM facilitates further advancements in the exciting world of generative AI.