Google Just Built a Foundation Model for Zero-Shot Time Series Forecasting

A decoder-only transformer for predictions in time series data.

5 min readFeb 5, 2024

I recently started an AI-focused educational newsletter, that already has over 165,000 subscribers. TheSequence is a no-BS (meaning no hype, no news, etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers, and concepts. Please give it a try by subscribing below:

TheSequence

The best source to stay up-to-date with the developments in the machine learning, artificial intelligence, and data…

thesequence.substack.com

Time series forecasting is one of the classic scenarios in machine learning(ML) since its early days. The ability of outputting predictions on time series data is relevant on many domains including retail, finance, manufacturing, healthcare, and natural sciences and yes, stock market predictions. Despite its relevance, the progress in those scenarios pales relative to the rapid developments we are seeing in LLMs, computer visions and other areas of generative AI. Is the paradigm of pretrain model applicable to time series forecasting scenarios. Google seems to believe so with a recent research paper that outlines a decoder-only pretrain model for time series forecasting. Google also announced that the new model will be available in Vertex AI in the near future.

Parallel to this, the realm of LLMs is advancing rapidly, particularly with the development of large foundation models. These models are gaining attention for their versatility in generating text, translating languages, creating diverse forms of content, and providing answers to queries in a detailed manner. Their training on extensive datasets enables them to grasp the nuances of human language, making them highly effective for a range of tasks, often without the need for additional, task-specific training.

This progress raises an intriguing question: could a large model, trained on a vast array of time-series data, identify and learn temporal patterns well enough to forecast future events in data it hasn’t seen before? Specifically, the idea is to create a foundational time-series model capable of delivering accurate forecasts right out of the box for new datasets, which could significantly reduce the need for extensive training data and computational resources in forecasting tasks.

However, creating such a model presents unique challenges. Unlike language, time-series data doesn’t have a set vocabulary or syntax, and a model would need to accommodate different lengths of historical data, forecast periods, and data frequencies. Moreover, unlike the abundant availability of text data for training language models, a similarly vast repository of time-series data is not as readily accessible.

Despite these obstacles, Google presents evidence suggesting that developing a foundational time-series forecasting model is indeed feasible, pointing towards a future where accurate forecasting could become more accessible and efficient for a wide range of applications.

The Architecture

Google’s foundation model for time series forecasting is a decoder-only model. The architecture choice is guided by a series of key principles.

1. Patching: The idea of patching is based on segmenting the time series data into patches during training, analogous to the way tokens function in language models. This strategy not only enhances model performance but also accelerates inference by reducing the volume of data the transformer needs to process. Despite this, Google maintains a careful balance to avoid moving too far from a decoder-only training approach, which is known for its efficiency.

2. Decoder-Only Model : Unlike some existing models that operate on both encoder and decoder mechanisms, Google’s model operates in a decoder-only mode. This means it predicts future patches of data based solely on the sequence of past patches, a method that allows for parallel processing over the entire context window. This approach mirrors the auto-regressive nature of large language models (LLMs) but is specially adapted for forecasting long-term future events.

3. Longer Output Patches: Google has also proposed a novel solution to improve long-horizon forecasting accuracy. By allowing output patches for prediction to be longer than the input patches, the model can leverage a broader context for its forecasts, enhancing its predictive capabilities. This method contrasts with traditional step-by-step forecasting, offering a more efficient and accurate approach to predicting future time-points.

4. Patch Masking: The introduction of patch masking during training ensures the model can handle context lengths of any size, not just those that are multiples of the input patch length. This strategy involves randomly masking parts of or entire patches, allowing the model to learn from a diverse range of context lengths.

The architectural backbone of Google’s model closely mirrors the structure of LLMs, utilizing stacked transformer layers that process input patches (treated as tokens) through self-attention and feedforward layers. A unique aspect of their model is the conversion of time-series patches into tokens using a multilayer perceptron block with residual connections, a technique that has proven effective in prior long-horizon forecasting efforts. Additionally, the model is designed to predict longer sequences of future time-points than what was input, enabling more comprehensive forecasting capabilities.

This approach by Google signifies a significant advancement in time-series forecasting, utilizing transformer-based architectures to adapt to the unique challenges of predicting future events based on past data, demonstrating a blend of innovation and practicality tailored for the complexities of time-series data.

The Results

Google carried out an extensive evaluation of its time series forecasting model without prior exposure to specific datasets, choosing three key collections for this assessment:

The first dataset, known as the Monash archive, encompasses a diverse set of 30 datasets that vary in training and prediction lengths. These datasets span a wide array of granularities, from minutes to years, and cover several domains such as finance, demand forecasting, weather, and traffic, providing a comprehensive testing ground for the model’s versatility.

Next, they examined the Darts collection, which consists of 8 univariate datasets. These particular datasets are notable for their distinct seasonal patterns and trends, both additive and multiplicative, offering a focused challenge on the model’s ability to capture and forecast recurring patterns.

Lastly, the Informer datasets, a collection well-regarded for its use in testing supervised long-horizon forecasting methods, were also included in the evaluation. However, Google concentrated on a subset of these datasets that were not part of the model’s pretraining phase. Specifically, they focused on datasets pertaining to electricity transformer temperatures over two years, recorded at one-hour and fifteen-minute intervals, namely ETTm1, ETTm2, ETTh1, and ETTh2.

This selection of datasets provided Google with a broad and challenging spectrum of data, ensuring a rigorous test of the model’s forecasting capabilities across various time series scenarios.

Google is pushing the boundaries of time series forecasting with its pretrained decoder model. Seems like a bold idea but hopefully one that inspires more research in the space.