In time series forecasting, critical information often doesn't reside in the "most recent step." It might be a specific phase within a cycle, a recovery after a sudden spike, or similar patterns separated by long intervals. Traditional recurrent neural networks (RNNs) and their variants like LSTM struggle with these long-range dependencies because they must sequentially propagate information through time, leading to vanishing gradients and computational bottlenecks.
Attention mechanisms revolutionize this approach. Instead of forcing information to flow step-by-step through time, attention allows the model to directly learn "which segments of history to look at and with what weight." This direct access to any position in the sequence makes attention particularly powerful for capturing long-distance dependencies and irregular correlations that are common in time series data.
This article breaks down the self-attention computation step-by-step
through formulas (