When tackling time series forecasting, many problems don't require
jumping straight to deep learning models. Structures like trends,
seasonality, autocorrelation, and volatility clustering can be cleanly
expressed by traditional models — and they're interpretable,
diagnosable, and make excellent baselines. This article places
ARIMA/SARIMA (stationarity and seasonality), VAR (multivariate
dynamics), GARCH (volatility modeling), Exponential Smoothing and
Prophet (trend decomposition), and Kalman Filtering (state-space
perspective) within a unified framework: what structure each model
captures, what its core assumptions are, how to interpret parameters,
and when each approach is more stable and worth trying first on real
data.
ARIMA
(AutoRegressive Integrated Moving Average)
ARIMA models consist of three components:
AR (Autoregressive): The autoregressive component
represents linear relationships between the series and its past
values.
I (Integrated): The integration component handles
non-stationarity through differencing operations to achieve
stationarity.
MA (Moving Average): The moving average component
uses past residuals (white noise) to predict current values.
ARIMA models are denoted by three parameters :
: Order of the
autoregressive term, indicating how many time steps to look back.
: Order of differencing,
indicating how many differences are needed to achieve stationarity.
*: Order of the moving average
term, indicating how many past white noise terms to use.
The core formula is:where:
-is the value at time; -is a constant term; -are autoregressive coefficients;
-is the error term (white
noise); -are moving average
coefficients.
ARIMA
Implementation: Complete Pipeline from Differencing to Prediction
Problem Context: Real-world time series often
contain trends and seasonality, making direct modeling problematic due
to non-stationarity. ARIMA eliminates trends through differencing, then
builds an ARMA model on the stationarized series. The core challenge is
correctly combining the differencing, autoregressive, and moving average
components.
Solution Approach: Use a "difference first, model
second" two-stage strategy. First, apply-order differencing to eliminate trends
and obtain a stationary series; then build an AR() and MA() combination model on that series.
During prediction, the AR component uses linear combinations of
historical values, while the MA component leverages information from
historical prediction errors.
Design Considerations: 1. Differencing
Strategy: First-order differencing usually suffices to
eliminate linear trends; higher-order differencing may over-difference
and cause information loss 2. Parameter Initialization:
AR and MA coefficients must satisfy stationarity and invertibility
conditions; in practice, they're obtained through maximum likelihood
estimation 3. Error Accumulation: MA terms depend on
historical errors, requiring sequential computation and maintenance of
the error sequence 4. Boundary Handling: Initial time
steps lack sufficient historical data; use zero prediction or mean
imputation
defdifference(series, order): """ Perform d-order differencing to eliminate trend components Problem: Original series may contain trends (e.g., linear growth), causing non-stationarity Solution: Transform non-stationary series to stationary through differencing Design: order=1 eliminates linear trends, order=2 eliminates quadratic trends Parameters: ----------- series : array-like Original time series, shape (n,) order : int Differencing order, typically 1 or 2. order=1 means first-order differencing: y'_t = y_t - y_{t-1} Returns: -------- diff_series : list Differenced series, length len(series) - order Notes: ------ - Differencing reduces series length, losing the first 'order' data points - Over-differencing (order too large) increases variance and causes information loss """ diff_series = [] # Start from position 'order' since earlier points cannot compute differences for i inrange(order, len(series)): # First-order difference: current value minus value 'order' steps back # For order=1: diff[i] = series[i] - series[i-1] value = series[i] - series[i - order] diff_series.append(value) return diff_series
defar_model(series, ar_order, ar_coeffs): """ Compute the autoregressive (AR) component: predict current value using weighted sum of historical values Core idea: Current value can be expressed as a linear combination of past p values Mathematical expression: AR_part = φ₁· y_{t-1} + φ₂· y_{t-2} + ... + φₚ· y_{t-p} Parameters: ----------- series : array-like Input series (typically the differenced stationary series) ar_order : int Autoregressive order p, indicating how many time steps to use ar_coeffs : array-like Autoregressive coefficients [φ₁, φ₂, ..., φₚ], length must equal ar_order Returns: -------- ar_part : float AR component prediction value Example: -------- >>> series = [1.0, 2.0, 3.0, 4.0] >>> ar_model(series, ar_order=2, ar_coeffs=[0.5, 0.3]) # Computes: 0.5*4.0 + 0.3*3.0 = 2.0 + 0.9 = 2.9 """ ar_part = 0.0 # Start from most recent past value, working backward ar_order steps for i inrange(ar_order): # series[-(i+1)] is the (i+1)th element from the end (most recent past value) # ar_coeffs[i] is the corresponding coefficient ar_part += ar_coeffs[i] * series[-(i + 1)] return ar_part
defma_model(errors, ma_order, ma_coeffs): """ Compute the moving average (MA) component: correct prediction using weighted sum of historical errors Core idea: Past prediction errors contain useful information to improve current prediction Mathematical expression: MA_part = θ₁·ε_{t-1} + θ₂·ε_{t-2} + ... + θₚ·ε_{t-q} Parameters: ----------- errors : list Historical prediction error sequence [ε₁, ε₂, ..., εₜ] ma_order : int Moving average order q, indicating how many past error terms to use ma_coeffs : array-like Moving average coefficients [θ₁, θ₂, ..., θₚ], length must equal ma_order Returns: -------- ma_part : float MA component correction value Notes: ------ - MA terms help the model capture "error patterns," like consecutive over/under-estimation - In initial stages when errors is empty or insufficient, MA part is 0 """ ma_part = 0.0 # Start from most recent error, working backward ma_order steps for i inrange(ma_order): # errors[-(i+1)] is the (i+1)th error from the end (most recent past error) ma_part += ma_coeffs[i] * errors[-(i + 1)] return ma_part
defarima(series, ar_order, ma_order, diff_order, ar_coeffs, ma_coeffs): """ Manual implementation of ARIMA(p, d, q) model's complete prediction pipeline Workflow: 1. Differencing: Apply d-order differencing to get stationary series 2. Modeling: Build ARMA(p, q) model on stationary series 3. Prediction: Combine AR and MA components for prediction Parameters: ----------- series : array-like Original time series, shape (n,) ar_order : int AR component order p ma_order : int MA component order q diff_order : int Differencing order d, typically 1 (first-order differencing) ar_coeffs : array-like AR coefficients [φ₁, φ₂, ..., φₚ] ma_coeffs : array-like MA coefficients [θ₁, θ₂, ..., θₚ] Returns: -------- predictions : list Predictions for the differenced series, length len(series_diff) Notes: ------ - This is a simplified implementation; actual ARIMA requires MLE for parameters - Initial stages (t < max(p, q)) cannot compute full ARMA, returning 0 - To get original series predictions, inverse differencing is needed """ # Step 1: Differencing to eliminate trends series_diff = difference(series, diff_order) # Step 2: Initialize error and prediction sequences errors = [] # Store prediction errors at each time step predictions = [] # Store predictions # Step 3: Predict at each time step for t inrange(len(series_diff)): # Only compute ARMA prediction when sufficient historical data exists if t >= max(ar_order, ma_order): # AR component: linear combination based on historical values ar_part = ar_model(series_diff[:t], ar_order, ar_coeffs) # MA component: linear combination based on historical errors ma_part = ma_model(errors[:t], ma_order, ma_coeffs) # Final prediction = AR part + MA part (note: no constant term c here) predicted_value = ar_part + ma_part else: # Initial stage: insufficient historical data, use 0 as initial prediction predicted_value = 0.0 # Compute actual value and prediction error for current time step actual_value = series_diff[t] error = actual_value - predicted_value # Update error and prediction sequences errors.append(error) predictions.append(predicted_value) return predictions
# Usage example: Simulate a time series with trend # Generate random walk series (cumulative random noise, with clear trend) data = np.random.randn(100).cumsum()
Role of Differencing: The differencing
operationeliminates linear trends. If the original series is(linear trend +
noise), first-order differencing yields, removing the trend term.
AR and MA Synergy: The AR component captures
"influence of historical values on current value," while the MA
component captures "patterns in historical prediction errors." Together
they leverage both the series' autocorrelation and error
autocorrelation.
Recursive Nature of Errors: MA terms depend on
historical errors, which depend on historical predictions, forming a
recursive relationship. This requires sequential computation —
parallelization is not possible.
Common Questions:
Q: Why is the initial prediction 0?
A: When,
there's insufficient historical data to compute AR or MA terms. In
practice, use the series mean or first few observations for
imputation.
Q: How to chooseparameters?
A: Typically through ACF/PACF plots: PACF cuts off at lag→ AR(); ACF cuts off at lag→ MA(). Differencing orderis determined by ADF tests until the
series is stationary.
Q: Why are predictions for the differenced series? How to
recover original values?
A: The model operates on the differenced series; predictions need
inverse differencing:, whereis the differenced series andis the original.
SARIMA extends ARIMA to handle time series with
seasonal periodicity. Its basic form is SARIMA(p, d, q)(P, D, Q,
m), whererepresents the non-seasonal component andrepresents the seasonal
component.
-: Seasonal autoregressive
order. -: Seasonal differencing
order. -: Seasonal moving average
order. -: Seasonal period
length.
The SARIMA formula is:where:
-is the lag operator; -andare seasonal autoregressive and
moving average coefficients respectively.
SARIMA
Implementation: ARIMA Extension for Seasonal Patterns
Problem Context: Many time series contain not only
trends but also clear seasonal patterns (e.g., annual cycles in monthly
data, weekly cycles in daily data). Standard ARIMA only handles trends
and cannot capture "the influence of last year's same period on current
value"— this type of seasonal dependency.
Solution Approach: SARIMA adds seasonal components
to ARIMA. Through seasonal differencing (, whereis the
period length) to eliminate seasonality, then simultaneously building
non-seasonal ARMA and seasonal ARMA models. The non-seasonal component
captures short-term dependencies, while the seasonal component captures
long-term cross-cycle dependencies.
Design Considerations: 1. Double
Differencing: First apply regular differencing to eliminate
trends, then seasonal differencing to eliminate seasonality — order
matters 2. Lag Selection: Seasonal lagmust equal the data's period length
(e.g., monthly data, weekly
data) 3. Parameter
Separation: Non-seasonal and seasonal parameters are estimated
independently to avoid interference 4. Data
Requirements: At leastobservations are needed for seasonal
differencing and modeling
defseasonal_difference(series, season_lag): """ Perform seasonal differencing: eliminate seasonal components Problem: Series may contain periodic patterns (e.g., December sales always high), causing non-stationarity Solution: Eliminate seasonality through seasonal differencing y'_t = y_t - y_{t-m} Design: season_lag=m represents period length, e.g., monthly data m=12, weekly data m=7 Parameters: ----------- series : array-like Input series (typically already differenced for trends) season_lag : int Seasonal period length m. Examples: - Monthly data: m=12 (12 months per year) - Weekly data: m=52 (52 weeks per year) - Daily data: m=7 (7 days per week) Returns: -------- diff_series : list Seasonally differenced series, length len(series) - season_lag Example: -------- >>> series = [10, 12, 15, 18, 20, 22, 25, 28, 30, 32, 35, 38] >>> seasonal_difference(series, season_lag=12) # For monthly data, computes difference between current month and same month last year """ diff_series = [] # Start from season_lag position since earlier points cannot compute seasonal differences for i inrange(season_lag, len(series)): # Seasonal differencing: current value minus value m periods ago # For monthly data m=12: diff[i] = series[i] - series[i-12] (this Jan - last Jan) value = series[i] - series[i - season_lag] diff_series.append(value) return diff_series
defsarima(series, ar_order, ma_order, diff_order, seasonal_order, seasonal_lag, ar_coeffs, ma_coeffs, seasonal_ar_coeffs, seasonal_ma_coeffs): """ Manual implementation of SARIMA(p, d, q)(P, D, Q, m) model Model structure: SARIMA = Non-seasonal ARIMA + Seasonal ARIMA - Non-seasonal component: (p, d, q) handles trends and short-term dependencies - Seasonal component: (P, D, Q, m) handles periodic patterns and cross-cycle dependencies Workflow: 1. Regular differencing: eliminate trends (d orders) 2. Seasonal differencing: eliminate seasonality (D orders, here D=1) 3. Modeling: build ARMA(p,q) and seasonal ARMA(P,Q) on double-differenced series Parameters: ----------- series : array-like Original time series ar_order : int Non-seasonal AR order p ma_order : int Non-seasonal MA order q diff_order : int Non-seasonal differencing order d seasonal_order : int Seasonal AR/MA order P=Q (simplified; can differ in practice) seasonal_lag : int Seasonal period length m ar_coeffs : array-like Non-seasonal AR coefficients [φ₁, φ₂, ..., φₚ] ma_coeffs : array-like Non-seasonal MA coefficients [θ₁, θ₂, ..., θₚ] seasonal_ar_coeffs : array-like Seasonal AR coefficients [Φ₁, Φ₂, ..., Φₚ] seasonal_ma_coeffs : array-like Seasonal MA coefficients [Θ₁, Θ₂, ..., Θₚ] Returns: -------- predictions : list Predictions for double-differenced series Notes: ------ - Simplified implementation; actual SARIMA requires complex parameter estimation - Seasonal AR/MA terms use lags that are multiples of m (e.g., t-m, t-2m, etc.) - Requires at least 2m data points for seasonal differencing """ # Step 1: Regular differencing to eliminate trends series_diff = difference(series, diff_order) # Step 2: Seasonal differencing to eliminate seasonality (on already-differenced series) series_season_diff = seasonal_difference(series_diff, seasonal_lag) # Step 3: Initialize error and prediction sequences errors = [] predictions = [] # Step 4: Predict at each time step for t inrange(len(series_season_diff)): # Need sufficient history: non-seasonal needs p/q steps, seasonal needs P/Q cycles if t >= max(ar_order, ma_order, seasonal_order * seasonal_lag): # Non-seasonal AR: based on most recent p time steps ar_part = ar_model(series_season_diff[:t], ar_order, ar_coeffs) # Non-seasonal MA: based on most recent q errors ma_part = ma_model(errors[:t], ma_order, ma_coeffs) # Seasonal AR: based on values m, 2m, ..., Pm cycles ago # Note: simplified here; actual should use lag multiples of m seasonal_ar_part = ar_model(series_season_diff[:t], seasonal_order, seasonal_ar_coeffs) # Seasonal MA: based on errors m, 2m, ..., Qm cycles ago seasonal_ma_part = ma_model(errors[:t], seasonal_order, seasonal_ma_coeffs) # Final prediction = non-seasonal AR + non-seasonal MA + seasonal AR + seasonal MA predicted_value = ar_part + ma_part + seasonal_ar_part + seasonal_ma_part else: # Initial stage: insufficient historical data predicted_value = 0.0 # Compute error and update actual_value = series_season_diff[t] error = actual_value - predicted_value errors.append(error) predictions.append(predicted_value) return predictions
# Usage example: Simulate time series with seasonality # Generate a series with annual cycle (12 months) data = np.random.randn(100).cumsum()
Necessity of Double Differencing: Regular
differencing eliminates trends, seasonal differencing eliminates
periodicity. Doing only seasonal differencing leaves trends; doing only
regular differencing leaves seasonal patterns. Both are needed for true
stationarity.
Seasonal Lag Selection:must equal the actual period length of
your data. Wrong choice ofrenders
seasonal differencing ineffective or introduces spurious patterns.
Identifythrough ACF plots showing
peaks at lag.
Non-seasonal and Seasonal Interaction: Both
components are modeled independently but act jointly. The non-seasonal
component captures "yesterday's influence on today," while the seasonal
component captures "last year same day's influence on today"— both
combine for complete prediction.
Common Questions:
Q: How to determine seasonal period length?
A: Determine by data frequency (monthly →12, weekly →7, daily →24 or
7). Also verify through ACF plots showing peaks at lags.
Q: How much data does SARIMA need?
A: At leastobservations. For example, monthly data () SARIMA(1,1,1)(1,1,1,12) needs at
leastmonths of
data.
Q: What if series length drops significantly after seasonal
differencing?
A: This is normal. If data is insufficient, consider other methods
(like Prophet) or reduce seasonal order.
# Visualization: compare actual vs fitted values import matplotlib.pyplot as plt plt.figure(figsize=(12, 6)) plt.plot(ts_data, label='Actual') plt.plot(fitted_model.fittedvalues, label='Fitted', linestyle='--') plt.plot(pd.date_range(ts_data.index[-1], periods=13, freq='M')[1:], forecast, label='Forecast', marker='o') plt.legend() plt.title('SARIMA Model Fitting and Forecasting') plt.show()
VAR (Vector AutoRegressive
Model)
VAR models are used for multivariate time series
modeling. They assume each time series depends not only on its own past
values but also on past values of other series.
The VAR model's mathematical expression is:where:
-is a vector containing
multiple time series; -are lag
coefficient matrices; -is
the error vector.
Eachvector represents current
values of multiple time series, andare coefficient matrices describing
relationships among the series.
VAR
Implementation: Multivariate Time Series Modeling
Problem Context: In practice, multiple time series
often influence each other (e.g., GDP and unemployment, stock price and
volume, temperature and humidity). Univariate ARIMA models cannot
capture these cross-series dependencies. VAR models express each
variable's current value as a linear combination of all variables'
(including its own) historical values, simultaneously modeling multiple
series and their interactions.
Solution Approach: The core of VAR is "vector
autoregression"— each variable's current value depends not only on its
own history but also on other variables' histories. For a VAR() model withvariables, we need to estimatecoefficients,
describing "variable's influence
on variableat lag." The model solves for coefficient
matrices via least squares or maximum likelihood estimation.
Design Considerations: 1. Variable
Selection: All variables must be stationary (or cointegrated),
otherwise differencing is needed first 2. Lag Order:
Select optimalvia information
criteria (AIC/BIC), typically
Coefficient Matrix Interpretation:represents variable's marginal effect on variableat lag
Granger Causality Tests: Test whether one
variable "Granger-causes" another (i.e., whether historical values help
prediction)
defvar_model(data, lag): """ Manual implementation of VAR(p) model: vector autoregression for multivariate time series Core idea: Each variable's current value = linear combination of all variables' historical values Mathematical expression: Y_t = A_1· Y_{t-1} + A_2· Y_{t-2} + ... + A_p · Y_{t-p} + ε_t where Y_t is a k-dimensional vector (k variables), A_i is a k × k coefficient matrix Parameters: ----------- data : array-like, shape (n, k) Multivariate time series data - n: number of time steps - k: number of variables (here k=2, handling 2 time series) Example: [[y1_t0, y2_t0], [y1_t1, y2_t1], ..., [y1_tn, y2_tn]] lag : int Lag order p, indicating how many past time steps to use - p=1: use only previous 1 step (VAR(1)) - p=2: use previous 2 steps (VAR(2)) - Typical values: 1-4, selected via AIC/BIC Returns: -------- predictions : list List of predictions, each element is a tuple of shape (k,) Length is n - lag (first lag time steps cannot be predicted) Notes: ------ - Simplified implementation; actual VAR requires estimating coefficient matrices A_i (via OLS or MLE) - Here coefficients are assumed known (in practice, estimation is needed first) - All variables must be stationary (or cointegrated), otherwise differencing is needed - VAR models capture inter-variable influences and dynamic relationships Example: -------- >>> data = np.column_stack((np.random.randn(100).cumsum(), ... np.random.randn(100).cumsum())) >>> predictions = var_model(data, lag=2) >>> # Each prediction is a (y1_pred, y2_pred) tuple """ n = len(data) # Number of time steps k = data.shape[1] iflen(data.shape) > 1else1# Number of variables # Initialize coefficient matrices (simplified: coefficients are 0 here; actual needs estimation) # coefficients[i][j]: variable i's coefficient on variable j (simplified here) coefficients = np.zeros((k, lag)) # (k, lag): k variables, lag lags # Store prediction errors and predictions errors = [] # Store prediction errors at each time step predictions = [] # Store predictions at each time step # Predict starting from position lag (first lag time steps lack sufficient history) for t inrange(lag, n): # Extract actual values at current time step y1 = data[t][0] # Variable 1's current value y2 = data[t][1] # Variable 2's current value # VAR prediction: each variable's prediction = linear combination of all variables' histories # Variable 1's prediction: pred_y1 = Σ(A1_i * y1_{t-i}) + Σ(A2_i * y2_{t-i}) # Simplified here: only using own historical values (actual VAR should include all variables' histories) pred_y1 = sum([coefficients[0][i] * data[t - i - 1][0] for i inrange(lag)]) pred_y2 = sum([coefficients[1][i] * data[t - i - 1][1] for i inrange(lag)]) # Compute prediction errors error_y1 = y1 - pred_y1 # Variable 1's prediction error error_y2 = y2 - pred_y2 # Variable 2's prediction error # Save errors and predictions errors.append((error_y1, error_y2)) predictions.append((pred_y1, pred_y2)) return predictions
# Usage example: Simulate two related multivariate time series # Generate example data: two random walk series (in practice should be stationary) data = np.column_stack((np.random.randn(100).cumsum(), np.random.randn(100).cumsum())) lag = 2# VAR(2) model: use previous 2 time steps
# Execute VAR prediction predictions = var_model(data, lag) print(f"Number of predictions: {len(predictions)}") print(f"First 5 predictions: {predictions[:5]}")
Key Points:
Meaning of Vector Autoregression: The core of
VAR is "vector"— all variables are modeled simultaneously, with each
variable's equation containing lag terms of all variables. This differs
from univariate ARIMA: ARIMA only considers its own history; VAR
considers all variables' histories. For example, GDP's equation includes
not only GDP's historical values but also unemployment, inflation,
etc.
Economic Interpretation of Coefficient Matrices:
Elementof coefficient
matrixrepresents "variable's marginal effect on variableat lag." If, it
means a 1-unit increase in unemployment leads to adecrease in
GDP next period. This interpretation makes VAR widely used in
economics.
Granger Causality: An important application of
VAR is testing Granger causality. If variable's historical values help predict
variable(controlling for's own history), we say "Granger-causes." This doesn't equal true causality but
reveals predictive relationships.
Common Questions:
Question
Answer
Q: How much data does VAR need?
A: At leasttime
steps, whereis number of variables
andis lag order. For example,
2-variable VAR(2) needs at least 54 time steps.
Q: How to choose lag order?
A: Use information criteria (AIC/BIC) or likelihood ratio tests.
Typically start from, increase
gradually, and select thethat
minimizes AIC/BIC.
Q: Must variables be stationary?
A: Yes, all variables must be stationary (or cointegrated). If
non-stationary, differencing is needed to get a VAR model; if
cointegrated, use VECM (Vector Error Correction Model).
Q: How many variables can VAR handle?
A: Theoretically unlimited, but asincreases, parameter countgrows rapidly. Typically; for more than 10 variables,
consider dimensionality reduction or factor models.
# Practical application: Complete VAR modeling with statsmodels from statsmodels.tsa.vector_ar.var_model import VAR import pandas as pd
# Prepare multivariate data (e.g., GDP, unemployment rate, inflation rate) data_dict = { 'GDP': gdp_data, # GDP series 'Unemployment': unemployment_data, # Unemployment series 'Inflation': inflation_data # Inflation series } data_df = pd.DataFrame(data_dict)
# Fit VAR model model = VAR(data_df) # Automatically select optimal lag order lag_order = model.select_order(maxlags=10) print(f"Optimal lag orders: {lag_order.selected_orders}")
# Fit using optimal lag order results = model.fit(maxlags=lag_order.selected_orders['aic']) print(results.summary())
GARCH models are used for modeling conditional
heteroskedasticity in time series, particularly suited for volatility
forecasting in financial data. GARCH models use past residuals and past
variance to predict future variance.
The core GARCH(1, 1) formula is:where:
-is the conditional
variance at time; -are model
parameters; -is the
residual sequence.
The model recursively computes to predict series volatility.
GARCH
Implementation: Volatility Modeling and Risk Prediction
Problem Context: A key characteristic of financial
time series (like stock prices, exchange rates) is "volatility
clustering"— large fluctuations tend to follow large fluctuations, and
small fluctuations tend to follow small fluctuations. Traditional ARIMA
assumes constant variance (homoskedasticity) and cannot capture this
time-varying volatility. GARCH models the conditional variance as a
function of historical squared residuals and conditional variance,
enabling prediction of "how volatile tomorrow will be"— critical for
risk management.
Solution Approach: The core idea of GARCH is
"conditional heteroskedasticity"— variance is not constant but changes
over time. GARCH(1,1) uses three components to predict conditional
variance: 1) constant term(long-run average variance), 2)
ARCH term(captures "yesterday's shock's effect on
today's volatility"), 3) GARCH term(captures "yesterday's volatility's effect on
today's volatility"). Parameters must satisfy(stationarity
condition).
Design Considerations: 1. Mean
Model: GARCH is typically combined with ARMA — first model the
mean (e.g., ARMA-GARCH), then model the residual variance 2.
Parameter Constraints:,,(ensures
positive and stationary variance) 3. Initial Variance:
Usually initialized with sample variance or unconditional variance
Application Scenarios: Primarily for financial data
(stocks, forex, returns); limited effectiveness for non-financial
data
Mathematical Expression of Volatility
Clustering: GARCH captures volatility clustering through
theterm.
Ifis close to 1 (e.g., 0.9),
yesterday's volatilitystrongly influences today's
volatility, leading to
"large fluctuations following large fluctuations." This is typical of
financial data: during market panic, volatility keeps amplifying; during
calm periods, volatility stays subdued.
ARCH vs GARCH: ARCH models only use squared
residualsto predict
variance; GARCH also uses conditional variance. GARCH's advantage is
fewer parameters (GARCH(1,1) is equivalent to infinite-order ARCH) and
better captures volatility persistence. In practice, GARCH(1,1) usually
suffices — rarely need GARCH(p,q) withor.
Economic Interpretation of Parameters:measures "decay rate of
volatility shocks." If, volatility shocks decay at 5% per period (half-life about
14 days). Ifapproaches 1, volatility has long memory (approaches
IGARCH), suitable for modeling extreme events like financial
crises.
Common Questions:
Question
Answer
Q: What does GARCH predict?
A: Predicts conditional variance (squared
volatility), not returns themselves. To predict returns, combine with a
mean model (e.g., ARMA-GARCH).
Q: How to choose GARCH order (p,q)?
A: Usually GARCH(1,1) suffices. If residual tests indicate higher
order needed, try GARCH(2,1) or GARCH(1,2), but parameters grow
quickly.
Q: Does GARCH apply to non-financial data?
A: Usually not. GARCH assumes "volatility clustering," common in
financial data but rare in sales, temperature, etc. Non-financial data
typically has relatively stable variance.
Q: How to estimate GARCH parameters?
A: Use maximum likelihood estimation (MLE). Requires assuming
residual distribution (usually normal or t-distribution). In practice,
use the arch library or statsmodels.
Q: How far ahead can GARCH predict?
A: Any number of steps, but long-term forecasts converge to
unconditional variance. Typically
used for 1-5 step short-term volatility forecasting.
# Practical application: Complete GARCH modeling with arch library from arch import arch_model import pandas as pd
# Prepare return data returns_data = pd.Series(returns, index=pd.date_range('2020-01-01', periods=len(returns), freq='D'))
# Fit GARCH(1,1) model # mean='Zero': assume mean is 0 (can also use 'AR' or 'Constant') # vol='GARCH': use GARCH model for volatility # p=1, q=1: GARCH(1,1) model = arch_model(returns_data, mean='Zero', vol='GARCH', p=1, q=1) fitted_model = model.fit(disp='off')
# View model summary print(fitted_model.summary())
Exponential Smoothing is a smoothing technique for
time series data that assigns higher weights to more recent data points,
commonly used for forecasting stationary series. Exponential smoothing
has multiple forms, including single, double, and triple exponential
smoothing (Holt-Winters method).
Single Exponential Smoothing
The formula is:where:
-is the smoothed value;
-is the smoothing
coefficient,;
-is the current observation.
Single
Exponential Smoothing Implementation: Recursive Form of Weighted Moving
Average
Problem Context: Simple moving average assigns equal
weights to all historical data, but in practice "more recent data is
more important." Single exponential smoothing achieves this through
exponentially decaying weights: current observation weight is, weight for observationsteps back is, forming exponential
decay.
Solution Approach: Use recursive update strategy,
maintaining only one smoothed valueat a time. New values are computed as
weighted averages of current observation and smoothed value. Weightcontrols sensitivity to new
information:close to 1
responds quickly to changes;close to 0 is smoother but slower
to respond.
Design Considerations: 1. Initial Value
Choice: Typically use first observation, or mean of first few
observations 2. Choosing: Select via
cross-validation or grid search, typically in range 0.1-0.3 3.
Forecast Formula: Future-step forecast is(constant forecast) 4.
Applicable Scenarios: Only for trend-free,
seasonality-free stationary series
defsingle_exponential_smoothing(data, alpha): """ Manual implementation of Single Exponential Smoothing (Simple Exponential Smoothing) Core idea: Assign exponentially decaying weights to historical data; more recent = higher weight Mathematical expression: S_t = α· Y_t + (1-α)· S_{t-1} Weight distribution: current observation weight α, k steps ago weight α(1-α)^k (exponential decay) Parameters: ----------- data : array-like Input time series, shape (n,) alpha : float Smoothing coefficient, range [0, 1] - α close to 1: fast response to new data, but higher fluctuation (for rapidly changing series) - α close to 0: better smoothing, but slower response (for stable series) - Typical values: 0.1-0.3 Returns: -------- smoothed_data : list Smoothed series, same length as input Notes: ------ - Initial value uses first observation: S_0 = Y_0 - Forecast formula: future h-step forecast = S_t (constant forecast) - Only for trend-free, seasonality-free stationary series Example: -------- >>> data = [10, 12, 11, 13, 12, 14, 13] >>> smoothed = single_exponential_smoothing(data, alpha=0.3) >>> # S_1 = 0.3*12 + 0.7*10 = 10.6 >>> # S_2 = 0.3*11 + 0.7*10.6 = 10.72 """ # Initialize: use first observation as initial smoothed value smoothed_data = [data[0]] # Recursive update: compute new smoothed value at each time step for t inrange(1, len(data)): # Core formula: new smoothed value = α· current observation + (1-α)· previous smoothed value # This is equivalent to exponentially weighted average of all historical observations smoothed_value = alpha * data[t] + (1 - alpha) * smoothed_data[-1] smoothed_data.append(smoothed_value) return smoothed_data
# Usage example data = np.random.randn(100).cumsum() alpha = 0.2# Smaller α produces smoother results smoothed_data = single_exponential_smoothing(data, alpha)
print(f"Original data std: {np.std(data):.4f}") print(f"Smoothed data std: {np.std(smoothed_data):.4f}")
Key Points:
Mathematical Nature of Exponential Decay:
Expanding the recursion gives, with
weightsdecaying
exponentially. For example, with, current weight is 0.2, 1 step
back is 0.16, 2 steps back is 0.128— weights decay quickly.
TheTrade-off: Largemakes the model adapt quickly to
changes but is noise-sensitive; smallgives better smoothing but may lag.
Practical choice balances "tracking ability" and "smoothness."
Forecast Limitations: Single exponential
smoothing can only make constant forecasts (future value = current
smoothed value), unable to capture trends or seasonality — its biggest
limitation.
Common Questions:
Q: How to choose optimal?
A: Use time series cross-validation, selectingthat minimizes validation set MSE.
Or use statsmodels' automatic optimization.
Q: When is single exponential smoothing
appropriate?
A: Only for trend-free, seasonality-free stationary
series. For obvious trends, use double exponential smoothing; for
seasonality, use Holt-Winters.
Q: Does initial value choice matter?
A: For long series, impact is minimal sincedecays quickly. For short
series, use mean of first few observations for initialization.
Double exponential smoothing predicts series with linear trends by
introducing a trend term. The formulas are:$
$ where:
-is the trend term; -is the trend smoothing
coefficient.
Double
Exponential Smoothing Implementation: Holt Method for Capturing Linear
Trends
Problem Context: Single exponential smoothing
assumes no trend, but many real series show clear upward or downward
trends (e.g., year-over-year sales growth, long-term stock
appreciation). Using single exponential smoothing on trending series
causes systematic lag in forecasts.
Solution Approach: Holt double exponential smoothing
introduces an independent trend term, separately smoothing "level" and
"trend" components. Levelrepresents the series' current value;
trendrepresents the rate of
change (slope). For forecasting, future-step prediction is—
current level plus linear extrapolation of trend.
Design Considerations: 1. Level
Update: New level =·
current observation +·(previous level + previous
trend), embodying "trend continuation" assumption 2. Trend
Update: New trend =·
level change +· previous
trend, smoothing trend changes 3. Initial Values: Level
uses first observation; trend uses difference of first two observations
4. Parameter Selection:controls level update speed,controls trend update speed, both
typically small (0.1-0.3)
# Forecast next 5 steps last_level = smoothed_data[-1] - (smoothed_data[-1] - smoothed_data[-2]) # Approximate current level last_trend = smoothed_data[-1] - smoothed_data[-2] # Approximate current trend future_forecast = [last_level + h * last_trend for h inrange(1, 6)] print(f"Next 5 steps forecast: {future_forecast}")
Key Points:
Intuition Behind Level Update: In the
formula,represents "what the current value should be if trend
continues." If actual observationis close to this, trend is stable; if
different, trend may be changing.
Smooth Trend Updates: Trend itself also changes
(accelerates or decelerates), but smoothly.controls trend change response
speed: largeadapts quickly to
trend changes; smallmakes
trend more stable but may lag.
Linear Extrapolation Assumption: Forecast
formulaassumes trend remains constant in the future. This is
usually reasonable for short-term forecasts (small) but may be inaccurate for
long-term.
Common Questions:
Q: How to chooseand?
A: Select via time series cross-validation for combination
minimizing validation set MSE. Usuallyis slightly smaller thansince trend changes are typically
smoother than level changes.
Q: Can double exponential smoothing handle nonlinear
trends?
A: No. Double exponential smoothing assumes linear trends. For
nonlinear trends (e.g., exponential growth), log-transform first then
apply, or use other methods (like Prophet).
Q: How to compute forecast confidence intervals?
A: Double exponential smoothing forecast error variance grows with
time. Use state-space model framework for theoretical confidence
intervals, or Bootstrap methods.
# Practical application: Double exponential smoothing with statsmodels from statsmodels.tsa.holtwinters import ExponentialSmoothing
# Prepare data ts_data = pd.Series(trend_data, index=pd.date_range('2020-01-01', periods=100, freq='D'))
# Fit double exponential smoothing model (trend='add' means additive trend) model = ExponentialSmoothing(ts_data, trend='add', seasonal=None) fitted_model = model.fit(optimized=True)
Triple exponential smoothing (Holt-Winters method) is an extended
exponential smoothing technique specifically for handling time series
with both trends and seasonal fluctuations. It considers not only the
series' current level and trend but also captures periodic seasonal
variations. This method is commonly used in finance, retail, and energy
sectors for forecasting time series with repeating patterns.
Model Components
The Holt-Winters method consists of three parts:
Level: Represents the current estimated level
value.
Trend: Represents the linear trend of the time
series.
Seasonality: Represents the seasonal periodic
fluctuations in the time series.
Each component updates based on the previous time step's estimates,
similar to single and double exponential smoothing. Triple exponential
smoothing aims to capture dynamic changes in these components and use
them for future predictions.
Formulas
In the Holt-Winters method, we have three core formulas to update
level, trend, and seasonality components.
Level Update Formula (Level equation):where:
-is the level estimate at
time; -is the actual observation at time; -is the seasonal factor fromperiods ago (is the seasonal period length); -is the level smoothing
coefficient,.
Trend Update Formula (Trend equation):where:
-is the trend estimate at
time; -is the trend smoothing
coefficient,.
Seasonality Update Formula (Seasonal equation):where:
-is the seasonal factor at
time; -is the seasonal smoothing
coefficient,.
Forecast Formula
Once we have level, trend, and seasonality components at timevia the above formulas, we can forecast
future values with:
For forecastingsteps
ahead:where:
-is the forecast
for time; -is the level estimate at time; -is the trend estimate at time; -is the seasonal factor
corresponding to step.
Holt-Winters
Triple Exponential Smoothing Implementation: Handling Both Trends and
Seasonality
Problem Context: Real time series often contain both
trends and seasonality (e.g., retail sales with both long-term growth
trends and annual cyclical fluctuations). Double exponential smoothing
only handles trends and cannot capture "December sales are always high"
type seasonal patterns. Holt-Winters introduces seasonal factors, decomposing the series into three
independent components: level, trend, and seasonality.
Solution Approach: Use a "deseasonalize → smooth →
add back seasonality" strategy. When updating level, first divide by
seasonal factor to remove seasonal influence, getting "deseasonalized"
values; then update level and trend as in double exponential smoothing;
finally update seasonal factors separately. When forecasting, multiply
the level + trend linear extrapolation by corresponding seasonal
factors.
Design Considerations: 1. Periodicity of
Seasonal Factors: Need to maintainseasonal factors (), corresponding to
each position within one complete cycle 2.
Deseasonalization: Thein level update formula
represents "value after removing seasonality"— this correctly estimates
level 3. Initial Value Estimation: Need at leastobservations to initialize level and
trend; seasonal factors are estimated from first cycle's data 4.
Additive vs Multiplicative Models: This implements the
multiplicative model (seasonality appears as ratios), suitable when
seasonal amplitude grows with level
Necessity of Deseasonalization: Thein level update is key.
Without dividing by seasonal factor, seasonal fluctuations would be
mistaken for level changes, causing biased level estimates. For example,
if December sales are always high (seasonality), not dividing by
seasonal factor would overestimate December's level.
Periodic Storage of Seasonal Factors: Need to
maintainseasonal factors
corresponding to each position within the cycle. When forecasting, use
modulo operationto find
corresponding seasonal factor. For example, forecasting month 13 uses
month 1's ()
seasonal factor.
Multiplicative vs Additive Models: This
implements multiplicative model (), suitable when seasonal amplitude grows with level.
For constant seasonal amplitude (like temperature: summer is always 20°
higher than winter), use additive model ().
Common Questions:
Q: How to decide between multiplicative and additive
models?
A: If seasonal fluctuation amplitude grows with series level (e.g.,
sales: higher sales, larger fluctuations), use multiplicative; if
seasonal fluctuation is constant (e.g., temperature), use additive.
Determine through visualization.
Q: Does initial value choice significantly impact
results?
A: For long series (>3 cycles), impact is minimal since
exponential smoothing quickly "forgets" initial values. For short
series, initial value choice matters — consider more robust
initialization methods.
Q: How to choose?
A: Use time series cross-validation to select combination minimizing
validation set MSE. Or use statsmodels' automatic
optimization. Usuallysince trend changes most slowly, seasonality changes
fastest.
# Original data axes[0].plot(ts_data, 'o-', label='Original Data', linewidth=2) axes[0].set_title('Original Time Series') axes[0].legend() axes[0].grid(True, alpha=0.3)
Prophet is a time series forecasting model developed
by Facebook, capable of handling complex time series data with
seasonality, trends, and holiday effects.
The Prophet model can be decomposed into three parts: trend,
seasonality, and holiday effects. The model form is:where:
-is the trend component,
capturing long-term changes; -is the seasonality component,
representing periodic variations; -is the holiday effect; -is the error term.
Prophet
Implementation: Automated Time Series Decomposition and Forecasting
Problem Context: Traditional time series models
(like ARIMA, Holt-Winters) require extensive manual tuning and domain
knowledge: choosing differencing order, identifying seasonal periods,
handling outliers, etc. Prophet's design goal is "out of the box"— users
only need to provide time series data, and the model automatically
handles trend change points, multiple seasonal periods
(daily/weekly/yearly), holiday effects, and outliers. It's especially
suitable for rapid prototyping and automated forecasting in business
scenarios.
Solution Approach: Prophet adopts an "additive
decomposition" approach, decomposing the time series into three
interpretable components: 1) trend(using piecewise linear or logistic
growth models with automatic change point detection), 2)
seasonality(using Fourier
series to model multiple periods like daily/weekly/yearly), 3) holiday
effects(user provides holiday
list, model learns their impact). For forecasting, the three components
are added together. The model uses Bayesian methods for parameter
estimation and provides uncertainty intervals.
Design Considerations: 1. Trend Change Point
Detection: Prophet automatically detects trend change points
(e.g., product launches, policy changes), allowing trend to have
different growth rates in different segments 2. Multiple
Seasonality Support: Can simultaneously model multiple seasonal
periods (e.g., daily + weekly + yearly cycles for daily data) without
manual specification 3. Robustness: Robust to outliers
and missing values, using M-estimator instead of least squares 4.
Interpretability: Decomposition results are visualized,
users can understand contributions from trend, seasonality, and
holidays
Automatic Change Point Detection: Prophet's core
advantage is automatic trend change point detection. If the series
suddenly changes trend at some point (e.g., product launch causing sales
growth acceleration), Prophet automatically identifies and uses
different growth rates before and after that point. Users can specify
prior change points or let the model auto-detect. This is more flexible
than ARIMA, which assumes fixed trend (eliminated through
differencing).
Fourier Series for Seasonality: Prophet uses
Fourier seriesfor seasonality
modeling, whereis period length
andis number of terms (controls
smoothness). This is more flexible than Holt-Winters: can model multiple
periods (daily/weekly/yearly) without needing complete cycle data. For
example, Prophet can estimate yearly seasonality even with only 3 months
of data (though inaccurate).
Uncertainty Quantification: Prophet provides not
just point forecasts but also uncertainty intervals. The model assumes
trend change rates and seasonality parameters come from prior
distributions, obtaining posterior distributions through MCMC sampling
to quantify forecast uncertainty. This is important for business
decisions: can assess risk range of "prediction ± uncertainty."
Common Questions:
Question
Answer
Q: How much data does Prophet need?
A: At least 2 complete cycles (e.g., at least 24 months for monthly
data). But Prophet's data requirements are minimal — can work with just
dozens of data points (though accuracy is limited).
Q: Can Prophet handle missing values?
A: Yes. Prophet automatically handles missing values, but recommend
manual handling (e.g., interpolation) before fit() for better
results.
# 2. Add holiday effects (optional) # Example: Chinese New Year, National Day, etc. holidays = pd.DataFrame({ 'holiday': 'Chinese_New_Year', 'ds': pd.to_datetime(['2022-02-01', '2023-01-22']), # CNY dates 'lower_window': -2, # 2 days before also affected 'upper_window': 2# 2 days after also affected })
# 3. Create and configure model model = Prophet( yearly_seasonality=True, # Yearly seasonality weekly_seasonality=True, # Weekly seasonality daily_seasonality=False, # Daily seasonality (usually not needed for daily data) holidays=holidays, # Holiday effects seasonality_mode='additive', # Additive model changepoint_prior_scale=0.05# Change point prior (smaller = fewer change points) )
# 4. Fit model model.fit(data)
# 5. Create future time frame and forecast future = model.make_future_dataframe(periods=90, freq='D') # Forecast next 90 days forecast = model.predict(future)
Kalman Filter is a recursive estimation algorithm
for time-varying systems, widely used in noisy systems for smoothing and
prediction. The key to Kalman filtering is updating estimates through
the difference between state variable estimates and observations.
The Kalman filter state update formulas are:$ $
The observation update formulas are:$
$ where:
-is the predicted state
at time; -is the prediction error
covariance matrix; -is the
Kalman gain; -is the
observation; -is the observation
model matrix.
The advantage of Kalman filter is its ability to recursively update
state estimates, making it suitable for real-time systems.
Kalman
Filter Implementation: Recursive Estimation for State-Space Models
Problem Context: In many real systems (like GPS
positioning, robot navigation, sensor fusion), we cannot directly
observe the true system state — only infer it from noisy observations.
For example, GPS positions have errors, temperature sensors have noise,
and a stock's true value cannot be directly observed. Kalman filtering
uses a "predict-update" recursive framework, combining system dynamics
models with observation models to make optimal state estimates under
uncertainty.
Solution Approach: The core of Kalman filtering is
"Bayesian updating"— each time step divides into two phases: 1)
Prediction step: predict next state based on system
dynamics model (state transition matrix), and update uncertainty
(covariance); 2) Update
step: when new observation arrives, compute Kalman gain(balancing prediction and observation
reliability), fuse prediction and observation for optimal estimate. The
Kalman gain formulaautomatically balances "prediction
uncertainty" and "observation uncertainty."
Design Considerations: 1. State-Space
Model: Need to define state transition equation () and
observation equation () 2. Noise Covariances: Process noise(system uncertainty) and observation
noise(sensor uncertainty) need
appropriate settings 3. Initial State: Initial
stateand initial covarianceaffect early estimates; typicallyis set large indicating high
uncertainty 4. Real-time Performance: Kalman filtering
is recursive, each time step only needscomputation (is state dimension), suitable for
real-time applications
defkalman_filter(data, F, B, H, Q, R, initial_state, initial_covariance): """ Manual implementation of 1D Kalman filter: recursive estimation for state-space models Core idea: Through "predict-update" cycle, fuse system dynamics and observation information for optimal state estimate Mathematical framework: - State transition: x_t = F · x_{t-1} + B · u_t + w_t (w_t ~ N(0, Q)) - Observation equation: y_t = H · x_t + v_t (v_t ~ N(0, R)) - Prediction step: x_{t|t-1} = F · x_{t-1|t-1} + B · u_t P_{t|t-1} = F · P_{t-1|t-1}· F^T + Q - Update step: K_t = P_{t|t-1}· H^T ·(H · P_{t|t-1}· H^T + R)^{-1} x_{t|t} = x_{t|t-1} + K_t ·(y_t - H · x_{t|t-1}) P_{t|t} = (I - K_t · H)· P_{t|t-1} Parameters: ----------- data : array-like, shape (n,) Observation sequence (noisy measurements) Example: GPS positions, sensor readings, stock price observations F : float or array-like State transition matrix (scalar for 1D) - F=1: state unchanged (e.g., position tracking) - F>1: state growing (e.g., cumulative process) - F<1: state decaying (e.g., decay process) B : float or array-like Control input matrix (scalar for 1D) - B=0: no external control input - B ≠0: has external control (e.g., acceleration, thrust) H : float or array-like Observation matrix (scalar for 1D, usually H=1) - H=1: directly observing state (e.g., position observing position) - H ≠1: indirect observation (e.g., velocity observing position) Q : float Process noise covariance (system uncertainty) - Small Q: system model highly reliable, prediction more trusted - Large Q: system model uncertain, observation more trusted - Typical values: 0.001-0.1 (depends on state scale) R : float Observation noise covariance (sensor uncertainty) - Small R: observation highly reliable, observation more trusted - Large R: observation noisy, prediction more trusted - Typical values: 0.01-1.0 (depends on observation scale) initial_state : float Initial state estimate - Can be prior knowledge or first observation value initial_covariance : float Initial state covariance (uncertainty) - Usually set large (e.g., 1.0) indicating high initial uncertainty - Decreases as observations accumulate Returns: -------- estimates : list State estimate sequence at each time step Length n (same as input data) Notes: ------ - Kalman filter assumes Gaussian white noise (mean 0, covariances Q and R) - For non-Gaussian noise, use Extended Kalman Filter (EKF) or Unscented Kalman Filter (UKF) - 1D implementation simplifies matrix operations; higher dimensions need matrix multiplication - Kalman gain K_t automatically balances weights between prediction and observation Example: -------- >>> # Track a random walk process (true state unobservable) >>> true_state = np.random.randn(100).cumsum() # True state >>> noisy_obs = true_state + np.random.randn(100) * 0.5 # Noisy observation >>> estimates = kalman_filter(noisy_obs, F=1, B=0, H=1, Q=0.01, R=0.25, ... initial_state=0, initial_covariance=1) >>> # estimates should be close to true_state (smoothed estimates) """ n = len(data) # Initialize: current state estimate and covariance state_estimate = initial_state # x_{t-1|t-1}: previous state estimate covariance_estimate = initial_covariance # P_{t-1|t-1}: previous covariance estimates = [] # Store state estimate at each time step # Recursive filtering: "predict-update" cycle at each time step for t inrange(n): # ========== Prediction Step ========== # Predict next state based on system dynamics model # State prediction: x_{t|t-1} = F · x_{t-1|t-1} + B · u_t # Here B=0 (no control input), so simplifies to: x_{t|t-1} = F · x_{t-1|t-1} state_predict = F * state_estimate + B # Covariance prediction: P_{t|t-1} = F · P_{t-1|t-1}· F^T + Q # 1D case: P_{t|t-1} = F ²· P_{t-1|t-1} + Q # Meaning: prediction uncertainty = previous uncertainty × state transition ² + process noise covariance_predict = F * covariance_estimate * F + Q # ========== Update Step ========== # When new observation arrives, fuse prediction and observation for optimal estimate # Compute Kalman gain: K_t = P_{t|t-1}· H^T ·(H · P_{t|t-1}· H^T + R)^{-1} # 1D case: K_t = P_{t|t-1}· H / (H ²· P_{t|t-1} + R) # Kalman gain meaning: # - If prediction uncertainty P large, observation uncertainty R small → K large → trust observation more # - If prediction uncertainty P small, observation uncertainty R large → K small → trust prediction more # - K range: [0, 1] (1D case) kalman_gain = covariance_predict * H / (H * covariance_predict * H + R) # Compute observation residual (innovation): y_t - H · x_{t|t-1} # This is difference between observation and prediction, containing new information innovation = data[t] - H * state_predict # State update: x_{t|t} = x_{t|t-1} + K_t ·(y_t - H · x_{t|t-1}) # Optimal estimate = prediction + Kalman gain × observation residual state_estimate = state_predict + kalman_gain * innovation # Covariance update: P_{t|t} = (I - K_t · H)· P_{t|t-1} # 1D case: P_{t|t} = (1 - K_t · H)· P_{t|t-1} # Meaning: after fusing observation, uncertainty decreases (gained new information) covariance_estimate = (1 - kalman_gain * H) * covariance_predict # Save current time step's state estimate estimates.append(state_estimate) return estimates
# Usage example: Track a random walk process (like GPS position tracking) # Generate simulated data: true state + observation noise np.random.seed(42) true_state = np.random.randn(100).cumsum() # True state (random walk) noisy_obs = true_state + np.random.randn(100) * 0.5# Noisy observation (std=0.5)
# Kalman filter parameter settings F = 1# State transition: assume state unchanged (random walk) B = 0# Control input: no external control H = 1# Observation matrix: directly observing state (position observing position) Q = 0.01# Process noise: system uncertainty (small, indicating reliable system model) R = 0.25# Observation noise: sensor uncertainty (observation noise variance = 0.5² = 0.25) initial_state = 0# Initial state estimate initial_covariance = 1.0# Initial uncertainty (large, indicating high initial uncertainty)
Automatic Balancing of Kalman Gain: Kalman
gainis the algorithm's core,
automatically balancing prediction and observation weights. If
prediction uncertaintyis
large (model unreliable) and observation uncertaintyis small (sensor accurate), thenapproaches 1, trusting observation
more; conversely, if prediction uncertainty is small and observation
uncertainty is large,approaches
0, trusting prediction more. This adaptive weighting enables Kalman
filtering to make optimal estimates under uncertainty.
Advantages of Recursive Updates: Kalman
filtering is recursive — each time step only needs current observation
and previous state estimate, no need to store all historical data. This
makes it suitable for real-time applications (like GPS navigation, robot
control), withcomputational
complexity (is state dimension),
much lower than batch methods (like least squares needing).
Uncertainty Quantification: Kalman filtering
provides not just state estimates but also covariance(uncertainty quantification). This
allows computing confidence intervals:(95%
confidence interval). This is crucial for risk assessment and decision
making.
Common Questions:
Question
Answer
Q: How to set Q and R?
A: Q (process noise) reflects system model uncertainty; R
(observation noise) reflects sensor precision. Usually set through
experience or maximum likelihood estimation. If observation is very
accurate, set R small; if system model is uncertain, set Q large.
Q: Does Kalman filtering apply to nonlinear
systems?
A: Standard Kalman filtering only applies to linear systems. For
nonlinear systems, use Extended Kalman Filter (EKF, linearization) or
Unscented Kalman Filter (UKF, using sigma points).
Q: How to choose initial state and covariance?
A: Initial state can use first observation or prior knowledge.
Initial covariance is usually set large (e.g., 1.0 or 10.0) indicating
high initial uncertainty; it automatically decreases as observations
accumulate.
Q: Can Kalman filtering predict the future?
A: Yes. Prediction stepcan be recursively applied to predictsteps ahead:. But prediction
uncertainty grows with time.
Q: How to handle missing observations?
A: If some time step has no observation, skip update step and only
execute prediction step. State estimate continues based on prediction,
but uncertainty increases (no new information).
Suitable for: periodic series (summer sales always high)
Analogy:
Single: only remembers "what it is now"
Double: also remembers "how fast it's growing"
Triple: also remembers "what it was like this time last year"
Model Selection Decision
Tree
1 2 3 4 5 6 7 8 9 10 11
What does my data look like? │ ├─ Only one variable? │ ├─ Stationary (no trend/seasonality) → ARIMA(p,0,q) │ ├─ Has trend (gradually rising/falling) → ARIMA(p,1,q) or Double Exponential Smoothing │ ├─ Has seasonality (periodic repetition) → SARIMA or Holt-Winters │ └─ Volatility changes (financial data) → GARCH │ └─ Multiple related variables? ├─ Variables influence each other → VAR └─ Need to fuse prior knowledge → Kalman Filter
Q&A: Common Beginner
Questions
Q1: How to choose
ARIMA's (p,d,q) parameters?
Quick Rules of Thumb:
Parameter
Meaning
How to Choose
p
AR order
ACF decays slowly → p might be large; usually p=1~3
d
Differencing order
Look at trend: no trend d=0, linear trend d=1, quadratic trend
d=2
q
MA order
PACF cuts off → q might be large; usually q=1~3
Practical Tips: 1. First plot the data (has trend?
has seasonality?) 2. Use auto_arima for automatic parameter
selection (Python's pmdarima package) 3. Start simple: try
ARIMA(1,1,1) first 4. Look at AIC/BIC metrics: smaller is better
(balances fit vs complexity)
Q2: When to use ARIMA vs
deep learning?
ARIMA is better when:
✅ Small data (<1000 time points)
✅ Need interpretability ("why this prediction?")
✅ Need confidence intervals ("how uncertain is the
prediction?")
✅ Clear seasonality/trends
Deep learning (LSTM/Transformer) is better when:
✅ Large data (>10000 time points)
✅ Multivariate (>10 features)
✅ Complex nonlinear relationships
✅ Can incorporate external features (like news, weather)
Hybrid Strategy:
Use ARIMA for baseline (quick data quality check)
Use deep learning for peak performance
Ensemble multiple models (ARIMA + LSTM + Prophet)
Q3:
Why does my ARIMA forecast always become "a straight line"?
Common Reasons:
Reason 1: Data too stationary, model learned "no
change"
Symptom: forecast = last observed value
Solution: check if over-differenced (d too large)
Reason 2: Parameters p, q too small
Symptom: model too simple, can't capture changes
Solution: increase p or q (e.g., ARIMA(1,1,1) → ARIMA(3,1,3))
Reason 3: Training data too short
Symptom: model didn't learn enough patterns
Solution: need at least 50-100 time points
Reason 4: Forecast horizon too long
Symptom: 1-step forecast accurate, 10-step forecast is a line
Solution: ARIMA suits short-term forecasts (1-5 steps), use other
methods for long-term
Q4: What's the
difference between GARCH and ARIMA?