Regression-Based Financial Modeling and Forecasting

Objective: Determine whether a small, carefully‑chosen set of monthly macroeconomic variables can predict the next month’s direction (up vs. down) of the S&P 500, DJIA and NASDAQ.

Data & Pre-processing:

  • Time frame: 2000 – 2024, monthly frequency

  • Targets: Month‑over‑month returns for S&P 500, DJIA, NASDAQ (Yahoo Finance)

  • Candidate predictors (FRED): 2‑ & 10‑year Treasury yields, fed‑funds rate, CPI, unemployment, M2 money supply, NFCI, industrial production, UMich consumer sentiment, WTI oil prices, mortgage rates, OECD composite leading indicator

  • Resampled to monthly %, differenced for stationarity, one‑month lag added for NFCI

  • Five most informative features selected via Pearson correlation heat map:

    1. Δ WTI oil price (DCOILWTICO)

    1. Δ OECD composite leading indicator

    2. Δ 2‑year Treasury yield (GS2)

    3. Δ UMich consumer sentiment

    4. NFCI (lag 1)

Modeling Path:

  1. Regression trial (OLS, Lasso, Elastic‑Net, tree & gradient boosting):

    • Low explanatory power (R² ≈ 0.06–0.18). Oil prices dominated, generalization weak.

  2. Pivot to classification: positive vs. negative next‑month return.

    • Logistic Regression and Random Forest trained on 2000–2017, tested on 2018–2023. 

Index Model Accuracy Precision (Up) Recall (Up) F1 (Up) Notes
S&P 500 Logistic Regression 65.3 % 0.69 0.85 0.76 Out‑performed buy‑and‑hold (2.14× vs. 1.78× ending capital).
S&P 500 Random Forest 65.3 % 0.72 0.77 0.74 Comparable accuracy, less interpretable.
DJIA Logistic Regression 59.7 % 0.64 0.80 0.71 Beat buy‑and‑hold (1.86× vs. 1.52×).
DJIA Random Forest 55.6 % 0.65 0.59 0.62 Under‑performed logistic on this index.
NASDAQ Logistic Regression 58.3 % 0.68 0.68 0.68 Slightly trailed buy‑and‑hold (2.05× vs. 2.17×).
NASDAQ Random Forest 59.7 % 0.75 0.57 0.65 Marginal accuracy gain but lower recall.

Interpretation & Notes:

  • Even coarse macro data can give directional edge (~15 pp above chance) for broad‑market ETFs.

  • Oil‑price shocks and leading‑indicator shifts carry the strongest signal; consumer sentiment and short‑rate moves matter but lag.

  • Simple, transparent Logistic Regression yielded the best risk‑adjusted improvement and is easy to explain to stakeholders.

  • Limitations: monthly granularity misses intramonth events; structural breaks (e.g., 2008, COVID) challenge stationarity; bullish bias in 2018‑23 sample.

Key Figures: