Gradient Descent

Polynomial Surface Regression via Iterative Optimization

Video loading...
If nothing appears, ensure gradient_descent.mp4 is in /video

The Mathematics

1. The Problem

We have a set of data points in 3D space: input coordinates (x₁, x₂) and an output value y. The goal is to find a surface that best fits this data. We assume the underlying relationship is a quadratic polynomial with noise.

True Underlying Function y = w₀ + w₁x₁ + w₂x₂ + w₃x₁² + w₄x₂² + w₅x₁x₂ + ε

Where ε represents random Gaussian noise (σ = 2.0)

2. Feature Engineering

To fit a polynomial model using linear algebra, we construct a feature matrix Φ (Phi). Each row represents one data point, and each column represents a feature: the constant term, linear terms, and quadratic terms.

Feature Matrix Construction Φ = [1, x₁, x₂, x₁², x₂², x₁x₂]

For n data points, Φ is an n×6 matrix. This transforms our polynomial regression into a linear regression problem in the feature space.

3. Model Prediction

The model's prediction is a simple matrix-vector multiplication. The weight vector w contains 6 parameters that we need to learn.

Prediction ŷ = Φw

Initially, all weights are set to zero, producing a flat plane at y = 0. As training progresses, the surface morphs to fit the data.

4. Loss Function

We measure how wrong our predictions are using Mean Squared Error (MSE). This function is smooth and convex, making it ideal for gradient-based optimization.

Mean Squared Error L(w) = (1/n) Σᵢ (yᵢ - ŷᵢ)² = (1/n) ||y - Φw||²

Key properties of MSE:

  • Always non-negative (squared terms)
  • Penalizes large errors more heavily than small ones
  • Has a unique global minimum for this problem
  • Differentiable everywhere, enabling gradient descent

5. Computing the Gradient

The gradient tells us the direction of steepest increase of the loss function. We move in the opposite direction to minimize loss.

Gradient of MSE ∇L(w) = -(2/n) Φᵀ(y - Φw)

Derivation steps:

  • Expand: L = (1/n)(y - Φw)ᵀ(y - Φw)
  • Differentiate with respect to w
  • Apply chain rule: ∂L/∂w = -(2/n)Φᵀ(y - Φw)

The term (y - Φw) is the residual vector — the difference between actual and predicted values. Φᵀ projects these errors back into the weight space.

6. The Update Rule

Gradient descent iteratively updates weights by taking small steps opposite to the gradient direction.

Weight Update w ← w - α · ∇L(w)

The learning rate α controls step size. Too large and we overshoot; too small and convergence is slow.

Learning Rate

α = 0.001

Iterations

2000

Parameters

6 weights

Data Points

100 (10×10 grid)

7. Convergence

As iterations progress, the loss decreases and weights converge to values close to the true parameters.

True Weights vs Learned w* = [1.0, -2.0, 3.0, 0.5, -1.5, 2.0]

The learned weights won't exactly match due to noise in the data, but they approximate the true underlying function. The visualization shows the surface transforming from a flat plane to a curved quadratic surface that fits the scattered data points.

8. Why This Works

Gradient descent succeeds here because:

  • The loss landscape is convex — there's only one minimum
  • The gradient provides exact direction information
  • Small, consistent steps guarantee convergence
  • The feature matrix Φ is full rank (columns are linearly independent)

For this linear-in-parameters problem, a closed-form solution exists: w = (ΦᵀΦ)⁻¹Φᵀy. However, gradient descent scales better to large datasets and generalizes to non-convex problems like neural networks.

gd_manim.py