Polynomial Surface Regression via Iterative Optimization
Video loading...
If nothing appears, ensure gradient_descent.mp4 is in /video
We have a set of data points in 3D space: input coordinates (x₁, x₂) and an output value y. The goal is to find a surface that best fits this data. We assume the underlying relationship is a quadratic polynomial with noise.
Where ε represents random Gaussian noise (σ = 2.0)
To fit a polynomial model using linear algebra, we construct a feature matrix Φ (Phi). Each row represents one data point, and each column represents a feature: the constant term, linear terms, and quadratic terms.
For n data points, Φ is an n×6 matrix. This transforms our polynomial regression into a linear regression problem in the feature space.
The model's prediction is a simple matrix-vector multiplication. The weight vector w contains 6 parameters that we need to learn.
Initially, all weights are set to zero, producing a flat plane at y = 0. As training progresses, the surface morphs to fit the data.
We measure how wrong our predictions are using Mean Squared Error (MSE). This function is smooth and convex, making it ideal for gradient-based optimization.
Key properties of MSE:
The gradient tells us the direction of steepest increase of the loss function. We move in the opposite direction to minimize loss.
Derivation steps:
The term (y - Φw) is the residual vector — the difference between actual and predicted values. Φᵀ projects these errors back into the weight space.
Gradient descent iteratively updates weights by taking small steps opposite to the gradient direction.
The learning rate α controls step size. Too large and we overshoot; too small and convergence is slow.
As iterations progress, the loss decreases and weights converge to values close to the true parameters.
The learned weights won't exactly match due to noise in the data, but they approximate the true underlying function. The visualization shows the surface transforming from a flat plane to a curved quadratic surface that fits the scattered data points.
Gradient descent succeeds here because:
For this linear-in-parameters problem, a closed-form solution exists: w = (ΦᵀΦ)⁻¹Φᵀy. However, gradient descent scales better to large datasets and generalizes to non-convex problems like neural networks.