Why Gradient Descent Zigzag and How Momentum Corrects It

Gradient descent has a fundamental limitation: in most real-world loss environments, it doesn’t work well. When the surface has uneven curvature—rising in one direction and flat in the other, which is common in practice—the algorithm struggles to make consistent progress. A high learning rate helps to move faster on flat terrain but causes jerking and spinning on steep terrain. Decreasing the read rate stabilizes updates but greatly reduces convergence. This exchange is rare; it is the typical behavior of gradient descent.
Momentum addresses this issue by combining information from previous gradients. Instead of relying only on the current gradient, it stores the active rate (often called the velocity) and updates the parameters based on this collected method. As a result, constant gradients reinforce each other, allowing rapid movement on flat surfaces, while reverse gradients tend to cancel each other, reducing instability.
In this article, we go through how this works: review measurements, and simulations from scratch in a controlled anisotropic environment that allow us to measure the difference precisely – 185 steps for vanilla GD against 159 for Momentum, with β=0.99 which fails to converge completely.

Setting dependencies
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec
Determining the Location of the Loss
The loss zone is an elongated bowl – flat along one axis, steep on the other. This is controlled by two coefficients: 0.05 in x makes the path almost flat, while 5 in y makes it steep. Gradients reflect this directly — 0.1·x for flat terrain, 10·y for steep terrain.
The Hessian of this surface is diagonal with eigenvalues 0.1 and 10, giving a mode number of 100. That number is the crux of the problem: it tells you that the surface is 100× more curved on one side than the other, which is what forces GD into its zigzag behavior.
A reading level of 0.18 is deliberately chosen. The stability limit of GD is 2 / λ_max = 2 / 10 = 0.2 — any higher and the optimizer diverges directly. At 0.18, the axial update factor on the slopes is |1 − 10 × 0.18| = 0.8, which means the optimizer goes through and reverses each direction one step. The flat axis factor is |1 − 0.1 × 0.18| = 0.982, which means it only gets 1.8% of the remaining distance per step. This is a combination of the worst case that Momentum is designed for: wandering in one direction, near-stop in the other.
def loss(x, y):
return 0.05 * x**2 + 5 * y**2
def grad(x, y):
return np.array([0.1 * x, 10 * y])
Enhancements
Both methods follow the same overall process: start from a starting point, take a fixed number of steps, and track how the position changes. The main difference is in how each step is calculated.
Vanilla gradient descent is very simple. At each step, it updates the area by removing the grading scaled by reading level. It doesn’t remember anything from previous steps. That’s why oscillations happen—if the gradient keeps changing (up, then down), the updates just follow that pattern with no way to smooth it out.
Momentum introduces one additional term: velocity (v), which is initially zero. Instead of using only the current gradient, it updates this velocity by combining the previous velocity and the new gradient. The β parameter controls how much weight is given to past information. A high β (eg, 0.9) means that the update relies heavily on previous gradients, while a low β makes it behave like standard gradient descent.
This average effect behaves differently in all directions. At high altitudes, where the gradients tend to change sign, the updates tend to cancel out, reducing the circulation. In soft methods, where the gradients are more consistent, they accumulate over time, allowing the optimizer to run faster.
Finally, the position is updated using this velocity instead of the raw gradient. This results in smooth and steady progress to the minimum.
def gradient_descent(start, lr, steps=300):
"""
Vanilla GD: θ ← θ − lr · ∇L(θ)
Each update depends only on the current gradient.
No memory of past gradients -- oscillations persist.
"""
path = [np.array(start, dtype=float)]
pos = np.array(start, dtype=float)
for _ in range(steps):
pos = pos - lr * grad(*pos)
path.append(pos.copy())
return np.array(path)
def momentum_gd(start, lr, beta, steps=300):
"""
Momentum GD:
v ← β·v + (1−β)·∇L(θ)
θ ← θ − lr·v
v is a weighted running average of past gradients (exponential moving avg).
Why it helps:
- In y: gradients alternate sign → they cancel in v → oscillations damped.
- In x: gradients share the same sign → they accumulate in v → faster steps.
β controls memory length. High β → longer memory → more smoothing (and risk
of overshooting). Low β → shorter memory → closer to vanilla GD.
"""
path = [np.array(start, dtype=float)]
pos = np.array(start, dtype=float)
v = np.zeros(2)
for _ in range(steps):
g = grad(*pos)
v = beta * v + (1 - beta) * g
pos = pos - lr * v
path.append(pos.copy())
return np.array(path)
Running all three conditions
All three experiments start at the same point (−4.0, 1.5), use the same learning rate, and run 300 steps. The only difference is the use of momentum and the value of β. Instead of simply recording the final position, the full trajectory is saved for each run, allowing us to analyze how the optimizer moves over time. The decline of the Vanilla gradient continues slowly in a zigzag pattern and reaches a final loss of 0.000015. Momentum with β = 0.90 acts effectively, reducing oscillations and building speed in the right place, finally reaching a low loss of 0.000001 within the same number of steps.
However, the momentum is sensitive to the choice of β. If β is set too high (eg, 0.99), the optimizer accumulates a large speedup with very little decay. This leads to minimum shooting and failure to stabilize, resulting in a very high final loss of 0.487363 even after 300 steps. In this case, the optimizer effectively keeps rounding the minimum without convergence. These results highlight that while momentum can greatly improve convergence, it must be carefully tuned—too little supply has no advantage over standard gradient descent, while too much introduces instability.
START = [-4.0, 1.5]
LR = 0.18
STEPS = 300
path_gd = gradient_descent(START, lr=LR, steps=STEPS)
path_mom_good = momentum_gd(START, lr=LR, beta=0.90, steps=STEPS)
path_mom_large = momentum_gd(START, lr=LR, beta=0.99, steps=STEPS)
print(f"Vanilla GD -- final loss: {loss(*path_gd[-1]):.6f}")
print(f"Momentum β=0.90 -- final loss: {loss(*path_mom_good[-1]):.6f}")
print(f"Momentum β=0.99 -- final loss: {loss(*path_mom_large[-1]):.6f} ← diverges")

Seeing the Result
The view is divided into two parts. The top row shows the first 55 steps of each fixator mapped to the missing area concrete, making it easy to view movement patterns without scrolling. The bottom row shows the loss curves for 300 steps on a log scale, allowing a clear comparison of convergence speed across runs.
From the contour, the behavior is immediately clear. The vanilla gradient descent is heavily skewed in one direction, making very slow progress to the smallest. Momentum with β = 0.90 stops this oscillation and follows a smooth, straight path. In contrast, β = 0.99 leads to continuous bouncing with almost no progress. The loss curves confirm this: vanilla GD decreases gradually but slowly, β = 0.90 changes quickly and efficiently, while β = 0.99 shows repeated spikes due to overshooting and fails to converge within the given steps.
PLOT_STEPS = 55
x_ = np.linspace(-5, 5, 500)
y_ = np.linspace(-2.2, 2.2, 500)
X, Y = np.meshgrid(x_, y_)
Z = loss(X, Y)
fig = plt.figure(figsize=(16, 10), facecolor="#FAFAF8")
gs = GridSpec(2, 3, figure=fig, hspace=0.45, wspace=0.38,
left=0.07, right=0.97, top=0.88, bottom=0.08)
COLORS = {
"gd": "#E05C4B",
"mom_good": "#3A7CA5",
"mom_large": "#F4A536",
"contour": "#D4C9B8",
"minima": "#2A9D5C",
"start": "#444444",
}
PANEL_TITLES = [
"Vanilla Gradient DescentnOscillates, slow (185 steps to converge)",
"Momentum β = 0.90nSmooth, fast (159 steps to converge)",
"Momentum β = 0.99 (too large)nOvershoots -- never converges",
]
paths_plot = [
path_gd[:PLOT_STEPS+1],
path_mom_good[:PLOT_STEPS+1],
path_mom_large[:PLOT_STEPS+1],
]
colors = [COLORS["gd"], COLORS["mom_good"], COLORS["mom_large"]]
# top row: trajectory panels
for col, (path, color, title) in enumerate(zip(paths_plot, colors, PANEL_TITLES)):
ax = fig.add_subplot(gs[0, col])
ax.set_facecolor("#F5F3EE")
levels = np.geomspace(0.005, 3.5, 28)
ax.contour(X, Y, Z, levels=levels, colors=COLORS["contour"],
linewidths=0.7, alpha=0.9)
ax.plot(path[:, 0], path[:, 1], color=color, lw=1.8, alpha=0.85, zorder=3)
ax.scatter(path[:, 0], path[:, 1], color=color, s=18, zorder=4, alpha=0.6)
ax.scatter(*path[0], marker="o", s=90, color=COLORS["start"], zorder=5, label="start")
ax.scatter(*path[-1], marker="*", s=120, color=COLORS["minima"], zorder=5, label="end")
ax.scatter(0, 0, marker="+", s=200, color=COLORS["minima"], linewidths=2.5, zorder=6)
ax.set_xlim(-5, 5)
ax.set_ylim(-2.2, 2.2)
ax.set_title(title, fontsize=9.5, fontweight="bold", color="#222", pad=7, loc="left")
ax.set_xlabel("θ₁ (slow direction)", fontsize=8, color="#666")
ax.set_ylabel("θ₂ (fast direction)", fontsize=8, color="#666")
ax.tick_params(labelsize=7, colors="#888")
for spine in ax.spines.values():
spine.set_edgecolor("#CCCCCC")
# bottom-left: loss curves (full 300 steps)
ax_loss = fig.add_subplot(gs[1, :2])
ax_loss.set_facecolor("#F5F3EE")
full_paths = [path_gd, path_mom_good, path_mom_large]
full_labels = ["Vanilla GD (185 steps)", "Momentum β=0.90 (159 steps)", "Momentum β=0.99 (diverges)"]
for path, color, label in zip(full_paths, colors, full_labels):
losses = [loss(*p) for p in path]
steps_range = np.arange(len(path))
ax_loss.plot(steps_range, losses, color=color, lw=2, label=label, alpha=0.9)
ax_loss.axhline(0.001, color="#999", lw=1, ls="--", alpha=0.6)
ax_loss.text(305, 0.001, "convergencenthreshold", fontsize=7, color="#888", va="center")
ax_loss.set_yscale("log")
ax_loss.set_xlim(0, STEPS)
ax_loss.set_title("Loss vs. Optimisation Step (log scale, 300 steps)",
fontsize=10.5, fontweight="bold", color="#222", loc="left")
ax_loss.set_xlabel("Step", fontsize=9, color="#666")
ax_loss.set_ylabel("Loss f(θ)", fontsize=9, color="#666")
ax_loss.legend(fontsize=8.5, framealpha=0.6)
ax_loss.tick_params(labelsize=8, colors="#888")
for spine in ax_loss.spines.values():
spine.set_edgecolor("#CCCCCC")
# bottom-right: annotation panel
ax_ann = fig.add_subplot(gs[1, 2])
ax_ann.set_facecolor("#F5F3EE")
ax_ann.axis("off")
annotation = (
"Update rulesnn"
"Vanilla GDn"
" θ ← θ − α·∇L(θ)nn"
"Momentum GDn"
" v ← β·v + (1−β)·∇L(θ)n"
" θ ← θ − α·vnn"
"Key intuitionn"
" v accumulates past gradients.n"
" Vertical oscillations cancel out.n"
" Horizontal steps compound.nn"
"Hyperparameter βn"
" β → 0 : behaves like GDn"
" β = 0.9: typical sweet spotn"
" β → 1 : overshoots / diverges"
)
ax_ann.text(0.05, 0.97, annotation, transform=ax_ann.transAxes,
fontsize=8.8, va="top", ha="left",
fontfamily="monospace", color="#333", linespacing=1.7)
fig.suptitle("Momentum in Gradient Descent",
fontsize=16, fontweight="bold", color="#111", y=0.95)
plt.savefig("momentum_explainer.png", dpi=150, bbox_inches="tight",
facecolor=fig.get_facecolor())
plt.show()

β sensitivity to sweep
The experiment runs the stimulus multiple times with different β values, each with up to 500 steps. In every run, it checks when the loss starts to drop below 0.001 and records that step as the convergence point. β = 0 acts as a baseline, as it removes the momentum effect and behaves exactly like vanilla gradient descent.
The results show a clear pattern. As β increases from 0.0 to 0.95, the convergence becomes better, with fewer steps required each time. This happens because higher values of β better smooth the oscillations and create useful momentum in the right place. However, for β = 0.99, the performance drops significantly. The optimizer is too slow to optimize because it relies heavily on previous gradients, leading to overshooting and convergence delays. Overall, this creates an inverted U-shaped relationship: moderate values of β (about 0.9–0.95) provide the best performance, while very high values can significantly impair convergence.
THRESHOLD = 0.001
betas = [0.0, 0.5, 0.7, 0.85, 0.90, 0.95, 0.99]
print(f"n── β Sensitivity (steps to loss < {THRESHOLD}) ───────────────")
print(f"{'β':>6} {'steps':>10} note")
print("─" * 46)
for b in betas:
path = momentum_gd(START, lr=LR, beta=b, steps=500)
losses = [loss(*p) for p in path]
hit = next((i for i, l in enumerate(losses) if l < THRESHOLD), None)
note = ""
if b == 0.0: note = "← equivalent to vanilla GD"
elif b == 0.90: note = "← typical sweet spot"
elif b == 0.99: note = "← overshoots / diverges"
status = f"{hit:>6} steps" if hit else " did not converge"
print(f"{b:>6.2f} {status} {note}")

Check it out Codes with Notebook here. Also, feel free to follow us Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.
Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.?contact us
The post Why Gradient Descent Zigzags and How Momentum Corrects appeared first on MarkTechPost.



