Training Order Matters More Than You Think

When you train a neural network, the order of training examples matters more than we'd like to admit. A recent analysis treats each training example as a vector field and computes their Lie brackets, a mathematical tool that measures how much swapping two examples changes where your parameters end up. The result is an O(ε²) difference that's normally invisible but structurally real.

The experiment ran an MXResNet on CelebA's 40 binary attributes with Adam. At various checkpoints, the author computed Lie brackets between test examples and found something clean: bracket magnitudes track gradient magnitudes with tight correlation. Bigger gradients mean bigger non-commutativity. The math works out such that the non-commutativity of example pairs is mostly separate from which parameter tensor you're looking at.

Two features kept showing up as problematic: Black_Hair and Brown_Hair. They're mutually exclusive in CelebA, but the model predicts each independently. When uncertain between the two (lighting makes this common), the model can't express a proper 50-50 split between valid combinations. It's forced to predict 25% for all four combos, including the impossible one where both are true. The Lie brackets caught this loss function inadequacy by showing large logit perturbations for these features past checkpoint 600.

This extends Baptiste Dherin's 2023 work at Google DeepMind connecting Lie brackets to implicit biases in SGD optimization. The potential application is diagnostic. High non-commutativity in certain features during training signals your problem setup has issues. Some community discussion suggested batch filtering strategies based on this measurement, though that remains speculative.