Previous answers, while technically correct, are inefficient performance wise and are not too modular (hard to apply on a per-layer basis, as provided by, say, keras layers).
Let's take a look at torch.optim.SGD source code (currently as functional optimization procedure), especially this part:
for i, param in enumerate(params):
d_p = d_p_list[i]
# L2 weight decay specified HERE!
if weight_decay != 0:
d_p = d_p.add(param, alpha=weight_decay)
One can see, that d_p (derivative of parameter, gradient) is modified and re-assigned for faster computation (not saving the temporary variables)
It has O(N) complexity without any complicated math like pow
It does not involve autograd extending the graph without any need
Compare that to O(n)**2 operations, addition and also taking part in backpropagation.
Math
Let's see L2 equation with alpha regularization factor (same could be done for L1 ofc):
If we take derivative of any loss with L2 regularization w.r.t. parameters w (it is independent of loss), we get:
So it is simply an addition of alpha * weight for gradient of every weight! And this is exactly what PyTorch does above!
L1 Regularization layer
Using this (and some PyTorch magic), we can come up with quite generic L1 regularization layer, but let's look at first derivative of L1 first (sgn is signum function, returning 1 for positive input and -1 for negative, 0 for 0):
Full code with WeightDecay interface located in torchlayers third party library providing stuff like regularizing only weights/biases/specifically named paramters (disclaimer: I'm the author), but the essence of the idea outlined below (see comments):
class L1(torch.nn.Module):
def __init__(self, module, weight_decay):
super().__init__()
self.module = module
self.weight_decay = weight_decay
# Backward hook is registered on the specified module
self.hook = self.module.register_full_backward_hook(self._weight_decay_hook)
# Not dependent on backprop incoming values, placeholder
def _weight_decay_hook(self, *_):
for param in self.module.parameters():
# If there is no gradient or it was zeroed out
# Zeroed out using optimizer.zero_grad() usually
# Turn on if needed with grad accumulation/more safer way
# if param.grad is None or torch.all(param.grad == 0.0):
# Apply regularization on it
param.grad = self.regularize(param)
def regularize(self, parameter):
# L1 regularization formula
return self.weight_decay * torch.sign(parameter.data)
def forward(self, *args, **kwargs):
# Simply forward and args and kwargs to module
return self.module(*args, **kwargs)
Read more about hooks in this answer or respective PyTorch docs if needed.
And usage is also pretty simple (should work with gradient accumulation and and PyTorch layers):
Also, as a side note, L1 regularization is not implemented as it does not actually induce sparsity (lost citation, it was some GitHub issue on PyTorch repo I think, if anyone has it, please edit) as understood by weights being equal to zero.
More often, weight values are thresholded (simply assigning zero value to them) if they reach some small predefined magnitude (say 0.001)