# Adding L1/L2 regularization in PyTorch?

Each Answer to this Q is separated by one/two green lines.

Is there any way, I can add simple L1/L2 regularization in PyTorch? We can probably compute the regularized loss by simply adding the `data_loss` with the `reg_loss` but is there any explicit way, any support from PyTorch library to do it more easily without doing it manually?

Following should help for L2 regularization:

``````optimizer = torch.optim.Adam(model.parameters(), lr=1e-4, weight_decay=1e-5)
``````

This is presented in the documentation for PyTorch. You can add L2 loss using the `weight_decay` parameter to the Optimization function.

Previous answers, while technically correct, are inefficient performance wise and are not too modular (hard to apply on a per-layer basis, as provided by, say, `keras` layers).

## PyTorch L2 implementation

Why PyTorch implemented `L2` inside `torch.optim.Optimizer` instances?

Let’s take a look at `torch.optim.SGD` source code (currently as functional optimization procedure), especially this part:

``````for i, param in enumerate(params):
d_p = d_p_list[i]
# L2 weight decay specified HERE!
if weight_decay != 0:
``````
• One can see, that `d_p` (derivative of parameter, gradient) is modified and re-assigned for faster computation (not saving the temporary variables)
• It has `O(N)` complexity without any complicated math like `pow`
• It does not involve `autograd` extending the graph without any need

Compare that to `O(n)` `**2` operations, addition and also taking part in backpropagation.

## Math

Let’s see `L2` equation with `alpha` regularization factor (same could be done for L1 ofc): If we take derivative of any loss with `L2` regularization w.r.t. parameters `w` (it is independent of loss), we get: So it is simply an addition of `alpha * weight` for gradient of every weight! And this is exactly what PyTorch does above!

## L1 Regularization layer

Using this (and some PyTorch magic), we can come up with quite generic L1 regularization layer, but let’s look at first derivative of `L1` first (`sgn` is signum function, returning `1` for positive input and `-1` for negative, `0` for `0`): Full code with `WeightDecay` interface located in torchlayers third party library providing stuff like regularizing only weights/biases/specifically named paramters (disclaimer: I’m the author), but the essence of the idea outlined below (see comments):

``````class L1(torch.nn.Module):
def __init__(self, module, weight_decay):
super().__init__()
self.module = module
self.weight_decay = weight_decay

# Backward hook is registered on the specified module
self.hook = self.module.register_full_backward_hook(self._weight_decay_hook)

# Not dependent on backprop incoming values, placeholder
def _weight_decay_hook(self, *_):
for param in self.module.parameters():
# If there is no gradient or it was zeroed out
# Zeroed out using optimizer.zero_grad() usually
# Turn on if needed with grad accumulation/more safer way

# Apply regularization on it

def regularize(self, parameter):
# L1 regularization formula
return self.weight_decay * torch.sign(parameter.data)

def forward(self, *args, **kwargs):
# Simply forward and args and kwargs to module
return self.module(*args, **kwargs)
``````

And usage is also pretty simple (should work with gradient accumulation and and PyTorch layers):

``````layer = L1(torch.nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3))
``````

## Side note

Also, as a side note, `L1` regularization is not implemented as it does not actually induce sparsity (lost citation, it was some GitHub issue on PyTorch repo I think, if anyone has it, please edit) as understood by weights being equal to zero.

More often, weight values are thresholded (simply assigning zero value to them) if they reach some small predefined magnitude (say `0.001`)

For L2 regularization,

``````l2_lambda = 0.01
l2_reg = torch.tensor(0.)
for param in model.parameters():
l2_reg += torch.norm(param)
loss += l2_lambda * l2_reg
``````

References:

for L1 regularization and inclulde `weight` only:

``````L1_reg = torch.tensor(0., requires_grad=True)
for name, param in model.named_parameters():
if 'weight' in name:
L1_reg = L1_reg + torch.norm(param, 1)

total_loss = total_loss + 10e-4 * L1_reg
``````

## L2 regularization out-of-the-box

Yes, pytorch optimizers have a parameter called `weight_decay` which corresponds to the L2 regularization factor:

``````sgd = torch.optim.SGD(model.parameters(), weight_decay=weight_decay)
``````

## L1 regularization implementation

There is no analogous argument for L1, however this is straightforward to implement manually:

``````loss = loss_fn(outputs, labels)
l1_lambda = 0.001
l1_norm = sum(torch.linalg.norm(p, 1) for p in model.parameters())

loss = loss + l1_lambda * l1_norm
``````

The equivalent manual implementation of L2 would be:

``````l2_norm = sum(torch.linalg.norm(p, 2) for p in model.parameters())
``````

Source: Deep Learning with PyTorch (8.5.2)

Interesting `torch.norm` is slower on CPU and faster on GPU vs. direct approach.

``````import torch
x = torch.randn(1024,100)
y = torch.randn(1024,100)

%timeit torch.sqrt((x - y).pow(2).sum(1))
%timeit torch.norm(x - y, 2, 1)
``````

Out:

``````1000 loops, best of 3: 910 µs per loop
1000 loops, best of 3: 1.76 ms per loop
``````

On the other hand:

``````import torch
x = torch.randn(1024,100).cuda()
y = torch.randn(1024,100).cuda()

%timeit torch.sqrt((x - y).pow(2).sum(1))
%timeit torch.norm(x - y, 2, 1)
``````

Out:

``````10000 loops, best of 3: 50 µs per loop
10000 loops, best of 3: 26 µs per loop
`````` The answers/resolutions are collected from stackoverflow, are licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0 .