Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When running on Apple GPU (MPS), the loss is always nan. #199

Open
CaSiOFT opened this issue May 15, 2024 · 26 comments
Open

When running on Apple GPU (MPS), the loss is always nan. #199

CaSiOFT opened this issue May 15, 2024 · 26 comments

Comments

@CaSiOFT
Copy link

CaSiOFT commented May 15, 2024

I am using an M1 Pro Mac. I have installed the latest version of PyTorch provided by the official for MPS. Previously, when running other models on GPU, I did not encounter similar issues.
When running the example provided in the official documentation at the beginning, if running on CPU, the result is normal; if running on mps, the result is always nan.

import kan
import torch

device = torch.device("mps" if torch.backends.mps.is_available() else ("cuda" if torch.cuda.is_available() else "cpu"))
# device = torch.device("cpu")

# create a KAN: 2D inputs, 1D output, and 5 hidden neurons. cubic spline (k=3), 5 grid intervals (grid=5).
model = kan.KAN(width=[2, 5, 1], grid=5, k=3, seed=0, device=device)

# create dataset f(x,y) = exp(sin(pi*x)+y^2)
f = lambda x: torch.exp(torch.sin(torch.pi * x[:, [0]]) + x[:, [1]] ** 2)
dataset = kan.create_dataset(f, n_var=2, device=device)
print(dataset['train_input'].shape, dataset['train_label'].shape)
# plot KAN at initialization
model(dataset['train_input'])
model.plot(beta=100)
# train the model
model.train(dataset, opt="LBFGS", steps=20, lamb=0.01, lamb_entropy=10., device=device)
model.plot()

The above is my code. When running the code, there may be exceptions during training, and there is also a probability of generating exceptions when drawing the model structure. The error message is ValueError: alpha (nan) is outside 0-1 range. Even if there is no error, the plotted graph will have significant differences from CPU (the lines on the image are obviously thinner, and the function graphs inside the nodes fluctuate abnormally).
output

I found a similar issue under issues:
I don't know why but if use MPS(Apple SIlicon) to loss is nan.

model.train(dataset, opt="LBFGS", steps=20, lamb=0.01, lamb_entropy=10., device=device.type);
train loss: nan | test loss: nan | reg: nan : 100%|█████████████████| 20/20 [00:03<00:00,  5.11it/s]

Originally posted by @brainer3220 in #98 (comment)

@CaSiOFT
Copy link
Author

CaSiOFT commented May 16, 2024

I ran the test using the same code on another CUDA machine, and the results were completely normal.

@Stealeristaken
Copy link

I ran the test using the same code on another CUDA machine, and the results were completely normal.

well thats weird

@daguo7
Copy link

daguo7 commented May 18, 2024

I have the same problem. Did you solve it? Bro

@CaSiOFT
Copy link
Author

CaSiOFT commented May 18, 2024

I have the same problem. Did you solve it? Bro

No, I don't have any ideas on how to handle this issue. Are you experiencing the exact same problem?

@daguo7
Copy link

daguo7 commented May 18, 2024

Yes,when I run the last part of hellokan.The train loss and test loss is nan.The same result of symbolic formula.

@Stealeristaken
Copy link

I have the same problem mentioned it in #179

@CaSiOFT
Copy link
Author

CaSiOFT commented May 18, 2024

I have the same problem mentioned it in #179我遇到了同样的问题,已在 #179 中提及。

Are you using mps? Or cuda?

@Stealeristaken
Copy link

I have the same problem mentioned it in #179我遇到了同样的问题,已在 #179 中提及。

Are you using mps? Or cuda?

I'm using simple apple silicon so can not use cuda

@daguo7
Copy link

daguo7 commented May 18, 2024

I’m same with you

@Stealeristaken
Copy link

Can you guys try prune your model with threshold =2e-1

@daguo7
Copy link

daguo7 commented May 18, 2024

Bro,I tried. The problem seems to have been solved.Thank you

1 similar comment
@daguo7
Copy link

daguo7 commented May 18, 2024

Bro,I tried. The problem seems to have been solved.Thank you

@Stealeristaken
Copy link

Problem caused by loss function as author mentioned so lower threshold gives better results but worse prunes

@wkqian06
Copy link

There is something wrong using torch.linalg.lstsq in spline.py. it can lead to nan when we are not using CPU. I have added an alternative method to calculate the coef in my branch with the following setting.

model = KAN(width=[2,5,1], grid=5, k=3, seed=0, device=device_set,
            coef_method='svd',
            )

Hope it works on Apple GPU.

@brent-halen
Copy link

There is something wrong using torch.linalg.lstsq in spline.py. it can lead to nan when we are not using CPU. I have added an alternative method to calculate the coef in my branch with the following setting.

model = KAN(width=[2,5,1], grid=5, k=3, seed=0, device=device_set,
            coef_method='svd',
            )

Hope it works on Apple GPU.

I tried running this on a couple of the examples in my CUDA set up, but I got the following error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[9], line 1
----> 1 model.train(dataset, opt="LBFGS", steps=20);

Cell In[7], line 888, in KAN.train(self, dataset, opt, steps, log, lamb, lamb_l1, lamb_entropy, lamb_coef, lamb_coefdiff, update_grid, grid_update_num, loss_fn, lr, stop_grid_update_step, batch, small_mag_threshold, small_reg_factor, metrics, sglr_avoid, save_fig, in_vars, out_vars, beta, save_fig_freq, img_folder, device)
    885 test_id = np.random.choice(dataset['test_input'].shape[0], batch_size_test, replace=False)
    887 if _ % grid_update_freq == 0 and _ < stop_grid_update_step and update_grid:
--> 888     self.update_grid_from_samples(dataset['train_input'][train_id].to(device))
    890 if opt == "LBFGS":
    891     optimizer.step(closure)

Cell In[7], line 233, in KAN.update_grid_from_samples(self, x)
    210 '''
    211 update grid from samples
    212 
   (...)
    230 tensor([0.0128, 1.0064, 2.0000, 2.9937, 3.9873, 4.9809])
    231 '''
    232 for l in range(self.depth):
--> 233     self.forward(x)
    234     self.act_fun[l].update_grid_from_samples(self.acts[l])

Cell In[7], line 301, in KAN.forward(self, x)
    297 self.acts.append(x)  # acts shape: (batch, width[l])
    299 for l in range(self.depth):
--> 301     x_numerical, preacts, postacts_numerical, postspline = self.act_fun[l](x)
    303     if self.symbolic_enabled == True:
    304         x_symbolic, postacts_symbolic = self.symbolic_fun[l](x)

File /usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py:1532, in Module._wrapped_call_impl(self, *args, **kwargs)
   1530     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1531 else:
-> 1532     return self._call_impl(*args, **kwargs)

File /usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py:1541, in Module._call_impl(self, *args, **kwargs)
   1536 # If we don't have any hooks, we want to skip the rest of the logic in
   1537 # this function, and just call forward.
   1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1539         or _global_backward_pre_hooks or _global_backward_hooks
   1540         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1541     return forward_call(*args, **kwargs)
   1543 try:
   1544     result = None

Cell In[6], line 167, in KANLayer.forward(self, x)
    165 batch = x.shape[0]
    166 # x: shape (batch, in_dim) => shape (size, batch) (size = out_dim * in_dim)
--> 167 x = torch.einsum('ij,k->ikj', x, torch.ones(self.out_dim, device=self.device)).reshape(batch, self.size).permute(1, 0)
    168 preacts = x.permute(1, 0).clone().reshape(batch, self.out_dim, self.in_dim)
    169 base = self.base_fun(x).permute(1, 0)  # shape (batch, size)

File /usr/local/lib/python3.11/dist-packages/torch/functional.py:385, in einsum(*args)
    380     return einsum(equation, *_operands)
    382 if len(operands) <= 2 or not opt_einsum.enabled:
    383     # the path for contracting 0 or 1 time(s) is already optimized
    384     # or the user has disabled using opt_einsum
--> 385     return _VF.einsum(equation, operands)  # type: ignore[attr-defined]
    387 path = None
    388 if opt_einsum.is_available():

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

@wkqian06
Copy link

I tried running this on a couple of the examples in my CUDA set up, but I got the following error:

You may want to add a device parameter using train. The default device setup for train is 'cpu'.
model.train(dataset, opt="LBFGS", steps=20, device = 'cuda').

@CaSiOFT
Copy link
Author

CaSiOFT commented May 19, 2024

Can you guys try prune your model with threshold =2e-1

I tried, but there are still problems. Can you share the code? Thanks.

@CaSiOFT
Copy link
Author

CaSiOFT commented May 19, 2024

There is something wrong using torch.linalg.lstsq in spline.py. it can lead to nan when we are not using CPU. I have added an alternative method to calculate the coef in my branch with the following setting.

model = KAN(width=[2,5,1], grid=5, k=3, seed=0, device=device_set,
            coef_method='svd',
            )

Hope it works on Apple GPU.

I tried your code, but the following error occurred:
NotImplementedError: The operator 'aten::linalg_lstsq.out' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on pytorch/pytorch#77764. As a temporary fix, you can set the environment variable PYTORCH_ENABLE_MPS_FALLBACK=1 to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.
Unfortunately it seems that the MPS version of PyTorch does not implement this operation. I followed the instructions to set the environment variables, but it didn't work. But thank you for sharing.

@Stealeristaken
Copy link

Stealeristaken commented May 19, 2024

There is something wrong using torch.linalg.lstsq in spline.py. it can lead to nan when we are not using CPU. I have added an alternative method to calculate the coef in my branch with the following setting.

model = KAN(width=[2,5,1], grid=5, k=3, seed=0, device=device_set,
            coef_method='svd',
            )

Hope it works on Apple GPU.

I tried your code, but the following error occurred: NotImplementedError: The operator 'aten::linalg_lstsq.out' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on pytorch/pytorch#77764. As a temporary fix, you can set the environment variable PYTORCH_ENABLE_MPS_FALLBACK=1 to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS. Unfortunately it seems that the MPS version of PyTorch does not implement this operation. I followed the instructions to set the environment variables, but it didn't work. But thank you for sharing.

Maybe update your torch version could help for this try pip install -U torch and run again

@Stealeristaken
Copy link

Can you guys try prune your model with threshold =2e-1

I tried, but there are still problems. Can you share the code? Thanks.

Don't have a spesific code i do believe this is a problem with loss function outputs so pruning with bigger thresholds could help try more spesific values. But this is soooo weird it only happens in mps devices. It scratches my brain tbh

@wkqian06
Copy link

I tried your code, but the following error occurred: NotImplementedError: The operator 'aten::linalg_lstsq.out' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on pytorch/pytorch#77764. As a temporary fix, you can set the environment variable PYTORCH_ENABLE_MPS_FALLBACK=1 to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS. Unfortunately it seems that the MPS version of PyTorch does not implement this operation. I followed the instructions to set the environment variables, but it didn't work. But thank you for sharing.

It's weird though. I used torch.linalg.svd instead of torch.linalg.lstsq in my code. Not sure why this happens in your case.

I have made some updates in my branch, avoiding any nan, inf, and -inf in coef results, which, at least in my case, works and avoids nan in the loss. This time, just use the default settings for training.
model.train(dataset, opt="LBFGS", steps=20)

@CaSiOFT
Copy link
Author

CaSiOFT commented May 20, 2024

Can you guys try prune your model with threshold =2e-1

I tried, but there are still problems. Can you share the code? Thanks.

Don't have a spesific code i do believe this is a problem with loss function outputs so pruning with bigger thresholds could help try more spesific values. But this is soooo weird it only happens in mps devices. It scratches my brain tbh

I updated to version 2.3.0 of PyTorch, but it didn't improve, unfortunately. As for pruning, in fact, none of the units in the model produced qualified results, so pruning had no effect either.
I can only assume that there is a problem with the PyTorch implementation of MPS, but I have never encountered platform issues before, so I am very confused.

@CaSiOFT
Copy link
Author

CaSiOFT commented May 20, 2024

I tried your code, but the following error occurred: NotImplementedError: The operator 'aten::linalg_lstsq.out' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on pytorch/pytorch#77764. As a temporary fix, you can set the environment variable PYTORCH_ENABLE_MPS_FALLBACK=1 to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS. Unfortunately it seems that the MPS version of PyTorch does not implement this operation. I followed the instructions to set the environment variables, but it didn't work. But thank you for sharing.

It's weird though. I used torch.linalg.svd instead of torch.linalg.lstsq in my code. Not sure why this happens in your case.

I have made some updates in my branch, avoiding any nan, inf, and -inf in coef results, which, at least in my case, works and avoids nan in the loss. This time, just use the default settings for training. model.train(dataset, opt="LBFGS", steps=20)

I reset PYTORCH_ENABLE_MPS_FALLBACK=1 in the appropriate position, and it took effect. But unfortunately, it didn't help much.
I tried your latest code. If coef_method='lstsq' is set, the training result is still nan. If coef_method='svd' is set, it directly reports an error: [MPSNDArrayDescriptor sliceDimension:withSubrange:] error: the range subRange.start + subRange.length does not fit in dimension[1] (10). It's amazing that if set to CPU, both coef_method can work properly. SVD performs slightly better.

@wkqian06
Copy link

I reset PYTORCH_ENABLE_MPS_FALLBACK=1 in the appropriate position, and it took effect. But unfortunately, it didn't help much. I tried your latest code. If coef_method='lstsq' is set, the training result is still nan. If coef_method='svd' is set, it directly reports an error: [MPSNDArrayDescriptor sliceDimension:withSubrange:] error: the range subRange.start + subRange.length does not fit in dimension[1] (10). It's amazing that if set to CPU, both coef_method can work properly. SVD performs slightly better.

KAN.grid or KANLayer.coef may be the two main reasons why loss is nan during training. I found that KAN.grid can be nan even if it is not trainable during the backpropagation. Delete the initialization self.grid=torch.nn.Parameter(...) might help. As for KANLayer.coef, sometimes the initialization could be nan leading the the failure in the following training. One potential attempt is to set the initialization self.coef = torch.nn.Parameter(curce2coef(...)) to toch.ones with the same size and monitor the learning of coef to check if there is any chance coef being nan.

@CaSiOFT
Copy link
Author

CaSiOFT commented May 21, 2024

I reset PYTORCH_ENABLE_MPS_FALLBACK=1 in the appropriate position, and it took effect. But unfortunately, it didn't help much. I tried your latest code. If coef_method='lstsq' is set, the training result is still nan. If coef_method='svd' is set, it directly reports an error: [MPSNDArrayDescriptor sliceDimension:withSubrange:] error: the range subRange.start + subRange.length does not fit in dimension[1] (10). It's amazing that if set to CPU, both coef_method can work properly. SVD performs slightly better.

KAN.grid or KANLayer.coef may be the two main reasons why loss is nan during training. I found that KAN.grid can be nan even if it is not trainable during the backpropagation. Delete the initialization self.grid=torch.nn.Parameter(...) might help. As for KANLayer.coef, sometimes the initialization could be nan leading the the failure in the following training. One potential attempt is to set the initialization self.coef = torch.nn.Parameter(curce2coef(...)) to toch.ones with the same size and monitor the learning of coef to check if there is any chance coef being nan.

As for the error, I found that it consistently occurs during what should be a completely normal matrix multiplication in one of the loops of the SVD calculation. I casually searched and found many similar issues, such as pytorch/pytorch#113586 and pytorch/pytorch#96153 . It seems to be purely an implementation issue with MPS.

If a deep copy is used for the variables calculated around that error, bypassing the mps error, it will be an error in svdestimator: torch._C._LinAlgError: linalg.svd: The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated singular values (error code: 7). The reason is that the parameters passed to the function are all nan. There is no such problem when using the CPU.

Initializing with a matrix full of 1 did not improve either.

My interim conclusion is that PyTorch's MPS implementation is a very unstable thing, and any problem could potentially occur.

@link24tech
Copy link

Can you guys try prune your model with threshold =2e-1

It works for me in Mac settings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants