Loss=nan when training transformertrainer #68

HenryLhc · 2024-03-12T07:36:44Z

I used the codes in the jupyter notebook provided by @MarcusLoppe in the discussion section, and have successfully succeeded trained the autoencoder with a loss of 0.6. However, when I tried to proceed to the next section, the training loss remained high, before a few steps later it showed nan. Is this due to some problems in data augmentation or the data itself is not suitable for this method? As I'm using scanned meshes of concrete aggregates instead of the artificially built meshes.

MarcusLoppe · 2024-03-12T14:31:25Z

@HenryLhc

I've only encountered this issue when training on a very small dataset, e.g too few models.
It's probably something to do with the nature of the probabilities in the transformer e.g. if the letter A always comes before B in whatever context, then the probability is 100% but the tensors doesn't use % percentage but rather tensor float values. So probably somewhere in the model, the probability is to big so the value will be above some kind of max value.

So how many different models are you using in the dataset?

HenryLhc · 2024-03-12T14:55:16Z

I have 100 original data augmented 20 times in the file-reading process. I'm looking for a reconstruction technique for concrete aggregates, which are fundamentally random rocks with varied shapes as you can see from the example here. I think the key difference here is that aggregates are more shaped like a sphere and therefore possess less sharp angles which are generally considered to be more representative and contain more characteristics, and the average number of faces is around 150-200.

If the lack of models can be a potential reason, do you suggest under these conditions that I increase the volume of the dataset? Or do you think that this kind of data/shape is not very suitable for meshgpt to operate with?

MarcusLoppe · 2024-03-12T16:13:16Z

I have 100 original data augmented 20 times in the file-reading process. I'm looking for a reconstruction technique for concrete aggregates, which are fundamentally random rocks with varied shapes as you can see from the example here. I think the key difference here is that aggregates are more shaped like a sphere and therefore possess less sharp angles which are generally considered to be more representative and contain more characteristics, and the average number of faces is around 150-200.

If the lack of models can be a potential reason, do you suggest under these conditions that I increase the volume of the dataset? Or do you think that this kind of data/shape is not very suitable for meshgpt to operate with?

I don't think it's about the shape's or the nature of the shape, the dataset isn't very big so consider augmenting them at least x50 or x100.
I encountered this type of issue when I was training using 50 models x 50 augments, but when I scaled up to 300-600 models, this issue resolved itself.

I've seen some surprising results with scaling up the dataset size and model size.
For example; it took about 30+hr training the autoencoder to reach 0.48, but when i implemented 2 encoder & 4 decoder local attention layers it reduced it to 12hrs training the autoencoder.

The same goes for the training data, using a "small" dataset size of 800 meshes vs 14k, it took the autoencoder just a few more hours to reach the same loss when using the larger dataset vs smaller.

So consider feeding it more models, I've just started training the transformer with large dataset since the transformer takes considerable longer time to train e.g. 218k meshes for 1 epoch = 4.5hrs vs 1hr with the autoencoder.
But I believe the transformer will benefit from the same scaling law.

At least the autoencoder can deal with many types of shapes, as you can see below from the reconstructed mesh below, it was able to reconstruct the petals from the flower or the diamond shaped blobs.

Try and check how well the autoencoder it can reconstruct the meshes before you re-train it. You can find the render function in my mesh_render.py

import torch
import random
from tqdm import tqdm 

min_mse, max_mse = float('inf'), float('-inf')
min_coords, min_orgs, max_coords, max_orgs = None, None, None, None
random_samples = []
random_samples_pred = []
all_random_samples = []
total_mse = 0.0 

random.shuffle(dataset.data)

for item in tqdm(dataset.data[:200]):
    codes = autoencoder.tokenize(
                vertices = item['vertices'],
                faces = item['faces'],
                face_edges = item['face_edges']
    ) 
    codes =codes.flatten().unsqueeze(0)
    codes = codes[:, :codes.shape[-1] // 2 * 2]

    coords, mask = autoencoder.decode_from_codes_to_faces(codes)
    orgs = item['vertices'][item['faces']].unsqueeze(0)

    mse = torch.mean((orgs.view(-1, 3).cpu() - coords.view(-1, 3).cpu())**2)
    total_mse += mse
   

    if mse < min_mse:
        min_mse, min_coords, min_orgs = mse, coords, orgs

    if mse > max_mse:
        max_mse, max_coords, max_orgs = mse, coords, orgs
 
    if len(random_samples) <= 30:
        random_samples.append(coords)
        random_samples_pred.append(orgs)
    else:
        all_random_samples.append(random_samples_pred)
        all_random_samples.append(random_samples)
        random_samples = []
        random_samples_pred = []
  
print(f'MSE Min: {min_mse:.10f}, Max: {max_mse:.10f}') 
combind_mesh_with_rows(f'/kaggle/working/mse_rows3.obj', all_random_samples)

MarcusLoppe · 2024-03-13T18:12:10Z

Hi again, to avoid retraining from starch, here is a pre-trained model that will provide you with low loss rate after just a couple of fine-tuning epochs.

Use "mesh-autoencoder_encoder_4_decoder_8_0.36" that is in:
https://drive.google.com/drive/folders/1C1l5QrCtg9UulMJE5n_on4A9O9Gn0CC5

num_layers = 23 
autoencoder = MeshAutoencoder(     
        decoder_dims_through_depth =  (128,) * 3 + (192,) * 4 + (256,) * num_layers + (384,) * 3,   
        dim_codebook = 192, 
        codebook_size = 16384, 
        dim_area_embed = 16,
        dim_coor_embed = 16, 
        dim_normal_embed = 16,
        dim_angle_embed = 8,

    attn_decoder_depth  = 8,
    attn_encoder_depth =4
    )
.to("cuda")

HenryLhc · 2024-04-26T06:04:21Z

thanks for the advice! After some modification, the encoder and transformer can produce a satisfactory outcome compared to the data in the training dataset. When I try to generate meshes from codes, I find that regardless of the percentage of codes that I offer, the generated meshes all have roughly the same shape, is this normal?
Could you please share some outcomes from the part where you prompt with 0-80% of the codes? Given the fact that I only had one type of mesh, I didn't use the text embedding to categorize the meshes, and all the meshes generated with 0% are exactly the same.

MarcusLoppe mentioned this issue Mar 12, 2024

Transformer - token_embed outputs nan values #44

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss=nan when training transformertrainer #68

Loss=nan when training transformertrainer #68

HenryLhc commented Mar 12, 2024

MarcusLoppe commented Mar 12, 2024

HenryLhc commented Mar 12, 2024

MarcusLoppe commented Mar 12, 2024 •

edited

MarcusLoppe commented Mar 13, 2024

HenryLhc commented Apr 26, 2024

Loss=nan when training transformertrainer #68

Loss=nan when training transformertrainer #68

Comments

HenryLhc commented Mar 12, 2024

MarcusLoppe commented Mar 12, 2024

HenryLhc commented Mar 12, 2024

MarcusLoppe commented Mar 12, 2024 • edited

MarcusLoppe commented Mar 13, 2024

HenryLhc commented Apr 26, 2024

MarcusLoppe commented Mar 12, 2024 •

edited