Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

try to implement PSMNet use yours instructions #51

Open
passion3394 opened this issue Jun 17, 2020 · 19 comments
Open

try to implement PSMNet use yours instructions #51

passion3394 opened this issue Jun 17, 2020 · 19 comments

Comments

@passion3394
Copy link

stereo_net = Nets.get_stereo_net(args.modelName, net_args)

the above code runs good, but when I start to train on scene_flow, got the following error.
I have done some debug methods:
(1) check the scene_flow input, but it's ok
(2) check the network, but I didn't find some errors

Could you help to find the bug, thanks very much. If more files needed, I will upload them.

Traceback (most recent call last):
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
return fn(*args)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: input depth must be evenly divisible by filter depth: 4 vs 3
[[{{node model/CNN_2/conv0/conv0_1/Conv2D}}]]
(1) Invalid argument: input depth must be evenly divisible by filter depth: 4 vs 3
[[{{node model/CNN_2/conv0/conv0_1/Conv2D}}]]
[[validation_error/truediv_1/_23]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "Train.py", line 191, in
main(args)
File "Train.py", line 140, in main
fetches = sess.run(tf_fetches,options=run_options)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 950, in run
run_metadata_ptr)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1173, in _run
feed_dict_tensor, options, run_metadata)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
run_metadata)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: input depth must be evenly divisible by filter depth: 4 vs 3
[[node model/CNN_2/conv0/conv0_1/Conv2D (defined at /host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/sharedLayers.py:59) ]]
(1) Invalid argument: input depth must be evenly divisible by filter depth: 4 vs 3
[[node model/CNN_2/conv0/conv0_1/Conv2D (defined at /host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/sharedLayers.py:59) ]]
[[validation_error/truediv_1/_23]]
0 successful operations.
0 derived errors ignored.

Errors may have originated from an input operation.
Input Source operations connected to node model/CNN_2/conv0/conv0_1/Conv2D:
model/MirrorPad_2 (defined at /host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Data_utils/preprocessing.py:28)
model/CNN/conv0/conv0_1/weights/read (defined at /host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/sharedLayers.py:57)

Input Source operations connected to node model/CNN_2/conv0/conv0_1/Conv2D:
model/MirrorPad_2 (defined at /host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Data_utils/preprocessing.py:28)
model/CNN/conv0/conv0_1/weights/read (defined at /host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/sharedLayers.py:57)

Original stack trace for 'model/CNN_2/conv0/conv0_1/Conv2D':
File "Train.py", line 191, in
main(args)
File "Train.py", line 71, in main
val_stereo_net = Nets.get_stereo_net(args.modelName, net_args)
File "/host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/init.py", line 15, in get_stereo_net
return STEREO_FACTORYname
File "/host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/PSMNet.py", line 22, in init
super(PSMNet, self).init(**kwargs)
File "/host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/Stereo_net.py", line 44, in init
self._build_network(args)
File "/host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/PSMNet.py", line 54, in _build_network
conv4_left = self.CNN(self._left_input_batch)
File "/host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/PSMNet.py", line 71, in CNN
activation=tf.nn.leaky_relu, batch_norm=True, apply_relu=True,strides=2, name='conv0_1',reuse=reuse))
File "/host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/sharedLayers.py", line 59, in conv2d
x = tf.nn.conv2d(x, W, strides=[1, strides, strides, 1], padding=padding,dilations=dilation_rate)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/ops/nn_ops.py", line 1953, in conv2d
name=name)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 1071, in conv2d
data_format=data_format, dilations=dilations, name=name)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
op_def=op_def)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 2005, in init
self._traceback = tf_stack.extract_stack()

@AlessioTonioni
Copy link
Member

Without seeing the code I cannot provide much help, can you create a pull request with your code?

@passion3394
Copy link
Author

@AlessioTonioni Sure, I will create a pull request soon.

@passion3394
Copy link
Author

I have created a pull request, thanks for your help.

@passion3394
Copy link
Author

@AlessioTonioni Any updates?

@AlessioTonioni
Copy link
Member

Do the images that you are using have 4 channels for some reason? Like RGB + alpha

@passion3394
Copy link
Author

passion3394 commented Jun 23, 2020

@AlessioTonioni Yes, I use the scene flow dataset, the images contains 4 channels.
I did two things:
(1)in Nets/PSMNet.py, I try to print the shape of left_input_batch
in function _preprocess_inputs. it outputs [1,256,512,3]. so I think the input is ok.

(2)
I try to train MADNet on this dataset, got the same error:

tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: input depth must be evenly divisible by filter depth: 4 vs 3
[[node model/gc-read-pyramid_1/conv1/Conv2D (defined at /host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/sharedLayers.py:59) ]]
[[training_error/Sum_12/_57]]
(1) Invalid argument: input depth must be evenly divisible by filter depth: 4 vs 3
[[node model/gc-read-pyramid_1/conv1/Conv2D (defined at /host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/sharedLayers.py:59) ]]
0 successful operations.
0 derived errors ignored.

Errors may have originated from an input operation.
Input Source operations connected to node model/gc-read-pyramid_1/conv1/Conv2D:
model/MirrorPad_1 (defined at /host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Data_utils/preprocessing.py:28)
model/gc-read-pyramid/conv1/weights/read (defined at /host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/sharedLayers.py:57)

Input Source operations connected to node model/gc-read-pyramid_1/conv1/Conv2D:
model/MirrorPad_1 (defined at /host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Data_utils/preprocessing.py:28)
model/gc-read-pyramid/conv1/weights/read (defined at /host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/sharedLayers.py:57)

Original stack trace for 'model/gc-read-pyramid_1/conv1/Conv2D':
File "Train.py", line 191, in
main(args)
File "Train.py", line 63, in main
stereo_net = Nets.get_stereo_net(args.modelName, net_args)
File "/host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/init.py", line 15, in get_stereo_net
return STEREO_FACTORYname
File "/host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/MadNet.py", line 24, in init
super(MadNet, self).init(**kwargs)
File "/host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/Stereo_net.py", line 44, in init
self._build_network(args)
File "/host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/MadNet.py", line 260, in _build_network
self._pyramid_features(self._right_input_batch, scope='gc-read-pyramid', reuse=True, layer_prefix='right')
File "/host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/MadNet.py", line 183, in _pyramid_features
)[-1].value, 16], strides=2, name='conv1', bName='biases', activation=activation))
File "/host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/sharedLayers.py", line 59, in conv2d
x = tf.nn.conv2d(x, W, strides=[1, strides, strides, 1], padding=padding,dilations=dilation_rate)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/ops/nn_ops.py", line 1953, in conv2d
name=name)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 1071, in conv2d
data_format=data_format, dilations=dilations, name=name)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
op_def=op_def)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 2005, in init
self._traceback = tf_stack.extract_stack()

So how to fix this error?

@AlessioTonioni
Copy link
Member

If you get the same error with MADNet and (I guess) Dispnet then I think you have either a problem with your data or with the version of tensorflow you are using.
Can you share a stereo couple + gt from the dataset that you are trying to use?
Can you run Stereo_Online_Adaptation.py? Or do you get an error there as well?

@passion3394
Copy link
Author

passion3394 commented Jun 24, 2020

Firstly, a sample of the stereo couple + gt has been attached. sceneflow_samples.tar.gz

Secondly, I try to run Stereo_Online_Adaptation.py with the following script:

LIST="/host/nfs/hs/scene_flow/val.list" #the one described at step (1)
OUTPUT="_sf_output/"
WEIGHTS="pretrained/MADNet/synthetic/weights.ckpt"
MODELNAME="MADNet"
BLOCKCONFIG="block_config/MadNet_full.json"

python3 Stereo_Online_Adaptation.py
-l ${LIST}
-o ${OUTPUT}
--weights ${WEIGHTS}
--modelName ${MODELNAME}
--blockConfig ${BLOCKCONFIG}
--mode FULL
--imageShape 256 512
--sampleMode PROBABILITY
--logDispStep 1

GOT THE FOLLOWING ERROR, which is not the same with the previous one.

WARNING:tensorflow:From /root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists(from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
Traceback (most recent call last):
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
return fn(*args)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [3,3,2,16] rhs shape= [3,3,3,16]
[[{{node save/Assign_75}}]]
[[save/RestoreV2/_42]]
(1) Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [3,3,2,16] rhs shape= [3,3,3,16]
[[{{node save/Assign_75}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/training/saver.py", line 1286, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 950, in run
run_metadata_ptr)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1173, in _run
feed_dict_tensor, options, run_metadata)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
run_metadata)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [3,3,2,16] rhs shape= [3,3,3,16]
[[node save/Assign_75 (defined at Stereo_Online_Adaptation.py:152) ]]
[[save/RestoreV2/_42]]
(1) Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [3,3,2,16] rhs shape= [3,3,3,16]
[[node save/Assign_75 (defined at Stereo_Online_Adaptation.py:152) ]]
0 successful operations.
0 derived errors ignored.

Errors may have originated from an input operation.
Input Source operations connected to node save/Assign_75:
model/gc-read-pyramid/conv1/weights (defined at /host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/sharedLayers.py:57)

Input Source operations connected to node save/Assign_75:
model/gc-read-pyramid/conv1/weights (defined at /host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/sharedLayers.py:57)

@AlessioTonioni
Copy link
Member

Firstly, a sample of the stereo couple + gt has been attached. sceneflow_samples.tar.gz

Your images have 4 channels, rgb + alpha. I just pushed a commit to explicitly remove the extra channel after having read the images. Let me know if it fixes your problems.

The second error you are getting seems to be related to the operation to restore weights on the first conv. I Think it's still an issue with the number of channels in the input image (4 with your images rather than 3) so it might be fixed as well. But if it's not working let me know.

@passion3394
Copy link
Author

The first error about Train.py is fixed.

The second error still exists, and the same error.

@AlessioTonioni
Copy link
Member

I cannot replicate the error on my side, which version of tensorflow are you using?

The error seems to be related to how weights are restored.

@passion3394
Copy link
Author

I install tensorflow-gpu with pip, and I can't install tf1.12, so I installed tf1.14.0.

Another info
I ran Stereo_Online_Adaptation.py with another dataset whose channels is 3, and the result is ok.

@AlessioTonioni
Copy link
Member

I'm testing with the 1.12, the images you send, the weights available online, and everything seems to work.

Any other insight on what is happening?

@passion3394
Copy link
Author

so far, about this question, it's very strange that:
(1) use 4-channels images to run Stereo_Online_Adaptation.py, issue occured:
Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [3,3,2,16] rhs shape= [3,3,3,16]

(2)use 3-channels images to run Stereo_Online_Adaptation.py, everything runs ok.

I doubt it's about the checkpoint saved with 3-channels images training. I will convert the 4-channels images to 3-channels
images to validate this issue.
==============================split line=====================
At the same time, I continued to train PSMNet on sceneflow dataset, I got nan in losses, I doubt the nan issue lies on
the PSMNet network. Do you have any instructions?

Step:22800 Loss:17.37 f/b time:0.603486 Missing time:1 day, 1:53:40.462395
Step:22900 Loss:33.37 f/b time:0.602048 Missing time:1 day, 1:48:58.175722
Step:23000 Loss:36.77 f/b time:0.601132 Missing time:1 day, 1:45:36.700135
Step:23100 Loss:28.94 f/b time:0.596681 Missing time:1 day, 1:33:10.308483
Step:23200 Loss:18.60 f/b time:0.598654 Missing time:1 day, 1:37:14.649814
Step:23300 Loss:17.36 f/b time:0.595730 Missing time:1 day, 1:28:44.552802
Step:23400 Loss:24.45 f/b time:0.596634 Missing time:1 day, 1:30:04.066876
Step:23500 Loss:28.62 f/b time:0.598686 Missing time:1 day, 1:34:19.910125
Step:23600 Loss:63.38 f/b time:0.599219 Missing time:1 day, 1:34:41.985614
Step:23700 Loss:66.96 f/b time:0.599685 Missing time:1 day, 1:34:53.615409
Step:23800 Loss:104.30 f/b time:0.599060 Missing time:1 day, 1:32:17.753413
Step:23900 Loss:51.45 f/b time:0.595160 Missing time:1 day, 1:21:19.704973
Step:24000 Loss:31.38 f/b time:0.598833 Missing time:1 day, 1:29:43.103316
Step:24100 Loss:18.41 f/b time:0.601884 Missing time:1 day, 1:36:30.574517
Step:24200 Loss:17.72 f/b time:0.596067 Missing time:1 day, 1:20:39.928160
Step:24300 Loss:64.83 f/b time:0.591668 Missing time:1 day, 1:08:27.451441
Step:24400 Loss:94.23 f/b time:0.594829 Missing time:1 day, 1:15:31.437283
Step:24500 Loss:20.95 f/b time:0.599308 Missing time:1 day, 1:25:56.208433
Step:24600 Loss:15.28 f/b time:0.598947 Missing time:1 day, 1:24:01.202895
Step:24700 Loss:21.73 f/b time:0.596817 Missing time:1 day, 1:17:36.314750
Step:24800 Loss:17.40 f/b time:0.590698 Missing time:1 day, 1:01:03.701733
Step:24900 Loss:nan f/b time:0.589800 Missing time:1 day, 0:57:47.887786
Step:25000 Loss:nan f/b time:0.585932 Missing time:1 day, 0:46:59.878661
Step:25100 Loss:nan f/b time:0.589257 Missing time:1 day, 0:54:27.297112
Step:25200 Loss:nan f/b time:0.588775 Missing time:1 day, 0:52:15.065096
Step:25300 Loss:nan f/b time:0.585430 Missing time:1 day, 0:42:47.741276
Step:25400 Loss:nan f/b time:0.581155 Missing time:1 day, 0:31:00.059053
Step:25500 Loss:nan f/b time:0.587691 Missing time:1 day, 0:46:33.836580
Step:25600 Loss:nan f/b time:0.586745 Missing time:1 day, 0:43:11.544342

@AlessioTonioni
Copy link
Member

AlessioTonioni commented Jul 5, 2020

As for 1, which images are you using? The one you sent me? Because I'm able to use Stereo_Online_Adaptation.py without any issue on them.

As for PSMNet the loss seems quite high after 25K steps, is it going down? Like what does the plot in tensorboard look like?
Do the prediction start to look like anything reasonable? Otherwise there might be some issue with the atchitecture.

@passion3394
Copy link
Author

for 1, I use the same images as you tested.

I could not see the tensorboard, because the gpu is in the cloud. But I will try to visualize in tensorboard.

@passion3394
Copy link
Author

passion3394 commented Jul 9, 2020

I have run tensorboard on the remote terminal. Could you help to analysis the error.
I think the error is about the network.

PSMNet_error
PSMNet_error2

@AlessioTonioni
Copy link
Member

Are you using the reprojection loss to train the network from scratch?
I would advise you to use the supervised loss as done in train.py.

In general from this plots you can see that the loss is not going down at all, so there is definitely some implementation problem in the network or you still have some trouble with your data. If you train dispnet or MADNet are you able to see the loss going down?

@AkshatVashisht
Copy link

Does it doing depth estimation on real time?
Is it producing depth continuously from the cameras?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants