try to implement PSMNet use yours instructions #51

passion3394 · 2020-06-17T12:16:10Z

stereo_net = Nets.get_stereo_net(args.modelName, net_args)

the above code runs good, but when I start to train on scene_flow, got the following error.
I have done some debug methods:
(1) check the scene_flow input, but it's ok
(2) check the network, but I didn't find some errors

Could you help to find the bug, thanks very much. If more files needed, I will upload them.

Traceback (most recent call last):
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
return fn(*args)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: input depth must be evenly divisible by filter depth: 4 vs 3
[[{{node model/CNN_2/conv0/conv0_1/Conv2D}}]]
(1) Invalid argument: input depth must be evenly divisible by filter depth: 4 vs 3
[[{{node model/CNN_2/conv0/conv0_1/Conv2D}}]]
[[validation_error/truediv_1/_23]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "Train.py", line 191, in
main(args)
File "Train.py", line 140, in main
fetches = sess.run(tf_fetches,options=run_options)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 950, in run
run_metadata_ptr)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1173, in _run
feed_dict_tensor, options, run_metadata)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
run_metadata)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: input depth must be evenly divisible by filter depth: 4 vs 3
[[node model/CNN_2/conv0/conv0_1/Conv2D (defined at /host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/sharedLayers.py:59) ]]
(1) Invalid argument: input depth must be evenly divisible by filter depth: 4 vs 3
[[node model/CNN_2/conv0/conv0_1/Conv2D (defined at /host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/sharedLayers.py:59) ]]
[[validation_error/truediv_1/_23]]
0 successful operations.
0 derived errors ignored.

Errors may have originated from an input operation.
Input Source operations connected to node model/CNN_2/conv0/conv0_1/Conv2D:
model/MirrorPad_2 (defined at /host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Data_utils/preprocessing.py:28)
model/CNN/conv0/conv0_1/weights/read (defined at /host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/sharedLayers.py:57)

Input Source operations connected to node model/CNN_2/conv0/conv0_1/Conv2D:
model/MirrorPad_2 (defined at /host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Data_utils/preprocessing.py:28)
model/CNN/conv0/conv0_1/weights/read (defined at /host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/sharedLayers.py:57)

Original stack trace for 'model/CNN_2/conv0/conv0_1/Conv2D':
File "Train.py", line 191, in
main(args)
File "Train.py", line 71, in main
val_stereo_net = Nets.get_stereo_net(args.modelName, net_args)
File "/host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/init.py", line 15, in get_stereo_net
return STEREO_FACTORYname
File "/host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/PSMNet.py", line 22, in init
super(PSMNet, self).init(**kwargs)
File "/host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/Stereo_net.py", line 44, in init
self._build_network(args)
File "/host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/PSMNet.py", line 54, in _build_network
conv4_left = self.CNN(self._left_input_batch)
File "/host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/PSMNet.py", line 71, in CNN
activation=tf.nn.leaky_relu, batch_norm=True, apply_relu=True,strides=2, name='conv0_1',reuse=reuse))
File "/host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/sharedLayers.py", line 59, in conv2d
x = tf.nn.conv2d(x, W, strides=[1, strides, strides, 1], padding=padding,dilations=dilation_rate)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/ops/nn_ops.py", line 1953, in conv2d
name=name)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 1071, in conv2d
data_format=data_format, dilations=dilations, name=name)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
op_def=op_def)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 2005, in init
self._traceback = tf_stack.extract_stack()

AlessioTonioni · 2020-06-18T16:21:46Z

Without seeing the code I cannot provide much help, can you create a pull request with your code?

passion3394 · 2020-06-19T08:14:09Z

@AlessioTonioni Sure, I will create a pull request soon.

passion3394 · 2020-06-19T09:21:56Z

I have created a pull request, thanks for your help.

passion3394 · 2020-06-22T02:38:29Z

@AlessioTonioni Any updates?

AlessioTonioni · 2020-06-22T21:36:47Z

Do the images that you are using have 4 channels for some reason? Like RGB + alpha

passion3394 · 2020-06-23T06:41:45Z

@AlessioTonioni Yes, I use the scene flow dataset, the images contains 4 channels.
I did two things:
(1)in Nets/PSMNet.py, I try to print the shape of left_input_batch
in function _preprocess_inputs. it outputs [1,256,512,3]. so I think the input is ok.

(2)
I try to train MADNet on this dataset, got the same error:

tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: input depth must be evenly divisible by filter depth: 4 vs 3
[[node model/gc-read-pyramid_1/conv1/Conv2D (defined at /host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/sharedLayers.py:59) ]]
[[training_error/Sum_12/_57]]
(1) Invalid argument: input depth must be evenly divisible by filter depth: 4 vs 3
[[node model/gc-read-pyramid_1/conv1/Conv2D (defined at /host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/sharedLayers.py:59) ]]
0 successful operations.
0 derived errors ignored.

Errors may have originated from an input operation.
Input Source operations connected to node model/gc-read-pyramid_1/conv1/Conv2D:
model/MirrorPad_1 (defined at /host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Data_utils/preprocessing.py:28)
model/gc-read-pyramid/conv1/weights/read (defined at /host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/sharedLayers.py:57)

Input Source operations connected to node model/gc-read-pyramid_1/conv1/Conv2D:
model/MirrorPad_1 (defined at /host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Data_utils/preprocessing.py:28)
model/gc-read-pyramid/conv1/weights/read (defined at /host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/sharedLayers.py:57)

Original stack trace for 'model/gc-read-pyramid_1/conv1/Conv2D':
File "Train.py", line 191, in
main(args)
File "Train.py", line 63, in main
stereo_net = Nets.get_stereo_net(args.modelName, net_args)
File "/host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/init.py", line 15, in get_stereo_net
return STEREO_FACTORYname
File "/host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/MadNet.py", line 24, in init
super(MadNet, self).init(**kwargs)
File "/host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/Stereo_net.py", line 44, in init
self._build_network(args)
File "/host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/MadNet.py", line 260, in _build_network
self._pyramid_features(self._right_input_batch, scope='gc-read-pyramid', reuse=True, layer_prefix='right')
File "/host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/MadNet.py", line 183, in _pyramid_features
)[-1].value, 16], strides=2, name='conv1', bName='biases', activation=activation))
File "/host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/sharedLayers.py", line 59, in conv2d
x = tf.nn.conv2d(x, W, strides=[1, strides, strides, 1], padding=padding,dilations=dilation_rate)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/ops/nn_ops.py", line 1953, in conv2d
name=name)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 1071, in conv2d
data_format=data_format, dilations=dilations, name=name)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
op_def=op_def)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 2005, in init
self._traceback = tf_stack.extract_stack()

So how to fix this error?

AlessioTonioni · 2020-06-23T16:58:59Z

If you get the same error with MADNet and (I guess) Dispnet then I think you have either a problem with your data or with the version of tensorflow you are using.
Can you share a stereo couple + gt from the dataset that you are trying to use?
Can you run Stereo_Online_Adaptation.py? Or do you get an error there as well?

passion3394 · 2020-06-24T15:44:04Z

Firstly, a sample of the stereo couple + gt has been attached. sceneflow_samples.tar.gz

Secondly, I try to run Stereo_Online_Adaptation.py with the following script:

LIST="/host/nfs/hs/scene_flow/val.list" #the one described at step (1)
OUTPUT="_sf_output/"
WEIGHTS="pretrained/MADNet/synthetic/weights.ckpt"
MODELNAME="MADNet"
BLOCKCONFIG="block_config/MadNet_full.json"

python3 Stereo_Online_Adaptation.py
-l ${LIST}
-o ${OUTPUT}
--weights ${WEIGHTS}
--modelName ${MODELNAME}
--blockConfig ${BLOCKCONFIG}
--mode FULL
--imageShape 256 512
--sampleMode PROBABILITY
--logDispStep 1

GOT THE FOLLOWING ERROR, which is not the same with the previous one.

WARNING:tensorflow:From /root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists(from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
Traceback (most recent call last):
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
return fn(*args)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [3,3,2,16] rhs shape= [3,3,3,16]
[[{{node save/Assign_75}}]]
[[save/RestoreV2/_42]]
(1) Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [3,3,2,16] rhs shape= [3,3,3,16]
[[{{node save/Assign_75}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/training/saver.py", line 1286, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 950, in run
run_metadata_ptr)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1173, in _run
feed_dict_tensor, options, run_metadata)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
run_metadata)
File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [3,3,2,16] rhs shape= [3,3,3,16]
[[node save/Assign_75 (defined at Stereo_Online_Adaptation.py:152) ]]
[[save/RestoreV2/_42]]
(1) Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [3,3,2,16] rhs shape= [3,3,3,16]
[[node save/Assign_75 (defined at Stereo_Online_Adaptation.py:152) ]]
0 successful operations.
0 derived errors ignored.

Errors may have originated from an input operation.
Input Source operations connected to node save/Assign_75:
model/gc-read-pyramid/conv1/weights (defined at /host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/sharedLayers.py:57)

Input Source operations connected to node save/Assign_75:
model/gc-read-pyramid/conv1/weights (defined at /host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/sharedLayers.py:57)

AlessioTonioni · 2020-06-28T08:38:08Z

Firstly, a sample of the stereo couple + gt has been attached. sceneflow_samples.tar.gz

Your images have 4 channels, rgb + alpha. I just pushed a commit to explicitly remove the extra channel after having read the images. Let me know if it fixes your problems.

The second error you are getting seems to be related to the operation to restore weights on the first conv. I Think it's still an issue with the number of channels in the input image (4 with your images rather than 3) so it might be fixed as well. But if it's not working let me know.

passion3394 · 2020-06-28T10:32:10Z

The first error about Train.py is fixed.

The second error still exists, and the same error.

AlessioTonioni · 2020-06-28T12:39:46Z

I cannot replicate the error on my side, which version of tensorflow are you using?

The error seems to be related to how weights are restored.

passion3394 · 2020-06-28T12:43:43Z

I install tensorflow-gpu with pip, and I can't install tf1.12, so I installed tf1.14.0.

Another info
I ran Stereo_Online_Adaptation.py with another dataset whose channels is 3, and the result is ok.

AlessioTonioni · 2020-07-01T19:30:37Z

I'm testing with the 1.12, the images you send, the weights available online, and everything seems to work.

Any other insight on what is happening?

passion3394 · 2020-07-03T02:12:21Z

so far, about this question, it's very strange that:
(1) use 4-channels images to run Stereo_Online_Adaptation.py, issue occured:
Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [3,3,2,16] rhs shape= [3,3,3,16]

(2)use 3-channels images to run Stereo_Online_Adaptation.py, everything runs ok.

I doubt it's about the checkpoint saved with 3-channels images training. I will convert the 4-channels images to 3-channels
images to validate this issue.
==============================split line=====================
At the same time, I continued to train PSMNet on sceneflow dataset, I got nan in losses, I doubt the nan issue lies on
the PSMNet network. Do you have any instructions?

Step:22800 Loss:17.37 Step:22900 Loss:33.37 Step:23000 Loss:36.77 Step:23100 Loss:28.94 Step:23200 Loss:18.60 Step:23300 Loss:17.36 Step:23400 Loss:24.45 Step:23500 Loss:28.62 Step:23600 Loss:63.38 Step:23700 Loss:66.96 Step:23800 Loss:104.30 Step:23900 Loss:51.45 Step:24000 Loss:31.38 Step:24100 Loss:18.41 Step:24200 Loss:17.72 Step:24300 Loss:64.83 Step:24400 Loss:94.23 Step:24500 Loss:20.95 Step:24600 Loss:15.28 Step:24700 Loss:21.73 Step:24800 Loss:17.40 Step:24900 Loss:nan Step:25000 Loss:nan Step:25100 Loss:nan Step:25200 Loss:nan Step:25300 Loss:nan Step:25400 Loss:nan Step:25500 Loss:nan Step:25600 Loss:nan f/b time:0.603486 Missing time:1 day, 1:53:40.462395
f/b time:0.602048 Missing time:1 day, 1:48:58.175722
f/b time:0.601132 Missing time:1 day, 1:45:36.700135
f/b time:0.596681 Missing time:1 day, 1:33:10.308483
f/b time:0.598654 Missing time:1 day, 1:37:14.649814
f/b time:0.595730 Missing time:1 day, 1:28:44.552802
f/b time:0.596634 Missing time:1 day, 1:30:04.066876
f/b time:0.598686 Missing time:1 day, 1:34:19.910125
f/b time:0.599219 Missing time:1 day, 1:34:41.985614
f/b time:0.599685 Missing time:1 day, 1:34:53.615409
f/b time:0.599060 Missing time:1 day, 1:32:17.753413
f/b time:0.595160 Missing time:1 day, 1:21:19.704973
f/b time:0.598833 Missing time:1 day, 1:29:43.103316
f/b time:0.601884 Missing time:1 day, 1:36:30.574517
f/b time:0.596067 Missing time:1 day, 1:20:39.928160
f/b time:0.591668 Missing time:1 day, 1:08:27.451441
f/b time:0.594829 Missing time:1 day, 1:15:31.437283
f/b time:0.599308 Missing time:1 day, 1:25:56.208433
f/b time:0.598947 Missing time:1 day, 1:24:01.202895
f/b time:0.596817 Missing time:1 day, 1:17:36.314750
f/b time:0.590698 Missing time:1 day, 1:01:03.701733
f/b time:0.589800 Missing time:1 day, 0:57:47.887786
f/b time:0.585932 Missing time:1 day, 0:46:59.878661
f/b time:0.589257 Missing time:1 day, 0:54:27.297112
f/b time:0.588775 Missing time:1 day, 0:52:15.065096
f/b time:0.585430 Missing time:1 day, 0:42:47.741276
f/b time:0.581155 Missing time:1 day, 0:31:00.059053
f/b time:0.587691 Missing time:1 day, 0:46:33.836580
f/b time:0.586745 Missing time:1 day, 0:43:11.544342

AlessioTonioni · 2020-07-05T16:59:17Z

As for 1, which images are you using? The one you sent me? Because I'm able to use Stereo_Online_Adaptation.py without any issue on them.

As for PSMNet the loss seems quite high after 25K steps, is it going down? Like what does the plot in tensorboard look like?
Do the prediction start to look like anything reasonable? Otherwise there might be some issue with the atchitecture.

passion3394 · 2020-07-06T02:37:31Z

for 1, I use the same images as you tested.

I could not see the tensorboard, because the gpu is in the cloud. But I will try to visualize in tensorboard.

passion3394 · 2020-07-09T03:36:57Z

I have run tensorboard on the remote terminal. Could you help to analysis the error.
I think the error is about the network.

AlessioTonioni · 2020-07-13T09:28:57Z

Are you using the reprojection loss to train the network from scratch?
I would advise you to use the supervised loss as done in train.py.

In general from this plots you can see that the loss is not going down at all, so there is definitely some implementation problem in the network or you still have some trouble with your data. If you train dispnet or MADNet are you able to see the loss going down?

AkshatVashisht · 2021-09-02T11:37:56Z

Does it doing depth estimation on real time?
Is it producing depth continuously from the cameras?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

try to implement PSMNet use yours instructions #51

try to implement PSMNet use yours instructions #51

passion3394 commented Jun 17, 2020

AlessioTonioni commented Jun 18, 2020

passion3394 commented Jun 19, 2020

passion3394 commented Jun 19, 2020

passion3394 commented Jun 22, 2020

AlessioTonioni commented Jun 22, 2020

passion3394 commented Jun 23, 2020 •

edited

AlessioTonioni commented Jun 23, 2020

passion3394 commented Jun 24, 2020 •

edited

AlessioTonioni commented Jun 28, 2020

passion3394 commented Jun 28, 2020

AlessioTonioni commented Jun 28, 2020

passion3394 commented Jun 28, 2020

AlessioTonioni commented Jul 1, 2020

passion3394 commented Jul 3, 2020

AlessioTonioni commented Jul 5, 2020 •

edited

passion3394 commented Jul 6, 2020

passion3394 commented Jul 9, 2020 •

edited

AlessioTonioni commented Jul 13, 2020

AkshatVashisht commented Sep 2, 2021

try to implement PSMNet use yours instructions #51

try to implement PSMNet use yours instructions #51

Comments

passion3394 commented Jun 17, 2020

AlessioTonioni commented Jun 18, 2020

passion3394 commented Jun 19, 2020

passion3394 commented Jun 19, 2020

passion3394 commented Jun 22, 2020

AlessioTonioni commented Jun 22, 2020

passion3394 commented Jun 23, 2020 • edited

AlessioTonioni commented Jun 23, 2020

passion3394 commented Jun 24, 2020 • edited

AlessioTonioni commented Jun 28, 2020

passion3394 commented Jun 28, 2020

AlessioTonioni commented Jun 28, 2020

passion3394 commented Jun 28, 2020

AlessioTonioni commented Jul 1, 2020

passion3394 commented Jul 3, 2020

AlessioTonioni commented Jul 5, 2020 • edited

passion3394 commented Jul 6, 2020

passion3394 commented Jul 9, 2020 • edited

AlessioTonioni commented Jul 13, 2020

AkshatVashisht commented Sep 2, 2021

passion3394 commented Jun 23, 2020 •

edited

passion3394 commented Jun 24, 2020 •

edited

AlessioTonioni commented Jul 5, 2020 •

edited

passion3394 commented Jul 9, 2020 •

edited