Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSL Radio::ReceiveAt failed in some cases #9025

Open
Irving-cl opened this issue May 5, 2023 · 7 comments
Open

CSL Radio::ReceiveAt failed in some cases #9025

Irving-cl opened this issue May 5, 2023 · 7 comments

Comments

@Irving-cl
Copy link
Contributor

Describe the bug
Recently I'm doing some more strict tests for CSL transmission. The CSL transmitter is a Thread product (a border router using host+RCP mode) and the CSL receiver is ot-cli-ftd on nRF52840DK. I can sometimes see Radio::ReceiveAt failed with an error and as a result, CSL transmission failed. This doesn't happen all the time. When CSL period is set to 5s, this happens a lot. When CSL period is 1,2,3s, it can work well.

Debug and Analysis
I have debugged this issue by adding logs on the CSL receiver and checking the logs from JLink. And it was indeed caused by the failure of calling Radio::ReceiveAt. I added logs here (SubMac::HandleCslTimer):

        // Schedule reception window for any state except RX - so that CSL RX Window has lower priority
        // than scanning or RX after the data poll.
        if (RadioSupportsReceiveTiming() && (mState != kStateDisabled) && (mState != kStateReceive))
        {
            IgnoreError(Get<Radio>().ReceiveAt(mCslChannel, mCslSampleTime.GetValue() - periodUs - timeAhead,
                                               timeAhead + timeAfter));
======>     Change the above line, get the error code and log it
        }
        else if (mState == kStateCslSample)
        {
            IgnoreError(Get<Radio>().Receive(mCslChannel));
            LogDebg("CSL sample %lu, duration %lu", ToUlong(mCslTimer.GetNow().GetValue()),
                    ToUlong(timeAhead + timeAfter));
        }
    }

For the failed cases, I can see such output:

[0000007855] SubMac--------: HandleCslTimer                                                                                                                                                                                                                                         
[0000007856] SubMac--------: Right to call ReceiveAt, SampleTime:12863617, param:7857176                                                                                                                                                                                            
[0000007856] SubMac--------: Error: 0                                                                                                                                                                                                                                               
[0000007868] MeshForwarder-: Received IPv6 UDP msg, len:102, chksum:5857, ecn:no, from:0645845e0a9d9307, sec:no, [0000007874] Mac-----------: Sent data poll, fp:yes                                                                                                                
[0000007887] Mac-----------: Sent data poll, fp:yes                                                                                                                                                                                                                                 
[0000007894] Mac-----------: Frame rx failed, error:Duplicated, len:126, seqnum:192, type:Data, src:0645845e0a9d9[0000007904] MeshForwarder-: Received IPv6 UDP msg, len:172, chksum:fb8a, ecn:no, from:0645845e0a9d9307, sec:yes,[0000007912] Mac-----------: Frame tx attempt 1/16
 failed, error:NoAck, len:80, seqnum:119, type:Data, src:f26ef1[0000008006] Mle-----------: Send Child Update Request to parent (fe80:0:0:0:445:845e:a9d:9307)                                                                                                                      
[0000008015] MeshForwarder-: Sent IPv6 UDP msg, len:122, chksum:f3c1, ecn:no, to:0645845e0a9d9307, sec:yes, prio:[0000012855] SubMac--------: HandleCslTimer                                                                                                                        
[0000012855] SubMac--------: Right to call ReceiveAt, SampleTime:17863617, param:12857186                                                                                                                                                                                           
[0000012855] SubMac--------: Error: 0                                                                                                                                                                                                                                               
[0000012868] MeshForwarder-: Received IPv6 UDP msg, len:102, chksum:7e87, ecn:no, from:0645845e0a9d9307, sec:no, [0000012868] MeshForwarder-:     src:[fe80:0:0:0:445:845e:a9d:9307]:19788                                                                                          
[0000012868] MeshForwarder-:     dst:[fe80:0:0:0:f06e:f12e:38c3:333e]:19788                                                                                                                                                                                                         
[0000012869] Mle-----------: Receive Child Update Response from parent (fe80:0:0:0:445:845e:a9d:9307)                                                                                                                                                                               
[0000017855] SubMac--------: HandleCslTimer                                                                                                                                                                                                                                         
[0000017855] SubMac--------: Right to call ReceiveAt, SampleTime:22863617, param:17855811                                                                                                                                                                                           
[0000017855] SubMac--------: Error: 1                                                                                                                                                                                                                                               
[0000022853] SubMac--------: HandleCslTimer                                                                                                                                                                                                                                         
[0000022854] SubMac--------: Right to call ReceiveAt, SampleTime:27863617, param:22854437                                                                                                                                                                                           
[0000022854] SubMac--------: Error: 1                                                                                                                                                                                                                                               
[0000027852] SubMac--------: HandleCslTimer                                                                                                                                                                                                                                         
[0000027852] SubMac--------: Right to call ReceiveAt, SampleTime:32863617, param:27853062                                                                                                                                                                                           
[0000027852] SubMac--------: Error: 1                                                                                                                                                                                                                                               
[0000032851] SubMac--------: HandleCslTimer                                                                                                                                                                                                                                         
[0000032851] SubMac--------: Right to call ReceiveAt, SampleTime:37863617, param:32851687                                                                                                                                                                                           
[0000032851] SubMac--------: Error: 1                                                                                                                                                                                                                                               
[0000037019] Mac-----------: Sent data poll, fp:yes   

param is the value of mCslSampleTime.GetValue() - periodUs - timeAhead, which is the parameter of ReceiveAt. We can see that in the failed cases, when ReceiveAt is called, param is close to (or even later than) the timestamp of the logs. For example:

[0000017855] SubMac--------: Right to call ReceiveAt, SampleTime:22863617, param:17855811  

The log timestamp (when ReceiveAt is called) is 17855, param is also 17855(811).

That means, when ReceiveAt is called, the expected time has passed. I guess that's the reason ReceiveAt failed.

Proposed Solution
I think we can let the timer fires a little bit earlier than CslSampleTime - Ahead. This advance is not for the window, it's for the accuracy of the SubMac timer. If the timer is set to fire exactly at a time and ReceiveAt is also called with that time, sometimes the timer fires later than the expected time and ReceiveAt failed.

So I think we can fire the timer 1-2 ms earlier. Like:

        if (RadioSupportsReceiveTiming())
        {
            mCslSampleTime += periodUs;
            mCslTimer.FireAt(mCslSampleTime - timeAhead - timerMargin);  // Add some margin here, maybe 2ms
            timeAhead -= kCslReceiveTimeAhead;
        }

As I understand, this advance time won't cause any negative effect since Receive will still start at the time specified.

I made some tests with 2ms as the timer margin and get some good results:

[0000022332] MeshForwarder-:     src:[fd47:cb97:3e1f:f634[0000022345] MeshForwarder-: Received IPv6 ICMP6 msg, len:98, chksum:fe2a, ecn:no, from:0x4000, sec:yes, prio:nor[0000022346] Icmp6---------: Received Echo Request
[0000022346] Icmp6---------: Sent Echo Reply (seq = 5)               
[0000022352] MeshForwarder-: Sent IPv6 ICMP6 msg, len:98, chksum:90d9, ecn:no, to:0x4000, sec:yes, prio:normal                            
[0000022352] MeshForwarder-:     src:[fd47:cb97:3e1f:f634:2f27:287d:352c:27e]                                                             
[0000022353] MeshForwarder-:     dst:[fd2b:3f5:eea6:1:8e82:961f:62de:afb1]                                                                
[0000022360] MeshForwarder-: Sent IPv6 ICMP6 msg, len:98, chksum:fd2a, ecn:no, to:0x4000, sec:yes, prio:normal                            
[0000022360] MeshForwarder-:     src:[fd47:cb97:3e1f:f634:2f27:287d:352c:27e]                                                             
[0000022360] MeshForwarder-:     dst:[fd2b:3f5:eea6:1:8e82:961f:62de:afb1]                                                                
[0000027241] SubMac--------: HandleCslTimer                          
[0000027241] SubMac--------: Next FireTime: 32244251                 
[0000027241] SubMac--------: Right to call ReceiveAt, SampleTime:32253693, param:27247251                                                 
[0000027242] SubMac--------: Error: 0                                
[0000032244] SubMac--------: HandleCslTimer                          
[0000032244] SubMac--------: Next FireTime: 37242875                 
[0000032244] SubMac--------: Right to call ReceiveAt, SampleTime:37253693, param:32245875                                                 
[0000032244] SubMac--------: Error: 0                                
[0000037242] SubMac--------: HandleCslTimer                          
[0000037243] SubMac--------: Next FireTime: 42241501                 
[0000037243] SubMac--------: Right to call ReceiveAt, SampleTime:42253693, param:37244501                                                 
[0000037243] SubMac--------: Error: 0                                
[0000042241] SubMac--------: HandleCslTimer                          
[0000042241] SubMac--------: Next FireTime: 47240126                 
[0000042241] SubMac--------: Right to call ReceiveAt, SampleTime:47253693, param:42243126                                                 
[0000042242] SubMac--------: Error: 0                                
[0000047240] SubMac--------: HandleCslTimer                          
[0000047240] SubMac--------: Next FireTime: 52238751                 
[0000047240] SubMac--------: Right to call ReceiveAt, SampleTime:52253693, param:47241751                                                 
[0000047240] SubMac--------: Error: 0                                
[0000051366] Mac-----------: Sent data poll, fp:no                   
[0000052238] SubMac--------: HandleCslTimer

In this case, we can see param is 1~2 ms after the log timestamp.

Any thoughts? @EskoDijk @edmont

If you think it's fine I'll create a PR.

@edmont
Copy link
Member

edmont commented May 5, 2023

Hi @Irving-cl, thanks for the investigation. In the past I've seen cases in which HandleCslTimer is triggered a few milliseconds after the expected time (due to tasklets processing, I guess), with the only consequence of a missed receive slot.

This seems to be more severe, with all subsequent receive slots missed as well. Do you think the 2 ms margin will be enough or will the times eventually overlap again? On the other hand, if the problem is harder to reproduce with shorter CSL periods, does it make sense to apply the same margin? Could the margin be proportional to the CSL period value?

@Irving-cl
Copy link
Contributor Author

@edmont

Thanks! Setting the margin proportional to the CSL period sounds good to me. I'll make more tests and see what is the proper ratio. Just want to check with you: If the margin is fixed, does it have some impact for short CSL periods? Because as I understand, during the margin time, the radio will still keep sleeping.

@edmont
Copy link
Member

edmont commented May 5, 2023

It shouldn't have a big impact, just doesn't seem natural to increase the margin unnecessarily. Since this is added to OPENTHREAD_CONFIG_CSL_RECEIVE_TIME_AHEAD care must be taken if somehow very short periods are desired.

@Irving-cl
Copy link
Contributor Author

@edmont Make sense to me.

@Irving-cl Irving-cl self-assigned this May 6, 2023
@EskoDijk
Copy link
Contributor

EskoDijk commented May 6, 2023

@Irving-cl Thanks for looking further into this. It could be useful to include the build/commit id for both openthread and the Nordic platform code used for the experiments.

Looking at the latest Nordic platform code, it has a setting for OPENTHREAD_CONFIG_CSL_RECEIVE_TIME_AHEAD set to 2 ms ahead already. So does it mean this value is not enough, for certain CSL periods such as 5 sec ?

Maybe it would be useful to find out or describe here why the task processor calls tasks too late in this particular case; and/or which tasks are taking so long to execute in this case. And why these tasks don't take long with the shorter sleep periods :-)
Maybe it could be reproduced in simulation also.

@Irving-cl
Copy link
Contributor Author

@EskoDijk

Thanks! You're right. It's weird that OPENTHREAD_CONFIG_CSL_RECEIVE_TIME_AHEAD seems doesn't make the timer fire early enough. I'll take further look at this.

Note that this issue only happens with the ReceiveAt case. I have done some tests with non ReceiveAt case (remove the OT_RADIO_CAPS_RECEIVE_TIMING in platform code) and this issue doesn't happen. So it may be hard to reproduce this issue in simulation.

Regarding the build/commit, I used the latest commit for the CSL receiver. However, as I mentioned, the CSL Transmitter is a Thread product. When I tested with two DKs, the issue doesn't occur. This is kind of interesting. But as I checked, I'm sure that the issue is on the receiver because the tx time of the transmitter is correct and I can see the errors of Radio::ReceiveAt on Receiver.

@EskoDijk
Copy link
Contributor

@Irving-cl Maybe it's worth trying to measure this again after #9322 has been completed (on CSL timestamp reference points). Also the CSL code has meanwhile changed; it now uses winStart in the calculation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants