Use data source label in distributed HeadNode factory #15554

vepadulano · 2024-05-17T10:13:11Z

Introduce a new method to get a label for the data source that the current RDataFrame is processing. There are three major types:

The dataframe will process a TTree dataset
The dataframe will process an empty dataset
The dataframe will process data from an RDataSource

The function returns a label with the suffix "DS" also for the first two cases, to be aligned as much as possible with the RDataSource infrastructure.

Make use of this function in distributed RDataFrame to create the headnode of the Python computation graph. This also avoids extra parsing in the factory function which includes opening the first input file once more to distinguish between TTree or RNTuple input (in case the first input argument is a string).

github-actions · 2024-05-17T13:58:48Z

Test Results

12 files 12 suites 2d 21h 39m 15s ⏱️
2 637 tests 2 636 ✅ 0 💤 1 ❌
29 949 runs 29 948 ✅ 0 💤 1 ❌

For more details on these failures, see this check.

Results for commit 5717ec1.

♻️ This comment has been updated with latest results.

Introduce a new method to get a label for the data source that the current RDataFrame is processing. There are three main types: * The dataframe will process a TTree dataset * The dataframe will process an empty dataset * The dataframe will process data from an RDataSource The function returns a label with the suffix "DS" also for the first two cases, to be aligned as much as possible with the RDataSource infrastructure.

Make use of the new function to get the data source label in distributed RDataFrame to create the headnode of the Python computation graph. This also avoids extra parsing in the factory function which includes opening the first input file once more to distinguish between TTree or RNTuple input (in case the first input argument is a string).

martamaja10

Thanks Vincenzo! It is much neater now and we avoid opening the file.

vepadulano added the in:RDataFrame label May 17, 2024

vepadulano self-assigned this May 17, 2024

vepadulano requested review from martamaja10 and dpiparo as code owners May 17, 2024 10:13

vepadulano force-pushed the distrdf-headnode-from-ds-label branch from 06cebe3 to 20966e8 Compare May 17, 2024 11:57

vepadulano mentioned this pull request May 17, 2024

[df] Add function to check if dataframe origin is TTree or TChain #15500

Closed

vepadulano added 2 commits May 22, 2024 12:10

vepadulano force-pushed the distrdf-headnode-from-ds-label branch from 20966e8 to 5717ec1 Compare May 22, 2024 10:12

martamaja10 approved these changes May 28, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use data source label in distributed HeadNode factory #15554

Use data source label in distributed HeadNode factory #15554

vepadulano commented May 17, 2024

github-actions bot commented May 17, 2024 •

edited

martamaja10 left a comment •

edited

Use data source label in distributed HeadNode factory #15554

Are you sure you want to change the base?

Use data source label in distributed HeadNode factory #15554

Conversation

vepadulano commented May 17, 2024

github-actions bot commented May 17, 2024 • edited

Test Results

martamaja10 left a comment • edited

Choose a reason for hiding this comment

github-actions bot commented May 17, 2024 •

edited

martamaja10 left a comment •

edited