Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use data source label in distributed HeadNode factory #15554

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

vepadulano
Copy link
Member

Introduce a new method to get a label for the data source that the current RDataFrame is processing. There are three major types:

  • The dataframe will process a TTree dataset
  • The dataframe will process an empty dataset
  • The dataframe will process data from an RDataSource

The function returns a label with the suffix "DS" also for the first two cases, to be aligned as much as possible with the RDataSource infrastructure.

Make use of this function in distributed RDataFrame to create the headnode of the Python computation graph. This also avoids extra parsing in the factory function which includes opening the first input file once more to distinguish between TTree or RNTuple input (in case the first input argument is a string).

Copy link

github-actions bot commented May 17, 2024

Test Results

    12 files      12 suites   2d 21h 39m 15s ⏱️
 2 637 tests  2 636 ✅ 0 💤 1 ❌
29 949 runs  29 948 ✅ 0 💤 1 ❌

For more details on these failures, see this check.

Results for commit 5717ec1.

♻️ This comment has been updated with latest results.

Introduce a new method to get a label for the data source that the current RDataFrame is processing. There are three main types:
* The dataframe will process a TTree dataset
* The dataframe will process an empty dataset
* The dataframe will process data from an RDataSource

The function returns a label with the suffix "DS" also for the first two cases, to be aligned as much as possible with the RDataSource infrastructure.
Make use of the new function to get the data source label in distributed RDataFrame to create the headnode of the Python computation graph. This also avoids extra parsing in the factory function which includes opening the first input file once more to distinguish between TTree or RNTuple input (in case the first input argument is a string).
@vepadulano vepadulano force-pushed the distrdf-headnode-from-ds-label branch from 20966e8 to 5717ec1 Compare May 22, 2024 10:12
Copy link
Contributor

@martamaja10 martamaja10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Vincenzo! It is much neater now and we avoid opening the file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants