Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow reading of files from remote file stores such as S3 [willing to contribute a PR!] #6049

Open
krishanbhasin-px opened this issue May 7, 2024 · 5 comments
Labels
feature-request Please add this cool feature!

Comments

@krishanbhasin-px
Copy link

Describe the feature you would like to be added.

I would like to be able to easily read files direct from an remote filestore, such as S3.

Links to VTK Documentation, Examples, or Class Definitions.

Currently, the definition of read() makes strong assumptions that the objects to be loaded live on a local filesystem.

Pseudocode or Screenshots

I'd love to be able to either:

  1. pass in a string that starts with s3:// and have pyvista know to use fsspec/s3fs
  2. construct an fsspec filesystem object, reference a file from it, and then pass that object directly to the read method
  3. Any other suggestions for making this work!

I'd be super happy to work on bringing a PR to do this, if you would be open to merging this kind of a change in

@krishanbhasin-px krishanbhasin-px added the feature-request Please add this cool feature! label May 7, 2024
@krishanbhasin-px
Copy link
Author

Hey @tkoyama010, I saw your 👍 on the issue and just wanted to check; can I take that as an endorsement of the idea/you being open to merging a PR that implements this?

Sorry for the direct tag; I just want to be sure before I spend any time working on making this happen.

Thanks!

@MatthewFlamm
Copy link
Contributor

This seems like a good idea, but my experience is that some (many?) VTK readers are not happy with non-string based paths or direct data being passed in binary/string form. If PyVista can transform user input to what VTK expects, it makes sense to me, particularly if we do not have to add any dependencies.

@banesullivan
Copy link
Member

This is high up there on my wish list and I'm happy to help you make this happen in pyvsita!

@MatthewFlamm makes a great point that we are mostly limited by what the upstream VTK readers can handle.

Some native VTK readers support the ReadFromInputStringOn option, specifically the XML VTK formats. Here is a routine that will read those files from S3 by fetching the file contents and passing along to the reader directly:

def read_xml_from_s3(uri):
    import pyvista as pv
    import fsspec, s3fs
    from vtkmodules import vtkIOXML
    readers = {
        "vti": vtkIOXML.vtkXMLImageDataReader,
        "vts": vtkIOXML.vtkXMLStructuredGridReader,
        "vtr": vtkIOXML.vtkXMLRectilinearGridReader,
        "vtu": vtkIOXML.vtkXMLUnstructuredGridReader,
        "vtp": vtkIOXML.vtkXMLPolyDataReader,
    }
    fs = fsspec.filesystem('s3')
    ext = uri.split('.')[-1]
    try:
        reader = readers[ext]()
    except KeyError:
        raise KeyError(f"Extension {ext} is not supported for reading from S3")
    reader.ReadFromInputStringOn()
    with fs.open(uri, 'rb') as f:
        reader.SetInputString(f.read())
    reader.Update()
    return pv.wrap(reader.GetOutput())
import pyvista as pv
mesh = read_xml_from_s3("s3://pyvista/examples/nefertiti.vtp")

However, we can't do this for any other VTK readers as far as I am aware, leaving us with needing to write to a temporary file for formats like OBJ. Generally in my experience this is fine (just maybe don't do this for massive datasets). So perhaps a full solution is just some sort of helper routine like the following if the data path/URI is an s3:// path or non-local path:

def read_from_s3(uri):
    """Read any mesh file from S3."""
    import os
    import pyvista as pv
    import fsspec, s3fs
    import tempfile
    fs = fsspec.filesystem('s3')
    basename = os.path.basename(uri)
    with tempfile.NamedTemporaryFile(suffix=basename) as tmpf:
        with fs.open(uri, 'rb') as rf, open(tmpf.name, 'wb') as wf:
            wf.write(rf.read())
        return pv.read(tmpf.name)
import pyvista as pv
mesh = read_from_s3("s3://pyvista/examples/nefertiti.obj")

@krishanbhasin-px
Copy link
Author

Hey @banesullivan, thank you for the detailed write up!

I’m new to pyvista and 3D data like this in general, but given I had a need to read data from S3 I thought I’d use this as an opportunity to learn more about it.

I thought I’d write up a short summary of what I’ve found so far this morning, and if you have the capacity I’d love some guidance on what to look at next.

I'm not trying to put any obligation on you here, please feel free to totally ignore this comment
At the very least, writing this up will help clarify my own thoughts.

Naive summary of Pyvista

Pyvista is a Pythonic interface to VTK.

Under the hood it makes use of many readers written in the core VTK project. e.g. this CGNSReader class is "just" a wrapper around this class. Very few of these (as you listed) support being passed the file contents directly, and instead want a filepath that they themselves load from.

Pyvista also makes use of meshio to read formats that VTK doesn’t natively support. Meshio does appear to support being passed a buffer, which could then make use of fsspec's OpenFile objects.

Approach for introducing fsspec/remote file reading

Based on the structure of fileio.pys read method, I took at look at first seeing if read_meshio can take a file handle as a first 'easy' step. As mentioned above, it contains a _read_buffer() method which in theory should support this.

When trying this diff:

def read_meshio(filename, file_format=None):
# ...
    try:
        import meshio
    except ImportError:  # pragma: no cover
        raise ImportError("To use this feature install meshio with:\n\npip install meshio")

-    # Make sure relative paths will work
-    filename = str(Path(str(filename)).expanduser().resolve())
-    # Read mesh file
-    mesh = meshio.read(filename, file_format)
+    with fsspec.open(filename, 'rb') as f:
+        mesh = meshio.read(f, filename.ext[1:] if file_format is None else file_format)
    return from_meshio(mesh)

Running tests/test_meshio.py::test_meshio fails, with [Errno 2] No such file or directory: '<fsspec.implementations.local.LocalFileOpener object at 0x167bf3d90>’.

Investigating this shows that meshio's VTUReader in _vtu.py stringifies the filename passed in to the xml tree reader, despite it being happy taking a filename or file object.

From my uninformed perspective this looks like a bug, but I'm aware of how little context I have of this domain and usecase.

It also made me doubt the feasibility of me making a "simple" change that would facilitate trasparent reading of s3:// and other remote URIs.

Thinking of how to continue

Given your comment about how only a subset of readers would support being passed through and your provided snippets, would you prefer:

  • updating the read() method to handle this internally, entirely transparent to the user
    • this appears doable but would be non-trivial and potentially messy
  • introducing a new method to fileio.py similar to the one(s) you shared, which the user has to expressly call if the data is on a remote source, something like:
def read_remote_data(remote_uri):
    if remote_uri.file_extension in LIST_OF_SUPPORTED_READERS:
        ... # fssspec.open(), reader.SetInputString() etc.
    else:
        ... # copy file to local tmpdir and read in from there

@user27182
Copy link
Contributor

The intern package has an API which may be a helpful reference for implementing this feature in PyVista. The intern package is used for working with really big datasets. For example, this remote dataset from bossdb
https://bossdb.org/project/maher_briegel2023, is read with the following API:

# Import intern (pip install intern)
from intern import array

# Save a cutout to a numpy array in ZYX order:
channel = array("bossdb://MaherBriegel2023/Lgn200/sbem")
data = channel[30:36, 1024:2048, 1024:2048]

See the implementation code for intern.array here:
https://github.com/jhuapl-boss/intern/blob/15073c6eed12e1372e2d0448ed1e874df827b3ba/intern/convenience/array.py#L936

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request Please add this cool feature!
Projects
None yet
Development

No branches or pull requests

4 participants