-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add a api load dataset from [huggingface datasets] #11126
Comments
Thanks for the issue @simplew2011. We already support loading datasets from hugging face via fsspec's import dask.dataframe as dd
df = dd.read_parquet("hf://datasets/wikimedia/wikipedia/20231101.en") Can you say more about what you're looking for? It could be things already work |
|
See this related Stackoverflow question and answer https://stackoverflow.com/questions/44889526/dask-bag-jsondecodeerror-when-reading-multiline-json-arrays. In short, the
Maybe you'll be better off just using Dask DataFrame's JSON reader? With data files like this:
You can read the data like this: import dask.dataframe as dd
files = ["data/0.json", "data/1.json"]
df = dd.read_json("data/*.json", lines=False)
print(f"{df.compute() = }") |
thanks |
The text was updated successfully, but these errors were encountered: