Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault when running a join between pyarrow.RecordBatchReader and pyarrow.Table #12133

Closed
2 tasks done
rz-vastdata opened this issue May 19, 2024 · 5 comments
Closed
2 tasks done

Comments

@rz-vastdata
Copy link

rz-vastdata commented May 19, 2024

What happens?

DuckDB fails with a segmentation fault when running a join between pyarrow.RecordBatchReader and pyarrow.Table.
Joining pyarrow.Table and pyarrow.Table does work.

To Reproduce

$ docker run -it ubuntu:24.04 bash
Unable to find image 'ubuntu:24.04' locally
24.04: Pulling from library/ubuntu
49b384cc7b4a: Pull complete
Digest: sha256:3f85b7caad41a95462cf5b787d8a04604c8262cdcdf9a472b8c52ef83375fe15
Status: Downloaded newer image for ubuntu:24.04

root@48a30a637cb3:/# apt update && apt install python3-pip python3-venv
<snip>
root@48a30a637cb3:/# python3 -m venv env
root@48a30a637cb3:/# . env/bin/activate
(env) root@48a30a637cb3:/# pip install pyarrow duckdb
Collecting pyarrow
  Downloading pyarrow-16.1.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (3.0 kB)
Collecting duckdb
  Downloading duckdb-0.10.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (763 bytes)
Collecting numpy>=1.16.6 (from pyarrow)
  Downloading numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.0/61.0 kB 3.1 MB/s eta 0:00:00
Downloading pyarrow-16.1.0-cp312-cp312-manylinux_2_28_x86_64.whl (40.8 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 40.8/40.8 MB 37.0 MB/s eta 0:00:00
Downloading duckdb-0.10.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.2/18.2 MB 47.0 MB/s eta 0:00:00
Downloading numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.0 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.0/18.0 MB 32.7 MB/s eta 0:00:00
Installing collected packages: numpy, duckdb, pyarrow
Successfully installed duckdb-0.10.2 numpy-1.26.4 pyarrow-16.1.0

(env) root@48a30a637cb3:/# cat > test.py
import duckdb
import pyarrow as pa

d = duckdb.connect()

t = pa.RecordBatchReader(pa.Table.from_pydict({"x": [11, 12], "y": ["c", "d"]}).to_batches())
s = pa.Table.from_pydict({"x": [11, 12], "y": ["c", "d"]})

print(d.execute("SELECT * FROM s, t WHERE t.x = s.x").arrow())
(env) root@48a30a637cb3:/# python test.py
Segmentation fault

OS:

Ubuntu 24.04 x64

DuckDB Version:

0.10.2

DuckDB Client:

Python 3.12

Full Name:

Roman Zeyde

Affiliation:

VAST Data

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

Not applicable - the reproduction does not require a data set

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have
@Tishj
Copy link
Contributor

Tishj commented May 19, 2024

I suspect this is because the RecordBatchReader being destructive
it will get read twice, being empty the second time around

This is a known issue and hard to detect currently, but we're working on a fix
That fix will at first just be to throw a pre-emptive error

@Mytherin
Copy link
Collaborator

This might be a problem in pyarrow actually, the following snippet crashes for me as well:

import pyarrow as pa

tbl = pa.Table.from_pydict({"x": [11, 12], "y": ["c", "d"]})
t = pa.RecordBatchReader(tbl.to_batches())

print(t.read_all())

@Mytherin
Copy link
Collaborator

Mytherin commented May 21, 2024

I've opened a bug report in the arrow repository here - apache/arrow#41758

Closing this as it does not seem to be caused by DuckDB itself.

@Mytherin
Copy link
Collaborator

Mytherin commented May 21, 2024

This seems to be using the wrong syntax for creating a record batch reader, here's the correct one:

import duckdb
import pyarrow as pa

d = duckdb.connect()

tbl = pa.Table.from_pydict({"x": [11, 12], "y": ["c", "d"]})
t = pa.RecordBatchReader.from_batches(tbl.schema, tbl.to_batches())
s = pa.Table.from_pydict({"x": [11, 12], "y": ["c", "d"]})

print(d.execute("SELECT * FROM s, t WHERE t.x = s.x").arrow())
# pyarrow.Table
# x: int64
# y: string
# x: int64
# y: string
# ----
# x: [[11,12]]
# y: [["c","d"]]
# x: [[11,12]]
# y: [["c","d"]]

@rz-vastdata
Copy link
Author

Many thanks @Mytherin!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants