Segmentation fault when running a join between `pyarrow.RecordBatchReader` and `pyarrow.Table` #12133

rz-vastdata · 2024-05-19T18:18:24Z

What happens?

DuckDB fails with a segmentation fault when running a join between pyarrow.RecordBatchReader and pyarrow.Table.
Joining pyarrow.Table and pyarrow.Table does work.

To Reproduce

$ docker run -it ubuntu:24.04 bash
Unable to find image 'ubuntu:24.04' locally
24.04: Pulling from library/ubuntu
49b384cc7b4a: Pull complete
Digest: sha256:3f85b7caad41a95462cf5b787d8a04604c8262cdcdf9a472b8c52ef83375fe15
Status: Downloaded newer image for ubuntu:24.04

root@48a30a637cb3:/# apt update && apt install python3-pip python3-venv
<snip>
root@48a30a637cb3:/# python3 -m venv env
root@48a30a637cb3:/# . env/bin/activate
(env) root@48a30a637cb3:/# pip install pyarrow duckdb
Collecting pyarrow
  Downloading pyarrow-16.1.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (3.0 kB)
Collecting duckdb
  Downloading duckdb-0.10.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (763 bytes)
Collecting numpy>=1.16.6 (from pyarrow)
  Downloading numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.0/61.0 kB 3.1 MB/s eta 0:00:00
Downloading pyarrow-16.1.0-cp312-cp312-manylinux_2_28_x86_64.whl (40.8 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 40.8/40.8 MB 37.0 MB/s eta 0:00:00
Downloading duckdb-0.10.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.2/18.2 MB 47.0 MB/s eta 0:00:00
Downloading numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.0 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.0/18.0 MB 32.7 MB/s eta 0:00:00
Installing collected packages: numpy, duckdb, pyarrow
Successfully installed duckdb-0.10.2 numpy-1.26.4 pyarrow-16.1.0

(env) root@48a30a637cb3:/# cat > test.py
import duckdb
import pyarrow as pa

d = duckdb.connect()

t = pa.RecordBatchReader(pa.Table.from_pydict({"x": [11, 12], "y": ["c", "d"]}).to_batches())
s = pa.Table.from_pydict({"x": [11, 12], "y": ["c", "d"]})

print(d.execute("SELECT * FROM s, t WHERE t.x = s.x").arrow())
(env) root@48a30a637cb3:/# python test.py
Segmentation fault

OS:

Ubuntu 24.04 x64

DuckDB Version:

0.10.2

DuckDB Client:

Python 3.12

Full Name:

Roman Zeyde

Affiliation:

VAST Data

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

Not applicable - the reproduction does not require a data set

Did you include all code required to reproduce the issue?

Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

Yes, I have

The text was updated successfully, but these errors were encountered:

Tishj · 2024-05-19T18:30:52Z

I suspect this is because the RecordBatchReader being destructive
it will get read twice, being empty the second time around

This is a known issue and hard to detect currently, but we're working on a fix
That fix will at first just be to throw a pre-emptive error

Mytherin · 2024-05-21T13:26:56Z

This might be a problem in pyarrow actually, the following snippet crashes for me as well:

import pyarrow as pa

tbl = pa.Table.from_pydict({"x": [11, 12], "y": ["c", "d"]})
t = pa.RecordBatchReader(tbl.to_batches())

print(t.read_all())

Mytherin · 2024-05-21T13:50:50Z

I've opened a bug report in the arrow repository here - apache/arrow#41758

Closing this as it does not seem to be caused by DuckDB itself.

Mytherin · 2024-05-21T16:15:49Z

This seems to be using the wrong syntax for creating a record batch reader, here's the correct one:

import duckdb
import pyarrow as pa

d = duckdb.connect()

tbl = pa.Table.from_pydict({"x": [11, 12], "y": ["c", "d"]})
t = pa.RecordBatchReader.from_batches(tbl.schema, tbl.to_batches())
s = pa.Table.from_pydict({"x": [11, 12], "y": ["c", "d"]})

print(d.execute("SELECT * FROM s, t WHERE t.x = s.x").arrow())
# pyarrow.Table
# x: int64
# y: string
# x: int64
# y: string
# ----
# x: [[11,12]]
# y: [["c","d"]]
# x: [[11,12]]
# y: [["c","d"]]

rz-vastdata · 2024-05-21T17:46:37Z

Many thanks @Mytherin!

rz-vastdata added the needs triage label May 19, 2024

Mytherin closed this as completed May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault when running a join between `pyarrow.RecordBatchReader` and `pyarrow.Table` #12133

Segmentation fault when running a join between `pyarrow.RecordBatchReader` and `pyarrow.Table` #12133

rz-vastdata commented May 19, 2024 •

edited

Tishj commented May 19, 2024

Mytherin commented May 21, 2024

Mytherin commented May 21, 2024 •

edited

Mytherin commented May 21, 2024 •

edited

rz-vastdata commented May 21, 2024

Segmentation fault when running a join between pyarrow.RecordBatchReader and pyarrow.Table #12133

Segmentation fault when running a join between pyarrow.RecordBatchReader and pyarrow.Table #12133

Comments

rz-vastdata commented May 19, 2024 • edited

What happens?

To Reproduce

OS:

DuckDB Version:

DuckDB Client:

Full Name:

Affiliation:

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

Did you include all relevant data sets for reproducing the issue?

Did you include all code required to reproduce the issue?

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

Tishj commented May 19, 2024

Mytherin commented May 21, 2024

Mytherin commented May 21, 2024 • edited

Mytherin commented May 21, 2024 • edited

rz-vastdata commented May 21, 2024

Segmentation fault when running a join between `pyarrow.RecordBatchReader` and `pyarrow.Table` #12133

Segmentation fault when running a join between `pyarrow.RecordBatchReader` and `pyarrow.Table` #12133

rz-vastdata commented May 19, 2024 •

edited

Mytherin commented May 21, 2024 •

edited

Mytherin commented May 21, 2024 •

edited