Skip to content

Commit d27d4a4

Browse files
Added support for fetching VECTORs in Arrow arrays.
1 parent 44d64c5 commit d27d4a4

21 files changed

+1217
-33
lines changed

doc/src/release_notes.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,8 +63,9 @@ Common Changes
6363

6464
#) Added Instance Principal authentication support when using
6565
:ref:`OCI Cloud Native Authentication <cloudnativeauthoci>`.
66-
#) Improvements to :ref:`data frames <dataframeformat>`:
66+
#) Improvements to :ref:`data frame <dataframeformat>` support:
6767

68+
- Added support for VECTOR columns when fetching data frames.
6869
- Fixed date handling to match PyArrow's and avoid localization issues
6970
(`issue 499 <https://github.com/oracle/python-oracledb/issues/499>`__).
7071
- Fixed bug on Windows when fetching dates prior to 1970 and after 2038

doc/src/user_guide/dataframes.rst

Lines changed: 198 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,8 @@ To fetch in batches, use an iterator:
5151

5252
.. code-block:: python
5353
54+
import pyarrow
55+
5456
sql = "select * from departments where department_id < 80"
5557
# Adjust "size" to tune the query fetch performance
5658
# Here it is small to show iteration
@@ -144,6 +146,10 @@ Oracle Database will result in an exception. :ref:`Output type handlers
144146
- TIMESTAMP
145147
* - DB_TYPE_VARCHAR
146148
- STRING
149+
* - DB_TYPE_VECTOR
150+
- List or struct with DOUBLE, FLOAT, INT8, or UINT8 values
151+
152+
**Numbers**
147153

148154
When converting Oracle Database NUMBERs:
149155

@@ -158,10 +164,51 @@ When converting Oracle Database NUMBERs:
158164

159165
- In all other cases, the Arrow data type is DOUBLE.
160166

167+
**Vectors**
168+
169+
When converting Oracle Database VECTORs:
170+
171+
- Dense vectors are fetched as lists.
172+
173+
- Sparse vectors are fetched as structs with fields ``num_dimensions``,
174+
``indices`` and ``values`` similar to :ref:`SparseVector objects
175+
<sparsevectorsobj>`.
176+
177+
- VECTOR columns with flexible dimensions are supported.
178+
179+
- VECTOR columns with flexible formats are not supported. Each vector value
180+
must have the same storage format data type.
181+
182+
- Vector values are fetched as the following types:
183+
184+
.. list-table-with-summary::
185+
:header-rows: 1
186+
:class: wy-table-responsive
187+
:widths: 1 1
188+
:align: left
189+
:summary: The first column is the Oracle Database VECTOR format. The second column is the resulting Arrow data type in the list.
190+
191+
* - Oracle Database VECTOR format
192+
- Arrow data type
193+
* - FLOAT64
194+
- DOUBLE
195+
* - FLOAT32
196+
- FLOAT
197+
* - INT8
198+
- INT8
199+
* - BINARY
200+
- UINT8
201+
202+
See :ref:`dfvector` for more information.
203+
204+
**LOBs**
205+
161206
When converting Oracle Database CLOBs and BLOBs:
162207

163208
- The LOBs must be no more than 1 GB in length.
164209

210+
**Dates and Timestamps**
211+
165212
When converting Oracle Database DATEs and TIMESTAMPs:
166213

167214
- Arrow TIMESTAMPs will not have timezone data.
@@ -236,6 +283,8 @@ An example that creates and uses a `PyArrow Table
236283

237284
.. code-block:: python
238285
286+
import pyarrow
287+
239288
# Get an OracleDataFrame
240289
# Adjust arraysize to tune the query fetch performance
241290
sql = "select id, name from SampleQueryTab order by id"
@@ -303,8 +352,8 @@ An example that creates and uses a `Polars DataFrame
303352

304353
.. code-block:: python
305354
306-
import pyarrow
307355
import polars
356+
import pyarrow
308357
309358
# Get an OracleDataFrame
310359
# Adjust arraysize to tune the query fetch performance
@@ -377,8 +426,8 @@ For example, to convert to `NumPy <https://numpy.org/>`__ ``ndarray`` format:
377426

378427
.. code-block:: python
379428
380-
import pyarrow
381429
import numpy
430+
import pyarrow
382431
383432
SQL = "select id from SampleQueryTab order by id"
384433
@@ -426,3 +475,150 @@ An example of working with data as a `Torch tensor
426475
427476
See `samples/dataframe_torch.py <https://github.com/oracle/python-oracledb/
428477
blob/main/samples/dataframe_torch.py>`__ for a runnable example.
478+
479+
.. _dfvector:
480+
481+
Using VECTOR data with Data Frames
482+
----------------------------------
483+
484+
Columns of the `VECTOR <https://www.oracle.com/pls/topic/lookup?ctx=dblatest&
485+
id=GUID-746EAA47-9ADA-4A77-82BB-64E8EF5309BE>`__ data type can be fetched with
486+
the methods :meth:`Connection.fetch_df_all()` and
487+
:meth:`Connection.fetch_df_batches()`. VECTOR columns can have flexible
488+
dimensions, but flexible storage formats are not supported: each vector value
489+
must have the same format data type. Vectors can be dense or sparse.
490+
491+
See :ref:`dftypemapping` for the type mapping for VECTORs.
492+
493+
**Dense Vectors**
494+
495+
By default, Oracle Database vectors are "dense". These are fetched in
496+
python-oracledb as Arrow lists. For example, if the table::
497+
498+
create table myvec (v64 vector(3, float64));
499+
500+
contains these two vectors::
501+
502+
[4.1, 5.2, 6.3]
503+
[7.1, 8.2, 9.3]
504+
505+
then the code:
506+
507+
.. code-block:: python
508+
509+
odf = connection.fetch_df_all("select v64 from myvec")
510+
pyarrow_table = pyarrow.Table.from_arrays(
511+
odf.column_arrays(), names=odf.column_names()
512+
)
513+
514+
will result in a PyArrow table containing lists of doubles. The table can be
515+
converted to a data frame of your chosen library using functionality described
516+
earlier in this chapter. For example, to convert to Pandas:
517+
518+
.. code-block:: python
519+
520+
pdf = pyarrow_table.to_pandas()
521+
print(pdf)
522+
523+
The output will be::
524+
525+
V64
526+
0 [4.1, 5.2, 6.3]
527+
1 [7.1, 8.2, 9.3]
528+
529+
**Sparse Vectors**
530+
531+
Sparse vectors (where many of the values are 0) are fetched as structs with
532+
fields ``num_dimensions``, ``indices``, and ``values`` similar to
533+
:ref:`SparseVector objects <sparsevectorsobj>` which are discussed in a
534+
non-data frame context in :ref:`sparsevectors`.
535+
536+
If the table::
537+
538+
create table myvec (v64 vector(3, float64, sparse));
539+
540+
contains these two vectors::
541+
542+
[3, [1,2], [4.1, 5.2]]
543+
[3, [0], [9.3]]
544+
545+
then the code to fetch as data frames:
546+
547+
.. code-block:: python
548+
549+
import pyarrow
550+
551+
odf = connection.fetch_df_all("select v64 from myvec")
552+
pdf = pyarrow.Table.from_arrays(
553+
odf.column_arrays(), names=odf.column_names()
554+
).to_pandas()
555+
556+
print(pdf)
557+
558+
print("First row:")
559+
560+
num_dimensions = pdf.iloc[0].V64['num_dimensions']
561+
print(f"num_dimensions={num_dimensions}")
562+
563+
indices = pdf.iloc[0].V64['indices']
564+
print(f"indices={indices}")
565+
566+
values = pdf.iloc[0].V64['values']
567+
print(f"values={values}")
568+
569+
will display::
570+
571+
V64
572+
0 {'num_dimensions': 3, 'indices': [1, 2], 'valu...
573+
1 {'num_dimensions': 3, 'indices': [0], 'values'...
574+
575+
First row:
576+
num_dimensions=3
577+
indices=[1 2]
578+
values=[4.1 5.2]
579+
580+
You can convert each struct as needed. One way to convert into `Pandas
581+
dataframes with sparse values
582+
<https://pandas.pydata.org/docs/user_guide/sparse.html>`__ is via a `SciPy
583+
coordinate format matrix <https://docs.scipy.org/doc/scipy/reference/generated/
584+
scipy.sparse.coo_array.html#scipy.sparse.coo_array>`__. The Pandas method
585+
`from_spmatrix() <https://pandas.pydata.org/docs/reference/api/
586+
pandas.DataFrame.sparse.from_spmatrix.html>`__ can then be used to create the
587+
final sparse dataframe:
588+
589+
.. code-block:: python
590+
591+
import numpy
592+
import pandas
593+
import pyarrow
594+
import scipy
595+
596+
def convert_to_sparse_array(val):
597+
dimensions = val["num_dimensions"]
598+
col_indices = val["indices"]
599+
row_indices = numpy.zeros(len(col_indices))
600+
values = val["values"]
601+
sparse_matrix = scipy.sparse.coo_matrix(
602+
(values, (col_indices, row_indices)), shape=(dimensions, 1))
603+
return pandas.arrays.SparseArray.from_spmatrix(sparse_matrix)
604+
605+
odf = connection.fetch_df_all("select v64 from myvec")
606+
pdf = pyarrow.Table.from_arrays(
607+
odf.column_arrays(), odf.column_names()
608+
).to_pandas()
609+
610+
pdf["SPARSE_ARRAY_V64"] = pdf["V64"].apply(convert_to_sparse_array)
611+
612+
print(pdf.SPARSE_ARRAY_V64)
613+
614+
The code will print::
615+
616+
0 [0.0, 4.1, 5.2]
617+
Fill: 0.0
618+
IntIndex
619+
Indices: ar...
620+
1 [9.3, 0.0, 0.0]
621+
Fill: 0.0
622+
IntIndex
623+
Indices: ar...
624+
Name: SPARSE_ARRAY_V64, dtype: object

doc/src/user_guide/vector_data_type.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,9 @@ various storage formats mentioned above. For example:
3232
vec_data vector
3333
)
3434
35+
If you are interested in using VECTOR data with data frames, see
36+
:ref:`dfvector`.
37+
3538
.. _intfloatformat:
3639

3740
Using FLOAT32, FLOAT64, and INT8 Vectors

samples/dataframe_numpy.py

Lines changed: 57 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -25,12 +25,14 @@
2525
# -----------------------------------------------------------------------------
2626
# dataframe_numpy.py
2727
#
28-
# Shows how to use connection.fetch_df_all() to efficiently put data into a
29-
# NumPy ndarray via the DLPack standard memory layout.
28+
# Shows how to use connection.fetch_df_all() to put data into a NumPy ndarray
3029
# -----------------------------------------------------------------------------
3130

32-
import pyarrow
31+
import array
32+
import sys
33+
3334
import numpy
35+
import pyarrow
3436

3537
import oracledb
3638
import sample_env
@@ -46,11 +48,14 @@
4648
params=sample_env.get_connect_params(),
4749
)
4850

49-
SQL = "select id from SampleQueryTab order by id"
51+
# -----------------------------------------------------------------------------
52+
#
53+
# Fetching all records
5054

5155
# Get an OracleDataFrame
5256
# Adjust arraysize to tune the query fetch performance
53-
odf = connection.fetch_df_all(statement=SQL, arraysize=100)
57+
sql = "select id from SampleQueryTab order by id"
58+
odf = connection.fetch_df_all(statement=sql, arraysize=100)
5459

5560
# Convert to an ndarray via the Python DLPack specification
5661
pyarrow_array = pyarrow.array(odf.get_column_by_name("ID"))
@@ -62,10 +67,56 @@
6267
print("Type:")
6368
print(type(np)) # <class 'numpy.ndarray'>
6469

65-
# Perform various numpy operations on the ndarray
70+
print("Values:")
71+
print(np)
72+
73+
# Perform various NumPy operations on the ndarray
6674

6775
print("\nSum:")
6876
print(numpy.sum(np))
6977

7078
print("\nLog10:")
7179
print(numpy.log10(np))
80+
81+
# -----------------------------------------------------------------------------
82+
#
83+
# Fetching VECTORs
84+
85+
# The VECTOR example only works with Oracle Database 23.4 or later
86+
if sample_env.get_server_version() < (23, 4):
87+
sys.exit("This example requires Oracle Database 23.4 or later.")
88+
89+
# The VECTOR example works with thin mode, or with thick mode using Oracle
90+
# Client 23.4 or later
91+
if not connection.thin and oracledb.clientversion()[:2] < (23, 4):
92+
sys.exit(
93+
"This example requires python-oracledb thin mode, or Oracle Client"
94+
" 23.4 or later"
95+
)
96+
97+
# Insert sample data
98+
rows = [
99+
(array.array("d", [11.25, 11.75, 11.5]),),
100+
(array.array("d", [12.25, 12.75, 12.5]),),
101+
]
102+
103+
with connection.cursor() as cursor:
104+
cursor.executemany("insert into SampleVectorTab (v64) values (:1)", rows)
105+
106+
# Get an OracleDataFrame
107+
# Adjust arraysize to tune the query fetch performance
108+
sql = "select v64 from SampleVectorTab order by id"
109+
odf = connection.fetch_df_all(statement=sql, arraysize=100)
110+
111+
# Convert to a NumPy ndarray
112+
pyarrow_array = pyarrow.array(odf.get_column_by_name("V64"))
113+
np = pyarrow_array.to_numpy(zero_copy_only=False)
114+
115+
print("Type:")
116+
print(type(np)) # <class 'numpy.ndarray'>
117+
118+
print("Values:")
119+
print(np)
120+
121+
print("\nSum:")
122+
print(numpy.sum(np))

0 commit comments

Comments
 (0)