Skip to content

How to copy a CSR matrix into a sparsevec column? #127

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TopCoder2K opened this issue Apr 21, 2025 · 2 comments
Closed

How to copy a CSR matrix into a sparsevec column? #127

TopCoder2K opened this issue Apr 21, 2025 · 2 comments

Comments

@TopCoder2K
Copy link

I have a sparse vector — the result of applying sklearn's TfidfVectorizer:

<Compressed Sparse Row sparse matrix of dtype 'float64'
        with 4 stored elements and shape (1, 157541)>
  Coords        Values
  (0, 5051)     0.35521903059198523
  (0, 14956)    0.5566306658037382
  (0, 45152)    0.7328483894186835
  (0, 60738)    0.1640578566196061

which I want to copy into a table with a sparsevec column. As far as I understand from the documentation, the correct way to do this is the following:

        with cur.copy(
            "COPY my_table FROM STDIN WITH (FORMAT BINARY)"
        ) as copy:
            copy.set_types(["sparsevec"])
            copy.write_row((SparseVector(the_sparse_vector),))

but this produces an error:

psycopg.errors.DataException: sparsevec indices must not contain duplicates

I've investigated a bit and found this line which uses value.coords[0] (not value.coords[1] for two dimensional input). Is this a bug? What should I do?

Additional information about the example:

  1. The code
print(the_sparse_vector)
the_sparse_vector = the_sparse_vector.tocoo()
print(the_sparse_vector.ndim, the_sparse_vector.shape)
print(the_sparse_vector.coords)
print(the_sparse_vector.data)
print(SparseVector(the_sparse_vector))

outputs:

<Compressed Sparse Row sparse matrix of dtype 'float64'
        with 4 stored elements and shape (1, 157541)>
  Coords        Values
  (0, 5051)     0.35521903059198523
  (0, 14956)    0.5566306658037382
  (0, 45152)    0.7328483894186835
  (0, 60738)    0.1640578566196061
2 (1, 157541)
(array([0, 0, 0, 0], dtype=int32), array([ 5051, 14956, 45152, 60738], dtype=int32))
[0.35521903 0.55663067 0.73284839 0.16405786]
SparseVector({0: 0.1640578566196061}, 157541)
  1. I have
psycopg           3.2.6
psycopg-binary    3.2.6
pgvector          0.4.0
scipy             1.15.2
@TopCoder2K
Copy link
Author

TopCoder2K commented Apr 21, 2025

I've implemented a quick fix:

class SparseVectorFixed(SparseVector):
    def _from_sparse(self, value):
        value = value.tocoo()

        if value.ndim == 1:
            self._dim = value.shape[0]
        elif value.ndim == 2 and value.shape[0] == 1:
            self._dim = value.shape[1]
        else:
            raise ValueError('expected ndim to be 1')

        if hasattr(value, 'coords'):
            # scipy 1.13+
            ### Start of changes ###
            if value.ndim == 1:
                self._indices = value.coords[0].tolist()
            else:
                self._indices = value.coords[1].tolist()
            ### End of changes ###
        else:
            self._indices = value.col.tolist()
        self._values = value.data.tolist()

and it seems to work fine: the result of SELECT from the table —

                                pos_vector                                 
---------------------------------------------------------------------------
 {5052:0.35521904,14957:0.5566307,45153:0.7328484,60739:0.16405785}/157541
(1 row)

though I want the changes to be confirmed by experts.

@ankane ankane closed this as completed in 713590a Apr 21, 2025
@ankane
Copy link
Member

ankane commented Apr 21, 2025

Hi @TopCoder2K, thanks for reporting! Pushed a fix in the commit above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants