Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

joblib.hash Produces Different Hashes for Equivalent Pandas DataFrames #1611

Open
LouisJalouzot opened this issue Sep 25, 2024 · 3 comments
Open

Comments

@LouisJalouzot
Copy link

Description

joblib.hash produces different hash values for DataFrames that are equivalent in content, columns, and index, which can cause cache misses. This issue appears in particular when the _is_copy attribute of the DataFrames is different and pointing to different DataFrames.

To Reproduce

import pandas as pd
import joblib

# Create two equivalent DataFrames
df = pd.DataFrame({"a": [1, 1, 2]})
df_1 = df.drop_duplicates(ignore_index=True)
df_2 = pd.DataFrame({"a": [1, 2]})

# They are identical in content, columns, and index
print(df_1)
# Output:
#    a
# 0  1
# 1  2

print(df_2)
# Output:
#    a
# 0  1
# 1  2

print(df_1.columns, df_2.columns)
# Output:
# Index(['a'], dtype='object') Index(['a'], dtype='object')

print(df_1.index, df_2.index)
# Output:
# RangeIndex(start=0, stop=2, step=1) RangeIndex(start=0, stop=2, step=1)

# However, their joblib hash values are different
print(joblib.hash(df_1) == joblib.hash(df_2))
# Output:
# False

# In this case, their _is_copy attributes are different
print(str(df_1._is_copy), str(df_2._is_copy))
# Output:
# <weakref at 0x7bd97103e160; to 'DataFrame' at 0x7bd82c647810> None

Environment

Python: 3.11.10
Pandas: 2.2.2
Joblib: 1.4.2

@gabrielgrant
Copy link

This Stack Overflow question seems to be about this same issue and has some additional details

@gabrielgrant
Copy link

Seems this may be another instance of the last of the listed gotchas in the docs:

Cache-miss with objects that have non-reproducible pickle representations

pytorch.Tensor is called out as an example, but if Pandas DF an expected failure, seems that should be called out explicitly as well

@lesteve
Copy link
Member

lesteve commented Feb 14, 2025

I think this has been discussed in the past, some issues can probably be found about it. From what I remember this is that we would rather be conservative, i.e. recompute the result, rather than use a cached result that may be wrong.

It would be nice to be more explicit about this in the documentation if possible ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants