`joblib.hash` Produces Different Hashes for Equivalent Pandas DataFrames #1611

LouisJalouzot · 2024-09-25T14:19:21Z

Description

joblib.hash produces different hash values for DataFrames that are equivalent in content, columns, and index, which can cause cache misses. This issue appears in particular when the _is_copy attribute of the DataFrames is different and pointing to different DataFrames.

To Reproduce

import pandas as pd
import joblib

# Create two equivalent DataFrames
df = pd.DataFrame({"a": [1, 1, 2]})
df_1 = df.drop_duplicates(ignore_index=True)
df_2 = pd.DataFrame({"a": [1, 2]})

# They are identical in content, columns, and index
print(df_1)
# Output:
#    a
# 0  1
# 1  2

print(df_2)
# Output:
#    a
# 0  1
# 1  2

print(df_1.columns, df_2.columns)
# Output:
# Index(['a'], dtype='object') Index(['a'], dtype='object')

print(df_1.index, df_2.index)
# Output:
# RangeIndex(start=0, stop=2, step=1) RangeIndex(start=0, stop=2, step=1)

# However, their joblib hash values are different
print(joblib.hash(df_1) == joblib.hash(df_2))
# Output:
# False

# In this case, their _is_copy attributes are different
print(str(df_1._is_copy), str(df_2._is_copy))
# Output:
# <weakref at 0x7bd97103e160; to 'DataFrame' at 0x7bd82c647810> None

Environment

Python: 3.11.10
Pandas: 2.2.2
Joblib: 1.4.2

The text was updated successfully, but these errors were encountered:

gabrielgrant · 2025-01-17T00:41:32Z

This Stack Overflow question seems to be about this same issue and has some additional details

gabrielgrant · 2025-01-17T00:58:11Z

Seems this may be another instance of the last of the listed gotchas in the docs:

Cache-miss with objects that have non-reproducible pickle representations

pytorch.Tensor is called out as an example, but if Pandas DF an expected failure, seems that should be called out explicitly as well

lesteve · 2025-02-14T14:35:47Z

I think this has been discussed in the past, some issues can probably be found about it. From what I remember this is that we would rather be conservative, i.e. recompute the result, rather than use a cached result that may be wrong.

It would be nice to be more explicit about this in the documentation if possible ...

gabrielgrant mentioned this issue Jan 17, 2025

persisted memoization (ie saving @mo.cache values) marimo-team/marimo#3471

Closed

lesteve added the documentation label Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`joblib.hash` Produces Different Hashes for Equivalent Pandas DataFrames #1611

`joblib.hash` Produces Different Hashes for Equivalent Pandas DataFrames #1611

LouisJalouzot commented Sep 25, 2024

gabrielgrant commented Jan 17, 2025

gabrielgrant commented Jan 17, 2025

lesteve commented Feb 14, 2025

joblib.hash Produces Different Hashes for Equivalent Pandas DataFrames #1611

joblib.hash Produces Different Hashes for Equivalent Pandas DataFrames #1611

Comments

LouisJalouzot commented Sep 25, 2024

Description

To Reproduce

Environment

gabrielgrant commented Jan 17, 2025

gabrielgrant commented Jan 17, 2025

lesteve commented Feb 14, 2025

`joblib.hash` Produces Different Hashes for Equivalent Pandas DataFrames #1611

`joblib.hash` Produces Different Hashes for Equivalent Pandas DataFrames #1611