Skip to content

ENH: Add force_suffixes boolean argument to pd.merge #61498

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

kopytjuk
Copy link
Contributor

@kopytjuk kopytjuk commented May 26, 2025

Motivation

Often, when working with wide (i.e. multiple columns) dataframes in exploratory, merging them leads to an even wider dataframe. Currently, the suffixes mechanism is only applied on equally named columns from both dataframes.

However, often developers alter the column names beforehand, or use solutions similar to the one suggested here.

Changes

This PR adds a force_suffixes boolean argument to pd.merge which applies the suffixes on all columns, no matter if they equally named or not.

The goal is to have the following:

df1 = pd.DataFrame({
                'ID': [1, 2, 3],
                'Value': ['A', 'B', 'C']
                })

  df2 = pd.DataFrame({
                  'ID': [2, 3, 4],
                  'Value': ['D', 'E', 'F']
              })

merged_df = pd.merge(df1, df2, on='ID', how="inner", suffixes=('_left', '_right'), force_suffixes=True)

# Goal:
expected = DataFrame([[2, 2, "B", "D"], [3, 3, "C", "E"]],
                                        columns=["ID_left", "Value_left", "ID_right", "Value_right"])

@kopytjuk kopytjuk changed the title Add force_suffixes boolean argument to pd.merge ENH: Add force_suffixes boolean argument to pd.merge May 26, 2025
@kopytjuk
Copy link
Contributor Author

kopytjuk commented May 26, 2025

Hey @mroeschke, can you please take a look at my if the direction is right for you (i.e. you are OK with an additional argument) before I will fix the failing tests, linting errors and adjust the documentation. Ty in advance!

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening the PR, but I would say this feature needs more discussion and agreement from the core team before moving forward with a PR

@datapythonista
Copy link
Member

merge already has a quite complex signature, and what you are trying to solve here ca be easily done in pandas with:

pd.merge(df1.add_suffix("_left"), df2.add_suffix("_right"))

Let me know if I'm missing something, but while seems some people would appreciate that, doesn't seem any core dev is excited about this, ans I understand why, since it makes a tricky method even trickier.

@TomAugspurger in the issue seems like you were a bit more positive than others about adding this when discussed some time ago. Would you move on with this PR? Otherwise let's close it.

@kopytjuk
Copy link
Contributor Author

kopytjuk commented Jun 2, 2025

merge already has a quite complex signature, and what you are trying to solve here ca be easily done in pandas with:

pd.merge(df1.add_suffix("_left"), df2.add_suffix("_right"))

Let me know if I'm missing something, but while seems some people would appreciate that, doesn't seem any core dev is excited about this, ans I understand why, since it makes a tricky method even trickier.

Thanks for your feedback!

Let me motivate the additional flag approach. I agree with you, using add_suffix is a valid approach, which however adds complexity on the code of the uses, forcing them to pass additional arguments like left_on="uuid_left", right_on="id_right", which makes it even more complicated.

Using force_suffixes would make the joins with wide data frames in exploratory settings very easy, because people cannot remember 10 different column names for each of the participants.

However I also agree on the additional complexity of the upcoming implementation. The internal logic of renaming and returning columns is quite complex already, which is not easy to grasp, maintain and test.

I will wait upon your decision.

@datapythonista
Copy link
Member

Thanks for the clarification. I didn't realize the suffix would be added to the columns to join and it'd make things more complex than just adding the add_suffix, which otherwise feels like a quite clean approach. There is clearly a trade-off here, I'm personally fine with the change even if adds some complexity, since there seems to be many users who would appreciate that.

@pandas-dev/pandas-core any opinion on adding a flag to pandas.merge to make the suffixes be added also to non-conflicting names?

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jun 3, 2025

@pandas-dev/pandas-core any opinion on adding a flag to pandas.merge to make the suffixes be added also to non-conflicting names?

I think this is a nice idea. If we default it to False, then the current behavior is preserved.

@rhshadrach
Copy link
Member

This is a situation I've run into occasionally. It's a few lines of user code, and yes, you need to track what you're joining on. I don't think it's unreasonable for the onus to be on users here, but no objection to adding a flag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants