Skip to content

gh-134706: Return bytes written from codecs.Stream(Reader)Writer.write() #134708

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

srittau
Copy link
Contributor

@srittau srittau commented May 26, 2025

Copy link
Member

@sobolevn sobolevn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I consider this a bug fix and propose to backport this change.

Copy link
Member

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small suggestion for the NEWS

@picnixz
Copy link
Member

picnixz commented May 26, 2025

Actually, there is also https://docs.python.org/3/library/asyncio-stream.html#asyncio.StreamWriter which doesn't return the number of bytes as well.

@sobolevn sobolevn added needs backport to 3.13 bugs and security fixes needs backport to 3.14 bugs and security fixes labels May 26, 2025
@srittau
Copy link
Contributor Author

srittau commented May 26, 2025

Actually, there is also https://docs.python.org/3/library/asyncio-stream.html#asyncio.StreamWriter which doesn't return the number of bytes as well.

I think this is a different case, as that seems to write asynchronously, so the number of bytes written is not known at the time write() returns.

Copy link
Contributor

@cmaloney cmaloney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

--
There are a number of cases in test mocks which do similarly I have in my backlog to make small patches for (Have a patch locally which makes a warning if TextIOWrapper gets something other than a number of bytes which finds a lot).

@sobolevn
Copy link
Member

btw, can we add / modify a test for this?

@srittau
Copy link
Contributor Author

srittau commented May 27, 2025

I was originally looking into this, but noticed that there were no tests for codecs at all. I can start a test suite, of course.

@sobolevn
Copy link
Member

I am not a fan of adding big amount of test code in the same trivial PR. So, please, let's create a new one with codecs tests 🙏

@picnixz
Copy link
Member

picnixz commented May 27, 2025

The codecs tests are in test_capi IIRC and/or in another place that I forgot

@srittau
Copy link
Contributor Author

srittau commented May 27, 2025

There are actually two tests: One for the capi and one for the Python wrapper. They are in the obvious place, so I have no idea, how I could overlook them ... Will add tests.

I am not a fan of adding big amount of test code in the same trivial PR. So, please, let's create a new one with codecs tests 🙏

Well, I was only going to add a test for this specific feature anyway, not a whole suite of tests.

@srittau
Copy link
Contributor Author

srittau commented May 28, 2025

Could this be merged so the backport gets into the next 3.13 release which is scheduled for 2025-06-03?

@AA-Turner
Copy link
Member

cc @malemburg as codecs expert and original author of this code.

Copy link
Contributor

@cmaloney cmaloney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this change .writelines as well?

@srittau
Copy link
Contributor Author

srittau commented May 30, 2025

Should this change .writelines as well?

Looking at the documentation for IOBase.writelines and the existing stub annotations, it seems unusual for writelines() to return anything but None. In typeshed, all writelines methods are annotated to return either None or object. (The latter probably to allow sub-classes to return something, although none seem to want to do that.) So I would recommend not to change it for codecs.

@serhiy-storchaka
Copy link
Member

It returns the number of bytes written, but io.TextIOWrapper.write() returns the number of characters written.

@srittau
Copy link
Contributor Author

srittau commented Jun 2, 2025

I don't want to be a pest, but 3.13.4 is expected tomorrow, and I would like to update typeshed after that release while this is still fresh in my mind.

@AA-Turner
Copy link
Member

I'd prefer to wait for a review from MAL, better to delay than rush a merge.

@malemburg
Copy link
Member

I'm not sure whether it's a good idea to return a value from .write(). The reason is that not the StreamWriter defines what kind of count is returned, but instead the .write() method of the underlying stream. Most of the time, this will be a bytes count, but not always.

The io module also is not consistent with this: most objects return number of bytes, but TextIOBase uses number of characters (or rather: code points) and always returns len(text) - not really useful information.

Overall, I'm not convinced that returning a value provides much benefit. You can use .tell() to get a defined stream position, if needed (and available).

In any case: This is a new feature, so I don't think it should go into a patch level release.

@srittau
Copy link
Contributor Author

srittau commented Jun 2, 2025

Well, the current behavior has the downside that it may break code in unexpected ways, as it breaks the write() contract. This is especially true when a StreamWriter is wrapped by one of the many I/O wrappers. This was caught in typeshed, so this is a real-world problem.

@picnixz picnixz removed needs backport to 3.13 bugs and security fixes needs backport to 3.14 bugs and security fixes labels Jun 2, 2025
@picnixz
Copy link
Member

picnixz commented Jun 2, 2025

For now, I'll remove the backport labels; we'll add them back if needed. The fact that the contract is broken is a convincing argument though.

@malemburg
Copy link
Member

Well, the current behavior has the downside that it may break code in unexpected ways, as it breaks the write() contract. This is especially true when a StreamWriter is wrapped by one of the many I/O wrappers. This was caught in typeshed, so this is a real-world problem.

I'm not sure where you get this idea from that there is some .write() contract. As I tried to explain, different I/O implementations return different things for .write(). Some return None, others number of bytes written and yet others number of code points.

We could start to use the same approach as is done for the StreamReaderWriter, in that we simply return whatever the underlying stream returns for .write(), but this should then be documented as such and also made consistent with .writelines().

@srittau
Copy link
Contributor Author

srittau commented Jun 2, 2025

I'm not sure where you get this idea from that there is some .write() contract. As I tried to explain, different I/O implementations return different things for .write(). Some return None, others number of bytes written and yet others number of code points.

The contract to return the number of items written (characters or bytes) for synchronous I/O is found throughout the standard library, both in docs in implementation. For example, all types returned by open() return the number of items written, as does the (virtual) base class typing.IO, which tries to be a base for "file-like" types. There are very few exceptions to file-like classes that don't return the number of items written. Apart from codecs classes, I could only find sqlite3.Blob.write() and wsgiref.validate.ErrorWrapper.write(). These should probably be fixed as well.

We could start to use the same approach as is done for the StreamReaderWriter, in that we simply return whatever the underlying stream returns for .write(), but this should then be documented as such and also made consistent with .writelines().

This is exactly what this PR does, although the documentation was removed when it was determined that this should be a bug fix. writelines() never returns anything but None in the stdlib. Changing that would be a bigger effort.

@serhiy-storchaka
Copy link
Member

If returning the number of characters written is the part of the contract, then returning the number of bytes written breaks the contract.

The codecs streams were implemented before adding the io module. StreamWriter is not a part of the io class hierarchy.

Of course, been more compatible with TextIOWriter would be good. But we should not introduce a new subtle discrepancy.

@serhiy-storchaka
Copy link
Member

BTW, returning the number of bytes written is essential for unbuffered bytes stream, but I do not see a use case for text streams.

@srittau
Copy link
Contributor Author

srittau commented Jun 2, 2025

If returning the number of characters written is the part of the contract, then returning the number of bytes written breaks the contract.

That's actually a good point. The writers behave like a "character writer", not a "bytes writer" to the user. The fact that they write bytes behind the scenes does not matter to the user. So instead of returning the number of bytes written, we should actually return the number of characters written (consistently, without looking at the actual codec). This would also be consistent with other text writers:

>>> with open("foo", "w") as f:
...     print(f.write("Föo"))
... 
3

@cmaloney
Copy link
Contributor

cmaloney commented Jun 2, 2025

For character writers and non-blocking what number of characters should be returned if only part of a multi-byte character is written? (Non-blocking I/O is used with various pieces and usually doesn't guarantee whole "characters" are written. Going from None -> int returned is a safeish API change, int -> slightly different meaning int a lot less safe).

@srittau
Copy link
Contributor Author

srittau commented Jun 2, 2025

For character writers and non-blocking what number of characters should be returned if only part of a multi-byte character is written?

Non-blocking I/O generally returns None for write(), but that is a special case that's not supported by the stream writers as far as I can tell.

@cmaloney
Copy link
Contributor

cmaloney commented Jun 2, 2025

None when no data is written and writing would block, they return number of bytes written / can do "partial" writes generally. Some (ex. BufferedIO) retry, but not all (ex. FileIO / "Raw IO" and TextIO). As far as I know there's no reliable way to know "buffers writes" and/or "will retry partial writes" for a "file-like" at the moment... For "Raw IO" it is well specified and tested what happens, and BufferedIO has a consistent write behavior (write raises BlockingIOError with a characters_written member set to the number of bytes written, read returns number of bytes). "Text IO" behavior wasn't specified in the initial I/O spec (PEP-3116 "Non-Blocking I/O"). Text I/O can be used directly with Raw I/O sometimes (and is), which has multiple open long-standing bugs (gh-57531 is probably the most up to date discussion one)

Also note: the implementation and docs differ on some cases although I'm actively working on getting the docs and implementation to match.

@malemburg
Copy link
Member

I have thought a bit more about this PR and "bug" report:

  • First, the contract you are seeing is really just an implementation detail of the io module in Python 3. In Python 2, file.write() returned None, just like the StreamWriter does. Remember that StreamWriter was designed and added in the Python 2 days. There is no general concept of having write methods return the number of bytes written in Python. What you describe as a bug, is really not a bug, but a feature request.

  • Since write methods can be had on many stream-like objects, it is not clear whether they should return number of bytes written, number of code points written, number of items written, etc. This depends a lot on context. You can see that the io module chose to return number of code points for TextIOBase, but number of bytes for everything else. In some context, having access to the number of bytes written may make sense (e.g. if buffering is applied or a retry operation is underway), but in most cases, such information is better left with the stream and you can access this via stream APIs.

  • Returning the number of code points passed in to the method is not useful information, since the codec's encoder may not really have written all of them to the stream. If at all, the write methods should always return number of bytes written or None if that information is not available, to make the return value useful.

  • Finally, you are proposing a change to a base class that is being subclassed by lots of codecs out there. We don't control all codecs using it (only the ones in the stdlib), so a change to a base class method will not propogate as you probably expect. Subclasses can continue to return None or return something else (e.g. number of code points, bytes or items). The return value of the StreamWriter.write() method is not defined. If we now do define a value, code will break, since other code will start assuming that such a return value is always to be expected.

  • Aside: Please also recognize that StreamWriter instances can work on many different stream objects, not just the ones provided by the io module. They are using the codec encoders to write data to the stream and only rely on a minimal stream API for this purpose.

Given all this, I'm not in favor of the change and will close the PR and ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants