-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[cxx-interop] Copy lazily bridged Cocoa strings correctly without hanging #83701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
04eef7c
to
3c23f30
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the contribution!
Could you add a test that fails before the fix and passes after?
Somewhere in test/Interop/SwiftToCxx/stdlib
might be a good location for it.
f95e460
to
36abda7
Compare
36abda7
to
54f3153
Compare
using IndexType = decltype(u.getStartIndex()); | ||
for (auto s = u.getStartIndex().getEncodedOffset(), | ||
e = u.getEndIndex().getEncodedOffset(); | ||
s != e; s = u.indexOffsetBy(IndexType::init(s), 1).getEncodedOffset()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming s
never ends up being equal to e
. For someone unfamiliar with Cocoa strings, could you elaborate on why that is the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the easiest way to explain this is to look at my patch, along with some debug prints
SWIFT_INLINE_THUNK String::operator std::string() const {
auto u = getUtf8();
auto size = u.getCount();
std::string result;
result.reserve(size);
auto end_offset = u.getEndIndex().getEncodedOffset();
printf("end offset = %d\n", end_offset);
auto idx = u.getStartIndex();
for (; idx.getEncodedOffset() < end_offset;
idx = u.indexAfter(idx)) {
result.push_back(static_cast<char>(u[idx]));
printf("u[idx] = %c, idx offset = %d\n", u[idx], idx.getEncodedOffset());
}
printf("final idx offset = %d\n", idx.getEncodedOffset());
return result;
}
When the Swift::String originates from an NSString containing Unicode chars (👨💻👩💻), we get
end offset = 10
u[idx] = \360, idx offset = 0
u[idx] = \237, idx offset = 0
u[idx] = \221, idx offset = 0
u[idx] = \250, idx offset = 0
u[idx] = \342, idx offset = 2
u[idx] = \200, idx offset = 2
u[idx] = \215, idx offset = 2
u[idx] = \360, idx offset = 3
u[idx] = \237, idx offset = 3
u[idx] = \222, idx offset = 3
u[idx] = \273, idx offset = 3
u[idx] = \360, idx offset = 5
u[idx] = \237, idx offset = 5
u[idx] = \221, idx offset = 5
u[idx] = \251, idx offset = 5
u[idx] = \342, idx offset = 7
u[idx] = \200, idx offset = 7
u[idx] = \215, idx offset = 7
u[idx] = \360, idx offset = 8
u[idx] = \237, idx offset = 8
u[idx] = \222, idx offset = 8
u[idx] = \273, idx offset = 8
final idx offset = 10
As can be seen, idx offset does not update correctly.
If we do string.makeContiguousUTF8()
on the Swift side before the conversion, we instead get
end offset = 22
u[idx] = \360, idx offset = 0
u[idx] = \237, idx offset = 1
u[idx] = \221, idx offset = 2
u[idx] = \250, idx offset = 3
u[idx] = \342, idx offset = 4
u[idx] = \200, idx offset = 5
u[idx] = \215, idx offset = 6
u[idx] = \360, idx offset = 7
u[idx] = \237, idx offset = 8
u[idx] = \222, idx offset = 9
u[idx] = \273, idx offset = 10
u[idx] = \360, idx offset = 11
u[idx] = \237, idx offset = 12
u[idx] = \221, idx offset = 13
u[idx] = \251, idx offset = 14
u[idx] = \342, idx offset = 15
u[idx] = \200, idx offset = 16
u[idx] = \215, idx offset = 17
u[idx] = \360, idx offset = 18
u[idx] = \237, idx offset = 19
u[idx] = \222, idx offset = 20
u[idx] = \273, idx offset = 21
final idx offset = 22
So basically, we can't rely on the offset for the first case. Using .indexAfter
however seems to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, it looks like the UTF-8 code units within each unicode code point share the same encoded offset. So when the for-loop eagerly calls getEncodedOffset()
and then needs to reconstruct the index with IndexType::init(s)
to advance the loop, it probably constructs the index for the first code unit within that code point. This causes it to jump back to an earlier code unit and looping indefinitely. Thanks for sharing the logs and for your contribution!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem! 👍🏻
Hi @Xazax-hun Thanks for the quick reply! I added a very simple test which seems to hang on main. I run
With this patch, the test finishes fine 🎉 Please excuse my affection for the Swedish letters... I can change the string if anyone has any objections 😁 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
@swift-ci please test |
@@ -70,13 +70,14 @@ static_assert(sizeof(_impl::_impl_String) >= 0, | |||
SWIFT_INLINE_THUNK String::operator std::string() const { | |||
auto u = getUtf8(); | |||
std::string result; | |||
result.reserve(u.getCount() + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure that reserving the extra byte isn't necessary to avoid reallocation? I think it makes sense, since I believe the null terminator in std::string
doesn't count as an element in the string, but I want to highlight it so we don't accidentally make a mistake here.
CC @hyp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to https://en.cppreference.com/w/cpp/string/basic_string/capacity.html:
Note that the null terminator is not an element of the std::basic_string
So the behavior here looks correct to me.
This sort of confusion is why the |
@Catfish-Man Perhaps something like this could work? I'm building right now... SWIFT_INLINE_THUNK String::operator std::string() const {
auto ptr = getUtf8().baseAddress();
return std::string(ptr, ptr + u.getCount());
} |
IFF the String is small (15 or fewer bytes of UTF8) that won't work, because there's no contiguous storage to point to. We use closure-taking accessors to get the pointer out reliably: https://developer.apple.com/documentation/swift/string/withcstring(_:) |
I don't think we currently expose Swift closures to C++ at the moment, so we might have to keep using the non-closure-taking APIs here. |
At least this patch solves #69870, albeit not in the most efficient way. Any clues to what's going on with the Windows test @egorzhdan? |
hmmm let's try to run that again |
@hnrklssn Looks like everything passes. Can we merge this now? 🤔 |
Fixes #69870