Skip to content

Conversation

ludwwwig
Copy link
Contributor

Fixes #69870

Copy link
Contributor

@Xazax-hun Xazax-hun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the contribution!

Could you add a test that fails before the fix and passes after?

Somewhere in test/Interop/SwiftToCxx/stdlib might be a good location for it.

using IndexType = decltype(u.getStartIndex());
for (auto s = u.getStartIndex().getEncodedOffset(),
e = u.getEndIndex().getEncodedOffset();
s != e; s = u.indexOffsetBy(IndexType::init(s), 1).getEncodedOffset()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming s never ends up being equal to e. For someone unfamiliar with Cocoa strings, could you elaborate on why that is the case?

Copy link
Contributor Author

@ludwwwig ludwwwig Aug 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the easiest way to explain this is to look at my patch, along with some debug prints

SWIFT_INLINE_THUNK String::operator std::string() const {
  auto u = getUtf8();
  auto size = u.getCount();

  std::string result;
  result.reserve(size);

  auto end_offset = u.getEndIndex().getEncodedOffset();

  printf("end offset = %d\n", end_offset);

  auto idx = u.getStartIndex();

  for (; idx.getEncodedOffset() < end_offset;
        idx = u.indexAfter(idx)) {
    result.push_back(static_cast<char>(u[idx]));
    printf("u[idx] = %c, idx offset = %d\n", u[idx], idx.getEncodedOffset());
  }
  printf("final idx offset = %d\n", idx.getEncodedOffset());

  return result;
}

When the Swift::String originates from an NSString containing Unicode chars (👨‍💻👩‍💻), we get

end offset = 10
u[idx] = \360, idx offset = 0
u[idx] = \237, idx offset = 0
u[idx] = \221, idx offset = 0
u[idx] = \250, idx offset = 0
u[idx] = \342, idx offset = 2
u[idx] = \200, idx offset = 2
u[idx] = \215, idx offset = 2
u[idx] = \360, idx offset = 3
u[idx] = \237, idx offset = 3
u[idx] = \222, idx offset = 3
u[idx] = \273, idx offset = 3
u[idx] = \360, idx offset = 5
u[idx] = \237, idx offset = 5
u[idx] = \221, idx offset = 5
u[idx] = \251, idx offset = 5
u[idx] = \342, idx offset = 7
u[idx] = \200, idx offset = 7
u[idx] = \215, idx offset = 7
u[idx] = \360, idx offset = 8
u[idx] = \237, idx offset = 8
u[idx] = \222, idx offset = 8
u[idx] = \273, idx offset = 8
final idx offset = 10

As can be seen, idx offset does not update correctly.

If we do string.makeContiguousUTF8() on the Swift side before the conversion, we instead get

end offset = 22
u[idx] = \360, idx offset = 0
u[idx] = \237, idx offset = 1
u[idx] = \221, idx offset = 2
u[idx] = \250, idx offset = 3
u[idx] = \342, idx offset = 4
u[idx] = \200, idx offset = 5
u[idx] = \215, idx offset = 6
u[idx] = \360, idx offset = 7
u[idx] = \237, idx offset = 8
u[idx] = \222, idx offset = 9
u[idx] = \273, idx offset = 10
u[idx] = \360, idx offset = 11
u[idx] = \237, idx offset = 12
u[idx] = \221, idx offset = 13
u[idx] = \251, idx offset = 14
u[idx] = \342, idx offset = 15
u[idx] = \200, idx offset = 16
u[idx] = \215, idx offset = 17
u[idx] = \360, idx offset = 18
u[idx] = \237, idx offset = 19
u[idx] = \222, idx offset = 20
u[idx] = \273, idx offset = 21
final idx offset = 22

So basically, we can't rely on the offset for the first case. Using .indexAfter however seems to work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, it looks like the UTF-8 code units within each unicode code point share the same encoded offset. So when the for-loop eagerly calls getEncodedOffset() and then needs to reconstruct the index with IndexType::init(s) to advance the loop, it probably constructs the index for the first code unit within that code point. This causes it to jump back to an earlier code unit and looping indefinitely. Thanks for sharing the logs and for your contribution!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem! 👍🏻

@ludwwwig
Copy link
Contributor Author

Hi @Xazax-hun

Thanks for the quick reply!

I added a very simple test which seems to hang on main. I run

utils/run-test –lit ../llvm-project/llvm/utils/lit/lit.py ../build/Ninja-RelWithDebInfoAssert/swift-macosx-$(uname -m)/test-macosx-$(uname -m) –filter=“to-nsstring”

With this patch, the test finishes fine 🎉

Please excuse my affection for the Swedish letters... I can change the string if anyone has any objections 😁

Copy link
Contributor

@egorzhdan egorzhdan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@egorzhdan
Copy link
Contributor

@swift-ci please test

@@ -70,13 +70,14 @@ static_assert(sizeof(_impl::_impl_String) >= 0,
SWIFT_INLINE_THUNK String::operator std::string() const {
auto u = getUtf8();
std::string result;
result.reserve(u.getCount() + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure that reserving the extra byte isn't necessary to avoid reallocation? I think it makes sense, since I believe the null terminator in std::string doesn't count as an element in the string, but I want to highlight it so we don't accidentally make a mistake here.

CC @hyp

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to https://en.cppreference.com/w/cpp/string/basic_string/capacity.html:

Note that the null terminator is not an element of the std::basic_string

So the behavior here looks correct to me.

@parkera parkera requested a review from Catfish-Man August 15, 2025 16:09
@Catfish-Man
Copy link
Contributor

This sort of confusion is why the encodedOffset property of String.Index is deprecated; I'm not familiar with which aspects of the Swift String API are available from this layer of C++ bridging, but perhaps there's a way to structure it that doesn't rely on encodedOffset at all? For example, getting a pointer-length pair out of the String and just initializing directly from that.

@ludwwwig
Copy link
Contributor Author

ludwwwig commented Aug 16, 2025

@Catfish-Man Perhaps something like this could work? I'm building right now...

SWIFT_INLINE_THUNK String::operator std::string() const {
    auto ptr = getUtf8().baseAddress();
    return std::string(ptr, ptr + u.getCount());
}

@Catfish-Man
Copy link
Contributor

IFF the String is small (15 or fewer bytes of UTF8) that won't work, because there's no contiguous storage to point to. We use closure-taking accessors to get the pointer out reliably: https://developer.apple.com/documentation/swift/string/withcstring(_:)

@egorzhdan
Copy link
Contributor

I don't think we currently expose Swift closures to C++ at the moment, so we might have to keep using the non-closure-taking APIs here.

@ludwwwig
Copy link
Contributor Author

ludwwwig commented Aug 20, 2025

At least this patch solves #69870, albeit not in the most efficient way.

Any clues to what's going on with the Windows test @egorzhdan?

@hnrklssn
Copy link
Contributor

hmmm let's try to run that again
@swift-ci please test windows platform

@ludwwwig
Copy link
Contributor Author

@hnrklssn Looks like everything passes. Can we merge this now? 🤔

@egorzhdan egorzhdan merged commit 0e1be77 into swiftlang:main Aug 27, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c++ interop Feature: Interoperability with C++
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Grabbing Mac computer name through C++ Interop hangs
5 participants