Skip to content

Add the Uri\Rfc3986\Uri class to ext/uri without wither support #18836

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

kocsismate
Copy link
Member

@kocsismate kocsismate commented Jun 11, 2025

There are a few WIP parts but the PR can already be reviewed.

Relates to #14461 and https://wiki.php.net/rfc/url_parsing_api

@kocsismate kocsismate changed the title Add Uri\Rfc3986\Uri class to ext/uri Add the Uri\Rfc3986\Uri class to ext/uri without wither support Jun 11, 2025
Copy link
Member

@nielsdos nielsdos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursory glance, I see there's some todos wrt memory management as well so maybe I'll wait a bit on that

Copy link
Member

@TimWolla TimWolla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some first remarks from just looking at the diff on GitHub.

UriUriA *uri, zend_string *uri_str,
UriUriA *normalized_uri, zend_string *normalized_uri_str
) {
uriparser_uris_t *uriparser_uris = emalloc(sizeof(uriparser_uris_t));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
uriparser_uris_t *uriparser_uris = emalloc(sizeof(uriparser_uris_t));
uriparser_uris_t *uriparser_uris = emalloc(sizeof(*uriparser_uris));

Prevents the two types from going out of sync (same elsewhere).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something I'm only noticing now: I think this should rather be called ext/uri/parser_rfc3986.c and php_lexbor.c would be ext/uri/parser_whatwg.c. The php_uriparser.c name is confusing to me, because uriparser is an extremely generic term.

To give another comparison with ext/random, since it is architecturally similar: Each engine has its own engine_enginename.c file, e.g. engine_xoshiro256starstar.c.

efree(uriparser_uris);
}

const uri_handler_t uriparser_uri_handler = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And similarly the public symbols in the file should also be prefixed with php_uri_. Just uriparser_ can conflict with the uriparser library itself.

Here I would suggest php_uri_rfc3986_handler.

But I'm also happy to leave this to a follow-up to not introduce too much churn.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this and the file renaming is something that I'm happy to do after most parts are merged.

Comment on lines 849 to 851
if (uri_handler_register(&uriparser_uri_handler) == FAILURE) {
return FAILURE;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Registering the handler can be part of PHP_MINIT(uri_uriparser) then.

uriparser_copy_text_range(&uriparser_uri->query, &new_uriparser_uri->query, false);
uriparser_copy_text_range(&uriparser_uri->fragment, &new_uriparser_uri->fragment, false);
new_uriparser_uri->absolutePath = uriparser_uri->absolutePath;
new_uriparser_uri->owner = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
new_uriparser_uri->owner = true;
new_uriparser_uri->owner = URI_TRUE;

}
}

static UriUriA *uriparser_copy_uri(UriUriA *uriparser_uri) // TODO add to uriparser
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can just be:

UriUriA *copy = emalloc(sizeof(*copy));
memcpy(copy, uriparser_uri, sizeof(*copy));
uriMakeOwnerA(copy);

return copy;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to add this deep copy method back then because this method didn't work, it caused use-after-frees. I would have to investigate the root cause again (I forgot the exact problem over the course of a few months), but it's because of the path component for sure.

READ of size 8 at 0x6030000502a0 thread T0
    #0 0x00010376f4d8 in uriFreeUriMembersMmA UriParse.c:2338
    #1 0x0001037709e4 in uriFreeUriMembersA UriParse.c:2256
    #2 0x000103712f1c in uriparser_free_uri php_uriparser.c:526
...

I may try to find out the exact problem again sometime soon.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue appears that the contents of each individual PathSegment are copied, but not the path segments themselves.

Comment on lines 26 to 31
typedef struct uriparser_uris_t {
UriUriA *uri;
zend_string *uri_str;
UriUriA *normalized_uri;
zend_string *normalized_uri_str;
} uriparser_uris_t;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
typedef struct uriparser_uris_t {
UriUriA *uri;
zend_string *uri_str;
UriUriA *normalized_uri;
zend_string *normalized_uri_str;
} uriparser_uris_t;
typedef struct uriparser_uris_t {
UriUriA uri;
zend_string *uri_str;
UriUriA normalized_uri;
zend_string *normalized_uri_str;
} uriparser_uris_t;

From what I see, the uriparser library does not allocate UriUriA structs itself, but requires you to pass in an OUT-pointer. Thus we can just stack-allocate it / embed it into another struct.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The uri field is a good candidate indeed, but the normalized_uri field is initialized on demand (by uriparser_read_uri). So we would have to track its initialization status by a separate bool field because UriUriA doesn't provide any indication. Is it still desirable to allocate on stack even if it causes some maintainability degradation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it still desirable to allocate on stack even if it causes some maintainability degradation?

I feel that a separate bool would also be simpler on the maintenance, since you would have less allocs / frees to deal with, but use whatever “feels comfortable”.

Comment on lines 369 to 416
UriUriA *uriparser_uri = emalloc(sizeof(UriUriA));

/* uriparser keeps the originally passed in string, while lexbor may allocate a new one. */
zend_string *original_uri_str = zend_string_init(ZSTR_VAL(uri_str), ZSTR_LEN(uri_str), false);
if (ZSTR_LEN(original_uri_str) == 0 ||
uriParseSingleUriExA(uriparser_uri, ZSTR_VAL(original_uri_str), ZSTR_VAL(original_uri_str) + ZSTR_LEN(original_uri_str), NULL) != URI_SUCCESS
) {
efree(uriparser_uri);
zend_string_release_ex(original_uri_str, false);
if (!silent) {
throw_invalid_uri_exception();
}

return NULL;
}

if (uriparser_base_urls == NULL) {
return uriparser_create_uris(uriparser_uri, original_uri_str, NULL, NULL);
}

UriUriA *uriparser_base_url = uriparser_copy_uri(uriparser_base_urls->uri);

UriUriA *absolute_uri = emalloc(sizeof(UriUriA));

if (uriAddBaseUriExA(absolute_uri, uriparser_uri, uriparser_base_url, URI_RESOLVE_STRICTLY) != URI_SUCCESS) {
zend_string_release_ex(original_uri_str, false);
uriFreeUriMembersA(uriparser_uri);
efree(uriparser_uri);
uriFreeUriMembersA(uriparser_base_url);
efree(uriparser_base_url);
efree(absolute_uri);

if (!silent) {
throw_invalid_uri_exception();
}

return NULL;
}

/* TODO fix freeing: if the following code runs, then we'll have use-after-free-s because uriparser doesn't
copy the input. If we don't run the following code, then we'll have memory leaks...
uriFreeUriMembersA(uriparser_base_url);
efree(uriparser_base_url);
uriFreeUriMembersA(uriparser_uri);
efree(uriparser_uri);
*/

return uriparser_create_uris(absolute_uri, original_uri_str, NULL, NULL);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rough untested suggestion. Basically: Keep the UriUriA object stack-allocated for as long as possible.

Suggested change
UriUriA *uriparser_uri = emalloc(sizeof(UriUriA));
/* uriparser keeps the originally passed in string, while lexbor may allocate a new one. */
zend_string *original_uri_str = zend_string_init(ZSTR_VAL(uri_str), ZSTR_LEN(uri_str), false);
if (ZSTR_LEN(original_uri_str) == 0 ||
uriParseSingleUriExA(uriparser_uri, ZSTR_VAL(original_uri_str), ZSTR_VAL(original_uri_str) + ZSTR_LEN(original_uri_str), NULL) != URI_SUCCESS
) {
efree(uriparser_uri);
zend_string_release_ex(original_uri_str, false);
if (!silent) {
throw_invalid_uri_exception();
}
return NULL;
}
if (uriparser_base_urls == NULL) {
return uriparser_create_uris(uriparser_uri, original_uri_str, NULL, NULL);
}
UriUriA *uriparser_base_url = uriparser_copy_uri(uriparser_base_urls->uri);
UriUriA *absolute_uri = emalloc(sizeof(UriUriA));
if (uriAddBaseUriExA(absolute_uri, uriparser_uri, uriparser_base_url, URI_RESOLVE_STRICTLY) != URI_SUCCESS) {
zend_string_release_ex(original_uri_str, false);
uriFreeUriMembersA(uriparser_uri);
efree(uriparser_uri);
uriFreeUriMembersA(uriparser_base_url);
efree(uriparser_base_url);
efree(absolute_uri);
if (!silent) {
throw_invalid_uri_exception();
}
return NULL;
}
/* TODO fix freeing: if the following code runs, then we'll have use-after-free-s because uriparser doesn't
copy the input. If we don't run the following code, then we'll have memory leaks...
uriFreeUriMembersA(uriparser_base_url);
efree(uriparser_base_url);
uriFreeUriMembersA(uriparser_uri);
efree(uriparser_uri);
*/
return uriparser_create_uris(absolute_uri, original_uri_str, NULL, NULL);
UriUriA uriparser_uri;
/* uriparser keeps the originally passed in string, while lexbor may allocate a new one. */
zend_string *original_uri_str = zend_string_init(ZSTR_VAL(uri_str), ZSTR_LEN(uri_str), false);
if (ZSTR_LEN(original_uri_str) == 0 ||
uriParseSingleUriExA(&uriparser_uri, ZSTR_VAL(original_uri_str), ZSTR_VAL(original_uri_str) + ZSTR_LEN(original_uri_str), NULL) != URI_SUCCESS
) {
zend_string_release_ex(original_uri_str, false);
if (!silent) {
throw_invalid_uri_exception();
}
return NULL;
}
if (uriparser_base_urls == NULL) {
return uriparser_create_uris(&uriparser_uri, original_uri_str, NULL, NULL);
}
UriUriA absolute_uri;
if (uriAddBaseUriExA(&absolute_uri, uriparser_uri, &uriparser_base_urls->uri, URI_RESOLVE_STRICTLY) != URI_SUCCESS) {
zend_string_release_ex(original_uri_str, false);
uriFreeUriMembersA(uriparser_uri);
uriFreeUriMembersA(uriparser_base_url);
if (!silent) {
throw_invalid_uri_exception();
}
return NULL;
}
/* TODO fix freeing: if the following code runs, then we'll have use-after-free-s because uriparser doesn't
copy the input. If we don't run the following code, then we'll have memory leaks...
uriFreeUriMembersA(uriparser_base_url);
efree(uriparser_base_url);
uriFreeUriMembersA(uriparser_uri);
efree(uriparser_uri);
*/
return uriparser_create_uris(&absolute_uri, original_uri_str, NULL, NULL);

URI_ASSERT_INITIALIZATION(internal_uri);

if (UNEXPECTED(uriparser_read_userinfo(internal_uri, read_mode, return_value) == FAILURE)) {
zend_throw_error(NULL, "%s::$%s property cannot be retrieved", ZSTR_VAL(Z_OBJ_P(ZEND_THIS)->ce->name),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is only used for userinfo, why go through the indirection of using a zend_string? I assume the class needs to use ce->name because of subclasses, but you can do

Suggested change
zend_throw_error(NULL, "%s::$%s property cannot be retrieved", ZSTR_VAL(Z_OBJ_P(ZEND_THIS)->ce->name),
zend_throw_error(NULL, "%s::$userinfo property cannot be retrieved", ZSTR_VAL(Z_OBJ_P(ZEND_THIS)->ce->name));

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code uses ce->name because most getters have an implementation-alias (but surely, if the classes once become non-final, this is at least something that won't be changing). Generally, I pretty much try to avoid hardcoding symbol names into the error messages, because they make copy-pasting/generalization more difficult.

I think using the userinfo known string is a nice thing to do because then the property name doesn't have to be duplicated.

Copy link
Member Author

@kocsismate kocsismate Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, you have just made me aware that these messages are completely nonsense, since there are no such properties anymore... 😅 so the message should rather be related to the userinfo component and/or the getUserInfo/getRawUserInfo() methods. Something similar to

Uri\Rfc3986\Uri::getUserInfo(): The userinfo component cannot be retrieved

but any good suggestions are welcome :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uri\Rfc3986\Uri::getUserInfo(): The userinfo component cannot be retrieved

Would leave out the method name. That is implied by the stack trace. Also I think this error should effectively be unreachable anyways, no?

@TimWolla

This comment was marked as resolved.

kocsismate and others added 2 commits June 14, 2025 13:55
Co-authored-by: Tim Düsterhus <tim@tideways-gmbh.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants