Add the Uri\Rfc3986\Uri class to ext/uri without wither support #18836

kocsismate · 2025-06-11T17:57:49Z

There are a few WIP parts but the PR can already be reviewed.

Relates to #14461 and https://wiki.php.net/rfc/url_parsing_api

Relates to php#14461 and https://wiki.php.net/rfc/url_parsing_api

ext/uri/php_uriparser.c

ext/uri/uriparser/include/uriparser/Uri.h

nielsdos

Cursory glance, I see there's some todos wrt memory management as well so maybe I'll wait a bit on that

Zend/zend_string.h

ext/uri/php_uri.c

ext/uri/php_uriparser.c

TimWolla

Some first remarks from just looking at the diff on GitHub.

ext/uri/php_uriparser.c

TimWolla · 2025-06-11T18:16:46Z

ext/uri/php_uriparser.c

+	UriUriA *uri, zend_string *uri_str,
+	UriUriA *normalized_uri, zend_string *normalized_uri_str
+) {
+	uriparser_uris_t *uriparser_uris = emalloc(sizeof(uriparser_uris_t));


Suggested change

uriparser_uris_t *uriparser_uris = emalloc(sizeof(uriparser_uris_t));

uriparser_uris_t *uriparser_uris = emalloc(sizeof(*uriparser_uris));

Prevents the two types from going out of sync (same elsewhere).

ext/uri/php_uriparser.c

TimWolla · 2025-06-11T18:26:41Z

ext/uri/php_uriparser.c

+	URIPARSER_READ_URI(uriparser_uri, uriparser_uris, read_mode);
+
+	if (uriparser_uri->userInfo.first != NULL && uriparser_uri->userInfo.afterLast != NULL) {
+		ZVAL_STRINGL(retval, uriparser_uri->userInfo.first, uriparser_uri->userInfo.afterLast - uriparser_uri->userInfo.first);


This is very verbose, it would probably make sense to add:

static inline zend_string *text_range_to_zend_string(const UriTextRangeA *range) { return zend_string_init(range->first, range->afterLast - range->first, false); }

and then:

ZVAL_NEW_STR(retval, text_range_to_zend_string(uriparser_uri->userInfo));

Alternatively (seeing the read_user() function that needs to access substrings), perhaps a text_range_len() function that just calculates the correct length to avoid the repetition.

ext/uri/php_uriparser.c

TimWolla · 2025-06-12T07:19:33Z

ext/uri/php_uriparser.c

Something I'm only noticing now: I think this should rather be called ext/uri/parser_rfc3986.c and php_lexbor.c would be ext/uri/parser_whatwg.c. The php_uriparser.c name is confusing to me, because uriparser is an extremely generic term.

To give another comparison with ext/random, since it is architecturally similar: Each engine has its own engine_enginename.c file, e.g. engine_xoshiro256starstar.c.

TimWolla · 2025-06-12T07:22:45Z

ext/uri/php_uriparser.c

+	efree(uriparser_uris);
+}
+
+const uri_handler_t uriparser_uri_handler = {


And similarly the public symbols in the file should also be prefixed with php_uri_. Just uriparser_ can conflict with the uriparser library itself.

Here I would suggest php_uri_rfc3986_handler.

But I'm also happy to leave this to a follow-up to not introduce too much churn.

Yes, this and the file renaming is something that I'm happy to do after most parts are merged.

TimWolla · 2025-06-12T07:24:50Z

ext/uri/php_uriparser.h

+	zend_string *normalized_uri_str;
+} uriparser_uris_t;
+
+void uriparser_module_init(void);


Suggested change

void uriparser_module_init(void);

PHP_MINIT_FUNCTION(uri_uriparser);

TimWolla · 2025-06-12T07:25:34Z

ext/uri/php_uri.c

+		return FAILURE;
+	}
+
+	uriparser_module_init();


Suggested change

uriparser_module_init();

if (FAILURE == PHP_MINIT(uri_uriparser)(INIT_FUNC_ARGS_PASSTHRU)) {

return FAILURE;

}

For consistency with e.g. spl or sodium.

TimWolla · 2025-06-12T07:26:43Z

ext/uri/php_uri.c

+	if (uri_handler_register(&uriparser_uri_handler) == FAILURE) {
+		return FAILURE;
+	}


Registering the handler can be part of PHP_MINIT(uri_uriparser) then.

TimWolla · 2025-06-12T07:39:42Z

ext/uri/php_uriparser.c

+	uriparser_copy_text_range(&uriparser_uri->query, &new_uriparser_uri->query, false);
+	uriparser_copy_text_range(&uriparser_uri->fragment, &new_uriparser_uri->fragment, false);
+	new_uriparser_uri->absolutePath = uriparser_uri->absolutePath;
+	new_uriparser_uri->owner = true;


Suggested change

new_uriparser_uri->owner = true;

new_uriparser_uri->owner = URI_TRUE;

TimWolla · 2025-06-12T07:42:36Z

ext/uri/php_uriparser.c

+	}
+}
+
+static UriUriA *uriparser_copy_uri(UriUriA *uriparser_uri) // TODO add to uriparser


I think this can just be:

UriUriA *copy = emalloc(sizeof(*copy)); memcpy(copy, uriparser_uri, sizeof(*copy)); uriMakeOwnerA(copy); return copy;

TimWolla · 2025-06-12T07:45:36Z

ext/uri/php_uriparser.h

+typedef struct uriparser_uris_t {
+	UriUriA *uri;
+	zend_string *uri_str;
+	UriUriA *normalized_uri;
+	zend_string *normalized_uri_str;
+} uriparser_uris_t;


Suggested change

typedef struct uriparser_uris_t {

UriUriA *uri;

zend_string *uri_str;

UriUriA *normalized_uri;

zend_string *normalized_uri_str;

} uriparser_uris_t;

typedef struct uriparser_uris_t {

UriUriA uri;

zend_string *uri_str;

UriUriA normalized_uri;

zend_string *normalized_uri_str;

} uriparser_uris_t;

From what I see, the uriparser library does not allocate UriUriA structs itself, but requires you to pass in an OUT-pointer. Thus we can just stack-allocate it / embed it into another struct.

The uri field is a good candidate indeed, but the normalized_uri field is initialized on demand (by uriparser_read_uri). So we would have to track its initialization status by a separate bool field because UriUriA doesn't provide any indication. Is it still desirable to allocate on stack even if it causes some maintainability degradation?

Is it still desirable to allocate on stack even if it causes some maintainability degradation?

I feel that a separate bool would also be simpler on the maintenance, since you would have less allocs / frees to deal with, but use whatever “feels comfortable”.

TimWolla · 2025-06-12T07:49:11Z

ext/uri/php_uriparser.c

+	UriUriA *uriparser_uri = emalloc(sizeof(UriUriA));
+
+	/* uriparser keeps the originally passed in string, while lexbor may allocate a new one. */
+	zend_string *original_uri_str = zend_string_init(ZSTR_VAL(uri_str), ZSTR_LEN(uri_str), false);
+	if (ZSTR_LEN(original_uri_str) == 0 ||
+		uriParseSingleUriExA(uriparser_uri, ZSTR_VAL(original_uri_str), ZSTR_VAL(original_uri_str) + ZSTR_LEN(original_uri_str), NULL) != URI_SUCCESS
+	) {
+		efree(uriparser_uri);
+		zend_string_release_ex(original_uri_str, false);
+		if (!silent) {
+			throw_invalid_uri_exception();
+		}
+
+		return NULL;
+	}
+
+	if (uriparser_base_urls == NULL) {
+		return uriparser_create_uris(uriparser_uri, original_uri_str, NULL, NULL);
+	}
+
+	UriUriA *uriparser_base_url = uriparser_copy_uri(uriparser_base_urls->uri);
+
+	UriUriA *absolute_uri = emalloc(sizeof(UriUriA));
+
+	if (uriAddBaseUriExA(absolute_uri, uriparser_uri, uriparser_base_url, URI_RESOLVE_STRICTLY) != URI_SUCCESS) {
+		zend_string_release_ex(original_uri_str, false);
+		uriFreeUriMembersA(uriparser_uri);
+		efree(uriparser_uri);
+		uriFreeUriMembersA(uriparser_base_url);
+		efree(uriparser_base_url);
+		efree(absolute_uri);
+
+		if (!silent) {
+			throw_invalid_uri_exception();
+		}
+
+		return NULL;
+	}
+
+	/* TODO fix freeing: if the following code runs, then we'll have use-after-free-s because uriparser doesn't
+	   copy the input. If we don't run the following code, then we'll have memory leaks...
+	uriFreeUriMembersA(uriparser_base_url);
+	efree(uriparser_base_url);
+	uriFreeUriMembersA(uriparser_uri);
+	efree(uriparser_uri);
+	 */
+
+	return uriparser_create_uris(absolute_uri, original_uri_str, NULL, NULL);


Rough untested suggestion. Basically: Keep the UriUriA object stack-allocated for as long as possible.

Suggested change

UriUriA *uriparser_uri = emalloc(sizeof(UriUriA));

/* uriparser keeps the originally passed in string, while lexbor may allocate a new one. */

zend_string *original_uri_str = zend_string_init(ZSTR_VAL(uri_str), ZSTR_LEN(uri_str), false);

if (ZSTR_LEN(original_uri_str) == 0 ||

uriParseSingleUriExA(uriparser_uri, ZSTR_VAL(original_uri_str), ZSTR_VAL(original_uri_str) + ZSTR_LEN(original_uri_str), NULL) != URI_SUCCESS

) {

efree(uriparser_uri);

zend_string_release_ex(original_uri_str, false);

if (!silent) {

throw_invalid_uri_exception();

}

return NULL;

}

if (uriparser_base_urls == NULL) {

return uriparser_create_uris(uriparser_uri, original_uri_str, NULL, NULL);

}

UriUriA *uriparser_base_url = uriparser_copy_uri(uriparser_base_urls->uri);

UriUriA *absolute_uri = emalloc(sizeof(UriUriA));

if (uriAddBaseUriExA(absolute_uri, uriparser_uri, uriparser_base_url, URI_RESOLVE_STRICTLY) != URI_SUCCESS) {

zend_string_release_ex(original_uri_str, false);

uriFreeUriMembersA(uriparser_uri);

efree(uriparser_uri);

uriFreeUriMembersA(uriparser_base_url);

efree(uriparser_base_url);

efree(absolute_uri);

if (!silent) {

throw_invalid_uri_exception();

}

return NULL;

}

/* TODO fix freeing: if the following code runs, then we'll have use-after-free-s because uriparser doesn't

copy the input. If we don't run the following code, then we'll have memory leaks...

uriFreeUriMembersA(uriparser_base_url);

efree(uriparser_base_url);

uriFreeUriMembersA(uriparser_uri);

efree(uriparser_uri);

*/

return uriparser_create_uris(absolute_uri, original_uri_str, NULL, NULL);

UriUriA uriparser_uri;

/* uriparser keeps the originally passed in string, while lexbor may allocate a new one. */

zend_string *original_uri_str = zend_string_init(ZSTR_VAL(uri_str), ZSTR_LEN(uri_str), false);

if (ZSTR_LEN(original_uri_str) == 0 ||

uriParseSingleUriExA(&uriparser_uri, ZSTR_VAL(original_uri_str), ZSTR_VAL(original_uri_str) + ZSTR_LEN(original_uri_str), NULL) != URI_SUCCESS

) {

zend_string_release_ex(original_uri_str, false);

if (!silent) {

throw_invalid_uri_exception();

}

return NULL;

}

if (uriparser_base_urls == NULL) {

return uriparser_create_uris(&uriparser_uri, original_uri_str, NULL, NULL);

}

UriUriA absolute_uri;

if (uriAddBaseUriExA(&absolute_uri, uriparser_uri, &uriparser_base_urls->uri, URI_RESOLVE_STRICTLY) != URI_SUCCESS) {

zend_string_release_ex(original_uri_str, false);

uriFreeUriMembersA(uriparser_uri);

uriFreeUriMembersA(uriparser_base_url);

if (!silent) {

throw_invalid_uri_exception();

}

return NULL;

}

/* TODO fix freeing: if the following code runs, then we'll have use-after-free-s because uriparser doesn't

copy the input. If we don't run the following code, then we'll have memory leaks...

uriFreeUriMembersA(uriparser_base_url);

efree(uriparser_base_url);

uriFreeUriMembersA(uriparser_uri);

efree(uriparser_uri);

*/

return uriparser_create_uris(&absolute_uri, original_uri_str, NULL, NULL);

DanielEScherzer · 2025-06-12T22:33:25Z

ext/uri/php_uri.c

+	URI_ASSERT_INITIALIZATION(internal_uri);
+
+	if (UNEXPECTED(uriparser_read_userinfo(internal_uri, read_mode, return_value) == FAILURE)) {
+		zend_throw_error(NULL, "%s::$%s property cannot be retrieved", ZSTR_VAL(Z_OBJ_P(ZEND_THIS)->ce->name),


If this is only used for userinfo, why go through the indirection of using a zend_string? I assume the class needs to use ce->name because of subclasses, but you can do

Suggested change

zend_throw_error(NULL, "%s::$%s property cannot be retrieved", ZSTR_VAL(Z_OBJ_P(ZEND_THIS)->ce->name),

zend_throw_error(NULL, "%s::$userinfo property cannot be retrieved", ZSTR_VAL(Z_OBJ_P(ZEND_THIS)->ce->name));

DanielEScherzer · 2025-06-12T22:38:47Z

ext/uri/php_uriparser.c

+		uriparser_uri = uriparser_uris->normalized_uri;
+	}
+
+	int charsRequired;


suggest size_t since you cannot have a negative number here

Unfortunately, uriparser needs an int here:

php-src/ext/uri/uriparser/src/UriRecompose.c

Line 77 in d585a56

int * charsRequired) {

kocsismate requested a review from dstogov as a code owner June 11, 2025 17:57

github-actions bot added Category: Build System Category: Engine ABI break Extension: uri labels Jun 11, 2025

kocsismate requested review from TimWolla and nielsdos and removed request for dstogov June 11, 2025 17:58

kocsismate marked this pull request as draft June 11, 2025 17:58

kocsismate force-pushed the ext-url6 branch from ff940a4 to 43c1d9f Compare June 11, 2025 18:04

Add Uri\Rfc3986\Uri class to ext/uri

155e070

Relates to php#14461 and https://wiki.php.net/rfc/url_parsing_api

kocsismate force-pushed the ext-url6 branch from 43c1d9f to 155e070 Compare June 11, 2025 18:07

kocsismate changed the title ~~Add Uri\Rfc3986\Uri class to ext/uri~~ Add the Uri\Rfc3986\Uri class to ext/uri without wither support Jun 11, 2025

kocsismate commented Jun 11, 2025

View reviewed changes

ext/uri/php_uriparser.c Show resolved Hide resolved

kocsismate commented Jun 11, 2025

View reviewed changes

ext/uri/php_uriparser.c Outdated Show resolved Hide resolved

kocsismate commented Jun 11, 2025

View reviewed changes

ext/uri/uriparser/include/uriparser/Uri.h Show resolved Hide resolved

nielsdos requested changes Jun 11, 2025

View reviewed changes

TimWolla reviewed Jun 11, 2025

View reviewed changes

kocsismate mentioned this pull request Jun 11, 2025

[RFC] Add RFC 3986 and WHATWG compliant URL parsing support #14461

Open

Review round 1 fixes

85bca21

TimWolla reviewed Jun 12, 2025

View reviewed changes

DanielEScherzer reviewed Jun 12, 2025

View reviewed changes

	uriparser_uris_t *uriparser_uris = emalloc(sizeof(uriparser_uris_t));
	uriparser_uris_t uriparser_uris = emalloc(sizeof(uriparser_uris));

	void uriparser_module_init(void);
	PHP_MINIT_FUNCTION(uri_uriparser);

	new_uriparser_uri->owner = true;
	new_uriparser_uri->owner = URI_TRUE;

	zend_throw_error(NULL, "%s::$%s property cannot be retrieved", ZSTR_VAL(Z_OBJ_P(ZEND_THIS)->ce->name),
	zend_throw_error(NULL, "%s::$userinfo property cannot be retrieved", ZSTR_VAL(Z_OBJ_P(ZEND_THIS)->ce->name));

Add the Uri\Rfc3986\Uri class to ext/uri without wither support #18836

Are you sure you want to change the base?

Add the Uri\Rfc3986\Uri class to ext/uri without wither support #18836

Uh oh!

Conversation

kocsismate commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nielsdos left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TimWolla left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kocsismate commented Jun 11, 2025 •

edited

Loading