From 89050b604eae26062dd94fecd314c0651524b889 Mon Sep 17 00:00:00 2001 From: Addison Phillips Date: Fri, 13 Sep 2024 10:36:38 -0700 Subject: [PATCH 1/8] Address name and literal equality This change defines equality as discussed in the 2024-09-09 teleconference in the following ways: - It defines _name_ equality as being under NFC - It defines _literal_ equality as explicitly **not** under NFC - It moves _name_ before _identifier_ in that section of text to avoid a forward definition. Note that this deviates from discussion in 2024-09-09's call in that we didn't discuss literals at length. It also doesn't discuss non-name/non-literal values, which I'll point out are limited to ASCII sequences such as keywords. --- spec/syntax.md | 46 +++++++++++++++++++++++++++++++--------------- 1 file changed, 31 insertions(+), 15 deletions(-) diff --git a/spec/syntax.md b/spec/syntax.md index aef6720684..9006fed016 100644 --- a/spec/syntax.md +++ b/spec/syntax.md @@ -684,6 +684,17 @@ except for U+0000 NULL or the surrogate code points U+D800 through U+DFFF. All code points are preserved. +Two literals are considered equal if the consist of the same sequence of Unicode +code points. + +> [!IMPORTANT] +> _Literal_ equality is different from _name_ equality in that +> Unicode Normalization is not applied to _literal_ values before comparison. +> Users are cautioned to ensure that they use the same character sequences +> for equivalent values. +> The use of [Normalization Form C]((https://unicode.org/reports/tr15/) for all +> _literal_ values is RECOMMENDED. + A **_quoted literal_** begins and ends with U+005E VERTICAL BAR `|`. The characters `\` and `|` within a _quoted literal_ MUST be escaped as `\\` and `\|`. @@ -708,25 +719,15 @@ number-literal = ["-"] (%x30 / (%x31-39 *DIGIT)) ["." 1*DIGIT] [%i"e" ["-" / " ### Names and Identifiers -An **_identifier_** is a character sequence that -identifies a _function_, _markup_, or _option_. -Each _identifier_ consists of a _name_ optionally preceeded by -a _namespace_. -When present, the _namespace_ is separated from the _name_ by a -U+003A COLON `:`. -Built-in _functions_ and their _options_ do not have a _namespace_ identifier. - -The _namespace_ `u` (U+0075 LATIN SMALL LETTER U) -is reserved for future standardization. - -_Function_ _identifiers_ are prefixed with `:`. -_Markup_ _identifiers_ are prefixed with `#` or `/`. -_Option_ _identifiers_ have no prefix. - A **_name_** is a character sequence used in an _identifier_ or as the name for a _variable_ or the value of an _unquoted literal_. +A _name_ is identical to another name if both consist of the same sequence of +Unicode code points after +[Unicode Normalization Form C](https://unicode.org/reports/tr15/) (NFC) +has been applied to both. + _Variable_ names are prefixed with `$`. Valid content for _names_ is based on Namespaces in XML 1.0's @@ -740,6 +741,21 @@ Otherwise, the set of characters allowed in a _name_ is large. > Such variables cannot be referenced in a _message_, > but are not otherwise errors. +An **_identifier_** is a character sequence that +identifies a _function_, _markup_, or _option_. +Each _identifier_ consists of a _name_ optionally preceeded by +a _namespace_. +When present, the _namespace_ is separated from the _name_ by a +U+003A COLON `:`. +Built-in _functions_ and their _options_ do not have a _namespace_ identifier. + +The _namespace_ `u` (U+0075 LATIN SMALL LETTER U) +is reserved for future standardization. + +_Function_ _identifiers_ are prefixed with `:`. +_Markup_ _identifiers_ are prefixed with `#` or `/`. +_Option_ _identifiers_ have no prefix. + Examples: > A variable: >``` From 8d26f7f81ffa100efbbf1e476646acf12efa7d71 Mon Sep 17 00:00:00 2001 From: Addison Phillips Date: Fri, 13 Sep 2024 10:38:04 -0700 Subject: [PATCH 2/8] Typo fix --- spec/syntax.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/spec/syntax.md b/spec/syntax.md index 9006fed016..6afcf7da2d 100644 --- a/spec/syntax.md +++ b/spec/syntax.md @@ -684,7 +684,7 @@ except for U+0000 NULL or the surrogate code points U+D800 through U+DFFF. All code points are preserved. -Two literals are considered equal if the consist of the same sequence of Unicode +Two _literals_ are considered equal if they consist of the same sequence of Unicode code points. > [!IMPORTANT] From 46c80bfb2ac9fa7ff97c0b3fb9946cce67701889 Mon Sep 17 00:00:00 2001 From: Addison Phillips Date: Fri, 13 Sep 2024 10:43:39 -0700 Subject: [PATCH 3/8] Add a note about not requiring implementations to actually normalize --- spec/syntax.md | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/spec/syntax.md b/spec/syntax.md index 6afcf7da2d..2362e0d503 100644 --- a/spec/syntax.md +++ b/spec/syntax.md @@ -723,12 +723,21 @@ A **_name_** is a character sequence used in an _identifier_ or as the name for a _variable_ or the value of an _unquoted literal_. +_Variable_ names are prefixed with `$`. + A _name_ is identical to another name if both consist of the same sequence of Unicode code points after [Unicode Normalization Form C](https://unicode.org/reports/tr15/) (NFC) has been applied to both. -_Variable_ names are prefixed with `$`. +> [!NOTE] +> Implementations are not required to normalize _names_. +> Comparisons of _name_ values only need be done "as-if" normalization +> has occured. +> Since most text in the wild is already in NFC +> and since checking for NFC is fast and efficient, +> implementations can often substitute checking for actually applying normalization +> to _name_ values. Valid content for _names_ is based on Namespaces in XML 1.0's [NCName](https://www.w3.org/TR/xml-names/#NT-NCName). From c1e4982897c713dd3d24ebe6d746b904875ea9ab Mon Sep 17 00:00:00 2001 From: Addison Phillips Date: Mon, 16 Sep 2024 15:17:38 -0700 Subject: [PATCH 4/8] Implement changes dicussed in 2024-09-16 call. - Make _key_ require NFC for uniqueness/comparison - Add a note about NFC - Make _literal_ **_not_** define equality - Make text in _name_ identical to that in _key_ for consistency --- spec/syntax.md | 35 +++++++++++++++++++++++------------ 1 file changed, 23 insertions(+), 12 deletions(-) diff --git a/spec/syntax.md b/spec/syntax.md index 2362e0d503..b62bf270bd 100644 --- a/spec/syntax.md +++ b/spec/syntax.md @@ -438,6 +438,14 @@ A _key_ can be either a _literal_ value or the "catch-all" key `*`. The **_catch-all key_** is a special key, represented by `*`, that matches all values for a given _selector_. +The value of each _key_ MUST be treated as if it were in +[Unicode Normalization Form C](https://unicode.org/reports/tr15/) ("NFC"). +When _keys_ are passed during _pattern selection_, the _key_ values MUST +be normalized into NFC. +Two _keys_ are considered equal if they are canonically equivalent strings, +that is, if they consist of the same sequence of Unicode code points after +Unicode Normalization Form C has been applied to both. + ## Expressions An **_expression_** is a part of a _message_ that will be determined @@ -684,16 +692,19 @@ except for U+0000 NULL or the surrogate code points U+D800 through U+DFFF. All code points are preserved. -Two _literals_ are considered equal if they consist of the same sequence of Unicode -code points. - > [!IMPORTANT] -> _Literal_ equality is different from _name_ equality in that -> Unicode Normalization is not applied to _literal_ values before comparison. -> Users are cautioned to ensure that they use the same character sequences -> for equivalent values. -> The use of [Normalization Form C]((https://unicode.org/reports/tr15/) for all -> _literal_ values is RECOMMENDED. +> Most text, including that produced by common keyboards and input methods, +> is already encoded in the canonical form known as +> [Unicode Normalization Form C](https://unicode.org/reports/tr15) ("NFC"). +> A few languages, legacy character encoding conversions, or operating environments +> can result in _literal_ values that are not in this form. +> Some uses of _literals_ in MessageFormat, +> notably as the value of _keys_, +> apply NFC to the _literal_ value during processing or comparison. +> While there is no requirement that the _literal_ value actually be entered +> in a normalized form, +> users are cautioned to employ the same character sequences +> for equivalent values and, whenever possible, ensure _literals_ are in NFC. A **_quoted literal_** begins and ends with U+005E VERTICAL BAR `|`. The characters `\` and `|` within a _quoted literal_ MUST be @@ -725,9 +736,9 @@ or the value of an _unquoted literal_. _Variable_ names are prefixed with `$`. -A _name_ is identical to another name if both consist of the same sequence of -Unicode code points after -[Unicode Normalization Form C](https://unicode.org/reports/tr15/) (NFC) +Two _names_ are considered equal if they are canonically equivalent strings, +that is, if they consist of the same sequence of Unicode code points after +[Unicode Normalization Form C](https://unicode.org/reports/tr15/) ("NFC") has been applied to both. > [!NOTE] From 20cbbe77cb535cb02ffc3c44a84b10fd4db84316 Mon Sep 17 00:00:00 2001 From: Addison Phillips Date: Tue, 17 Sep 2024 09:27:24 -0700 Subject: [PATCH 5/8] Update formatting.md to include keys in NFC --- spec/formatting.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/spec/formatting.md b/spec/formatting.md index c141451217..bb18d1043a 100644 --- a/spec/formatting.md +++ b/spec/formatting.md @@ -474,6 +474,11 @@ This selection method is defined in more detail below. An implementation MAY use any pattern selection method, as long as its observable behavior matches the results of the method defined here. +The resolved value of each _key_ MUST be in Unicode Normalization Form C ("NFC"), +even if the _literal_ for the _key_ is not. +All comparisons of _keys_ MUST be done on the canonical, normalized values +and the normalized value MUST be the value that is passed in the steps below. + ### Resolve Selectors First, resolve the values of each _selector_: @@ -502,7 +507,7 @@ Next, using `res`, resolve the preferential order for all message keys: 1. Let `key` be the `var` key at position `i`. 1. If `key` is not the catch-all key `'*'`: 1. Assert that `key` is a _literal_. - 1. Let `ks` be the resolved value of `key`. + 1. Let `ks` be the resolved value of `key` in Unicode Normalization Form C. 1. Append `ks` as the last element of the list `keys`. 1. Let `rv` be the resolved value at index `i` of `res`. 1. Let `matches` be the result of calling the method MatchSelectorKeys(`rv`, `keys`) From b5eec2a3345010344789cf8b41be0ad78607871e Mon Sep 17 00:00:00 2001 From: Addison Phillips Date: Tue, 17 Sep 2024 11:12:55 -0700 Subject: [PATCH 6/8] Address comments --- spec/formatting.md | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/spec/formatting.md b/spec/formatting.md index bb18d1043a..f048975659 100644 --- a/spec/formatting.md +++ b/spec/formatting.md @@ -474,11 +474,6 @@ This selection method is defined in more detail below. An implementation MAY use any pattern selection method, as long as its observable behavior matches the results of the method defined here. -The resolved value of each _key_ MUST be in Unicode Normalization Form C ("NFC"), -even if the _literal_ for the _key_ is not. -All comparisons of _keys_ MUST be done on the canonical, normalized values -and the normalized value MUST be the value that is passed in the steps below. - ### Resolve Selectors First, resolve the values of each _selector_: @@ -521,6 +516,9 @@ The returned list MAY be empty. The most-preferred key is first, with each successive key appearing in order by decreasing preference. +The resolved value of each _key_ MUST be in Unicode Normalization Form C ("NFC"), +even if the _literal_ for the _key_ is not. + If calling MatchSelectorKeys encounters any error, a _Bad Selector_ error is emitted and an empty list is returned. From eb09a95819006b089ddcee80769937dd1c65b4b6 Mon Sep 17 00:00:00 2001 From: Addison Phillips Date: Tue, 17 Sep 2024 11:17:27 -0700 Subject: [PATCH 7/8] Update spec/syntax.md Co-authored-by: Eemeli Aro --- spec/syntax.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/spec/syntax.md b/spec/syntax.md index f8b0b2d654..c1b7a327e9 100644 --- a/spec/syntax.md +++ b/spec/syntax.md @@ -753,7 +753,7 @@ that is, if they consist of the same sequence of Unicode code points after has been applied to both. > [!NOTE] -> Implementations are not required to normalize _names_. +> Implementations are not required to normalize all _names_. > Comparisons of _name_ values only need be done "as-if" normalization > has occured. > Since most text in the wild is already in NFC From 94e124619e75234b8f43ccd00f24e66c86966a6a Mon Sep 17 00:00:00 2001 From: Addison Phillips Date: Tue, 17 Sep 2024 16:15:39 -0700 Subject: [PATCH 8/8] Update spec/syntax.md Co-authored-by: Eemeli Aro --- spec/syntax.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/spec/syntax.md b/spec/syntax.md index c1b7a327e9..24ea52318f 100644 --- a/spec/syntax.md +++ b/spec/syntax.md @@ -446,8 +446,6 @@ that matches all values for a given _selector_. The value of each _key_ MUST be treated as if it were in [Unicode Normalization Form C](https://unicode.org/reports/tr15/) ("NFC"). -When _keys_ are passed during _pattern selection_, the _key_ values MUST -be normalized into NFC. Two _keys_ are considered equal if they are canonically equivalent strings, that is, if they consist of the same sequence of Unicode code points after Unicode Normalization Form C has been applied to both.