Skip to content

Replace built-in MPH with a dependency on the 'phf' crate #21

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

progval
Copy link
Owner

@progval progval commented Aug 6, 2023

Before:

$ cargo bench
test tests::character_10000    ... bench:      95,820 ns/iter (+/- 3,313)
test tests::character_basic    ... bench:         188 ns/iter (+/- 4)
test tests::name_10000_invalid ... bench:      35,322 ns/iter (+/- 237)
test tests::name_all_valid     ... bench:   7,546,306 ns/iter (+/- 24,243)
test tests::name_basic         ... bench:          78 ns/iter (+/- 12)
$ ll target/release/deps/unicode_names2-a53ae0f01b2261d9
-rwxr-xr-x 1 dev-rust dev-rust 21M  6 août  12:21 target/release/deps/unicode_names2-a53ae0f01b2261d9

$ RUSTFLAGS="-C target-cpu=native" cargo bench
test tests::character_10000    ... bench:      88,172 ns/iter (+/- 2,905)
test tests::character_basic    ... bench:         189 ns/iter (+/- 4)
test tests::name_10000_invalid ... bench:      33,011 ns/iter (+/- 589)
test tests::name_all_valid     ... bench:   7,561,671 ns/iter (+/- 47,556)
test tests::name_basic         ... bench:          89 ns/iter (+/- 6)
$ ll target/release/deps/unicode_names2-a53ae0f01b2261d9
-rwxr-xr-x 1 dev-rust dev-rust 21M  6 août  12:24 target/release/deps/unicode_names2-a53ae0f01b2261d9

After:

$ cargo bench
test tests::character_10000    ... bench:     100,275 ns/iter (+/- 1,521)
test tests::character_basic    ... bench:         162 ns/iter (+/- 7)
test tests::name_10000_invalid ... bench:      35,341 ns/iter (+/- 258)
test tests::name_all_valid     ... bench:   7,480,702 ns/iter (+/- 79,935)
test tests::name_basic         ... bench:          84 ns/iter (+/- 13)
$ ll target/release/deps/unicode_names2-a7967b9cdf0f0db1
-rwxr-xr-x 1 dev-rust dev-rust 26M  6 août  12:21 target/release/deps/unicode_names2-a7967b9cdf0f0db1

$ RUSTFLAGS="-C target-cpu=native" cargo bench
test tests::character_10000    ... bench:      97,400 ns/iter (+/- 3,124)
test tests::character_basic    ... bench:         169 ns/iter (+/- 4)
test tests::name_10000_invalid ... bench:      34,721 ns/iter (+/- 5,037)
test tests::name_all_valid     ... bench:   7,590,151 ns/iter (+/- 83,504)
test tests::name_basic         ... bench:          92 ns/iter (+/- 10)
$ ll target/release/deps/unicode_names2-a7967b9cdf0f0db1
-rwxr-xr-x 1 dev-rust dev-rust 26M  6 août  12:23 target/release/deps/unicode_names2-a7967b9cdf0f0db1

while running on i5-1240P (downclocked to 2GHz)

so performance is similar, but character_10000 is noticeably slower, and the binary size grew by 5MB, which is not negligeable.

@youknowone Could you try rewriting #12 based on this? I'd like to know the performance and size difference between your current code (keeps the custom MPH, and adds a phf MPH just for the aliases) and using a single phf MPH for both tables.

@progval progval changed the title Remove built-in MPH with a dependency on the 'phf' crate Replace built-in MPH with a dependency on the 'phf' crate Aug 6, 2023
@youknowone
Copy link

I am sorry. I was totally forgetting about this patch.
I will look in it and I will try to rewrite #12 too.
Thank you for reminding!

@youknowone
Copy link

#12 is done. Could you rebase this onto the current master? Then I will create a new PR with this patch + #12

@youknowone
Copy link

I created #24

@progval
Copy link
Owner Author

progval commented Sep 1, 2023

#12:

$ cargo bench
test tests::character_10000    ... bench:      81,375 ns/iter (+/- 2,454)
test tests::character_basic    ... bench:         183 ns/iter (+/- 2)
test tests::name_10000_invalid ... bench:      35,331 ns/iter (+/- 204)
test tests::name_all_valid     ... bench:   7,448,226 ns/iter (+/- 103,458)
test tests::name_basic         ... bench:          79 ns/iter (+/- 3)
$ ll target/release/deps/unicode_names2-45fd9b0e52768896
-rwxr-xr-x 1 dev-rust dev-rust 21M  1 sept. 08:49 target/release/deps/unicode_names2-45fd9b0e52768896

$ RUSTFLAGS="-C target-cpu=native" cargo bench
test tests::character_10000    ... bench:      84,814 ns/iter (+/- 3,904)
test tests::character_basic    ... bench:         188 ns/iter (+/- 3)
test tests::name_10000_invalid ... bench:      33,015 ns/iter (+/- 1,965)
test tests::name_all_valid     ... bench:   7,576,462 ns/iter (+/- 97,365)
test tests::name_basic         ... bench:          89 ns/iter (+/- 6)
$ ll target/release/deps/unicode_names2-45fd9b0e52768896
-rwxr-xr-x 1 dev-rust dev-rust 21M  1 sept. 08:50 target/release/deps/unicode_names2-45fd9b0e52768896

#24:

$ cargo bench
test tests::character_10000    ... bench:      78,114 ns/iter (+/- 2,547)
test tests::character_basic    ... bench:         167 ns/iter (+/- 2)
test tests::name_10000_invalid ... bench:      35,358 ns/iter (+/- 130)
test tests::name_all_valid     ... bench:   7,463,370 ns/iter (+/- 293,371)
test tests::name_basic         ... bench:          89 ns/iter (+/- 5)
$ ll target/release/deps/unicode_names2-6bc8607c5cb04625
-rwxr-xr-x 1 dev-rust dev-rust 26M  1 sept. 08:46 target/release/deps/unicode_names2-6bc8607c5cb04625

$ RUSTFLAGS="-C target-cpu=native" cargo bench
test tests::character_10000    ... bench:     102,357 ns/iter (+/- 1,786)
test tests::character_basic    ... bench:         162 ns/iter (+/- 2)
test tests::name_10000_invalid ... bench:      33,056 ns/iter (+/- 243)
test tests::name_all_valid     ... bench:   7,545,929 ns/iter (+/- 31,639)
test tests::name_basic         ... bench:          90 ns/iter (+/- 10)
$ ll target/release/deps/unicode_names2-6bc8607c5cb04625
-rwxr-xr-x 1 dev-rust dev-rust 26M  1 sept. 08:47 target/release/deps/unicode_names2-6bc8607c5cb04625

(character_10000 is repeatably slower with -C target-cpu=native???)

#24 with a hacky patch to merge the PHFs
diff --git a/build.rs b/build.rs
index 8caa8c1..dedbffe 100644
--- a/build.rs
+++ b/build.rs
@@ -19,11 +19,6 @@ fn main() {
     {
         let mut generated_phf_path = out_dir.clone();
         generated_phf_path.push("generated_phf.rs");
-        generator::generate_phf(UNICODE_DATA, Some(&generated_phf_path), None);
-    }
-    {
-        let mut generated_alias_path = out_dir;
-        generated_alias_path.push("generated_alias.rs");
-        generator::generate_aliases(NAME_ALIASES, &generated_alias_path);
+        generator::generate_phf(UNICODE_DATA, NAME_ALIASES, Some(&generated_phf_path), None);
     }
 }
diff --git a/generator/src/lib.rs b/generator/src/lib.rs
index 11b8103..bfad690 100644
--- a/generator/src/lib.rs
+++ b/generator/src/lib.rs
@@ -400,7 +400,7 @@ fn get_truncated_table_data(
     (codepoint_names, cjk)
 }
 
-pub fn generate_phf(unicode_data: &'static str, path: Option<&Path>, truncate: Option<usize>) {
+pub fn generate_phf(unicode_data: &'static str, name_aliases: &'static str, path: Option<&Path>, truncate: Option<usize>) {
     let (codepoint_names, _) = get_truncated_table_data(unicode_data, truncate);
 
     let mut ctxt = make_context(path);
@@ -408,6 +408,10 @@ pub fn generate_phf(unicode_data: &'static str, path: Option<&Path>, truncate: O
     for &(character, name) in &codepoint_names {
         builder.entry(name.as_bytes(), &format!("{:?}", character));
     }
+    for Alias { code, alias, .. } in get_aliases(name_aliases).into_iter() {
+        let formatted = format!("'\\u{{{code}}}'");
+        builder.entry(alias.as_bytes(), &formatted);
+    }
     write!(
         &mut ctxt.out,
         "pub static NAMES: phf::Map<&'static [u8], char> = {};",
@@ -432,13 +436,3 @@ pub fn generate(unicode_data: &'static str, path: Option<&Path>, truncate: Optio
         fs::rename(path.with_extension("tmp"), path).unwrap()
     }
 }
-
-pub fn generate_aliases(name_aliases: &'static str, path: &Path) {
-    let mut aliases = phf_codegen::Map::new();
-    for Alias { code, alias, .. } in get_aliases(name_aliases).into_iter() {
-        let formatted = format!("'\\u{{{code}}}'");
-        aliases.entry(alias, &formatted);
-    }
-    let aliases = aliases.build().to_string().replace("(\"", "(b\"");
-    writeln!(BufWriter::new(File::create(path).unwrap()), "{aliases}",).unwrap();
-}
diff --git a/src/lib.rs b/src/lib.rs
index 5768386..05e00dd 100755
--- a/src/lib.rs
+++ b/src/lib.rs
@@ -86,11 +86,6 @@ mod generated_phf {
 #[allow(dead_code)]
 mod jamo;
 
-/// A map of unicode aliases to their corresponding values.
-/// Generated in generator
-#[allow(dead_code)]
-static ALIASES: phf::Map<&'static [u8], char> = include!(concat!(env!("OUT_DIR"), "/generated_alias.rs"));
-
 mod iter_str;
 
 static HANGUL_SYLLABLE_PREFIX: &str = "HANGUL SYLLABLE ";
@@ -291,16 +286,6 @@ pub fn name(c: char) -> Option<Name> {
     }
 }
 
-/// Get alias value from alias name, returns `None` if the alias is not found.
-fn character_by_alias(name: &[u8]) -> Option<char> {
-    println!(
-        "name {name}, {alias:?}",
-        name = std::str::from_utf8(name).unwrap(),
-        alias = ALIASES.get(name)
-    );
-    ALIASES.get(name).copied()
-}
-
 /// Find the character called `name`, or `None` if no such character
 /// exists.
 ///
@@ -378,7 +363,7 @@ pub fn character(search_name: &str) -> Option<char> {
 
     let codepoint = match generated_phf::NAMES.get(search_name) {
         Some(&codepoint) => codepoint,
-        None => return character_by_alias(search_name),
+        None => return None,
     };
 
     // Now check that this is actually correct. Since this is a
@@ -587,9 +572,9 @@ mod tests {
 
     #[test]
     fn character_by_alias() {
-        assert_eq!(super::character_by_alias(b"NEW LINE"), Some('\n'));
-        assert_eq!(super::character_by_alias(b"BACKSPACE"), Some('\u{8}'));
-        assert_eq!(super::character_by_alias(b"NOT AN ALIAS"), None);
+        assert_eq!(super::character("NEW LINE"), Some('\n'));
+        assert_eq!(super::character("BACKSPACE"), Some('\u{8}'));
+        assert_eq!(super::character("NOT AN ALIAS"), None);
     }
 
     #[bench]
$ cargo bench
test tests::character_10000    ... bench:      98,142 ns/iter (+/- 16,253)
test tests::character_basic    ... bench:         168 ns/iter (+/- 5)
test tests::name_10000_invalid ... bench:      35,412 ns/iter (+/- 413)
test tests::name_all_valid     ... bench:   7,589,785 ns/iter (+/- 23,819)
test tests::name_basic         ... bench:          95 ns/iter (+/- 4)
$ ll target/release/deps/unicode_names2-6bc8607c5cb04625
-rwxr-xr-x 1 dev-rust dev-rust 26M  1 sept. 08:49 target/release/deps/unicode_names2-6bc8607c5cb04625

$ RUSTFLAGS="-C target-cpu=native" cargo bench
test tests::character_10000    ... bench:      98,602 ns/iter (+/- 1,924)
test tests::character_basic    ... bench:         164 ns/iter (+/- 6)
test tests::name_10000_invalid ... bench:      33,016 ns/iter (+/- 104)
test tests::name_all_valid     ... bench:   7,631,792 ns/iter (+/- 24,104)
test tests::name_basic         ... bench:          97 ns/iter (+/- 6)
$ ll target/release/deps/unicode_names2-6bc8607c5cb04625
-rwxr-xr-x 1 dev-rust dev-rust 26M  1 sept. 08:48 target/release/deps/unicode_names2-6bc8607c5cb04625

@progval
Copy link
Owner Author

progval commented Sep 1, 2023

So it seems pretty clear the built-in PHF is (predictably) still better than generic PHF even on aliases.

Too bad.

@progval progval closed this Sep 1, 2023
@progval progval mentioned this pull request Sep 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants