Skip to content

Find keywords using perfect hashing #1590

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

davisp
Copy link
Member

@davisp davisp commented Dec 11, 2024

This updates the keyword lookup algorthm to use the phf crate.

This is one possible approach to solving #1588 at the cost of pulling in a new dependency on phf. I will note that phf is rather small and is being used in a no_std configuration, but it is a new dependency which may be enough to not go this route.

Local measuring with the sqlparser_bench shows roughly a 5% speedup. I'll open a second PR for comparison with my other approach to just scope the binary search to words starting with the same first letter.

This updates the keyword lookup algorthm to use the phf crate.
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for thsi @davisp

];
};
pub fn lookup(keyword: &str) -> Keyword {
let keyword = keyword.to_ascii_uppercase();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to optimize lookup I think we should also avoid this call to to_ascii_upprcase as it allocates / copies the string which I believe is significiantly more costly than some comparisons

https://doc.rust-lang.org/std/ascii/trait.AsciiExt.html#tymethod.to_ascii_uppercase
https://doc.rust-lang.org/std/primitive.str.html#method.to_uppercase

Since all keywords are ASCII, what if we did the binary search character by character

Here is some brain 🤮 :

let first_char = string.first().to_upper() 
// find range of keywords within the
let range = binary_search(keywords, first_char, 0);
// if not unique location, proceed to second character, etc
...

use std::path::Path;

fn read_keywords() -> Vec<(String, Option<String>)> {
let path = Path::new("src").join("keywords.txt");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is quite clever

@alamb
Copy link
Contributor

alamb commented Dec 12, 2024

As much as I love not rewriting code, I am quite loath to take on a new dependency in this crate. My concerns are

  1. The binary / dependency size
  2. (mostly) that every additional crate used results in another places to help maintain (see for example CI Failure in json writer tests after upgrade from lexical-core 1.0.2 to lexical-core 1.0.3: range end index 20 out of range for slice of length 19 arrow-rs#6858)

That being said, phf appears to be widely used and not changing: https://crates.io/crates/phf/reverse_dependencies

Are you wiling to try the "compare without allocating approach" described in #1591 (comment)?

@tobyhede
Copy link
Contributor

Also an open question regarding maintenance of the phf crate rust-phf/rust-phf#318
It does appear to be relatively stable, most of the open issues appear to be feature requests and enhancements rather than fixes.

@alamb alamb marked this pull request as draft December 24, 2024 15:00
@alamb
Copy link
Contributor

alamb commented Dec 24, 2024

Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look

Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale label Feb 23, 2025
@github-actions github-actions bot closed this Mar 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants