Skip to content

Implement Set as a core class #13074

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 26, 2025
Merged

Implement Set as a core class #13074

merged 1 commit into from
Apr 26, 2025

Conversation

jeremyevans
Copy link
Contributor

Set has been an autoloaded standard library since Ruby 3.2. The standard library Set is less efficient than it could be, as it uses Hash for storage, which stores unnecessary values for each key.

Implementation details:

  • Core Set uses a modified version of st_table, named set_table. Other than s/st_/set_/, the main difference is that the stored records do not have values, making them 1/3 smaller. st_table_entry stores hash, key, and record (value), while set_table_entry only stores hash and key. This results in large sets using ~33% less memory compared to stdlib Set. For small sets, core Set uses 20% less memory (160 bytes, while stdlib set uses 40 for Set and 160 for Hash).

  • All methods are implemented as cfuncs, except the pretty_print methods, which were moved to lib/pp.rb (which is where the pretty_print methods for other core classes are defined). As is typical for core classes, internal calls call C functions and not Ruby methods. For example, to check if something is a Set, rb_obj_is_kind_of is used, instead of calling is_a?(Set) on the related object.

  • Almost all methods use the same algorithm that the pure-Ruby implementation used. The exception is when calling Set#divide with a block with 2-arity. The pure-Ruby method used tsort to implement this. I developed an algorithm that only allocates a single intermediate hash and does not need tsort.

  • The flatten_merge protected method is no longer necessary, so it is not implemented (it could be).

  • Similar to Hash/Array, subclasses of Set are no longer reflected in inspect output.

  • RDoc from stdlib Set was moved to core Set, with minor updates.

This includes a comprehensive benchmark suite for all public Set methods. As you would expect, the native version is faster in the vast majority of cases, and multiple times faster in many cases. There are a few cases where it is significantly slower:

  • Set.new with no arguments (~1.6x)
  • Set#compare_by_identity for small sets (~1.3x)
  • Set#clone for small sets (~1.5x)
  • Set#dup for small sets (~1.7x)

These are slower as Set does not currently use the AR table optimization that Hash does, so a new set_table is initialized for each call. I'm not sure it's worth the complexity to have an AR table-like optimization for small sets (for hashes it makes sense, as small hashes are used everywhere in Ruby).

The rbs and repl_type_completor bundled gems will need updates to support core Set. The pull request marks them as allowed failures.

This passes all set tests with no changes. The following specs needed modification:

  • Modifying frozen set error message (changed for the better)
  • Set#divide when passed a 2-arity block no longer yields the same object as both the first and second argument (this seems like an issue with the previous implementation).
  • Set-like objects that override is_a? such that is_a?(Set) return true are no longer treated as Set instances.
  • Set.allocate.hash is no longer the same as nil.hash
  • Set#join no longer calls Set#to_a (it calls the underlying C function).
  • Set#flatten_merge protected method is not implemented.

Previously, set.rb added a SortedSet autoload, which loads set/sorted_set.rb. This replaces the Set autoload in prelude.rb with a SortedSet autoload, but I recommend removing it and set/sorted_set.rb.

This moves test/set/test_set.rb to test/ruby/test_set.rb, reflecting that switch to a core class. This does not move the spec files, as I'm not sure how they should be handled.

This comment has been minimized.

@jeremyevans
Copy link
Contributor Author

update_bundled_gems CI failure is due to release of test-unit 3.6.8 (which was today), which moved the file that the CI workflow is trying to use. Related test-unit commit: test-unit/test-unit@b7d3c32

set.c Outdated
# endif
#endif

typedef struct set_table set_table;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to split the set_table internal implementation and the Set type, because there are other places in the VM that could benefit from using a set_table instead of a st_table, one example being the fstring_table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I'll work on that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another place that could use a set_stable is the constant cache tracker:

st_insert(ics, (st_data_t) ic, (st_data_t) Qtrue);

set.c Outdated
Comment on lines 144 to 147
struct set_table_entry {
set_hash_t hash;
set_data_t key;
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really worth having set_hash_t / set_data_t / set_index_t? I might be off, but I think re-using st_* types would be easier and would make it trivial to covert code using an st_table into using a set_table?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the types deliberately different to avoid ambiguity. However, there is no technical reason we couldn't use the same types, and I agree that it could make conversion from st_table easier.

If we want to reuse st_* types, one option is to move the set_table code into st.c. There are some functions that are the same between the two implementations, so it would eliminate some duplicate code. What are your thoughts on that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are your thoughts on that?

That would be fine by me.

For the record there's an in-flight PR that would also benefit from a set_table: #13077

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Giving some more thought to this, before making changes, I think we should discuss this at the developer meeting next week, as other developers may have thoughts about where the code should go. I'll add this to the meeting agenda.

@hsbt
Copy link
Member

hsbt commented Apr 7, 2025

@jeremyevans

update_bundled_gems CI failure is due to release of test-unit 3.6.8 (which was today),

I fixed this at #13078

@ParadoxV5
Copy link
Contributor

🆙


Here’s 2¢ from my recent experience:
Instead of HashSet[E] implemented as HashMap[E, truthy_dummy], it could be the converse, with HashMap[K, V] implemented as HashSet[KVPair[K, V]].

KVPair = Struct.new(:key, :value) do
  def hash = ~key.hash
  def eql?(other) = key.eql?(other.key) rescue nil
end

@jeremyevans
Copy link
Contributor Author

🆙

Here’s 2¢ from my recent experience: Instead of HashSet[E] implemented as HashMap[E, truthy_dummy], it could be the converse, with HashMap[K, V] implemented as HashSet[KVPair[K, V]].

KVPair = Struct.new(:key, :value) do
  def hash = ~key.hash
  def eql?(other) = key.eql?(other.key) rescue nil
end

Can you explain the advantage of doing that? Seems to use more objects and is probably slower, plus it would require fairly invasive changes. Additionally, HashSet does not support updating elements already in the set, while updating values for keys already in the map is a requirement of HashMap. Supporting both HashSet[E] and HashMap[K,V] (as this pull request does) seems better to me, there is no requirement to implement one in terms of the other.

set.c Outdated
static int
mark_key(set_data_t key, set_data_t data)
{
rb_gc_mark((VALUE)key);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rb_gc_mark((VALUE)key);
rb_gc_mark_movable((VALUE)key);

I think you need rb_gc_mark_movable otherwise there is little point implementing dcompact.

set.c Outdated
Comment on lines 1640 to 120
struct set_object {
set_table *table;
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
struct set_object {
set_table *table;
};
struct set_object {
set_table table;
};

Would probably be better to embed the set_table and save a pointer chasing in most places. You can even use RUBY_TYPED_EMBEDDABLE so it's all in the object slot.

set.c Outdated
.dsize = set_size,
.dcompact = set_compact,
},
.flags = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed last night, you can set TYPED_FREE_IMMEDIATELY

@byroot
Copy link
Member

byroot commented Apr 16, 2025

Shower thought: Probably doesn't matter, or at least largely offseted by the more efficient implementation, but we're losing the AR_TABLE optimization for small sets.

@jeremyevans jeremyevans force-pushed the core-set branch 2 times, most recently from 1cc9fcc to 9f9821a Compare April 23, 2025 06:42
@jeremyevans
Copy link
Contributor Author

This PR now implements what was agreed to at the dev meeting:

  • set_table now uses st_* types/functions where possible
  • set_table implementation lives in st.c
  • internal/set_table.h is added, allowing use of set_table
    by places other than set.c
  • set.c only contains the Set-specific logic

Additionally, it implements the following improvements:

  • Embeds set_table into object slot (currently results in
    a significant amount of wasted memory per Set, since
    struct set_table is 56 bytes, so the object is 88 bytes
    when embedded, just over the 80 byte limit, so it
    occupies a 160 byte slot)

  • GC.compact support

  • Write barrier support

  • Optimizes some common Set methods when used with arrays,
    by using the array ptr to access elements instead of
    calling methods on the array.

I think this is now ready to be merged.

@jeremyevans jeremyevans requested review from byroot and knu April 24, 2025 05:45
Comment on lines +105 to +108
#define RSET_LEV_MASK (FL_USER13 | FL_USER14 | FL_USER15 | /* FL 13..19 */ \
FL_USER16 | FL_USER17 | FL_USER18 | FL_USER19)
#define RSET_LEV_SHIFT (FL_USHIFT + 13)
#define RSET_LEV_MAX 127 /* 7 bits */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal, but given we have plenty of space to spare inside the object slot, we could skip all that iter-level in flags complexity and just use a long or something inside struct set_object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an idea for fitting Set in an 80-byte slot, which would not leave enough spare space. I think there would be a single padding byte in struct set_object that could be used, but I don't see much advantage to that over the flags approach.

You are correct that there is a significant amount of complexity here, but it was just copied from hash.c, so I assume there shouldn't be any issues (or at least, no issues that do not also affect Hash).

If I'm unable to get Set to fit in 80 bytes, I'll come back to this and simplify it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If your idea is what is mentioned in 9f9821a (getting rid of typed_flag), it's not as simple.

The GC need to store one bit of information to know if the object was actually embedded or not (since allocation size is dynamic, if you ask for more than the GC can provide, you won't be embeded).

I have an idea how to reclaim that in the future, but it's a lot of work.

At the end of the day, I don't think Set being size 160B is a big deal. It's a bit of a shame we're wasting 72B, but that's still smaller than what the previous implementation was.

set.c Outdated
Comment on lines 411 to 412
RB_OBJ_WRITTEN(set, Qundef, k);
set_insert(table, (st_data_t)k);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
RB_OBJ_WRITTEN(set, Qundef, k);
set_insert(table, (st_data_t)k);
set_insert(table, (st_data_t)k);
RB_OBJ_WRITTEN(set, Qundef, k);

Not sure if it has an incidence, but in theory you are supposed to call _WRITTEN after having created the reference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Also, this allows us to only call RB_OBJ_WRITTEN if the insertion was successful, as you mentioned in another comment. If the insertion was not successful, we did not write to the set_table, and therefore should not call RB_OBJ_WRITTEN. I've added a commit that adds a couple functions to DRY up this code and ensure that set_insert is only called in a single place in set.c, as you recommended in another comment.

Copy link
Contributor Author

@jeremyevans jeremyevans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@byroot Thank you very much for your detailed review. I just pushed commits to address all of the issues you pointed out.

Comment on lines +105 to +108
#define RSET_LEV_MASK (FL_USER13 | FL_USER14 | FL_USER15 | /* FL 13..19 */ \
FL_USER16 | FL_USER17 | FL_USER18 | FL_USER19)
#define RSET_LEV_SHIFT (FL_USHIFT + 13)
#define RSET_LEV_MAX 127 /* 7 bits */
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an idea for fitting Set in an 80-byte slot, which would not leave enough spare space. I think there would be a single padding byte in struct set_object that could be used, but I don't see much advantage to that over the flags approach.

You are correct that there is a significant amount of complexity here, but it was just copied from hash.c, so I assume there shouldn't be any issues (or at least, no issues that do not also affect Hash).

If I'm unable to get Set to fit in 80 bytes, I'll come back to this and simplify it.

set.c Outdated
Comment on lines 411 to 412
RB_OBJ_WRITTEN(set, Qundef, k);
set_insert(table, (st_data_t)k);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Also, this allows us to only call RB_OBJ_WRITTEN if the insertion was successful, as you mentioned in another comment. If the insertion was not successful, we did not write to the set_table, and therefore should not call RB_OBJ_WRITTEN. I've added a commit that adds a couple functions to DRY up this code and ensure that set_insert is only called in a single place in set.c, as you recommended in another comment.

};
set_iter(set, set_merge_i, (st_data_t)&args);
set_free_embedded(sobj);
memcpy(&sobj->table, new, sizeof(*new));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed in this function as the write barrier are added in set_merge_i (called a couple lines above).

Copy link
Member

@byroot byroot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

However I don't know if you cherry-picked the commits from jeremyevans#2, but it seems like @mrnoname1000 authorship was lost in the process. Would be nice to add him back.

Set has been an autoloaded standard library since Ruby 3.2.
The standard library Set is less efficient than it could be, as it
uses Hash for storage, which stores unnecessary values for each key.

Implementation details:

* Core Set uses a modified version of `st_table`, named `set_table`.
  than `s/st_/set_/`, the main difference is that the stored records
  do not have values, making them 1/3 smaller. `st_table_entry` stores
  `hash`, `key`, and `record` (value), while `set_table_entry` only
  stores `hash` and `key`.  This results in large sets using ~33% less
  memory compared to stdlib Set.  For small sets, core Set uses 12% more
  memory (160 byte object slot and 64 malloc bytes, while stdlib set
  uses 40 for Set and 160 for Hash).  More memory is used because
  the set_table is embedded and 72 bytes in the object slot are
  currently wasted. Hopefully we can make this more efficient and have
  it stored in an 80 byte object slot in the future.

* All methods are implemented as cfuncs, except the pretty_print
  methods, which were moved to `lib/pp.rb` (which is where the
  pretty_print methods for other core classes are defined).  As is
  typical for core classes, internal calls call C functions and
  not Ruby methods.  For example, to check if something is a Set,
  `rb_obj_is_kind_of` is used, instead of calling `is_a?(Set)` on the
  related object.

* Almost all methods use the same algorithm that the pure-Ruby
  implementation used.  The exception is when calling `Set#divide` with a
  block with 2-arity.  The pure-Ruby method used tsort to implement this.
  I developed an algorithm that only allocates a single intermediate
  hash and does not need tsort.

* The `flatten_merge` protected method is no longer necessary, so it
  is not implemented (it could be).

* Similar to Hash/Array, subclasses of Set are no longer reflected in
  `inspect` output.

* RDoc from stdlib Set was moved to core Set, with minor updates.

This includes a comprehensive benchmark suite for all public Set
methods.  As you would expect, the native version is faster in the
vast majority of cases, and multiple times faster in many cases.
There are a few cases where it is significantly slower:

* Set.new with no arguments (~1.6x)
* Set#compare_by_identity for small sets (~1.3x)
* Set#clone for small sets (~1.5x)
* Set#dup for small sets (~1.7x)

These are slower as Set does not currently use the AR table
optimization that Hash does, so a new set_table is initialized for
each call.  I'm not sure it's worth the complexity to have an AR
table-like optimization for small sets (for hashes it makes sense,
as small hashes are used everywhere in Ruby).

The rbs and repl_type_completor bundled gems will need updates to
support core Set.  The pull request marks them as allowed failures.

This passes all set tests with no changes.  The following specs
needed modification:

* Modifying frozen set error message (changed for the better)
* `Set#divide` when passed a 2-arity block no longer yields the same
  object as both the first and second argument (this seems like an issue
  with the previous implementation).
* Set-like objects that override `is_a?` such that `is_a?(Set)` return
  `true` are no longer treated as Set instances.
* `Set.allocate.hash` is no longer the same as `nil.hash`
* `Set#join` no longer calls `Set#to_a` (it calls the underlying C
   function).
* `Set#flatten_merge` protected method is not implemented.

Previously, `set.rb` added a `SortedSet` autoload, which loads
`set/sorted_set.rb`.  This replaces the `Set` autoload in `prelude.rb`
with a `SortedSet` autoload, but I recommend removing it and
`set/sorted_set.rb`.

This moves `test/set/test_set.rb` to `test/ruby/test_set.rb`,
reflecting that switch to a core class.  This does not move the spec
files, as I'm not sure how they should be handled.

Internally, this uses the st_* types and functions as much as
possible, and only adds set_* types and functions as needed.
The underlying set_table implementation is stored in st.c, but
there is no public C-API for it, nor is there one planned, in
order to keep the ability to change the internals going forward.

For internal uses of st_table with Qtrue values, those can
probably be replaced with set_table.  To do that, include
internal/set_table.h.  To handle symbol visibility (rb_ prefix),
internal/set_table.h uses the same macro approach that
include/ruby/st.h uses.

The Set class (rb_cSet) and all methods are defined in set.c.
There isn't currently a C-API for the Set class, though C-API
functions can be added as needed going forward.

Implements [Feature #21216]

Co-authored-by: Jean Boussier <jean.boussier@gmail.com>
Co-authored-by: Oliver Nutter <mrnoname1000@riseup.net>
@jeremyevans
Copy link
Contributor Author

Looks good to me.

However I don't know if you cherry-picked the commits from jeremyevans#2, but it seems like @mrnoname1000 authorship was lost in the process. Would be nice to add him back.

Certainly. I squashed this into a single commit and added him as a co-author. I'll merge after this passes CI.

@jeremyevans
Copy link
Contributor Author

CI is only failing on the same VM_DEBUG ractor btest that is also failing on master, so I am going to merge this.

@jeremyevans jeremyevans merged commit e4f85bf into ruby:master Apr 26, 2025
88 of 90 checks passed
@MSP-Greg
Copy link
Contributor

@jeremyevans

Haven't had time to look into this, but while running GHA tests today, the following patch worked:

ruby 3.5.0dev (2025-04-24T18:33:11Z :detached: 71166f60d9) +PRISM [x86_64-linux]

A few hours later, the following:

ruby 3.5.0dev (2025-04-26T11:56:42Z :detached: 687bd83724) +PRISM [x86_64-linux]

The logged error was:

ActionView::Template::Error: no _dump_data is defined for class Set
    app/views/layouts/application.html.erb:7

This was running tests against turbo-rails in Puma...

@byroot
Copy link
Member

byroot commented Apr 26, 2025

Thanks for pointing it out, that is indeed an oversight, won't be too hard to fix.

@jeremyevans
Copy link
Contributor Author

Yep, I'll work on a patch for Marshal support for Set.

@jeremyevans
Copy link
Contributor Author

PR submitted to support marshalling: #13185

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants