Implement Set as a core class #13074

jeremyevans · 2025-04-06T03:43:45Z

Set has been an autoloaded standard library since Ruby 3.2. The standard library Set is less efficient than it could be, as it uses Hash for storage, which stores unnecessary values for each key.

Implementation details:

Core Set uses a modified version of st_table, named set_table. Other than s/st_/set_/, the main difference is that the stored records do not have values, making them 1/3 smaller. st_table_entry stores hash, key, and record (value), while set_table_entry only stores hash and key. This results in large sets using ~33% less memory compared to stdlib Set. For small sets, core Set uses 20% less memory (160 bytes, while stdlib set uses 40 for Set and 160 for Hash).
All methods are implemented as cfuncs, except the pretty_print methods, which were moved to lib/pp.rb (which is where the pretty_print methods for other core classes are defined). As is typical for core classes, internal calls call C functions and not Ruby methods. For example, to check if something is a Set, rb_obj_is_kind_of is used, instead of calling is_a?(Set) on the related object.
Almost all methods use the same algorithm that the pure-Ruby implementation used. The exception is when calling Set#divide with a block with 2-arity. The pure-Ruby method used tsort to implement this. I developed an algorithm that only allocates a single intermediate hash and does not need tsort.
The flatten_merge protected method is no longer necessary, so it is not implemented (it could be).
Similar to Hash/Array, subclasses of Set are no longer reflected in inspect output.
RDoc from stdlib Set was moved to core Set, with minor updates.

This includes a comprehensive benchmark suite for all public Set methods. As you would expect, the native version is faster in the vast majority of cases, and multiple times faster in many cases. There are a few cases where it is significantly slower:

Set.new with no arguments (~1.6x)
Set#compare_by_identity for small sets (~1.3x)
Set#clone for small sets (~1.5x)
Set#dup for small sets (~1.7x)

These are slower as Set does not currently use the AR table optimization that Hash does, so a new set_table is initialized for each call. I'm not sure it's worth the complexity to have an AR table-like optimization for small sets (for hashes it makes sense, as small hashes are used everywhere in Ruby).

The rbs and repl_type_completor bundled gems will need updates to support core Set. The pull request marks them as allowed failures.

This passes all set tests with no changes. The following specs needed modification:

Modifying frozen set error message (changed for the better)
Set#divide when passed a 2-arity block no longer yields the same object as both the first and second argument (this seems like an issue with the previous implementation).
Set-like objects that override is_a? such that is_a?(Set) return true are no longer treated as Set instances.
Set.allocate.hash is no longer the same as nil.hash
Set#join no longer calls Set#to_a (it calls the underlying C function).
Set#flatten_merge protected method is not implemented.

Previously, set.rb added a SortedSet autoload, which loads set/sorted_set.rb. This replaces the Set autoload in prelude.rb with a SortedSet autoload, but I recommend removing it and set/sorted_set.rb.

This moves test/set/test_set.rb to test/ruby/test_set.rb, reflecting that switch to a core class. This does not move the spec files, as I'm not sure how they should be handled.

jeremyevans · 2025-04-06T04:25:53Z

update_bundled_gems CI failure is due to release of test-unit 3.6.8 (which was today), which moved the file that the CI workflow is trying to use. Related test-unit commit: test-unit/test-unit@b7d3c32

byroot · 2025-04-07T07:46:56Z

set.c

+# endif
+#endif
+
+typedef struct set_table set_table;


It would be good to split the set_table internal implementation and the Set type, because there are other places in the VM that could benefit from using a set_table instead of a st_table, one example being the fstring_table.

That makes sense. I'll work on that.

Another place that could use a set_stable is the constant cache tracker:

ruby/vm_insnhelper.c

Line 6355 in 7564793

st_insert(ics, (st_data_t) ic, (st_data_t) Qtrue);

byroot · 2025-04-07T07:52:13Z

set.c

+struct set_table_entry {
+    set_hash_t hash;
+    set_data_t key;
+};


Is it really worth having set_hash_t / set_data_t / set_index_t? I might be off, but I think re-using st_* types would be easier and would make it trivial to covert code using an st_table into using a set_table?

I made the types deliberately different to avoid ambiguity. However, there is no technical reason we couldn't use the same types, and I agree that it could make conversion from st_table easier.

If we want to reuse st_* types, one option is to move the set_table code into st.c. There are some functions that are the same between the two implementations, so it would eliminate some duplicate code. What are your thoughts on that?

What are your thoughts on that?

That would be fine by me.

For the record there's an in-flight PR that would also benefit from a set_table: #13077

Giving some more thought to this, before making changes, I think we should discuss this at the developer meeting next week, as other developers may have thoughts about where the code should go. I'll add this to the meeting agenda.

hsbt · 2025-04-07T07:58:47Z

@jeremyevans

update_bundled_gems CI failure is due to release of test-unit 3.6.8 (which was today),

I fixed this at #13078

ParadoxV5 · 2025-04-12T18:13:44Z

🆙

Here’s 2¢ from my recent experience:
Instead of HashSet[E] implemented as HashMap[E, truthy_dummy], it could be the converse, with HashMap[K, V] implemented as HashSet[KVPair[K, V]].

KVPair = Struct.new(:key, :value) do
  def hash = ~key.hash
  def eql?(other) = key.eql?(other.key) rescue nil
end

jeremyevans · 2025-04-13T11:16:23Z

🆙

Here’s 2¢ from my recent experience: Instead of HashSet[E] implemented as HashMap[E, truthy_dummy], it could be the converse, with HashMap[K, V] implemented as HashSet[KVPair[K, V]].
KVPair = Struct.new(:key, :value) do
  def hash = ~key.hash
  def eql?(other) = key.eql?(other.key) rescue nil
end

Can you explain the advantage of doing that? Seems to use more objects and is probably slower, plus it would require fairly invasive changes. Additionally, HashSet does not support updating elements already in the set, while updating values for keys already in the map is a requirement of HashMap. Supporting both HashSet[E] and HashMap[K,V] (as this pull request does) seems better to me, there is no requirement to implement one in terms of the other.

byroot · 2025-04-15T07:53:57Z

set.c

+static int
+mark_key(set_data_t key, set_data_t data)
+{
+    rb_gc_mark((VALUE)key);


Suggested change

rb_gc_mark((VALUE)key);

rb_gc_mark_movable((VALUE)key);

I think you need rb_gc_mark_movable otherwise there is little point implementing dcompact.

byroot · 2025-04-15T07:55:33Z

set.c

+struct set_object {
+    set_table *table;
+};


Suggested change

struct set_object {

set_table *table;

};

struct set_object {

set_table table;

};

Would probably be better to embed the set_table and save a pointer chasing in most places. You can even use RUBY_TYPED_EMBEDDABLE so it's all in the object slot.

byroot · 2025-04-15T23:52:15Z

set.c

+        .dsize = set_size,
+        .dcompact = set_compact,
+    },
+    .flags = 0


As discussed last night, you can set TYPED_FREE_IMMEDIATELY

byroot · 2025-04-16T00:09:15Z

Shower thought: Probably doesn't matter, or at least largely offseted by the more efficient implementation, but we're losing the AR_TABLE optimization for small sets.

jeremyevans · 2025-04-24T05:44:52Z

This PR now implements what was agreed to at the dev meeting:

set_table now uses st_* types/functions where possible
set_table implementation lives in st.c
internal/set_table.h is added, allowing use of set_table
by places other than set.c
set.c only contains the Set-specific logic

Additionally, it implements the following improvements:

Embeds set_table into object slot (currently results in
a significant amount of wasted memory per Set, since
struct set_table is 56 bytes, so the object is 88 bytes
when embedded, just over the 80 byte limit, so it
occupies a 160 byte slot)
GC.compact support
Write barrier support
Optimizes some common Set methods when used with arrays,
by using the array ptr to access elements instead of
calling methods on the array.

I think this is now ready to be merged.

byroot · 2025-04-24T05:48:48Z

set.c

+#define RSET_LEV_MASK (FL_USER13 | FL_USER14 | FL_USER15 |                /* FL 13..19 */ \
+                        FL_USER16 | FL_USER17 | FL_USER18 | FL_USER19)
+#define RSET_LEV_SHIFT (FL_USHIFT + 13)
+#define RSET_LEV_MAX 127 /* 7 bits */


Not a big deal, but given we have plenty of space to spare inside the object slot, we could skip all that iter-level in flags complexity and just use a long or something inside struct set_object.

I have an idea for fitting Set in an 80-byte slot, which would not leave enough spare space. I think there would be a single padding byte in struct set_object that could be used, but I don't see much advantage to that over the flags approach.

You are correct that there is a significant amount of complexity here, but it was just copied from hash.c, so I assume there shouldn't be any issues (or at least, no issues that do not also affect Hash).

If I'm unable to get Set to fit in 80 bytes, I'll come back to this and simplify it.

If your idea is what is mentioned in 9f9821a (getting rid of typed_flag), it's not as simple.

The GC need to store one bit of information to know if the object was actually embedded or not (since allocation size is dynamic, if you ask for more than the GC can provide, you won't be embeded).

I have an idea how to reclaim that in the future, but it's a lot of work.

At the end of the day, I don't think Set being size 160B is a big deal. It's a bit of a shame we're wasting 72B, but that's still smaller than what the previous implementation was.

set.c

byroot · 2025-04-24T05:58:38Z

set.c

+        RB_OBJ_WRITTEN(set, Qundef, k);
+        set_insert(table, (st_data_t)k);


Suggested change

RB_OBJ_WRITTEN(set, Qundef, k);

set_insert(table, (st_data_t)k);

set_insert(table, (st_data_t)k);

RB_OBJ_WRITTEN(set, Qundef, k);

Not sure if it has an incidence, but in theory you are supposed to call _WRITTEN after having created the reference.

Makes sense. Also, this allows us to only call RB_OBJ_WRITTEN if the insertion was successful, as you mentioned in another comment. If the insertion was not successful, we did not write to the set_table, and therefore should not call RB_OBJ_WRITTEN. I've added a commit that adds a couple functions to DRY up this code and ensure that set_insert is only called in a single place in set.c, as you recommended in another comment.

set.c

st.c

jeremyevans

@byroot Thank you very much for your detailed review. I just pushed commits to address all of the issues you pointed out.

jeremyevans · 2025-04-25T01:22:22Z

set.c

+#define RSET_LEV_MASK (FL_USER13 | FL_USER14 | FL_USER15 |                /* FL 13..19 */ \
+                        FL_USER16 | FL_USER17 | FL_USER18 | FL_USER19)
+#define RSET_LEV_SHIFT (FL_USHIFT + 13)
+#define RSET_LEV_MAX 127 /* 7 bits */


I have an idea for fitting Set in an 80-byte slot, which would not leave enough spare space. I think there would be a single padding byte in struct set_object that could be used, but I don't see much advantage to that over the flags approach.

You are correct that there is a significant amount of complexity here, but it was just copied from hash.c, so I assume there shouldn't be any issues (or at least, no issues that do not also affect Hash).

If I'm unable to get Set to fit in 80 bytes, I'll come back to this and simplify it.

set.c

jeremyevans · 2025-04-25T01:57:13Z

set.c

+        RB_OBJ_WRITTEN(set, Qundef, k);
+        set_insert(table, (st_data_t)k);


Makes sense. Also, this allows us to only call RB_OBJ_WRITTEN if the insertion was successful, as you mentioned in another comment. If the insertion was not successful, we did not write to the set_table, and therefore should not call RB_OBJ_WRITTEN. I've added a commit that adds a couple functions to DRY up this code and ensure that set_insert is only called in a single place in set.c, as you recommended in another comment.

jeremyevans · 2025-04-25T02:00:14Z

set.c

+        };
+        set_iter(set, set_merge_i, (st_data_t)&args);
+        set_free_embedded(sobj);
+        memcpy(&sobj->table, new, sizeof(*new));


Not needed in this function as the write barrier are added in set_merge_i (called a couple lines above).

set.c

st.c

byroot

Looks good to me.

However I don't know if you cherry-picked the commits from jeremyevans#2, but it seems like @mrnoname1000 authorship was lost in the process. Would be nice to add him back.

Set has been an autoloaded standard library since Ruby 3.2. The standard library Set is less efficient than it could be, as it uses Hash for storage, which stores unnecessary values for each key. Implementation details: * Core Set uses a modified version of `st_table`, named `set_table`. than `s/st_/set_/`, the main difference is that the stored records do not have values, making them 1/3 smaller. `st_table_entry` stores `hash`, `key`, and `record` (value), while `set_table_entry` only stores `hash` and `key`. This results in large sets using ~33% less memory compared to stdlib Set. For small sets, core Set uses 12% more memory (160 byte object slot and 64 malloc bytes, while stdlib set uses 40 for Set and 160 for Hash). More memory is used because the set_table is embedded and 72 bytes in the object slot are currently wasted. Hopefully we can make this more efficient and have it stored in an 80 byte object slot in the future. * All methods are implemented as cfuncs, except the pretty_print methods, which were moved to `lib/pp.rb` (which is where the pretty_print methods for other core classes are defined). As is typical for core classes, internal calls call C functions and not Ruby methods. For example, to check if something is a Set, `rb_obj_is_kind_of` is used, instead of calling `is_a?(Set)` on the related object. * Almost all methods use the same algorithm that the pure-Ruby implementation used. The exception is when calling `Set#divide` with a block with 2-arity. The pure-Ruby method used tsort to implement this. I developed an algorithm that only allocates a single intermediate hash and does not need tsort. * The `flatten_merge` protected method is no longer necessary, so it is not implemented (it could be). * Similar to Hash/Array, subclasses of Set are no longer reflected in `inspect` output. * RDoc from stdlib Set was moved to core Set, with minor updates. This includes a comprehensive benchmark suite for all public Set methods. As you would expect, the native version is faster in the vast majority of cases, and multiple times faster in many cases. There are a few cases where it is significantly slower: * Set.new with no arguments (~1.6x) * Set#compare_by_identity for small sets (~1.3x) * Set#clone for small sets (~1.5x) * Set#dup for small sets (~1.7x) These are slower as Set does not currently use the AR table optimization that Hash does, so a new set_table is initialized for each call. I'm not sure it's worth the complexity to have an AR table-like optimization for small sets (for hashes it makes sense, as small hashes are used everywhere in Ruby). The rbs and repl_type_completor bundled gems will need updates to support core Set. The pull request marks them as allowed failures. This passes all set tests with no changes. The following specs needed modification: * Modifying frozen set error message (changed for the better) * `Set#divide` when passed a 2-arity block no longer yields the same object as both the first and second argument (this seems like an issue with the previous implementation). * Set-like objects that override `is_a?` such that `is_a?(Set)` return `true` are no longer treated as Set instances. * `Set.allocate.hash` is no longer the same as `nil.hash` * `Set#join` no longer calls `Set#to_a` (it calls the underlying C function). * `Set#flatten_merge` protected method is not implemented. Previously, `set.rb` added a `SortedSet` autoload, which loads `set/sorted_set.rb`. This replaces the `Set` autoload in `prelude.rb` with a `SortedSet` autoload, but I recommend removing it and `set/sorted_set.rb`. This moves `test/set/test_set.rb` to `test/ruby/test_set.rb`, reflecting that switch to a core class. This does not move the spec files, as I'm not sure how they should be handled. Internally, this uses the st_* types and functions as much as possible, and only adds set_* types and functions as needed. The underlying set_table implementation is stored in st.c, but there is no public C-API for it, nor is there one planned, in order to keep the ability to change the internals going forward. For internal uses of st_table with Qtrue values, those can probably be replaced with set_table. To do that, include internal/set_table.h. To handle symbol visibility (rb_ prefix), internal/set_table.h uses the same macro approach that include/ruby/st.h uses. The Set class (rb_cSet) and all methods are defined in set.c. There isn't currently a C-API for the Set class, though C-API functions can be added as needed going forward. Implements [Feature #21216] Co-authored-by: Jean Boussier <jean.boussier@gmail.com> Co-authored-by: Oliver Nutter <mrnoname1000@riseup.net>

jeremyevans · 2025-04-26T00:31:06Z

Looks good to me.

However I don't know if you cherry-picked the commits from jeremyevans#2, but it seems like @mrnoname1000 authorship was lost in the process. Would be nice to add him back.

Certainly. I squashed this into a single commit and added him as a co-author. I'll merge after this passes CI.

jeremyevans · 2025-04-26T01:31:04Z

CI is only failing on the same VM_DEBUG ractor btest that is also failing on master, so I am going to merge this.

MSP-Greg · 2025-04-26T20:59:45Z

@jeremyevans

Haven't had time to look into this, but while running GHA tests today, the following patch worked:

ruby 3.5.0dev (2025-04-24T18:33:11Z :detached: 71166f60d9) +PRISM [x86_64-linux]

A few hours later, the following:

ruby 3.5.0dev (2025-04-26T11:56:42Z :detached: 687bd83724) +PRISM [x86_64-linux]

The logged error was:

ActionView::Template::Error: no _dump_data is defined for class Set
    app/views/layouts/application.html.erb:7

This was running tests against turbo-rails in Puma...

byroot · 2025-04-26T21:01:30Z

Thanks for pointing it out, that is indeed an oversight, won't be too hard to fix.

jeremyevans · 2025-04-26T23:43:27Z

Yep, I'll work on a patch for Marshal support for Set.

jeremyevans · 2025-04-27T02:54:34Z

PR submitted to support marshalling: #13185

This comment has been minimized.

Sign in to view

byroot reviewed Apr 7, 2025

View reviewed changes

alpaca-tc mentioned this pull request Apr 8, 2025

Optimize callcache invalidation for refinements #13077

Open

byroot reviewed Apr 15, 2025

View reviewed changes

jeremyevans force-pushed the core-set branch 2 times, most recently from 1cc9fcc to 9f9821a Compare April 23, 2025 06:42

jeremyevans requested review from byroot and knu April 24, 2025 05:45

byroot requested changes Apr 24, 2025

View reviewed changes

jeremyevans commented Apr 25, 2025

View reviewed changes

jeremyevans force-pushed the core-set branch from 56fda76 to 66a0b2e Compare April 25, 2025 04:06

byroot approved these changes Apr 25, 2025

View reviewed changes

jeremyevans force-pushed the core-set branch from 66a0b2e to 56d9fcf Compare April 26, 2025 00:29

jeremyevans mentioned this pull request Apr 26, 2025

Make built-in set object embeddable and instantly freeable jeremyevans/ruby#2

Closed

jeremyevans merged commit e4f85bf into ruby:master Apr 26, 2025
88 of 90 checks passed

MSP-Greg mentioned this pull request Apr 28, 2025

ci: Add Ubuntu and Windows arm jobs puma/puma#3651

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Set as a core class #13074

Implement Set as a core class #13074

jeremyevans commented Apr 6, 2025

This comment has been minimized.

jeremyevans commented Apr 6, 2025

byroot Apr 7, 2025

jeremyevans Apr 7, 2025

byroot Apr 10, 2025

byroot Apr 7, 2025

jeremyevans Apr 7, 2025

byroot Apr 7, 2025

jeremyevans Apr 7, 2025

hsbt commented Apr 7, 2025

ParadoxV5 commented Apr 12, 2025

jeremyevans commented Apr 13, 2025

byroot Apr 15, 2025

byroot Apr 15, 2025

byroot Apr 15, 2025

byroot commented Apr 16, 2025

jeremyevans commented Apr 24, 2025

byroot Apr 24, 2025

jeremyevans Apr 25, 2025

byroot Apr 25, 2025

byroot Apr 24, 2025

jeremyevans Apr 25, 2025

jeremyevans left a comment

jeremyevans Apr 25, 2025

jeremyevans Apr 25, 2025

jeremyevans Apr 25, 2025

byroot left a comment

jeremyevans commented Apr 26, 2025

jeremyevans commented Apr 26, 2025

MSP-Greg commented Apr 26, 2025

byroot commented Apr 26, 2025

jeremyevans commented Apr 26, 2025

jeremyevans commented Apr 27, 2025

		RB_OBJ_WRITTEN(set, Qundef, k);
		set_insert(table, (st_data_t)k);

Implement Set as a core class #13074

Implement Set as a core class #13074

Conversation

jeremyevans commented Apr 6, 2025

This comment has been minimized.

jeremyevans commented Apr 6, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hsbt commented Apr 7, 2025

ParadoxV5 commented Apr 12, 2025

jeremyevans commented Apr 13, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

byroot commented Apr 16, 2025

jeremyevans commented Apr 24, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeremyevans left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

byroot left a comment

Choose a reason for hiding this comment

jeremyevans commented Apr 26, 2025

jeremyevans commented Apr 26, 2025

MSP-Greg commented Apr 26, 2025

byroot commented Apr 26, 2025

jeremyevans commented Apr 26, 2025

jeremyevans commented Apr 27, 2025