Skip to content

py/map: Make map rehashing an atomic operation. #11604

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

dpgeorge
Copy link
Member

This simple change to map rehashing makes it so the old map is kept intact during the rehash, and only after the rehash is complete is the old map updated to the new one. And this update is a very quick memory copy, which is done inside an atomic section.

This means that hard IRQs handlers can access dict members (eg globals) even during the update (add/del) of that dict.

See related #10620.

@dpgeorge dpgeorge added the py-core Relates to py/ directory in source label May 24, 2023
@github-actions
Copy link

github-actions bot commented May 24, 2023

Code size report:

   bare-arm:    +0 +0.000% 
minimal x86:   +72 +0.039% 
   unix x64:   +96 +0.012% standard
      stm32:    +8 +0.002% PYBV10
     mimxrt:   +16 +0.004% TEENSY40
        rp2:   +16 +0.005% RPI_PICO
       samd:    +8 +0.003% ADAFRUIT_ITSYBITSY_M4_EXPRESS

@codecov-commenter
Copy link

codecov-commenter commented May 24, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (2962e24) 98.36% compared to head (d5a023b) 98.36%.

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #11604   +/-   ##
=======================================
  Coverage   98.36%   98.36%           
=======================================
  Files         161      161           
  Lines       21089    21089           
=======================================
  Hits        20744    20744           
  Misses        345      345           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jimmo
Copy link
Member

jimmo commented May 24, 2023

Great idea!

Worth noting that this isn't just important for hard IRQs but also GIL-free multi-core (on e.g. rp2) where the problem is much more likely to occur.

@jimmo
Copy link
Member

jimmo commented May 26, 2023

Unfortunately from further discussion with @dpgeorge about this, I'm not sure this solves the problem. Here are two scenarios:

  1. The main thread triggers map rehash, during which a hard IRQ fires which sets the value of an existing global. This is the scenario in Docs: Detail issue with hard ISRs and globals. #10620 (comment) . With this PR now that won't cause corruption, but the write will go to the old map, and be lost when the rehash finishes and switches over to the new map. I think some sort of per-map flag to detect a write during a re-hash, and then re-run the re-hash might be able to fix this? (This only works because the IRQ could not have triggered the re-hash)

  2. Two non-GIL threads (e.g. rp2040), one triggers a re-hash while the other reads from the same map. The second thread will see the old table, but ff the first thread causes an allocation before the second thread has time to access the table, then the table will be overwritten by the new allocation. This is due to the m_del at the end of rehash. This is an optimisation to early reclaim memory, we could disable this when non-GIL threading is enabled.

Also worth pointing out -- if two threads (or thread + hard IRQ) do simultaneous modifications to a map, it is fine only if they are both updating existing keys. In other words, any operation that mutates the set of keys (insert, remove) is not concurrent safe. For the insert at least, you probably shouldn't be allowed to do this in a hard IRQ because there's a chance it might trigger a re-hash anyway (which would fail to allocate). So for the hard IRQ case at least, we could consider preventing any mutation that isn't an update to an existing key.

For the non-GIL threading case, it seems reasonable that two threads shouldn't be allowed to concurrently mutate a dictionary's keys. Enforcing that without a mutex seems hard though. I feel that conceptually the globals dict and, say, the locals dict of a class or an actual dict object in users code should have different rules. Maybe in non-GIL threading mode, we need to do special protection of just globals dicts?

CC @peterhinch

@dpgeorge
Copy link
Member Author

The test tests/extmod/select_poll_eintr.py is failing on the unix port due to the bug that's being addressed in this PR. The spawned thread tries to read the global variable lock but that sometimes happens while the main thread is rehashing the globals dict because it's adding the global variable s or poller.

I was able to fix that unix case by adding a mutex that is obtained during all map operations. That's obviously not efficient but at least shows the problem and how it could be fixed.

Maybe in non-GIL threading mode, we need to do special protection of just globals dicts?

Yes, I think that would be a reasonable way to solve this problem, to somehow make globals dicts (the dict of the top level of a module) special and support concurrent access. Maybe the map can internally have both a read-only and write-only table and then the rehash scheme is:

  1. read-only and write-only starts by pointing to the same table
  2. rehash begins and creates a new table and populates the keys only (not their values)
  3. rehash then points the write-only table to this new table with only keys (so now anyone writing to this map will update the new table, but reads get existing values)
  4. rehash then copies across values from the read-only table but only if the value in the write-only table is empty
  5. rehash then points the read-only pointer to the new table
  6. rehash frees the old table

@dpgeorge dpgeorge force-pushed the py-map-atomic-rehash branch from e25fe94 to 3102154 Compare February 20, 2024 23:56
@dpgeorge
Copy link
Member Author

I added a test which does concurrent globals access and fails easily on the unix port with non-GIL threading.

Signed-off-by: Damien George <damien@micropython.org>
Signed-off-by: Damien George <damien@micropython.org>
@dpgeorge
Copy link
Member Author

Alternative idea for a fix:

  • Define a new mutex which is used to protect the set of functions that can access the globals dict: mp_load_global(), mp_load_name(), mp_store_name() etc. That won't cover all cases of globals access (eg globals()['abc'] = 123) but should be enough to fix normal use of globals.
  • Would need to think how to release the mutex in case the lookup raises an exception while the mutex is held.
  • Could optimise the mutex obtain/release (and also the GC and qstr mutex's) by having them dynamic function pointers that are the identity when no other thread is running, and proper mutex functions when other threads are running.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
py-core Relates to py/ directory in source
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants