Skip to content

Introducing pyscript.fs namespace/module #2289

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Feb 17, 2025

Conversation

WebReflection
Copy link
Contributor

@WebReflection WebReflection commented Feb 13, 2025

Description

We have recently discovered the pyodide ability to mount users' local folders via mountNativeFS API and we would like to provide that experiment via our own pyscript.fs module.

This MR provides all the basic "bricks" to have the ability to await fs.mount("/path") and later on, if changes are made to such path, the ability to await fs.sync("/path") so that changes become persistent and available for any further visit of the same page/app/tab/domain.

Works only in Chrome/ium ⚠️

I wouldn't know how to best reflect the fact this feature currently works only in chrome as it requires showDirectoryPicker which is supported even on Android but nowhere around Firefox or Safari.

No Polyfill ⚠️

Due the inability to persist the directory handler via the only polyfill I could find, it's unclear how users could decide different strategies when it comes to storing files ... although we do provide a pyscript.store API so they should be covered.

The check on their side would be as simple as:

// JS, main thread
if ('showDirectoryPicker' in globalThis)
  console.log("yeah, persistent directory handlers available!");
from pyscript import window
if hasattr(window, "showDirectoryPicker"):
  print("yeah, persistent directory handlers available!")

Changes

  • update polyscript to its latest, where the IDBMap module and its Sync counterpart get exported, as it's the best/easiest way to actually persist data between either main or workers threads. The latest version also monkey-patch MicroPython interpreter providing the exact same functionality Pyodide provides around mountNativeFS
  • provide a minimalistic, dialog based, transient user activation logic that always runs on the main thread, even if it's a worker asking for persistent files and even if there is no SharedArrayBuffer
  • provide the logic from either main or worker threads to deal with such transient activation once and never again, or at least never until the user clear the cache or erases the IndexedDB associated to that space
  • provide a manual test (these kind of permissions are a bit awkward to provide via playwright) that works with and without SharedArrayBuffer, and either on the main or the worker thread
  • allow users to attach events themselves without forcing our own modal approach and tested that works too

Checklist

  • I have checked make build works locally.
  • I have created / updated documentation for this change (if applicable).

@WebReflection WebReflection force-pushed the pyscript-fs branch 5 times, most recently from 8a58df8 to 2aabb6e Compare February 13, 2025 10:03
@WebReflection WebReflection force-pushed the pyscript-fs branch 3 times, most recently from 99f7beb to 2a36f3e Compare February 13, 2025 12:19
@WebReflection
Copy link
Contributor Author

WebReflection commented Feb 13, 2025

OK, while things might (and likely will) change or be discussed, I've published the current state on npm as https://cdn.jsdelivr.net/npm/@pyscript/core@0.6.28/dist/core.js and https://cdn.jsdelivr.net/npm/@pyscript/core@0.6.28/dist/core.css for anyone willing to give it a try out there.

@ntoll I will demo this during today PyScript Fun Community call but feel free to chime in with thoughts or comments, if you have any. I do like the current state of the MR:

  • it's explicitly experimental with its own caveats and/or limitations
  • it works like a charm when it's usable
  • it's pyodide only but if people are happy with it, and we are happy with it, we can ask MicroPython folks to provide a similar API to hook a dir handler within Emscripten FS the same way pyodide did already

@WebReflection
Copy link
Contributor Author

WebReflection commented Feb 13, 2025

FYI I've pinged @dpgeorge in Discord around this issue as I think some work needs to be addressed on the MicroPython side of affairs and we all would love to have this API work in there too without further thinking from our users.

@WebReflection
Copy link
Contributor Author

Update this now works in MicrpoPython too through exact same logic Pyodide implemented ... it's up and running as 0.6.28 via CDN

@ntoll
Copy link
Member

ntoll commented Feb 15, 2025

This is outstanding work. Bravo.

Some thoughts:

  • Is it possible to UNmount..? Just thinking about the symmetry of the API. While I understand that most folks, most of the time probably won't need it... I imagine that someone may find it helpful.
  • As @fpliger mentioned in the call, this feels like a great case for a Pythonic context manager. Used something like this:
from pyscript import fs

await with fs.MountPoint("foo"):
    ... do stuff ..

# At this point, now out of scope of the `MountPoint` context handler, the fs is synced
# and unmounted.

An example implementation (off the top of my head) might be:

# Somewhere in the fs module...


class MountPoint:
    """
    An async context handler for working with a mounted local filesystem. Automatically
    syncs and unmounts the referenced mount-point once out of scope.
    """

    def __init__(self, path, mode="readwrite", root="", id="pyscript"):
        """
        Expects exactly the same args as the fs.mount function.
        """
        self.path = path
        self.mode = mode
        self.root = root
        self.id= id

    async def __aenter__(self):
        await mount(self.path, self.mode, self.root, self.id)

    async def __aexit__(self, *args):
        await sync(self.path)
        # If unmount is also implemented
        await unmount(self.path)

I actually prefer your (@WebReflection's) more explicit way of working with the fs... but Pythonistas (like @fpliger) would really appreciate the expected "Pythonic" idiom of context handlers. This feels like a classic PyScript case of trying to integrate aspects of both the browser and Python in a coherent and aesthetic manner.

Once again, really great work. Bravo.

@WebReflection
Copy link
Contributor Author

WebReflection commented Feb 15, 2025

@ntoll what does unmount mean?

  • it removes the stored handler so that every single time one has to re-pick/chose/authorize the local folder? imho, that's bad DX
  • it removes the mounted folder from the MEMFS so that the rest of the logic can't reach it? current Pyodide doesn't provide that and I think nobody needs that too ... the mounted point that is not reachable anymore can also cause tons of thrown errors in case another part of the logic mounts the same FS and no context manager can take over
  • it does nothing for the time being? (that's doable)

I would like to understand use cases for unmounting because Pyodide didn't consider it and I could not think of any use case for desiring that operation plus it's problematic and it brings nothing to the plate (unless use cases are explicit indeed).

If we remove the handler from IndexedDB all my effort to make it work the wonderful way it does now through Workers and with or without SharedArrayBuffer would be vane and that's unfortunate because we have an opportunity to provide something nobody has done before even on JS side of affairs.

Last, but not least, having that with statement suddenly showing a modal which is mandatory to have access to the folder would be extremely awkward code-flow experience to me while right now it's clear when you can mount or where and how that operation works.

All demo around this feature in Pyodide are also based on my approach for a reason: it cannot be seamlessly integrated like a context manager would so your class sketched idea breaks in unexpected way if operations are not performed after the modal granted access or whenever a click handler (transient explicit user event) has been clicked.

So, as soon as all these questions are answered by you or @fpliger I can say if it's worth it or not ... so far I think what's offered is exactly what our users explicitly stated (in Discord) would be awesome to have and it's extremely simple (imho) to reason about.

edit on top of all this, a transient user action is not something you want to spread around your logic ... it's something by Web design and specs you should ask once and never again ... having that randomly asked every single time that mounted file is needed, in case we want to unmount and erase the access right each time, feels counter productive and some added friction for no reason ... pythonista or not, here we're still dealing with Web related security flows, not just ideal APIs we could have ignoring these Web related constraints so we should really be careful here in changing the current approach, imho, but happy to work on the best outcome we can and still I would like to understand use cases or concrete examples that are better, thanks.

@WebReflection
Copy link
Contributor Author

WebReflection commented Feb 15, 2025

To expand further, please read pyodide mount ability and realize there is not a single unmount operation in there.

On top of that, with the current logic provided by this PR it means that no Python code needs to know, learn, or understand how the underlying path has been mounted or provided, it's fully transparent, meaning that if matplotlib based library would like to read or write files it just does that in the Pythonic way it knows already and anyone can pass a persistent file.

Requiring a different code/logic/class no other Python env knows or understand out there means breaking compatibility and portability across already working code.

Here one could just add a single mount on top of the logic without changing anything else around the code ... and that would work out of a persistent folder.

Providing a new API nobody (so far) asked for, would require changes all over previously working code written in pure Python.

That being said, calling sync per each mounted operation is also deadly slow so what we can do there is await for the time to be right when Emscripten will implement its own auto sync FS feature which is still not there yet, so that we won't diverge in logic and capabilities from what Emscripten, hence MicroPython and Pyodide, will provide too.

@ntoll
Copy link
Member

ntoll commented Feb 15, 2025

Hey hey @WebReflection - thanks for the feedback. I was, as I know you realise, thinking out loud given the Pythonic with context handling @fpliger mentioned in the meeting.

However, I'd failed to take into account the transient activation - which means using with feels clunky IMHO.

Regarding unmount. 😉 OK... I had a "thought experiment" use case in mind. As always feedback most welcome!

Say you have some sort of data science tool delivered via PyScript. Because of size of download the datasets to be processed by the tool are to be loaded from the user's local filesystem. I imagine that different datasets for such analysis may be in different folders on the user's filesystem, and so being able to unmount and remount different local folders to the same location in the browser's virtual FS might be useful should you wish to change datasets being analysed by the tool. The same probably also applies for LLM models too - different models mounted to the same virtual location for use with a PyScript delivered LLM library would be rather useful. Also, this may apply to spreadsheet data for PySheets? @laffra, is this just a silly thought experiment or would this be useful for you?

However, I TOTALLY get this is a thought experiment - and I'm mentioning it here merely to promote a discussion. I wonder what @fpliger thinks?

In any case, as a first step, the totality of this PR is magnificent. Perhaps we can keep this open for feedback from Fabio, but then I say merge and release asap..? This is a big deal, as the reaction on discord shows.

@laffra
Copy link

laffra commented Feb 15, 2025

@ntoll With access to the local filesystem, PySheets can load local Excel or CSV files, process them in PyOdide in the browser, and then save the resulting result in JSON, CSV, PDF, HTML, or PNG. This would make PySheets a lot more practical as a research tool.

@WebReflection
Copy link
Contributor Author

WebReflection commented Feb 15, 2025

being able to unmount and remount different local folders to the same location in the browser's virtual FS might be useful

that's not possible though so it will be rather confusing.


edit amend, there is a possible workaround to mount different folders over the same path, see #2289 (comment) ... we still need to dig further to see if that's beneficial or practical, but there is hope with a great default and an even greater extra parameter to have one Python reachable folder able to reach different paths (one per time, of course) !


The stored handler with previously granted rights works only for the selected folder ... the name you give or use in Python code to that folder is irrelevant for the Web FileSystem Directory Picker API so your idea of swapping folders at the Python level out of a granted access fails in practice due the way the thing work ... on the other hand, you have a perfect use case to keep the current API exactly as it is, assuming you grant accesso to your Downloads/data folder and in there you have multiple sub-folders you can crawl whenever you need ... mount that as /data ( or ./data ) and work seamlessly like you would locally.

These are the reason I wanted this landed because there are technical limitations we need to consider:

  • the Web specification allows granted access only to one folder per time ... we can mount multiple folders too but those will be multiple modals or clicks ... if you could select multiple folders and grant access simultaneously in one go to all of them, the story would be different and having multiple targets behind the same authorization would work ... I am afraid that's not the case though
  • the Pyodide implementation doesn't allow to unmount but it throws out of the box if the target folder was already not empty or mounted
    • this is why the mounting path is bigger than a reference with a ref.syncfs() ability ... the path matters more than any reference
    • swapping multiple handlers for the same path on the OS/System is not allowed and AFAIK there is no way to know which folder you asked for on the JS side due security concerns around revealing undesired paths on the OS once any folder has been granted ... that detail is kept behind the browser, multiple modals can be stored but if you ask multiple times for the same folder or a different time every single time we don't get to know or disambiguate that ... meaning also using the same path for multiple different folders in your system leads to error and confusion on both browsers, IndexedDB, and users' expectations side
  • Pyodide uses Emscripten current FS abilities mostly as is ... and that might change once Emscripten finishes WasmFS and/or once it implements autoPersist option while mounting so that we will all benefit about that, including MicroPython. It's a risk to move away from "native" Emscripten capabilities and due Web constraints we need to understand what we want to land in the brest possible DX, not the most awkward one because we had a great idea (contextual with) which is unfortunately impractical

Last, but not least, it's true that data could be huge and the current Pyodide implementation (hence its MicroPython port) uses MEMFS so if you mount 5 folders with 10Gb of models each, stuff won't work ... however, there's literally nothing we can do (right now) about it, we need to wait for WasmFS to land (it did already) and be complete enough to be adopted by Pyodide ... we'll follow up after with MicroPython and that should give us "infinite access without RAM constraints" which is currently the case.

Perhaps we can keep this open for feedback from Fabio, but then I say merge and release asap..? This is a big deal, as the reaction on discord shows.

It is a big deal but most importantly there's really no other way around ... we're delivering the best approach Pyodide and Emscripten offer to date to deal with that API while other parts are still moving (WasmFS) so that primitives are all there and it's through those primitive we can eventually refine, improve, or offer a contextual with class in the near future but what works in here will remain for a while out there as "the only way" due constraints around the stack and feature which is also experimental on the Web, on Pyodide, and so on.

To clarify, I never meant to say "this is our new fs namespace and it's frozen", this MR was meant to rather state "we're working on edge experimental capabilities as soon as these are available to provide most basic bricks to build even more awesome flows on top of that right after" which I hope we share as sentiment in the PysRcript team and I hope @fpliger agrees on that too.

@WebReflection
Copy link
Contributor Author

@laffra that's already possible with this MR and the currently published npm module: you select the folder with CSV files, you work with those without needing to list all of them in the config and you can edit, save, transform, and save again, in that very same folder after (or any subfolder, you are allowed to use any Python operation over its filesystem starting from that root folder). What else would you need, considering limitations I've previously explained so that it has to be something doable, not just ideal to have? Thanks!

@WebReflection
Copy link
Contributor Author

WebReflection commented Feb 15, 2025

@ntoll just because I've looked at the code (then I'll be likely off 'til Monday) the only possibility to mount multiple handlers into the same path is to use an id property (native in browsers but limited in possible chars), or a name one so that one could:

fs.mount("/same-path", name="project2")

that would ask for a new transient activation flow in case /same-path@project2 (note: this is the IndexedDB key, not the Python path, the Python path will still be /same-path) was not previously required, allowing users to swap folders seamlessly but bear in mind each different name, when provided, will require a new transient action flow if it wasn't know already ... would this satisfy at least your request around multiple folders, same entry point on the VFS ???

This is doable right now where the default name can be just an empty string and it feels like something extra we can provide.

However the fact we don't get to unmount it remains, but I can try to figure out out of the Emscripten filesystem if that's a possibility so we'll have that extra utility with actually a meaning, without revoking access to that previously mounted folder (the directory handler is not so heavy in terms of used bytes but it's gold in terms of potentials).

Could this be a compromise that let us all move forward happier than now? I have no objection around this compromise but happy to hear back from you if that feels reasonable and it is a decent DX too (which I think it is, it just requires extra documentation around this part of the API).

If you are OK, on Monday:

  • I will investigate direct Emscripten unmount capabilities (with freed RAM too or it's pointless)
  • I will check that capability works in both Pyodide and MicroPython
  • if that's the case, I will provide a way to map an access right to not just a path name, also a "version" or "name" for that specific path, so that multiple paths can transparently result into same path for the code running around that logic

Last thing is that eventually I would like to also investigate if an open(path, "w" | "wb") could be hacked internally to sync automatically once any content has been written and, if possible, put that behind a flag we can drop/remove once Emscripten auto sync happens but I don't find that particular extra magic a show-stopper right now and it's also possible we can't really hook ourself that much into the FS ... all speculative though, looking forward to hear your thoughts around that.


edit on a late second thought, this would allow Pyscript to read from a folder and store into another through the same path ... which I think is an extra awesome capability ... I hope my findings on Monday will make me scream Hooooraaay!

@WebReflection
Copy link
Contributor Author

this is a quick one for @fpliger: if my latest idea works I think it's superior to the contextual with because it will work with any open operation out there, not just those reaching out our API so that portability and seamless integration would be hopefully preferred for both new code that will land and existing code where all logic around previous Python regular FS operations will "just work". Of course the syncing part needs to be improved (not just us, it's *Emscripten, pyodide, then MicroPython) but once that's done we can offer a great opportunity to work from local files, reuse the same path out of multiple entry points with granted access, and require zero extra knowledge to whoever writes Python or use PyScript around those topics + legacy code that was reacing static files or not using our API can work out of this new approach based, behind the scene, on our FileSystem API.

That's the summary, if things go smoothly, and I hope to hear you OK around this summary 👋

@fpliger
Copy link
Contributor

fpliger commented Feb 15, 2025

Excellent work @WebReflection ! This is really awesome.

Honestly, I think there's a bit of disconnect between possible use cases and flows on how this can be used....

IIRC, the context manager case was to have a pythonic way to make sure .sync() is called after N files operations, and not to unmount. I'd expect that once you mount a path, you'd want to execute multiple operations and leave it mounted for a while. If pyodide doesn't even provide a way to unmount, well, that's a non starter anyway :)

I do think that having a context manager that calls sync after a block of operations a more pythonic way to manage a transactional group of ops. I don't think it's a blocker for this PR though. I think we can add this feature already and discuss/add a context manager as a quick follow up, if we are all in agreement.

I'm +1 on merging and discussing an additional follow up :)

@WebReflection
Copy link
Contributor Author

@fpliger awesome, then we all agree ... my idea is that sync after operations should not be a concern, and I believe we, pyodide folks, and Emscripten folks too are all aligned ... once that's not needed anymore, we can decide what to do but until then, we can surely offer a better mechanism in the near future out of the very same pyscript.fs module.

Like I've said, ideally any w or wb or a operation that is meant to mutate a file should automatically persist such mutation after, with or without our helpers, so the goal here is to move toward that day when the stack will offer that and be aligned around it, and extra helpers will just keep working like they did before, as those will be all about abstracting!

Updates on Monday, or after, you all enjoy your weekend 👋

Copy link
Member

@ntoll ntoll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

  * verified that RAM gets freed
  * allowed to mount different handlers within the same path through different `id` as that's the Web best way to do so
@WebReflection
Copy link
Contributor Author

WebReflection commented Feb 17, 2025

Thanks @ntoll for approval + I have added other changes and here are both changes and findings:

  • the current MEMFS implementation has a maximum of 4GB of data ... this must be mentioned in related docs because when that buffer overflows the current Pyodide or Emscripten logic to sync results into an erased folder content if there was a readwrite access right instead of just read
  • the current mount is provided by the chosen path to use via Python code and an optional id option as specified by current Web specifications. The id must have just a-zA-Z0-9 chars and it helps also browsers to remember where that folder was when selected so that in the future it would just propose that folder assuming the same id has been used. This allows users to mount different folders on the same path, as long as there is only one mounted path at the time (or both Pyodide and MicroPython fails at mounting more in there)
  • the current unmount (via path) automatically sync and unmount it, so that either async call would not result into lost data
  • I have tested and fixed all worker/sab/no-worker/no-sab scenarios and now it always produces the correct error if no granted authorization is provided

Quick examples of the current basic API:

from pyscript import fs

# ask once for permission to mount any local folder
# into the Virtual FileSystem handled by Pyodide/MicroPython
await fs.mount("/local")

# if changes were made, ensure these are persistent in the local system folder
await fs.sync("/local")

# if needed to free RAM or that specific path, sync and unmount
await fs.unmount("/local")

variants

# mount a local folder specifying a different handler
# this requires user explicit transient action (once)
await fs.mount("/local", id="v1")
# ... operate on that folder ...
await fs.unmount("/local")


# mount a local folder specifying a different handler
# this requires user explicit transient action (once)
await fs.mount("/local", id="v2")
# ... operate on that folder ...
await fs.unmount("/local")

# go back to the original handler or previous one
# no transient action required now
await fs.mount("/local", id="v1")
# ... operate again on that folder ...

The root string field is to hint the browser where to start picking the path that should be mounted in Python: desktop, documents, downloads, music, pictures or videos are the currently available root hints as per Web specifications.

The mode field is by default readwrite but it could be forced to read only, as per Web specifications.

This is the entirety of the current mount(path, mode="readwrite", id="pyscript", root=""), while sync(path) and unmount(path) currently accept only the path used without needing a reference so that it is possible to be sure unmount and mount operations can be performed cross tab, page, domain, worker, forgetting about multiple references to orchestrate.

Last, but not least, we should declare in our docs that fs.mount(...) implicitly uses a minimalistic dialog to provide the transient user activation flow but that if it's directly invoked while another transient user action is ongoing (that is an onclick event through its async handler, as example) such dialog won't bother or show up so it's up to users to decide to use our minimalistic fallback or not.

Happy to expand or clarify further but right now I am super happy about the current state and, like we stated already, we can also improve later on with ease.

@WebReflection WebReflection merged commit 0366e48 into pyscript:main Feb 17, 2025
2 checks passed
WebReflection added a commit to WebReflection/pyscript that referenced this pull request Feb 17, 2025
* introducing pyscript.fs namespace/module

* Added proper rejection when showDirectoryPicker is not supported

* Improved exports to make explicit import in 3rd party modules easier

* implemented `fs.unmount(path)`:

  * verified that RAM gets freed
  * allowed to mount different handlers within the same path through different `id` as that's the Web best way to do so
WebReflection added a commit to WebReflection/pyscript that referenced this pull request Feb 18, 2025
* introducing pyscript.fs namespace/module

* Added proper rejection when showDirectoryPicker is not supported

* Improved exports to make explicit import in 3rd party modules easier

* implemented `fs.unmount(path)`:

  * verified that RAM gets freed
  * allowed to mount different handlers within the same path through different `id` as that's the Web best way to do so
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants