Skip to content

File system problems and soft reboot hanging #1643

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kevinjwalters opened this issue Mar 13, 2019 · 20 comments · Fixed by #1649
Closed

File system problems and soft reboot hanging #1643

kevinjwalters opened this issue Mar 13, 2019 · 20 comments · Fixed by #1649

Comments

@kevinjwalters
Copy link

kevinjwalters commented Mar 13, 2019

I was using a variant of CircuitPython 4.0.0 beta 3 (4.0.0-beta.3-25-g2169a6240-dirty) which initially looked good and then started seriously misbehaving with a lot of things disappearing from the file system and it failing to soft reboot, i.e. control-D from the REPL over serial over USB.

The first thing I noticed from a Win8.1 host was that the application on the Circuit Playground Express board I was using (it reads MIDI over USB and changes PWMOut properties) which had worked earlier no longer worked. I then wanted to check its health so tried to connect over COM9 but surprisingly that would not let me connect. I thought I'd check the CIRCUITPY drive associated with it and that was misbehaving too.

The following is from my notes which I initially put into comments in #1626.

The drive (on Windows) was no longer showing up with the typical spaced used but just marked as FAT and opening it in explorer showed it as empty. It responded to reset button but that's left me with an CIRCUITPY drive that's missing my code! I've only got some wavs, a README, .fseventsd and .Trashes. I did a power cycle and file system still looks the same, no sign of my code.py or massive lib directory etc.

I think we can rule out user error here because the whole lib directory has gone AWOL and that takes ages to delete so I would have noticed if I'd done this accidentally.

After that the wavs disappeared and a .metadata_never_index has appeared. (Minor chance this is related to Patch Tuesday but that's even more frightening.)

My boot_out currently has the contents:

Adafruit CircuitPython 4.0.0-beta.3-25-g2169a6240-dirty on 2019-03-12; Adafruit CircuitPlayground Express with samd21g18

but earlier it had a large hex number with some hypens with braces on either end.

I've also noted that if i leave the CPX doing not much in REPL (I had a for loop printing the time every 10 seconds) then if I control-c that ten minute later then control-D for a soft reset the soft reset doesn't complete.

On going back to 4.0.0 beta 3, this popped out on the serial console (over USB):

Auto-reload is on. Simply save files over USB to run them or enter REPL to disable.

You are running in safe mode which means something unanticipated happened.
Looks like our core CircuitPython code crashed hard. Whoops!
Please file an issue at https://github.com/adafruit/circuitpython/issues
 with the contents of your CIRCUITPY drive and this message:
Crash into the HardFault_Handler.

Press any key to enter the REPL. Use CTRL-D to reload.
Adafruit CircuitPython 4.0.0-beta.3 on 2019-03-07; Adafruit CircuitPlayground Express with samd21g18
>>>

boot_out currently contains:

Adafruit CircuitPython 4.0.0-beta.3-25-g2169a6240-dirty on 2019-03-12; Adafruit CircuitPlayground Express with samd21g18

and has a date of 1-Jan-2000 which is a bit odd (date may not be off, these things dont have battery backed clocks on them, of course). I'm going to storage.erase_filesystem() to ensure I've got a clean slate.

@kevinjwalters kevinjwalters changed the title File system problems and soft reboot failing File system problems and soft reboot hanging (on CPX) Mar 13, 2019
@uhrheber
Copy link

This sounds familiar. may be linked to #1627.

@kevinjwalters
Copy link
Author

kevinjwalters commented Mar 14, 2019

Does sound similar but you mention "virtual drive fills up" in the ticket's title. I think people will read that as problems occur if and only if the CIRCUITPY is filled up, is that what you mean?

In my case I don't believe disk fills were part of cause as I have a lot of spare space on CPX. I have been doing a certain amout of power off / power on and have not been diligent about unmounting the file system. I've caused a mess with Windows delayed writes in the past (https://forums.adafruit.com/viewtopic.php?f=52&t=125927 and #111 ) but that wasn't as bad or as weird as the mess I had the other day.

@dhalbert dhalbert added this to the 4.0.0 - Bluetooth milestone Mar 14, 2019
@dhalbert dhalbert changed the title File system problems and soft reboot hanging (on CPX) File system problems and soft reboot hanging Mar 15, 2019
@jerryneedell
Copy link
Collaborator

Some additional observations:
The symptom I am running into with the PyPortal FS is that I am running some tests of new code (fixing TileGrid calls) and if the it often fails with at TypeError if I have not updated it yet. When I copy over the new code and do a control-D to reboot, - it display "soft reboot" then hangs. If it press the RESET button it comes up in Safe mode (HardFault) and the FS is hosed. I erase and reload and it works OK, but this has happened several times.
Not sure if the initial error is part of the issue or not.

@dhalbert
Copy link
Collaborator

I can confirm I saw something similar to what @jerryneedell saw. Not sure if it was on a CPX or a Metro M4 (I was using both today). I had just plugged it in.

@siddacious
Copy link

I also saw something similar; I believe it was on my feather m4. I'll do some more testing this weekend.

@uhrheber
Copy link

@kevinjwalters What I wanted to say is, the fuller the drive gets, the more likely those errors appear.
With just a single main.py file, and maybe even a library or two, the code may run for some time without an error (or not), but if you add additional files (may be random test data), the error becomes more likely.

@uhrheber
Copy link

uhrheber commented Mar 15, 2019

@kevinjwalters What I wanted to say is, the fuller the drive gets, the more likely those errors appear.
With just a single main.py file, and maybe even a library or two, the code may run for some time without an error (or not), but if you add additional files (may be random test data), the error becomes more likely.

To repeat this behaviour:

  • Put a single main.py file on the drive, that does a simple task, like blinking a LED, and/or printing something to the REPL.
  • Let it run
  • Copy some random files to the drive (maybe python libs), until it is nearly full
  • Watch the behaviour of your program, it may crash sooner or later
  • If it still runs, try unplugging and replugging the board several times. Sooner or later, the program will crash, and/or the drive becomes read only and/or some files disappear and/or the whole disk gets wiped.

@kevinjwalters
Copy link
Author

Based on @uhrheber comments it suggests the file system code is one canidate here, has it changed? Also interesting to note that it doesn't appear to depend on what the application is doing as such a trivial tiny main.py was used.

Does the whole code base go through static code/dynamic analysis checks for buffer overflows and the like? I'm just wondering if this could be some bug that's been around for a while but one hasn't trashed anything important in the past.

I've been problem free over last few days on 4.0.0-beta.3 but given weirdness of this I could just be lucky in some way.

@uhrheber
Copy link

I tested all official releases since 4.0.0 Beta 1, and they all behave like that.
I think that the older releases are somewhat more stable, but this might as well be coincidence.

The thing is, it's somewhat of a hit and miss, I have one pca10059 dongle, that I use as a USB password typer, and it uses both the USB HID and the TouchIO library. It was originally on Beta 1, and I later updated it to Beta 2. I'm using it on my PC since about 2 months, plugging it several times a day, and it runs rock stable.

When I put code on this very stick, that uses BLE, the drive gets wiped after a few plug cycles.
When I put the USB password typer code on it again, it keeps running stable forever.

Weird.

@dhalbert
Copy link
Collaborator

@uhrheber Thanks for that observation. That's interesting, because it's on nrf as well as atmel-samd, and, because BLE does tasks at a higher priority than anything else. The combination suggests that the problem may be in the common code, not the chip-specific code, and that it's some kind of timing or interrupt problem.

@uhrheber
Copy link

uhrheber commented Mar 15, 2019

If you want to test it yourself, here's my code (needs the libs adafruit_ble and adafruit_bluefruit_connect).
It can be used with the Bluefruit app, to use the colourpicker to change the colour of the RGB LED on the pca10059 dongle.

import board

from adafruit_ble.uart import UARTServer
import board, digitalio, pulseio
from adafruit_bluefruit_connect.packet import Packet
from adafruit_bluefruit_connect.color_packet import ColorPacket
from time import sleep

ledr = pulseio.PWMOut(board.LED2_R, frequency=5000, duty_cycle=65535)
ledg = pulseio.PWMOut(board.LED2_G, frequency=5000, duty_cycle=65535)
ledb = pulseio.PWMOut(board.LED2_B, frequency=5000, duty_cycle=65535)

ledy = digitalio.DigitalInOut(board.LED1)
ledy.direction = digitalio.Direction.OUTPUT
ledy.value = True

uart_server = UARTServer()

def rgbled(colors):
    ledr.duty_cycle = (255 - colors[0]) << 8
    ledg.duty_cycle = (255 - colors[1]) << 8
    ledb.duty_cycle = (255 - colors[2]) << 8

while True:
    # Advertise when not connected.
    uart_server.start_advertising()
    ledy.value = False
    while not uart_server.connected:
        pass

    while uart_server.connected:
        ledy.value = True
        packet = Packet.from_stream(uart_server)
        if isinstance(packet, ColorPacket):
            print(packet.color) 
            rgbled(packet.color) 

@uhrheber
Copy link

What I also observed:
Sometimes code runs perfectly, but when you add/modify code (or even a comment), it starts failing.

@jerryneedell
Copy link
Collaborator

Another observation:
I have not had a FS failure if I first eject CircuitPY then do a hard RESET (press RESET button)
It fails frequently if I do a Soft Reboot (Control -D)
all on PyPortal

@makermelissa
Copy link
Collaborator

An observation I’ve noticed that may or may not be related is sometimes just going into a folder in finder causes the PyPortal to reset. I’ve not had corruption yet. I had noticed similar symptoms on other boards previously, but it seems to be fine now. I’ve only noticed this lately on PyPortal.

@kevinjwalters
Copy link
Author

In the #1639 comments there's mention of a change in a USB library for 4.x. Could that be doing something different at soft reboot time compared to the previous releases? Given that writing files to the CPX triggers automatic reboots if that new USB stack does something different which ends up preventing the final writes (#111) to the file system from being applied that might explain this?

@dhalbert
Copy link
Collaborator

Yes, this is definitely the kind of thing we'd be looking at, but also whether filesystem operations before that were not completed or were incorrect.

@makermelissa
Copy link
Collaborator

makermelissa commented Mar 16, 2019

I'm definitely noticing File System problems much more on a Feather M4 Express with CircuitPython 4.0-Beta4. The RGB LED goes red and it can't read files. Resetting puts it in safe mode. I had a file disappear as well and had to recopy.

@jerryneedell
Copy link
Collaborator

jerryneedell commented Mar 16, 2019

On my feather_nrf52840_express with beta4 doing a Soft Reboot after copying a file to the board results in a disconnect of the screen session followed by a boot to safe mode (HardFaultHandler).
Resettng remounts CIRCUITPY, but all files have been erased. Only boot_out.txt remains.

This is reproducible:
Boot -- open screen session
copy a file to CIRCUITPY
issuescontrol-D
Screen disconnects - board boots to Safe Mode
RESET board
Files system mounts but it is empty (except for boot_out.txt)

If I instead always Eject the CIRCUITPY Drive and then RESET the board - is seems to be OK

@dhalbert
Copy link
Collaborator

dhalbert commented Mar 16, 2019

I have reproduced @jerryneedell 's crash on Feather 840 with a simple main.py just printing out something every second, and then I do touch /media/halbert/CIRCUITPY/foo.txt;sync.

Stack is messed up:

121	    reset_into_safe_mode(HARD_CRASH);
(gdb) bt
#0  HardFault_Handler () at supervisor/port.c:121
#1  <signal handler called>
#2  0x00027e42 in gc_free (ptr=<optimized out>) at ../../py/gc.c:671
#3  0x08029000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

@dhalbert
Copy link
Collaborator

Scenario is:

  1. run_code_py() in main.c is called the first time. filesystem_flush() is called, which calls spi_flash_flush_keep_cache() indirectly. It does not call flush_ram_cache() in external_flash.c because there's no RAM cache.
  2. I write a file in CIRCUITPY.
  3. Reload is requested, and happens. run_code_py() gets run a second time, calls filesystem_flush() again.
  4. This time flush_ram_cache() is called. It gets a hard fault at line 400, calling:
                m_free(MP_STATE_VM(flash_ram_cache)[i * pages_per_block + j]);

Sorry for the blow-by-blow, but I might not be able to finish the debugging session, and want to record what I found so far.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants