extmod/network_wiznet5k.c: Five small bug fixes. #8974

robert-hh · 2022-07-27T19:29:46Z

Five small commits fixing bugs for WIZNET5K boards:

Register the nic with the network module. That was missing, causing the list returned by network.route() to be empty.
Drop and obsolete argument check in wiznet5k_make_new(). That prevented specifying a pin for the interrupt signal.
Deinit the nic before (re-)initialsation. Without that, calling nic.active(True) more than once in a row could lock up the device. If it was not initialized before, the deinit does not hurt.
Schedule clearing of the Interrupt. I was in the IRQ handler, but that caused a problem when a SPI transfer was ongoing at the time of the interrupt, because IRQ clearing starts an SPI transfer itself.
Check for the absence of keyword arguments when the constructor is called with default settings, as suggested by @dpgeorge.

Commits 2 though 4 are done in behalf of @omogenot, who asked me to do so.

Commit 3 effectively adds only one line, but looks larger since the deinit() function was moved avoiding a forward declaration.

The changes were tested by @omogenot on a W5500_EVB_PICO board and by me with a RP2040 pico + W5500 breakout combo. Ping times are ~0.7s for "pinging" the board and ~3 second for ping originated from the board, when interrupts are enabled. The MP test suite for Network shows a good result:

./run-multitests.py -i pyb:a0 multi_net/*

20 tests performed
17 tests passed
3 tests failed: multi_net/ssl_cert_rsa.py multi_net/tcp_client_rst.py multi_net/uasyncio_tcp_client_rst.py

./run-multitests.py -i micropython -i pyb:a0 multi_net/*

20 tests performed
16 tests passed
2 tests skipped: multi_net/tcp_client_rst.py multi_net/uasyncio_tcp_client_rst.py
2 tests failed: multi_net/uasyncio_tcp_readall.py multi_net/udp_data.py

python run-tests.py --target pyboard net_inet/*.py net_hosted/*.py

17 tests performed (115 individual testcases)
15 tests passed
2 tests failed: connect_nonblock_xfer test_tls_nonblock

That was missing, and network.route() returned an empty list.

Drop an obsolete and wrong argument check. The proper checks are done further down in the code.

If nic.active(True) is called several times in a row, the device may lock up. Even if that is bad coding practice, calling wiznet5k_deinit() in wiznet5k_init() prevents the lock.

codecov-commenter · 2022-07-27T19:52:10Z

Codecov Report

Merging #8974 (2fffb18) into master (45ab801) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #8974   +/-   ##
=======================================
  Coverage   98.37%   98.37%           
=======================================
  Files         156      156           
  Lines       20281    20281           
=======================================
  Hits        19951    19951           
  Misses        330      330

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 45ab801...2fffb18. Read the comment docs.

extmod/network_wiznet5k.c

Using keyword arguments raises an error now.

omogenot · 2022-07-28T20:18:31Z

@robert-hh ,

I think I found an issue with the IRQ mode. If the nic is heavily solicited (a web server for instance), I guess it can happen that an IRQ occurs whilst the program was performing an SPI transfer already. Since the IRQ handler is accessing the WIZNET chip as well, there might be some cases where a SPI access conflict could arise => system locked-up.
I propose to move the WIZNET chip SPI access out of the IRQ handler:

the mpy_wiznet_read_int function becomes:

STATIC mp_obj_t mpy_wiznet_read_int(mp_obj_t none_in) {
    (void)none_in;
//    wizchip_clrinterrupt(IK_SOCK_0);
//    setSn_IR(0, Sn_IR_RECV);

    // Handle incoming data
    wiznet5k_try_poll();
    return mp_const_none;
}

and the wiznet5k_poll function becomes:

void wiznet5k_poll(void) {
    wizchip_clrinterrupt(IK_SOCK_0);    // setSn_IR(0, Sn_IR_RECV); is already done in this function call
    wiznet5k_obj_t *self = &wiznet5k_obj;
    if (!(self->netif.flags & NETIF_FLAG_UP) ||
        !(self->netif.flags & NETIF_FLAG_LINK_UP)) {
        return;
    }
    uint16_t len;
    while ((len = wiznet5k_recv_ethernet(self)) > 0) {
        if (self->trace_flags & TRACE_ETH_RX) {
            netutils_ethernet_trace(MP_PYTHON_PRINTER, len, self->eth_frame, NETUTILS_TRACE_NEWLINE);
        }
        struct pbuf *p = pbuf_alloc(PBUF_RAW, len, PBUF_POOL);
        if (p != NULL) {
            pbuf_take(p, self->eth_frame, len);
            if (self->netif.input(p, &self->netif) != ERR_OK) {
                pbuf_free(p);
            }
        }
    }
}

With these fixes, the web server seems to work without lock-ups. To be continued. I have a test running 24/7 to verify that.

robert-hh · 2022-07-28T21:01:44Z

That change looks good. 100MByte bulk transfer w/o problems. I'll run a 10GByte test over night. Transfer speed ~400kByte/s. Will take ~7 hours

robert-hh · 2022-07-29T06:13:06Z

The 10GByte test succeeded fine. All data was transferred, no lock-up. So the change improves the reliability substantially.

omogenot · 2022-07-29T06:23:44Z

Cool ! I'm glad to hear that.
On my side, the current test is so far successful. The web server is still up and running, at the same time the mDNS responder is working, a ping request to the pico board is running since yesterday without problems, and I have a Modbus TCP client pulling data from the embedded Modbus TCP server on the pico every 500 ms without major issues (some timeouts from time to time as the poor pico cannot handle everything (http request, ftp client, modbus TCP request, etc.) at the same time :-) )

robert-hh · 2022-07-29T06:42:25Z

That's great. The standard test are still running the same way as before (as expected). You should make a PR for that, or I could add that to the current one.

omogenot · 2022-07-29T06:52:33Z

I think that you can add it to the current one as it is more or less related.
In the meantime, I am writing mDNS support when LWIP is used. I added this to WIZNET5K class, but I think it shall be available for all types of nic attached to the LWIP. I had a look at the modnetwork.c file and found strange that the network_wiznet5k.c never calls the mod_network_register_nic function if using LWIP. Is it normal? If not, this would be another 'little' bug that shall be corrected in this PR as well, no?

robert-hh · 2022-07-29T06:59:48Z

Registering the nic with the network module is already part of this PR. It's done in wiznet5k_init().

omogenot · 2022-07-29T07:00:21Z

Sorry, my bad. That's your first commit.

However, I saw that all other nic drivers are doing this in the object new function. Shan't you move it to this function as well rather than in the init to be consistent? (I know I can be a pain...)

robert-hh · 2022-07-29T07:05:21Z

It was already in the init function for the non-LWIP case, so I left it there. init() is called with nic.active(True). So actually it makes already a small difference, in that the route does not exists before the interface is active.

omogenot · 2022-07-29T07:12:44Z

Fine by me then. :-) Good job.

omogenot · 2022-07-29T13:01:40Z

@robert-hh The pain in the neck is back. I kept on making extra tests, and found some cases again where the Ethernet link is dead (without locking up the device). It seemed to be related to timing issues. I therefore moved the clear IRQ AFTER the while wiznet5k_recv_ethernet to avoid false calls when we already read all incoming packets. The function becomes:

void wiznet5k_poll(void) {
    wiznet5k_obj_t *self = &wiznet5k_obj;
    if ((self->netif.flags & (NETIF_FLAG_UP | NETIF_FLAG_LINK_UP)) == (NETIF_FLAG_UP | NETIF_FLAG_LINK_UP)) {
        uint16_t len;
        while ((len = wiznet5k_recv_ethernet(self)) > 0) {
            if (self->trace_flags & TRACE_ETH_RX) {
                netutils_ethernet_trace(MP_PYTHON_PRINTER, len, self->eth_frame, NETUTILS_TRACE_NEWLINE);
            }
            struct pbuf *p = pbuf_alloc(PBUF_RAW, len, PBUF_POOL);
            if (p != NULL) {
                pbuf_take(p, self->eth_frame, len);
                if (self->netif.input(p, &self->netif) != ERR_OK) {
                    pbuf_free(p);
                }
            }
        }
    }
    wizchip_clrinterrupt(IK_SOCK_0);
}

I reversed the flag test so that there is only one call to the wizchip_clrinterrupt function.
I keep on trying, using different timings for the Modbus TCP client, and ping, whilst refreshing randomly the web page sent by the http server. I am terribly sorry to come back with recurrent modification, but it looks like a long shot. I kind of understand why the previous contributor(s) have inhibited the IRQ mode :-)
Thank you for your patience and for bearing with me.

Edit:
Unfortunately, a new block arose on a socket.accept() statement in the http server. I try again, without the IRQ pin to verify that this is what creates the blocking situation or not.

omogenot · 2022-07-29T13:56:01Z

Well, the non IRQ version seems to work OK. I keep on thinking that the problem is due to a concurrent access to the SPI bus. Therefore, I added a test for the CS signal level to enable the call to wiznet5k_try_poll. The function mpy_wiznet_read_int becomes:

STATIC mp_obj_t mpy_wiznet_read_int(mp_obj_t none_in) {
    (void)none_in;
    // Handle incoming data
    if (mp_hal_pin_read(wiznet5k_obj.cs))
        wiznet5k_try_poll();
    return mp_const_none;
}

I pursue my tests and let you know.

robert-hh · 2022-07-29T14:36:45Z

Doesn't the Interrupt have to be cleared at the RP2? Otherwise you would fall back to polling mode after the first interrupt has been ignored.
Edit: I tried your change with my "send burst" test, and it does not slow down the transmission. Looking into the code, the irq is acknowledged before the handler is called.

omogenot · 2022-07-29T15:34:02Z

Doesn't the Interrupt have to be cleared at the RP2? Otherwise you would fall back to polling mode after the first interrupt has been ignored.
Edit: I tried your change with my "send burst" test, and it does not slow down the transmission. Looking into the code, the irq is acknowledged before the handler is called.

We cannot ack the IRQ whilst the SPI bus is used. It will be acknowledged during the next poll anyway. The poll function has a while loop that would read multiple packets, therefore the poll could have read the packet that triggered a new IRQ since last time we acknowledged it, making a new IRQ to be trigger for nothing as the packet has already been read. So I think that the last IRQ shall/can be ignored by acknowledging the IRQ at the end of the poll.
Edit: My current test is still OK. 6500 pings @ about 1ms, 25000 Modbus TCP requests (1 request every 200ms) and many (I did not count since it's manual clicks I do myself from time to time) web page requests.

robert-hh · 2022-07-29T15:52:18Z

That's clear. I was concerned about the Pin IRQ of the rp2040, which has to be acknowledged to enable it again. But that is done in the hardware Pin IRQ interrupt handler before calling mpy_wiznet_read_int() as the registered callback.

omogenot · 2022-07-30T08:37:50Z

Well, after more than 12 hours of test, no lock-ups. 👍
If I have a look at the ping results, I can see a few round times of 64 ms, instead of the regular 1 ms, showing that the IRQ was probably discarded and the polling took place. This tend to show that the CS signal test used as a kind of semaphore to regulate the concurrency SPI bus access is working.

robert-hh · 2022-07-30T08:58:41Z

That looks good. My bulk transfer test works fine. With the bulk transfer running, ping times vary between 0.6 and 5.7 ms. Still pretty good. The IRQ handler now is:

STATIC mp_obj_t mpy_wiznet_read_int(mp_obj_t none_in) {
    (void)none_in;
    // Handle incoming data, unless the SPI bus is busy
    if (mp_hal_pin_read(wiznet5k_obj.cs)) {
        wiznet5k_try_poll();
    }
    return mp_const_none;
}
STATIC MP_DEFINE_CONST_FUN_OBJ_1(mpy_wiznet_read_int_obj, mpy_wiznet_read_int);

Do you still have the call to wizchip_clrinterrupt(IK_SOCK_0); at the start of wiznet5k_poll()? I tried both at the start and the end and see no difference.

omogenot · 2022-07-30T09:33:00Z

Well, I kept it at the end because, in my mind, it makes more sense to clear it after having executed the loop and avoid "false alarms". But you're right the impact is really small (a few nano-seconds). It would require a lot of time to statistically prove it :-).
Even if I prefer my flag test which is done in one test statement instead of two :-). However, it's not a big code optimisation.

robert-hh · 2022-07-30T09:56:50Z

Would you please show me your version of the wiznet5k_poll() function? Besides the flag and style issues (one entry and one exit from a function), a difference could be, whether the interrupt is always cleared or only when the flag test passes.
Edit: Maybe not. because when the netif is not UP, then there are no interrupts to be cleared.

omogenot · 2022-07-30T10:02:13Z

Here you go:

void wiznet5k_poll(void) {
    wiznet5k_obj_t *self = &wiznet5k_obj;
    if ((self->netif.flags & (NETIF_FLAG_UP | NETIF_FLAG_LINK_UP)) == (NETIF_FLAG_UP | NETIF_FLAG_LINK_UP)) {
        uint16_t len;
        while ((len = wiznet5k_recv_ethernet(self)) > 0) {
            if (self->trace_flags & TRACE_ETH_RX) {
                netutils_ethernet_trace(MP_PYTHON_PRINTER, len, self->eth_frame, NETUTILS_TRACE_NEWLINE);
            }
            struct pbuf *p = pbuf_alloc(PBUF_RAW, len, PBUF_POOL);
            if (p != NULL) {
                pbuf_take(p, self->eth_frame, len);
                if (self->netif.input(p, &self->netif) != ERR_OK) {
                    pbuf_free(p);
                }
            }
        }
    }
    wizchip_clrinterrupt(IK_SOCK_0);
}

If we get there, it's either because of the regular polling every 64 ms, or because of an IRQ. If the LINK is not UP we have to clear the IRQ anyway. That was already the case before because the call to the wizchip_clrinterrupt function was done at the very beginning if I remember well.

robert-hh · 2022-07-30T10:22:16Z

Interestingly I see a difference in the ping times depending, on how the flag test is made. All ping'ing is done during the bulk transfer. With the 'old' test, the range is 0.654/1.985/5.734 ms, with the 'new' flag test, it is 0.639/1.857/10.657 ms (all min/avg/max).

@omogenot

Avoiding conflicts between the IRQ and an active transfers. Before the change, the device could lock up in heavy traffic situations. Fix found and code supplied by @omogenot.

To have just one exit and a more compact flag test. This is just a style change without impact to the function.

robert-hh · 2022-07-30T10:33:43Z

The style of wiznet5k_poll() is changed, but in a separate commit, to keep it apart from the bug fix.

omogenot · 2022-07-30T10:55:22Z

Interestingly I see a difference in the ping times depending, on how the flag test is made. All ping'ing is done during the bulk transfer. With the 'old' test, the range is 0.654/1.985/5.734 ms, with the 'new' flag test, it is 0.639/1.857/10.657 ms (all min/avg/max).

My overnight test (with new flag):

66629 packets transmitted, 66629 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 0.643/1.011/64.404/0.855 ms

robert-hh · 2022-07-30T13:41:05Z

@dpgeorge I think this is now ready to go, after having eliminated a rare lock-up state. These changes should affect the W5100 board as well, but there is nothing directly chip type related in the commits. So it will work, I assume.

omogenot · 2022-07-30T14:09:28Z

@dpgeorge , @robert-hh
If you wish, you could wait until monday evening (French time), I will do the same testing with the W5100S_EVB_PICO board.

robert-hh · 2022-07-30T14:17:06Z

Sure. Nothing easier than that (almost). Thanks for testing it with that board. I considered getting one, but they seem out of stock at the EU based shops, I usually use.

omogenot · 2022-07-31T09:06:04Z

I started the tests with the W5100S board (actually I use a standard pico with a Wiznet Hat that uses a W5100S chip).
There is a bug in the Wiznet library that does not reset the IRQ flag. Reading the W5100S data sheet, the IRQ signal is not reset by writing to the IR register, but by writing to the SN_IR register. I now understand why we had that extra line that was useless for the W5500 ;-)
Therefore I modified the wiznet5k_poll function (again) to add this little patch and make the IRQ mode work.

void wiznet5k_poll(void) {
    wiznet5k_obj_t *self = &wiznet5k_obj;
    if ((self->netif.flags & (NETIF_FLAG_UP | NETIF_FLAG_LINK_UP)) == (NETIF_FLAG_UP | NETIF_FLAG_LINK_UP)) {
        uint16_t len;
        while ((len = wiznet5k_recv_ethernet(self)) > 0) {
            if (self->trace_flags & TRACE_ETH_RX) {
                netutils_ethernet_trace(MP_PYTHON_PRINTER, len, self->eth_frame, NETUTILS_TRACE_NEWLINE);
            }
            struct pbuf *p = pbuf_alloc(PBUF_RAW, len, PBUF_POOL);
            if (p != NULL) {
                pbuf_take(p, self->eth_frame, len);
                if (self->netif.input(p, &self->netif) != ERR_OK) {
                    pbuf_free(p);
                }
            }
        }
    }
    wizchip_clrinterrupt(IK_SOCK_0);
    #if _WIZCHIP_ == W5100S
    setSn_IR(0, Sn_IR_RECV);  // WZ5100S driver bug.
    #endif
}

I added a small comment for future readers and a conditional compilation to avoid this useless call for other Wiznet chips.
Now, I start the long haul tests... To be continued.

robert-hh · 2022-07-31T09:28:08Z

OK. I changed the code accordingly. So it shows again, that testing is required after code changes.

omogenot · 2022-08-01T08:28:10Z

Here are the results after almost 24 hours of tests. The test was consisting in having an Ajax HTTP request every 2 seconds to the embedded http server, 1 request every 200 ms from a Modbus TCP client, 1 ping every second.
No error, nor lock up of the pico device. I don't have the number of http request but every 2 seconds for 24 hours, that's a lot => no error. About 280000 Modbus request (some timeout errors, but this is expected. The poor pico cannot handle everything at the same time).

--- 192.168.0.105 ping statistics ---
81563 packets transmitted, 81563 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 0.681/1.100/64.301/0.458 ms

The maximum round-trip time of 64 ms tends to show that the SPI bus congestion was handled by ignoring the IRQ and letting the polling do the work.
I think that the fixes are OK and it's good to go.

robert-hh · 2022-08-01T08:35:50Z

Thanks. Then I'll make the commit for the most recent change.

dpgeorge · 2022-08-09T06:46:37Z

Thanks for all your efforts on this, to thoroughly test it.

Rebased and squashed into 5 commits and merged in 736b427 through f000ac9

robert-hh added 3 commits July 27, 2022 10:49

network_wiznet5k.c: Register nic when the LWIP stack is used.

23c15b9

That was missing, and network.route() returned an empty list.

network_wiznet5k.c: Drop an obsolete argument check.

34c45b9

Drop an obsolete and wrong argument check. The proper checks are done further down in the code.

extmod/network_wiznet5k.c: Deinit the nic before (re-)initialisation.

1f3f89c

If nic.active(True) is called several times in a row, the device may lock up. Even if that is bad coding practice, calling wiznet5k_deinit() in wiznet5k_init() prevents the lock.

dpgeorge added the extmod Relates to extmod/ directory in source label Jul 28, 2022

dpgeorge reviewed Jul 28, 2022

View reviewed changes

extmod/network_wiznet5k.c Outdated Show resolved Hide resolved

network_wiznet5k.c: Check for the absence of keyword arguments.

b4bb29c

Using keyword arguments raises an error now.

robert-hh changed the title ~~extmod/network_wiznet5k.c: Three small bug fixes.~~ extmod/network_wiznet5k.c: Four small bug fixes. Jul 29, 2022

robert-hh changed the title ~~extmod/network_wiznet5k.c: Four small bug fixes.~~ extmod/network_wiznet5k.c: Five small bug fixes. Jul 29, 2022

network_wiznet5k.c: Schedule clearing of interrupt flags.

8bc13a5

Avoiding conflicts between the IRQ and an active transfers. Before the change, the device could lock up in heavy traffic situations. Fix found and code supplied by @omogenot.

robert-hh force-pushed the rp2_wiznet5k branch from f2b781f to 8bc13a5 Compare July 30, 2022 10:27

network_wiznet5k.c: Rearrange the function wiznet5k_poll().

2fffb18

To have just one exit and a more compact flag test. This is just a style change without impact to the function.

network_wiznet5k.c: Clear interrupt for W5100S.

f0af40c

This was referenced Aug 5, 2022

stm32: Use lib/wiznet5k instead of drivers/wiznet5k. #9019

Closed

stm32: Use lib/wiznet5k instead of drivers/wiznet5k. #9021

Closed

dpgeorge closed this Aug 9, 2022

robert-hh deleted the rp2_wiznet5k branch August 9, 2022 13:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extmod/network_wiznet5k.c: Five small bug fixes. #8974

extmod/network_wiznet5k.c: Five small bug fixes. #8974

robert-hh commented Jul 27, 2022 •

edited

Loading

codecov-commenter commented Jul 27, 2022 •

edited

Loading

omogenot commented Jul 28, 2022 •

edited

Loading

robert-hh commented Jul 28, 2022

robert-hh commented Jul 29, 2022

omogenot commented Jul 29, 2022

robert-hh commented Jul 29, 2022

omogenot commented Jul 29, 2022

robert-hh commented Jul 29, 2022

omogenot commented Jul 29, 2022

robert-hh commented Jul 29, 2022

omogenot commented Jul 29, 2022

omogenot commented Jul 29, 2022 •

edited

Loading

omogenot commented Jul 29, 2022

robert-hh commented Jul 29, 2022 •

edited

Loading

omogenot commented Jul 29, 2022 •

edited

Loading

robert-hh commented Jul 29, 2022

omogenot commented Jul 30, 2022

robert-hh commented Jul 30, 2022

omogenot commented Jul 30, 2022

robert-hh commented Jul 30, 2022 •

edited

Loading

omogenot commented Jul 30, 2022

robert-hh commented Jul 30, 2022

robert-hh commented Jul 30, 2022

omogenot commented Jul 30, 2022

robert-hh commented Jul 30, 2022

omogenot commented Jul 30, 2022

robert-hh commented Jul 30, 2022

omogenot commented Jul 31, 2022

robert-hh commented Jul 31, 2022 •

edited

Loading

omogenot commented Aug 1, 2022

robert-hh commented Aug 1, 2022

dpgeorge commented Aug 9, 2022

extmod/network_wiznet5k.c: Five small bug fixes. #8974

extmod/network_wiznet5k.c: Five small bug fixes. #8974

Conversation

robert-hh commented Jul 27, 2022 • edited Loading

codecov-commenter commented Jul 27, 2022 • edited Loading

Codecov Report

omogenot commented Jul 28, 2022 • edited Loading

robert-hh commented Jul 28, 2022

robert-hh commented Jul 29, 2022

omogenot commented Jul 29, 2022

robert-hh commented Jul 29, 2022

omogenot commented Jul 29, 2022

robert-hh commented Jul 29, 2022

omogenot commented Jul 29, 2022

robert-hh commented Jul 29, 2022

omogenot commented Jul 29, 2022

omogenot commented Jul 29, 2022 • edited Loading

omogenot commented Jul 29, 2022

robert-hh commented Jul 29, 2022 • edited Loading

omogenot commented Jul 29, 2022 • edited Loading

robert-hh commented Jul 29, 2022

omogenot commented Jul 30, 2022

robert-hh commented Jul 30, 2022

omogenot commented Jul 30, 2022

robert-hh commented Jul 30, 2022 • edited Loading

omogenot commented Jul 30, 2022

robert-hh commented Jul 30, 2022

robert-hh commented Jul 30, 2022

omogenot commented Jul 30, 2022

robert-hh commented Jul 30, 2022

omogenot commented Jul 30, 2022

robert-hh commented Jul 30, 2022

omogenot commented Jul 31, 2022

robert-hh commented Jul 31, 2022 • edited Loading

omogenot commented Aug 1, 2022

robert-hh commented Aug 1, 2022

dpgeorge commented Aug 9, 2022

robert-hh commented Jul 27, 2022 •

edited

Loading

codecov-commenter commented Jul 27, 2022 •

edited

Loading

omogenot commented Jul 28, 2022 •

edited

Loading

omogenot commented Jul 29, 2022 •

edited

Loading

robert-hh commented Jul 29, 2022 •

edited

Loading

omogenot commented Jul 29, 2022 •

edited

Loading

robert-hh commented Jul 30, 2022 •

edited

Loading

robert-hh commented Jul 31, 2022 •

edited

Loading