Skip to content

extmod/network_wiznet5k.c: Five small bug fixes. #8974

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

robert-hh
Copy link
Contributor

@robert-hh robert-hh commented Jul 27, 2022

Five small commits fixing bugs for WIZNET5K boards:

  1. Register the nic with the network module. That was missing, causing the list returned by network.route() to be empty.
  2. Drop and obsolete argument check in wiznet5k_make_new(). That prevented specifying a pin for the interrupt signal.
  3. Deinit the nic before (re-)initialsation. Without that, calling nic.active(True) more than once in a row could lock up the device. If it was not initialized before, the deinit does not hurt.
  4. Schedule clearing of the Interrupt. I was in the IRQ handler, but that caused a problem when a SPI transfer was ongoing at the time of the interrupt, because IRQ clearing starts an SPI transfer itself.
  5. Check for the absence of keyword arguments when the constructor is called with default settings, as suggested by @dpgeorge.

Commits 2 though 4 are done in behalf of @omogenot, who asked me to do so.

Commit 3 effectively adds only one line, but looks larger since the deinit() function was moved avoiding a forward declaration.

The changes were tested by @omogenot on a W5500_EVB_PICO board and by me with a RP2040 pico + W5500 breakout combo. Ping times are ~0.7s for "pinging" the board and ~3 second for ping originated from the board, when interrupts are enabled. The MP test suite for Network shows a good result:

./run-multitests.py -i pyb:a0 multi_net/*

20 tests performed
17 tests passed
3 tests failed: multi_net/ssl_cert_rsa.py multi_net/tcp_client_rst.py multi_net/uasyncio_tcp_client_rst.py

./run-multitests.py -i micropython -i pyb:a0 multi_net/*

20 tests performed
16 tests passed
2 tests skipped: multi_net/tcp_client_rst.py multi_net/uasyncio_tcp_client_rst.py
2 tests failed: multi_net/uasyncio_tcp_readall.py multi_net/udp_data.py

python run-tests.py --target pyboard net_inet/*.py net_hosted/*.py

17 tests performed (115 individual testcases)
15 tests passed
2 tests failed: connect_nonblock_xfer test_tls_nonblock

That was missing, and network.route() returned an empty list.
Drop an obsolete and wrong argument check. The proper checks are done
further down in the code.
If nic.active(True) is called several times in a row, the device may
lock up. Even if that is bad coding practice, calling wiznet5k_deinit()
in wiznet5k_init() prevents the lock.
@codecov-commenter
Copy link

codecov-commenter commented Jul 27, 2022

Codecov Report

Merging #8974 (2fffb18) into master (45ab801) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #8974   +/-   ##
=======================================
  Coverage   98.37%   98.37%           
=======================================
  Files         156      156           
  Lines       20281    20281           
=======================================
  Hits        19951    19951           
  Misses        330      330           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 45ab801...2fffb18. Read the comment docs.

@dpgeorge dpgeorge added the extmod Relates to extmod/ directory in source label Jul 28, 2022
Using keyword arguments raises an error now.
@omogenot
Copy link
Contributor

omogenot commented Jul 28, 2022

@robert-hh ,

I think I found an issue with the IRQ mode. If the nic is heavily solicited (a web server for instance), I guess it can happen that an IRQ occurs whilst the program was performing an SPI transfer already. Since the IRQ handler is accessing the WIZNET chip as well, there might be some cases where a SPI access conflict could arise => system locked-up.
I propose to move the WIZNET chip SPI access out of the IRQ handler:

the mpy_wiznet_read_int function becomes:

STATIC mp_obj_t mpy_wiznet_read_int(mp_obj_t none_in) {
    (void)none_in;
//    wizchip_clrinterrupt(IK_SOCK_0);
//    setSn_IR(0, Sn_IR_RECV);

    // Handle incoming data
    wiznet5k_try_poll();
    return mp_const_none;
}

and the wiznet5k_poll function becomes:

void wiznet5k_poll(void) {
    wizchip_clrinterrupt(IK_SOCK_0);    // setSn_IR(0, Sn_IR_RECV); is already done in this function call
    wiznet5k_obj_t *self = &wiznet5k_obj;
    if (!(self->netif.flags & NETIF_FLAG_UP) ||
        !(self->netif.flags & NETIF_FLAG_LINK_UP)) {
        return;
    }
    uint16_t len;
    while ((len = wiznet5k_recv_ethernet(self)) > 0) {
        if (self->trace_flags & TRACE_ETH_RX) {
            netutils_ethernet_trace(MP_PYTHON_PRINTER, len, self->eth_frame, NETUTILS_TRACE_NEWLINE);
        }
        struct pbuf *p = pbuf_alloc(PBUF_RAW, len, PBUF_POOL);
        if (p != NULL) {
            pbuf_take(p, self->eth_frame, len);
            if (self->netif.input(p, &self->netif) != ERR_OK) {
                pbuf_free(p);
            }
        }
    }
}

With these fixes, the web server seems to work without lock-ups. To be continued. I have a test running 24/7 to verify that.

@robert-hh
Copy link
Contributor Author

That change looks good. 100MByte bulk transfer w/o problems. I'll run a 10GByte test over night. Transfer speed ~400kByte/s. Will take ~7 hours

@robert-hh
Copy link
Contributor Author

The 10GByte test succeeded fine. All data was transferred, no lock-up. So the change improves the reliability substantially.

@omogenot
Copy link
Contributor

Cool ! I'm glad to hear that.
On my side, the current test is so far successful. The web server is still up and running, at the same time the mDNS responder is working, a ping request to the pico board is running since yesterday without problems, and I have a Modbus TCP client pulling data from the embedded Modbus TCP server on the pico every 500 ms without major issues (some timeouts from time to time as the poor pico cannot handle everything (http request, ftp client, modbus TCP request, etc.) at the same time :-) )

@robert-hh
Copy link
Contributor Author

That's great. The standard test are still running the same way as before (as expected). You should make a PR for that, or I could add that to the current one.

@omogenot
Copy link
Contributor

I think that you can add it to the current one as it is more or less related.
In the meantime, I am writing mDNS support when LWIP is used. I added this to WIZNET5K class, but I think it shall be available for all types of nic attached to the LWIP. I had a look at the modnetwork.c file and found strange that the network_wiznet5k.c never calls the mod_network_register_nic function if using LWIP. Is it normal? If not, this would be another 'little' bug that shall be corrected in this PR as well, no?

@robert-hh
Copy link
Contributor Author

Registering the nic with the network module is already part of this PR. It's done in wiznet5k_init().

@omogenot
Copy link
Contributor

Sorry, my bad. That's your first commit.

However, I saw that all other nic drivers are doing this in the object new function. Shan't you move it to this function as well rather than in the init to be consistent? (I know I can be a pain...)

@robert-hh
Copy link
Contributor Author

It was already in the init function for the non-LWIP case, so I left it there. init() is called with nic.active(True). So actually it makes already a small difference, in that the route does not exists before the interface is active.

@omogenot
Copy link
Contributor

Fine by me then. :-) Good job.

@robert-hh robert-hh changed the title extmod/network_wiznet5k.c: Three small bug fixes. extmod/network_wiznet5k.c: Four small bug fixes. Jul 29, 2022
@robert-hh robert-hh changed the title extmod/network_wiznet5k.c: Four small bug fixes. extmod/network_wiznet5k.c: Five small bug fixes. Jul 29, 2022
@omogenot
Copy link
Contributor

omogenot commented Jul 29, 2022

@robert-hh The pain in the neck is back. I kept on making extra tests, and found some cases again where the Ethernet link is dead (without locking up the device). It seemed to be related to timing issues. I therefore moved the clear IRQ AFTER the while wiznet5k_recv_ethernet to avoid false calls when we already read all incoming packets. The function becomes:

void wiznet5k_poll(void) {
    wiznet5k_obj_t *self = &wiznet5k_obj;
    if ((self->netif.flags & (NETIF_FLAG_UP | NETIF_FLAG_LINK_UP)) == (NETIF_FLAG_UP | NETIF_FLAG_LINK_UP)) {
        uint16_t len;
        while ((len = wiznet5k_recv_ethernet(self)) > 0) {
            if (self->trace_flags & TRACE_ETH_RX) {
                netutils_ethernet_trace(MP_PYTHON_PRINTER, len, self->eth_frame, NETUTILS_TRACE_NEWLINE);
            }
            struct pbuf *p = pbuf_alloc(PBUF_RAW, len, PBUF_POOL);
            if (p != NULL) {
                pbuf_take(p, self->eth_frame, len);
                if (self->netif.input(p, &self->netif) != ERR_OK) {
                    pbuf_free(p);
                }
            }
        }
    }
    wizchip_clrinterrupt(IK_SOCK_0);
}

I reversed the flag test so that there is only one call to the wizchip_clrinterrupt function.
I keep on trying, using different timings for the Modbus TCP client, and ping, whilst refreshing randomly the web page sent by the http server. I am terribly sorry to come back with recurrent modification, but it looks like a long shot. I kind of understand why the previous contributor(s) have inhibited the IRQ mode :-)
Thank you for your patience and for bearing with me.

Edit:
Unfortunately, a new block arose on a socket.accept() statement in the http server. I try again, without the IRQ pin to verify that this is what creates the blocking situation or not.

@omogenot
Copy link
Contributor

Well, the non IRQ version seems to work OK. I keep on thinking that the problem is due to a concurrent access to the SPI bus. Therefore, I added a test for the CS signal level to enable the call to wiznet5k_try_poll. The function mpy_wiznet_read_int becomes:

STATIC mp_obj_t mpy_wiznet_read_int(mp_obj_t none_in) {
    (void)none_in;
    // Handle incoming data
    if (mp_hal_pin_read(wiznet5k_obj.cs))
        wiznet5k_try_poll();
    return mp_const_none;
}

I pursue my tests and let you know.

@robert-hh
Copy link
Contributor Author

robert-hh commented Jul 29, 2022

Doesn't the Interrupt have to be cleared at the RP2? Otherwise you would fall back to polling mode after the first interrupt has been ignored.
Edit: I tried your change with my "send burst" test, and it does not slow down the transmission. Looking into the code, the irq is acknowledged before the handler is called.

@omogenot
Copy link
Contributor

omogenot commented Jul 29, 2022

Doesn't the Interrupt have to be cleared at the RP2? Otherwise you would fall back to polling mode after the first interrupt has been ignored.
Edit: I tried your change with my "send burst" test, and it does not slow down the transmission. Looking into the code, the irq is acknowledged before the handler is called.

We cannot ack the IRQ whilst the SPI bus is used. It will be acknowledged during the next poll anyway. The poll function has a while loop that would read multiple packets, therefore the poll could have read the packet that triggered a new IRQ since last time we acknowledged it, making a new IRQ to be trigger for nothing as the packet has already been read. So I think that the last IRQ shall/can be ignored by acknowledging the IRQ at the end of the poll.
Edit: My current test is still OK. 6500 pings @ about 1ms, 25000 Modbus TCP requests (1 request every 200ms) and many (I did not count since it's manual clicks I do myself from time to time) web page requests.

@robert-hh
Copy link
Contributor Author

That's clear. I was concerned about the Pin IRQ of the rp2040, which has to be acknowledged to enable it again. But that is done in the hardware Pin IRQ interrupt handler before calling mpy_wiznet_read_int() as the registered callback.

@omogenot
Copy link
Contributor

Well, after more than 12 hours of test, no lock-ups. 👍
If I have a look at the ping results, I can see a few round times of 64 ms, instead of the regular 1 ms, showing that the IRQ was probably discarded and the polling took place. This tend to show that the CS signal test used as a kind of semaphore to regulate the concurrency SPI bus access is working.

@robert-hh
Copy link
Contributor Author

That looks good. My bulk transfer test works fine. With the bulk transfer running, ping times vary between 0.6 and 5.7 ms. Still pretty good. The IRQ handler now is:

STATIC mp_obj_t mpy_wiznet_read_int(mp_obj_t none_in) {
    (void)none_in;
    // Handle incoming data, unless the SPI bus is busy
    if (mp_hal_pin_read(wiznet5k_obj.cs)) {
        wiznet5k_try_poll();
    }
    return mp_const_none;
}
STATIC MP_DEFINE_CONST_FUN_OBJ_1(mpy_wiznet_read_int_obj, mpy_wiznet_read_int);

Do you still have the call to wizchip_clrinterrupt(IK_SOCK_0); at the start of wiznet5k_poll()? I tried both at the start and the end and see no difference.

@omogenot
Copy link
Contributor

Well, I kept it at the end because, in my mind, it makes more sense to clear it after having executed the loop and avoid "false alarms". But you're right the impact is really small (a few nano-seconds). It would require a lot of time to statistically prove it :-).
Even if I prefer my flag test which is done in one test statement instead of two :-). However, it's not a big code optimisation.

@robert-hh
Copy link
Contributor Author

robert-hh commented Jul 30, 2022

Would you please show me your version of the wiznet5k_poll() function? Besides the flag and style issues (one entry and one exit from a function), a difference could be, whether the interrupt is always cleared or only when the flag test passes.
Edit: Maybe not. because when the netif is not UP, then there are no interrupts to be cleared.

@omogenot
Copy link
Contributor

Here you go:

void wiznet5k_poll(void) {
    wiznet5k_obj_t *self = &wiznet5k_obj;
    if ((self->netif.flags & (NETIF_FLAG_UP | NETIF_FLAG_LINK_UP)) == (NETIF_FLAG_UP | NETIF_FLAG_LINK_UP)) {
        uint16_t len;
        while ((len = wiznet5k_recv_ethernet(self)) > 0) {
            if (self->trace_flags & TRACE_ETH_RX) {
                netutils_ethernet_trace(MP_PYTHON_PRINTER, len, self->eth_frame, NETUTILS_TRACE_NEWLINE);
            }
            struct pbuf *p = pbuf_alloc(PBUF_RAW, len, PBUF_POOL);
            if (p != NULL) {
                pbuf_take(p, self->eth_frame, len);
                if (self->netif.input(p, &self->netif) != ERR_OK) {
                    pbuf_free(p);
                }
            }
        }
    }
    wizchip_clrinterrupt(IK_SOCK_0);
}

If we get there, it's either because of the regular polling every 64 ms, or because of an IRQ. If the LINK is not UP we have to clear the IRQ anyway. That was already the case before because the call to the wizchip_clrinterrupt function was done at the very beginning if I remember well.

@robert-hh
Copy link
Contributor Author

Interestingly I see a difference in the ping times depending, on how the flag test is made. All ping'ing is done during the bulk transfer. With the 'old' test, the range is 0.654/1.985/5.734 ms, with the 'new' flag test, it is 0.639/1.857/10.657 ms (all min/avg/max).

Avoiding conflicts between the IRQ and an active transfers.
Before the change, the device could lock up in heavy traffic situations.
Fix found and code supplied by @omogenot.
To have just one exit and a more compact flag test. This is just a
style change without impact to the function.
@robert-hh
Copy link
Contributor Author

The style of wiznet5k_poll() is changed, but in a separate commit, to keep it apart from the bug fix.

@omogenot
Copy link
Contributor

Interestingly I see a difference in the ping times depending, on how the flag test is made. All ping'ing is done during the bulk transfer. With the 'old' test, the range is 0.654/1.985/5.734 ms, with the 'new' flag test, it is 0.639/1.857/10.657 ms (all min/avg/max).

My overnight test (with new flag):

66629 packets transmitted, 66629 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 0.643/1.011/64.404/0.855 ms

@robert-hh
Copy link
Contributor Author

@dpgeorge I think this is now ready to go, after having eliminated a rare lock-up state. These changes should affect the W5100 board as well, but there is nothing directly chip type related in the commits. So it will work, I assume.

@omogenot
Copy link
Contributor

@dpgeorge , @robert-hh
If you wish, you could wait until monday evening (French time), I will do the same testing with the W5100S_EVB_PICO board.

@robert-hh
Copy link
Contributor Author

Sure. Nothing easier than that (almost). Thanks for testing it with that board. I considered getting one, but they seem out of stock at the EU based shops, I usually use.

@omogenot
Copy link
Contributor

I started the tests with the W5100S board (actually I use a standard pico with a Wiznet Hat that uses a W5100S chip).
There is a bug in the Wiznet library that does not reset the IRQ flag. Reading the W5100S data sheet, the IRQ signal is not reset by writing to the IR register, but by writing to the SN_IR register. I now understand why we had that extra line that was useless for the W5500 ;-)
Therefore I modified the wiznet5k_poll function (again) to add this little patch and make the IRQ mode work.

void wiznet5k_poll(void) {
    wiznet5k_obj_t *self = &wiznet5k_obj;
    if ((self->netif.flags & (NETIF_FLAG_UP | NETIF_FLAG_LINK_UP)) == (NETIF_FLAG_UP | NETIF_FLAG_LINK_UP)) {
        uint16_t len;
        while ((len = wiznet5k_recv_ethernet(self)) > 0) {
            if (self->trace_flags & TRACE_ETH_RX) {
                netutils_ethernet_trace(MP_PYTHON_PRINTER, len, self->eth_frame, NETUTILS_TRACE_NEWLINE);
            }
            struct pbuf *p = pbuf_alloc(PBUF_RAW, len, PBUF_POOL);
            if (p != NULL) {
                pbuf_take(p, self->eth_frame, len);
                if (self->netif.input(p, &self->netif) != ERR_OK) {
                    pbuf_free(p);
                }
            }
        }
    }
    wizchip_clrinterrupt(IK_SOCK_0);
    #if _WIZCHIP_ == W5100S
    setSn_IR(0, Sn_IR_RECV);  // WZ5100S driver bug.
    #endif
}

I added a small comment for future readers and a conditional compilation to avoid this useless call for other Wiznet chips.
Now, I start the long haul tests... To be continued.

@robert-hh
Copy link
Contributor Author

robert-hh commented Jul 31, 2022

OK. I changed the code accordingly. So it shows again, that testing is required after code changes.

@omogenot
Copy link
Contributor

omogenot commented Aug 1, 2022

Here are the results after almost 24 hours of tests. The test was consisting in having an Ajax HTTP request every 2 seconds to the embedded http server, 1 request every 200 ms from a Modbus TCP client, 1 ping every second.
No error, nor lock up of the pico device. I don't have the number of http request but every 2 seconds for 24 hours, that's a lot => no error. About 280000 Modbus request (some timeout errors, but this is expected. The poor pico cannot handle everything at the same time).

--- 192.168.0.105 ping statistics ---
81563 packets transmitted, 81563 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 0.681/1.100/64.301/0.458 ms

The maximum round-trip time of 64 ms tends to show that the SPI bus congestion was handled by ignoring the IRQ and letting the polling do the work.
I think that the fixes are OK and it's good to go.

@robert-hh
Copy link
Contributor Author

Thanks. Then I'll make the commit for the most recent change.

@dpgeorge
Copy link
Member

dpgeorge commented Aug 9, 2022

Thanks for all your efforts on this, to thoroughly test it.

Rebased and squashed into 5 commits and merged in 736b427 through f000ac9

@dpgeorge dpgeorge closed this Aug 9, 2022
@robert-hh robert-hh deleted the rp2_wiznet5k branch August 9, 2022 13:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
extmod Relates to extmod/ directory in source
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants