Skip to content

USB transmission gradually slowing down (in sawtooth pattern) #7479

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
FilipDominec opened this issue Jul 1, 2021 · 47 comments
Closed

USB transmission gradually slowing down (in sawtooth pattern) #7479

FilipDominec opened this issue Jul 1, 2021 · 47 comments
Labels

Comments

@FilipDominec
Copy link

FilipDominec commented Jul 1, 2021

I am trying to (mis)use Rpi Pico as a scientific data acquisition platform, imagine a 500 kSpS cheap oscilloscope. It is easy to set up communication with μPython: writing data to stdout on the Rpi Pico, and reading data from /dev/ttyACM0 (or other quasi-serial port) in the computer.

Its direct USB support would promise decent speed up to 1 MB/s, which would be just fine for the task. In practice, there are tighter bottlenecks described below. I made a little script for Rpi Pico, flooding the USB with data, and measuring the speed.

My code for μPython in Rpi Pico:

    from machine import Pin
    import time, machine, usys, gc, micropython 
    machine.freq(125000000*2)

    LED = machine.Pin(25, machine.Pin.OUT)
    LED.value(1); time.sleep(.5); LED.value(0) # no resets detected

    while True:
        for a in range(1000):
            try:
                usys.stdout.buffer.write('x'*10000)  
            except:
                LED.value(1); time.sleep(.2); LED.value(0) # no errors detected
        time.sleep(1) # try allowing USB stack to flush (no effect on later speed)

My code to receive the data in the computer:

    #!/usr/bin/python3  
    import sys, serial, time
    port = serial.Serial('/dev/ttyACM0') 
    c = 0
    while True: 
        t0=time.time()
        raw = port.read(10000)
        t1= time.time()
        tooktime = t1-t0
        print(f'{c:5d} receiving 10 kB now took', tooktime )
        c+=1

First observation is that on Rpi Pico clocked at 250 MHz, one has to avoid using print() as it is slow (some 30−50 kB/s). Calling usys.stdout.write() is better (up to 200 kB/s), but usys.stdout.buffer.write() can initially transmit 10kB in 0.015 s, getting close to the USB "Full Speed" throughput limit (i.e. ~600 kB/s). So I happily use the latter function for data.

Second observation: The trouble is that this speed drops continously - after few seconds of transmission, transmitting 10kB takes 0.050 s (i.e. only 200 kB/s). Then suddenly the transmit time returns to its initial value and starts growing linearly again. This repeats with the transmit times forming a sawtooth pattern.

Here I paste part of the printout on my computer where the drop in transmit time happened:

...
116 receiving 10 kB now took 0.051924705505371094
117 receiving 10 kB now took 0.05239605903625488
118 receiving 10 kB now took 0.05241680145263672
119 receiving 10 kB now took 0.05560755729675293
120 receiving 10 kB now took 0.04527783393859863
121 receiving 10 kB now took 0.015183687210083008
122 receiving 10 kB now took 0.015230655670166016
123 receiving 10 kB now took 0.01515960693359375
124 receiving 10 kB now took 0.015021562576293945
125 receiving 10 kB now took 0.015263795852661133
...

Complementary observations:

2a) Without overclocking, Rpi Pico running at 125 MHz, the transmit speeds are roughly half of what I wrote above, so the Rpi Pico is the bottleneck here.

2b) When transmit speed goes high, I thought the chip resets, but it is not the case (LED does not blink).

2c) If I put a random time.sleep(1) after some hundreds of transmission cycles, the transmit speed does not change much after the delay and continues to grow steadily when delay expires (so it is probably not a USB buffer error).

2d) Between transmitting cycles, I observed no significant changes in the micropython.mem_info() printout. So it is probably not a memory issue either.

2e) The period of main transmission loop (as measured on oscilloscope) is proportional to the time required to receive data in computer. So no packets are probably dropped.

Achieving fast USB transmission is a generally useful task and micropython on Rpi Pico does surprisingly well. At least initially. Can anybody give me a hint to get constantly high USB throughput, please?

@veni-vidi-code
Copy link

Have you measured the temperatur of your pico while doing so? Is it more steady when not overclocked?

@FilipDominec
Copy link
Author

@veni-vidi-code Thanks for the tip, but surface temperature is well below 35 °C all the time. No throttling expected, even when overclocking.

@FilipDominec
Copy link
Author

I could also plot the sawtooth pattern...

image

@robert-hh
Copy link
Contributor

One critical line slowing down the process is this one:

                usys.stdout.buffer.write('x'*10000)  

It creates the data object to be sent every time it is called. That will lead to garbage collectikon after a few iterations. It is more effective to create that data buffer beforehand. like:

    data = b'x'*10000
    while True:
        for a in range(1000):
            try:
                usys.stdout.buffer.write(data)  
            except:
                LED.value(1); time.sleep(.2); LED.value(0) # no errors detected
        time.sleep(1) # try allowing USB stack to flush (no effect on later speed)

@FilipDominec
Copy link
Author

@robert-hh Thanks, this improved speed by few %.

Interestingly, the speed cycling occurs at the same period with regards to amount of data. I am now quite confident the cycling occurs at each 418 chunks, i.e. 4180 kB transmitted.

beforehand_image_by chunk

@robert-hh
Copy link
Contributor

At least there seems not much difference between sys.stdout and sys.stdout.buffer any more, unless you changed the plotting style. Can you test, whether these delays are caused by the RP2 or by your PC?

@FilipDominec
Copy link
Author

FilipDominec commented Jul 1, 2021

@robert-hh Sorry for confusing you. In the second picture above, both datasets used usys.stdout.buffer.write(), but the red was with a string prepared beforehand as you suggested. usys.stdout.write() is rather slow and I don't try it any more.

Now I compared clock speeds. The delays are roughly doubled when Rpi runs at 125 MHz:

clockspeed_by-chunk

So I guess the bottleneck is somewhere in Rpi - not sure whether it is micropython or the USB interface.

What puzzles me the most is that we can see a repeated, almost fractal-like pattern on the speed curves. Like a function of data transmitted modulo 4180 kB. But this pattern does not scale exactly with the clock speed, though some features are maintained between 125 and 250 MHz.

@FilipDominec
Copy link
Author

I also noticed that the transmission speed does not depend on the size of data, be it 100 bytes or 10 kB. (Naturally there is some limit that entirely clogs USB of Rpi Pico and requires flash_nuke.uf2, so I didn't test larger chunks than 50 kB.)

@dpgeorge
Copy link
Member

dpgeorge commented Jul 2, 2021

It might be worth trying this with the latest (stable) pico-sdk and tinyusb, see #7474.

@FilipDominec
Copy link
Author

FilipDominec commented Jul 2, 2021

@dpgeorge I agree - either I could report the bug persists in the latest μP version (which is good), or I would find that it is no more there (which is even better).

So far I used rp2-pico-20210618-v1.16.uf2, but on the download page I can see there is a fresh version. Cannot wait to try it out!

@robert-hh
Copy link
Contributor

No change here with MicroPython v1.16-48-g0b3332c8e. Times between 0.0126 and 0.084 seconds.

@FilipDominec
Copy link
Author

@robert-hh I can confirm, no change whatsoever.

image

Why is the sawtooth repeated each 418 or 419 chunks? Why are there 10 steps?

Note these curves were measured today on a different computer, and they sligtly differ in shape from the previous ones.

@FilipDominec
Copy link
Author

FilipDominec commented Jul 2, 2021

I also changed the computer script so that it receives 1kB chunks instead of 10kB. To make it comparable, I normalized the time to 10kB also in this case.

We can see finer structure of the delays, perhaps some mismatch of RP2 and computer loops. But the general trend of transmission slowing down in still obvious.

finer

@lurch
Copy link
Contributor

lurch commented Jul 2, 2021

I am trying to (mis)use Rpi Pico as a scientific data acquisition platform, imagine a 500 kSpS cheap oscilloscope.

Have you looked at https://github.com/raspberrypi/pico-examples/tree/master/pio/logic_analyser ?
Using a compiled language (like C) instead of an interpreted language (like MicroPython) will be faster, and you should also see more "consistent" performance, as there's no memory garbage-collection being done in the background.

@FilipDominec
Copy link
Author

FilipDominec commented Jul 2, 2021

@lurch Sure I did. It would be interesting to see if the problem persists - AFAIK the logic analyser uses the tinyusb for communication, doesn't it? And it uses printf() for transmission of results over USB, so perhaps it could be burdened by similar trouble as the (C-compiled) micropython code.

There are many options to do the job. Every project is a tradeoff between spending time to try out different ways, and between spending time to fix one chosen way.

For example, this week I thought micropython on RP2 could be the right choice. Maybe it is not.

The previous two weeks I wasted time almost finishing my project in compiled C for ESP32 -- only to find out its USB communication is slow (limited to 921600 baud or less on some modules) and not reliable (spitting random 0xFF bytes in the data stream). Too bad I found it so late, now I believe that ESP32 is not suitable for serious projects involving USB connection.

@lurch
Copy link
Contributor

lurch commented Jul 2, 2021

I'm sure @kilograham and @hathach would be very appreciative of any stress-testing you can give to the USB stack 😀

And as a bonus, you get to narrow down whether the problem is with MicroPython or with TinyUSB 😉
Hopefully at least some of your C code that you wrote for ESP32 will be usable (with tweaks) on RP2040? Good luck.

@FilipDominec
Copy link
Author

Now I must admit that the problem seems to be beyond my hacking skills, and beyond my time dedicated for making the RP2/computer communication fast.

Anyway, initial data rate is 10kB in 0.012 s, that is, 830 kB/s. This is very promising, but after transmitting 320 kB, the speed starts dropping as shown above. This must be some software trouble and I still believe more skilled coders will be able to find its cause.

While currently I cannot dig through micropython + TinyUSB codebase, I have few ideas that might inspire somebody:

  • Since we observe some unknown event that makes the speed high again every 4180 kB transmitted, couldn't this mysterious event be activated more often? For instance, is there an equivalent of sys.stdout.flush() in micropython? (I observed sys.stdout.close() after write() changed nothing.)
  • Similar issue is mentioned here: High-speed streaming capacity of CDC class hathach/tinyusb#920 Is the call to usys.stdout.buffer.write() making any data copy?
  • Does the slowdown manifest itself in the TinyUSB speed tests proposed here: CDC speed test hathach/tinyusb#189 ?

@dhalbert
Copy link
Contributor

dhalbert commented Jul 2, 2021

I have seen things run slower due to the heap filling up and finding a free block taking longer. Once garbage collection happens, the cycle starts over. Here is a non-USB example of an analogous situation: adafruit#2602. Also see a PR that ameliorated this to some extent: adafruit#2614, with comments by @jimmo.

@FilipDominec
Copy link
Author

FilipDominec commented Jul 2, 2021

@dhalbert I also guessed this will be the correct explanation, but ...

  • it is not clear to me why speed cycling occurs each 4180 kB transmitted (i.e. 20× more data than the RAM in RP2 is), rather than being dependent on the number of calls instead,
  • why calling gc.collect() in each cycle has no effect on the speed cycling (only slows the transmission down by a constant, as I checked now),
  • why would any extra object be accumulated on heap, even in the following code which still shows the trouble:
data = b'x'*1000
buf = usys.stdout.buffer
while True:
    buf.write(data)  
    gc.collect()

Still, while this example allocates no extra object on heap explicitly, this garbage would have to be created somewhere in the implementation of usys.stdout.buffer.write().

Is it correct? Can anybody more knowledgeable than me review the code for garbage-prone allocation?

@FilipDominec
Copy link
Author

FilipDominec commented Jul 3, 2021

I experimented further, changing the size of data down to few bytes. Like this.

from machine import Pin
import time, machine, usys, gc, micropython 
machine.freq(125000000*2)

data = b'x'*14
buf = usys.stdout.buffer
while True:
    buf.write(data)  

Naturally transmitting in such short bursts slows the transmission down due to overhead of the program loop. What interested me more is its impact on the sawtooth function.

  1. Transmitting byte arrays of 13bytes or shorter makes the sawtooth repeat more frequently (w r t the total amount of bytes transmitted).
  2. Transmitting byte arrays between 14 and some 30 bytes shows a very similar sawtooth as previously used large chunks of 10kB, but there are no more steps.

image-1

Can this inspire somebody?

@lurch
Copy link
Contributor

lurch commented Jul 3, 2021

@FilipDominec Do you have an FTDI232 or PL2303 or similar USB-UART adaptor? Might be interesting to repeat your experiment, but sending the data over the RP2040's UART instead of USB? Obviously the maximum data-rate will be much reduced, but it would be interesting to see if it stays constant or shows a similar slowdown over time?

@robert-hh
Copy link
Contributor

robert-hh commented Jul 3, 2021

Using UART, the pattern looks different. For a 10000 byte block at 921600 baud it oscillates between 0,10767 and 0,1131 seconds. Doubling the baud rate reduces the time by two, but the pattern stays.
Edit. The peak appears at every 6 transmissions.

uart_send

Edit2: Changing the data block size to 100 or 10 and removing the sleep(1) does not change the times or pattern at all.

@FilipDominec
Copy link
Author

Using online tips and getting inspired from the "Adafruit TinyUSB/CDC/cdc_multi" example in Arduino studio, it is not hard to flood TinyUSB from C, either:

#include <Adafruit_TinyUSB.h>
#include "pico/stdlib.h"

void setup() {
  set_sys_clock_khz(250000, false);
  Serial.begin(115200);
}

void loop() {
  Serial.printf("xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx");
}

And the result is beatiful, sustained throughput at 770 kB/s for this simple C program:

image-3

So there is either a bug in micropython, its interface to TinyUSB, or in the way I am using it to transmit data to computer.

@tjvr
Copy link
Contributor

tjvr commented Jul 3, 2021

Does MicroPython have any profiling tools we could use to figure out where it’s spending time?

@hathach
Copy link

hathach commented Jul 4, 2021

Could you try to increase CFG_TUD_CDC_EP_BUFSIZE to larger value such as 128/256 to see if it improves the speed. The latest tinyusb has also implemented USB double buffered for rp2040 (IN only) so it may help as well hathach/tinyusb#891 . There is more approach mentioned in my github issue, though, it may need to take a bit of time to write it in portable way to work with all mcus (therefore takes time).

Note CFG_TUD_CDC_RX_BUFSIZE/CFG_TUD_CDC_TX_BUFSIZE` may also need to be increased as well.

@robert-hh
Copy link
Contributor

CFG_TUD_CDC_RX_BUFSIZE/CFG_TUD_CDC_TX_BUFSIZE were already at 256. Setting CFG_TUD_CDC_EP_BUFSIZEto 256 as well improves the speed, in that the 10kB times then vary between 0.09 and 0.19 seconds. Setting all buffer sizes to 512 moves the range further down to 0.08 to 0.13 seconds.
The PR you mention seems not merged into MicroPyhton.

@dpgeorge
Copy link
Member

dpgeorge commented Jul 5, 2021

Using the Pico & PC code in the first post of this issue, I was able to reproduce the "saw tooth" plots.

Using PR #7474 seems to fix the problem completely, I can get a sustained ~7Mbits/sec transfer rate from Pico to PC. I'll merge that PR shortly.

@hathach
Copy link

hathach commented Jul 5, 2021

Using the Pico & PC code in the first post of this issue, I was able to reproduce the "saw tooth" plots.

Using PR #7474 seems to fix the problem completely, I can get a sustained ~7Mbits/sec transfer rate from Pico to PC. I'll merge that PR shortly.

That is great to hear 🎉 🎉

@dpgeorge
Copy link
Member

dpgeorge commented Jul 5, 2021

That is great to hear

I'm not sure where the issue was and what the fix was but it should be somewhere in the pico-sdk and-or tinyusb. I don't have time to do a git-bisect on these to find out 🙂

@hathach
Copy link

hathach commented Jul 5, 2021

yeah, to be honest, tinyusb isn't speed optimized at the moment. Since I am more focused on the portability. I am glad that micropython could pull off a great throughput 7Mbps which will give me a lot more time to work on the speed feature :)

@robert-hh
Copy link
Contributor

Data rate confirmed. Setting CFG_TUD_CDC_EP_BUFSIZE to 256 as suggested by @hathach increases the data rate further by 35% to about 9.5 Mbit/sec.

@dpgeorge
Copy link
Member

dpgeorge commented Jul 5, 2021

Setting CFG_TUD_CDC_EP_BUFSIZE to 256 as suggested by @hathach increases the data rate further by 35% to about 9.5 Mbit/sec.

Since the rp2040 has quite a bit of RAM it's probably worth making this the default (or at least bump it to 128).

@dpgeorge
Copy link
Member

dpgeorge commented Jul 5, 2021

Setting CFG_TUD_CDC_EP_BUFSIZE to 256 as suggested by @hathach increases the data rate further by 35% to about 9.5 Mbit/sec.

Since the rp2040 has quite a bit of RAM it's probably worth making this the default (or at least bump it to 128).

Actually, testing this (using 256 for CFG_TUD_CDC_EP_BUFSIZE) shows no difference in data rate (for Pico -> PC) for me. @robert-hh what test script to you use to get 9.5Mbit/sec?

@robert-hh
Copy link
Contributor

I'm using stripped down versions of the initial script:
Send:

import time, machine, usys
machine.freq(250_000_000)

data = b'x'*10000
while True:
    for a in range(1000):
        usys.stdout.buffer.write(data)  

Receive:

#!/usr/bin/python3  
import sys, serial, time
port = serial.Serial('/dev/ttyACM0')
c = 0
while True: 
    t0=time.time()
    raw = port.read(10000)
    t1= time.time()
    tooktime = t1-t0
    print(f'{c:5d} receiving 10 kB now took', tooktime )
    c+=1

@FilipDominec
Copy link
Author

FilipDominec commented Jul 5, 2021

Good job, everybody! The remaining task is to provide micropython newbies (like me) an official tutorial for interfacing with the computer over USB. As a matter of fact, the popular and seemingly preferred way, print(), is lamentably slow.

Remember the words of a prophet:

There should be one-- and preferably only one --obvious way to do it.
Explicit is better than implicit.
In the face of ambiguity, refuse the temptation to guess.

@dpgeorge
Copy link
Member

dpgeorge commented Jul 6, 2021

I'm using stripped down versions of the initial script:

Ok, thanks. Using that I can also achieve around 9MBit/sec (with 256 endpoint size). And I get a similar rate using the same test on a PYBD-SF2.

@dpgeorge
Copy link
Member

dpgeorge commented Jul 6, 2021

As a matter of fact, the popular and seemingly preferred way, printf(), is lamentably slow.

print() was slow because it was not optimised. #7491 makes it much faster, almost as fast as sys.stdout.buffer.write.

But note that print() is only going to be fast for printing strings (ascii and/or utf8). If you need to print raw byte data then sys.stdout.buffer.write is the way to do it (even in CPython).

dpgeorge added a commit to dpgeorge/micropython that referenced this issue Jul 6, 2021
This improves the speed of data going out of the MCU and in to the USB
host.

See related micropython#7479.

Signed-off-by: Damien George <damien@micropython.org>
@dpgeorge
Copy link
Member

dpgeorge commented Jul 6, 2021

Since the rp2040 has quite a bit of RAM it's probably worth making this the default

See #7492 for this.

@lurch
Copy link
Contributor

lurch commented Jul 6, 2021

But note that print() is only going to be fast for printing strings (ascii and/or utf8). If you need to print raw byte data then sys.stdout.buffer.write is the way to do it (even in CPython).

There's also the obvious difference that print has to do any variable interpolations that might be necessary (e.g. print("Value is %d" % value)) whereas sys.stdout.buffer.write can just squirt out the raw data as fast as possible without having to check for special-characters first.

But good to hear that you've managed to speed up print() anyway 🙂

@dpgeorge
Copy link
Member

dpgeorge commented Jul 7, 2021

There's also the obvious difference that print has to do any variable interpolations that might be necessary

That's not technically true: print() itself doesn't do the interpolations, it's just passed a pre-formatted string. And one could actually do sys.stdout.buffer.write(b"%d" % 123). The main difference is that print() renders objects (which for strings is converting \n to \r\n) while sys.stdout.buffer.write() just writes out a bytes object.

@lurch
Copy link
Contributor

lurch commented Jul 7, 2021

Ahh, I must have been muddling up Python's print with C's printf 😉

dpgeorge added a commit to dpgeorge/micropython that referenced this issue Jul 8, 2021
This improves the speed of data going out of the MCU.

See related micropython#7479.

Signed-off-by: Damien George <damien@micropython.org>
@dpgeorge
Copy link
Member

dpgeorge commented Jul 8, 2021

This issues should be fully resolved now. See related commits 20a6044 and 0613d3e

@dpgeorge dpgeorge closed this as completed Jul 8, 2021
@derdudee
Copy link

Is there also a possibility to use this in C?
printf is too slow for me

@lurch
Copy link
Contributor

lurch commented Aug 16, 2022

@derdudee For questions regarding the C SDK, you ought to be asking at https://github.com/raspberrypi/pico-sdk/issues

@FilipDominec
Copy link
Author

@derdudee Without questioning the note of lurch, I would like to point you to https://github.com/FilipDominec/rp2daq/blob/main/rp2daq.c#L157 where I commented the issue. Do not forget to set stdio_set_translate_crlf(&stdio_usb, false); to transmit binary data correctly.

@0i9j
Copy link

0i9j commented Aug 18, 2022

@FilipDominec Are you able to read data from you COM Port? Or where is fwrite printing to?

@matsobdev
Copy link

matsobdev commented Feb 20, 2024

@derdudee

Is there also a possibility to use this in C? printf is too slow for me

Use:

#include <unistd.h>

write(1, &image, 108800);

and:

read(0, &image, 1600);

to read. This:

stdio_set_translate_crlf(&stdio_usb, false);

will keet binary data safe from changing 0x0D to 0x0D0A. But you need make change in https://github.com/raspberrypi/pico-sdk/blob/master/src/rp2_common/pico_stdio_usb/include/tusb_config.h bike presented above here in the thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests