Skip to content

Conversation

ccooper21
Copy link
Contributor

For a project that I'm working on, I need to be able to extract some information from HTTP response headers. This capability is provided by the CPython requests library, but is not presently provided by the MicroPython urequests library. Hence, I have added it in a consistent way.

This change is benficial from a functional standpoint, but there are some downsides with respect to the MicroPython project's goals. In particular, this increases the Python byte code size for the urequests module by about 2x (i.e. 4966 vs. 2498 bytes). Of course, this also increases the amount of memory used by the urequests module at runtime since the HTTP header response headers are now held in memory. For my need this is acceptable and necessary, but it is disappointing to see that the memory consumption is about 5x the raw byte count for the header text. In one example that I looked at, this was 6816 vs. 1383 bytes.

Given the above, I'm not necessarily expecting this pull request to be accepted, but I would appreciate feedback and discussion on how to work towards that. Are there any tricks known from working on other modules that I might be able to leverage here?

Here are a few observations that I have:

  • Since RFC 7230 states that HTTP headers are case-insensitive and do not require unique keys, I implemented a specialized dict class to handle this. This is certainly the most consequential decision with respect to byte code size and runtime memory usage. Most every HTTP library has some variation of this. It appears that often base classes in the CPython collections module are used to simplify the implementation somewhat, but those are not available in MicroPython.

  • Initially the HTTP header strings are read via the network connection a line at a time. Each string is then split to extract the key and value that are ultimately added to the header dict. An opportunity for improvement would be to some how reuse the initial strings instead of generating new ones along the way to building the dict. Maybe this could be achieved using memoryview instances, but it seems like it would quickly get complex and require a lot of code.

  • Has any approach to turning on and off features been found with respect to other modules? The best solution might be to have a way to turn on or off HTTP response header handling based on the client's needs. As the micropython-lib project continues to get built out I'm guessing such an approach will become relevant in numerous places. Even with respect to the urequests module, this could be relevant to adding capabilities like redirect or cookie handling.

Section 3.2 of RFC 7230 states that HTTP header names are case-insensitive.
Further, section 3.2.2 states that HTTP header names do not need to be unique
as long as the semantics of the message are retained when concatenating the
corresponding values into a comma separated list.  When non-unique header
names are present, their order is to be preserved in the concatenation.  This
implementation attempts to comply with these requirements.
@dpgeorge
Copy link
Member

You are right that this PR add too much, in both the HttpHeaderDict class and the fact that all headers are stored in this dict.

One way around the need for HttpHeaderDict is to just use a normal dict(), store all the keys as lowercase, then always remember to query with lowercase keys (that will be compatibly with normal requests headers).

And then there could be an optional argument to the request() function to specify whether headers are stored in a dict or not. Alternatively there could be an argument which passes in a callback that is called for each header. Then you can just interpret those headers you are interested in for your particular application. The downside of these options is that neither is compatible with normal requests because they introduce additional paramaters.

@pfalcon
Copy link
Contributor

pfalcon commented Nov 9, 2017

Also #194

@SpotlightKid
Copy link
Contributor

SpotlightKid commented Aug 7, 2018

Alternatively there could be an argument which passes in a callback that is called for each header.

That's maybe the easiest and a memory efficient solution. But how about instead letting the user provide a custom Response class which must provide a set_header method (and which would be a no-op in the default urequests.Response class? A user, which needed access to the response headers could then easily wrap the urequests.request function e.g. like this (assuming urequests.request would take a new response_class keyword arg):

import urequests


class MyResponse(urequests.Response):
    def set_header(self, header):
        name, value = header.rstrip(b'\r\n').split(b':', 1)
        self.headers[name.decode('utf-8').lower()] = value.strip()

    @property
    def headers(self):
        if getattr(self, '_headers', None ) is None:
            self._headers = {}
        return self._headers


def request(*args, **kw):
    kw.setdefault('response_class', MyResponse)
    return urequests.request(*args, **kw)

def head(url, **kw):
    return request('HEAD', url, **kw)

# same for 'get', 'post', etc.
# alternatively use functools.partial

This would require only very minimal changes to urequests.py and allow further response handling customizations like in #278 without needing to complicate the module further:

diff --git a/urequests/urequests.py b/urequests/urequests.py
index acb220e..3ed8e28 100644
--- a/urequests/urequests.py
+++ b/urequests/urequests.py
@@ -13,6 +13,9 @@ class Response:
             self.raw = None
         self._cached = None
 
+    def set_header(self, header):
+        pass
+
     @property
     def content(self):
         if self._cached is None:
@@ -32,7 +35,7 @@ class Response:
         return ujson.loads(self.content)
 
 
-def request(method, url, data=None, json=None, headers={}, stream=None):
+def request(method, url, data=None, json=None, headers={}, stream=None, response_class=Response):
     try:
         proto, dummy, host, path = url.split("/", 3)
     except ValueError:
@@ -85,6 +88,8 @@ def request(method, url, data=None, json=None, headers={}, stream=None):
         reason = ""
         if len(l) > 2:
             reason = l[2].rstrip()
+
+        resp = response_class(s)
         while True:
             l = s.readline()
             if not l or l == b"\r\n":
@@ -95,11 +100,12 @@ def request(method, url, data=None, json=None, headers={}, stream=None):
                     raise ValueError("Unsupported " + l)
             elif l.startswith(b"Location:") and not 200 <= status <= 299:
                 raise NotImplementedError("Redirects not yet supported")
+
+            resp.set_header(l)
     except OSError:
         s.close()
         raise
 
-    resp = Response(s)
     resp.status_code = status
     resp.reason = reason
     return resp

@andrewleech
Copy link
Contributor

A simpler headers access change was merged in 5854ae1 as part of #500. While it doesn't meet all the desired behavior here, hopefully it's enough for most users. At this stage it's still case-sensitive but if anyone has suggestions for a light-weight case-insensitive change a new PR would be very welcome.

@andrewleech andrewleech closed this Jul 4, 2022
@crusy
Copy link

crusy commented Jan 14, 2023

A simpler headers access change was merged in 5854ae1 as part of #500. While it doesn't meet all the desired behavior here, hopefully it's enough for most users. At this stage it's still case-sensitive but if anyone has suggestions for a light-weight case-insensitive change a new PR would be very welcome.

Besides lower vs. upper case: The said PR does not allow multiple headers of the same name, such as Cache-Control or Set-Cookie. I think this is important, bc if one is reading request.headers there should be all of them or nothing, not partial data.

Anyways: I'm new to MicroPython, so I will not add a PR, nor am I sure about the following suggestion: Is there something like defaultdict available in MicroPython? But even if not, I guess it should be possible to add something similar in line 110 in 5854ae1? CC @pi-anl

crusy referenced this pull request Jan 14, 2023
This is controlled by parse_headers param to request(), which defaults to
True for compatibility with upstream requests. In this case, headers are
available as .headers of Response objects. They are however normal (not
case-insensitive) dict.

If parse_headers=False, old behavior of ignore response headers is used,
which saves memory on the dict.

Finally, parse_headers can be a custom function which can e.g. parse only
subset of headers (again, to save memory).
@andrewleech
Copy link
Contributor

Besides lower vs. upper case: The said PR does not allow multiple headers of the same name, such as Cache-Control or Set-Cookie. I think this is important, bc if one is reading request.headers there should be all of them or nothing, not partial data.

If a parse_header function is passed in it should receive any duplicated headers that may have been sent by the server. In fact, even with the fallback dict, if two copies of a header with different case are sent you'll be able to see both of them - this is different to cpython requests library where headers are exposed in a caseless dict - no duplicates will be visible there.

Is there a standard compliant use case for duplicate headers you can describe where the current implementation doesn't work?

I'm getting mixed information from a Google search on the topic, though it looks like RFC 6265 states you should not have multiple set-cookie headers, there should be one header with string separator between multiple cookies.
RFC 7230 states:

A sender MUST NOT generate multiple header fields with the same field
name in a message unless either the entire field value for that
header field is defined as a comma-separated list [i.e., #(values)]
or the header field is a well-known exception.

It does go on to note that in practice set-cookie is sometimes duplicated in violation of the spec and should be handled separately to cover this. In light of this it could make sense to explicitly look for duplicates of set-cookie, then perhaps combine them properly to expose to the user in the dict?

Either way the existing parse_header callback should be usable to handle that already if needed.

What are you proposing defaultdict to be used for? I don't see how that would help for duplicates or key case?

@crusy
Copy link

crusy commented Jan 15, 2023

Last in first out: defaultdict. The defaultdict Examples show how to append multiple items for the same key, which is what I expected here.

parse_header callback: I will have a look at it. Sorry I didn't do it before, but as I said I'm new to MicroPython, and I struggle in finding docs. In fact, I didn't even find something about getting the headers at all before this PR here.

Most important, my "standard compliant use case for duplicate headers". The "standard compliant" is where I most probably will be beaten :-) There's a discussion about that under the Cache-Control-answer on Stack Overflow I linked, btw. But anyways: PHP not only allows to set headers of the same type, it even has a replace parameter to enable that, see here. PHP's setcookie's examples pretty much contain only examples with multiple cookies, each of which result in a dedicated set-cookie-header. Getting a response's cookies is what I wanted to achieve, but I can get only the last one.

Edit: See the MDN web docs I linked in my first answer:

To send multiple cookies, multiple Set-Cookie headers should be sent in the same response.

@andrewleech
Copy link
Contributor

Were you referring to the "Using list as the default_factory" example? Having that means you no longer have a "key: value" set of headers, you have a "key: [list of values]" which would be a break of convention for with other python http libraries.

Speaking of which, the usage of this library generally tries to match (a subset of) the usage of the requests library, which also exposes headers in a dict, albeit a caseless one. We're always going to aim to match the usage of that library more than PHP or other platform, with the standards being the fallback :-)

There's also only so much "working around poorly behaving servers" we can efficiently add to a library that needs to be lean for embedded use on a range of devices.

@crusy
Copy link

crusy commented Jan 15, 2023

Well, any web dev will look for sources like that: PHP if it's a backend developer, MDN for frontend devs (or similar, this is not about PHP or MDN, they're just examples). Both state that multiple headers are not only possible (PHP), they are even a "should" (MDN). No mention of concatenating them into one header. You asked for use cases, here you are. No user of the standards look up RFCs, they look up their respective languages' documentation.

That being said, I see your point, specially in being somewhat "limited" by compatibility. That's one valid policy, bc why would the expectations of web dev be more important than the expectations of regular Python's devs? ;-)

So, as defaultdict and its "list of values" approach isn't a valid return value in line 122 of 5854ae1, maybe it still is handy in gathering the data in line 110? And before returning it, each entry of that defaultdict is join()ed into a single string. We would have both: a "key: value" set return value and no loss of multiple headers.

In other words: If multiple headers of the same type are supposed to come merged into one, why not merging them into one?

@andrewleech
Copy link
Contributor

I didn't see any links to where these other languages are promoting usage of duplicate headers? It's a real shame if they are, as this is why incompatibilities arise and developers are led astray, when anti-patterns that break spec are shown to people who don't know better.

In other words: If multiple headers of the same type are supposed to come merged into one, why not merging them into one?

It turns out the built-in python urllib3 (which underpins most python web libraries) used to merge keys this way 15 years ago, but found this caused more problems: https://bugs.python.org/issue1660009

psf/requests#1415

Regardless of what individual language docs / examples might show, a quick Google shows many instances of the controversy over duplicate headers in many languages. Regardless of what servers should be doing, there are clearly some that don't follow spec already out there. I'm not sure what the best way to deal with these arguable broken servers here is...

The existing Response.headers is "correct", in that it follows spec and is consistent with cpython requests.

The optional headers callback function is non-standard, but is a workaround that should allow developers to deal with these broken servers - perhaps an example of this usage would be good to add to the module here?

Alternatively, the multiple headers could be stored internally in a list and exposed some other way to users that need to deal with broken servers; I'm just not sure how common these are and whether it's worth adding code bulk to all users to benefit the few?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants