Design Decisions¶
Note
This section is not necessary for users of the library to understand.
If you are looking to contribute, or curious about why things work the way they do, read on.
Design Goals¶
In order of importance:
- Work as a drop-in replacement for
httpx.Client, so existing code can gain the benefits of the library without being changed. - Preserve the 100% typed interface of
httpx. - Make writing & testing new augmentations easier.
- Organize code in a way that'll make augmenting future libraries (or
httpx.AsyncClient) easier. (Based on limitations in portingscrapelibthat made a partial rewrite easier.)
How it works¶
Each "suite" of behavior: throttling, retries, caching consists of a function that
monkey patches httpx.Client to add this behavior
to the all-important request method. (The other common methods like get and post call this method.)
If you'd like to follow along, throttle.py is the simplest of them, at around 50 lines long.
You'll see two functions (ignore the class for now):
_throttle_request- this acts as a sort of decorator forhttpx.Client.requestmake_throttled_client- a psuedo-constructor for our monkey patched client.
Each feature is implemented in a similar way.
The recommended make_careful_client entrypoint is
just a convinient combination of these make_ZZZ_ciient functions.
If you don't need to be convinced that monkey patching was a reasonable choice here, you can skip to Our patch pattern
Why not inheritance?¶
For 15 years scrapelib used a class hierarchy to a similar end, it'd certainly work here too.
The equivalent to a careful client in scrapelib is a scrapelib.Scraper.
It has a long inheritance hierarchy:
scrapelib.Scraper -> CachingSession -> ThrottledSession -> RetrySession -> requests.Session
This hierarchy means that there is no such thing as a CachingSession that doesn't use throttling,
and adding new behavior means considering exactly where it works best in the chain and then setting that in stone.
There's arguably no benefit derived from having things set up this way. It is just too annoying to mix & match behaviors, or add new ones, a single class would have been easier to maintain over the years.
Why not mixins?¶
We don't want to just give up and go with a single monolithic class. (See design goals 3 and 4.)
It is worth revisiting why those classes weren't mixins.
It seems like we could have ThrottledMixin, RetryMixin, DevCacheMixin, etc.
Then to use retry & cache together someone would:
import httpx
from careful.hypothetical import ThrottledMixin, RetryMixin, DevCacheMixin
class CustomClient(RetryMixin, DevCacheMixin, Client):
pass
client = CustomClient()
Honestly, not a great start: having to declare an empty class, and to carefully think about method resolution order rules in ordering them.
But there's a bigger problem lurking here... configuration.
Assume each mixin is configured through its constructor:
class RetryMixin:
def __init__(self, num_retries=2, retry_delay_seconds=10, **kwargs):
...
class DevCacheMixin:
def __init__(self, cache_backend=..., should_cache=..., **kwargs):
...
To make this work properly, our custom class needs a constructor too. It'd wind up looking like:
def __init__(self, num_retries, retry_delay_seconds, cache_backend, should_cache, ...)
# Initialize mixins explicitly
RetryMixin.__init__(self, num_retries=num_retries,
retry_delay_seconds=retry_delay_seconds)
DevCacheMixin.__init__(self, cache_backend=cache_backend,
should_cache=should_cache)
Client.__init__(self, **kwargs)
This makes working with the mixins frustrating, since any new combination requires modifying a repetitive constructor.
Preserving type signatures¶
One of the most annoying parts of maintaining scrapelib has been keeping its function signatures in sync with small requests changes.
Design goal #1 is that someone's existing usage of httpx.Client is unimpeded.
The most important method, and where we need to hook in our overrides, is Client.request.
The method takes a whopping 13 parameters, and of course httpx is also fully type-annotaed.
To replace request we have three options:
- Give our new class a
requestmethod which takes*args, and **kwargsand passes them up the chain. - Give each class a
requestmethod take the exact same 13 parameters, and be careful to keep them in sync. - Use
functools.wrapsto replace the function but leave existing annotations & docstrings in place.
#1 reduces type safety and leads to a worse DX overall since language servers can no longer offer suggestions. This won't work for us.
#2 is the approach that scrapelib took. It was annoying and conflicts with goals 3 and 4.
#3 is the approach taken by careful, our actual monkey patch. Each make_ZZZ_client winds up with code resembling:
tclient._no_throttle_request = tclient.request
tclient.request = types.MethodType(
functools.wraps(client.request)(_throttle_request), client
)
The first line tucks away the pre-patch request method for use within the decorated function. It uses a unique name since it'll be sharing a namespace with other patches.
The second line does two neat things:
_throttle_requestis given the signature ofclient.requestviafunctools.wrapstypes.MethodTyperebinds the member function (soselfis correctly handled as the first parameter)
Our patch pattern¶
Augmenting Client is done in two steps:
- a request function that acts as a decorator for
Client.request, this is where the actual logic for the augmentation lives - a patch function that:
- writes any private state needed for the new behavior to a
Clientinstance - replaces
Client.requestwith our patched request
- writes any private state needed for the new behavior to a
All of the files in careful.httpx follow this structure.
Protocol-typing the internal interface¶
After considering some of the issues above, I was considering that I'd probably have to have type ignore statements everywhere to get the monkey patching to work with a type checker.
While this might have been an option, after all the end user experience is the priority, it'd be nice to keep the benefits of type checking for myself and other authors of extensions.
It turns out, as long as we consider the patches fully internal to the Client, there is a way to make this work.
The key challenge presented is that we add new variables to the Client during augmentation:
# this is that internal state we store on a client
tclient._last_request = 0.0
tclient._requests_per_minute = requests_per_minute
tclient._request_frequency = 60.0 / requests_per_minute
These set off type checker alarms, and if we simply ignore them then when they are used in the request wrapper they'll set off alarms again!
It'd be nice to at least have a consistency check between these, so the wrapped request doesn't accidentally use the wrong name, I typed _requests_per_second at least once while writing.
The answer here is a Protocol and a cast.
Each augmentation now comes with a typing.Protocol:
class Throttled(Protocol):
_last_request: float
_requests_per_minute: float
_request_frequency: float
_no_throttle_request: Callable
request: Callable
This defines all of the hidden state for the augmentation, as well as a placeholder for our overriden request.
Then, our decorator looks like this:
Which satisfies the type checker when it comes to internal use of those new attributes.
The final change comes in where we initialize the attributes in the wrapper functions:
# a cast is made to the new type, allowing assignment
tclient = cast(ThrottledClient, client)
tclient._last_request = 0.0
tclient._requests_per_minute = requests_per_minute
tclient._request_frequency = 60.0 / requests_per_minute
tclient._no_throttle_request = client.request
tclient.request = types.MethodType(
functools.wraps(client.request)(_throttle_request), client
)
# the original client can be returned, of type `Client`
return client
Closing thoughts¶
With this approach, users do not know at any point they have a ThrottledClient or a CachedClient, etc.
Not having the final type change is not ideal, but the compromise made for today.
It would be nice to be able to expose an extra method or two, but this approach leans on only having private attributes & therefore being able to safely treat an augmented client as a Client.
There's almost certainly room for improvement, but I'm fairly happy with the trade-offs for now.