|
| 1 | +--- |
| 2 | +layout: pattern |
| 3 | +title: Retry |
| 4 | +folder: retry |
| 5 | +permalink: /patterns/retry/ |
| 6 | +categories: other |
| 7 | +tags: |
| 8 | + - java |
| 9 | + - difficulty-expert |
| 10 | + - performance |
| 11 | +--- |
| 12 | + |
| 13 | +## Retry / resiliency |
| 14 | +Enables an application to handle transient failures from external resources. |
| 15 | + |
| 16 | +## Intent |
| 17 | +Transparently retry certain operations that involve communication with external |
| 18 | +resources, particularly over the network, isolating calling code from the |
| 19 | +retry implementation details. |
| 20 | + |
| 21 | + |
| 22 | + |
| 23 | +## Explanation |
| 24 | +The `Retry` pattern consists retrying operations on remote resources over the |
| 25 | +network a set number of times. It closely depends on both business and technical |
| 26 | +requirements: how much time will the business allow the end user to wait while |
| 27 | +the operation finishes? What are the performance characteristics of the |
| 28 | +remote resource during peak loads as well as our application as more threads |
| 29 | +are waiting for the remote resource's availability? Among the errors returned |
| 30 | +by the remote service, which can be safely ignored in order to retry? Is the |
| 31 | +operation [idempotent](https://en.wikipedia.org/wiki/Idempotence)? |
| 32 | + |
| 33 | +Another concern is the impact on the calling code by implementing the retry |
| 34 | +mechanism. The retry mechanics should ideally be completely transparent to the |
| 35 | +calling code (service interface remains unaltered). There are two general |
| 36 | +approaches to this problem: from an enterprise architecture standpoint |
| 37 | +(**strategic**), and a shared library standpoint (**tactical**). |
| 38 | + |
| 39 | +*(As an aside, one interesting property is that, since implementations tend to |
| 40 | +be configurable at runtime, daily monitoring and operation of this capability |
| 41 | +is shifted over to operations support instead of the developers themselves.)* |
| 42 | + |
| 43 | +From a strategic point of view, this would be solved by having requests |
| 44 | +be redirected to a separate intermediary system, traditionally an |
| 45 | +[ESB](https://en.wikipedia.org/wiki/Enterprise_service_bus), but more recently |
| 46 | +a [Service Mesh](https://medium.com/microservices-in-practice/service-mesh-for-microservices-2953109a3c9a). |
| 47 | + |
| 48 | +From a tactical point of view, this would be solved by reusing shared libraries |
| 49 | +like [Hystrix](https://github.com/Netflix/Hystrix)[1]. This is the type of |
| 50 | +solution showcased in the simple example that accompanies this *README*. |
| 51 | + |
| 52 | +In our hypothetical application, we have a generic interface for all |
| 53 | +operations on remote interfaces: |
| 54 | + |
| 55 | +```java |
| 56 | +public interface BusinessOperation<T> { |
| 57 | + T perform() throws BusinessException; |
| 58 | +} |
| 59 | +``` |
| 60 | + |
| 61 | +And we have an implementation of this interface that finds our customers |
| 62 | +by looking up a database: |
| 63 | + |
| 64 | +```java |
| 65 | +public final class FindCustomer implements BusinessOperation<String> { |
| 66 | + @Override |
| 67 | + public String perform() throws BusinessException { |
| 68 | + ... |
| 69 | + } |
| 70 | +} |
| 71 | +``` |
| 72 | + |
| 73 | +Our `FindCustomer` implementation can be configured to throw |
| 74 | +`BusinessException`s before returning the customer's ID, thereby simulating a |
| 75 | +'flaky' service that intermittently fails. Some exceptions, like the |
| 76 | +`CustomerNotFoundException`, are deemed to be recoverable after some |
| 77 | +hypothetical analysis because the root cause of the error stems from "some |
| 78 | +database locking issue". However, the `DatabaseNotAvailableException` is |
| 79 | +considered to be a definite showstopper - the application should not attempt |
| 80 | +to recover from this error. |
| 81 | + |
| 82 | +We can model a 'recoverable' scenario by instantiating `FindCustomer` like this: |
| 83 | + |
| 84 | +```java |
| 85 | +final BusinessOperation<String> op = new FindCustomer( |
| 86 | + "12345", |
| 87 | + new CustomerNotFoundException("not found"), |
| 88 | + new CustomerNotFoundException("still not found"), |
| 89 | + new CustomerNotFoundException("don't give up yet!") |
| 90 | +); |
| 91 | +``` |
| 92 | + |
| 93 | +In this configuration, `FindCustomer` will throw `CustomerNotFoundException` |
| 94 | +three times, after which it will consistently return the customer's ID |
| 95 | +(`12345`). |
| 96 | + |
| 97 | +In our hypothetical scenario, our analysts indicate that this operation |
| 98 | +typically fails 2-4 times for a given input during peak hours, and that each |
| 99 | +worker thread in the database subsystem typically needs 50ms to |
| 100 | +"recover from an error". Applying these policies would yield something like |
| 101 | +this: |
| 102 | + |
| 103 | +```java |
| 104 | +final BusinessOperation<String> op = new Retry<>( |
| 105 | + new FindCustomer( |
| 106 | + "1235", |
| 107 | + new CustomerNotFoundException("not found"), |
| 108 | + new CustomerNotFoundException("still not found"), |
| 109 | + new CustomerNotFoundException("don't give up yet!") |
| 110 | + ), |
| 111 | + 5, |
| 112 | + 100, |
| 113 | + e -> CustomerNotFoundException.class.isAssignableFrom(e.getClass()) |
| 114 | +); |
| 115 | +``` |
| 116 | + |
| 117 | +Executing `op` *once* would automatically trigger at most 5 retry attempts, |
| 118 | +with a 100 millisecond delay between attempts, ignoring any |
| 119 | +`CustomerNotFoundException` thrown while trying. In this particular scenario, |
| 120 | +due to the configuration for `FindCustomer`, there will be 1 initial attempt |
| 121 | +and 3 additional retries before finally returning the desired result `12345`. |
| 122 | + |
| 123 | +If our `FindCustomer` operation were instead to throw a fatal |
| 124 | +`DatabaseNotFoundException`, which we were instructed not to ignore, but |
| 125 | +more importantly we did *not* instruct our `Retry` to ignore, then the operation |
| 126 | +would have failed immediately upon receiving the error, not matter how many |
| 127 | +attempts were left. |
| 128 | + |
| 129 | +<br/><br/> |
| 130 | + |
| 131 | +[1] Please note that *Hystrix* is a complete implementation of the *Circuit |
| 132 | +Breaker* pattern, of which the *Retry* pattern can be considered a subset of. |
| 133 | + |
| 134 | +## Applicability |
| 135 | +Whenever an application needs to communicate with an external resource, |
| 136 | +particularly in a cloud environment, and if the business requirements allow it. |
| 137 | + |
| 138 | +## Presentations |
| 139 | +You can view Microsoft's article [here](https://docs.microsoft.com/en-us/azure/architecture/patterns/retry). |
| 140 | + |
| 141 | +## Consequences |
| 142 | +**Pros:** |
| 143 | + |
| 144 | +* Resiliency |
| 145 | +* Provides hard data on external failures |
| 146 | + |
| 147 | +**Cons:** |
| 148 | + |
| 149 | +* Complexity |
| 150 | +* Operations maintenance |
| 151 | + |
| 152 | +## Related Patterns |
| 153 | +* [Circuit Breaker](https://martinfowler.com/bliki/CircuitBreaker.html) |
0 commit comments