What is Mashery's Fail-Safe and how it protects customer backend?

Products	Versions
TIBCO Cloud API Management	-

Description

Mashery's Fail-Safe: Mashery returning "503 Service Unavailable" error and "504 Gateway Timeout" error.

Environment

Production

Resolution

Mashery Failsafe will trigger when the Mashery Traffic Manager encounters predefined number of 50x errors to the same target host within a pre-configured timeframe in seconds. This applies across services and endpoints in customer's area. The failsafe trigger considers api request's target host as a primary criteria. The primary reason is, if customer backend host is responding with '504 - Gateway timeout' for high number of calls then usually it translates into under performing and overwhelmed host. It needs some time to recover. In this case, Mashery proxy will respond to calls for that specific host with '503 - Service unavailable'. It will use the Error template defined in the Dashboard for this response. After one minute, Mashery Proxy will reset the fail-safe for that host and will let the traffic go through to that backend host. Fail-safe will be triggered again if the situation persists.

For example, assume endpoint 'A' and endpoint 'B' are configured with the same Target Host value, test.customer.com. Another endpoint C is configured with a different target host prod.customer.com. If 49 requests through endpoint 'A' timeout within a 10 second timeframe and within the same 10 second timeframe 1 request through endpoint 'B' also encounters a timeout, the failsafe will then be triggered. This is because a total of 50 requests encountered a timeout to the same Target Host within the 10 second timeframe. All subsequent requests that are intended for the same Target Host, test.customer.com, will then get a "503 Service Unavailable" error for a one minute duration. After 1 minute the failsafe trigger will reset. During this time all traffic to prod.customer.com will not be affected.

Who sets the Fail-Safe trigger values?
The Fail-safe triggers (x of 50x calls in y seconds) are set by Mashery team based on customer's traffic volume. Mashery cannot set these values extremely high as it might obscure genuine backend issues and in volume instances, have negative impact on customer's traffic as worker capacity will stay engaged waiting for the timeouts to occur.

What can customer do to ensure that Fail-Safe doesn't get triggered in their area?
504 - Gateway timeouts are usually primary reason behind triggering fail-safe in production environment. There are two settings at the endpoint level, which specify the amount of time Mashery waits for either a Connection or a Response from customer Target Host. If either one of those configured wait times are exceeded, Mashery will return a "504 Gateway Timeout" error. Set the Connect Time and Response Time parameter to an appropriate level, not extremely high or too low based on your average response time of your backend.

Issue/Introduction

Faile-Safe Mechanism

Welcome to "KB Articles"