About message queues
Message queues are a major part of any distributed system. They act as buffers between the application components, allowing services to communicate asynchronously without being tightly coupled. When a web service receives a request that triggers background processing—like sending emails, generating reports, or processing payments—message queues ensure these tasks happen reliably without blocking your user-facing operations.
We've been using SQS since 2015, prior to our adoption of Elixir, to handle the workload to a Ruby app (and lose the data when it goes down).
They're cost-effective, easier to maintain but there's a common issue: the noisy neighbor problem.
It goes like this: one tenant suddenly sends a massive surge of messages to the queue, and suddenly all other tenants are experiencing delays:
Tenant 1: 10 messages
Tenant 2: 20 messages
Tenant 3: 1000 messages
Now all messages get mixed and the messages from the first two need to wait for most or all the messages from the third. The carefully balanced system becomes a bottleneck
AWS just released a solution that tackles this: Amazon SQS Fair Queues.
The problem
In traditional message queues, messages are processed on a first-in, first-out basis. When tenant A floods your queue with thousands of messages, tenants B, C, and D have to wait. Their messages pile up behind the backlog, increasing dwell.
The typical solutions? Over-provision resources (expensive), build complex custom load balancing (time-consuming), or implement separate queues per tenant (operational nightmare). None of these are ideal.
How fair queues change that
Amazon SQS Fair Queues automatically detect when one tenant is consuming disproportionate resources and adjust message delivery to maintain fairness. Here's the nice part: it happens transparently without changing your existing message processing logic.
The system continuously monitors message distribution across tenants. When it detects an imbalance:
- Identifies the noisy neighbor
- Automatically prioritizes messages from quiet tenants
- Maintains overall queue throughput
The noisy tenant doesn't get throttled. They just don't impact everyone else.
Implementation
Getting started requires just one change to your message producers. Add a MessageGroupId
to identify tenants:
In an Elixir app, this would look like this:
# In your producer module
defmodule MyApp.MessageProducer do
alias ExAws.SQS
def send_tenant_message(queue_url, message_body, tenant_id) do
SQS.send_message(queue_url, message_body,
message_group_id: "tenant-#{tenant_id}" # This enables fair queuing
)
|> ExAws.request()
end
end
# Usage
MyApp.MessageProducer.send_tenant_message(
"your-queue-url",
Jason.encode!(%{data: "your message"}),
"123"
)
That's it. No consumer changes needed, nor API latency impact.
Monitoring fair queues
AWS provides new CloudWatch metrics specifically for fair queues:
ApproximateNumberOfNoisyGroups
- How many tenants are being noisyApproximateNumberOfMessagesVisibleInQuietGroups
- Backlog for well-behaved tenantsApproximateAgeOfOldestMessageInQuietGroups
- Message age excluding noisy neighbors
The real power comes from comparing these new metrics with standard queue metrics. During a traffic surge, your general queue metrics might show concerning backlogs, but the quiet group metrics will reveal that most tenants aren't actually affected.
Real-world impact
A real use case from a system we are actively developing: products are fetched from 100 supplier APIs and enriched with AI, before getting sent for human review. Since the results are cached and only new products are processed, the number of messages is somewhat consistent and predictable. However, adding a new supplier causes a spike in new products that have not yet been seen, which clogs the queue and causes delays for all other 99 suppliers.
With fair queues, the other 99 tenants can maintain normal processing times while the new supplier's excess messages are processed when resources are available. This way, the system remains responsive while still handling the spike.
Getting visibility
While the new metrics show you when fairness is being applied, you might want to know which tenant is causing issues. Use CloudWatch Contributor Insights to identify top contributors by MessageGroupId
. This is especially valuable when you're managing thousands of tenants.
Simply log message processing in your consumer with the corresponding MessageGroupId
, and Contributor Insights will show you the top-N noisy neighbors without creating expensive high-cardinality metrics.
When to use fair queues
Fair queues are ideal when you have:
- Multi-tenant applications sharing queue resources
- Unpredictable traffic patterns across tenants
- Requirements for consistent quality of service
- Limited ability to over-provision resources
They're particularly valuable in SaaS platforms, API gateways, and any system where tenant isolation is important but complete separation isn't feasible.
Conclusion
AWS SQS Fair Queues solve a real problem that every multi-tenant system faces. The implementation is straightforward, the operational overhead is minimal, and the impact on system resilience is significant. It's available now in all regions where Amazon SQS operates.