AWS SQS fair queues for multi-tenant applications

About message queues

Message queues are a major part of any distributed system. They act as buffers between the application components, allowing services to communicate asynchronously without being tightly coupled. When a web service receives a request that triggers background processing—like sending emails, generating reports, or processing payments—message queues ensure these tasks happen reliably without blocking your user-facing operations.

We've been using SQS since 2015, prior to our adoption of Elixir, to handle the workload to a Ruby app (and lose the data when it goes down).

They're cost-effective, easier to maintain but there's a common issue: the noisy neighbor problem.

It goes like this: one tenant suddenly sends a massive surge of messages to the queue, and suddenly all other tenants are experiencing delays:

Tenant 1: 10 messages
Tenant 2: 20 messages
Tenant 3: 1000 messages

Now all messages get mixed and the messages from the first two need to wait for most or all the messages from the third. The carefully balanced system becomes a bottleneck

AWS just released a solution that tackles this: Amazon SQS Fair Queues.

The problem

In traditional message queues, messages are processed on a first-in, first-out basis. When tenant A floods your queue with thousands of messages, tenants B, C, and D have to wait. Their messages pile up behind the backlog, increasing dwell.

The typical solutions? Over-provision resources (expensive), build complex custom load balancing (time-consuming), or implement separate queues per tenant (operational nightmare). None of these are ideal.

How fair queues change that

Amazon SQS Fair Queues automatically detect when one tenant is consuming disproportionate resources and adjust message delivery to maintain fairness. Here's the nice part: it happens transparently without changing your existing message processing logic.

The system continuously monitors message distribution across tenants. When it detects an imbalance:
- Identifies the noisy neighbor
- Automatically prioritizes messages from quiet tenants
- Maintains overall queue throughput

The noisy tenant doesn't get throttled. They just don't impact everyone else.

Implementation

Getting started requires just one change to your message producers. Add a MessageGroupId to identify tenants:

In an Elixir app, this would look like this:

# In your producer module
defmodule MyApp.MessageProducer do
  alias ExAws.SQS

  def send_tenant_message(queue_url, message_body, tenant_id) do
    SQS.send_message(queue_url, message_body, 
      message_group_id: "tenant-#{tenant_id}"  # This enables fair queuing
    )
    |> ExAws.request()
  end
end

# Usage
MyApp.MessageProducer.send_tenant_message(
  "your-queue-url", 
  Jason.encode!(%{data: "your message"}), 
  "123"
)

That's it. No consumer changes needed, nor API latency impact.

Monitoring fair queues

AWS provides new CloudWatch metrics specifically for fair queues:

ApproximateNumberOfNoisyGroups - How many tenants are being noisy
ApproximateNumberOfMessagesVisibleInQuietGroups - Backlog for well-behaved tenants
ApproximateAgeOfOldestMessageInQuietGroups - Message age excluding noisy neighbors

The real power comes from comparing these new metrics with standard queue metrics. During a traffic surge, your general queue metrics might show concerning backlogs, but the quiet group metrics will reveal that most tenants aren't actually affected.

Real-world impact

A real use case from a system we are actively developing: products are fetched from 100 supplier APIs and enriched with AI, before getting sent for human review. Since the results are cached and only new products are processed, the number of messages is somewhat consistent and predictable. However, adding a new supplier causes a spike in new products that have not yet been seen, which clogs the queue and causes delays for all other 99 suppliers.

With fair queues, the other 99 tenants can maintain normal processing times while the new supplier's excess messages are processed when resources are available. This way, the system remains responsive while still handling the spike.

Getting visibility

While the new metrics show you when fairness is being applied, you might want to know which tenant is causing issues. Use CloudWatch Contributor Insights to identify top contributors by MessageGroupId. This is especially valuable when you're managing thousands of tenants.

Simply log message processing in your consumer with the corresponding MessageGroupId, and Contributor Insights will show you the top-N noisy neighbors without creating expensive high-cardinality metrics.

When to use fair queues

Fair queues are ideal when you have:
- Multi-tenant applications sharing queue resources
- Unpredictable traffic patterns across tenants
- Requirements for consistent quality of service
- Limited ability to over-provision resources

They're particularly valuable in SaaS platforms, API gateways, and any system where tenant isolation is important but complete separation isn't feasible.

Conclusion

AWS SQS Fair Queues solve a real problem that every multi-tenant system faces. The implementation is straightforward, the operational overhead is minimal, and the impact on system resilience is significant. It's available now in all regions where Amazon SQS operates.

Original announcement

AWS SQS fair queues for multi-tenant applications

About message queues

The problem

How fair queues change that

Implementation

Monitoring fair queues

Real-world impact

Getting visibility

When to use fair queues

Conclusion

Related posts

Greener Software: How Elixir reduces cost & environmental impact

Create AWS Cloudfront invalidations upon CodePipeline release

How much does AWS cost and is it worth

Common traps with message queues based communication (AWS SQS)