How to prevent duplicate SQS Messages?

小开

There is no API level way of preventing duplicate messages to be posted to a SQS queue. You would need to handle this at application level I am afraid.

You can use a DynamoDB table to store your Domain Names waiting to be crawled and only add them to the queue if they are not in DynamoDB for example.

小开

最佳答案

As the other answers mentioned, you can't prevent duplicate messages coming through from SQS.

Most of the time your messages will be handed to one of your consumers once, but you will run into duplicates at some stage.

I don't think there is an easy answer to this question, because it entails coming up with a proper architecture that can cope with duplicates, meaning it's idempotent in nature.

If all the workers in your distributed architecture were idempotent, it would be easy, because you wouldn't need to worry about duplicates. But in reality, that sort of environment does not exist, somewhere along the way something will not be able to handle it.

I am currently working on a project where it's required of me to solve this, and come up with an approach to handle it. I thought it might benefit others to share my thinking here. And it might be a good place to get some feedback on my thinking.

Fact store

It's a pretty good idea to develop services so that they collect facts which can theoretically be replayed to reproduce the same state in all the affected downstream systems.

For example, let's say you are building a message broker for a stock trading platform. (I have actually worked on a project like this before, it was horrible, but also a good learning experience.)

Now let's say that that trades come in, and there are 3 systems interested in it:

An old school mainframe which needs to stay updated
A system that collates all the trades and share it with partners on a FTP server
The service that records the trade, and reallocates shares to the new owner

It's a bit convoluted, I know, but the idea is that one message (fact) coming in, has various distributed downstream effects.

Now let's imagine that we maintain a fact store, a recording of all the trades coming into our broker. And that all 3 downstream service owners calls us to tell us that they have lost all of their data from the last 3 days. The FTP download is 3 days behind, the mainframe is 3 days behind, and all the trades are 3 days behind.

Because we have the fact store, we could theoretically replay all these messages from a certain time to a certain time. In our example that would be from 3 days ago until now. And the downstream services could be caught up.

This example might seem a bit over the top, but I'm trying to convey something very particular: the facts are the important things to keep track of, because that's where we are going to use in our architecture to battle duplicates.

How the Fact store helps us with duplicate messages

Provided you implement your fact store on a persistence tier that gives you the CA parts of the CAP theorem, consistency and availability, you can do the following:

As soon as a message is received from a queue, you check in your fact store whether you've already seen this message before, and if you have, whether it's locked at the moment, and in a pending state. In my case, I will be using MongoDB to implement my fact store, as I am very comfortable with it, but various other DB technologies should be able to handle this.

If the fact does not exist yet, it gets inserted into the fact store, with a pending state, and a lock expiration time. This should be done using atomic operations, because you do not want this to happen twice! This is where you ensure your service's idempotence.

Happy case - happens most of the time

When the Fact store comes back to your service telling it that the fact did not exist, and that a lock was created, the service attempts to do it's work. Once it's done, it deletes the SQS message, and marks the fact as completed.

Duplicate message

So that's what happens when a message comes through and it's not a duplicate. But let's look at when a duplicate message comes in. The service picks it up, and asks the fact store to record it with a lock. The fact store tells it that it already exists, and that it's locked. The service ignores the message and skips over it! Once the message processing is done, by the other worker, it will delete this message from the queue, and we won't see it again.

Disaster case - happens rarely

So what happens when a service records the fact for the first time in the store, then get a lock for a certain period, but falls over? Well SQS will present a message to you again, if it was picked up, but not deleted within a certain period after it was served from the queue. Thats why we code up our fact store such that a service maintains a lock for a limited time. Because if it falls over, we want SQS to present the message to the service, or another instance thereof at a later time, allowing that service to assume that the fact should be incorporated into state (executed) again.

小开

According to AWS Docs, Exactly-Once Processing provides a way to avoid duplicate messages.

Unlike standard queues, FIFO queues don't introduce duplicate messages. FIFO queues help you avoid sending duplicates to a queue. If you retry the SendMessage action within the 5-minute deduplication interval, Amazon SQS doesn't introduce any duplicates into the queue.

If your queue is an FIFO queue and has enabled content based duplication, this function can be utilized to avoid duplicate messages during the deduplication interval. For more, read this section, and the below link.

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-sqs-queues.html#cfn-sqs-queue-contentbaseddeduplication

小开

Amazon SQS Introduces FIFO Queues with Exactly-Once Processing and Lower Prices for Standard Queues

Using the Amazon SQS Message Deduplication ID The message deduplication ID is the token used for deduplication of sent messages. If a message with a particular message deduplication ID is sent successfully, any messages sent with the same message deduplication ID are accepted successfully but aren't delivered during the 5-minute deduplication interval.

Amazon SQS Introduces FIFO Queues

Using the Amazon SQS Message Deduplication ID

小开

You can use the VisibilityTimeout parameter:

var params = {
VisibilityTimeout: 20,
...
};


sqs.receiveMessage(params, function(err, data) {});

reference: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-visibility-timeout.html

VisibilityTimeout — (Integer)
The duration (in seconds) that the received messages are hidden from subsequent retrieve requests after being retrieved by a ReceiveMessage request.

小开

Only if AWS had not brought in the update to detect duplicate messages, I'd have done something like this.. Just thinking aloud, Push everything to Redis with a TTL of let's say 15 mins (or whatever period that's suitable for the use case) as soon as the request reaches your gateway/your environment and make your producer check Redis for any message coming from user with same amount (whatever logic makes it duplicate as per the business logic) before sending/putting the message in the queue, if message exists then return the response back to client stating duplicate found, if not allow it to be pushed to the queue. Redis needs to be really HA.