What is the best way to prevent duplicate messages in Amazon SQS? I have a SQS of domains waiting to be crawled. before I add a new domain to the SQS I can check with the saved data to see if it has been crawled recently, to prevent duplicates.
The problem is with the domains that have not been crawled yet. For example if there is 1000 domains in the queue that have not been crawled. Any of those links could be added again, and again and again. Which swells my SQS to hundreds of thousands of messages that is mostly duplicates.
How do I prevent this? Is there a way to remove all duplicates from a queue? Or is there a way to search a queue for a message before I add it? I feel this is a problem that anyone with a SQS must have experienced.
One option that I can see is if I store some data before the domain is added to the SQS. But if I have to store the data twice, that kinda ruins the point of using a SQS in the first place.