拥有许多小的 Azure 存储 blob 容器(每个容器都带有一些 blob)或者一个真正大的容器带有成吨的 blob,这样做是否更好?

情况是这样的:

我有一个 Web 服务的多个实例,它将一个数据块写入 Azure Storage。我需要能够根据收到的时间将 blobs 分组到容器(或虚拟目录)中。每隔一段时间(最糟糕的情况是每天) ,旧的 blobs 会被处理然后删除。

我有两个选择:

选择一

我制作了一个名为“ blobs”的容器(例如) ,然后将所有 blog 存储到该容器中。每个 blob 将使用一个目录样式名称,其中包含收到目录时的目录名称(例如: “ hr0min0/data.bin”、“ hr0min0/data2.bin”、“ hr0min30/data3.bin”、“ hr1min45/data.bin”、 ... 、“ hr23min0/dataN.bin”等——每隔 X分钟新建一个目录)。处理这些 blobs 的程序将首先处理 hr0min0 blobs,然后是 hr0minX,以此类推(并且在处理这些 blobs 时仍在写入)。

选择二

我有许多容器,每个容器都有一个基于到达时间的名称(所以首先是一个名为 blobs _ hr0min0的容器,然后是 blobs _ hr0minX,等等) ,容器中的所有 blobs 都是那些在指定时间到达的 blobs。处理这些 blog 的东西将一次处理一个容器。

所以我的问题是,哪个选择更好?选项2是否提供了更好的并行性(因为容器可以位于不同的服务器中) ,或者选项1是否更好,因为许多容器可能导致其他未知问题?

20516 次浏览

I don't think it really matters (from a scalability/parallelization perspective), because partitioning in Win Azure blobs storage is done at the blob level, not the container. Reasons to spread out across different containers have more to do with access control (e.g. SAS) or total storage size.

See here for more details: http://blogs.msdn.com/b/windowsazurestorage/archive/2010/05/10/windows-azure-storage-abstractions-and-their-scalability-targets.aspx

(Scroll down to "Partitions").

Quoting:

Blobs – Since the partition key is down to the blob name, we can load balance access to different blobs across as many servers in order to scale out access to them. This allows the containers to grow as large as you need them to (within the storage account space limit). The tradeoff is that we don’t provide the ability to do atomic transactions across multiple blobs.

Theoretically speaking, there should be no difference between lots of containers or fewer containers with more blobs. The extra containers can be nice as additional security boundaries (for public anonymous access or different SAS signatures for instance). Extra containers can also make housekeeping a bit easier when pruning (deleting a single container versus targeting each blob). I tend to use more containers for these reasons (not for performance).

Theoretically, the performance impact should not exist. The blob itself (full URL) is the partition key in Windows Azure (has been for a long time). That is the smallest thing that will be load-balanced from a partition server. So, you could (and often will) have two different blobs in same container being served out by different servers.

Jeremy indicates there is a performance difference between more and fewer containers. I have not dug into those benchmarks enough to explain why that might be the case, but I would suspect other factors (like size, duration of test, etc.) to explain any discrepancies.

Everyone has given you excellent answers around accessing blobs directly. However, if you need to list blobs in a container, you will likely see better performance with the many-container model. I just talked with a company who's been storing a massive number of blobs in a single container. They frequently list the objects in the container and then perform actions against a subset of those blobs. They're seeing a performance hit, as the time to retrieve a full listing has been growing.

This might not apply to your scenario, but it's something to consider...

There is also one more factor that get's into this. Price!

Currently operation List and Create container are for the same price: 0,054 US$ / 10.000 calls

Same price is actually for writing the blob.

So in extreme cause you can pay a lot more, if you create and delete many containers

  • delete is free

you can see the calculator here: https://azure.microsoft.com/en-us/pricing/calculator/

https://learn.microsoft.com/en-us/azure/storage/blobs/storage-performance-checklist#partitioning

Understanding how Azure Storage partitions your blob data is useful for enhancing performance. Azure Storage can serve data in a single partition more quickly than data that spans multiple partitions. By naming your blobs appropriately, you can improve the efficiency of read requests.

Blob storage uses a range-based partitioning scheme for scaling and load balancing. Each blob has a partition key comprised of the full blob name (account+container+blob). The partition key is used to partition blob data into ranges. The ranges are then load-balanced across Blob storage.