To handle billions of documents in CosmosDB, especially in your case where you are experiencing issues with hot partitions due to monotonically increasing partition keys, it's crucial to re-evaluate your partitioning strategy. I'll address the points you raised in your query:
Is hashedContainerID a good partition key given that for a key I have a range of values up to 10,000 entries?
Yes, using a hashed version of containerID
could be a good approach. By hashing the containerID
, you avoid the issue of a monotonically increasing key, which is one of the primary reasons behind the hot partition issue you're experiencing. Since CosmosDB internally uses a hash-based partitioning mechanism, your custom hash for containerID
can ensure a more even distribution across partitions. Given that you have a relatively large range of values (up to 10,000 per key), this method should work well for distributing load more evenly, preventing all writes from being concentrated on a single partition.
Does this solution avoid hot partitions while writing documents?
Yes, hashing the containerID
should prevent hot partitions during document writes. One of the major problems with your current partitioning strategy is that writes for a given day all target the same partition, creating a bottleneck. By distributing the documents based on a hashed containerID
, you'll break the concentration of writes into a single partition, making use of more of your available RU/s (Request Units per second) across the physical partitions, and thus avoiding hot partitions.
Can this shard key scale up to billions of documents?
Yes, this shard key should scale effectively up to billions of documents. The use of a hash function is a common method to distribute load evenly across many partitions, even as the dataset grows significantly. However, the choice of partition key should always be based on the specific query patterns that you expect in the future. Since you're dealing with billions of records, and your queries frequently involve containerID
, distributing by a hashed containerID
should be scalable. Just make sure that any future queries can also take advantage of this partition key.
Is it necessary to manually create a hash, or can I even use containerID?
In CosmosDB, you don't need to manually hash the containerID
if it’s already used as a partition key. CosmosDB automatically hashes partition key values behind the scenes. Therefore, you can simply use containerID
as the partition key, and CosmosDB will handle distributing the data. However, the important consideration is whether containerID
provides enough uniqueness and even distribution across partitions. If it is still monotonically increasing or too clustered, manually hashing containerID
can give you finer control and ensure better distribution, but this step is not strictly necessary.