Find duplicate indexed field across collection & is it possible to restore backup to a new db account with unique key constraint

Srinivas54 20 Reputation points
2024-12-19T17:21:31.7333333+00:00

We have a cosmos database with mongo api, it has more than 50 million records currently and it has few fields indexed and it is also Sharded with a shardkey. due to some technical issue our application has inserted a duplicate transaction. We want to query the database (atleast on monthly timeframes) to find if there are any duplicate transactions

we tried below query for a day timespan

db.collection.aggregate([
  {
    $match: {
      TimestampField: {
        $gt: "20241207000000000000",
        $lte: "20241208000000000000"
      }
    }
  },
  {
    $group: {
      _id: "$TransID",
      count: { $sum: 1 },
    }
  },
  {
    $match: {
      count: { $gt: 1 }
    }
  },
  {
    $project: {
      TransID: "$_id",
      count: 1,
      docs: 1
    }
  }
])

For one day timeframe, it consumes around 14000 RUs. we plan to increase our max RUs to 20000/s. and try for monthly timeframes. Is it possible to optimize the Query ? we are using mongo 4.0, we tried with diskusage : true option but we find there is no drop in RUs consumed. TransID field is already indexed. Is there a solution to find duplicate transIDs across the Database

And infuture, we dont want to create a transaction with duplicate TransID in the database, We have not implemented a unique key constraint policy for current database. Is there any possibility to add the policy now after removing the records with duplicate TransID values. Actually we need a primary key field, we couldnt use _id as we have a collection sharded. Now we are in a plan to implement controls to make a perfect primary key for the data consuming.

if we cannot apply a unique key constraint policy for TransID field to an existing database, is it possible to copy existing data it to a new database after applying the policy to it? what would be the best option

Azure Cosmos DB
Azure Cosmos DB
An Azure NoSQL database service for app development.
1,710 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Sina Salam 14,551 Reputation points
    2024-12-19T21:01:51.96+00:00

    Hello Srinivas54,

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    I understand that you would like to find duplicate indexed field across collection, also asking if it is possible to restore backup to a new Database account with unique key constraint.

    Regarding your explanation and questions, it seems like a complex scenario though I will provide you steps to ensure efficient database operations and prevent future issues related to duplicate transactions while maintaining system performance. Also, short-term resolution of duplicate issues and long-term data consistency.

    Three things for the main challenges:

    1. To improve the efficiency of the query and minimize the high RU consumption, focus on shard-aware queries and payload reduction. Including the shard key in the $match stage ensures the query targets only the relevant shards, significantly reducing the computational load. For an example:
         db.collection.aggregate([
           {
             $match: {
               ShardKeyField: "value",
               TimestampField: {
                 $gt: "20241207000000000000",
                 $lte: "20241208000000000000"
               }
             }
           },
           ...
         ]);
      
      Additionally, use $project early in the pipeline to restrict data processed in subsequent stages. To further refine your query, analyze its execution plan using the Azure Cosmos DB Query Plan Explorer - https://learn.microsoft.com/en-us/azure/cosmos-db/mongodb/mongodb-compatibility These optimizations are practical for reducing RU consumption.
    2. To avoid duplicate transactions in the future, consider implementing unique constraints and improving your sharding strategy. Cosmos DB with Mongo API does not support unique keys on existing collections directly, but you can enforce uniqueness in new collections. A unique index can be created on the TransID field using the following command: db.collection.createIndex({ "TransID": 1 }, { unique: true }); For better distribution, you may also use a composite shard key, such as combining TransID with a timestamp or user ID, to enhance the uniqueness of keys across shards. Refer to this Cosmos DB Indexing Guide - https://learn.microsoft.com/en-us/azure/cosmos-db/indexing-policies for further information
    3. To migrate data efficiently while removing duplicates, use tools and SDKs designed for high-throughput operations like the Azure Cosmos DB .NET or Python SDKs. A recommended approach is to deduplicate during migration using an aggregation query that groups by TransID, retaining only the latest record: db.oldCollection.aggregate([ { $group: { _id: "$TransID", latestRecord: { $first: "$$ROOT" } } } ]).forEach(doc => { db.newCollection.insert(doc.latestRecord); }); For large datasets, consider using Cosmos DB's bulk executor library to process records in batches, reducing the risk of throttling. During migration, adopt a dual-write strategy by simultaneously writing new transactions to both the old and new collections. This minimizes downtime and ensures a seamless transition. For more on bulk operations, read more here - https://learn.microsoft.com/en-us/azure/cosmos-db/sql/bulk-executor-overview

    Regarding your major questions:

    1. To find duplicates in an indexed field (TransID) across the collection, including scenarios involving large datasets, you can leverage the aggregate pipeline as you initially attempted. For collections with an indexed TransID, this query efficiently identifies duplicates: db.collection.aggregate([ { $group: { _id: "$TransID", count: { $sum: 1 } } }, { $match: { count: { $gt: 1 } } }, { $project: { TransID: "$_id", count: 1 } } ]); This query uses the index on TransID to group and count records, then filters out groups with only one record. To further optimize, limit the search range using a shard key or date filter. For example: $match: { TimestampField: { $gte: "start_date", $lt: "end_date" } } For massive datasets, consider breaking the query into smaller timeframes or using parallel processing to query different shards simultaneously. Ensure you monitor the Resource Units (RUs) consumed using [Azure Cosmos DB's RU Monitoring. - https://learn.microsoft.com/en-us/azure/cosmos-db/mongodb/analyze-performance
    2. Yes, it is possible to restore a backup to a new Cosmos DB account while applying a unique key constraint. You can follow these steps:
      1. Use the Data Migration Tool (DTM) - https://learn.microsoft.com/en-us/azure/cosmos-db/import-data or an Azure Function to export your collection data to a storage account or a file.
      2. Set up a new Cosmos DB account with Mongo API. During collection creation, define a unique key constraint: db.newCollection.createIndex({ "TransID": 1 }, { unique: true });
      3. Before importing data, run the deduplication query (see above) to ensure no duplicate records exist.
      4. Use tools like the DTM, Azure Data Factory, or a custom script to import the deduplicated data into the new collection. For efficient bulk inserts, leverage the Cosmos DB bulk executor library - https://learn.microsoft.com/en-us/azure/cosmos-db/sql/bulk-executor-overview
      5. Perform a query to verify the absence of duplicates and confirm the integrity of your data.

    I hope this is helpful! Do not hesitate to let me know if you have any other questions.


    Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.