API source data and intermediatory storage

azure_learner 240 Reputation points
2024-08-22T04:04:42.67+00:00

I am moving data from a highly secure system with compliance as report as service (RaaS) as I discussed in the questions below.

https://learn.microsoft.com/en-us/answers/questions/1858373/pre-data-validation-in-azure

https://learn.microsoft.com/en-us/answers/questions/1859125/repot-services-as-source-in-adf

My questions are when we have ingest data as RaaS, what is the best practice, and should the data be stored first in Blob storage and use Databricks for data transformations and then push the data into the ADLS gen2 landing zone? but it would be the case that data being duplicated as stored twice, first in a blob and then ADLS, or is it should be directly pushed to the gold layer avoiding bronze and silver? Please suggest which approach should I follow. Please help.

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,466 questions
Azure Blob Storage
Azure Blob Storage
An Azure service that stores unstructured data in the cloud as blobs.
2,795 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,164 questions
{count} votes

Accepted answer
  1. Nehruji R 7,556 Reputation points Microsoft Vendor
    2024-08-22T14:23:39.6033333+00:00

    Hello azure_learner,

    Greetings! Welcome to Microsoft Q&A Platform.

    Azure Data Lake Storage Gen2 isn't a dedicated service or account type. It's a set of capabilities that support high throughput analytic workloads. The Data Lake Storage Gen2 documentation provides best practices and guidance for using these capabilities.

    Azure Data Lake Storage Gen2 isn't a dedicated service or account type. It's a set of capabilities that support high throughput analytic workloads. The Data Lake Storage Gen2 documentation provides best practices and guidance for using these capabilities. refer - https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-best-practices.

    It’s common to use Blob Storage as the initial landing zone for raw data. This allows for scalable and cost-effective storage. Using Databricks for data transformations is a good practice. It provides a powerful platform for processing large datasets and performing complex transformations. Alternatively, you can ingest data directly into ADLS Gen2 and use Databricks to read and transform data from there.

    If you need to preserve raw data for reprocessing if needed you can store data in Blob Storage and then move it to ADLS Gen2, it might seem like duplication. However, this can be managed by:

    Deleting Raw Data: After transformation, you can delete the raw data from Blob Storage to avoid duplication.

    Silver Layer: Allows for intermediate transformations and quality checks.

    Gold Layer: Provides final, ready-to-use data for reporting.

    Perform transformations in Databricks, moving data from Bronze to Silver and then to Gold layers. Ensure raw data in the bronze layer is archived or deleted after processing to avoid unnecessary storage costs. This approach balances the need for data integrity, transformation flexibility, and storage efficiency. This approach balances the need for data integrity, transformation flexibility, and storage efficiency.

    refer for more details- https://learn.microsoft.com/en-us/azure/data-explorer/ingest-data-overview, https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/scenarios/cloud-scale-analytics/best-practices/data-ingestion, https://learn.microsoft.com/en-us/azure/databricks/ingestion/lakeflow-connect/workday/workday-reports/.

    Hope this answer helps! please let us know if you have any further queries. I’m happy to assist you further.

    Please "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members

    0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.