Share via


Designing end-to-end solutions

patterns & practices Developer Center

From: Developing big data solutions on Microsoft Azure HDInsight

Automation enables you to avoid some or all of the manual operations required to perform your specific big data processing tasks. Unless you are simply experimenting with some data you will probably want to create a completely automated end-to-end solution. For example, you may want to make a solution repeatable without requiring manual interaction every time, perhaps incorporate a workflow, and even execute the entire solution automatically on a schedule. HDInsight supports a range of technologies and techniques to help you achieve this, several of which are used in the example scenario you’ve already seen this guide.

You can think of an end-to-end big data solution as being a process that encompasses multiple discrete sub-processes. Throughout this guide you have seen how to automate these individual sub-processes using a range of tools such as Windows PowerShell, the .NET SDK for HDInsight, SQL Server Integration Services, Oozie, and command line tools.

A typical big data process might consist of the following sub-processes:

However, before beginning to design an automated solution, it is sensible to start by identifying the dependencies and constraints in your specific data processing scenario, and considering the requirements for each stage in the overall solution. For example, you must consider how to coordinate the automation of these operations as a whole, as well as planning the scheduling of each discrete task.

This section includes the following topics related to designing automated end-to-end solutions:

Considerations

Consider the following points when designing and implementing end-to-end solutions around HDInsight:

  • Analyze the requirements for the solution before you start to implement automation. Consider factors such as how the data will be collected, the rate at which it arrives, the timeliness of the results, the need for quick access to aggregated results, and the consequent impact of the speed of processing each batch. All of these factors will influence the processes and technologies you choose, the batch size for each process, and the overall scheduling for the solution.
  • Automating a solution can help to minimize errors for tasks that are repeated regularly, and by setting permissions on the client-side applications that initiate jobs and access the data you can also limit access so that only your authorized users can execute them. Automation is likely to be necessary for all types of solutions except those where you are just experimenting with data and processes.
  • The individual tasks in your solutions will have specific dependencies and constraints that you must accommodate to achieve the best overall data processing workflow. Typically these dependencies are time based and affect how you orchestrate and schedule the tasks and processes. Not only must they execute in the correct order, but you may also need to ensure that specific tasks will be completed before the next one begins. See Workflow dependencies and constraints for more information.
  • Consider if you need to automate the creation of storage accounts to hold the cluster data, and decide when this should occur. HDInsight can automatically create one or more linked storage accounts for the data as part of the cluster provisioning process. Alternatively, you can automate the creation of linked storage accounts before you create a cluster, and non-linked storage accounts before or after you create a cluster. For example, you might automate creating a new storage account, loading the data, creating a cluster that uses the new storage account, and then executing a job. For more information about linked and non-linked storage accounts see Cluster and storage initialization in the section Collecting and loading data into HDInsight.
  • Consider the end-to-end security of your solution. You must protect the data from unauthorized access and tampering when it is in storage and on the wire, and secure the cluster as a whole to prevent unauthorized access. See Security for more details.
  • As with any complex multi-step solution, it is important to make monitoring and troubleshooting as easy as possible by maintaining detailed logs of the individual stages of the overall process. This typically requires comprehensive exception handling and well as planning how to log the information. See Monitoring and logging for more information.

Next Topic | Previous Topic | Home | Community