Microsoft Azure Data Engineer Associate (DP-203) Cert Prep by Microsoft Press

9h 27mIntermediate2024-09-17

Authors

Microsoft Press

Microsoft

Tim Warner

Technical Trainer and Content Developer

Course details

In this course, Microsoft MVP Tim Warner walks you through what to expect on the DP-203 Data Engineering on Microsoft Azure exam, covering every Exam DP-203 objective in a friendly and logical way. Tim dives into the intricacies of data engineering on Microsoft Azure, focusing on deploying efficient, secure, and robust data processing solutions. Learn how to design and implement diverse data storage strategies, including leveraging Azure Synapse Analytics for managing massive datasets efficiently. Discover techniques for data compression, partitioning, and sharding to optimize storage and access speed. Investigate table geometries, data redundancy, and archival methods to ensure data is both accessible and protected. Ideal for IT professionals, data scientists, and anyone interested in the data engineering capabilities of Azure, this course empowers you to build scalable data solutions and ensure that your data-driven applications perform seamlessly.

Skills covered

Cloud StorageCloud AdministrationData EngineeringAzureCloud PlatformsCert PrepCloud ComputingData ScienceMicrosoft

Concepts

0. Introduction

01 - Introduction

1. Design and Implement Data Storage

02 - Learning objectives
03 - Design an Azure Data Lake solution
04 - Recommend file types for storage
05 - Recommend file types for analytical queries
06 - Design for efficient querying

2. Design for Data Pruning

07 - Learning objectives
08 - Design a folder structure that represents levels of data transformation
09 - Design a distribution strategy
10 - Design a data archiving solution

3. Design a Partition Strategy

11 - Learning objectives
12 - Design a partition strategy for files
13 - Design a partition strategy for analytical workloads
14 - Design a partition strategy for efficiency and performance
15 - Design a partition strategy for Azure Synapse Analytics
16 - Identify when partitioning is needed in Azure Data Lake Storage Gen2

4. Design the Serving Layer

17 - Learning objectives
18 - Design star schemas
19 - Design slowly changing dimensions
20 - Design a dimensional hierarchy
21 - Design a solution for temporal data
22 - Design for incremental loading
23 - Design analytical stores
24 - Design metastores in Azure Synapse Analytics and Azure Databricks

5. Implement Physical Data Storage Structures

25 - Learning objectives
26 - Implement compression
27 - Implement partitioning
28 - Implement sharding
29 - Implement different table geometries with Azure Synapse Analytics pools
30 - Implement data redundancy
31 - Implement distributions
32 - Implement data archiving

6. Implement Logical Data Structures

33 - Learning objectives
34 - Build a temporal data solution
35 - Build a slowly changing dimension
36 - Build a logical folder structure
37 - Build external tables
38 - Implement file and folder structures for efficient querying and data pruning

7. Implement the Serving Layer

39 - Learning objectives
40 - Deliver data in a relational star schema
41 - Deliver data in Parquet files
42 - Maintain metadata
43 - Implement a dimensional hierarchy

8. Ingest and Transform Data

44 - Learning objectives
45 - Transform data by using Apache Spark
46 - Transform data by using Transact-SQL
47 - Transform data by using Data Factory
48 - Transform data by using Azure Synapse pipelines
49 - Transform data by using Stream Analytics

9. Work with Transformed Data

50 - Learning objectives
51 - Cleanse data
52 - Split data
53 - Shred JSON
54 - Encode and decode data

10. Troubleshoot Data Transformations

55 - Learning objectives
56 - Configure error handling for the transformation
57 - Normalize and denormalize values
58 - Transform data by using Scala
59 - Perform data exploratory analysis

11. Design a Batch Processing Solution

60 - Learning objectives
61 - Develop batch processing solutions by using Data Factory, Data Lake, Spark, Azure Synapse pipelines, PolyBase, and Azure Databricks
62 - Create data pipelines
63 - Design and implement incremental data loads
64 - Design and develop slowly changing dimensions
65 - Handle security and compliance requirements
66 - Scale resources

12. Develop a Batch Processing Solution

67 - Learning objectives
68 - Configure the batch size
69 - Design and create tests for data pipelines
70 - Integrate Jupyter and Python Notebooks into a data pipeline
71 - Handle duplicate data
72 - Handle missing data
73 - Handle late-arriving data

13. Configure a Batch Processing Solution

74 - Learning objectives
75 - Upsert data
76 - Regress to a previous state
77 - Design and configure exception handling
78 - Configure batch retention
79 - Revisit batch processing solution design
80 - Debug Spark jobs by using the Spark UI

14. Design a Stream Processing Solution

81 - Learning objective
82 - Develop a stream processing solution by using Stream Analytics, Azure Databricks, and Azure Event Hubs
83 - Process data by using Spark structured streaming
84 - Monitor for performance and functional regressions
85 - Design and create windowed aggregates
86 - Handle schema drift

15. Process Data in a Stream Processing Solution

87 - Learning objectives
88 - Process time series data
89 - Process across partitions
90 - Process within one partition
91 - Configure checkpoints and watermarking during processing
92 - Scale resources
93 - Design and create tests for data pipelines
94 - Optimize pipelines for analytical or transactional purposes

16. Troubleshoot a Stream Processing Solution

95 - Learning objectives
96 - Handle interruptions
97 - Design and configure exception handling
98 - Upsert data
99 - Replay archived stream data
100 - Design a stream processing solution

17. Manage Batches and Pipelines

101 - Learning objectives
102 - Trigger batches
103 - Handle failed batch loads
104 - Validate batch loads
105 - Manage data pipelines in Data Factory and Synapse pipelines
106 - Schedule data pipelines in Data Factory and Synapse pipelines
107 - Implement version control for pipeline artifacts
108 - Manage Spark jobs in a pipeline

18. Design Security for Data Policies

109 - Learning objectives
110 - Design data encryption for data at rest and in transit
111 - Design a data auditing strategy
112 - Design a data masking strategy
113 - Design for data privacy

19. Design Security for Data Standards

114 - Learning objectives
115 - Design a data retention policy
116 - Design to purge data based on business requirements
117 - Design Azure RBAC and POSIX-like ACL for Data Lake Storage Gen2
118 - Design row-level and column-level security

20. Implement Data Security Protection

119 - Learning objectives
120 - Implement data masking
121 - Encrypt data at rest and in motion
122 - Implement row-level and column-level security
123 - Implement Azure RBAC
124 - Implement POSIX-like ACLs for Data Lake Storage Gen2
125 - Implement a data retention policy
126 - Implement a data auditing strategy

21. Implement Data Security Access

127 - Learning objectives
128 - Manage identities, keys, and secrets across different data platforms
129 - Implement secure endpoints - Private and public
130 - Implement resource tokens in Azure Databricks
131 - Load a DataFrame with sensitive information
132 - Write encrypted data to tables or Parquet files
133 - Manage sensitive information

22. Monitor Data Storage

134 - Learning objectives
135 - Implement logging used by Azure Monitor
136 - Configure monitoring services
137 - Measure performance of data movement
138 - Monitor and update statistics about data across a system
139 - Monitor data pipeline performance
140 - Measure query performance

23. Monitor Data Processing

141 - Learning objectives
142 - Monitor cluster performance
143 - Understand custom logging options
144 - Schedule and monitor pipeline tests
145 - Interpret Azure Monitor metrics and logs
146 - Interpret a Spark Directed Acyclic Graph (DAG)

24. Tune Data Storage

147 - Learning objectives
148 - Compact small files
149 - Rewrite user-defined functions (UDFs)
150 - Handle skew in data
151 - Handle data spill
152 - Tune shuffle partitions
153 - Find shuffling in a pipeline
154 - Optimize resource management

25. Optimize and Troubleshoot Data Processing

155 - Learning objectives
156 - Tune queries by using indexers
157 - Tune queries by using cache
158 - Optimize pipelines for analytical or transactional purposes
159 - Optimize pipeline for descriptive versus analytical workloads
160 - Troubleshoot failed Spark jobs
161 - Troubleshoot failed pipeline runs

Conclusion

162 - Summary