Project Overview
As a seasoned growth engineer, I spearheaded the development of a comprehensive Data Lakehouse on AWS, a centralized repository designed to streamline data ingestion, bolster storage efficiency, and facilitate swift, effective data analytics.
Challenge Addressed:
Faced with growing data volume and the need for advanced analytics, my goal was to establish a robust, scalable solution capable of overcoming historical data fragmentation and inconsistent nomenclature.
Key Features:
- Engineered a two-tiered data structure comprising Bronze (raw data) and Silver (processed data) layers, optimizing for future expansion while foregoing the Gold layer for initial simplicity.
- Innovated data ingestion by employing Deepnotes, enhancing the timeliness and accuracy of data transfers to the Bronze layer.
- Established precise, descriptive naming conventions across data sets to ensure clarity and uniformity. Leveraged Snappy Parquet for its balance of compression and performance, while also considering format variability to cater to different use cases.
Technical Contributions:
- Implemented data ingestion from Deepnotes to the AWS S3 Bronze bucket and transformed data flows from Bronze to Silver, ensuring efficient data lifecycle management.
- Reused and modified existing data room transformation functions to fit the new architectural framework.
- Maintained stringent data protection protocols, including server-side encryption with Amazon S3-Managed Keys (SSE-S3) and minimal-permission IAM policies.