Architecting distributed data systems processing 100TB → 6PB scale datasets with PySpark, AWS, and deep learning.
Data Engineer with 7+ years of experience building large scale distributed data systems and real time pipelines. Strong Python and SQL fundamentals paired with hands-on PyTorch experience implementing CNNs and other deep learning architectures.
Deep expertise in PySpark, AWS, and distributed computing, with a strong focus on data quality, validation, idempotency, and reproducible pipelines that feed analytics and ML consumers.
Currently designing data platforms that power analytics, marketing insights, and AI driven decision systems at Smart Energy Water.
A constellation of tools I use to build large-scale data systems. Drag to rotate, scroll to zoom.
Highlights from data engineering and ML projects — pipelines, platforms, and research.
Architected a PySpark ETL/ELT data platform on AWS, scaling ingestion from 100TB to 6PB datasets. Built a Snowflake + S3 + Delta Lake lakehouse with dbt-powered aggregates, delivering a 53% pipeline efficiency gain and 40%+ query improvement.
Built and evaluated 10+ ML/DL architectures in PyTorch for fMRI neuroimaging classification under strict privacy controls. Top CNN model achieved 0.85 AUC, with 50+ ablation iterations tracked for reproducibility.
Three real-world data engineering scenarios. Write SQL, get instant feedback. Can you ace all three?
| order_id | customer_id | amount | status |
|---|---|---|---|
| 1042 | 88 | $1,240.00 | delivered |
| 987 | 54 | $890.50 | delivered |
| 763 | 21 | $714.00 | delivered |
| 501 | 67 | $512.75 | delivered |
| country | total_revenue |
|---|---|
| United States | $48,320.00 |
| Germany | $31,100.50 |
| Japan | $22,750.00 |
| Canada | $18,900.25 |
| pipeline_name | failure_count |
|---|---|
| ingest_clickstream | 7 |
| sync_crm_daily | 5 |
| load_warehouse_facts | 4 |
Open to data engineering, ML engineering, and platform roles. Reach out anytime.