INITIALIZING DATA STREAM...
AVAILABLE FOR OPPORTUNITIES

RACHIT YADAV

DATA ENGINEER · ML ENGINEER

Architecting distributed data systems processing 100TB → 6PB scale datasets with PySpark, AWS, and deep learning.

0
Years Experience
0
Petabytes Processed
0
ML Iterations
0
Companies
SCROLL

Building Data Platforms at Scale

Data Engineer with 7+ years of experience building large scale distributed data systems and real time pipelines. Strong Python and SQL fundamentals paired with hands-on PyTorch experience implementing CNNs and other deep learning architectures.

Deep expertise in PySpark, AWS, and distributed computing, with a strong focus on data quality, validation, idempotency, and reproducible pipelines that feed analytics and ML consumers.

Currently designing data platforms that power analytics, marketing insights, and AI driven decision systems at Smart Energy Water.

LocationIrvine, CA
EducationM.S. Data Science · USF
Pipeline Boost+53% efficiency
Status● Active

Tech Stack in 3D

A constellation of tools I use to build large-scale data systems. Drag to rotate, scroll to zoom.

DRAG · ZOOM · EXPLORE
Languages & ML
PythonSQLPyTorchPandasCNNs
Distributed Processing
Apache SparkPySparkDatabricks
Storage & Lakehouse
S3Delta LakeParquetORC
Orchestration
AirflowAWS Step Functions
Cloud & Infra
AWS EMREC2LambdaKinesisTerraformDocker
Data Warehousing
SnowflakeRedshiftBigQueryPostgreSQLSQL Server

Career Timeline

Jan 2025 — Present
Smart Energy Water
Data Engineer · Irvine, CA
  • Designed PySpark ETL/ELT on AWS, processing 100TB→6PB; +53% pipeline efficiency.
  • Architected Snowflake + S3 + Delta Lake lakehouse for low-latency analytics.
  • Built dbt-powered aggregates improving downstream queries by 40%+.
Nov 2022 — Nov 2024
Meta (via SGS Consulting)
Data Engineer · San Francisco, CA
  • Scalable Spark pipelines for structured and semi-structured datasets.
  • Reduced data processing latency by 30%+ via near real-time ingestion.
  • Improved data accuracy & consistency by ~35% with quality frameworks.
Oct 2021 — Jun 2022
Cerenetics
Data Scientist · San Francisco, CA
  • Processed large-scale fMRI datasets under strict privacy controls.
  • Built 10+ ML/DL architectures in PyTorch; 0.85 AUC CNN on fMRI inputs.
  • Tracked 50+ model iterations enabling clean ablations.
Sep 2020 — Jul 2021
KPMG Global Services
Data Engineer · Bengaluru, India
  • Distributed Spark pipelines on 150TB+ regulatory datasets.
  • Modernized Hadoop ETL → PySpark, 25–30% performance gain.
  • Automated EMR provisioning with boto3, –60% operational overhead.
Jan 2020 — May 2020
Cognizant
Data Engineer · Pune, India
  • EDA on high-volume transactional & financial KPIs for pharma client.
  • Delivered actionable SQL-driven insights to stakeholders.
Jan 2017 — Jan 2020
Infosys
ML / Data Engineer · Pune, India
  • T-SQL ETL workflows supporting 10TB–50TB daily financial analytics.
  • Pandas EDA reduced manual effort by 60%.

Featured Work

Highlights from data engineering and ML projects — pipelines, platforms, and research.

Petabyte Lakehouse Platform

Architected a PySpark ETL/ELT data platform on AWS, scaling ingestion from 100TB to 6PB datasets. Built a Snowflake + S3 + Delta Lake lakehouse with dbt-powered aggregates, delivering a 53% pipeline efficiency gain and 40%+ query improvement.

PySpark AWS EMR Snowflake Delta Lake dbt Airflow
fMRI Neural Activity Classifier

Built and evaluated 10+ ML/DL architectures in PyTorch for fMRI neuroimaging classification under strict privacy controls. Top CNN model achieved 0.85 AUC, with 50+ ablation iterations tracked for reproducibility.

PyTorch CNN Python fMRI Deep Learning
SQL Query Lab

An interactive browser-based SQL challenge engine with real-time query validation. Three progressively difficult data engineering scenarios — from basic filtering to complex HAVING clauses — with simulated result sets and instant hint feedback.

JavaScript SQL HTML CSS

SQL Query Lab

Three real-world data engineering scenarios. Write SQL, get instant feedback. Can you ace all three?

SOLVED
0 / 3
EASY
Your ops team needs a report of all delivered orders with amount > $500, sorted highest first. Write the query.

Schema

orders
PKorder_idINT
FKcustomer_idINT
  amountDECIMAL
  statusVARCHAR
  created_atTIMESTAMP
Your Query
order_idcustomer_idamountstatus
104288$1,240.00delivered
98754$890.50delivered
76321$714.00delivered
50167$512.75delivered
4 rows · simulated result
MEDIUM
Marketing wants total revenue by country for Gold-tier customers only. Join the tables and aggregate.

Schema

customers
PKcustomer_idINT
  nameVARCHAR
  countryVARCHAR
  tierVARCHAR
orders
PKorder_idINT
FKcustomer_idINT
  amountDECIMAL
  statusVARCHAR
Your Query
countrytotal_revenue
United States$48,320.00
Germany$31,100.50
Japan$22,750.00
Canada$18,900.25
4 rows · simulated result
HARD
Your data platform is flaky. Find pipeline names that failed more than 3 times in the last 30 days, showing the failure count.

Schema

pipeline_runs
PKrun_idINT
  pipeline_nameVARCHAR
  statusVARCHAR
  run_dateDATE
Your Query
pipeline_namefailure_count
ingest_clickstream7
sync_crm_daily5
load_warehouse_facts4
3 rows · simulated result

Let's Connect

Open to data engineering, ML engineering, and platform roles. Reach out anytime.