AVAILABLE FOR OPPORTUNITIES

RACHIT YADAV

DATA ENGINEER · ML ENGINEER

Architecting distributed data systems processing 100TB → 6PB scale datasets with PySpark, AWS, and deep learning.

EXPLORE SKILLS GET IN TOUCH

0

Years Experience

0

Petabytes Processed

0

ML Iterations

0

Companies

SCROLL

01 / ABOUT

Building Data Platforms at Scale

Data Engineer with 7+ years of experience building large scale distributed data systems and real time pipelines. Strong Python and SQL fundamentals paired with hands-on PyTorch experience implementing CNNs and other deep learning architectures.

Deep expertise in PySpark, AWS, and distributed computing, with a strong focus on data quality, validation, idempotency, and reproducible pipelines that feed analytics and ML consumers.

Currently designing data platforms that power analytics, marketing insights, and AI driven decision systems at Smart Energy Water.

LocationIrvine, CA

EducationM.S. Data Science · USF

Pipeline Boost+53% efficiency

Status● Active

02 / SKILLS

Tech Stack in 3D

A constellation of tools I use to build large-scale data systems. Drag to rotate, scroll to zoom.

DRAG · ZOOM · EXPLORE

Languages & ML

PythonSQLPyTorchPandasCNNs

Distributed Processing

Apache SparkPySparkDatabricks

Storage & Lakehouse

S3Delta LakeParquetORC

Orchestration

AirflowAWS Step Functions

Cloud & Infra

AWS EMREC2LambdaKinesisTerraformDocker

Data Warehousing

SnowflakeRedshiftBigQueryPostgreSQLSQL Server

03 / EXPERIENCE

Career Timeline

Jan 2025 — Present

Smart Energy Water

Data Engineer · Irvine, CA

Designed PySpark ETL/ELT on AWS, processing 100TB→6PB; +53% pipeline efficiency.
Architected Snowflake + S3 + Delta Lake lakehouse for low-latency analytics.
Built dbt-powered aggregates improving downstream queries by 40%+.

Nov 2022 — Nov 2024

Meta (via SGS Consulting)

Data Engineer · San Francisco, CA

Scalable Spark pipelines for structured and semi-structured datasets.
Reduced data processing latency by 30%+ via near real-time ingestion.
Improved data accuracy & consistency by ~35% with quality frameworks.

Oct 2021 — Jun 2022

Cerenetics

Data Scientist · San Francisco, CA

Processed large-scale fMRI datasets under strict privacy controls.
Built 10+ ML/DL architectures in PyTorch; 0.85 AUC CNN on fMRI inputs.
Tracked 50+ model iterations enabling clean ablations.

Sep 2020 — Jul 2021

KPMG Global Services

Data Engineer · Bengaluru, India

Distributed Spark pipelines on 150TB+ regulatory datasets.
Modernized Hadoop ETL → PySpark, 25–30% performance gain.
Automated EMR provisioning with boto3, –60% operational overhead.

Jan 2020 — May 2020

Cognizant

Data Engineer · Pune, India

EDA on high-volume transactional & financial KPIs for pharma client.
Delivered actionable SQL-driven insights to stakeholders.

Jan 2017 — Jan 2020

Infosys

ML / Data Engineer · Pune, India

T-SQL ETL workflows supporting 10TB–50TB daily financial analytics.
Pandas EDA reduced manual effort by 60%.

04 / PROJECTS

Featured Work

Highlights from data engineering and ML projects — pipelines, platforms, and research.

Petabyte Lakehouse Platform

Architected a PySpark ETL/ELT data platform on AWS, scaling ingestion from 100TB to 6PB datasets. Built a Snowflake + S3 + Delta Lake lakehouse with dbt-powered aggregates, delivering a 53% pipeline efficiency gain and 40%+ query improvement.

PySpark AWS EMR Snowflake Delta Lake dbt Airflow

fMRI Neural Activity Classifier

Built and evaluated 10+ ML/DL architectures in PyTorch for fMRI neuroimaging classification under strict privacy controls. Top CNN model achieved 0.85 AUC, with 50+ ablation iterations tracked for reproducibility.

PyTorch CNN Python fMRI Deep Learning

SQL Query Lab

An interactive browser-based SQL challenge engine with real-time query validation. Three progressively difficult data engineering scenarios — from basic filtering to complex HAVING clauses — with simulated result sets and instant hint feedback.

JavaScript SQL HTML CSS

05 / CHALLENGE

SQL Query Lab

Three real-world data engineering scenarios. Write SQL, get instant feedback. Can you ace all three?

SOLVED

0 / 3

EASY

Your ops team needs a report of all delivered orders with amount > $500, sorted highest first. Write the query.

Schema

orders

PKorder_idINT

FKcustomer_idINT

amountDECIMAL

statusVARCHAR

created_atTIMESTAMP

Your Query

order_id	customer_id	amount	status
1042	88	$1,240.00	delivered
987	54	$890.50	delivered
763	21	$714.00	delivered
501	67	$512.75	delivered

4 rows · simulated result

MEDIUM

Marketing wants total revenue by country for Gold-tier customers only. Join the tables and aggregate.

Schema

customers

PKcustomer_idINT

nameVARCHAR

countryVARCHAR

tierVARCHAR

orders

PKorder_idINT

FKcustomer_idINT

amountDECIMAL

statusVARCHAR

Your Query

country	total_revenue
United States	$48,320.00
Germany	$31,100.50
Japan	$22,750.00
Canada	$18,900.25

4 rows · simulated result

HARD

Your data platform is flaky. Find pipeline names that failed more than 3 times in the last 30 days, showing the failure count.

Schema

pipeline_runs

PKrun_idINT

pipeline_nameVARCHAR

statusVARCHAR

run_dateDATE

Your Query

pipeline_name	failure_count
ingest_clickstream	7
sync_crm_daily	5
load_warehouse_facts	4

3 rows · simulated result

06 / CONTACT

Let's Connect

Open to data engineering, ML engineering, and platform roles. Reach out anytime.

linkedin.com/in/rachity

GitHub

github.com/rachiteagles