INITIALIZING DATA STREAM...
AVAILABLE FOR OPPORTUNITIES

RACHIT YADAV

DATA ENGINEER · ML ENGINEER

Architecting distributed data systems processing 100TB → 6PB scale datasets with PySpark, AWS, and deep learning.

0
Years Experience
0
Petabytes Processed
0
ML Iterations
0
Companies
SCROLL

Building Data Platforms at Scale

Data Engineer with 7+ years of experience building large scale distributed data systems and real time pipelines. Strong Python and SQL fundamentals paired with hands-on PyTorch experience implementing CNNs and other deep learning architectures.

Deep expertise in PySpark, AWS, and distributed computing, with a strong focus on data quality, validation, idempotency, and reproducible pipelines that feed analytics and ML consumers.

Currently designing data platforms that power analytics, marketing insights, and AI driven decision systems at Smart Energy Water.

LocationIrvine, CA
EducationM.S. Data Science · USF
Pipeline Boost+53% efficiency
Status● Active

Tech Stack in 3D

A constellation of tools I use to build large-scale data systems. Drag to rotate, scroll to zoom.

DRAG · ZOOM · EXPLORE
Languages & ML
PythonSQLPyTorchPandasCNNs
Distributed Processing
Apache SparkPySparkDatabricks
Storage & Lakehouse
S3Delta LakeParquetORC
Orchestration
AirflowAWS Step Functions
Cloud & Infra
AWS EMREC2LambdaKinesisTerraformDocker
Data Warehousing
SnowflakeRedshiftBigQueryPostgreSQLSQL Server

Career Timeline

Jan 2025 — Present
Smart Energy Water
Data Engineer · Irvine, CA
  • Designed PySpark ETL/ELT on AWS, processing 100TB→6PB; +53% pipeline efficiency.
  • Architected Snowflake + S3 + Delta Lake lakehouse for low-latency analytics.
  • Built dbt-powered aggregates improving downstream queries by 40%+.
Nov 2022 — Nov 2024
Meta (via SGS Consulting)
Data Engineer · San Francisco, CA
  • Scalable Spark pipelines for structured and semi-structured datasets.
  • Reduced data processing latency by 30%+ via near real-time ingestion.
  • Improved data accuracy & consistency by ~35% with quality frameworks.
Oct 2021 — Jun 2022
Cerenetics
Data Scientist · San Francisco, CA
  • Processed large-scale fMRI datasets under strict privacy controls.
  • Built 10+ ML/DL architectures in PyTorch; 0.85 AUC CNN on fMRI inputs.
  • Tracked 50+ model iterations enabling clean ablations.
Sep 2020 — Jul 2021
KPMG Global Services
Data Engineer · Bengaluru, India
  • Distributed Spark pipelines on 150TB+ regulatory datasets.
  • Modernized Hadoop ETL → PySpark, 25–30% performance gain.
  • Automated EMR provisioning with boto3, –60% operational overhead.
Jan 2020 — May 2020
Cognizant
Data Engineer · Pune, India
  • EDA on high-volume transactional & financial KPIs for pharma client.
  • Delivered actionable SQL-driven insights to stakeholders.
Jan 2017 — Jan 2020
Infosys
ML / Data Engineer · Pune, India
  • T-SQL ETL workflows supporting 10TB–50TB daily financial analytics.
  • Pandas EDA reduced manual effort by 60%.

Let's Connect

Open to data engineering, ML engineering, and platform roles. Reach out anytime.