Arpan Nookala

Data Scientist & ML Engineer|

MS Data Science @ Rutgers University • Specializing in Advanced Analytics, Machine Learning, and Spatial Data Science • Building intelligent solutions that bridge academic research with real-world impact

About Me

I'm a passionate data scientist and machine learning engineer with a strong foundation in advanced analytics and spatial data science. Having completed my Master's in Data Science at Rutgers University, I specialize in bridging the gap between cutting-edge research and practical, real-world applications.

My research focuses on Small Area Estimation, spatial microsimulation, and telework pattern analysis, where I leverage advanced machine learning techniques including GANs, transfer learning, and ensemble methods. I'm particularly interested in how data science can inform urban planning and policy decisions.

With experience at Google and multiple published research papers, I bring both industry expertise and academic rigor to every project. I'm always excited to collaborate on innovative solutions that make a meaningful impact.

Education

Master of Science in Data Science

Rutgers University – New Brunswick

New Jersey

Sep 2023 – May 2025

GPA: 3.8/4.0

Bachelor of Technology in Electronics Engineering

Sardar Patel Institute of Technology

Mumbai, India

Aug 2019 – Jun 2023

GPA: 8.81/10

Minor in Computer Engineering

Professional Experience

Data Science & Machine Learning Analyst

Rutgers Urban and Civic Informatics Laboratory

Under Dr. Piyushimita (Vonu) Thakuriah

New Brunswick, NJ

Feb 2024 - Present

Leading advanced research in Small Area Estimation and telework propensity modeling using cutting-edge ML techniques.

Key Achievements:

Conducting Small Area Estimation (SAE) for modeling telework propensity at the census block group level, integrating multi-modal data sources including ACS, Household Pulse Survey (HPS), Current Population Survey (CPS), and Public Use Microdata Sample (PUMS) using Python, R and SQL

Engineered and optimized a 6x faster Python-based implementation of Iterative Proportional Fitting (IPF) (Raking), significantly improving computational performance and scalability compared to existing R-based methods

Developed and evaluated advanced empirical models, including a binary classifier using XGBoost (F1 score: 0.85) and Probit modeling for marginal effects and policy insight analysis on telework patterns

Advanced spatial microsimulation methods by incorporating Transfer Learning, Ensemble Learning, and Generative Adversarial Networks (GANs), enhancing the accuracy of small-area telework propensity estimation

Designed and built multi-output joint estimation transfer learning frameworks to improve predictive performance on HPS dataset, transferring relevant features from CPS and PUMS, and utilizing Conditional Tabular GANs for synthetic population generation

Produced geospatial visualizations and GIS-based analyses using Leaflet and QGIS, facilitating actionable insights into telework patterns and urban planning

Optimized large-scale data management and processing by integrating SQLite, converting data to Parquet format, and employing Apache Spark for distributed computing, substantially enhancing efficiency and scalability

Data Scientist

Google via DKSH Smollan

Remote

Jul 2021 – Jul 2022

Led data science initiatives for Google's retail operations across multiple continents, focusing on pricing intelligence and dashboard optimization.

Key Achievements:

Developed custom Selenium scrapers to extract smartphone and smart home product pricing data from retail websites across Asia, Europe, and North America, storing data using Cloud Function microservices in Google Cloud SQL databases

Optimized dashboard performance by integrating Google BigQuery, reducing dashboard load times by nearly 50% and improving data accessibility

Designed and deployed interactive dashboards in Looker Studio, enabling data-driven insights for product pricing and trend analysis

Led a team of five junior interns, overseeing script development and microservice integration to ensure timely project delivery

Featured Projects

July 2025 – Present

Intelligent Multi-Agent Research Discovery Platform

Building a production-ready multi-agent system using LangGraph and CrewAI for intelligent research paper discovery. Features 5 specialized AI agents, semantic search across arXiv/Semantic Scholar/PubMed, personalized recommendations, and real-time agent orchestration with MCP integration.

Python

FastAPI

LangGraph

CrewAI

Next.js

PostgreSQL

Qdrant

MCP

Docker

Oct 2024 – Dec 2024

Research Paper Recommendation System

Developed a Research Paper Recommendation System using TF-IDF and fine-tuned SBERT (all-MiniLM-L6-v2), achieving 78.9% accuracy and an F1 score of 79.64%, with embeddings efficiently stored and retrieved from LanceDB.

Python

SBERT

TF-IDF

LanceDB

NLP

Nov 2024 – Dec 2024

Commodity Trading using Alternative Data

Explored coffee trading strategies integrating weather anomalies and technical signals for trend-following and mean-reversion methods.

Python

Financial Analysis

Weather APIs

MACD

RSI

Oct 2023 – Dec 2023

Credit Card Fraud Detection

Logistic regression with SMOTE for class imbalance, achieving a 95% ROC AUC on a real card transaction dataset.

Python

Logistic Regression

SMOTE

Feature Engineering

Jul 2022 – Aug 2023

Deep RL Traffic Control

Built a DDPG-based model to optimize traffic signal timing, reducing average wait times by up to 23% on a grid of intersections.

Python

Deep RL

DDPG

DQN

IEEE Publication

Jul 2021 – Oct 2022

Automated Stock Trading with Short Selling

Integrated short-selling thresholds into DRL-based stock trading, outperforming previous methods by 11.4% p.a.

Python

OpenAI Gym

DRL

Financial Trading

Springer

Jan 2021 – Jun 2021

Fire Detection & Localization

Deployed Google's Inception V3 on images of fire, smoke, and neutral scenes, integrated with IoT for real-time alerts.

Python

Inception V3

OpenCV

IoT

Publications

Deep Reinforcement Learning based Intelligent Traffic Control

First Author

Jul 2022 – Aug 2023

IEEE TENSYMP 2023 (Canberra, Australia)

Presented novel DRL approaches for optimizing traffic signal timing systems.

Abstract:

The development of Intelligent Traffic Signal Control (ITSC) systems is crucial for enhancing traffic flow and mitigating congestion, which is a widespread problem in urban areas globally. Presently, RADAR or inductive loop-based intelligent systems are used in metropolises of developed countries, but the large investment and infrastructure requirements rule out their widespread application. This paper explores a nascent Deep Reinforcement Learning (DRL) approach to the Traffic Signal Control (TSC) problem, as opposed to classical optimization or rule-based approaches of the past. To address the challenges that limit past RL approaches, the study leverages the Deep Deterministic Policy Gradient (DDPG) algorithm to optimize traffic light control policies. The proposed DRL approach shows intelligent behavior and reduces the average delay time and congestion when compared to the traditional RL, past DRL, and fixed-time signal approaches. A comparative analysis of the reward functions is also presented, which reveals insights into the variance of performance.

Deep Reinforcement Learning for Automated Stock Trading: Inclusion of Short Selling

Second Author

Jul 2021 – Oct 2022

26th International Symposium on Methodologies for Intelligent Systems (Cosenza, Italy)

Published in Springer LNAI vol 13515, focusing on advanced trading strategies.

Abstract:

Multiple facets of the financial industry, such as algorithmic trading, have greatly benefited from their unison with cutting-edge machine learning research in recent years. However, despite significant research efforts directed towards leveraging supervised learning methods alone for designing superior algorithmic trading strategies, existing studies continue to confront significant hurdles like striking the optimum balance of risk and return, incorporating real-world complexities, and minimizing max drawdown periods. This research work proposes a modified deep reinforcement learning (DRL) approach to automated stock trading with the inclusion of short selling, a new thresholding framework, and employs turbulence as a safety switch. The DRL agents' performance is evaluated on the U.S. stock market's DJIA index constituents. The modified DRL agents are shown to outperform previous DRL approaches and the DJIA index, in terms of absolute returns, risk-adjusted returns, and lower max drawdowns, while giving insights into the effects of short selling inclusion and proposed thresholding.

Technical Skills

Machine Learning & AI

Expertise in traditional ML, deep learning, NLP, and reinforcement learning algorithms.

Machine Learning

Linear/Non-linear Regression

Sampling(Bootstrap, Cross-Validation, Regularization)

Dimensionality Reduction(PCA, t-SNE, UMAP)

Decision Trees(XGBoost, Random Forest, Gradient Boosting etc.)

Bayesian Networks

Clustering(KMeans, Hierarchical, DBSCAN etc.)

Reinforcement Learning(MDP, PPO/MAPPO)

Deep Learning

CNNs

Transformers

GANs

Multimodal Systems

LLM Finetuning (PEFT/SFT/QLoRA etc.)

Advanced RAG

Diffusion Models

Graph Models

GNNs

Programming Languages

Proficiency in multiple programming languages for diverse project requirements.

Python

C++

JavaScript/TypeScript

Java

Cloud & DevOps

Deploying and managing scalable infrastructure and CI/CD pipelines.

AWS

GCP

Docker

Kubernetes

GitHub Actions

Jenkins

MLflow

Data Systems

Managing and optimizing various database systems for efficient data storage and retrieval.

SQL

NoSQL

Vector DBs

MySQL

PostgreSQL

MongoDB

Redis

LanceDB

Pinecone

MLOps

Implementing best practices for machine learning operations and model lifecycle management.

MLflow

DVC

Model Monitoring

Pipeline Automation

Big Data Processing

Handling and processing large-scale datasets efficiently.

PySpark

Kafka

Airflow

Dask

Data Visualization & GIS

Creating interactive visualizations and geospatial analyses for complex datasets.

Plotly

Superset

Streamlit/Gradio

Tableau

PowerBI

Leaflet

Kepler.gl

GeoPandas

QGIS

Full-Stack Development

Building robust and scalable web applications with modern frameworks.

Python

JavaScript

React

Django

Node.js

Next.js

Version Control

Efficient code management and collaboration using version control systems.

Git

GitHub

GitLab