Anav S
- Research Program Mentor
PhD candidate at Stanford University
Expertise
Machine Learning, Data Science, Quantitative Modeling, Statistics, Mathematics
Bio
Hi! I am currently a PhD student in Stanford's statistics department. Prior to joining the PhD program I completed a B.S. in mathematics and M.S. in statistics, both from Stanford. My interests center around using quantitative models to analyze, interpret, and utilize trends in data. Depending on who you ask, this subject goes by many names (e.g. machine learning, data science, statistical learning, deep learning). I approach these problems from a statistical lens, which lends itself to two main kinds of data driven tasks: prediction and inference. I am eager to work with students who want to learn how to better work with, model, and understand data sets!Project ideas
Introduction to Machine Learning
In this project we will walk through how to set up machine learning experiments and discuss foundational models that are used for regression and classification tasks. Topics include but are not limited to cross-validation and testing, overfitting/underfitting, feature selection and dimensionality reduction, linear regression, logistic regression, and neural networks. The culmination is to apply these techniques to a prediction problem of the student's choosing. Topics will be tailored and scoped to the interests and background of the student.
Natural Language Processing (NLP)
In 2018, Google released BERT, a neural language model which helped NLP practitioners outperform previous state of the art benchmarks in language tasks (e.g. question answering, sentiment analysis, machine translation) across the board. In this project we will learn how deep learning researches approach problems in language quantitatively and develop an understanding of "contextual word embeddings", the motivation for BERT, from the ground up. Then we will learn how to apply BERT to a language task of your choosing. One example is quantifying political bias in news articles.
Exploring Genomics Data
In this project the student will get to explore the 1000 Genomes project dataset. The student will learn how to make their own hypothesis about the data and validate them quantitatively. The student will learn how to construct features and find signals in the dataset. The project will involve both statistical inference and prediction.
Final Notes
If you have a particular dataset in mind, I can help you set up an end-to-end project starting from stages as early as scraping data/dataset construction.