Mail | Linkedin | Github | Kaggle

EDUCATION

Cornell University, Department of Statistics and Data Science, Ithaca, New York Augâ€™19 - Mayâ€™20

Masters in Applied Statistics (Data Science Concentration), GPA: 3.75/4.0

âœ“ Member of the Society of Industrial and Applied Mathematics (SIAM ID:(XXX) XXX-XXXX7)

Selected Coursework: Data Mining and Machine Learning (Supervised) â€¢ Natural Language Processing â€¢ Linear Models and Matrices â€¢ Applied Stat Comp

with SAS â€¢ Probability Models and Inference â€¢ Machine Learning for Data Science (Unsupervised) â€¢ Big Data Management and Analysis â€¢ Statistical Data

Mining Iâ€¢ Principles of Large Scale ML (Audit) Birla Institute of Technology and Science (BITS), Pilani, Department of Computer Science Sepâ€™15 â€“ Mayâ€™19

Bachelor of Engineering - B.E.(Hons.) in Computer Science, Honors: Distinction, GPA: 9.13/10.0 Leadership: 1. President, Business Intelligence and Statistical Committee 2. Football Team Captain, BITS Pilani (2018-19) SKILLS

Technical: C, Python, Java, R, SQL, SAS (Certified Specialist - 2019), Tableau, Microsoft Excel, Minitab, Spark, AWS SageMaker

Libraries: Python(Numpy, Pandas, Matplotlib, Scikit-Learn, Tensorflow, Keras, Seaborn, Folium), R(ggplot, dplyr) EXPERIENCE

Student Data Scientist (Capstone), NASA Ames Research Centre, Moffet Field, CA

Janâ€™20 - Present

â€¢

Currently developing a statistical model that gives information about the battery health and fault detection in autonomous aerial vehicles (e-UAV)

â€¢

Completed the first stage of predicting State of Health for batteries in any given cycle of different flight plans using Random Forest Regression.

â€¢

Building an Artificial Recurrent Neural Network (LSTM) in order to predict State of Health in a sequential manner with previous cycles as input. Graduate Teaching Assistant (ILRST 2100- Introductory Statistics), The ILR School, Cornell University, Ithaca, New York Janâ€™20 â€“Present

â€¢

Responsibilities include teaching weekly discussion sections and labs (Minitab), proctoring exams, grading assignments and holding office hours. Data Science Intern, Merck, Dubai, United Arab Emirates

Augâ€™18 â€“ Janâ€™19

â€¢

Built a predictive model using ARIMA in Python to forecast product sales for all brands in Saudi Arabia, Qatar and Kuwait to address market and

supply chain / distribution for the next year.

â€¢

Performed unsupervised learning (K-Means Clustering) to do customer segmentation based on inclination to purchase different brands and

medicines in the U.A.E and Saudi Arabia, to recognize the audience that needs to be targeted by marketing campaigns.

â€¢

Developed the first ever MSD In-Market Sales Dashboard using Tableau for the entire GCC region to help the management to review and act

promptly to address sales deficits, B2B customer issues and sales force performance.

â€¢

Managed the compilation, validation and maintenance of the database using SQL which helped in providing easy access to it whenever required. ACADEMIC PROJECTS

â€¢

Clustering Neighborhoods of Toronto using K-Means: Produced three clusters of neighborhoods of Toronto using Python, based on top

surrounding venues in Toronto. Started off by scraping data from Wikipedia using Beautiful Soup, did some data cleaning to make it ready for

exploration, retrieved the coordinates of each neighborhood by geocoding and then added it to the original data. Exploration was done with the help

of Foursquare API and clusters were produced using the unsupervised learning technique K-Means Clustering.

â€¢

Language Modelling and Opinion Spam Classification: Built a n-gram language model (Unigram and Bigram specifically) to classify

Chicago hotel reviews into truthful and deceptive. Data Preprocessing was performed (converting into lower case and removing stop words) as well

as techniques such as smoothing and handling unknown words method were implemented. Classification was performed based on the performance

metric, perplexity for both unigram and bigram based models. Bigram based language models were very accurate giving an accuracy of 92.19%.

Also performed classification using a NaÃ¯ve-Bayes Classifier using the Bag of Words representation which gave an accuracy of 91.4%. Both

models gave similar results, n-gram language model being a bit better.

â€¢

Metaphor Detection with Sequence Labeling Models: Implemented a model that identifies relevant information in a text and tags individual

words as metaphor and non-metaphor on a dataset with over 6500 text observations. Used a Hidden Markov Model, generating the transition and

emission probabilities, and used the Viterbi algorithm as the decoding algorithm. Implemented Backpointer and Score Matrix using Numpy arrays.

This method gave us a F1 score of 51%. Next task was to use probabilities from Logistic Regression and NaÃ¯ve-Bayes Classifier, instead of the

transition probabilities. Logistic Regression had a higher F1 score of 56%.

â€¢

Fine Grained Sentiment Analysis using Neural Networks: Classified Yelp reviews into 5 different sentiment classes using a Feedforward

Neural Network and then a Recurrent Neural Network using Keras. Used a simple Bag of Words approach with Feedforward Neural Network but

used word embedding using Glove, for Recurrent Neural Networks. Experimented with the number of hidden layers, number of neurons in the

hidden layers and the learning rate parameter. Using a Stochastic Gradient Descent optimizer, the RNN gave an accuracy of 79% after 10 epochs

whereas an accuracy of 64% was achieved when a FFNN was used for the same number of epochs.

â€¢

Story Cloze Task on ROCStories: Chose the correct ending, right or wrong, to a four sentence story given two options for the ending using a

feature based classifier. Started off by creating two samples of right and wrong endings as positive and negative sets from the training data. Then

trained a binary Logistic Regression classifier with Bag of n-words representation to distinguish between right and wrong stories. At test time,

candidates were kept or the label who had a lower posterior probability is reversed, depending on if the classifier assigns same or different labels to

both candidates. Classification accuracy on test data was 71.4%.

â€¢

Divorce Prediction using Classification: Predicted the divorce class based on fifty-four questions answered by each married couple using

relevant supervised machine learning techniques. Training data consists of 170 men/women who are either divorced or happily married. Started off

by Data Cleaning which consisted of removing missing values and handling any data anomalies. Then we used density plots to see how much of an

impact each feature had on the prediction. Upon finding that majority of the features are correlated, performed Principal Component Analysis

which is a method of dimensionality reduction that helps in removing correlated variables. Fitted Logistic Regression, K-Nearest Neighbors, NaÃ¯ve-

Bayes Classifier, Random Forests and Support Vector Machines to the first principal component. Performed 10-fold Cross Validation to figure out

which algorithm gives the least Test Mean Squared Error. Random Forest Classifier yielded the best results with an accuracy of 98.04%. A

summary Jupyter Notebook can be found here.