XXXXXX Ithaca, NY 14850 RAJDEEP GHOSH Mob No: +1 (XXX) XXX-XXXX
Mail | Linkedin | Github | Kaggle
EDUCATION
Cornell University, Department of Statistics and Data Science, Ithaca, New York Aug’19 - May’20
Masters in Applied Statistics (Data Science Concentration), GPA: 3.75/4.0
✓ Member of the Society of Industrial and Applied Mathematics (SIAM ID:(XXX) XXX-XXXX7)
Selected Coursework: Data Mining and Machine Learning (Supervised) • Natural Language Processing • Linear Models and Matrices • Applied Stat Comp
with SAS • Probability Models and Inference • Machine Learning for Data Science (Unsupervised) • Big Data Management and Analysis • Statistical Data
Mining I• Principles of Large Scale ML (Audit) Birla Institute of Technology and Science (BITS), Pilani, Department of Computer Science Sep’15 – May’19
Bachelor of Engineering - B.E.(Hons.) in Computer Science, Honors: Distinction, GPA: 9.13/10.0 Leadership: 1. President, Business Intelligence and Statistical Committee 2. Football Team Captain, BITS Pilani (2018-19) SKILLS
Technical: C, Python, Java, R, SQL, SAS (Certified Specialist - 2019), Tableau, Microsoft Excel, Minitab, Spark, AWS SageMaker
Libraries: Python(Numpy, Pandas, Matplotlib, Scikit-Learn, Tensorflow, Keras, Seaborn, Folium), R(ggplot, dplyr) EXPERIENCE
Student Data Scientist (Capstone), NASA Ames Research Centre, Moffet Field, CA
Jan’20 - Present
•
Currently developing a statistical model that gives information about the battery health and fault detection in autonomous aerial vehicles (e-UAV)
•
Completed the first stage of predicting State of Health for batteries in any given cycle of different flight plans using Random Forest Regression.
•
Building an Artificial Recurrent Neural Network (LSTM) in order to predict State of Health in a sequential manner with previous cycles as input. Graduate Teaching Assistant (ILRST 2100- Introductory Statistics), The ILR School, Cornell University, Ithaca, New York Jan’20 –Present
•
Responsibilities include teaching weekly discussion sections and labs (Minitab), proctoring exams, grading assignments and holding office hours. Data Science Intern, Merck, Dubai, United Arab Emirates
Aug’18 – Jan’19
•
Built a predictive model using ARIMA in Python to forecast product sales for all brands in Saudi Arabia, Qatar and Kuwait to address market and
supply chain / distribution for the next year.
•
Performed unsupervised learning (K-Means Clustering) to do customer segmentation based on inclination to purchase different brands and
medicines in the U.A.E and Saudi Arabia, to recognize the audience that needs to be targeted by marketing campaigns.
•
Developed the first ever MSD In-Market Sales Dashboard using Tableau for the entire GCC region to help the management to review and act
promptly to address sales deficits, B2B customer issues and sales force performance.
•
Managed the compilation, validation and maintenance of the database using SQL which helped in providing easy access to it whenever required. ACADEMIC PROJECTS
•
Clustering Neighborhoods of Toronto using K-Means: Produced three clusters of neighborhoods of Toronto using Python, based on top
surrounding venues in Toronto. Started off by scraping data from Wikipedia using Beautiful Soup, did some data cleaning to make it ready for
exploration, retrieved the coordinates of each neighborhood by geocoding and then added it to the original data. Exploration was done with the help
of Foursquare API and clusters were produced using the unsupervised learning technique K-Means Clustering.
•
Language Modelling and Opinion Spam Classification: Built a n-gram language model (Unigram and Bigram specifically) to classify
Chicago hotel reviews into truthful and deceptive. Data Preprocessing was performed (converting into lower case and removing stop words) as well
as techniques such as smoothing and handling unknown words method were implemented. Classification was performed based on the performance
metric, perplexity for both unigram and bigram based models. Bigram based language models were very accurate giving an accuracy of 92.19%.
Also performed classification using a Naïve-Bayes Classifier using the Bag of Words representation which gave an accuracy of 91.4%. Both
models gave similar results, n-gram language model being a bit better.
•
Metaphor Detection with Sequence Labeling Models: Implemented a model that identifies relevant information in a text and tags individual
words as metaphor and non-metaphor on a dataset with over 6500 text observations. Used a Hidden Markov Model, generating the transition and
emission probabilities, and used the Viterbi algorithm as the decoding algorithm. Implemented Backpointer and Score Matrix using Numpy arrays.
This method gave us a F1 score of 51%. Next task was to use probabilities from Logistic Regression and Naïve-Bayes Classifier, instead of the
transition probabilities. Logistic Regression had a higher F1 score of 56%.
•
Fine Grained Sentiment Analysis using Neural Networks: Classified Yelp reviews into 5 different sentiment classes using a Feedforward
Neural Network and then a Recurrent Neural Network using Keras. Used a simple Bag of Words approach with Feedforward Neural Network but
used word embedding using Glove, for Recurrent Neural Networks. Experimented with the number of hidden layers, number of neurons in the
hidden layers and the learning rate parameter. Using a Stochastic Gradient Descent optimizer, the RNN gave an accuracy of 79% after 10 epochs
whereas an accuracy of 64% was achieved when a FFNN was used for the same number of epochs.
•
Story Cloze Task on ROCStories: Chose the correct ending, right or wrong, to a four sentence story given two options for the ending using a
feature based classifier. Started off by creating two samples of right and wrong endings as positive and negative sets from the training data. Then
trained a binary Logistic Regression classifier with Bag of n-words representation to distinguish between right and wrong stories. At test time,
candidates were kept or the label who had a lower posterior probability is reversed, depending on if the classifier assigns same or different labels to
both candidates. Classification accuracy on test data was 71.4%.
•
Divorce Prediction using Classification: Predicted the divorce class based on fifty-four questions answered by each married couple using
relevant supervised machine learning techniques. Training data consists of 170 men/women who are either divorced or happily married. Started off
by Data Cleaning which consisted of removing missing values and handling any data anomalies. Then we used density plots to see how much of an
impact each feature had on the prediction. Upon finding that majority of the features are correlated, performed Principal Component Analysis
which is a method of dimensionality reduction that helps in removing correlated variables. Fitted Logistic Regression, K-Nearest Neighbors, Naïve-
Bayes Classifier, Random Forests and Support Vector Machines to the first principal component. Performed 10-fold Cross Validation to figure out
which algorithm gives the least Test Mean Squared Error. Random Forest Classifier yielded the best results with an accuracy of 98.04%. A
summary Jupyter Notebook can be found here. XXXXXX Ithaca, NY 14850 RAJDEEP GHOSH Mob No: +1 (XXX) XXX-XXXX
Mail | Linkedin | Github | Kaggle
EDUCATION
Cornell University, Department of Statistics and Data Science, Ithaca, New York Aug’19 - May’20
Masters in Applied Statistics (Data Science Concentration), GPA: 3.75/4.0
✓ Member of the Society of Industrial and Applied Mathematics (SIAM ID:(XXX) XXX-XXXX7)
Selected Coursework: Data Mining and Machine Learning (Supervised) • Natural Language Processing • Linear Models and Matrices • Applied Stat Comp
with SAS • Probability Models and Inference • Machine Learning for Data Science (Unsupervised) • Big Data Management and Analysis • Statistical Data
Mining I• Principles of Large Scale ML (Audit) Birla Institute of Technology and Science (BITS), Pilani, Department of Computer Science Sep’15 – May’19
Bachelor of Engineering - B.E.(Hons.) in Computer Science, Honors: Distinction, GPA: 9.13/10.0 Leadership: 1. President, Business Intelligence and Statistical Committee 2. Football Team Captain, BITS Pilani (2018-19) SKILLS
Technical: C, Python, Java, R, SQL, SAS (Certified Specialist - 2019), Tableau, Microsoft Excel, Minitab, Spark, AWS SageMaker
Libraries: Python(Numpy, Pandas, Matplotlib, Scikit-Learn, Tensorflow, Keras, Seaborn, Folium), R(ggplot, dplyr) EXPERIENCE
Student Data Scientist (Capstone), NASA Ames Research Centre, Moffet Field, CA
Jan’20 - Present
•
Currently developing a statistical model that gives information about the battery health and fault detection in autonomous aerial vehicles (e-UAV)
•
Completed the first stage of predicting State of Health for batteries in any given cycle of different flight plans using Random Forest Regression.
•
Building an Artificial Recurrent Neural Network (LSTM) in order to predict State of Health in a sequential manner with previous cycles as input. Graduate Teaching Assistant (ILRST 2100- Introductory Statistics), The ILR School, Cornell University, Ithaca, New York Jan’20 –Present
•
Responsibilities include teaching weekly discussion sections and labs (Minitab), proctoring exams, grading assignments and holding office hours. Data Science Intern, Merck, Dubai, United Arab Emirates
Aug’18 – Jan’19
•
Built a predictive model using ARIMA in Python to forecast product sales for all brands in Saudi Arabia, Qatar and Kuwait to address market and
supply chain / distribution for the next year.
•
Performed unsupervised learning (K-Means Clustering) to do customer segmentation based on inclination to purchase different brands and
medicines in the U.A.E and Saudi Arabia, to recognize the audience that needs to be targeted by marketing campaigns.
•
Developed the first ever MSD In-Market Sales Dashboard using Tableau for the entire GCC region to help the management to review and act
promptly to address sales deficits, B2B customer issues and sales force performance.
•
Managed the compilation, validation and maintenance of the database using SQL which helped in providing easy access to it whenever required. ACADEMIC PROJECTS
•
Clustering Neighborhoods of Toronto using K-Means: Produced three clusters of neighborhoods of Toronto using Python, based on top
surrounding venues in Toronto. Started off by scraping data from Wikipedia using Beautiful Soup, did some data cleaning to make it ready for
exploration, retrieved the coordinates of each neighborhood by geocoding and then added it to the original data. Exploration was done with the help
of Foursquare API and clusters were produced using the unsupervised learning technique K-Means Clustering.
•
Language Modelling and Opinion Spam Classification: Built a n-gram language model (Unigram and Bigram specifically) to classify
Chicago hotel reviews into truthful and deceptive. Data Preprocessing was performed (converting into lower case and removing stop words) as well
as techniques such as smoothing and handling unknown words method were implemented. Classification was performed based on the performance
metric, perplexity for both unigram and bigram based models. Bigram based language models were very accurate giving an accuracy of 92.19%.
Also performed classification using a Naïve-Bayes Classifier using the Bag of Words representation which gave an accuracy of 91.4%. Both
models gave similar results, n-gram language model being a bit better.
•
Metaphor Detection with Sequence Labeling Models: Implemented a model that identifies relevant information in a text and tags individual
words as metaphor and non-metaphor on a dataset with over 6500 text observations. Used a Hidden Markov Model, generating the transition and
emission probabilities, and used the Viterbi algorithm as the decoding algorithm. Implemented Backpointer and Score Matrix using Numpy arrays.
This method gave us a F1 score of 51%. Next task was to use probabilities from Logistic Regression and Naïve-Bayes Classifier, instead of the
transition probabilities. Logistic Regression had a higher F1 score of 56%.
•
Fine Grained Sentiment Analysis using Neural Networks: Classified Yelp reviews into 5 different sentiment classes using a Feedforward
Neural Network and then a Recurrent Neural Network using Keras. Used a simple Bag of Words approach with Feedforward Neural Network but
used word embedding using Glove, for Recurrent Neural Networks. Experimented with the number of hidden layers, number of neurons in the
hidden layers and the learning rate parameter. Using a Stochastic Gradient Descent optimizer, the RNN gave an accuracy of 79% after 10 epochs
whereas an accuracy of 64% was achieved when a FFNN was used for the same number of epochs.
•
Story Cloze Task on ROCStories: Chose the correct ending, right or wrong, to a four sentence story given two options for the ending using a
feature based classifier. Started off by creating two samples of right and wrong endings as positive and negative sets from the training data. Then
trained a binary Logistic Regression classifier with Bag of n-words representation to distinguish between right and wrong stories. At test time,
candidates were kept or the label who had a lower posterior probability is reversed, depending on if the classifier assigns same or different labels to
both candidates. Classification accuracy on test data was 71.4%.
•
Divorce Prediction using Classification: Predicted the divorce class based on fifty-four questions answered by each married couple using
relevant supervised machine learning techniques. Training data consists of 170 men/women who are either divorced or happily married. Started off
by Data Cleaning which consisted of removing missing values and handling any data anomalies. Then we used density plots to see how much of an
impact each feature had on the prediction. Upon finding that majority of the features are correlated, performed Principal Component Analysis
which is a method of dimensionality reduction that helps in removing correlated variables. Fitted Logistic Regression, K-Nearest Neighbors, Naïve-
Bayes Classifier, Random Forests and Support Vector Machines to the first principal component. Performed 10-fold Cross Validation to figure out
which algorithm gives the least Test Mean Squared Error. Random Forest Classifier yielded the best results with an accuracy of 98.04%. A
summary Jupyter Notebook can be found here.



0
Following
0
Followers
218
Profile Views