Fangshu Lin
XXXX@XXXX.XXX | (XXX) XXX-XXXX | https://www.linkedin.com/in/fangshulin/
SKILLS
Languages: Python, R, SQL, MATLAB, Shel scripting Tools: scikit-learn, Pandas, Numpy, GeoPandas, SciPy, NLTK, Git
Visualization: Matplotlib, ggplot2, Tableau, D3.js Big data: Hadoop, Hive, Spark, Athena
EXPERIENCE
Factual, Inc. Los Angeles, CA
Data Scientist Apr 2019 - Apr 2020
• Leveraged billion-row real-world geolocation and in-store visitation data with Hive, Spark and Python
• Built data pipelines to detect anomalies and monitor footfal integrity metrics; reduced troubleshooting time by 40%
• Developed statistical models with device event data; constructed a quality scoring system and deployed into production
• Applied smoothing and sampling methods for visit data at scale; fine-tuned parameters to maintain scale and stability
• Created solutions on building lookalike models with Data Enrichment using Autoencoder and Locality Sensitive Hashing
• Proposed methodology based on Iterative Proportional Fitting (IPF) for extrapolation of in-store visitation data
• Cooperated in data preprocessing, modeling and performance testing via Python codebase for product improvements
• Analyzed, visualized and tel ing story from places, audience segments and visit feed for company’s insights blog posts
ARGO Labs
New York, NY
Research Analyst Oct 2018 - Apr 2019
• Created a web App to automate geocoding, calculate and visualize water bil s applicable to 200+ CA utilities with R Shiny
• Built data pipelines to perform ETL, profiling on water rate data and report changes using Python, PostgreSQL, Airflow
• Prototyped customized analytical products for City of Santa Monica and deployed App to AWS
Purdue University Intelligent Infrastructure Systems Lab West Lafayette, IN
Ph.D Researcher Aug 2011 - Aug 2013
• Conducted parametric identification and model updating of nonlinear transfer system coupled with smart devices
• Performed pattern recognition, spectral analysis and error analysis with high-dimensional noisy sensor network signal
using time and frequency domain methodologies
• Developed LQG-based optimization algorithms and compensation methods with Kalman filter for smart damping
system, improving both latency and accuracy
SELECTED PROJECTS
Twitter Fake News Detection (Deep learning, Classification, PyTorch)
• Manipulated 500k unbalanced text data and trained LSTM to detect fake tweets about Hurricane Harvey
• Implemented pre-trained GloVe embeddings to overcome data limitation and gained 87% accuracy and F1 score of 0.9
Topic Modeling on Movie Corpus (Unsupervised machine learning, NLTK)
• Explored hidden topics and semantic patterns of movie synopses from Wikipedia with K-Means Clustering and LDA
• Integrated an LDA-based recommendation system and evaluated model using similarity metrics
• Examined result and pattern of segmentation by visualizing latent space using PCA and t-SNE
Vulnerability Analysis for Transportation (Network Analysis, Visualization, NetworkX)
• Simulated London Tube as directed graph (390 nodes, 2274 edges) weighted by travel time using Dijkstra algorithm
• Proposed a quantitative framework to evaluate resilience of subway system under multi-level disruptions in urban cities
• Identified and visualized vulnerable stations with Python, Flask and D3.js
Gas Leak Incidents Prediction (Tree-based regression, Machine learning pipeline, Apache Spark)
• Analyzed spatial/temporal distribution of gas leak incidents in SF using 4 mil ion fire department data of 18 years
• Managed dynamic data pipeline jobs using Spark RDD, Spark SQL for data processing and feature engineering
• Trained Gradient-Boosted Trees with PySpark MLlib to predict gas leak incidents and examined feature importance
EDUCATION
New York University
New York, NY
Master of Applied Informatics
August 2018
- Course: Applied data science, Machine learning, Text as data/NLP, Big data analytics, Data visualization, Spatial analytics
Tongji University Shanghai, China
PhD of Engineering - Intel igent Systems, joint PhD at Purdue University
December 2015
- Course: Statistics inference, Probability, Numerical computing, Optimization, Stochastic processes, Signal processing Fangshu Lin
XXXX@XXXX.XXX | (XXX) XXX-XXXX | https://www.linkedin.com/in/fangshulin/
SKILLS
Languages: Python, R, SQL, MATLAB, Shel scripting Tools: scikit-learn, Pandas, Numpy, GeoPandas, SciPy, NLTK, Git
Visualization: Matplotlib, ggplot2, Tableau, D3.js Big data: Hadoop, Hive, Spark, Athena
EXPERIENCE
Factual, Inc. Los Angeles, CA
Data Scientist Apr 2019 - Apr 2020
• Leveraged billion-row real-world geolocation and in-store visitation data with Hive, Spark and Python
• Built data pipelines to detect anomalies and monitor footfal integrity metrics; reduced troubleshooting time by 40%
• Developed statistical models with device event data; constructed a quality scoring system and deployed into production
• Applied smoothing and sampling methods for visit data at scale; fine-tuned parameters to maintain scale and stability
• Created solutions on building lookalike models with Data Enrichment using Autoencoder and Locality Sensitive Hashing
• Proposed methodology based on Iterative Proportional Fitting (IPF) for extrapolation of in-store visitation data
• Cooperated in data preprocessing, modeling and performance testing via Python codebase for product improvements
• Analyzed, visualized and tel ing story from places, audience segments and visit feed for company’s insights blog posts
ARGO Labs
New York, NY
Research Analyst Oct 2018 - Apr 2019
• Created a web App to automate geocoding, calculate and visualize water bil s applicable to 200+ CA utilities with R Shiny
• Built data pipelines to perform ETL, profiling on water rate data and report changes using Python, PostgreSQL, Airflow
• Prototyped customized analytical products for City of Santa Monica and deployed App to AWS
Purdue University Intelligent Infrastructure Systems Lab West Lafayette, IN
Ph.D Researcher Aug 2011 - Aug 2013
• Conducted parametric identification and model updating of nonlinear transfer system coupled with smart devices
• Performed pattern recognition, spectral analysis and error analysis with high-dimensional noisy sensor network signal
using time and frequency domain methodologies
• Developed LQG-based optimization algorithms and compensation methods with Kalman filter for smart damping
system, improving both latency and accuracy
SELECTED PROJECTS
Twitter Fake News Detection (Deep learning, Classification, PyTorch)
• Manipulated 500k unbalanced text data and trained LSTM to detect fake tweets about Hurricane Harvey
• Implemented pre-trained GloVe embeddings to overcome data limitation and gained 87% accuracy and F1 score of 0.9
Topic Modeling on Movie Corpus (Unsupervised machine learning, NLTK)
• Explored hidden topics and semantic patterns of movie synopses from Wikipedia with K-Means Clustering and LDA
• Integrated an LDA-based recommendation system and evaluated model using similarity metrics
• Examined result and pattern of segmentation by visualizing latent space using PCA and t-SNE
Vulnerability Analysis for Transportation (Network Analysis, Visualization, NetworkX)
• Simulated London Tube as directed graph (390 nodes, 2274 edges) weighted by travel time using Dijkstra algorithm
• Proposed a quantitative framework to evaluate resilience of subway system under multi-level disruptions in urban cities
• Identified and visualized vulnerable stations with Python, Flask and D3.js
Gas Leak Incidents Prediction (Tree-based regression, Machine learning pipeline, Apache Spark)
• Analyzed spatial/temporal distribution of gas leak incidents in SF using 4 mil ion fire department data of 18 years
• Managed dynamic data pipeline jobs using Spark RDD, Spark SQL for data processing and feature engineering
• Trained Gradient-Boosted Trees with PySpark MLlib to predict gas leak incidents and examined feature importance
EDUCATION
New York University
New York, NY
Master of Applied Informatics
August 2018
- Course: Applied data science, Machine learning, Text as data/NLP, Big data analytics, Data visualization, Spatial analytics
Tongji University Shanghai, China
PhD of Engineering - Intel igent Systems, joint PhD at Purdue University
December 2015
- Course: Statistics inference, Probability, Numerical computing, Optimization, Stochastic processes, Signal processing



0
Following
1
Followers
197
Profile Views