Fangshu Lin XXXX@XXXX.XXX | (XXX) XXX-XXXX | https://www.linkedin.com/in/fangshulin/ SKILLS Languages: Python, R, SQL, MATLAB, Shel scripting Tools: scikit-learn, Pandas, Numpy, GeoPandas, SciPy, NLTK, Git Visualization: Matplotlib, ggplot2, Tableau, D3.js Big data: Hadoop, Hive, Spark, Athena EXPERIENCE Factual, Inc. Los Angeles, CA Data Scientist Apr 2019 - Apr 2020 • Leveraged billion-row real-world geolocation and in-store visitation data with Hive, Spark and Python • Built data pipelines to detect anomalies and monitor footfal integrity metrics; reduced troubleshooting time by 40% • Developed statistical models with device event data; constructed a quality scoring system and deployed into production • Applied smoothing and sampling methods for visit data at scale; fine-tuned parameters to maintain scale and stability • Created solutions on building lookalike models with Data Enrichment using Autoencoder and Locality Sensitive Hashing • Proposed methodology based on Iterative Proportional Fitting (IPF) for extrapolation of in-store visitation data • Cooperated in data preprocessing, modeling and performance testing via Python codebase for product improvements • Analyzed, visualized and tel ing story from places, audience segments and visit feed for company’s insights blog posts ARGO Labs New York, NY Research Analyst Oct 2018 - Apr 2019 • Created a web App to automate geocoding, calculate and visualize water bil s applicable to 200+ CA utilities with R Shiny • Built data pipelines to perform ETL, profiling on water rate data and report changes using Python, PostgreSQL, Airflow • Prototyped customized analytical products for City of Santa Monica and deployed App to AWS Purdue University Intelligent Infrastructure Systems Lab West Lafayette, IN Ph.D Researcher Aug 2011 - Aug 2013 • Conducted parametric identification and model updating of nonlinear transfer system coupled with smart devices • Performed pattern recognition, spectral analysis and error analysis with high-dimensional noisy sensor network signal using time and frequency domain methodologies • Developed LQG-based optimization algorithms and compensation methods with Kalman filter for smart damping system, improving both latency and accuracy SELECTED PROJECTS Twitter Fake News Detection (Deep learning, Classification, PyTorch) • Manipulated 500k unbalanced text data and trained LSTM to detect fake tweets about Hurricane Harvey • Implemented pre-trained GloVe embeddings to overcome data limitation and gained 87% accuracy and F1 score of 0.9 Topic Modeling on Movie Corpus (Unsupervised machine learning, NLTK) • Explored hidden topics and semantic patterns of movie synopses from Wikipedia with K-Means Clustering and LDA • Integrated an LDA-based recommendation system and evaluated model using similarity metrics • Examined result and pattern of segmentation by visualizing latent space using PCA and t-SNE Vulnerability Analysis for Transportation (Network Analysis, Visualization, NetworkX) • Simulated London Tube as directed graph (390 nodes, 2274 edges) weighted by travel time using Dijkstra algorithm • Proposed a quantitative framework to evaluate resilience of subway system under multi-level disruptions in urban cities • Identified and visualized vulnerable stations with Python, Flask and D3.js Gas Leak Incidents Prediction (Tree-based regression, Machine learning pipeline, Apache Spark) • Analyzed spatial/temporal distribution of gas leak incidents in SF using 4 mil ion fire department data of 18 years • Managed dynamic data pipeline jobs using Spark RDD, Spark SQL for data processing and feature engineering • Trained Gradient-Boosted Trees with PySpark MLlib to predict gas leak incidents and examined feature importance EDUCATION New York University New York, NY Master of Applied Informatics August 2018 - Course: Applied data science, Machine learning, Text as data/NLP, Big data analytics, Data visualization, Spatial analytics Tongji University Shanghai, China PhD of Engineering - Intel igent Systems, joint PhD at Purdue University December 2015 - Course: Statistics inference, Probability, Numerical computing, Optimization, Stochastic processes, Signal processing
Fangshu Lin XXXX@XXXX.XXX | (XXX) XXX-XXXX | https://www.linkedin.com/in/fangshulin/ SKILLS Languages: Python, R, SQL, MATLAB, Shel scripting Tools: scikit-learn, Pandas, Numpy, GeoPandas, SciPy, NLTK, Git Visualization: Matplotlib, ggplot2, Tableau, D3.js Big data: Hadoop, Hive, Spark, Athena EXPERIENCE Factual, Inc. Los Angeles, CA Data Scientist Apr 2019 - Apr 2020 • Leveraged billion-row real-world geolocation and in-store visitation data with Hive, Spark and Python • Built data pipelines to detect anomalies and monitor footfal integrity metrics; reduced troubleshooting time by 40% • Developed statistical models with device event data; constructed a quality scoring system and deployed into production • Applied smoothing and sampling methods for visit data at scale; fine-tuned parameters to maintain scale and stability • Created solutions on building lookalike models with Data Enrichment using Autoencoder and Locality Sensitive Hashing • Proposed methodology based on Iterative Proportional Fitting (IPF) for extrapolation of in-store visitation data • Cooperated in data preprocessing, modeling and performance testing via Python codebase for product improvements • Analyzed, visualized and tel ing story from places, audience segments and visit feed for company’s insights blog posts ARGO Labs New York, NY Research Analyst Oct 2018 - Apr 2019 • Created a web App to automate geocoding, calculate and visualize water bil s applicable to 200+ CA utilities with R Shiny • Built data pipelines to perform ETL, profiling on water rate data and report changes using Python, PostgreSQL, Airflow • Prototyped customized analytical products for City of Santa Monica and deployed App to AWS Purdue University Intelligent Infrastructure Systems Lab West Lafayette, IN Ph.D Researcher Aug 2011 - Aug 2013 • Conducted parametric identification and model updating of nonlinear transfer system coupled with smart devices • Performed pattern recognition, spectral analysis and error analysis with high-dimensional noisy sensor network signal using time and frequency domain methodologies • Developed LQG-based optimization algorithms and compensation methods with Kalman filter for smart damping system, improving both latency and accuracy SELECTED PROJECTS Twitter Fake News Detection (Deep learning, Classification, PyTorch) • Manipulated 500k unbalanced text data and trained LSTM to detect fake tweets about Hurricane Harvey • Implemented pre-trained GloVe embeddings to overcome data limitation and gained 87% accuracy and F1 score of 0.9 Topic Modeling on Movie Corpus (Unsupervised machine learning, NLTK) • Explored hidden topics and semantic patterns of movie synopses from Wikipedia with K-Means Clustering and LDA • Integrated an LDA-based recommendation system and evaluated model using similarity metrics • Examined result and pattern of segmentation by visualizing latent space using PCA and t-SNE Vulnerability Analysis for Transportation (Network Analysis, Visualization, NetworkX) • Simulated London Tube as directed graph (390 nodes, 2274 edges) weighted by travel time using Dijkstra algorithm • Proposed a quantitative framework to evaluate resilience of subway system under multi-level disruptions in urban cities • Identified and visualized vulnerable stations with Python, Flask and D3.js Gas Leak Incidents Prediction (Tree-based regression, Machine learning pipeline, Apache Spark) • Analyzed spatial/temporal distribution of gas leak incidents in SF using 4 mil ion fire department data of 18 years • Managed dynamic data pipeline jobs using Spark RDD, Spark SQL for data processing and feature engineering • Trained Gradient-Boosted Trees with PySpark MLlib to predict gas leak incidents and examined feature importance EDUCATION New York University New York, NY Master of Applied Informatics August 2018 - Course: Applied data science, Machine learning, Text as data/NLP, Big data analytics, Data visualization, Spatial analytics Tongji University Shanghai, China PhD of Engineering - Intel igent Systems, joint PhD at Purdue University December 2015 - Course: Statistics inference, Probability, Numerical computing, Optimization, Stochastic processes, Signal processing