My Projects

Data Analysis (Database/Time Series/Big Data)

Beijing Real Estate Investment and House Rental Analysis

(Database Analysis)

  • This project helps us better understand the housing market situation in Beijing. We used PostgreSQL and Linus shell commands under AWS S3 instance to investigate city housing prices with 300k records from Lianjia.com by creating a schema dimensional database; we used query to dig out insights of the investment potential and sourcing strategy of houses in Beijing for house agency and investors based on different attributes of the house; Finally, we visualized result using Matplotlib inside the AWS to delivery the dashboards to better perform the insight.

  • Some interesting business insights:

    • The most popular house type is 2 bedroom 1 living room 1 kitchen 1 bathroom

    • The most popular building type is plate+ brick&wood+ 0.5ladder+ no_elevator

    • February, April, September is the optimal time to buy house; March, August to sell

Forecasting Visitors Volume of The Yellow Stone Park

(Time Series Analysis)

  • The data of this project is the time series data that time series data performing on a dataset consisting of recreation visiting information, climate information and economic information on Yellowstone national park. The main purpose of the project is to apply time series forecasting analysis of this dataset to propose and inspect the factor, recreation visiting number, into Yellowstone over the years.

  • Time series model: SeasonalDummies and Linear Trend Model, CyclicalTrend Model, ErrorModel, ARIMA(1,0,0)(2,1,0)s Model, TransferFunction Model .

Appliance Electricity Big Data Analysis with Spark

(Big Data Analysis)

  • Applied PySpark and Regex to extract the specific group of data files into spark data frame from S3 on AWS EMR clusters; then used PySpark.sql to do the DEA with 20M+ records of data and explored the rank of using frequency, power and resistance of different appliances and visualized the ranking result using bar plot. Finally, build a Random Forest pipeline using PySpark.ml to predict which appliance is working based on the electricity info.

  • Insights learned from the data:

    • The most frequently used appliance of the whole day are circuit 11, refrigerator and printer.

    • The most powerful used appliance are hair dryer, air compressor and kitchen chopper.

    • And more importantly, through the analysis, we realized that dealing with the dateset itself sometimes might be tough and challenge especially in the real case. However, we would like to overcome all the difficulties.

Machine Learning

Santander Satisfaction Prediction

(Responsible ML)

  • The target of this project is to predict the Santander Satisfaction and then identify the importance features that have a great effect on the satisfaction of the Santander user and give a analysis report to the Santander for improving the user experience based on our analysis.

  • We leveraged 3 Boosting models and averaging model to predict the satisfaction of the clients using 70k+ records data and achieved 84% ROC_AUC score, then discovered top 5 key influencing factors to the client satisfaction using Shaply value and then used ICE & PD plot to identify the influencing pattern of key factors.

  • Models: XGBoosting, Gradient Boost, LightGBM, Averaging Model and Stacking Model

NYC Restaurant Location Selection

(API + Clustering)

  • The goal of the project is to provide the best location selection for the restaurant that want to launch their business in the Manhattan area based on the venue information.

  • I Used FourSquare API to extract the venue data in the Manhattan; built K-mean model to group the neighborhoods of Manhattan into 5 clusters based on the frequency of occurrence of each venue category and mapped clusters using folium package; designed a new index (RPI) to measure the popularity of 3 type of restaurant in each cluster and gave the location decision based on RPI.

  • Tools: FourSquare AIP, Folium Package, K-mean Clustering Model

Housing Price Prediction

(Kaggle World Top 5% )

  • Our tasks are predicting house sale prices using 3k+ data records with 80 housing features. The best model could be used by realtors and buyers to estimate prices for a house with specific features that they want to sell/buy. And also provide business insights for real estate investors about the influencing factors to the housing-price.

  • We built 12 regression models from linear to non-linear techniques, from single model to averaging and stacking mode. We conclude that the best model is the Stacking model that the first layer model includes Elastic Net Regression, KRR, GBoosting and the meta-layer is the averaging model with Lasso, XGBoost, Light Gradient Boosting. The final result of the 5 fold cross-validation is RMSE of 0.1056, which ranks the World Top 10% .

  • Models: Lasso, Elastic Net, Kernal Regression, XGBoosting, Gradient Boost, LightGBM, AdaBoost, Catboost, 2 Averaging Model and 2 Stacking Model

Credit Card Default Prediction of Asian Business Bank

  • The job of the project is to use data of customer repayment data from a commercial bank in Asia, to perform classification model fitting and prediction on whether the customer will default on the next repayment date. Our goal is to provide a reasonable business conjecture that can be proved to be strongly practical and of high application value.

  • We provided three different solutions for three different groups of clients. The Conservative solution is to using the QDA model because of the relatively high F-score 70.7% and highest recall rate(87.6%) which means that it can retain the highest bank reserve to manage risk. The Efficient solution is to choose the Linear-SVM model because of the highest accuracy of 73.2%. Aggressive solution is to choose the RandomForest model because of the high F-score73.2%, and the highest accuracy(81.9%) and it can gain more revenue and absorb more customers.

  • Models: Logistic, LDA, QDA, Linear-SVM, Poly-SVM, KNN (K=27), Decision Tree, Random Forest and ANN.

Consulting Projects

Digital Transformation Solution for Costa Cruise in China Market (Thoughtworks)

  • Costa Cruise wants to have the digital transformation to better manage the customers of the company and also to improve the customer's experience by leveraging the digital tools during the user journey. We provided whole digital solutions for three steps of the user journey: Pre-Cruise, On-Cruise, and Post-Cruise.

  • The core idea of the solution is combining fragmented data across multiple touch-points to digitally re-create the customer profile to understand customers and to drive personalized engagement and recommendation.

Market-entry Case for Gyrfalcon Ventures - Precision Analytics in Agriculture (GWU)

  • Gyrfalcon Ventures wants to exploit their recently-received ownership for international rights to GAAP, a platform of cutting-edge, market-leading precision agricultural analytics technologies that can be integrated into “quad-copter” drones. However, they are not sure which market to enter, while the value of GAAP is believed to diminish rapidly due to the client’s lack of incentive to invest heavily in R&D and actively innovating competitors.

  • We recommended the clients enter China market and a course of the next steps to make the most out of this opportunity in a 3-year time window. To answer the question, we analyzed more than 15+ metrics of data from World Bank and other reputable sources to give the rank to 50 countries.

Telecommunications Company Strategy Accessibility Evaluation (BCG)

  • The client for this project is a telecommunications operator who has had declining profits for several years. We have been engaged to help drive profitability improvement. We has been engaged to drive improvements in profitability. One of the hypotheses under consideration to improve the profitability is the introduction of handset leasing.

  • We helped client to evaluate the accessibility to launch the Handset leasing project by market research and analyzing the reason behind decline market share of telecommunication client. Then, we assessed financial data of the proposed consideration and forecasted revenue growth, costs and profitability for the client, as per industry standards. And collected, evaluated consumer needs and designed an effective and consumer friendly offering. Deduced the target segment for the product.

To be continued...