
- Location
- Coquitlam, British Columbia, Canada
- Bio
-
I'm Peter Xu: Data scientist by day, music producer by night. Science Ph.D., data wizard at heart. Made moon phases predict e-commerce trends. Fluent in code, English, and Mandarin. Always chasing the next data thrill.
- Portals
-
-
Toronto, Ontario, Canada
-
- Categories
- Data analysis Data modelling Data science Data visualization
Skills
Latest feedback
Achievements


Recent projects
Work experience
Senior Engineer
China Petroleum
Beijing, Beijing, China
July 2008 - March 2014
• Developed and designed petroleum engineering tools
• Conducted research on seismic wave propagation and boundary condition processing methods
• Developed and implemented numerical algorithms in MATLAB for data analysis and simulations
• Collaborated with fellow engineers to share findings and improve project outcomes
• Published research findings in peer-reviewed scientific journals
Education
Diploma of data science, Data Science
Lighthouse Labs Bootcamp
February 2023 - April 2023
Ph.d. of natural sciences, Solid Earth Physics Numerical Simulation Of Seismic Wave Propagation
University of the Chinese Academy of Sciences
September 2005 - July 2008
Master of Engineering (M.Eng.), Engineering Mechanics, Optimization Algorithms
Dalian University of Technology
September 2002 - July 2005
Bachelor of Engineering (B.E.) / B.Eng., Engineering Mechanics
Dalian University of Technology
September 1998 - July 2002
Personal projects
Kids News Summarizer
October 2023 - October 2023
This project serves as a demonstration of the capabilities of Python libraries and OpenAI's API integration. It's an automated tool designed to fetch and summarize articles from a fictional children's news website. Using Selenium for dynamic web scraping and OpenAI's API for context-aware summarization, the tool offers a glimpse into the potential applications of these technologies. Note that this tool is purely illustrative and does not perform any actual operations related to real-world content.
Website URL:
https://github.com/peterandu365/KidsNews_Summarizer
---
Disclaimer
This project is purely for demonstration purposes and does not associate with or intend to infringe upon any rights of any news entity. Any resemblance to real organizations or news platforms is coincidental, and no actual summarization or distribution of copyrighted content has been executed through this tool.
----------------------------------------
Time Series Prediction: Moon Phase Effects on Sales
April 2023 - April 2023
In this research-oriented project, I explored the nuanced relationship between moon phases and e-commerce sales patterns using sophisticated time series prediction techniques. Harnessing the combined strengths of SARIMA and LSTM models, I delved deep into sales trends, uncovering pronounced spikes in sales during certain moon phases. My analysis confirmed that while the moon's phases had a palpable impact on sales volume, the average order size remained relatively constant, implying a larger volume of orders or an increase in customer visits during these periods.
Intriguingly, I employed Fast Fourier Transform (FFT) as a signal processing method, which showcased remarkable advantages, particularly in swiftly and accurately capturing longer-term trends. This blending of FFT with LSTM allowed for the effective interception of nuanced sales patterns, particularly those related to moon phases.
One of the pivotal discoveries was during my residual analysis. A significant deviation was observed, and upon further investigation, it was attributed to a once-in-a-century severe snowstorm that delayed peak order deliveries. This rare and extreme weather event was corroborated by news articles, highlighting the profound impact of external, unforeseeable events on sales predictions.
Website URL:
https://github.com/peterandu365/demand_prediction_project
----------------------------------------
Quora Duplicate Question Identification
April 2023 - April 2023
This project centered on the identification of duplicate questions within the Quora dataset. The core objective was to engineer a machine learning model adept at discerning whether a pair of provided questions were duplicates. The dataset comprised over 400,000 pairs of questions. Throughout this endeavor, I employed various data preprocessing techniques, such as text cleaning, tokenization, vectorization, and feature extraction using the TF-IDF vectorizer.
I further experimented with diverse models including Logistic Regression, XGBoost, Random Forest, and LSTM to tackle the task. Each model presented its unique challenges and learning curves. For instance, while XGBoost excelled in handling nonlinear problems, LSTM showcased its strength in addressing prediction problems sensitive to sequence order.
Notably, the project offered profound insights into the capabilities and characteristics of these models and their parameter tuning methodologies. Challenges were encountered, particularly with LSTM concerning batch sizes and model convergence. Yet, the overall experience culminated in a deeper comprehension of the models' behaviors, their strengths, and areas of improvement.
Website URL
https://github.com/peterandu365/mini-project-V
----------------------------------------
Loan Approval Prediction
March 2023 - March 2023
This project centered on predicting loan approval statuses using advanced machine learning techniques. I aimed to design a model that can predict the outcome based on an applicant's profile information. The project began with an in-depth exploratory data analysis, during which I identified pivotal features such as credit history, applicant income, and loan amount that greatly influence loan approval decisions.
To further bolster the model's performance, I cleaned the dataset by handling missing values, outliers, and encoding categorical variables. Additionally, feature engineering played a critical role; I introduced new features like the debt-to-income ratio and transformed existing ones to heighten the model's efficiency.
I employed the HistGradientBoostingClassifier and XGBoost models, evaluating their performances using various hyperparameters. The optimal model achieved an accuracy of 0.660, with notable precision and recall metrics.
One of the standout features of this project was the model deployment phase. The best-performing model was deployed via Flask and made accessible through an API hosted on AWS. This setup ensures that users can easily input applicant details and promptly receive a loan approval prediction.
Website URL
https://github.com/peterandu365/mini-project-IV
----------------------------------------
Data Visualization and Analysis with Tableau
March 2023 - March 2023
In this project, I deeply explored the relationship between housing prices and residents' livelihoods by analyzing datasets such as the Canadian House Price Index, Consumer Price Index, real estate construction, prices, and monthly income of residents. This analytical venture aimed to master the use of Tableau, a leading data analysis, and visualization software. I navigated through complex data extraction and transformation processes, primarily leveraging Python to convert varied data formats like JSON and XLSX into CSV.
A significant portion of my analysis delved into understanding macro-economic trends and their effects on housing markets. I studied the impact of global economic crises, such as Black Monday (1987) and the Financial Crisis (2007-2009), on different economic indicators in Canada. By plotting various indices and fitting regression lines, I examined the potential of predicting consumer behaviors based on housing price trends.
This project was not only technically intensive but also demanded a thorough understanding of economic indicators and their interplay. The insights derived were visually represented using Tableau's rich visualization features, resulting in comprehensive dashboards and reports.
Website URL:
https://github.com/peterandu365/Data-Visualization-and-Dashboards-with-Tableau
----------------------------------------
Financial Transactions Analysis, Customer Segmentation and PCA Visualization
February 2023 - March 2023
This project focused on advanced data wrangling, visualization, and unsupervised learning techniques, specifically for financial transaction data. Central to the analysis was the application of customer segmentation using clustering methods to categorize consumers based on demographics and banking behaviors. Two main tasks highlighted were:
1. **Advanced PCA Visualizations**: Implemented 3D PCA visualizations, underscoring the value of dimensionality reduction in high-dimensional data. Emphasis was on analyzing and interpreting the most influential features contributing to primary principal components, enabling a deeper understanding of underlying data patterns.
2. **Geolocation Clustering**: To provide spatial insights into customer distribution, address-related columns were transformed into longitude and latitude coordinates. A notable discovery was that the Haversine distance metric, designed to measure points on a sphere, outperformed the Euclidean distance metric in accuracy for geolocation clustering. This is pivotal as it underscores the importance of choosing distance metrics tailored to the data's nature.
Throughout the project, multiple clustering algorithms, including K-Means, Agglomerative, and DBSCAN, were utilized and compared. Additionally, the influence of various preprocessing scalers on clustering performance was meticulously assessed, emphasizing the interdependence of feature engineering and scaler selection in determining clustering outcomes.
Website URL:
https://github.com/peterandu365/mini-project-III
----------------------------------------
E-Commerce Data Analysis using SQL
February 2023 - February 2023
This is a comprehensive E-Commerce data analysis project. This project primarily revolved around transforming, cleaning, and analyzing a rich dataset from an e-commerce platform. My focus was to harness SQL's potential to derive actionable insights from data, ensuring effective data management practices. I delved deep into understanding consumer behaviors, sales trends, and inventory management, culminating in a thorough analysis presented through a series of well-defined metrics and visual representations. This project is an invaluable exposure to real-world data challenges and their solutions.
Website URL
https://github.com/peterandu365/SQL-Project
----------------------------------------