
daily_customers_predict
Feature Engineering of Turku Weather Data
Code Structure
class TurkuWeatherAnalysis:
- load_and_preprocess(filepath) # Loads and preprocesses the dataset
- visualize_data() # Generates weather-related plots
- generate_statistics() # Computes annual statistics
- check_data_period() # Displays the available time period
Data Period
- Start Date: 2021-01-01 00:00:00
- End Date: 2024-09-23 16:00:00
Annual Weather Statistics
Year | Avg Temp (°C) | Min Temp (°C) | Max Temp (°C) | Temp Std (°C) | Absolute Max Temp (°C) | Absolute Min Temp (°C) | Total Precipitation (mm) | Max Precipitation (mm) | Avg Wind Speed (m/s) | Max Wind Speed (m/s) |
---|---|---|---|---|---|---|---|---|---|---|
2021 | 6.54 | -22.2 | 32.0 | 10.17 | 32.5 | -22.3 | 599.4 | 10.1 | 2.98 | 10.4 |
2022 | 7.16 | -16.7 | 31.1 | 8.69 | 31.6 | -17.0 | 575.0 | 11.6 | 2.88 | 9.8 |
2023 | 6.75 | -17.6 | 32.5 | 9.26 | 33.0 | -17.8 | 649.8 | 9.0 | 2.66 | 10.2 |
2024 | 8.57 | -23.1 | 30.1 | 10.65 | 30.6 | -23.3 | 445.7 | 8.2 | 2.91 | 9.6 |
Visualizations
The following image presents key insights from the weather dataset, including temperature trends, precipitation distribution, humidity relationships, and wind speed analysis.
Description of Visualizations:
- Monthly Temperature Trends (Top Left): Shows how average temperature varies across different months from 2021 to 2024. Warmer months are clearly visible in June-August, while the coldest periods occur in December-February.
- Precipitation Distribution (Top Right): Highlights the frequency and distribution of precipitation values. The majority of precipitation is below 2mm, with some extreme values up to 12mm.
- Temperature vs Humidity (Bottom Left): A scatter plot that shows how temperature correlates with humidity. Higher temperatures tend to have lower humidity, while lower temperatures are associated with higher humidity.
- Wind Speed Distribution (Bottom Right): A box plot summarizing the distribution of wind speeds. The median wind speed is around 3 m/s, with some extreme values reaching up to 10 m/s.
Point Session Data Analysis
Data Period
- Start Date: 2019-01-23 14:00:31.857000
- End Date: 2024-06-28 10:19:29.179000
Basic Statistics
- Total records: 2,038,060
- Unique sessions: 581,137
- Unique points: 74
Daily Session Statistics
- Average sessions per day: 440
- Maximum sessions in one day: 1,127
- Minimum sessions in one day: 1
Property Type Distribution
Property Type | Count |
---|---|
dishWeight | 965,872 |
menuDeduction | 610,689 |
wasteWeight | 435,282 |
rating | 17,215 |
menuSelection | 9,002 |
Top 10 Most Active Points
Point ID | Session Count |
---|---|
jate1 | 192,585 |
jate2 | 192,539 |
koti2-oikea-salaatti2 | 64,539 |
koti2-vasen-salaatti2 | 63,873 |
koti2-oikea-salaatti3 | 62,030 |
koti2-oikea-lammin4 | 57,458 |
koti2-vasen-salaatti3 | 56,686 |
koti2-oikea-salaatti1 | 54,323 |
koti2-vasen-salaatti1 | 52,259 |
koti2-oikea-lammin1 | 49,822 |
Top 10 Busiest Days
Date | Sessions |
---|---|
2023-09-27 | 1,127 |
2023-11-23 | 1,028 |
2022-10-26 | 1,017 |
2022-11-17 | 993 |
2023-01-09 | 993 |
2019-12-03 | 985 |
2022-08-17 | 967 |
2024-02-07 | 966 |
2022-11-02 | 952 |
2022-11-08 | 930 |
Common Sequential Patterns
Pattern | Occurrences |
---|---|
wasteWeight | 128,782 |
dishWeight → menuDeduction | 54,954 |
dishWeight → menuDeduction → wasteWeight | 52,007 |
dishWeight | 30,432 |
dishWeight → menuDeduction → dishWeight → menuDeduction → wasteWeight | 24,695 |
dishWeight → wasteWeight | 23,573 |
dishWeight → menuDeduction → dishWeight → menuDeduction | 23,260 |
dishWeight → dishWeight → menuDeduction | 19,946 |
wasteWeight → wasteWeight | 18,773 |
dishWeight → dishWeight → menuDeduction → wasteWeight | 16,099 |
Visualizations
The following image presents key insights from the session dataset, including session distribution, property type occurrences, and busiest points.
Description of Visualizations:
- Session Distribution Over Time: Shows the total number of sessions per day across the dataset timeframe, highlighting peak usage periods.
-
Property Type Occurrences: Displays the frequency of different property types such as
dishWeight
,menuDeduction
, andwasteWeight
to understand session interactions. - Busiest Points Activity: Identifies the most active points where session interactions take place, helping to analyze high-traffic areas.
- Sequential Patterns: Reveals the most common sequences in session interactions, showing how different activities relate to each other.
Overfitting Analysis and Model Performance Comparison
Overview
This document provides an analysis of overfitting issues in different XGBoost models (M1, M2, M3, M4, M5) and offers recommendations for improvement. The key focus is identifying models that generalize well to unseen data and addressing those that suffer from overfitting.
Training and Evaluation Process
Training Process
-
Data Preprocessing:
- Convert datetime columns to numeric values.
- Ensure all features are numerical and handle missing values.
- Standardize features using
StandardScaler
to normalize input values.
-
Model Selection and Tuning:
- Hyperparameter tuning performed using Optuna, optimizing for RMSE.
- Models trained using XGBoost Regressor with different hyperparameter settings.
- Training conducted using GPU acceleration (
tree_method='hist'
).
-
Training Steps:
- Train-test split with 80% training data and 20% test data.
- Model iterates through 500 optimization trials for hyperparameter tuning.
- Final model is selected based on the best RMSE score.
- Models are evaluated using cross-validation and learning curves.
-
K-Fold Cross-Validation:
-
5-fold cross-validation (
cv=5
) is applied to assess model generalization. - The dataset is split into 5 parts, with each part serving as the validation set once.
- The final performance score is averaged across all 5 iterations.
- Helps ensure the model is robust and does not overfit to a single train-test split.
-
5-fold cross-validation (
-
Evaluation Metrics:
- Mean Absolute Error (MAE) to measure absolute prediction error.
- Root Mean Squared Error (RMSE) to penalize large errors.
- Training vs. Test Performance Comparison to detect overfitting.
-
Learning Curve Analysis:
- Training and validation scores analyzed over increasing dataset sizes.
- Helps identify if models are underfitting or overfitting.
-
Learning Curves Interpretation:
- M1 & M2: Show significant overfitting as training error remains very low while validation error is high.
- M3: Best balance between training and validation errors, indicating good generalization.
- M4 & M5: Some degree of overfitting, but better than M1 & M2.
-
Visualization:
- Blue line: Training score (lower is better)
- Red line: Cross-validation score (lower is better)
- Shaded region: Variance in cross-validation error.
-
Learning Curve Visualizations:
Model Performance Metrics
Ranking Based on Test MAE
Model | Algorithm | Test MAE | Test RMSE | Train MAE | Train RMSE |
---|---|---|---|---|---|
M3 | XGBoost | 55.28 | 76.80 | 29.85 | 39.95 |
M4 | XGBoost | 70.55 | 109.51 | 54.73 | 78.76 |
M1 | XGBoost | 73.84 | 111.25 | 35.74 | 50.01 |
M5 | XGBoost | 74.57 | 121.47 | 47.47 | 66.65 |
M2 | XGBoost | 76.57 | 116.69 | 9.87 | 13.69 |
Overfitting Indicators
Model | Train MAE | Train RMSE | Test MAE | Test RMSE | MAE Gap | RMSE Gap |
---|---|---|---|---|---|---|
M2 | 9.87 | 13.69 | 76.57 | 116.69 | 66.70 | 103.00 |
M5 | 47.47 | 66.65 | 74.57 | 121.47 | 27.10 | 54.82 |
M3 | 29.85 | 39.95 | 55.28 | 76.80 | 25.43 | 36.85 |
M4 | 54.73 | 78.76 | 70.55 | 109.51 | 15.82 | 30.75 |
M1 | 35.74 | 50.01 | 73.84 | 111.25 | 38.10 | 61.24 |
Key Takeaways
- M2 is the most overfit model, with a MAE gap of 66.70 and RMSE gap of 103.00.
- M5 also exhibits significant overfitting, with large discrepancies between training and test performance.
- M3 is the best model, balancing test accuracy with minimal overfitting.
- M4 shows relatively stable performance and may also be a good option.
Recommendations to Address Overfitting
1️⃣ Increase Regularization
Adding L1 (Lasso) or L2 (Ridge) regularization can penalize large weights and reduce overfitting:
'reg_alpha': trial.suggest_float('reg_alpha', 0.0, 10.0), # L1 regularization
'reg_lambda': trial.suggest_float('reg_lambda', 0.0, 10.0) # L2 regularization
2️⃣ Reduce Model Complexity
- Lower max_depth to 3–6 instead of 9–10.
- Reduce n_estimators to 100–150 instead of 200–300.
- Increase min_child_weight to 5–7 to make splits more selective.
3️⃣ Use More Training Data
- Expand the dataset or apply synthetic data augmentation.
- Implement k-fold cross-validation with
cv=5
to improve generalization.
4️⃣ Adjust Learning Rate
- Use a lower learning rate (0.01–0.1) with a higher number of trees (n_estimators 150–200) to smooth decision boundaries.
5️⃣ Feature Selection & Engineering
- Remove highly correlated or redundant features.
- Use SHAP values or feature importance scores to identify and remove unnecessary features.
Final Recommendation
- Use M3 as the best model since it has the lowest test MAE and RMSE with minimal overfitting.
- Apply regularization and hyperparameter tuning to further optimize M3.
- If stability is a priority, M4 is also a viable option.
Next Steps
- Implement suggested changes and re-evaluate models.
- Generate new hyperparameter settings to further reduce overfitting.
- Validate improved models with additional test data.
By following these recommendations, we can improve the model's generalization while maintaining strong predictive accuracy.