daily_customers_predict

Feature Engineering of Turku Weather Data

Code Structure

class TurkuWeatherAnalysis:
    - load_and_preprocess(filepath)  # Loads and preprocesses the dataset
    - visualize_data()               # Generates weather-related plots
    - generate_statistics()          # Computes annual statistics
    - check_data_period()            # Displays the available time period

Data Period

Start Date: 2021-01-01 00:00:00
End Date: 2024-09-23 16:00:00

Annual Weather Statistics

Year	Avg Temp (°C)	Min Temp (°C)	Max Temp (°C)	Temp Std (°C)	Absolute Max Temp (°C)	Absolute Min Temp (°C)	Total Precipitation (mm)	Max Precipitation (mm)	Avg Wind Speed (m/s)	Max Wind Speed (m/s)
2021	6.54	-22.2	32.0	10.17	32.5	-22.3	599.4	10.1	2.98	10.4
2022	7.16	-16.7	31.1	8.69	31.6	-17.0	575.0	11.6	2.88	9.8
2023	6.75	-17.6	32.5	9.26	33.0	-17.8	649.8	9.0	2.66	10.2
2024	8.57	-23.1	30.1	10.65	30.6	-23.3	445.7	8.2	2.91	9.6

Visualizations

The following image presents key insights from the weather dataset, including temperature trends, precipitation distribution, humidity relationships, and wind speed analysis.

Description of Visualizations:

Monthly Temperature Trends (Top Left): Shows how average temperature varies across different months from 2021 to 2024. Warmer months are clearly visible in June-August, while the coldest periods occur in December-February.
Precipitation Distribution (Top Right): Highlights the frequency and distribution of precipitation values. The majority of precipitation is below 2mm, with some extreme values up to 12mm.
Temperature vs Humidity (Bottom Left): A scatter plot that shows how temperature correlates with humidity. Higher temperatures tend to have lower humidity, while lower temperatures are associated with higher humidity.
Wind Speed Distribution (Bottom Right): A box plot summarizing the distribution of wind speeds. The median wind speed is around 3 m/s, with some extreme values reaching up to 10 m/s.

Point Session Data Analysis

Data Period

Start Date: 2019-01-23 14:00:31.857000
End Date: 2024-06-28 10:19:29.179000

Basic Statistics

Total records: 2,038,060
Unique sessions: 581,137
Unique points: 74

Daily Session Statistics

Average sessions per day: 440
Maximum sessions in one day: 1,127
Minimum sessions in one day: 1

Property Type Distribution

Property Type	Count
dishWeight	965,872
menuDeduction	610,689
wasteWeight	435,282
rating	17,215
menuSelection	9,002

Top 10 Most Active Points

Point ID	Session Count
jate1	192,585
jate2	192,539
koti2-oikea-salaatti2	64,539
koti2-vasen-salaatti2	63,873
koti2-oikea-salaatti3	62,030
koti2-oikea-lammin4	57,458
koti2-vasen-salaatti3	56,686
koti2-oikea-salaatti1	54,323
koti2-vasen-salaatti1	52,259
koti2-oikea-lammin1	49,822

Top 10 Busiest Days

Date	Sessions
2023-09-27	1,127
2023-11-23	1,028
2022-10-26	1,017
2022-11-17	993
2023-01-09	993
2019-12-03	985
2022-08-17	967
2024-02-07	966
2022-11-02	952
2022-11-08	930

Common Sequential Patterns

Pattern	Occurrences
wasteWeight	128,782
dishWeight → menuDeduction	54,954
dishWeight → menuDeduction → wasteWeight	52,007
dishWeight	30,432
dishWeight → menuDeduction → dishWeight → menuDeduction → wasteWeight	24,695
dishWeight → wasteWeight	23,573
dishWeight → menuDeduction → dishWeight → menuDeduction	23,260
dishWeight → dishWeight → menuDeduction	19,946
wasteWeight → wasteWeight	18,773
dishWeight → dishWeight → menuDeduction → wasteWeight	16,099

Visualizations

The following image presents key insights from the session dataset, including session distribution, property type occurrences, and busiest points.

Description of Visualizations:

Session Distribution Over Time: Shows the total number of sessions per day across the dataset timeframe, highlighting peak usage periods.
Property Type Occurrences: Displays the frequency of different property types such as dishWeight, menuDeduction, and wasteWeight to understand session interactions.
Busiest Points Activity: Identifies the most active points where session interactions take place, helping to analyze high-traffic areas.
Sequential Patterns: Reveals the most common sequences in session interactions, showing how different activities relate to each other.

Overfitting Analysis and Model Performance Comparison

Overview

This document provides an analysis of overfitting issues in different XGBoost models (M1, M2, M3, M4, M5) and offers recommendations for improvement. The key focus is identifying models that generalize well to unseen data and addressing those that suffer from overfitting.

Training and Evaluation Process

Training Process

Data Preprocessing:
- Convert datetime columns to numeric values.
- Ensure all features are numerical and handle missing values.
- Standardize features using StandardScaler to normalize input values.
Model Selection and Tuning:
- Hyperparameter tuning performed using Optuna, optimizing for RMSE.
- Models trained using XGBoost Regressor with different hyperparameter settings.
- Training conducted using GPU acceleration (tree_method='hist').
Training Steps:
- Train-test split with 80% training data and 20% test data.
- Model iterates through 500 optimization trials for hyperparameter tuning.
- Final model is selected based on the best RMSE score.
- Models are evaluated using cross-validation and learning curves.
K-Fold Cross-Validation:
- 5-fold cross-validation (cv=5) is applied to assess model generalization.
- The dataset is split into 5 parts, with each part serving as the validation set once.
- The final performance score is averaged across all 5 iterations.
- Helps ensure the model is robust and does not overfit to a single train-test split.
Evaluation Metrics:
- Mean Absolute Error (MAE) to measure absolute prediction error.
- Root Mean Squared Error (RMSE) to penalize large errors.
- Training vs. Test Performance Comparison to detect overfitting.
Learning Curve Analysis:
- Training and validation scores analyzed over increasing dataset sizes.
- Helps identify if models are underfitting or overfitting.
Learning Curves Interpretation:
- M1 & M2: Show significant overfitting as training error remains very low while validation error is high.
- M3: Best balance between training and validation errors, indicating good generalization.
- M4 & M5: Some degree of overfitting, but better than M1 & M2.
- Visualization:
  - Blue line: Training score (lower is better)
  - Red line: Cross-validation score (lower is better)
  - Shaded region: Variance in cross-validation error.
Learning Curve Visualizations:
- Below are the learning curves for each model:
  - M1:
  - M2:
  - M3:
  - M4:
  - M5:

Model Performance Metrics

Ranking Based on Test MAE

Model	Algorithm	Test MAE	Test RMSE	Train MAE	Train RMSE
M3	XGBoost	55.28	76.80	29.85	39.95
M4	XGBoost	70.55	109.51	54.73	78.76
M1	XGBoost	73.84	111.25	35.74	50.01
M5	XGBoost	74.57	121.47	47.47	66.65
M2	XGBoost	76.57	116.69	9.87	13.69

Overfitting Indicators

Model	Train MAE	Train RMSE	Test MAE	Test RMSE	MAE Gap	RMSE Gap
M2	9.87	13.69	76.57	116.69	66.70	103.00
M5	47.47	66.65	74.57	121.47	27.10	54.82
M3	29.85	39.95	55.28	76.80	25.43	36.85
M4	54.73	78.76	70.55	109.51	15.82	30.75
M1	35.74	50.01	73.84	111.25	38.10	61.24

Key Takeaways

M2 is the most overfit model, with a MAE gap of 66.70 and RMSE gap of 103.00.
M5 also exhibits significant overfitting, with large discrepancies between training and test performance.
M3 is the best model, balancing test accuracy with minimal overfitting.
M4 shows relatively stable performance and may also be a good option.

Recommendations to Address Overfitting

1️⃣ Increase Regularization

Adding L1 (Lasso) or L2 (Ridge) regularization can penalize large weights and reduce overfitting:

'reg_alpha': trial.suggest_float('reg_alpha', 0.0, 10.0),  # L1 regularization
'reg_lambda': trial.suggest_float('reg_lambda', 0.0, 10.0)  # L2 regularization

2️⃣ Reduce Model Complexity

Lower max_depth to 3–6 instead of 9–10.
Reduce n_estimators to 100–150 instead of 200–300.
Increase min_child_weight to 5–7 to make splits more selective.

3️⃣ Use More Training Data

Expand the dataset or apply synthetic data augmentation.
Implement k-fold cross-validation with cv=5 to improve generalization.

4️⃣ Adjust Learning Rate

Use a lower learning rate (0.01–0.1) with a higher number of trees (n_estimators 150–200) to smooth decision boundaries.

5️⃣ Feature Selection & Engineering

Remove highly correlated or redundant features.
Use SHAP values or feature importance scores to identify and remove unnecessary features.

Final Recommendation

Use M3 as the best model since it has the lowest test MAE and RMSE with minimal overfitting.
Apply regularization and hyperparameter tuning to further optimize M3.
If stability is a priority, M4 is also a viable option.

Next Steps

Implement suggested changes and re-evaluate models.
Generate new hyperparameter settings to further reduce overfitting.
Validate improved models with additional test data.

By following these recommendations, we can improve the model's generalization while maintaining strong predictive accuracy.

📌 End of Document