Skip to content
daily_customers_predict

daily_customers_predict

Feature Engineering of Turku Weather Data

Code Structure

class TurkuWeatherAnalysis:
    - load_and_preprocess(filepath)  # Loads and preprocesses the dataset
    - visualize_data()               # Generates weather-related plots
    - generate_statistics()          # Computes annual statistics
    - check_data_period()            # Displays the available time period

Data Period

  • Start Date: 2021-01-01 00:00:00
  • End Date: 2024-09-23 16:00:00

Annual Weather Statistics

Year Avg Temp (°C) Min Temp (°C) Max Temp (°C) Temp Std (°C) Absolute Max Temp (°C) Absolute Min Temp (°C) Total Precipitation (mm) Max Precipitation (mm) Avg Wind Speed (m/s) Max Wind Speed (m/s)
2021 6.54 -22.2 32.0 10.17 32.5 -22.3 599.4 10.1 2.98 10.4
2022 7.16 -16.7 31.1 8.69 31.6 -17.0 575.0 11.6 2.88 9.8
2023 6.75 -17.6 32.5 9.26 33.0 -17.8 649.8 9.0 2.66 10.2
2024 8.57 -23.1 30.1 10.65 30.6 -23.3 445.7 8.2 2.91 9.6

Visualizations

The following image presents key insights from the weather dataset, including temperature trends, precipitation distribution, humidity relationships, and wind speed analysis.

Weather Data Analysis

Description of Visualizations:

  1. Monthly Temperature Trends (Top Left): Shows how average temperature varies across different months from 2021 to 2024. Warmer months are clearly visible in June-August, while the coldest periods occur in December-February.
  2. Precipitation Distribution (Top Right): Highlights the frequency and distribution of precipitation values. The majority of precipitation is below 2mm, with some extreme values up to 12mm.
  3. Temperature vs Humidity (Bottom Left): A scatter plot that shows how temperature correlates with humidity. Higher temperatures tend to have lower humidity, while lower temperatures are associated with higher humidity.
  4. Wind Speed Distribution (Bottom Right): A box plot summarizing the distribution of wind speeds. The median wind speed is around 3 m/s, with some extreme values reaching up to 10 m/s.

Point Session Data Analysis

Data Period

  • Start Date: 2019-01-23 14:00:31.857000
  • End Date: 2024-06-28 10:19:29.179000

Basic Statistics

  • Total records: 2,038,060
  • Unique sessions: 581,137
  • Unique points: 74

Daily Session Statistics

  • Average sessions per day: 440
  • Maximum sessions in one day: 1,127
  • Minimum sessions in one day: 1

Property Type Distribution

Property Type Count
dishWeight 965,872
menuDeduction 610,689
wasteWeight 435,282
rating 17,215
menuSelection 9,002

Top 10 Most Active Points

Point ID Session Count
jate1 192,585
jate2 192,539
koti2-oikea-salaatti2 64,539
koti2-vasen-salaatti2 63,873
koti2-oikea-salaatti3 62,030
koti2-oikea-lammin4 57,458
koti2-vasen-salaatti3 56,686
koti2-oikea-salaatti1 54,323
koti2-vasen-salaatti1 52,259
koti2-oikea-lammin1 49,822

Top 10 Busiest Days

Date Sessions
2023-09-27 1,127
2023-11-23 1,028
2022-10-26 1,017
2022-11-17 993
2023-01-09 993
2019-12-03 985
2022-08-17 967
2024-02-07 966
2022-11-02 952
2022-11-08 930

Common Sequential Patterns

Pattern Occurrences
wasteWeight 128,782
dishWeight → menuDeduction 54,954
dishWeight → menuDeduction → wasteWeight 52,007
dishWeight 30,432
dishWeight → menuDeduction → dishWeight → menuDeduction → wasteWeight 24,695
dishWeight → wasteWeight 23,573
dishWeight → menuDeduction → dishWeight → menuDeduction 23,260
dishWeight → dishWeight → menuDeduction 19,946
wasteWeight → wasteWeight 18,773
dishWeight → dishWeight → menuDeduction → wasteWeight 16,099

Visualizations

The following image presents key insights from the session dataset, including session distribution, property type occurrences, and busiest points.

Point Session Data Analysis

Description of Visualizations:

  1. Session Distribution Over Time: Shows the total number of sessions per day across the dataset timeframe, highlighting peak usage periods.
  2. Property Type Occurrences: Displays the frequency of different property types such as dishWeight, menuDeduction, and wasteWeight to understand session interactions.
  3. Busiest Points Activity: Identifies the most active points where session interactions take place, helping to analyze high-traffic areas.
  4. Sequential Patterns: Reveals the most common sequences in session interactions, showing how different activities relate to each other.

Overfitting Analysis and Model Performance Comparison

Overview

This document provides an analysis of overfitting issues in different XGBoost models (M1, M2, M3, M4, M5) and offers recommendations for improvement. The key focus is identifying models that generalize well to unseen data and addressing those that suffer from overfitting.


Training and Evaluation Process

Training Process

  1. Data Preprocessing:

    • Convert datetime columns to numeric values.
    • Ensure all features are numerical and handle missing values.
    • Standardize features using StandardScaler to normalize input values.
  2. Model Selection and Tuning:

    • Hyperparameter tuning performed using Optuna, optimizing for RMSE.
    • Models trained using XGBoost Regressor with different hyperparameter settings.
    • Training conducted using GPU acceleration (tree_method='hist').
  3. Training Steps:

    • Train-test split with 80% training data and 20% test data.
    • Model iterates through 500 optimization trials for hyperparameter tuning.
    • Final model is selected based on the best RMSE score.
    • Models are evaluated using cross-validation and learning curves.
  4. K-Fold Cross-Validation:

    • 5-fold cross-validation (cv=5) is applied to assess model generalization.
    • The dataset is split into 5 parts, with each part serving as the validation set once.
    • The final performance score is averaged across all 5 iterations.
    • Helps ensure the model is robust and does not overfit to a single train-test split.
  5. Evaluation Metrics:

    • Mean Absolute Error (MAE) to measure absolute prediction error.
    • Root Mean Squared Error (RMSE) to penalize large errors.
    • Training vs. Test Performance Comparison to detect overfitting.
  6. Learning Curve Analysis:

    • Training and validation scores analyzed over increasing dataset sizes.
    • Helps identify if models are underfitting or overfitting.
  7. Learning Curves Interpretation:

    • M1 & M2: Show significant overfitting as training error remains very low while validation error is high.
    • M3: Best balance between training and validation errors, indicating good generalization.
    • M4 & M5: Some degree of overfitting, but better than M1 & M2.
    • Visualization:
      • Blue line: Training score (lower is better)
      • Red line: Cross-validation score (lower is better)
      • Shaded region: Variance in cross-validation error.
  8. Learning Curve Visualizations:

    • Below are the learning curves for each model:
      • M1: Learning Curve M1
      • M2: Learning Curve M2
      • M3: Learning Curve M3
      • M4: Learning Curve M4
      • M5: Learning Curve M5

Model Performance Metrics

Ranking Based on Test MAE

Model Algorithm Test MAE Test RMSE Train MAE Train RMSE
M3 XGBoost 55.28 76.80 29.85 39.95
M4 XGBoost 70.55 109.51 54.73 78.76
M1 XGBoost 73.84 111.25 35.74 50.01
M5 XGBoost 74.57 121.47 47.47 66.65
M2 XGBoost 76.57 116.69 9.87 13.69

Overfitting Indicators

Model Train MAE Train RMSE Test MAE Test RMSE MAE Gap RMSE Gap
M2 9.87 13.69 76.57 116.69 66.70 103.00
M5 47.47 66.65 74.57 121.47 27.10 54.82
M3 29.85 39.95 55.28 76.80 25.43 36.85
M4 54.73 78.76 70.55 109.51 15.82 30.75
M1 35.74 50.01 73.84 111.25 38.10 61.24

Key Takeaways

  • M2 is the most overfit model, with a MAE gap of 66.70 and RMSE gap of 103.00.
  • M5 also exhibits significant overfitting, with large discrepancies between training and test performance.
  • M3 is the best model, balancing test accuracy with minimal overfitting.
  • M4 shows relatively stable performance and may also be a good option.

Recommendations to Address Overfitting

1️⃣ Increase Regularization

Adding L1 (Lasso) or L2 (Ridge) regularization can penalize large weights and reduce overfitting:

'reg_alpha': trial.suggest_float('reg_alpha', 0.0, 10.0),  # L1 regularization
'reg_lambda': trial.suggest_float('reg_lambda', 0.0, 10.0)  # L2 regularization

2️⃣ Reduce Model Complexity

  • Lower max_depth to 3–6 instead of 9–10.
  • Reduce n_estimators to 100–150 instead of 200–300.
  • Increase min_child_weight to 5–7 to make splits more selective.

3️⃣ Use More Training Data

  • Expand the dataset or apply synthetic data augmentation.
  • Implement k-fold cross-validation with cv=5 to improve generalization.

4️⃣ Adjust Learning Rate

  • Use a lower learning rate (0.01–0.1) with a higher number of trees (n_estimators 150–200) to smooth decision boundaries.

5️⃣ Feature Selection & Engineering

  • Remove highly correlated or redundant features.
  • Use SHAP values or feature importance scores to identify and remove unnecessary features.

Final Recommendation

  • Use M3 as the best model since it has the lowest test MAE and RMSE with minimal overfitting.
  • Apply regularization and hyperparameter tuning to further optimize M3.
  • If stability is a priority, M4 is also a viable option.

Next Steps

  • Implement suggested changes and re-evaluate models.
  • Generate new hyperparameter settings to further reduce overfitting.
  • Validate improved models with additional test data.

By following these recommendations, we can improve the model's generalization while maintaining strong predictive accuracy.


📌 End of Document