StopOdds Logo
StopOdds

Methodology

How we analyze fare inspection patterns on Melbourne public transport
Back to FormView Results

Statistical Approach

Our analysis model primarily uses gradient boosted decision trees to predict inspection rates while accounting for complex interactions between demographic characteristics. We also maintain Poisson regression models for statistical inference. This hybrid approach provides both accurate predictions and interpretable statistical insights.

Exposure Modeling

We use trip counts as exposure variables to calculate rates per 100 trips, accounting for varying travel patterns.

Incidence Rate Ratios

We report IRRs with 95% confidence intervals to show relative inspection rates between demographic groups.

Privacy Protection

K-anonymity ensures groups with fewer than 50 respondents are suppressed from public results.

Detailed Methods

We collect anonymous self-reported data including:

  • Trip counts

    Number of public transport journeys in the last 30 days

  • Stop counts

    Number of times fare inspectors checked tickets during those trips

  • Demographic information

    Optional characteristics including age, gender, ethnicity, height, etc.

We use LightGBM gradient boosting as our primary prediction model, with statistical models for inference:

Primary Model: LightGBM

• Gradient boosting machine learning algorithm
• Handles categorical features natively
• Automatically discovers feature interactions
• Weighted by trip counts for robust rate estimation
• SHAP values provide individual explanations

Statistical Model: Poisson Regression

log(E[stops_i]) = log(trips_i) + β₀ + β₁×age_i + β₂×gender_i + ... + βₖ×trait_k

The statistical model provides interpretable coefficients and confidence intervals, while LightGBM provides more accurate individual predictions.

We provide results from both approaches:

LightGBM Predictions
  • Individual risk estimates

    Personalized predictions based on your specific characteristics

  • SHAP explanations

    Shows which factors most influence your prediction

  • Bootstrap confidence intervals

    Uncertainty estimates from model variation

Statistical Analysis
  • Incidence Rate Ratios (IRRs)

    Relative rates between demographic groups

  • 95% confidence intervals

    Statistical significance testing

  • Group-level comparisons

    Population-wide patterns and disparities

We implement multiple privacy protections:

  • K-anonymity

    Groups with <50 respondents are suppressed from results

  • No cross-tabulation

    We don't publish detailed breakdowns that could identify small subgroups

  • Data retention limits

    Raw data deleted after 12 months, only aggregates retained

  • Anonymous collection

    No personally identifiable information collected

We validate our models through:

LightGBM Validation
  • Cross-validation

    3-fold validation to assess model stability

  • Hold-out testing

    20% test set for unbiased performance evaluation

  • Feature importance analysis

    Identifying most predictive demographic factors

  • SHAP consistency

    Ensuring explanation quality and interpretability

Statistical Validation
  • Goodness-of-fit tests

    Chi-square tests for Poisson model appropriateness

  • Overdispersion testing

    Automatic switching to Negative Binomial if needed

  • Residual analysis

    Examining patterns in statistical model residuals

Current Model Status

Activation Requirements
  • ≥300 valid submissions

    Sufficient sample size for LightGBM training

  • ≥50 total stops

    Adequate events for pattern detection

  • Multiple demographic groups

    Enables meaningful comparisons and feature learning

Quality Assurance
  • Anomaly detection

    Automatic flagging of unusual patterns (e.g., >100 trips/month)

  • Data validation

    Consistency checks and range validation

  • Automatic model selection

    System chooses best performing model (LightGBM preferred)

  • Regular retraining

    Models updated weekly with new data

Limitations & Considerations

Data Limitations
  • Self-reported data

    Results depend on voluntary responses which may not represent all transport users

  • Recall bias

    People may not perfectly remember inspection frequencies over 30 days

  • Selection bias

    Survey respondents may differ systematically from non-respondents

  • Geographic coverage

    Results may not be representative of all Melbourne transport routes/times

Statistical Considerations
  • Multiple comparisons

    Testing many demographic groups increases chance of false discoveries

  • Confounding variables

    Unmeasured factors (route, time, behavior) may influence results

  • Sample size variations

    Some demographic groups may have limited representation

For technical questions about our methodology, please contact us at methods@stopodds.com.au

This methodology is open source and available for review on GitHub