StopOdds

Methodology

How we analyze fare inspection patterns on Melbourne public transport

Statistical Approach

StopOdds uses rigorous statistical methods to identify patterns in fare inspection data while protecting individual privacy.

Our analysis model primarily uses gradient boosted decision trees to predict inspection rates while accounting for complex interactions between demographic characteristics. We also maintain Poisson regression models for statistical inference. This hybrid approach provides both accurate predictions and interpretable statistical insights.

Exposure Modeling

We use trip counts as exposure variables to calculate rates per 100 trips, accounting for varying travel patterns.

Incidence Rate Ratios

We report IRRs with 95% confidence intervals to show relative inspection rates between demographic groups.

Privacy Protection

K-anonymity ensures groups with fewer than 50 respondents are suppressed from public results.

Detailed Methods

1. Data Collection

We collect anonymous self-reported data including:

Trip counts
Number of public transport journeys in the last 30 days
Stop counts
Number of times fare inspectors checked tickets during those trips
Demographic information
Optional characteristics including age, gender, ethnicity, height, etc.

Important: All data is self-reported and voluntary. Results represent patterns in reported experiences, not necessarily complete inspection practices.

2. Statistical Modeling

We use LightGBM gradient boosting as our primary prediction model, with statistical models for inference:

Primary Model: LightGBM

• Gradient boosting machine learning algorithm
• Handles categorical features natively
• Automatically discovers feature interactions
• Weighted by trip counts for robust rate estimation
• SHAP values provide individual explanations

Statistical Model: Poisson Regression

log(E[stops_i]) = log(trips_i) + β₀ + β₁×age_i + β₂×gender_i + ... + βₖ×trait_k

The statistical model provides interpretable coefficients and confidence intervals, while LightGBM provides more accurate individual predictions.

3. Interpreting Results

We provide results from both approaches:

LightGBM Predictions

Individual risk estimates
Personalized predictions based on your specific characteristics
SHAP explanations
Shows which factors most influence your prediction
Bootstrap confidence intervals
Uncertainty estimates from model variation

Statistical Analysis

Incidence Rate Ratios (IRRs)
Relative rates between demographic groups
95% confidence intervals
Statistical significance testing
Group-level comparisons
Population-wide patterns and disparities

4. Privacy & Ethics

We implement multiple privacy protections:

K-anonymity
Groups with <50 respondents are suppressed from results
No cross-tabulation
We don't publish detailed breakdowns that could identify small subgroups
Data retention limits
Raw data deleted after 12 months, only aggregates retained
Anonymous collection
No personally identifiable information collected

5. Model Validation

We validate our models through:

LightGBM Validation

Cross-validation
3-fold validation to assess model stability
Hold-out testing
20% test set for unbiased performance evaluation
Feature importance analysis
Identifying most predictive demographic factors
SHAP consistency
Ensuring explanation quality and interpretability

Statistical Validation

Goodness-of-fit tests
Chi-square tests for Poisson model appropriateness
Overdispersion testing
Automatic switching to Negative Binomial if needed
Residual analysis
Examining patterns in statistical model residuals

Current Model Status

The model is continuously updated as new data arrives. Minimum thresholds ensure statistical reliability.

Activation Requirements

≥300 valid submissions
Sufficient sample size for LightGBM training
≥50 total stops
Adequate events for pattern detection
Multiple demographic groups
Enables meaningful comparisons and feature learning

Quality Assurance

Anomaly detection
Automatic flagging of unusual patterns (e.g., >100 trips/month)
Data validation
Consistency checks and range validation
Automatic model selection
System chooses best performing model (LightGBM preferred)
Regular retraining
Models updated weekly with new data

Limitations & Considerations

This analysis has important limitations that users should understand when interpreting results.

Data Limitations

Self-reported data
Results depend on voluntary responses which may not represent all transport users
Recall bias
People may not perfectly remember inspection frequencies over 30 days
Selection bias
Survey respondents may differ systematically from non-respondents
Geographic coverage
Results may not be representative of all Melbourne transport routes/times

Statistical Considerations

Multiple comparisons
Testing many demographic groups increases chance of false discoveries
Confounding variables
Unmeasured factors (route, time, behavior) may influence results
Sample size variations
Some demographic groups may have limited representation

For technical questions about our methodology, please contact us at methods@stopodds.com.au

This methodology is open source and available for review on GitHub