Summary

In early July 2023, we received several reports from subscribers that the Top-Level US eduroam Proxies were not reliably responding to authentication requests, resulting in specific users or devices being unable to access eduroam. The errors indicated that the TLRS1 (tlrs1.eduroam.us) Proxy was not responding to Access Requests from some users/devices, while other users/devices were able to reliably authenticate using TLRS1, as usual. Some subscribers indicated that the problem had started on July 2nd, although we later learned that it had started earlier.

Working with an affected community member, the US eduroam operations team was able to determine that there was a problem with MTU/fragmentation handling that was causing large (1500 byte or larger) RADIUS messages to be silently dropped, instead of being passed to the target RADIUS server. Upon closer examination, we determined that only TLRS1 was affected, and eventually that the problem was isolated to the RADIUS servers within the AWS us-east-2 region. When examining why only one region was affected, we realized that the VPN endpoint in the us-east-2 region was not configured the same way as the VPN endpoints in other regions. We also noted that VPN fragmentation errors were occurring in the us-east-2 region only. Correcting the configuration to match the other regions resolved the issue.

Root cause analysis

During a maintenance window on June 16, 2023, we deployed and tested a proposed configuration change intended to improve MTU/fragmentation handling within the eduroam infrastructure. The change did not work as expected, and it was rolled back before the end of the maintenance window. However, the roll back failed to reverse the change from the VPN endpoint serving the AWS us-east-2 region.

Due to the way our load-balancing works, RADIUS Access-Requests that arrive at one of our top-level proxies are assigned to a specific RADIUS proxy within a particular AWS region on a persistent basis. Many RADIUS servers reuse a single port for all Access-Requests, so all of the subsequent RADIUS messages from one of those servers (as identified by IP Address/UDP port) are sent to a single RADIUS proxy within the US eduroam infrastructure.  This normally works fine, as all of our RADIUS proxies are configured identically.

In this case, however, the VPN tunnel to the RADIUS proxies in the us-east-2 region was misconfigured, and the RADIUS servers at some visited sites were persistently using a RADIUS proxy within the misconfigured region. Visitors to those sites whose devices sent large messages as part of their eduroam authentication process were unable to access eduroam for an extended period. This predominantly affected EAP-TLS users with long certificates.  

This issue began to significantly impact international roamers when the highest-volume EU eduroam proxy was load-balanced to the AWS us-east-2 region on July 2, 2023.

Although only a subset of sites and users were affected, this was a widespread service degradation preventing many US eduroam users from accessing eduroam for an extended period of time.

 

Mitigations

Over time, we are working to avoid the need to test proposed changes in production by expanding the staging platform, so that proposed changes can be more fully tested under load before they are deployed to production. We are also continuously working to detect a wider set of errors more quickly, by implementing automated monitoring for additional portions of the US eduroam infrastructure.

While this error was happening, MTU/fragmentation errors were being recorded in the logs on the us-east-2 VPN endpoint, but these errors were not being surfaced. In order to detect this sort of problem sooner in the future, we will extend our monitoring system to check the logs on the VPN endpoints and to notify our operations team if any of the VPN logs contain errors of an unexpected type or frequency.

Conclusions

The degraded service in late June/early July of 2023 was caused while rolling back a failed attempt to improve our MTU/fragmentation handling on a single AWS region.  Until the change was properly rolled back on that region, maximum-sized packets could not be forwarded to the RADIUS servers within that region. Although the problem only affected some users at some eduroam sites, the impact was widespread and many users were unable to access eduroam from many sites for an extended period.

We will continue to focus on detecting errors more quickly, by extending our automated monitoring systems to cover more of our infrastructure.

Timeline

June 13, 2023: During a maintenance window, an intended improvement to the VPN tunnel configuration was tested and found to cause issues. The change was successfully rolled back from 3 regions, but was not completely rolled back from the AWS us-east-2 region.

July 2, 2023: The US eduroam load balancing system switched the highest-volume European top-level eduroam proxy to the AWS us-east-2 region, and the issue began affecting many International roamers from US organizations.

July 6, 2023: The first report was received of what would eventually become a widespread issue.

[Attempts were made to debug and resolve the problem. The problem was isolated to TLRS1 and mis-identified as a load issue. Capacity was added to the east coast infrastructure, which did not resolve the issue.]

July 11, 2023: Several more US eduroam subscribers reported the issue to help@incommon.org and on the eduroam-admins email list.

July 12, 2023: The issue was escalated to “critical”, and an engineer was assigned to work on the issue full-time until it was fixed.

July 13/14, 2023: With the help of an affected community member, the US eduroam operations team determined the cause of the issue and figured out how to fix it.

July 14, 2023 @ 1:00pm US ET: The issue was resolved by updating the AWS us-east-2 VPN endpoint to match the VPN endpoint configuration in other regions.




  • No labels