IRTF Open Meeting ================= Tuesday, 22 March 2022, at 13:30-15:30 UTC Room: Grand Park Hall 2 Chair: Colin Perkins Minutes: Mat Ford ## Introduction and Status Update IRTF Chair Slides: https://datatracker.ietf.org/meeting/113/materials/slides-113-irtfopen-introduction-and-agenda-00 ## Solar Superstorms: planning for an Internet apocalypse Sangeetha Abdu Jyothi Paper: https://dl.acm.org/doi/10.1145/3452296.3472916 Slides: https://datatracker.ietf.org/meeting/113/materials/slides-113-irtfopen-solar-superstorms-planning-for-an-internet-apocalypse-00 Q&A: Wes Hardaker: Thankyou for the excellent work. Will pass this on to ham radio friends. I am a root server operator. Root instances or DNS instances maybe their installation doesn't pose a problem, but if they become disconnected from the rest of the system they'll fail to get updates at which point DNSSEC will stop validating after a while as the signatures expire. Did you find islands of connectivity that would otherwise become cut off? SAJ: Important question. I don't have a complete answer yet. Initial paper only looked at primary areas impacted, not the end-to-end discussion. Currently I'm doing more work to flesh out more complete answer. It could be possible that there could be a few very big islands, but not a lot of tiny islands. Preliminary analysis suggests most of Asia to Europe connectivity will stay up, but US<->Europe is most vulnerable. Complete disconnection is hard to predict, but significant reduction in capacity between bigger land masses is possible. Nicolas Kuhn: I think it is worth pointing out that there are regular solar storms and apart from Starlink accident with very low earth orbit satellites, satellites were fine. Do you have pointers to justify your assumptions on satellites being vulnerable? SAJ: Sats typically have shielding to protect them from solar activity during their lifespan, 5-10 years. In the past decade or two the number of satellites has grown exponentially. We have not had a large storm in that time. Starlink satellites were impacted when they were at a lower altitude than their operating altitude, they were hit by two successive G1 (low intensity) storms, and de-orbited. If a G5 (most intense) storm happened, we don't know what the impact would be for sats, even those at operating altitude. Radiation shielding does offer some protection but it's not guaranteed to protect against large solar storm. Satellites do have thrusters to correct de-orbiting events but they require connectivity to Earth ground stations for command and control. Storms could impact this connectivity for 10 - 12 hours. So very large storms do have the capacity to impact satellites even when operating at higher altitudes. Before current International Space Station was installed, the US had Skylab which was destroyed in the 1970s as a consequence of solar storms. We have not experienced such a large storm very recently. Richard Sheffenegger: Is the impact focussed on the high latitude / sun-facing side? Or are induced currents the same globally (just depending on the inductor loop area)? SAJ: When it comes to the direct impact on satellites, those facing the sun are at a much higher risk but when we look at induced current, these are caused by interaction of these magnetic particles with Earth's magnetic field. In the case of induced current, it's not just the sun-facing side, the dark side is also vulnerable. Higher latitudes on both sun-facing side and dark side are equally vulnerable. CSP: Is there anything we should be doing differently when we design protocols that would help with resilience to this type of event? SAJ: Need to consider resilience at every layer of the stack. With DNS for example, root servers are well distributed, but we don't know how the entire hierarchical tree would be impacted - do we need to change caching, or change how DNS records are managed? That is not clear. When it comes to protocols, BGP which allows only a single path might be too restrictive when capacity is severely limited. That's further analysis we are planning to do. Within an AS, OSPF or other intra-AS routing protocols fair very well because they are decentralised and can use whatever paths are available. The inter-domain protocol needs more investigation. CSP: Makes sense. I guess there's a whole bunch of coordination issues and management issues with large scale cloud provider infrastructure networks and so on as well. Some interesting problems. ## Unbiased experiments in congested networks Bruce Spang Paper: https://arxiv.org/abs/2110.00118 Slides: https://datatracker.ietf.org/meeting/113/materials/slides-113-irtfopen-unbiased-experiments-in-congested-networks-00 Q&A: Jana Iyengar: Great talk - very illuminating at a minimum. This is something everyone should take into account when conducting these experiments. Shows clearly that small A/B tests aren't necessarily good enough. Wondering if you've examined what would happen if you made a client sticky for a particular A/B test. How exactly was choice made for serving a particular type of content? Bottleneck has to be shared between control and experimental groups. If bottleneck is close to the user then if user is in a bucket that is either control or treatment then the user would effectively be at the far end of your scale. Have you looked into this? Where your experiments sticky to users? BS: There are things we can do on the allocation side to avoid some of these issues. Experiments we ran were sticky to users. Overall point you're making is true - if you can guarantee that users will never share resources then you can avoid this bias. If you believe users are not sharing any bottleneck links with each other then you avoid bias. Didn't explore this too much because we found it hard to measure whether users were sharing links, and didn't have a good sense of how often that would be the case. If you were allocating users instead of sessions, gut instinct that this is better than just allocating sessions in terms of interference. You could also allocate networks or particular servers or try to reduce probability that treatment and control share the same link - will reduce bias of the experiment. JI: Very helpful. If you tried to use these techniques, it could tell you something about where users do end up sharing bottlenecks as well. Brian Trammell: If we use what we know about networks to design experiments about networks we can get a lot smarter about this. Switchback experiment looked like a diagram of TDMA. A/B tests are like CDMA. Are there ways to use this multiple access metaphor to find other ways to analyse this? My SRE mind wants to automate code partition, watch with things show up in the power spectrum. Wondering if there's a more fundamental way to split this up. Not expecting an answer right now! BS: Super interesting question, don't have an answer offhand. One thing social networks do is allocate a user and all of their friends to a particular experiment. Could be something there for networks too. Jonathan Morton: One of the big lessons to take away from this is how important it is to understand the testing methodology in detail when we look at a set of purported results particularly ones that are used for marketing. Fine details can have a big effect such as how a particular A/B test was conducted and how potential bias that you identified has been mitigated. BS: Definitely. CSP: You said there was a need for better experiment methodology. What if anything should we be doing when we're designing and evaluating new protocols to improve confidence and results? Is there any general guidance we should be providing, or is it read this paper and think about these issues? BS: We build good systems with what we do today. This gives us another tool to think about evaluating algorithms. When designing new algorithms in the IETF I'd think about the fact that the way we run experiments to evaluate these algorithms can be biased so think about other ways to run them that might mitigate those effects. CSP: Makes sense. Thanks! Recordings of the talks, and links to the papers, are available from https://irtf.org/anrp/