IETF Technical Plenary Session, November 12, 2009 1. Introduction - Olaf Kolkman (Olaf) It's now 4:30 and I would like to start in about a minute, so if everybody can find themselves a seat. So, memorize the note well. Ladies and gentlemen, welcome all to the technical plenary, IETF 76 in Hiroshima. During this meeting, we've got some supporting tools for people in jabber land, and also for the people in this room. There's a jabber room for the plenary, jabber.ietf.org, and the presentations that we will use have been uploaded. You can find them on the meeting material site - so you can read a along if you're back home. Welcome to those back home. There's a little bit of an opportunistic environment for an experiment, we had the opportunity to find somebody to transcribe in English, and you can see that on the screen on the left and right - what is being said in this meeting. This might help people who do not capture the accents of some of the speakers or some of the people at the mic. And we would like to know your experience with that - if you value this experiment. When you step up to the mic and ask questions, again, it's important to swipe your card. I will do so now, so you can also see who I am. And when you are at the mic, please keep it short and keep it to the point. It's something that we heard yesterday. We hope we can establish a tradition. A short paragraph on the agenda for today. Aaron will start with his report. I will follow with the IAB's report. And then we will have a session on internationalization of names and other identifiers, that's a session led by John Klensin, Dave Thaler and Stuart Cheshire. After that, somewhere between 7:00 and 7:10 we will have an open microphone session which will hopefully not last longer than 7:30. It all depends on the conciseness of questions and amount of questions asked. Without further ado, Aaron... 2. IRTF Chair Report - Aaron Falk (Aaron) Hello, I'm here to give a brief report on the Internet Research Task Force (IRTF). My name is Aaron Falk. The short status - we had four research groups meeting this week here at the IETF. The Host Identity Protocol (HIP) research group and the Scalable Adaptive Multicast (SAM) research groups have already met. Tomorrow the Delay Tolerant Networking (DTN) and Routing Research Group (RRG) will meet. Another thing we did was to have review of the Routing Research Group with the IAB a this morning. We have six IRTF RFCs waiting to be published. Publication is currently wedged on finalizing the TLP, the trust license, and that looks like (as was reported last night) it will happen around the end of the year or early next year. The document defining the IRTF RFC stream was revised and is now sitting the RFC editor's queue, and for those of you keeping track, it was modified so that now the independent stream and the IRTF registries have common rights language. Let's see. What else? In work - there's been some proposed new work. We've been talking about for quite sometime, I guess almost two years, an RG on virtual networks, or network virtualization. There's a bar BOF starting in about 40 minutes, so those of who are going to that will miss the fascinating talk that's coming up later tonight, here in the plenary. But hopefully, this group is finally getting to discussion of a charter for the research group, so that would be good progress. Another topic that's come up, in groups that I've been in a couple of times this week, has been internet of things, smart objects, and so there's starting to be discussion about maybe there should be a research group that's looking at an architecture for those technologies and how they fit into the internet architecture. That is just talk at this point, but I'm just giving the community heads up. One thing we want to make sure is that anything that happens in the IRTF does not slow down the smart grid activities that are going on. So make sure we work around that. I like to give sort of a quick snapshot of the different research groups and the energy levels - who is active and who is not. The groups on the right, the 'active' list, are groups that are meeting at IETF or elsewhere, or having very a active mailing lists. The colored one at the bottom is the SAM research group, which has moved to the active column. They met this week, and they have actually had a few meetings at non-IETF locations with other conferences. The quiescent groups, they're not totally inactive, they have meeting lists going along. The Public Key Next Generation Research Group, PKNGRG, they had a little trouble getting started, but it sounds like there's some energy. Do you have a question? (Richard Barnes) I was just wondering if the PKNG group will be meeting at IETF 79? (Aaron) I don't think that's been decided why yet. I don't think there's a planned meeting now, but if you're on the mailing list, you would hear about it. Is Paul in the room? Can you confirm that's true? (Paul Hoffman) That's correct. (Aaron) Moving right along. So, another thing that I've been doing with the IRTF reports is to take a couple of research groups and give a very quick snapshot of the topic area - what the group is up to, some of their recent work items, to give a flavor of what's going on in these groups. This is really very cursory, and just to give a you flavor of some research stuff that's happening in the IETF and maybe help discover whether there is interest in getting more involved. The first group I want to talk about is Anti-Spam Research Group. It's an open research group that's looking at anti-spam research. In particular, at the open problems. It's been hoped that there would be some standards work that would come out of that, but it's not been as fruitful as was originally hoped. But there's a wide range of participation from not only the standards folks that we see here, but also researchers and other folks who are working in the area. There's lots of industrial activity going on this in this space. Because anti-spam is a big industry, there are lots of other activities going on, and so it's important to understand that the research group is not doing standards work. There is some in the IETF, in DKIM. It's not a trade group - there are several of those - I think the large one is MOK. And it's not an academic conference. So, this has really been sort of a discussion of technical topics in the area, and they've worked on a couple of documents, but mostly the activity has been on the mailing list. There's a document they produced on DNS black lists and white lists that's waiting to be published. And then there's another one on black list management that has been, it's a draft that has been circulated for a while and it's sort of waiting to be finalized. Another topic that's been going on in this research group has been starting to develop a taxonomy of the different techniques for fighting spam, and also, of different spamming strategies. You can see the URL here if want too check it on the web. This is really sort of open for contributions. I think that part of the motivation for this is that many people have come up with ideas, often the same ideas repeatedly for how to solve the spam problem. And so, this is a little bit, it's been described in the past as pre-printed rejection slips for why your idea won't work, so you can be indexed into the correct part of the WIKI when you have an idea and don't have to re-circulate threads on the mailing list over and over again. So I think that would be good work. This is turned out to be, like the spam problem in general, hard to make progress in. And the research group - I've heard, and I think the chair has heard, some frustration as to why they have not done more. There are a lot of folks doing research in this area and they're focused on publishing papers, sometimes more so than doing collaboration in the IRTF. Also, some of these problems are extremely hard. But, one of the values of what's happening on the research group mailing list is it's starting to go capture some of the folklore, some of the wisdom that's passed around between practitioners. But there's also misinformation that's stamped out, and they're making an effort to capture these things in the wiki. So the chair asked me to pass along - the mailing list is really intended for folks in the IETF and elsewhere, having questions about spam and anti-spam related technologies, that this research group is intended to be a good discussion point for bringing those topics out. Okay. So, the other research group that I wanted to talk about is the Scaleable Adaptive Multi-cast group. I apologize, it's hard for me to read (the slide) so it's probably hard for you to read. The concept behind this group, if you look at the pictures, the bottom one is intended to be a conventional network, host at the edge and routers in the middle. And the goal of the group is really to enable multicast services, taking hybrids of application layer multicast, which is easy to deploy among consenting end systems, and take advantage of either IP multicast or link layer multicast - any native multicast that might exist. This is what see in the pictures, in the bottom you have the conventional network, then you have the multicast tree, and then at the top you have a hybrid multicast environment where you have native multicast in one region, and application multicast in regions that don't support it. It takes advantage of the AMT protocol - this is a protocol for tunneling. AMT is Adaptive, automatic multi cast tunneling. That connects multicast enabled clouds over unicast networks. And this is technology that was developed in the MBONE D environment, and so, it's a way of sort of gluing these together. So the Sam RG is trying to create framework and protocols for integrating these various strategies for enabling multicast. There's a bunch of different communities, this work was initial led out of the X-cast environment, where they have some protocols, and they've got sort of one point they developed in the space. There's also P2P overlays and the IP multicast folks, and then applications include streaming and mobile networks and other kinds of applications. They've developed, I've gotten some drafts on developing a framework. This is just another illustration of another version of the same picture where you've got networks that have neighborhood IP multicast. They might have link layer multicast, application layer multicast, and they're glued together, and they've developed a protocol that's got different kinds of joins, IP multicast join, join-by-gateway, join-by-native-link, and so this is some of the work that's been going on the longest in the group. And it's pretty mature, I understand. Another thing that they've been working on is developing namespace support so that hosts can directly participate in multicast services. And along with middleware to make that work. And they've been also trying to build a simulation environment to allow exploration of a wide range of networks in this space. They started with a tool called OMnet, and they're extending it to support IP multicast, and then extending that again to support different kinds of overlay strategies. Then finally, to go beyond simulation to test beds, there's some work that's just being discussed now about building a hybrid multicast test bed that's started with contributions from the different participants in the research group. That is, they're actually globally distributed with the hopes that that will grow for implementing and exploring some of these protocols. So in a nutshell that's the status of the IRTF and two of the research groups, and I am open for questions if anybody has any. Okay. Thank you very much. (applause) 3. IAB Report - Olaf Kolkman As far as the IAB report, I love these little crane birds folded out of paper, so I put one on the picture, typically for Hiroshima. And I enjoyed making them during the social. Anyway about the IAB. I show this slide every time I open the session - basically pointing out what we're about. It's very hard to give a nutshell description of what the IAB is about. But it has a charter, RFC 2850, and we try to describe as much as possible on our home page. You can find the current membership there. There are links to documents, and within the documents section you can find our minutes. It is the goal to have minutes posted not more than two meetings behind. We're very bad at meeting that goal. Just before this meeting, a batch of minutes was published that were approved earlier in this week. Correspondence... when we talk to other organizations we usually leave a trail of correspondence and that is published on our web site as well. Documents are one of outputs. Recently we published as an RFC 5620 the RFC Editor model. I will be talking about the model implementation at the end of this presentation. There are two documents currently in auth48, 5694 which is the P2P architecture definition, taxonomies, and examples and applicability. That is about to be published. There's a final little thing with a header, same goes for RFC 5704 Uncoordinated Protocol Development Considered Harmful. You've heard a presentation about that, in previous IAB reports. Those are two very short. There is ongoing document activity. We've been working on a document considering IPv6 NATs. There was a call for comments from the community. Those comments have been incorporated into version 2 of this draft and we're about to submit this to the RFC editor when every IAB member has a chance to sign off on it. So that will be going to the RFC editor shortly. There's another document - IAB Thoughts on Encoding and Internationalized Domain Names. It's part of the inspiration for today's technical session. So basically, it's a call for comments. The technical plenary today is a working session that is based around this document. There are a bunch of documents that have draft IAB that are sitting somewhere in various states, that didn't have much attention over the last few weeks, at least not visibly. Drafts IAB headers and boiler plates - that document has been finished for a long time and is sitting in the RFC editor and refers to the BIS document. We found a way to get out of there by changing the reference to RFC 3932bis itself if the situation with 3932bis does not get resolved pretty soon. So we want to get that out as soon as possible. That document basically changes the headers of documents, and changes some of the boilerplates so it's more obvious if a document is an Independent RFC or an IETF Stream RFC or IAB Stream RFC. We're also working, and internally we've been rehashing, this document that is intended to describe the IANA functions, and what the IETF needs out of that. An update on that is imminent, and so is an update on the IP model (on which we've been working, and which will be uploaded as soon as the queue is open) A little bit of news. We had a communication with IANA on the way forward with respect to signing .arpa We received a plan of action shortly before the IETF, in which there's two phase approach, where they will proceed with the temporary set up to get .arpa signed in fourth quarter of 2009. Given that quarter four of 2009 is still only six weeks old, we expect it will be signed before the end of the year, so it's really imminent. After the design has been finished for signing the root zone, that same system will be used for signing .arpa We responded positively to that plan and we find it's very important to get .arpa signed, to get its key signing key published in the IANA tar? and in the signed root whenever that is available, and make sure there is secure delegations to sign sub zones, and that is now being set in motion. So this is some progress on that front. We made a bunch of appointments. We've re-appointed Thomas Narten as the IETF liaison to the ICANN BOT, and related to that, Henk Uijterwaal for ICAN NOMCOM. And finally, we appointed Fred Baker to the smart grid inter operability panel. And a number of you have been to the smart grid bar BOF yesterday and know what this is about. Communication to other bodies... There is an effort underway within the EU to standardize, to modernize ICT standardization. There was a white paper published by the committee, and we've reviewed that and basically replied with a number of facts around our process. So that we, so we at least are sure that there's no misunderstanding of how the IETF works. We also provided comments to the ICANN CEO and the ICANN board of trustees on a study that appeared recently that was about scaling the root. You can find those comments on our IAB correspondence section of the web page. Something that is of a more operational nature is the implementation of the RFC editor model. Just as a recap of the state of affairs, I've been talking about this previously, we're in a transition period. We're moving away from ISI as a service provider, and into an implementation of a model that has been developed over the last few years. Within this model, we've got four legs: the RFC Series Editor, the Independent Submissions Editor, the Publisher, and the Production House. The IAOC is responsible for selecting the RFC production center and the RFC publisher, and the IAB is responsible for creating an advisory group, appointing an advisory group for helping us with the selection of RSE and ISE candidates. That has all been done. Looking for the RFC series editor is our responsibility, and the Independent Submission Editor, which is also one of the functions within the model that is our job. Where are we with all that? Well, the IAOC, you heard yesterday, has appointed, awarded the production center contract to AMS and also the RFC publisher contract to AMS. And the good news here is that Sandy and Alice are the core members of the production center, which means, that the continuity of publishing and getting RFCs online is not in danger. This is the good news so to speak. As far as the Independent Submission Editor goes, that is the editor that assesses the technical quality of documents on the independent stream, we've had significant delays. That delay has been because we've been focusing on trying to find RFC series editor. However, we have candidates and we are currently interviewing and assessing those candidates and we are basically on track with that now. As far as the RFC Series Editor function goes, we had a call in July, not quite half a year ago. Closed nominations, August fifteenth, and the nominations were provided to the ACEF, the committee that helps us with assessing the candidates. They interviewed candidates. They've had long deliberations, and their conclusion was that there was no suitable match between the candidates, the functions, and the expectations of the role - those three variables didn't quite match. And their advice was to seek somebody to manage the transition, to do a step back, and make sure that pieces are in place and go for the long term solution. "Manage the transition" was the advice. The IAB went over this advice, turned it around a couple of times, and finally decided that the transitional RFC Series Editor (RSE) way forward is the best plan, the best way out so to speak. We've defined that job, and you should have seen the announcement with job description, and call for candidates earlier last week, mid last week. There is an ongoing call for candidates, but the evaluation of whatever we have will start November 20 or so. In a week. So, why do we think we will be successful now, or have a higher odds of success? Well, there are a couple of things that are different than we had with the situation on July 8. First, there is less uncertainty about the state of the production and publication functions. It is known who is going to execute those functions. There is capable staff there, there is institutional knowledge which makes the job easier. There's also, in the job description that we are looking for, there's more focus on the transitional aspects. We've called out that the person who is going to do this needs to refine the role of the RSE after the initial transition, so that it is more clear what the successor will be getting into. There is an explicit task to propose possible modifications to the RFC editor model in order to see that things work better, when we go out for the more permanent function. And, because this is a transitional management job, so to speak, it has different type of commitment, a different type of personality, and also a shorter time space for commitment. So we hope that the pool is wider, deeper, or of different dimensions. One of the things that we will also not do with respect to this, looking for candidates, is to disclose the names of the candidates publicly. We think that was a mistake and we don't do that now. As I said, the call for nominations is now open. We will start evaluation November 23, and we will accept nominations as long as nobody has been announced. We believe that this is inline with RFC 5620, the RFC Editor Model and the general community consensus. That doesn't mean that we kept out, or we didn't go back to the community and ask "is this all okay?". Because there is time pressure - ISI will stop this function December 31st, and January 1 we will be starting implementing this new model. We couldn't afford to lose time. That doesn't mean we're not listening if you have any comments or things should have been done differently. We have been talking, as Bob Hinden said yesterday, with ISI, and Bob Braden in particular, about their willingness to extend the, current contract on somewhat of a consultancy basis. So at least we have somebody so that balls do not fall on the ground, and somebody who can actually transfer some institutional memory to whoever gets this job. There are two small links that you can get and read them from the slides. Finally, it is worth mentioning this time we did not have appeals, and that basically closes my presentation. (applause) With that, I would like to invite John, Stuart, and Dave for a session on internationalization. And I'll start up the slide set. 4. Internationalization in Names and Other Identifiers (John Klensin) All right. Good afternoon, everybody. We've come to share some general ideas about internationalization, and where we stand, and where we're going. The plenary's goal is to try to inform the community about this topic. This is not the first time the IAB has tried to do this. We continue to learn more and to try to share that with you. Internationalization is badly understood. It is understood moderately well by a fairly small number of experts, most of whom end up realizing how little we actually understand. But it affects a large number of protocols, a large number of people, and should affect virtually everything we're doing in the IETF, and elsewhere there's a user interface. We've got a new working draft which contains some recommendations about choices of binds and encodings that we'll talk about. Current version is draft-iab-encoding-01.txt. It is very much still in progress. And more work is needed in this area, both on this document and about other things, and should continue. Internationalization is important and timely because a lot of things are going a on around us. Names. Names can have non-ASCII characters in them. Not everybody writes in Roman characters, especially undecorated ones. We're seeing trends towards internationalized domain names. They've been floating around the Internet, fairly widely deployed since about 2003. Earlier than that in other than the public Internet. Pieces of this talk we'll show you later, and they can be as simple as a string of Latin characters with accents or other declaration on some of them, or as complex as scripts which some of the people in this room can read, and others cannot. The things you can't read become problems. The things you can read become opportunities. We see URLs floating around. Actually IRIs - what we discover is that a great deal of software doesn't know the difference, and a great many people know even less about the difference. We have path names in various kinds of systems that use internationalized identifiers, and use them in ways which are fully interchangeable with plane ASCII things, because the underlying operating systems have been internationalized. Users want to use the Internet in their own languages. It seems obvious. It took us a long time to get there. We're not there yet. At the same time, we've been making progress. The MIME work which permitted non-ASCII characters in e-mail bodies in a standardized way was done in Internet time a very, very long time ago, and has been working fairly successfully since. In China, IDNs are being used for all government sites. A great many IDNs are deployed in the .cn domain. Users in China see domain names as if the top level domains are actually IDNs. We're told that about 35 percent of the domains in Taiwan are IDNs, and almost 14 percent of the domains in Korea are IDNs. There's demand from various parts of the world that use Arabic scripts and they have special problems because the script runs right to left under normal circumstances, and most of our protocols that use URLs and identifiers have been written around the assumption that things are written from left to right. In general, right to left is not a problem. Mixing right to left and left to right produces various strange effects. (Stuart Cheshire) Thank you. Jon. So, I'm going to go over some of the basic ideas and some of the terminology. Unicode is a set of characters which are represented using integers. There's actually about a million of them, but most of those are not used. Most commonly used characters fit in the first 65,000. Most of even the less commonly used ones fit in the first 200,000 or so, but the unicode standard defines up to about a million. These are abstract integers that, for the most part, represent characters. I say 'for the most part' because there are variations. You can have an E accent as a character, or the E with the character accent separately. But ignoring those details, roughly speaking it's a set of characters with numbers a assigned to them. Now, you can write those numbers on paper with a pen, or on a blackboard with chalk. When we use them in computer systems, we need some way to encode them. And it's easy for us to forget that, but the encoding is important. Here are three of the common encoding: UTF-32 is 32 bits in memory, of the normal way of representing integer values. That means there are endian issues. UFT-16 slightly more compact because the majority of characters fit in 65,000. That means most unicode characters can be represented by a single 32-bit word, so that takes half the space. Similarly, there are endian issues with UTF-16. UTF- 8 uses a sequence of 8 bit bytes to encode the characters. And UTF-8 has some interesting properties, so I'm going to talk a bit more about that. The IETF policy on character sets and protocols specifies that all protocols starting in January 1998 should be able to use UTF-8. Why is that? Why do we like UTF-8? Well, UTF-8 has a useful property of being ASCII compatible. And what that means is that the unicode code points from zero to a 127 are the same as the ASCII code points. So, decimal 65, hexadecimal 41, represents an upper case A, both in ASCII and unicode. I'm talking about integers here. When you represent that integer unicode value using UTF-8, use the same value for values up to 127. So this may seem very obvious, but, it's an important distinction between the integer value and how represent it in memory. The property of this is that if I have an ASCII file which is clean 7-bit ASCII, I can wave a magic wand and say that's actually UTF-8, and it is actually valid UTF-8 and not just 'valid', but valid with the same meaning - it represents the same string of characters. For files that already have other meanings for the octet values 128 and up, like Latin 8859-1, that property is not true, because those code values have already been given other meanings. But for plain ASCII, UTF-8 is backwards compatible. UTF-8 uses those octet values above a 127 to encode the high numbered code points, and I'll explain how that works. So, in blue, we have the unicode characters that are the same as ASCII characters. It's just a single byte in memory. In the middle, we have the green ones, and those are the octets that start with the top two bits being one, or the top three or the top four. And when see one of those, that indicates the start of a multi- character, a multi-byte sequence for encoding one unicode code point. In the right in purple, we have the continuation bytes which all have the top bits one and zero. The nice property of this is, by looking at any octet value in memory, you can tell whether it's a stand-alone character, the start of a sequence, or something in the middle of a multi-byte sequence. This is how they look in memory. So, we have the ASCII characters standing alone, we have the two-byte sequence where we start with the 110 marker, we have the three byte and the four byte sequence. UTF-8 has some nice properties. Part of being ASCII compatible is that UTF-8 encoding results in no zero octets in the middle of the string. That's useful for standard C APIs that expect null- terminated strings. . The fact that the bytes are all self-describing makes it robust to errors. If there is corruption in the data, or data is copied and pasted, inserted and deleted, maybe by software that doesn't understand UTF-8, it is possible to recognize. If I give a megabyte character, and look at a byte in the middle of the file, then you can tell whether you've got a stand-alone character. If you look at the byte and the top two bits are 10, you know you're in the middle of a sequence, so you have to go forward or back, but you don't have to go very far before you can re-synchronize with the byte stream, and you know how to recode the characters correctly. That is in contrast to other encodings that uses skip characters to switch modes, where you really have to parse the data from the start to keep track of what mode you're in. Another nice property of UTF-8 is because it has this structure, you can tell with very high probability looking at a file whether it's valid UTF-8, or whether it is something else like Latin. One of the properties is that if I see a byte above 127 in a file, it can't appear by itself, because that must be part of a multi- byte sequence. So there have to be at least two, and the first one has to have two or three or four top bits, and the later ones have to have the top bits be one zero. So the probability of typing 8859-1 that happens to meet that pattern goes down very quickly for all but the shortest files. Another useful property is that a simple byte-wise comparison of two strings of UTF-8 bytes using routines results in them sorting in the same order as sorting the unicode code points as integers. This is not necessarily for humans to see in terms of what we consider to be alphabetical order, but for software like quick sort that needs an ordering of things. This is a suitable comparison which results inconsistent behavior whether you're using unicode or UTF-8. One of the criticisms that's often raised against UTF-8 is that while it's great for ASCII - one character is one byte - and it's pretty good for European languages, since most European characters fit in two bytes, for Asian languages they often take three or four bytes per character. And this has led to a concern that it results in big, bloated files on disk. While that may have been a concern 10 or 20 years ago, I think in today's world, there are different trade offs we have to consider. One thing is that everybody inventing their own encoding, which is locally optimal in some particular context, may save a few bytes of memory in that context, but it comes at a big price in interoperability. And when I talk about different contexts here, I don't mean just geographically different places around the world, or different languages, but context like e-mail doing one thing and web pages doing a different thing. Also in the context of applications and working groups, we have a tendency for each community to roll their own solution that they feel meets their needs best, which is different to that of other people's, and we have a lot of friction at the boundaries when you convert between these different protocols that are using different encodings. We'll have some more examples of that later. Another aspect is that on most of our disks today, most of that space is taken up you with images, and audio and video. Text actually takes a very small amount of space. When you view a web page, most of the data that's coming over the network is JPEG images. If you're looking at YouTube, almost all of it is the video data. Those images and video are almost always compressed, because it makes sense to compress them. Ten years ago there were web browsers that would actually gzip the HTML part of the file to make the download faster. I don't believe anybody worries about that anymore, because the text part of the web page is so insignificant compared to the other media that it's not that important. Another interesting observation here is, with today's file formats like HTML and XML, quite often the the machine readable mark up tags in that file, which are not there for end users to ever see, they're there to tell your web browser how to render the text, those tags really, they're just bytes in memory. They have no human meanings, but it's convenient that we use mnemonic text, so we use ASCII characters for things like title and head and body. And even in files containing international text, a lot of that mark up is ASCII. And I have had discussions very much like this with engineering teams at apple, all the applications that apple ships are internationalized in multiple languages and we have inside the application, which you can see for yourself, if you control click on it and open up to see the contents, they contain files that contain all of the user interface text in different languages. And we had the debate should it be in UTF-8 or UTF-16. And clearly for western European languages, the UTF-8 is more compact. But the argument was for Asian languages that would be wasteful. So I did an experiment. This is the file path, you can try the experiment for yourself. I had a look at that file. And in UTF-16 it was a 117K. In UTF-8, it was barely half the size. This is the Japanese localization. I'm thinking, how can that be? I was expecting it to be about the same size or a little bigger, but I wasn't expecting it to be smaller. When I looked at the file, it's because of this... The file is full of these key equals value pairs. And all that text on the left is ASCII text. And the Japanese on the right may be taking three or four bytes per character, but that's not the only thing in the file. So, I believe that the benefits we get from having a consistent text encoding so we can communicate with each other are worth paying possible performance size overhead that there might be. And as this example shows, there may not be a size overhead in many cases. So that's UTF-8. But we know that not everything uses UTF-8. So the other thing we're going to talk is punycode, which is what's used in international domain names. Now, this is not because the DNS can't handle 8 bit data. The DNS protocol itself perfectly well can. But many of the applications that use DNS names have been written assuming that the only valid DNS names contains letters, digits and hyphens. So in order to accommodate those applications punycode was invented. And whereas UTF-8 encodes unicode code points as octet values in the range from zero up to hexF4, punycode restricts itself to a smaller range of values, listed on the slide. And those are the byte values that correspond to a ASCII characters hyphen, digits and letters. So, what that means is that when punycode encodes a unicode string you get out a series of bytes, which if you interpret them as being ASCII, look like a sequence of characters. If you interpret them as being a punycode encoding of a unicode string, and do the appropriate decoding, and then display it using the appropriate fonts, they look like rich text. So this is a subtle point. We have these same sequence of bytes in memory or on disk or on the wire in the protocol, that have two interpretations. They can be interpreted as letters, digits and hyphens -not particularly helpful, as it kind of looks like opening a JPEG in emacs. You see a bunch of characters, but that doesn't really communicate what the meaning of the JPEG is. Or, the letters and hyphens can be interpreted as punycode data which represents a unicode string. Let me give you another example of that. Does this look like standard 7 bit U.S. ASCII or not? Let me zoom in. We'll do a hum. Who would say this is 7 bit ASCII? Can I have a hum? Who would say this looks like rich unicode text? Hum? Okay. Let me zoom in a bit closer. This is a plane ASCII file. In fact it only contains Xs and spaces. You can edit this file in vi if you want. So, the same data has two interpretations. Seen from a sufficient distance, it looks like Chinese characters, but it can also be interpreted as Xs and spaces. So the meaning of this text depends very much on how you choose to look at it. But I would argue that editing this file in vi would not be the most efficient way of writing Chinese text. So this problem that the same byte values and memory can be interpreted in different ways really plagues us today. That was just a few days ago I was buying a hard disk on amazon and got these upside down question marks. I think that's supposed to be dashes. This isn't even in complicated script systems - this is just the characters that any English or American reader would expect to use in plain text. I remember when I had my first computer, an Apple IIe, and it could only do upper case. And then my next one, the BBC micro, had lower case. And the next one which was a Macintosh, in about 1985, could actually do curly quotes, and I could write degrees Fahrenheit with a degrees symbol, and I could do E-M dashes, and I could do Greek alpha signs. I could write not equals as an equal sign with a line through it the way I did in school when I was writing with a point. Not exclamation point. We have done it for so long, we forget. Not equals is a equal sign with a slash through it. So by 1986 we had gone from typewriter to some fairly nice typography where I could type what I wanted on my Mac. And here we are 10 years later and things seem to have gone backwards. I'm not happy. How do we solve this problem? We make the user guess from 30 different encodings - "what do you think this web page might be?" This is not something that we want to impose on users. This is not something that the average end user is even qualified to understand what they're being asked here. So, international domain names don't only appear on their own. They appear in context. And here are some examples. They can appear in URLs, they can appear in file paths on windows. Of all these different encodings, which most of the people in this room would probably recognize as meaning the same thing, these are the only ones that in my mind are really useful, if we have a goal of supporting international text. If you you asked a child to draw a Greek alpha symbol, and gave her a pencil and paper, plain pencil and paper, she would draw an alpha symbol. She would not write %cn, % something and say that's an alpha. That's complete insanity. That is not an alpha. An alpha is this thing that looks like an A with a curly tail on the right side. If we want to support international text, it's got to look like international text. But because we have all these protocols that don't have a native handling of international text, we keep thinking of ways to encode international text using printable ASCII characters. And when you do that encoding, who decodes it? There's an assumption that if I encode it with percent something or ampersand something, then the thing on the receiving side will undo can that and put it back to the alpha character it was supposed to be. Well, we got bitten by this yesterday. We sent out an e-mail announcing this plenary. This was not staged. This is real. And some piece of software, somewhere decided that unicode newlines were no good. So it was going to replace them with the HTML ampersand code for a unicode newline. And something on the receiving side was supposed to undo that, and turn it back into a newline. Well, nothing did and this is what you all got in your email. This can get really crazy. Suppose you have a domain name which is part of an e-mail address, which you put in a mail to URL, which is then appearing on a web page in HTML text. Is the domain name supposed to be actual rich text as seen to the user? Or is it supposed to be puny code? Because it's an email address, email uses printable encoding followed by two hexadecimal characters. Well, in an e-mail address do we have to do that escaping? And the e-mail is part of the URL. And the whole thing is going into a web page, so HTML has its own escaping for representing arbitrary characters. Do we use all of these? A lot of people say yes. It's not clear which ones we wouldn't use out of that four in the nested hierarchy of containers. If you're looking at an HTML file in your editor, you are very far removed from having rich text in front of you on the screen. So we tried, we decided we'd try an experiment. What would happen if we didn't do all this encoding? What would happen if we just sent straight 8 bit data over the network and decided to try this email test. Now, the SMTP specification says it's 7 bit only, but we asked the question, what if we disregarded that, and tried it anyway, to see what would happen? So, I sent a test e-mail, where I replaced the E in my name with a Greek epsilon, and I with an iota, and I sent this e-mail by hand, using net cat, so it wasn't my mail client doing encoding. I just put the raw bytes on to the wire and sent them to the SMTP server to see how it would handle it. I did it two ways, using the punycode-encoded representation of that first label of the domain name, X M -- something that looks like line noise. And I did it a second time, just using the UTF-8 representation of that, which I'm showing here as the actual unicode characters. So to make that really clear, this is the text that I sent using net cat to the SMTP server. This is the first one, using punycode, so this whole email is plane 7 bit ASCII. No surprising byte values in it. The first two lines are the header, after the blank line, the rest is the body. I point this out because headers are handled differently from bodies. Header lines are processed by the mail system. The body by and large is delivered to the user for viewing. The second e-mail is conceptually the same thing, except not using punycode, using just 8 bit UTF-8. So this is the result of the first test. Not surprisingly, the puny code in the body of the message was displayed by all the mail clients we tried as line noise. Which is not surprising, because it's just text in the body of an e-mail message. There's no way that the mail client really knows that that text is actually the representation of an international domain name that's been encoded. We could have some heuristics where it looks through the e-mail. I would not be happy about that. Type the wrong thing in e-mail and it magically displays as something else. That seems like going further in the wrong direction. In the from line, where we could argue that the mail client does know this is a domain name because it's user name angle bracket, user at example.com, close angle bracket, that is a clearly structured syntax for st e-mail address, and the mail client knows how to reply to it. It could conceivably decode that text and say 'this is puny code'. The intended meaning of this text is not xxc-x, it's a rich text name with epsilons and iotas in it. One client did that, which was outlook on windows. The second test was the raw 8 bit UTF-8 data. And I'm very happy to say, in our small set of e-mail clients that we tested, a 100% of them displayed UTF-8 text in the body in a sensible way. We had some more interesting results from the from line. Gmail did this very interesting thing where it clearly received and understood the UTF 8 text perfectly well, because it displayed it to the user as the punycode form. I'm not quite sure why. Possibly for security reasons, because there is concern with confusable characters, which you will hear about in great detail in a few minutes. There is concern with confusable characters that you might get spoofed emails that look like they're from somebody you know but are really not. Turning it into this punycode form, at some level, should avoid that. I'm not sure it really does, because in a world where all of my email comes from line noise, the chance of me noticing that the line noise is different in this particular email, I don't know how much of a security feature that really is. But that may be the motivation. Eudora 6 is an old mail client, written I think before UTF was very common. Those characters there are what get if you interpret the UTF bytes as being ISO 8859-1. And the last three here, to be fair, I don't think we should blame the Outlook clients here, because what appears to have happened is that the mail server that received the mail, went through and whacked any characters that were above 127 and changed them to question marks. It didn't do that in the body, see, but in the header it did do that pre- processing. So it's unclear right now whether it was the mail client that did this or the mail server that messed it before the client even saw it. So, back to terminology. Mapping is the process of converting one string into another equivalent one. And we'll talk a little bit later about what that's used for. Matching is the process of comparing things that are intended to be equivalent as far as the user is concerned, even though the unicode code points may be different, the bytes in memory used to represent those unicode code points may be different, but the user intention is the same. Sorting is a question of deciding what order things should be displayed to the user. And the encoding issue has various levels to it. I've talked today about how to encode unicode code points using UTF-8. There is also the question that the E accent character can be represented by a single unicode code point for E accent, or as the code point for E followed by the accent combining character. So more terminology. In the IDNA space, an IDNA valid string is one that contains allowed unicode characters to go into international domain names, and those can take two forms. The term will commonly used in the IDN community is a U label. An IDNA- valid string represented in unicode, by which they mean, in whatever is a sensible representation in that operating system. It might be UTF-8, it might be UTF-16, but it is one of the natural forms of encoding unicode strings. An A label is that string encoded with the punycode algorithm, and xn-- to call out the fact that that is not just a string of characters in DNS, this is something that's encoded by punycode, so you have to decode it in order to get the meaning. So I'll wrap up my part of the presentation with an observation. When it comes to writing documents, or writing an e-mail to your family, having the most expressively rich writing tools available is very nice. When it comes to identifiers that are going to be passed around, and are used to identify specific things, then it's not quite so clear. Because the bigger the alphabet, the more ambiguity. Telephone numbers use ten digits. And by and large, we can read those digits without getting too confused. We can hear them over the telephone. Most people who can work a telephone, can read and hear the ten digits without getting them too confused. When we go to domain names, we have a bigger alphabet. We have 37 characters, and we start to get a bit of confusion. Os and zeros, Ls and ones and Is - there's a bit of confusion, which is bad, but it's limited to those few examples. When we move to international domain names, the alphabet is tens of thousands, and the number of characters that look similar or identical is much much greater. So with more expressibility comes more scope for confusion. And I will note, that while we're going in this direction, of bigger and bigger alphabets, the computer systems we use went in the opposite direction. They went to binary. Because when you only have one and zero, then there's a lot less scope for confusion in terms of signaling on the wire, with voltage levels. If there's only two voltage levels that are valid, you're high or low. If there are ten that are valid, then a smaller error might mean reading a 5 as a 6. So we know that when we build reliable computer systems that binary has this nice property. So, I leave you with that. And I ask Dave to come up and tell more. (Dave Thaler) I'm going to talk about matching first. So earlier on when we talked about definitions, we said, you probably thought that matching meant comparing two things in memory. That is certainly one of the as aspects of matching. You do a database entry, and I know whether to respond or not. There's another problem with matching - that is the human recognition matching problem. So, let's do another eye test here. We have two strings up here that could be easily confused by a human. Can you spot the difference? Hum if you can spot the difference. Okay. The difference that you can spot here, is that on the left, this is .com, and on the right this is .corn. It seems like a great opportunity for some farmer's organization, doesn't it? This illustrates that even in plain ASCII we have confusion. Now, some of you who have been participating in the RFID experiment are aware of another type of confusion. On this slide, these are not capital Is, they are lower case Ls. More confusion with just ASCII. But wait, it gets worse. This is the Greek alphabet. If Ethiopia, those are not the letters E T H I O P I A. If you check the lower case both of those, then they look different. All right. The lower case versions of the Greek letters there is fairly distinctive. So as a result, we see the current trend to actually deprecate these various forms and revert to one standard one or one conical one if you will - in various identifiers such as IRIs and so on. In IDNA, in 2008, some of these characters are treated as disallowed for these types of reasons. Second eye chart, okay. Look up from your computer and stare at the screen. Hum if you spot the difference. They both look the same. If you can spot the difference, you may need an eye test. The difference here is that all the characters on the right are in the Cyrillic alphabet. There's no visual difference. What's worse is that in ASCII .py is the TLD for Paraguay. On the right, those are the Cyrillic alphabet letters corresponding to .ru, which is Russian. Now, anybody here who actually speaks Russian or is intimately familiar with internationalization, will be quick to point out one important fact - 'jessica' uses letters that are not in the Russian language. For example, J and S do not appear in the Russian language. It points out there are alphabets, and languages that use a subset of characters in those alphabets. In order to get the letters that look like 'jessica', you have to combine characters from two different languages, but they're both in the same alphabet. So what this points out if you are a registry, that is going to be accepting say, domain registrations under your zone, then they may want to apply additional restrictions, such as not accepting things that look like, or that contain characters that are not in their language. If you're .py for Paraguay, if there are certain characters that don't want to allow, you can restrict that. This particular example requires combining characters from two different languages, and there are other examples that are purely from the same language. Epoxy, and on the right, this may be, say, a 5 letter acronym for some Russian organization. The problem is that's at the human matching layer. Is that the thing you're looking for? Does that match or not match? John is going to talk about a couple of more examples. So hopefully your eye tests have been enlightening. (John Klensin) We get more interesting problems when we move beyond the eye tests into a piece of human perception problem, which is people tend to see what they kind of expect to see. So we have here two strings which look different, but look different only when they're next to each other. And, the first one is a restaurant, and the second one is in Latin characters and something different altogether. But they look a lot alike if you're not sensitive to what's going on. In general, if you have a sufficiently creative use of fonts, and style sheets from a strange environment, almost anything can look like almost anything else. A number of years ago I came into Bangkok very late at night, and I was exhausted and I was driving, being driven to the city and I saw a huge billboard and it had three characters on it and a red, white, and blue background and from the characters I was firmly convinced it was USA. Well, it was in Thai, the characters were decorated, and having seen them outside of that script, maybe I would have understood the difference. Maybe I would not have. That brings us to another perception test which snuck up on my last month, me and an audience of other people. We were sitting in a room, two months ago, at an AP meeting and there was a poster in the back of the room with the sponsors. And on the poster we had these three logos, and the first one, pretend that you're not used to looking at Latin characters. You look at the first one and you don't know whether that character is an A or star. And then there's a reverse eye test. See the second and third lines there, and convince yourself, assuming you know nothing about Latin alphabets, as to whether those are the same string or same letters or not. Because this is the problem that you're going to get into when you're seeing characters in scripts and strings that you're not familiar with. And it's a problem when people are not used to looking at Latin characters, when the fonts get fancy. People keep carrying out tests in which they say 'these things are confusable or not confusable' when they're looking at things in fonts which are designed to make maximum distinctions. When people get artistic about their writing systems, they're not trying to make maximum distinctions, they're trying to be artistic. And artistic-ness is another source of ambiguity for people, as to whether two things are the same or different. We have other kinds of equivalence problems. To anyone who looks closely, or who is vaguely familiar with Chinese, simplified Chinese characters do not look like traditional Chinese characters. But they're equivalent if it's Chinese. If it's Japanese or Korean instead, one of them may be completely unintelligible, which means they are not equivalent anymore. As a consequence of some coding decisions which unicode made for perfectly good reasons, there are characters in the Arabic script with two different code points but which look exactly the same. So the two strings seen there, which are the name of the Kingdom of Saudi Arabia, look identical, but would not compare equal if one simply compared the bytes. Two strings, same a appearance, different code points. Little simple things like worrying about whether accents go over Es do not get caught in the same way these things do. This is another equivalence issue. What you're looking at are digits from zero to to 9 in most cases and from one to 9 in a few. Are they equivalent? Well, for some purposes, if we write numbers in two different scripts are the same. For other purposes, they're not. We've seen an interesting situation with Arabic in that input method mechanisms in parts of the world accept what the user thinks of as indic Arabic characters going in encode European digits. When they decode they treat the situation as a localization matter, so users see Arabic digits going and and coming out. But if we compare them to a system in which the actual indic Arabic digits are stored, we get not equal. We've also got in unicode some western indic Arabic digits and some eastern Arabic indic digits. They look the same above three, but below three they look different, and all of the code points are different. Are are they equal or not equal? And if you think that they're digits and if you think, as we've said several times this week in various working groups and over the last several years, that user facing information ought to be internationalized, remember that we show IP addresses in URLs which users look at and sometimes type, so now assume you see some of these Arabic digits, two or three of them, followed by a period. And then you see another two or three Arabic digits followed by a period. And then see another one, two or three Arabic digits followed by a period and then see another one, two or three Arabic digits. Is that an IPv4 address? Or domain name? And if it's an IPv4 address, do you know what order the octets come in? The difficulty with all of these things, is that they're, fun, funny. And then you catch your breath and say, he's not kidding. These are real, serious problems, and they don't have answers, except from a lot of context and a lot of knowledge. Our problems arise not when we're working in our own scripts, but in somebody else's. So now we come back to the place where people started becoming aware of these problems - with internationalization of the DNS. If I can make a string in one script, or partially in one script, look like a string in some other script, I suddenly have an opportunity, especially if I'm what the security people call a bad guy. But those kinds of attacks cannot be deliberate. They can be deliberate or accidental, depending on what's going on. We spent a lot of time thinking in the early days of IDNs, believing if only we can prevent people from mixing scripts we'd be okay. The example Dave gave shows how far from okay that is. We're almost at the point of believing that prohibiting mixed scripts is probably still worthwhile but it makes so little difference if somebody is trying to mount an a attack that it's really not a defense. If you have names in scripts that are not used in the user's area, and the user is not familiar with them, many scripts become indistinguishable chicken scratch to a user who is not used to that script. And all chicken scratches are indistinguishable from other chicken scratches, except for certain species of chickens. We talked from time to time about user interface design, and whether it should warn the user when displaying things from unknown sources, or strange environments, or mixed scripts, but the UI may not be able to tell. We're in a situation with many applications these days that we are coloring, and putting into italics, and marking, and putting lines under or around so many things, that the user cannot keep track of what's a warning and what's emphasis and what's a funny name. And as was mentioned earlier, some browsers try to fix this problem by displaying A labels, xn4 . Our problem there is that those things are impossible to remember. And one of the things we discovered fairly early is if we take a user who has been living for years with some nasty inadequate ASCII transliteration of her name, and we suddenly offer them, instead of that name written properly in its own characters, we offer them something completely non-nemonic, starting with X and followed by what Stuart calls line noise, and for some reason, the user doesn't think that's an an improvement. We've also recently discovered another problem we should have noticed earlier. There are two strings, or one string depending on which operating system you're using, which are confusable with anything. If one of these strings shows up in your environment, and you don't have the fonts or rendering machinery to render it, the system does something. It can turn it to blanks, which is pretty useless. But what most often happens is it's turned into some character which the system uses to represent characters it can't display. A string of question marks can either be a string of question marks or it can be some set of characters for which you don't have fonts. A string of little boxes can be either a string of little boxes, or some six characters for which you don't have fonts, or it can be an approximation to a string of question marks. And thus two strings in an environment in which you don't have the fonts installed can be confusable with anything. Now, the question is, what does a user do? Well, it should be a warning to the user that something is strange. But we know something about users from our security experience, which is if we pop-up a box which says, 'aha! this is strange - would you like to go ahead anyway?' the users almost always do the same thing, which is click okay, and go on. So, this is the string which can get you what you can't even read. And usually, depending on the operating system, trying to copy this into another environment by some kind of cut-and-paste situation will not work. The number of colorful ways of it not working, but not working, is pretty consistent. So, we started talking about mapping before. In a perfect world we would have a consistent system that allows us to perform the comparison for us. Now, that sounds obvious. In the ASCII DNS, when that was defined, we wrote a rule which said matching was going to be case insensitive, and the server goes off and does something case insensitive. They get stored in case sensitive ways, more or less, but queries in one case, match stored values in another case. It's all done on the server, nothing changes the other things. If you don't have intelligent mapping on the server, and you want to try to simulate it, which is what we've been trying to do over and over again in the international environment, where we're trying to not change the server, or how we think about things very much, one of the possibilities is to map both strings into some pre- defined conical form and compare the results. That sort of works. It doesn't permit matching based upon close enough principles or something fuzzy, and that's right in some cases and terribly wrong in others. But when we start converting characters, we lose information. We convert a visual form of one variety, into another form which is more easily understood, that's fine for matching purposes. But if we need to recover the original form, and we've made the conversion, we may be in trouble, depending on what we've done. The mapping process in inherently loses information when we start changing one character into another one. Sometimes it's pretty harmless. Case conversion may be harmless or not harmless, depending on what it is you're doing. Converting half width or full width characters to full width is normally harmless, depending on what you're doing. Unicode has normalization operations which turn strings in one form into strings of another form, making the E with an accent character and the E with followed by an over-striking non-spacing accent into the same kind of thing so they can be compared. Usually safe. Unicode has other operations which take characters which somebody thought were perfectly valid independent characters and turns them into something else, because somebody else thought they weren't independent and valid enough. If that conversion is taking a mathematical script, lower case A and turning it into a plain A, it's probably safe. If it's taking a character which is used in somebody's name and changing it into a character which is used in somebody else's name, it's probably not such a hot idea. And the difficulty is that we try to write simplified rules that get all these things right, and there are probably no such rules. So the mapping summary is, making up your own mapping systems is probably not a very good idea. People who have gotten, who have been experts in it, spend years worrying about how to get it right, can't get it right either, because there is no right. It depends on context. And finding the correct mapping for a particular use very often depends on the language in use, and very often, when we're trying to do these comparisons - DNS is a perfect example, but not the only one - we don't know what the language is which is being used. If you need language-dependent mapping, and you don't know the language, you're in big trouble. If you use a non- language dependent mapping in an environment where the user expects a dependent mapping, you can expect the user to get upset. In an international world, upset users are probably fate, but we need to get smarter how we handle them. (Dave Thaler) Our next topic is the issue of encoding, which is the topic that our working draft focuses on. So, if we look at some of the RFCs that we have right now, and we can step back and construct a simplified architecture, this is the simplified version. We are on a host, we have an application, it sits on top of and uses the DNS resolver library. That's our over- simplified model. There are two problems. And by the way, the IDNA work, for example, talks about inserting the punycode encoding algorithm in between those two. The two problems with this over-simplification: one, DNS is not the only protocol. Different protocols use different encoding today - and I'll get to this in a second... And the second problem is that the public Internet name space, in DNS, is not the only name space. In DNS. As John mentioned earlier, the Chinese TLDs are not in the public root. And different name spaces, as we'll see, use different encodings today. So this is the more realistic, more complicated version of that previous picture. On a host, you have an application. That sits on top of some name resolution library, such as sockets or whatever. Between those, they communicate with whatever the native encoding is of the operating system of choice. UTF-8 and UTF-16 are most common. Underneath the name resolution library, you have some variety of protocols, and there's the union of different things that exist on various operating systems. And then this host is a attached, for example, to multiple local LANs, each of which may or may not be connected to the public internet. And it may also be connected to a VPN, for example. Each of these is a potentially different naming context that you can resolve names in. So let's talk about problem No. 1 first, which is a multitude of name resolution protocols. Now, it turns out that many of these are actually defined to use the same syntax. What that means is if somebody hands you a FQDN, this thing with dots, you cannot tell what protocol is going to be used. It might be resolved by looking in your host file. It might be resolved by querying DNS, or resolved in the local LAN by the server TCP. Each of these are defined to use the same type of identifier space, same syntax. And so what happens, the name resolution library takes a request from the application and tries to figure out where to send it, which protocol or protocols to try, and in what order? And of course, if you have different implementations of different libraries that end up choosing different orders, you get interesting results. To make it more difficult, different protocols specify in different encodings, and so when you put those things together, that means the application can't tell which encoding, or in the case of multiple name resolution protocols being tried, which *set* of encodings are going to be attempted because that's the decision made by the name resolution library. Let's talk just for a a moment about the history of what is a legal name. All right. The naming resolution library gets something, and is that a legal name? What's that something? Let's briefly walk through the history, so to understand sort of where the world is at today. Back in 1985, RFC 952 defined the name of the host file. It may be internet host names, gateway names, domain names, or whatever. This is the one that said it contains ASCII letters, digits and hyphens, or LDH. In 1989 is when DNS came along, published in RFC 1034, 1035, and it includes a section called 'preferred name syntax' which repeats the same description of LDH. The confusion comes from the word 'preferred' there. Well, remember, this was before RFC 2119 language. Is that preferred a SHOULD or a MUST?. Or is preferred mandatory? There's confusion there. That was 1989. By 1997, 8 years later, we had RFC 2181, which was a clarification to the DNS specification, because of a number of areas of am a ambiguity and confusion that were resulting. These are three direct quotes with emphasis added. First one says 'any binary string', whatever, can be used as the label of any resource record. 'Any binary string' can serve as the value of any record that includes a domain name. And, as Stuart mentioned, applications can have restrictions imposed on what particular values are acceptable in their environment. Okay. So, to clarify, the DNS protocol itself places no restrictions whatsoever, but users of entries in DNS could place restrictions, and many have. Now, that was 1997, and in that same year there was work on the IETF policy which was published in, I think, January of '98, which is RFC 2277. This is the one that Stuart referred to, and here's the quotes from theirs. The first one you saw earlier - 'protocols must be able to use the UTF-8 character set'. And it then continues, 'protocols may specify, in addition, how to use other character sets or other character encoding schemes'. And finally, 'using a default other than UTF-8 is acceptable.' What's also worth pointing out - it's not just what it says, it is also what it doesn't say. What it can't say is anything about case, whether the E with accent, any types of combined characters, how things get sorted, etc. There's no policy, the IETF policy did not talk about such cases. And so, as a result, two unicode strings often cannot be compared to yield what you'd expect without some additional processesing. Now, since the protocols must be able to use UTF-8, but could potentially use other things, and since the simultaneously produced DNS RFC said any binary string is fine, that means it complies with the IETF policy. So, if the policy is used, UTF-8 and DNS comply with that policy, that means starting in that year, people started using UTF-8 in private namespaces. By private namespaces, we mean things like enterprises, corporate networks. By private namespace, again, we mean 'not resolvable outside of that particular network, not resolvable from the public internet.' In their own world they go off and use UTF-8 and it became widely deployed in those private networks. 5 years after that, UTF-8 was widely deployed. This included the work on punycode encoding for work in the public DNS name space. So, just to summarize here, UTF-8 is widely deployed in private namespaces. Punycode encoded strings or A labels deployed on the public DNS name space. Now, within the internationalization community, there's been a bunch of discussions on link issues, and I think it's important for the wider community to understand. DNS itself introduces a restriction on the length of names: 63 octets per label, 255 octets per name (not including a zero byte at the end if you're passing it around in an API. The point is that non ASCII characters, as Stuart showed, use a variable number of octets and encodings that are relevant here. Now, 256 UTF-16 octets, 256 UTF-8 octets, and 256 A label octets are all different lengths. So that existing strings can be represented within the length restrictions, in punycode-encoded A labels, but can't be encoded within the same length restrictions within UTF-8. There also exists strings that can be encoded in UTF-8, but cannot fit within punycode and get an A label. So, you can imagine some interesting discussions there. Let's recap. We've talked about multiple encodings of the same unicode characters. There are things we called U labels. WIth U think unicode, with A think ASCII. U labels, you have something that is usually written out in that way. A labels are things that start with xn-- You have different encodings, say, the top form and bottom form, that are used by different encodings and different networks, even within DNS. Punycode A labels on the Internet and private intranets. And even different applications that start to pay attention to different RFCs, ones that actually implement the IDNA document, and ones that don't. Because you have all the differences across the protocols, networks and so on, you can imagine the confusion that results. If you have one application that launches another application, and passes it some name or URL or whatever to use, the launching application may be able to access the directory of stuff, and you click on something, or you cause it to launch another application the some way, and it passes the name and whether the an indication that just got launched, or the use of the identifier in the same way. In general, all bets are off. It may or may not. You may get a failure, may get to some different site than what you got to from the launching application. Similarly, if you have two applications that are both trying to do the same thing - two jabber clients, for example - and one happens to work and the other one doesn't happen to work, there would be a switching incentive to say 'all I have to do is switch to the other one.' So let's walk through a couple of examples of applications that have actually tried to do a bunch of work to improve the user experience to deal with these cases, and you have to deal with the multiplicity of encoding, and I don't want to get to the wrong place or get failures. So what we found is some applications have tried to improve the algorithm to deal with this case that RFCs don't tell you how to deal with the multiple encoding issues. And so most of the time they actually get it right. There are a couple of corner cases where they don't solve it 100 percent. Here's one. You type in something into an address bar in a browser. And in this example, the 'IDN-aware' application is one that understands that there exists UTF-8 in some private namespaces that it's connected to, and punycode and the public namespace it's connected to. And so it knows which networks it's connected to, and may have some information about what the names are that are likely to appear on them, so it runs some algorithm to decide if this is an intranet or internet name. At this point the string is being held internally in memory in, let's say, UTF-16 or UTF-8, whatever the native storage of the operating system is. In this example, let's say it decides it's going to be an intranet name. So in this case it leaves it in sort-of the native encoding, does not run a punycode algorithm, and passes it to the name resolution API. Then it goes to DNS and says we're using UTF-8, and sends it to the DNS server in UTF-8. If you have host B in this example, if that one has chosen to register its name in DNS, in the punycode-encoded form, the A label form, if that's the name that actually matches, that's going to fail. If the host is host A, where it's using the same type of algorithm as the one on the top, it's going to succeed. So the normal expectation is that most of the hosts in that environment are all cooperating, or all have the same knowledge or configuration, and you actually get to host A. If instead it's in the mode of host B, it will fail. Now, let's take a case where the application decided by looking at it that it's going to be an Internet name. And by deciding by looking at it it could try one or the other. In this case it runs the punycode on it. The xn--4. This goes to the public DNS. In this example, let's say that name does not in fact exist in the DNS, and so the name resolution API wants to fall back and try a local LAN resolution and try LDNS or LMRR. In this case, LDNS is defined to say the protocol spec says that 'if the name is registered there and resolvable, better ask for it in UTF-8 or won't get the answer.' Here, if there is an indication to put it in the A a label form first before passing it to MDNS. If MDNS puts it out there, it's not going to find a match. Most of the time it does the right thing in both environments, but there are corner cases in both cases where things will fail. The next category is where you have some application that has become IDN aware, and another application that doesn't do anything, it just takes whatever the user types in and passes it directly to name resolution with no inspection or conversion, because the name resolution APIs in this example are UTF-16 APIs. So on the left, if this one is IDN aware, it will convert it using punycode to the A label form and it will go out and find the registration DNS in the punycode encoded form, whereas the other application passes it down in UTF-16. DNS will convert it to UTF-8, and it will go out and not find it. It doesn't find it, but there actually exists unicode code points with those binary strings and any string could peer in the DNS. So what if the UTF-8 version magically found its way out there, either accidentally or intentionally. They would get to a different site than what they expected. Finally, the other category of differences is applications that want to say 'I don't know which one it's going to be, so I'm going to try them both.' In some order... So, consider 2 applications, one that decides to try the UTF-8 version first, and one that decides to try the A label first. The one on the left converts it to punycode first, to the A label version first, and it goes out and finds the P or the A label version. The other one might try UTF-8 first, and again might find a different version in which one might be unreachable. So you get non deterministic behavior. Of course, the other one is intelligent too, so if that was unreachable you get the reverse. So this is what applications actually do today. So, the basic principle, the basic learning from these, the fact of physics, right, is that conversion to an A label, or UTF-8, or whatever else is going to appear on the wire, can only be done by some entity that knows which entity or protocol namespace is going to be used. What is the encoding that is appropriate for that particular environment, or that type of resolution. When an application tries to resolve a name, the name resolutions may try multiple of them. So there's no single or right choice at the application layer. This leads to two sort of remaining categories of hard issues. In general, the client (using the term generically), whether it's a host or application or whatever, because again, while we're using host names in many of our examples, the problems we're talking about are not limited to host names. Many of the ones we've talked about today may be unique to host names, but they could occur in other identifier spaces - 'may or may not' I should say. The first one is the client. The client has to guess, or learn, whatever the server encoding expects. In many cases it may be defined by the protocol, and that's fine. But if there are multiple protocols, it's part of the learning or guessing. Names appear inside in other types of identifiers, and each identifier type today often has its own encoding conventions. What is this identifier space? Is it UTF-8? Is it A label form? Is it percent or ampersand form or whatever... And anything that converts from one name space to another name space, such as extracting an e-mail address from mail, or extracting a host name from URL, you have to convert from those two sets of requirements. Now, just saying, well, if they all used a single encoding, they wouldn't have to do any of this transcoding in the movement between layers. By comparison, that's the easy part. That's not the hardest part of the problem. That's sufficient only if the only thing you're going to do is display it. All other things besides the encoding issue - comparison, matching, sorting - they all require more work. So just like RFC 952 defined what ASCII characters were legal in a host name, we need to define the unicode subsets for other identifiers. What are the things that are legal? The optimal subset for one protocol or type of identifier may be different from what's optimal for some other one. Now, there also exists cases where based on, say, implementation differences, the way that two things display visually look different. Usually, this is due to a bug. Now, the problem is nobody agrees which one is the bug or the correct behavior. So that's a hard issue. Stuart - back to you. (Stuart Cheshire) Thank you, Dave. So, Dave is right - having a single encoding does not solve all our problems, although, having lots of different encodings definitely does add to them. This is not news. We've known this for a while. There used to be computers using different character sets, and we recognized if some computers used ASCII, and some used another one, the receiver had to work out which it was, and this was not going to give a good experience. So the wire protocols used ASCII when they could. And if you had a computer that used the other one, you needed a mapping table so you could convert to the common language on the wire and convert back upon reception. We recognized that in 1969, but we seemed to have forgotten it now. To get out of the current chaos we need to go can beyond the current recommendation. Merely supporting UTF-8 as one of the many options doesn't solve the problem. I think we need to move to a world where we only use UTF-8, and when you receive an identifier, or you receive a text string off the network, you don't have to guess what the encoding is, because there is only one encoding. So the summary is, for text that end users see, we want to have rich text, and that means unicode. And for compatibility on the wire, that means using UTF-8 to encode those unicode code points. The corollary of this is for identifiers that are protocol identifiers, that are used for communication between computers to tell each computer what to do, and aren't seen by end users, it is much harder to make the argument why those should be unicode, because the bigger the alphabet, the more the scope for confusion and the more chance of things not interoperating. With that, I'd like to open the mic for questions. I think we should do half an hour for questions on this internationalization presentation, and then that will leave half an hour for general questions to the IAB. We will take new questions at the middle mic and the end ones, and the in between ones for follow-ups. Open Mic: (Bob Briscoe) A question for John. How long have we known about the security problems? Because, it was sort of quiet hearing about them, but this is the Internet, and we ought to be fixing these things. (John Klensin) What do you mean by 'knowing about the security problems?' (Bob Briscoe) Well, the problems of being able to spoof one character with another, and change fonts, etc. (John Klensin) Since long before this process started. We've known about confusability in characters since we started looking at multiple scripts. We've known about some of these confusion problems in titles of things since we deployed MIME with multiple character sets, and that would have been in, I'm guessing from memory, but like 1990 or shortly thereafter. I gave a presentation at an ICANN meeting Melborne that exhibited some of these abilities to write different things in different scripts. At that time, it was a general warning about these things. We've certainly seen more subtlety, as we've understood these things better. I used to joke that one of the properties of this whole internationalization situation, when one is actually trying to use the strings and identifiers, rather than printing them, is that every time we looked at a new script, we found a new and different set of problems. It was like going through a field and turning over rocks, and each time you found something new. So I'm not certain how to answer your question. This is just epidemic in an environment where we're suddenly moving identifiers from a world in which the maximum number of characters we treat as different is around 36, to an environment where the maximum number of characters we treat as different is in the range of tens of thousands. (Bob Briscoe) I guess my question is, your presentation told us about the problems. If we've known about these problems for 19 years or so, are there, you know, could we do a presentation on a solution space? Is there any solution space? (John Klensin) Let me give you an a different answer - we've had these problems for somewhere between two and 6,000 years. (Bob Briscoe) Time to fix it. (John Klensin) Absolutely time to fix it. (Bob Briscoe) You might do it before something goes to full standard. (John Klensin) The fundamental issues here really rely on two things. One of which is that we can design very, very highly distinguishable fonts. And we possibly need to design highly distinguishable fonts a across the entire unicode set, and they would be so ugly nobody would want to use them. We, in theory, could teach everyone about all of these 6,000 separate languages, and only slightly smaller number of scripts in the world, but that isn't going to happen. So the answer to your question is that there's a tremendous amount of reliance on user interface design here. And what we need to understand is that there's both the problem and an opportunity. The opportunity, which is very important, is for people to use the internet, in their own script, in their own language in their own environments. That's really important. Our problems arise when we start looking at, and operating in, environments which one of us doesn't understand. I'm gradually learning to recognize a few Chinese characters, but my ability to read Chinese or Japanese or Korean is zero. I don't know about you. But if your situation with regard to Chinese characters is the same as mine, if I send you a message in Chinese characters, we are both having a problem. If I send a message in a script I can read, or an identifier in a script I can read, but you can't, you've got a whole series of problems. You can't read the characters, you probably can't figure out how to put the characters in a computer if you can read them, and you're going to be easily tricked. And we're going to have to learn how to deal with that, just as we've had to learn about non interoperability of the human languages. If I have a face-to-face conversation with you, using a language which only one of us understands, then at a minimum we're going to have an interoperability problem. At a maximum, if I can make that language sound enough like something that you expect to hear, or you can do that to me, then we may have a nasty spoofing problem. And again, these issues are thousands of years old. And we kind of learned to cope. And we learn to cope by being careful, and we learn to cope by remembering that those little boxes are a big warning sign that we may not be able to read something. Many of us have started filtering out any email which arrives in a script which we can't read, because we know we're not going to be able to read it anyway. And those are the kind of things we do. It's very, very close to user level. And I don't think there are any easy answers. But the alternative to this situation is, say, oh, oops, terrible, there might be a security problem so nobody gets to use their own script and that answer is completely unacceptable. (Yoshiro Yoneya) All right. From my experience in the internationalization of protocols, one of the hardest issues is to keep backward compatibility. So, inventing encoding is to get interoperability, or backward compatibility with existing protocol. That's the reason why there are many encodings. So, I hope to have migration, generic migration guidelines for the protocol internationalization, that will be very good future work. (Stuart Cheshire) I think one of the things we need to be careful of - it's easy to fall into this trap to say we need to be backward compatible. And that actually means something else. But if the thing at the receiving end doesn't know that it means something else, we've not got international text. We have lots of percent signs. (Larry Masinter) This is a actually a followup about how long we've known about the problem. I'll take some blame. In 1993, I think, there was an internet draft where I proposed internationalization of URLs. Based on discussions in 1992, when I thought it was a simple problem. Just used UTF-8, and that there were regular URLs and internationalized ones. But I think part of the problem was the switch from thinking of these as, these weren't names, they weren't identifiers, they were locators. The notion of comparing two of them to see if they were the same was not a requirement. And at the time, there were no caches. And so, the notion of figuring out whether or not this URL was the same as that one, wasn't part of the protocol stack. And, therefore, some of the problems we're seeing, that idea of phishing that would actually look at the name and believe something, merely because you saw it on your screen, didn't have anything to do with where you were trying to go. That was a requirement that was added after the fact without a lot of thought. And, if you you think about it, we've add on some requirements that maybe shouldn't be there. And so, I think if you look at all of your examples, there's still some problems, even if don't try and compare. But almost all of the problems that you've listed really have to do with comparison, and of locators. As lots of them do. A lot of the problems have to do. You had a lot of things - look at a this and that, and are they the same or different? And if you didn't have the problem of trying to decide, a user trying to decide whether or not they were the same ahead of time, you wouldn't see a problem. (Dave Thaler) Larry, these issues exist when a system decides to take a a label or a string which can be user input, and compare it with something which is stored in in the database. The classic matching look up problem in DNS or otherwise, and then the question is whether the answer to the question, whether or not those matched, meets user expectations as far as the user is concerned is off the wall. There's no way to avoid that particular problem, other than require user to have universal knowledge of exactly what's stored. And I do do mean exactly. (Larry Masinter) No. I think, if you are, if you put a human communication in the loop, that you're going to print something on the side of the bus that you want people to type into their computers, it is your responsibility at the time that you print that on the bus, to do it in a way in which the users will have a satisfactory experience. It is not the responsibility of the intermediate system to make up for the fact that the printing was something that could be an O or could be a zero, or could be an L or one, you know, you get a password and I can't tell because the font used was bad. It's the responsibility of the printer to do that in a way that will cause appropriate behavior, and not to choose, to print things that are unrecognizable or have ambiguous forms. There are lots of systems that never go through that phase of the translate into a perceptual representation and translate back, and expect that to happen. So, I think that we can make progress by being more careful about what we choose to accept as requirements of the overall communication system. (Dave Thaler) I just want to comment on one of the things you said, about whether most of the problems are due to such and such. I want to summarize that we actually talked about at least two different, big categories of problems. One category of problems is when there are multiple unicode strings, in other words, multiple sets of unicode code point numbers that can be confused or not, or matched or whatever with each other. There's one set of things that are inherent in that, and it's lot about user interface, display, and so on. The second is one set of unicode code points. Those are two fairly different sets of problems that we talked about tonight. (Larry Masinter) I think if follow the paths, these differing alternate forms that look the same don't fall from the sky. They don't appear magically in the middle of the system. There's some data path that either transmits them, and along the way is screwing them up, or there's some human perceptual path along the way that involves printing things out or reading it out loud and transcribing it in a way that's inappropriate. (Dave Thaler) Pete, do you have follow up on this. (Pete Resnick) I do. I actually disagree with Larry, at one level. We're talking about identifiers being used for user interaction, that are also being used for machine interaction, for protocols. And that's inevitably going to get screwed up, because the stuff that we use for user interaction has variants, it's got humans involved. Once a user has to type and interpret something, and there are variations of how it might be typed or interpreted based on context, there's nothing to be done. What we've done is increased the probability of that happening from those 37 odd characters to tens of thousands of characters, incredibly. I used to be much more in the camp ten years ago if you said to me that today, I would say such a thing, I would have thought it ridiculous - we have to straighten this out with using proper encodings, this is done. You know, e-mail is no longer reliably delivered because of spam. I don't care anymore if e-mail is not delivered because a user cannot type in the e-mail address exactly the way I put it on the screen. There's no way to make that precise. If we get unlucky, the person who chose that e-mail address gets what they pay for. (Stuart Cheshire) I want to add one clarification to Larry's point. When we talk about comparing strings, we're not talking about showing two strings to the user and saying, do you think these are the same. (Larry Masinter) That was one category. (Stuart Cheshire) We were talking about, when a DNS server has a million names in its own files, and a query comes in for the name the user types, the DNS has to go through its own file and work out which record that query addresses. And you mentioned the subject of phishing, that's not a requirement that the IETF decided to put on identifiers. That's something that criminals decided would be lucrative for them, and we have to think about the consequences. (Larry Masinter) Let me see if I can clarify something. I'm not saying it's not a problem here. I'm just pointing out that trying to solve it, at a different place than, I'm trying to point out where I think it is going to be most productive, as far as looking at solutions. And it is inputing restrictions on what is output or displayed, in such a way that it is more unambiguous about how to enter it in a way that would be reliable. (Stuart Cheshire) Okay. (Larry Masinter) And to focus on that area. (John Klensin) You're asking people who design user input and output procedures to constrain their designs in a way which makes things unambiguous. My experience with telling designers what they can and cannot do has been pretty bad. (Larry Masinter) Somebody is going to have to do something, and trying to patch it somewhere else is not going to be effective. (John Klensin) Another way of looking at this is that these problems would be vastly diminished if we let no one on the internet who wasn't trained to be sophisticated about these kinds of things. And while there were times in my life while I probably would have a approved of 'nobody uses a computer unless they pass the training course and get a license', I think that's probably harder to constrained than designers. (Spencer Dawkins) Spencer Dawkins, and probably the least clued person on this topic that stood up so far. So I'm thinking, I'm thinking the kind of questions I would ask would be triage kind of questions. So, is this situation getting worse or, have we already hit the bottom? (Stuart Cheshire) It's still getting worse. We think of it as an educational process in which we continue to learn. (Spencer Dawkins) How much better does it have to get before it's good? Before it's okay? I mean, how much do we have to fix? (Stuart Cheshire) I think with a big name space identifiers, there is always going to be problems. Our goal is to minimize the unnecessary problems. (Spencer Dawkins) I see e-mails coming through and it's disappointing. (Stuart Cheshire) I think people who are working on this problem have job security. Sort of like security in antispam and so on. As long as human languages continue to exist, as long as there are humans using the network, the problems will exist. The one way to to make them go away would be to remove all the humans. (Spencer Dawkins) So, tell me if I've got this right. That, once upon a time, there was ASCII and there was the other system. And people on each side wanted to get to the resources on the other side. So, there was a death match, and we picked ASCII and life went on. Are we in any danger of being able to have that kind of a covering today, I mean, do people worry that they can't get places, in other scripts and things like that? Do people see this as a problem? And John has been, you know, demonstrating this, you know, on napkins and stuff like that for me for a while, just as a curiosity kind of thing, so I congratulate you guys for managing to scare the hell out of me yet again. But, like I say, I'm kind of curious about that. So, I'll sit down. So you asked a question there, at think at the end, you said, I think part of the question you're implying is 'how often are people actually running into problems today, right?' (Spencer Dawkins) Basically, like I said, the ASCII thing is, there's a computer I need to get to, and I can't get there. There's a computer in Saudi Arabia that I can't type the name of. How big of a problem is that? (Stuart Cheshire) As an example, in some of the cases I showed, applications are trying to deal with the fact that there's multiple encodings, it's a corner cases that fail. People run into that, but not very often. So people have done a good job of compensating for that. But we keep its as rare as possible. But the phishing attacks, whenever somebody tries to be dangerous, hopefully that isn't accomplished either. I'm goes to close the mic lines now, we have about 5 more minutes. Do we have a followup there? (Bob Briscoe) Maybe the question could be better posed as 'do we think there's sufficient protocols and languages that we're standardizing for applications that need to be secure, to be able to be?' And what I'm thinking is, if you're viewing a font and an encoding through an application that's some business, you know, important thing, legal, whatever, could the application writer say, well, normally in your locale, you'd be restricted to this, so if anything outside that range comes in, I can warn, et cetera, et cetera, and I can sign all your fonts and encodings. Do you think there's enough support there for an application to do that? (Stuart Cheshire) I think there is scope for heuristics to spot specific behavior, but there's trial and error, and they tend to be developed over a long period of time. When find something that doesn't work, find the particular heuristic. (Dave Crocker) So I got up before Pete, to ask you, Stuart, ask you about the end of your presentation, but it was, my question is, predicated on exactly the point that Pete was making. Which is that much of the mess right now, well, there are inherent complexities in the topics, but most of the mess is a layer violation that we created in simpler times, and the simpler times probably helped things a lot back then, in terms of making the internet usable. Making the arpanet usable. So that e-mail addresses, andlater web URLs, and to a large extent domain names, had this user interface use, and over the wire use, we made a lot of things simple that way, but we built the problem we have now. And, we continue to try to maintain the layer violation, and say that's okay, we have to do that. The end of your presentation, didn't phrase it this way, but essentially was going, no, maybe we really don't and we certainly should try not to. That is, we should go to a canonical over the wire representation. The piece, the little I touched this area, seems to suffer a lot from, and it will suffer even without this, but, it suffers worse, is the difficulty of getting the distinction between the user interface human factors, use bit, the human side stuff, and distinguishing it from over the wire. And I totally understand the resistance to it. But, our job is to fix problems, and we really need to be careful we don't just maintain them. We've been having, in the years that the international stuff has been worked on, we've been having to deal with some realities that forced us to make decisions that do maintain them. So I think that your suggestions at the end is, I mean, it's charmingly '70s. It's 'go back to canonical forms and over the wire.' And so the question has to do with achievability. How do we get there? Do we get there before we get to IP v6? Do we get there before we retire? Well, some of us anyhow? I mean, it's clearly the right goal. But is there anything practical about the goal, and if so, how? (Stuart Cheshire) I think moving to UTF-8 does not solve all of the problems that we talked about, not by a long way. But it solves one of the problems that at least we know which characters we're talking about when we're trying to decide if they're equal. Who will solve it, is implementers writing software, they need to write their software that way. Working groups writing standards need to specify that. I think I'm less pessimistic than are you are about the prospects of moving in a good direction here. And in the interest of time, we'll take the last question. (John Klensin) Dave, I think the other part of the answer is precisely that we have to stop taking the short cuts. Of assuming that by dropping a mechanism for internationalized characters into something which was designed for ASCII only, that that's a solution to the problem. Occasionally it will be a solution to the problem. But we may have to start thinking for the first time in our lives, seriously about presentation layers and identifiers which work in this kind of environment, rather than things that have been patched for a little bit of internationalization in an ASCII environment. And I don't think those problems are insurmountable. I don't think the problems of getting serious about localization sensitivity are insolvable. But we need to get serious and start working on them at some stage. (Dave Crocker) The layer violation is the reason why the Internet is successful, and popular. Right? (John Klensin) Was. (Dave Crocker) Was. Well, is. It is. But I think, I'll put in the process plug. We had a bar BOF in Stockholm, and there was a lot of interest in internationalization and resource identifiers to take this document, that there's an RFC proposed standard. But having 9 different solutions and 9 different committees for how we approach the problem seems like a bad idea. And there are a lot of different groups working on their solution on how to go about it. And I'm hoping we can converge into a single, if somewhat interesting working group. So, I encourage you to consider it. Public dash IRI, and I think IRI to the working group. (Olaf Kolkman) All right. Thank you. With that, I'll ask the rest of the IAB to come up on stage and we'll take general questions. So while the rest of the IAB comes to the stage for the open mic session, there was a suggestion yesterday to keep things short. I would like to remember the audience of that, we all have other own responsibilities here. In previous sessions I wrote a mail to the audience before the plenary saying 'if you have a question, please write us a mail.' That would help us to actually think about an answer, and answer concisely. And it would help to think about the question a little bit. That never happened, really. But I think that might be a mechanism for short mic lines and intelligent answers to your questions. So, please keep that in your mind for the rest of the year, the open mic is not the only way you can approach the IAB, or the community. With that said, is there anything somebody wants to bring to the mic? O Oh, I should introduce all. Let's start at the far end with Jon. (Olaf Kolkman) Okay. Thank you. (Tina Tsou) So, the document that the IETF produced, which is one that is about NATs. (Dave Thaler) It is mostly a repeat of a lot of the same points that have been made on the topics. What the IAB's thoughts are, if I can sum up what the RFC or the RFC-to-be says, is that the most important point is to preserve end-to-end transparency. Now, there are multiple solutions that preserve end-to-end transparency. So IPv6 nat is a solution, is something that's in the solution category. Now, it's possible to do translation in ways that preserve end-to-end transparency, it's possible to use tunneling, it may be possible to do other things. The IAB statement is that it's important to preserve end-to-end transparency. And there is no statement saying there must be NAT in IPv6 or not. So the first main point is that on the topic of if NAT could be done in a way that preserved end-to-end transparency is neither arguing for or against that. The second thing that's the main point of the document, is that there exists a number of things that people see as advantages in IPv4 NATS that they use them for. Re-numbering, all the things brought up in the IETF in the past. Some of them were documented previously in RFC 4864, some of them were not entirely there and so we elaborated on those. Those are things that people see as requirements for solutions. Today the simplest solution that people see is v6 NAT, but that may or may not be the only or the best solution. And so, the second point was there are some requirements there, that the community needs to work on solutions for. That's basically to sum up. Anybody else want to add anything? That's what the IAB's thoughts are are. Once you get into ways to meet those requirements, that's for the IETF to figure out. But we wanted to comment on what we believe the requirements are, and what the constraints on are, and to what extent NAT does or does not meet those requirements. (Gregory Lebowitz) I think the other thing that we tried to make very clear in the document was that every time you use a NAT to solve one of those problems, you give up something significant. And we tried to call out what those things were - the trade-off and the cost associated. (Dave Oran) Well, I just want to mention that it is somewhat difficult to establish transparency in any translation system, but it is in fact possible. Where you run into trouble is trying to take the simple approach, where you with would attempt to confine the translation state independently, in individual boxes with no coordination of the translation state, and that results in the non- invertability of transformation and the loss of transparency. So, something I would encourage the community to do is to look at NATs not as simply 'what can I get away with doing the minimal amount of work in in order to maybe get something that I want', because the consequences in negative terms for transparency are pretty severe. And with somewhat more work, translation-like approaches may in fact be quite acceptable. (Olaf Kolkman) Yes, my apologies for a minute ago to be a little bit fast in trying to close the mic lines, I see that people have collected, queued up. Peter, please. (Peter Lothberg) Okay. I'm Peter. So, there was an obvious reason why we have IPv4 NATs, and then people made them do all sorts of fantastic things, and I think people have them because they want more addresses because they have more things inside their houses and said I don't want to go there. But a major use of it is some kind of gate keeper, some kind of policy. It's a policy box that sits there and implements what policy I decide I want to have coming into my house for people. I look out the door, the door bell rings, how do they look, would I let them in or not. So, last time we forgot to do any work in the IETF and we ended up with a mess. I heard talks about smart grids, intelligent houses, and assume for a second that if we use IPv6 addresses on all of them and they have their own unique address, we still want policy. And those devices are so small, they probably need somebody to help them. See maybe the IETF somebody should go look at, so okay, in the future, we still need a policy control device that sits at the boundary of something and something, in order to enforce policy, to make sure the pool man gets to the pool and the alarm company gets to alarm, and vice versa. And let's get that done, before people make more kludges. (Gregory Lebowitz) Don't we have those, aren't they called fire walls? (Dave Oran) Can I jump in? So, as somebody who has been skeptical of most things firewalls do for as long as I can remember, I absolutely agree with Peter. However, in some cases, trying to capture the correct policy semantics simply at the individual packet layer as one would with a fire wall runs into many, many, many problems that make things in fact worse rather than better. If the fire wall simply processes on a per packet basis, there's lots of -- if the box attempts to do packet inspection, either shallow or deep, I think we're all aware of the problems that that type of approach goes into. So there's going to be some kind of application intermediate, that's going to be needed for various applications to enforce policy. Get used to it. Don't try to do everything by per packet processing and firewalls, or even worse, to try and guess what the correct policy ought to be for an application by doing intermediate inspection of packets. (Peter Lothberg) I was more thinking, more of the Swiss Army knife solution here. I wasn't only thinking packet inspections. I was thinking, this is the thing where I actually have stored my policy and all the devices I have in the house actually goes, and it's the system where it gets stored, the database, the PKI, blah, blah, blah. (Dave Oran) Then we agree. But it doesn't necessarily have to be the gateway box that sits physically at the boundary. (Peter Lothberg) Correct. (Dave Oran) But people don't want to buy many boxes. (Peter Lothberg) Yes, they only want one box that needs to get attacked. Right. (Remi Despres) That's a follow on of the first question, I think you made the point that end-to-end transparency is something important. And of course, I do agree. Now, yesterday something strange, with reference to this happened. Among the interesting technologies which are proposed to restore end-to-end transparency and move in the right direction, there is one which is in IPv4, the extension of addresses with trenches. Now, there was a BOF on A plus P. Now, some of the major birds of a feather, those people who are interested in the subject, were not permitted to talk, to present I mean, that is, to present their contribution to that. And the conclusion was that we would no longer be permitted to talk in any group, on this approach on end to end transparency. I still expect that there will be a reversal of that decision, that it will be possible in this area to work on A plus P. (Dave Thaler) Part of that is a question for the IESG, and part of it is for a question for IAB. And I'll comment about the IAB portion, which is about the end-to-end transparency, versus say the evolution of the model. The IAB has another document about what the assumptions are and the impact of changing those are, and whether those impacts should be done, you know, obviously, the whole point of the evolution of the IP model is to say, well, the IP model does evolve, but evolution has to happen carefully. Architecturally, what I think those at the BOF are trying to weigh between is, for IPv4 you have some inherent problems in IPv4 - we know that. So, at one hand, you said, well, not we the IAB, those in the BOF, we're looking at a one alternative that would say, maybe there's better end-to-end transparency, but more changes (for some definition of changes). And at the other extreme, there might be less end-to-end transparency. There's no single right answer, because architecturally the answer is remove the limitations in IPv4 and go to IPv6, that would be the architectural solution which gives you end-to-end transparency and preserves the model. So we see a tussle between the two sets of requirements that are trying to be met, but cannot be met at the same time architecturally. (Remi Despres) For the tussle to be resolved, it should be possible to talk and explain. (Dave Thaler) That's the question for for IESG, not IAB. (Remi Despres) Okay. That you thank you for the information. (Lorenzo Coletti) Since it's open mic and as somebody who deployed IPv6 network and services, that started because we needed more address space. If I look at the papers, I read the paper about the botnet that researchers got into control of the botnet and they had all of the access to the machines behind the botnet, and 80% had private IP addresses. That's a high number because it tells us that the internet would be dead and buried if we didn't have NAT, but let's remember that it was created to fight address shortage. You know the old saying, when all you have is a hammer. Yes, we started doing that because of address shortage, well it gives us security, well, not really. Multi homing, kind of. If we want those benefits, let's think outside the box. Not do it the same way. David was saying this. People want to do things the same way as they're used to, but there's benefit to clean slate. There's a protocol that allows you you to do things in very different ways, apply security policies through the last 64 bits of the IP address. You can do all a of this. Try to think outside the box. Don't do it the same way and think of all the operational costs that's involved in having different scopes and different addresses. Finally, we use public IPv6 addresses internally. And I can tell you, it's refreshingly simple to have one address. It's just, you just know what's going on. And, all the security benefits that NAT ostensibly has, personally, I don't buy them. And I don't think they would be comparable even to the game that you have when you can actually understand things and have a clean simple design. (Jon Peterson) That was not a question, but a contribution to the discussion? (Lorenzo Coletti) Open mic, right. (Jon Peterson) Certainly, if 80% of the hosts are behind NATS that tells something about the security that NATs grant. (Erik Kline) I'd like to add as well, could we possibly make a requirement that anybody who wants to implement IPv6 NAT actually run a reasonably large network for say, oh, a year. Because I'm concerned about lots of things being done based off of experiments and not valid requirements. (Jon Peterson) Well, we never did that with IPv4 and still they were developed. (Erik Kline) Right. But everybody by then had several years of experience with IPv4 four, actual experience. (Jon Peterson) I think that I'm going to disagree with that suggestion. I think it's a dangerous thing, because I guarantee you, whoever puts that amount of effort in, will become committed to it. (Olaf Kolkman) I'm going to carefully look around. If there are no further initiatives to move to the mic, and I don't see any, so, now, again, very slowly, going, going, gone. Thank you.