Presence Interdomain Scaling Analysis for SIP/SIMPLE

Summary: Needs a YES.

(Adrian Farrel) Discuss

Discuss (2009-09-23)
A good and thorough job. Thank you.

The Security Considerations section seems light on two counts.

Firstly, the very scaling problems that are described constitute a 
security issue since they will gum up the network, CPU, and memory. To
what extent is a system at risk for the uncontrolled use of presence
subscriptions? Can an attack be mounted by simulating a larger number 
of subscriptions?

Secondly, it is worth examining the impact on all of the scaling that 
is introduced by applying security to the messages that are exchanged.
Of course, it should be noted that disabling security to achieve 
better scaling is not recommended, but there may be choices and 

Fix this with a couple of new paragraphs in an RFC Editor note.
Comment (2009-09-23)
No email
send info
I love the new standard unit
"dozens of gigabytes of presence traffic per day"

(Cullen Jennings) Discuss

Discuss (2009-10-08)
In any modeling situation it is hard to decide what is important and what is not and get the underlying model right. I like the underlying model that is in this draft but I have significant problems with many of the other parts. The constants that are used to feed into the model seem very selected more to display that some other optimization work is needed that chosen to reflect a typical deployment. The results of the model don't support the conclusions. Some of the conclusions find very difficult to believe. I think this draft needs significant rework and review. 

On to some specifics 

It is not clear if the scope is enterprise networks or consumer networks. These seems to have very different constants for the model and often the two seem blended by this draft. 

Section 2.2. 

The numbers given provide are not very useful without knowing what they are on a per user bases. And you don't provide number of active and registered users for this data. One of the premises of this work was that we would get from real networks information about churn rates, average buddy lists sizes, rates of presence changes and so on. If we can not get any numbers to work on, then the draft should only express an abstract model and go to actually rates. 

I'd like someone to explain to me how we have 2000 longing per second and 4000 logouts

Section 2.3.1

C01 - 8 hours seems highly unlikely for a large consumer network. Why

C02 - The implication of we 8*3 = 24 state change per day does not match any real wold stats I have seen. Do you really think this is right?

C03 - This refresh of one hour is the single thing in the draft that I disagree with the most. Why would anyone use 1 hour. If you need fast refreshes for some reliability scheme, then this is far to high and I have seen deployments with 10 minutes for various reasons. If you don't need it for the reliability scheme, why not use something more like infinity, or 30 days? The results in the conclusions are highly sensitive to this number and 1 hour is not a choice I can seen any ration reason for choosing. 

I really don't like the practice of taking message sizes from RFC. I'd like some real world data. The example messages in the RFCs are not reflective of real things. 

On the calculations that show bytes for various things, I would like to see bytes or bandwidth normalized per user for all of them. 

In 2.8.1 the federated subscriptions moves to 20. Any data to support this is large domains? I can see how this might happen in small domains but again back to the scope, its clear scaling is not a problem in small domain federation. 

I love how in section 2.10 you just sort of slip in the change to the number of state changer per day per user going from 24 to 48. Ho many times do you  change your status a day?

Section 3.1 

All of the state size seem way too big. How did you get these. What do you store for a subscription that takes 2K?

Section 4

Section 4.1. I'm confused about what this is trying to say. I think it says "computing presence creates a load on the presence server". Same for 4.2, 4.3, 4.4. I'm just failing to get the point of these. 

Section 5

Does etags help? I have heard from some people that it does not because it increases the CPU load of the server and the cpu is the constrained resource not anything else. I have no idea if this is true or not but would be nice if this work could investigate the impact of etag. How it helps, how it makes things worse. 

The idea of using RLS as BCP seems it never has consensus any time it is brought up. The IETF has consistency had strong desire for privacy around presence data. The market is moving rapidly away from systems that don't support privacy of presence data. Why would we do a BCP that suggests your presence data should be all public to all your subscribers, they should all get the same thing, and basically toss all our privacy. I agree it would reduce CPU load on the presence servers. 

The content indirection discussion seem to sort of avoid the aggregation/privacy issues. 

Section 7.

ON the 1st bullet point on RLMI. I believe this may be true but I don't see how the data here shows that. 

On point 2, it may reduce number of messages, but not by a lot and far less clear if it reduces the constrained resources. I am skeptical it reduces number of bytes substantially and don't think the current model is adequate to evaluate that due to the model assumptions about constant size of NOTIFYs. 

Point 3. Seem this is just not true with sigcomp. On the non sigcomp case, I'd believe this was true but I don't see how this data o model shows that. 

Point 4. The claim that this cuts the number of bytes in half does not pass the laugh test. The idea that removing the responses to all the requests cuts the message count in half is clear but I don't know of any system where this wold greatly change the performance. Treating the resource load of 200s to be the same as requests for the servers is not an OK model of how these systems work. 

Point 5. Uh, given the traffic I see on networks, I don't believe this one. 

You talk about investing lots. Could you say how much per user?

The paragraph that talks about how no one thought about scalability and reliability when designing SIP is pretty insulting. Do you really believe this? 

The paragraph about we need a different protocol between domains: I see no evidence to support this would help. I believe that certain optimization are possible where domain A trusts domain B to enforce security policy but that can be done in the same not different protocol. Either way, I see no data to support this conclusion. I strongly disagree with the general amount of unsubstantiated suggestions there are much better ways to do this. 

On the paragraph about redundant messages, I think you are totally missing the privacy issues. Given privacy is per watcher, how are these redundant? When using sigcomp, how much extra data get sent? Is this a problem?

The final paragraph about creating NOTIFY as media seems uh, well, pretty confused. If your policy allowed this, it seems that having the intermediate proxies just not record route would accomplish the same thing? You agree?

Section 10

The assumptions this draft is making in the conclusions about way to optimize SIMPLE seem to greatly reduce the privacy capabilities of SIP. 


It is very unclear what you think the constrained resources is. Is it cpu, number of messages, bandwidth, memory. What is the part we need to model here. 

For many deployments, bandwidth is not a constraint but for the ones where it is, they use sipcomp. I'd like to see the actual bandwidth numbers with sigcomp as well as without.

In summary, you can see I don't think the document actually shows most the things it claims in the conclusion. I believe that it would be very harmful to publish the document as is. The numbers will be used by folks to justify work that is not needed and claim that other protocol are 10x better than SIMPLE. I don't mind claims like that if they are true, but working off these numbers I would not feel they are true. 

For a variety of reasons I did not have as much time as I would have like to review this draft and did not look at the underlying model or calculations carefully. I understand parts of my review may be unclear here and glad to work with folks to help clarify and fix any parts of my review.

(Robert Sparks) Yes

(Jari Arkko) No Objection

(Ron Bonica) No Objection

(Ross Callon) No Objection

(Ralph Droms) No Objection

Comment (2009-10-07)
No email
send info
Is there a specific reason not to include the total number of registered users in each scenario in section 2?  The enterprise scenario includes:

   o  About half of the registered users were online at peak time.

but that doesn't seem useful unless the number of registered users is provided.

(Pasi Eronen) No Objection

(Russ Housley) No Objection

(Alexey Melnikov) No Objection

(Tim Polk) No Objection

Comment (2009-09-24)
No email
send info
I support Adrian's discuss.

(Dan Romascanu) No Objection

Comment (2009-09-24)
No email
send info
I like the document and I salute this work - it's one of the first I see anaysing impact of a succesful protocol or application on the Internet load and CPU losd of the hosts and servers. I would have liked to see also some mention of the impact and scalability aspects of the management systems and operator tools. Although that impact is not direct and as easy measurable, it is a serious concern for operators when dealing with succesful protocols.

(Magnus Westerlund) No Objection

Roman Danyliw No Record

Martin Duke No Record

Lars Eggert No Record

Comment (2009-10-07)
No email
send info
I really wonder if an ID/RFC is the optimal way to present these results in a readable form...

Benjamin Kaduk No Record

Erik Kline No Record

Murray Kucherawy No Record

Warren Kumari No Record

Francesca Palombini No Record

Alvaro Retana No Record

Zaheduzzaman Sarker No Record

John Scudder No Record

Martin Vigoureux No Record

Éric Vyncke No Record

Robert Wilton No Record