If you work with distributed systems, HTTP referenceable services, and specifically caching systems, you probably should check out On the Intrinsic Locality Properties of Web Reference Streams.
This paper introduces two ideas around how to measure the temporal locality of a reference stream (web logs for example). I prefer to think of this reference stream as an event stream because the implications of the paper is broader than just web server logs.
Typically the popularity distribution of reference streams tend to follow a Zipf curve. Or more accurately, they are Zipf-like. Fitting an event stream to this curve has been proposed as a means to measure the effects of popularity on the stream.
As noted in the 'Zipf-like' paper, there are some implications. For one, the concentration of web accesses follow a (25-40)/70 rule (not a 10/90 or 20/80 rule).
Being able to fit your event stream to this distribution is quite useful when attempting to evaluate the value of adding a cache to the system. For one, you can simulate your event stream against difference cache replacement policies. This is assuming your event distribution fits Zipf. If you have a small number distinct events (just a few thousand to ten of thousands of documents) you may not be able to fit the distribution.
Back to the first paper, which does two things.
First, they recognize there are three operations on web reference streams as they pass through the internet: aggregation, disaggregation, and filtering. Where aggregation is typically a function of a proxy or a server, disaggregation can also be a function of a proxy and web client, and filtering again is in the domain of a proxy and a client.
Second, the notions of entropy and coefficient of variation (CV) are introduced. Entropy is a measure of popularity of events in a stream via the probability an event (reference) happens. CV is a measure of the correlation of events in a stream via the likelihood an event will happen some time after it has happened previously.
Entropy was introduced by C. E. Shannon in this paper. The 'Locality' paper introduces how to calculate normalized entropy. Normalized entropy has no dimension so the entropy of different event streams can be compared.
In R, normalized entropy would be like this:
entropy <- function( events ) {
frequency <- as.numeric( table( events ) )
prob <- frequency / sum( frequency )
HX <- -sum( prob * log2( prob ) )
H0 <- log2( length( unique( events ) ) )
Hn <- HX / H0
# compensate for imprecision
if( ( 1 - Hn ) < 0 )
Hs <- Inf
else
Hs <- -log10( 1 - Hn )
return( Hs )
}
Where 'events' is a vector of symbols (urls) from a web log file.
Sadly I've yet to decipher how to calculate CV (or specifically IAT-CV). Time permitting I'll dig through the references.
But it all boils down to this.
... both aggregation and disaggregation tend to increase the popularity component of temporal locality, while tending to somewhat decrease the correlation component of temporal locality.
Using entropy and CV calculations, the authors have shown the impact of stream aggregation and disaggregation. In terms of a web topology, you know where best to put a cache, and you can evaluate the relative effectiveness of a set of cache replacement policies against a given event stream.
Unfortunately this does not give me a fitted distribution to simulate an event stream.
But do consider there is a strong correlation between entropy and cache hit ratio, but only a correlation. Can Entropy Characterize Performance of Online Algorithms? will hopefully provide more insight.
Leave a comment