I'm looking for datasets for evaluating algorithms for finding top-k on data streams (e.g.).
I currently have network trace from Caida, and some self-generated zipf i.i.d. distributed datasets.
I'm looking for real-life data sets which are heavy tailed, i.e., for any fixed k, the top-k elements only consists a small portion of the stream.
Any suggestions for available datasets for academic research which are used for streaming algorithms and are heavy tailed?