1

I have a real dataset of sequences of events and a "fake" dataset generated using an lstm model. The two datasets are made up of the same vocabulary but are of a different length. I'm putting together an evaluation script to assess how similar the two datasets are, and one of the metrics should be a comparison of the ranking of the top 500 most frequent n-grams in the real data and the fake data. Seeing as the top 500 most frequent n-grams in the real data might be different to the ones in the fake data, I don't know what ranking measure to use that would give me a clear idea of how similar the two datasets are. Does anyone know of a measure that would allow for such a difference? I need one which treats the real dataset as the gold truth and compares it to the fake dataset.

Any help would be much appreciated.

Boris
  • 121
  • 3

1 Answers1

1

The best way to proceed is to get the top 500 most frequent n-grams in both datasets, then get the intersection and use the two rankings to get the Kendall's-W measure, which serves as an assessment for rank agreement between two different rankings:

https://en.wikipedia.org/wiki/Kendall%27s_W

Implementation in python :

https://stackoverflow.com/questions/48893689/kendalls-coefficient-of-concordance-w-in-python

Boris
  • 121
  • 3