1

This is rather a practical question. I'm looking for an efficient way of calculating the frequency of an event for a large number of samples. Here's a more concrete example.

Let's say that I have a system with millions of users. Each user has so many different features that I can use to categorize them into different classes. Among them, there's an event (let's say clicking) that each user generates once in a while. I'm interested in considering the frequency of clicking as an input feature, how would you calculate that frequency efficiently?

The brute force answer is that each time the user clicks, I store that as a pair (timestamp, 1). Then, for each new incoming event, I can construct a list of such pairs into a window. Each element of this list represents a bucket (time range) and the value of the bucket shows the number of pairs that fall into it. At last, I'll calculate FFT to transform the window in time into a frequency spectrum which is my classification's input feature.

It seems to me doing so for millions of users who are constantly generating events is very heavy processing. I was wondering if there's a lighter way of calculating (or even estimating) such a frequency spectrum for the events that occur over time?

Mehran
  • 277
  • 1
  • 2
  • 12

1 Answers1

2

Sounds like more of a resource issue, but it is still related to data science, because of its final objective.

Dealing with millions of users could require a lot of memory and computing power.

That's why client-side processing should be a priority, using client-side functions like javascript.

On the other hand, it is interesting to start with a data analysis about clicks (mean amount of clicks per person, mean time spent in a session, etc.).

This is important to set rules to call the database and save the information.

For instance, you could count clicks on the client-side and every and save it in the database every (mean time spent)/2 for example.

The aim is to reduce as much as possible the request to the server-side, without having to use a long time-out.

In addition to that, if you collect enough click data, it is possible to do some interesting stats (rush hours, functions performance, most used functions, ...) and adapt the server-side or client-side processing to it.

Nicolas Martin
  • 4,674
  • 1
  • 6
  • 15
  • Thanks for the answer. Unfortunately, my use case does not have a client-side. I know I mentioned clicks but that was only to explain the problem better. My actual problem is call logs (phone calls) which do not have a client-side and there's only a server-side. While your solution is a totally valid one (reducing the number of transformation times) but I was hoping to learn about a trick to estimate the frequency spectrum of my events in an accumulative way. I'm not sure such a trick exists though! Thanks again. – Mehran Jun 21 '22 at 19:25
  • Are call logs sound data? Could you give a small example ? – Nicolas Martin Jun 21 '22 at 22:55
  • No, call logs are tabular data. Like: start_timestamp, end_timestamp, call_stataus (accepted, rejected, didn't pick up, voicemail), along with source and destination phone numbers. What I'm trying to achieve is to have a vector for how often each phone number makes calls and use this to classify them into categories (for example whether the phone calls are made by a human or a machine but not necessarily just that). – Mehran Jun 22 '22 at 18:25
  • 1
    Maybe a first data analysis could be helpful to at least define the right frequency range, using a small random sample. For instance, if 80% of users call 8 times a week on average, maybe the week is the best time reference. Then you can fine-tune your model by adding the time in the day when the person calls the most (morning, 5-6pm,etc.), or the duration between calls. Once you've done a good data analysis, you can apply dimensional reduction with UMAP, or anomaly detection easily. – Nicolas Martin Jun 22 '22 at 18:50
  • Thanks, that helps. But I think there's still one aspect of this problem that is misunderstood. I'm trying to implement a "live"/"online" system. Meaning that I need to calculate the frequency spectrum for each client on-the-fly. This is NOT an offline calculation. That's why I asked my question to begin with. Calculating something like FFT offline is completely doable but online, that's a different story, especially for each incoming event. I'm trying to classify each client as they are using the system. – Mehran Jun 23 '22 at 15:51
  • Did you try Slide DFT? https://stackoverflow.com/questions/6663222/doing-fft-in-realtime – Nicolas Martin Jun 23 '22 at 16:00
  • 1
    No, I hadn't and it sounds very promising. I didn't know about this and thank you for introducing me to it. – Mehran Jun 23 '22 at 16:07