So I have an issue I'm facing in regards to clustering with live, continuously streaming data. Since I have an ever-growing data set I'm not sure what is the best way to run efficient and effective clustering. I've come up with a few possible solutions including:
Setting a limit on how many data points to allow, thus whenever the limit is reached as another data point comes in the oldest point is removed. Essentially, this would suggest that older data isn't relevant enough to us anymore to care what we're losing by throwing it out.
Once there is enough data to make a good clustering, consider this "the setup" and as new points come, rather than re-clustering all the data just figure out which cluster center the new point is closest to and add it to that. The benefit here is you could avoid having to re-cluster on every new point and you wouldn't have to store all the other points, just the cluster centers, considering this clustering "good enough". The downside is that re-running the algorithm with all data points from the beginning may be more accurate.
While those are some potential solutions I brain-stormed, I'd like to know if there are any better known techniques to face this problem. I figure sites like Google had to deal with it somehow (and I'm hoping that "add more ram, servers and processors" or "continually expand your data centers" aren't the only answers available).
It sounds like you're looking for online algorithms for clustering.
I suggest searching for "online clustering" on Google Scholar. Maybe the following links will prove useful (at least as a starting point).
External links referenced by this document: