Interesting Application of the Zipf Distribution: Data Purging

Interesting Application of the Zipf Distribution: Data Purging


The Zipf distribution is used to model situations in which a few observations have a very high value (or impact) and account for a large part of the total, while a very long tail of observations have medium, small, or very small values. A bit like the 80/20 rule. Examples include:

Distribution of keywords (ranked by popularity or traffic volume) in Google searches
Distribution of profiles on LinkedIn, ranked by number of connections
Distribution of Internet websites or webpages, ranked by Internet traffic (pageviews in a given time period)

Here we used it to model the distribution of files (or data) on our laptop or on the cloud (for a specific company), ranked by size. The idea of computing this distribution (say) for your laptop is to identify files that can be deleted to save as much space as possible. Many users have a very large number of files on their computer, many being of no use, slowing eating the gigantic amount of space available for storage. In short, this is an applied, simple, and very practical data storage optimization problem. We also discuss this problem in the context of optimizing resources used to store user data on large social networks such as Facebook or LinkedIn.

Source for picture: here
The data purging process is simple and consists of three steps:

Create a list of old files, look for file directories named Archives or Old; check the size, creation date, and most recent file in each directory, and the number of files that it contains
Create a list of the top 20 largest files — how many of them do you still need or use? On my laptop, the largest file (by far) was an App that Microsoft automatically installed/updated over time without my knowledge, and which seemed to be useless for me (I did some research on the Internet about it.) I un-installed it and never experienced any issues thereafter.
Search for large chunks of small files with a specific pattern, that when bundled together, occupy a lot of space (in my case, I found a lot of images, a few videos, and also many very old invoices)

You can backup all these files before deleting them. 
As for large social networks, data purging consists of identifying inactive accounts or profiles — they may represent 60% of all members. For instance, Facebook has far more US profiles than there are inhabitants in US. Identify fake and duplicate accounts, consolidate duplicate accounts that are otherwise valid.
We tend to think that the amount of storage space (and Internet bandwidth) that these companies have is infinite, or that storage is so cheap that it does not matter. However, overloaded servers results in errors and slows web page loading. Also it forces these companies to put limitations on user connection graphs: On Facebook, you can only have 5,000 friends. On LinkedIn, only 30,000 connections (I reached my limit.)
Another way for social networks to manage these large “constellations” of people is to offer premium services to members. LinkedIn could charge $20/month once you reach 5,000 connections, maybe $100/month when you reach 30,000 connections. It would free some space, and would make people more careful when accepting invitations (to be connected with someone) creating a better platform: A win-win both for LinkedIn and its members. 
Related articles: 

Graph Theory: Six Degrees of Separation Problem
Zipf’s distribution: great illustration

For related articles from the same author, click here or visit www.VincentGranville.com. Follow me on Twitter at @GranvilleDSC or on LinkedIn.
DSC Resources

Subscribe to our Newsletter
Comprehensive Repository of Data Science and ML Resources
Advanced Machine Learning with Basic Excel
Difference between ML, Data Science, AI, Deep Learning, and Statistics
Selected Business Analytics, Data Science and ML articles
Hire a Data Scientist | Search DSC | Classifieds | Find a Job
Post a Blog | Forum Questions


Link: Interesting Application of the Zipf Distribution: Data Purging