Here is a tiny app that uses deep sampling to find tumors in any disk or directory. It walks the directory tree twice, once to measure it, and the second time to print out the paths to 20 "random" bytes under the directory.
void walk(string sDir, int iPass, int64& n, int64& n1, int64 step){
foreach(string sSubDir in sDir){
walk(sDir + "/" + sSubDir, iPass, n, n1, step);
}
foreach(string sFile in sDir){
string sPath = sDir + "/" + sFile;
int64 len = File.Size(sPath);
if (iPass == 2){
while(n1 <= n+len){
print sPath;
n1 += step;
}
}
n += len;
}
}
void dscan(){
int64 n = 0, n1 = 0, step = 0;
// pass 1, measure
walk(".", 1, n, n1);
print n;
// pass 2, print
step = n/20; n1 = step/2; n = 0;
walk(".", 2, n, n1);
print n;
}
The output looks like this for my Program Files directory:
7,908,634,694
.\ArcSoft\PhotoStudio 2000\Samples\3.jpg
.\Common Files\Java\Update\Base Images\j2re1.4.2-b28\core1.zip
.\Common Files\Wise Installation Wizard\WISDED53B0BB67C4244AE6AD6FD3C28D1EF_7_0_2_7.MSI
.\Insightful\splus62\java\jre\lib\jaws.jar
.\Intel\Compiler\Fortran\9.1\em64t\bin\tselect.exe
.\Intel\Download\IntelFortranProCompiler91\Compiler\Itanium\Data1.cab
.\Intel\MKL\8.0.1\em64t\bin\mkl_lapack32.dll
.\Java\jre1.6.0\bin\client\classes.jsa
.\Microsoft SQL Server\90\Setup Bootstrap\sqlsval.dll
.\Microsoft Visual Studio\DF98\DOC\TAPI.CHM
.\Microsoft Visual Studio .NET 2003\CompactFrameworkSDK\v1.0.5000\Windows CE\sqlce20sql2ksp1.exe
.\Microsoft Visual Studio .NET 2003\SDK\v1.1\Tool Developers Guide\docs\Partition II Metadata.doc
.\Microsoft Visual Studio .NET 2003\Visual Studio .NET Enterprise Architect 2003 - English\Logs\VSMsiLog0A34.txt
.\Microsoft Visual Studio 8\Microsoft Visual Studio 2005 Professional Edition - ENU\Logs\VSMsiLog1A9E.txt
.\Microsoft Visual Studio 8\SmartDevices\SDK\CompactFramework\2.0\v2.0\WindowsCE\wce500\mipsiv\NETCFv2.wce5.mipsiv.cab
.\Microsoft Visual Studio 8\VC\ce\atlmfc\lib\armv4i\UafxcW.lib
.\Microsoft Visual Studio 8\VC\ce\Dll\mipsii\mfc80ud.pdb
.\Movie Maker\MUI\0409\moviemk.chm
.\TheCompany\TheProduct\docs\TheProduct User's Guide.pdf
.\VNI\CTT6.0\help\StatV1.pdf
7,908,634,694
It tells me that the directory is 7.9gb, of which
- ~15% goes to the Intel Fortran compiler
- ~15% goes to VS .NET 2003
- ~20% goes to VS 8
It is simple enough to ask if any of these can be unloaded.
It also tells about file types that are distributed across the file system, but taken together represent an opportunity for space saving:
- ~15% roughly goes to .cab and .MSI files
- ~10% roughly goes to logging text files
It shows plenty of other things in there also, that I could probably do without, like "SmartDevices" and "ce" support (~15%).
It does take linear time, but it doesn't have to be done often.
Examples of things it has found:
- backup copies of DLLs in many saved code repositories, that don't really need to be saved
- a backup copy of someone's hard drive on the server, under an obscure directory
- voluminous temporary internet files
- ancient doc and help files long past being needed