How can I check files that I already processed in a script so I don't process those again? and/or What is wrong with the way I am doing this now?
Hello, I am running tshark with the ring buffer option to dump to files after 5MB or 1 hour. I wrote a python script to read these files in XML and dump into a database, this works fine.
My issue is that this is really process intensive, one of those 5MB can turn into a 200MB file when converted to XML, so I do not want to do any unnecessary processing.
The script is running every 10 minutes and processes ~5 files per run, since is scanning the folder where the files are created for any new entries, I dump a hash of the file into the database and on the next run check the hash and if it isn't in the database I scan the file. The problem is that this does not appear to work every time, it ends up processing files that it has already done. When I check the hash of the file that it keeps trying to process it doesn't show up anywhere in the database, hence why is trying to process it over and over.
I am printing out the filename + hash in the output of the script:
using file /var/ss01/SS01_00086_20100107100828.cap with hash: 982d664b574b84d6a8a5093889454e59 using file /var/ss02/SS02_00053_20100106125828.cap with hash: 8caceb6af7328c4aed2ea349062b74e9 using file /var/ss02/SS02_00075_20100106184519.cap with hash: 1b664b2e900d56ca9750d27ed1ec28fc using file /var/ss02/SS02_00098_20100107104437.cap with hash: e0d7f5b004016febe707e9823f339fce using file /var/ss02/SS02_00095_20100105132356.cap with hash: 41a3938150ec8e2d48ae9498c79a8d0c using file /var/ss02/SS02_00097_20100107103332.cap with hash: 4e08b6926c87f5967484add22a76f220 using file /var/ss02/SS02_00090_20100105122531.cap with hash: 470b378ee5a2f4a14ca28330c2009f56 using file /var/ss03/SS03_00089_20100107104530.cap with hash: 468a01753a97a6a5dfa60418064574cc using file /var/ss03/SS03_00086_20100105122537.cap with hash: 1fb8641f10f733384de01e94926e0853 using file /var/ss03/SS03_00090_20100107105832.cap with hash: d6209e65348029c3d211d1715301b9f8 using file /var/ss03/SS03_00088_20100107103248.cap with hash: 56a26b4e84b853e1f2128c831628c65e using file /var/ss03/SS03_00072_20100105093543.cap with hash: dca18deb04b7c08e206a3b6f62262465 using file /var/ss03/SS03_00050_20100106140218.cap with hash: 36761e3f67017c626563601eaf68a133 using file /var/ss04/SS04_00010_20100105105912.cap with hash: 5188dc70616fa2971d57d4bfe029ec46 using file /var/ss04/SS04_00071_20100107094806.cap with hash: ab72eaddd9f368e01f9a57471ccead1a using file /var/ss04/SS04_00072_20100107100234.cap with hash: 79dea347b04a05753cb4ff3576883494 using file /var/ss04/SS04_00070_20100107093350.cap with hash: 535920197129176c4d7a9891c71e0243 using file /var/ss04/SS04_00067_20100107084826.cap with hash: 64a88ecc1253e67d49e3cb68febb2e25 using file /var/ss04/SS04_00042_20100106144048.cap with hash: bb9bfa773f3bf94fd3af2514395d8d9e using file /var/ss04/SS04_00007_20100105101951.cap with hash: d949e673f6138af2d388884f4a6b0f08
The only files it should be doing are one per folder, so only 4 files. This causes unecessary processing and I have to deal with overlapping cron jobs + other services been affected.
What I am hoping to get from this post is a better way to do this or hopefully someone can tell me why is happening, I know that the latter might be hard since it can be a bunch of reasons.
Here is the code (I am not a coder but a sys admin so be kind :P) line 30-32 handle the hash comparisons. Thanks in advance.