I have two groups of files that contain data in CSV format with a common key (Timestamp) - I need to walk through all the records chronologically.
Group A: 'Environmental Data'
- Filenames are in format A_0001.csv, A_0002.csv, etc.
- Pre-sorted ascending
- Key is Timestamp, i.e.YYYY-MM-DD HH:MM:SS
- Contains environmental data in CSV/column format
- Very large, serveral GBs worth of data
Group B: 'Event Data'
- Filenames are in format B_0001.csv, B_0002.csv
- Pre-sorted ascending
- Key is Timestamp, i.e.YYYY-MM-DD HH:MM:SS
- Contains event based data in CSV/column format
- Relatively small compared to Group A files, < 100 MB
What is best approach?
- Pre-merge: Use one of the various recipes out there to merge the files into a single sorted output and then read it for processing
- Real-time merge: Implement code to 'merge' the files in real-time
I will be running lots of iterations of the post-processing side of things. Any thoughts or suggestions? I am using Python.