I am writing a routine to compare two files using memory-mapped file. In case files are too big to be mapped at one go. I split the files and map them part by part. For example, to map a 1049MB file, I split it into 512MB + 512MB + 25MB.
Every thing works fine except one thing: it always take much, much longer to compare the remainder (25MB in this example), though the code logic is exactly the same. 3 observations:
- it does not matter which is compared first, whether the main part (512MB * N) or the remainder (25MB in this example) comes first, the result remains the same
- the extra time in the remainder seems to be spent in the user mode
- Profiling in VS2010 beta 1 shows, the time is spent inside t
std::_Equal()
, but this function is mostly (profiler says 100%) waiting for I/O and other threads.
I tried
- changing the VIEW_SIZE_FACTOR to another value
- replacing the lambda functor with a member function
- changing the file size under test
- changing the order of execution of the remainder to before/after the loop
The result was quite consistent: it takes a lot more time in the remainder part and in the User Mode.
I suspect it has something to do with the fact that the mapped size is not a multiple of mapping alignment (64K on my system), but not sure how.
Below is the complete code for the routine and a timing measured for a 3G file.
Can anyone please explain it, Thanks?
// using memory-mapped file
template <size_t VIEW_SIZE_FACTOR>
struct is_equal_by_mmapT
{
public:
bool operator()(const path_type& p1, const path_type& p2)
{
using boost::filesystem::exists;
using boost::filesystem::file_size;
try
{
if(!(exists(p1) && exists(p2))) return false;
const size_t segment_size = mapped_file_source::alignment() * VIEW_SIZE_FACTOR;
// lanmbda
boost::function<bool(size_t, size_t)> segment_compare =
[&](size_t seg_size, size_t offset)->bool
{
using boost::iostreams::mapped_file_source;
boost::chrono::run_timer t;
mapped_file_source mf1, mf2;
mf1.open(p1, seg_size, offset);
mf2.open(p2, seg_size, offset);
if(! (mf1.is_open() && mf2.is_open())) return false;
if(!equal (mf1.begin(), mf1.end(), mf2.begin())) return false;
return true;
};
boost::uintmax_t size = file_size(p1);
size_t round = size / segment_size;
size_t remainder = size & ( segment_size - 1 );
// compare the remainder
if(remainder > 0)
{
cout << "segment size = "
<< remainder
<< " bytes for the remaining round";
if(!segment_compare(remainder, segment_size * round)) return false;
}
//compare the main part. take much less time, even
for(size_t i = 0; i < round; ++i)
{
cout << "segment size = "
<< segment_size
<< " bytes, round #" << i;
if(!segment_compare(segment_size, segment_size * i)) return false;
}
}
catch(std::exception& e)
{
cout << e.what();
return false;
}
return true;
}
};
typedef is_equal_by_mmapT<(8<<10)> is_equal_by_mmap; // 512MB
output:
segment size = 354410496 bytes for the remaining round
real 116.892s, cpu 56.201s (48.1%), user 54.548s, system 1.652s
segment size = 536870912 bytes, round #0
real 72.258s, cpu 2.273s (3.1%), user 0.320s, system 1.953s
segment size = 536870912 bytes, round #1
real 75.304s, cpu 1.943s (2.6%), user 0.240s, system 1.702s
segment size = 536870912 bytes, round #2
real 84.328s, cpu 1.783s (2.1%), user 0.320s, system 1.462s
segment size = 536870912 bytes, round #3
real 73.901s, cpu 1.702s (2.3%), user 0.330s, system 1.372s
More observations after the suggestions by responders
Further split the remainder into body and tail(remainder = body + tail), where
- body = N * alignment(), and tail < 1 * alignment()
- body = m * alignment(), and tail < 1 * alignment() + n * alignment(), where m is even.
- body = m * alignment(), and tail < 1 * alignment() + n * alignment(), where m is exponents of 2.
- body = N * alignment(), and tail = remainder - body. N is random.
the total time remains unchanged, but I can see that the time does not necessary relate to tail, but to size of body and tail. the bigger part takes more time. The time is USER TIME, which is most incomprehensible to me.
I also look at the pages faults through Procexp.exe. the remainder does NOT take more faults than the main loop.
Updates 2
I've performed some test on other workstations, and it seem the issue is related to the hardware configuration.
Test Code
// compare the remainder, alternative way
if(remainder > 0)
{
//boost::chrono::run_timer t;
cout << "Remainder size = "
<< remainder
<< " bytes \n";
size_t tail = (alignment_size - 1) & remainder;
size_t body = remainder - tail;
{
boost::chrono::run_timer t;
cout << "Remainder_tail size = " << tail << " bytes";
if(!segment_compare(tail, segment_size * round + body)) return false;
}
{
boost::chrono::run_timer t;
cout << "Remainder_body size = " << body << " bytes";
if(!segment_compare(body, segment_size * round)) return false;
}
}
Observation:
On another 2 PCs with the same h/w configurations with mine, the result is consistent as following:
------VS2010Beta1ENU_VSTS.iso [1319909376 bytes] ------
Remainder size = 44840960 bytes
Remainder_tail size = 14336 bytes
real 0.060s, cpu 0.040s (66.7%), user 0.000s, system 0.040s
Remainder_body size = 44826624 bytes
real 13.601s, cpu 7.731s (56.8%), user 7.481s, system 0.250s
segment size = 67108864 bytes, total round# = 19
real 172.476s, cpu 4.356s (2.5%), user 0.731s, system 3.625s
However, running the same code on a PC with a different h/w configuration yielded:
------VS2010Beta1ENU_VSTS.iso [1319909376 bytes] ------ Remainder size = 44840960 bytes
Remainder_tail size = 14336 bytes
real 0.013s, cpu 0.000s (0.0%), user 0.000s, system 0.000s
Remainder_body size = 44826624 bytes
real 2.468s, cpu 0.188s (7.6%), user 0.047s, system 0.141s
segment size = 67108864 bytes, total round# = 19
real 65.587s, cpu 4.578s (7.0%), user 0.844s, system 3.734s
System Info
My workstation yielding imcomprehensible timing:
OS Name: Microsoft Windows XP Professional
OS Version: 5.1.2600 Service Pack 3 Build 2600
OS Manufacturer: Microsoft Corporation
OS Configuration: Member Workstation
OS Build Type: Uniprocessor Free
Original Install Date: 2004-01-27, 23:08
System Up Time: 3 Days, 2 Hours, 15 Minutes, 46 Seconds
System Manufacturer: Dell Inc.
System Model: OptiPlex GX520
System type: X86-based PC
Processor(s): 1 Processor(s) Installed.
[01]: x86 Family 15 Model 4 Stepping 3 GenuineIntel ~2992 Mhz
BIOS Version: DELL - 7
Windows Directory: C:\WINDOWS
System Directory: C:\WINDOWS\system32
Boot Device: \Device\HarddiskVolume2
System Locale: zh-cn;Chinese (China)
Input Locale: zh-cn;Chinese (China)
Time Zone: (GMT+08:00) Beijing, Chongqing, Hong Kong, Urumqi
Total Physical Memory: 3,574 MB
Available Physical Memory: 1,986 MB
Virtual Memory: Max Size: 2,048 MB
Virtual Memory: Available: 1,916 MB
Virtual Memory: In Use: 132 MB
Page File Location(s): C:\pagefile.sys
NetWork Card(s): 3 NIC(s) Installed.
[01]: VMware Virtual Ethernet Adapter for VMnet1
Connection Name: VMware Network Adapter VMnet1
DHCP Enabled: No
IP address(es)
[01]: 192.168.75.1
[02]: VMware Virtual Ethernet Adapter for VMnet8
Connection Name: VMware Network Adapter VMnet8
DHCP Enabled: No
IP address(es)
[01]: 192.168.230.1
[03]: Broadcom NetXtreme Gigabit Ethernet
Connection Name: Local Area Connection 4
DHCP Enabled: Yes
DHCP Server: 10.8.0.31
IP address(es)
[01]: 10.8.8.154
Another workstation yielding "correct" timing: OS Name: Microsoft Windows XP Professional
OS Version: 5.1.2600 Service Pack 3 Build 2600
OS Manufacturer: Microsoft Corporation
OS Configuration: Member Workstation
OS Build Type: Multiprocessor Free
Original Install Date: 5/18/2009, 2:28:18 PM
System Up Time: 21 Days, 5 Hours, 0 Minutes, 49 Seconds
System Manufacturer: Dell Inc.
System Model: OptiPlex 755
System type: X86-based PC
Processor(s): 1 Processor(s) Installed.
[01]: x86 Family 6 Model 15 Stepping 13 GenuineIntel ~2194 Mhz
BIOS Version: DELL - 15
Windows Directory: C:\WINDOWS
System Directory: C:\WINDOWS\system32
Boot Device: \Device\HarddiskVolume1
System Locale: zh-cn;Chinese (China)
Input Locale: en-us;English (United States)
Time Zone: (GMT+08:00) Beijing, Chongqing, Hong Kong, Urumqi
Total Physical Memory: 3,317 MB
Available Physical Memory: 1,682 MB
Virtual Memory: Max Size: 2,048 MB
Virtual Memory: Available: 2,007 MB
Virtual Memory: In Use: 41 MB
Page File Location(s): C:\pagefile.sys
NetWork Card(s): 3 NIC(s) Installed.
[01]: Intel(R) 82566DM-2 Gigabit Network Connection
Connection Name: Local Area Connection
DHCP Enabled: Yes
DHCP Server: 10.8.0.31
IP address(es)
[01]: 10.8.0.137
[02]: VMware Virtual Ethernet Adapter for VMnet1
Connection Name: VMware Network Adapter VMnet1
DHCP Enabled: Yes
DHCP Server: 192.168.154.254
IP address(es)
[01]: 192.168.154.1
[03]: VMware Virtual Ethernet Adapter for VMnet8
Connection Name: VMware Network Adapter VMnet8
DHCP Enabled: Yes
DHCP Server: 192.168.2.254
IP address(es)
[01]: 192.168.2.1
Any explanation theory? Thanks.