views:

528

answers:

6

I am writing a routine to compare two files using memory-mapped file. In case files are too big to be mapped at one go. I split the files and map them part by part. For example, to map a 1049MB file, I split it into 512MB + 512MB + 25MB.

Every thing works fine except one thing: it always take much, much longer to compare the remainder (25MB in this example), though the code logic is exactly the same. 3 observations:

  1. it does not matter which is compared first, whether the main part (512MB * N) or the remainder (25MB in this example) comes first, the result remains the same
  2. the extra time in the remainder seems to be spent in the user mode
  3. Profiling in VS2010 beta 1 shows, the time is spent inside t std::_Equal(), but this function is mostly (profiler says 100%) waiting for I/O and other threads.

I tried

  • changing the VIEW_SIZE_FACTOR to another value
  • replacing the lambda functor with a member function
  • changing the file size under test
  • changing the order of execution of the remainder to before/after the loop

The result was quite consistent: it takes a lot more time in the remainder part and in the User Mode.

I suspect it has something to do with the fact that the mapped size is not a multiple of mapping alignment (64K on my system), but not sure how.

Below is the complete code for the routine and a timing measured for a 3G file.

Can anyone please explain it, Thanks?

// using memory-mapped file
template <size_t VIEW_SIZE_FACTOR>
struct is_equal_by_mmapT
{
public:
    bool operator()(const path_type& p1, const path_type& p2)
    {
        using boost::filesystem::exists;
        using boost::filesystem::file_size;

        try
        {
            if(!(exists(p1) && exists(p2))) return false;

            const size_t segment_size = mapped_file_source::alignment() * VIEW_SIZE_FACTOR;  

            // lanmbda 
            boost::function<bool(size_t, size_t)> segment_compare = 
            [&](size_t seg_size, size_t offset)->bool 
            {
                using boost::iostreams::mapped_file_source;
                boost::chrono::run_timer t;     

                mapped_file_source mf1, mf2;  

                mf1.open(p1, seg_size, offset);
                mf2.open(p2, seg_size, offset);

                if(! (mf1.is_open() && mf2.is_open())) return false;

                if(!equal (mf1.begin(), mf1.end(), mf2.begin())) return false;  

                return true;
            };

            boost::uintmax_t size = file_size(p1);
            size_t round     = size / segment_size;
            size_t remainder = size & ( segment_size - 1 );

            // compare the remainder
            if(remainder > 0)
            {
                cout << "segment size = " 
                     << remainder 
                     << " bytes for the remaining round";
                if(!segment_compare(remainder, segment_size * round)) return false;    
            }   

            //compare the main part.  take much less time, even 
            for(size_t i = 0; i < round; ++i)
            {
                cout << "segment size = " 
                     << segment_size 
                     << " bytes, round #" << i;
                if(!segment_compare(segment_size, segment_size * i))  return false;
            }
        }
        catch(std::exception& e)
        {
            cout << e.what();
            return false;
        }

        return true;                                      
    }
};

typedef is_equal_by_mmapT<(8<<10)> is_equal_by_mmap;  // 512MB

output:

segment size = 354410496 bytes for the remaining round

real 116.892s, cpu 56.201s (48.1%), user 54.548s, system 1.652s

segment size = 536870912 bytes, round #0

real 72.258s, cpu 2.273s (3.1%), user 0.320s, system 1.953s

segment size = 536870912 bytes, round #1

real 75.304s, cpu 1.943s (2.6%), user 0.240s, system 1.702s

segment size = 536870912 bytes, round #2

real 84.328s, cpu 1.783s (2.1%), user 0.320s, system 1.462s

segment size = 536870912 bytes, round #3

real 73.901s, cpu 1.702s (2.3%), user 0.330s, system 1.372s


More observations after the suggestions by responders

Further split the remainder into body and tail(remainder = body + tail), where

  • body = N * alignment(), and tail < 1 * alignment()
  • body = m * alignment(), and tail < 1 * alignment() + n * alignment(), where m is even.
  • body = m * alignment(), and tail < 1 * alignment() + n * alignment(), where m is exponents of 2.
  • body = N * alignment(), and tail = remainder - body. N is random.

the total time remains unchanged, but I can see that the time does not necessary relate to tail, but to size of body and tail. the bigger part takes more time. The time is USER TIME, which is most incomprehensible to me.

I also look at the pages faults through Procexp.exe. the remainder does NOT take more faults than the main loop.


Updates 2

I've performed some test on other workstations, and it seem the issue is related to the hardware configuration.

Test Code

// compare the remainder, alternative way
if(remainder > 0)
{
    //boost::chrono::run_timer t;       
    cout << "Remainder size = " 
         << remainder 
         << " bytes \n";

    size_t tail = (alignment_size - 1) & remainder;
    size_t body = remainder - tail;

{
    boost::chrono::run_timer t;                               
    cout << "Remainder_tail size = " << tail << " bytes";
    if(!segment_compare(tail, segment_size * round + body)) return false;
}                        
{
    boost::chrono::run_timer t;                               
    cout << "Remainder_body size = " << body << " bytes";
    if(!segment_compare(body, segment_size * round)) return false; 
}                        

}

Observation:

On another 2 PCs with the same h/w configurations with mine, the result is consistent as following:

------VS2010Beta1ENU_VSTS.iso [1319909376 bytes] ------

Remainder size = 44840960 bytes

Remainder_tail size = 14336 bytes

real 0.060s, cpu 0.040s (66.7%), user 0.000s, system 0.040s

Remainder_body size = 44826624 bytes

real 13.601s, cpu 7.731s (56.8%), user 7.481s, system 0.250s

segment size = 67108864 bytes, total round# = 19

real 172.476s, cpu 4.356s (2.5%), user 0.731s, system 3.625s

However, running the same code on a PC with a different h/w configuration yielded:

------VS2010Beta1ENU_VSTS.iso [1319909376 bytes] ------ Remainder size = 44840960 bytes

Remainder_tail size = 14336 bytes

real 0.013s, cpu 0.000s (0.0%), user 0.000s, system 0.000s

Remainder_body size = 44826624 bytes

real 2.468s, cpu 0.188s (7.6%), user 0.047s, system 0.141s

segment size = 67108864 bytes, total round# = 19

real 65.587s, cpu 4.578s (7.0%), user 0.844s, system 3.734s

System Info

My workstation yielding imcomprehensible timing:

OS Name: Microsoft Windows XP Professional

OS Version: 5.1.2600 Service Pack 3 Build 2600

OS Manufacturer: Microsoft Corporation

OS Configuration: Member Workstation

OS Build Type: Uniprocessor Free

Original Install Date: 2004-01-27, 23:08

System Up Time: 3 Days, 2 Hours, 15 Minutes, 46 Seconds

System Manufacturer: Dell Inc.

System Model: OptiPlex GX520

System type: X86-based PC

Processor(s): 1 Processor(s) Installed.

                       [01]: x86 Family 15 Model 4 Stepping 3 GenuineIntel ~2992 Mhz

BIOS Version: DELL - 7

Windows Directory: C:\WINDOWS

System Directory: C:\WINDOWS\system32

Boot Device: \Device\HarddiskVolume2

System Locale: zh-cn;Chinese (China)

Input Locale: zh-cn;Chinese (China)

Time Zone: (GMT+08:00) Beijing, Chongqing, Hong Kong, Urumqi

Total Physical Memory: 3,574 MB

Available Physical Memory: 1,986 MB

Virtual Memory: Max Size: 2,048 MB

Virtual Memory: Available: 1,916 MB

Virtual Memory: In Use: 132 MB

Page File Location(s): C:\pagefile.sys

NetWork Card(s): 3 NIC(s) Installed.

       [01]: VMware Virtual Ethernet Adapter for VMnet1

             Connection Name: VMware Network Adapter VMnet1

             DHCP Enabled:    No

             IP address(es)

             [01]: 192.168.75.1

       [02]: VMware Virtual Ethernet Adapter for VMnet8

             Connection Name: VMware Network Adapter VMnet8

             DHCP Enabled:    No

             IP address(es)

             [01]: 192.168.230.1

       [03]: Broadcom NetXtreme Gigabit Ethernet

             Connection Name: Local Area Connection 4

             DHCP Enabled:    Yes

             DHCP Server:     10.8.0.31

             IP address(es)

             [01]: 10.8.8.154

Another workstation yielding "correct" timing: OS Name: Microsoft Windows XP Professional

OS Version: 5.1.2600 Service Pack 3 Build 2600

OS Manufacturer: Microsoft Corporation

OS Configuration: Member Workstation

OS Build Type: Multiprocessor Free

Original Install Date: 5/18/2009, 2:28:18 PM

System Up Time: 21 Days, 5 Hours, 0 Minutes, 49 Seconds

System Manufacturer: Dell Inc.

System Model: OptiPlex 755

System type: X86-based PC

Processor(s): 1 Processor(s) Installed.

        [01]: x86 Family 6 Model 15 Stepping 13 GenuineIntel ~2194 Mhz

BIOS Version: DELL - 15

Windows Directory: C:\WINDOWS

System Directory: C:\WINDOWS\system32

Boot Device: \Device\HarddiskVolume1

System Locale: zh-cn;Chinese (China)

Input Locale: en-us;English (United States)

Time Zone: (GMT+08:00) Beijing, Chongqing, Hong Kong, Urumqi

Total Physical Memory: 3,317 MB

Available Physical Memory: 1,682 MB

Virtual Memory: Max Size: 2,048 MB

Virtual Memory: Available: 2,007 MB

Virtual Memory: In Use: 41 MB

Page File Location(s): C:\pagefile.sys

NetWork Card(s): 3 NIC(s) Installed.

       [01]: Intel(R) 82566DM-2 Gigabit Network Connection

             Connection Name: Local Area Connection

             DHCP Enabled:    Yes

             DHCP Server:     10.8.0.31

             IP address(es)

             [01]: 10.8.0.137

       [02]: VMware Virtual Ethernet Adapter for VMnet1

             Connection Name: VMware Network Adapter VMnet1

             DHCP Enabled:    Yes

             DHCP Server:     192.168.154.254

             IP address(es)

             [01]: 192.168.154.1

       [03]: VMware Virtual Ethernet Adapter for VMnet8

             Connection Name: VMware Network Adapter VMnet8

             DHCP Enabled:    Yes

             DHCP Server:     192.168.2.254

             IP address(es)

             [01]: 192.168.2.1

Any explanation theory? Thanks.

+1  A: 

How fragmented is the file you are comparing with? You can use FSCTL_GET_RETRIEVAL_POINTERS to get the ranges that the file maps to on disk. I suspect the last 25 MB will have a lot of small ranges to account for the performance you have measured.

MSN
Thanks for the tip. But it's unlike caused by fragment, as I've test on multiple file ranging from MBs to GBs.
t.g.
+1  A: 

I wonder if mmap behaves strangely when a segment isn't an even number of pages in size? Maybe you can try handling the last parts of the file by progressively halving your segment sizes until you get to a size that's less than mapped_file_source::alignment() and handling that last little bit specially.

Also, you say you're doing 512MB blocks, but your code sets the size to 8<<10. It then multiplies that by mapped_file_source::alignment(). Is mapped_file_source::alignment() really 65536?

I would recommend, to be more portable and cause less confusion, that you simply use the size as given in the template parameter and simply require that it be an even multiple of mapped_file_source::alignment() in your code. Or have people pass in the power of two to start at for the block size, or something. Having the block size passed in as a template parameter then be multiplied by some strange implementation defined constant seems a little odd.

Omnifarious
Regarding alignment(), mapped_file::alignment() is a static member function Returning the operating system's virtual memory allocation granularity, which is not implementation-defined but platform defined. When using mmap, "the offset must be a multiple of the allocation granularity" (http://msdn.microsoft.com/en-us/library/aa366763%28VS.85%29.aspx). Using a FACTOR(VIEW_SIZE_FACTOR), instead of SIZE(VIEW_SIZE), as a template parameter would guaranteed such a rule would not be offended, also easing the user the labor of finding the granularity.
t.g.
Also (8<<10) is just a way to ease human reading, as the granularity are usually in KBs, and (1<<10) is 1K. looking at the preceding digit(say, 8), would tell approximately how big it is in MB. and because it's const integter. it will be calculated in compilation rather than run-time.
t.g.
I would change the factor to 1 to test your suggestion. Thanks.
t.g.
My point about the size is that I think it's better to take the actual size of your segments as the template parameter and just error out if it isn't a multiple of the platform specific value.
Omnifarious
+2  A: 

This behavior looks quite illogical. I wonder what would happen if we tried something stupid. Provided the overall file is larger than 512MB you could compare again a full 512MB for the last part instead of the remaining size.

something like:

        if(remainder > 0)
        {
            cout << "segment size = " 
                 << remainder 
                 << " bytes for the remaining round";
                if (size > segment_size){
                    block_size = segment_size;
                    offset = size - segment_size;
                }
                else{
                    block_size = remainder;
                    offset = segment_size * i
                }
            if(!segment_compare(block_size, offset)) return false;    
        }

It seems a really dumb thing to do because we would be comparing part of the file two times but if your profiling figures are accurate it should be faster.

It won't give us an answer (yet) but if it is indeed faster it means the response we are looking for lies in what your program does for small blocks of data.

kriss
Hihi, fantastic idea! Thinking outside the memory-mapped box.
Jonas Byström
it does not work because, in that way, the alignment rule that "the offset must be a multiple of the allocation granularity" is violated, thus failing the map. I did similar things like further splitting the remainder. see my updates in the OP.
t.g.
+1  A: 

I would try it on a Linux or BSD just to see how it acts, out of curiousity.

I have a really rough guess about the problem: I bet that Windows is doing a lot of extra checks to make sure it doesn't map past the end of the file. In the past there have been security problems in some OS's that allowed a mmap user to view filesystem-private data or data from other files in the area just past the end of the map, so being careful here is a good idea for a OS designer. So Windows may be using a much more careful "copy data from disk to kernel, zero out unmapped data, copy data to user" instead of the much faster "copy data from disk to user".

Try mapping to just under the end of the file, excluding the last bytes that don't fit into a 64K block.

Zan Lynx
I just tried it on another Windows PC. please see my updates in the OP.
t.g.
+1  A: 

I know this isn't an exact answer to your question; but have you tried side-stepping the entire problem - i.e. just map the entire file in one go?

I know little about Win32 memory management; but on Linux you can use the MAP_NORESERVE flag with mmap(), so you don't need to reserve RAM for the entire filesize. Considering you are just reading from both files the OS should be able to throw away pages at any time if it gets short of RAM...

Dave Rigby
When the file is too big, it will fail to map. That's why I split the the file into parts and map them view by view, which is the recommended way in Win32 as far as I know. When the file is small enough, yes, it can be mapped in one go with no performance problem.
t.g.
A: 

Could it be that a virus scanner is causing these strange results? Have you tried without virus scanner?

Regards,

Sebastiaan

Sebastiaan Megens
Unlikely, as I tried on a few PC with the same virus scanner installed. When the result only seems related to h/w configuration. I am guess it's some kind of CPU or RAM cache impact, but could not tell how, that's why I posted the h/w config out.
t.g.
I agree, indeed it seems unlikely. But in these strange situations I want to check all my assumptions so I always try if the problem is still there with the virus scanner disabled (and the firewall, when reading files from a network).Regards,Sebastiaan
Sebastiaan Megens