Skip to content
July 24, 2008 / steve8

Fast Large File Transfers on Windows Shares? Jumbo Frames?

Sometimes you may want to move large amounts of data over a network. The natural and probably naive assumption is that if you have a Gigabit Ethernet network, the transfer will occur at 1Gb/sec.

Many things can prevent this from happening:

1. Flow control on the transmitting node.

If a Gigabit Ethernet equipped server has a 100Mb/sec client in an active data transfer, this “feature” seems to throttle transfers to faster (Gb Ethernet) clients while the transfer with the slower client is active. link

Some drivers, including Intel’s, let you turn off flow control in the device manager.

2. SAMBA/CIFS/SMB/OS inefficiency.

SMB/CIFS is very convenient, as its fully integrated into Windows, but is not the lightest or fastest method of transferring data. You can use FTP or many other methods to transfer data quickly, but most of them are not nearly as convenient as SMB. SMB2.0, which Microsoft introduced in Vista and Windows Server 2008 is much better:

(note arrows indicate direction of transfer, the first computer listed is the initiator. “2008×64<-box” means that the windows server 2008 x64 computer initiated a copy of a file from the “box” computer to itself. )

smbcompare_transfers

This improved throughput with SMB2.0/vista comes at a cost of CPU Utilization:

smbcompare_cpu

If you want high-performance, use SMB2.0, but don’t expect miracles for free.

————————————————————————————————————————–

On the topic of OS liability in network throughput, there are two registry changes needed for Vista SP1 and Server 2008:

in HKEY_LOCAL_MACHINESOFTWAREMicrosoftWindows NTCurrentVersionMultimediaSystemProfile

change NetworkThrottlingIndex to 0xffffffff

and change SystemResponsiveness to 0x64.

This essentially disables the Windows MultiMedia Class Scheduler Service’s ability to prioritize multimedia playback at the cost of networking performance.

3. Encryption Tax

This may have more negative effect on delay then bandwidth, since there must be processing on each end of the transfer, but nevertheless, VPN data overhead can be significant >10% (often depending on packet size distribution as some costs are fixed per-packet).

4. Gigabit Ethernet over the PCI Bus.

It is true that the 32-bit 33Mhz PCI bus has a theoretical peak bandwidth over 1Gb/sec ( 100/3 Mhz * 32 bits = ~1.067 Gb/sec )

While this exceeds 1Gb/sec that Gigabit Ethernet needs, the PCI bus has overhead, is not full duplex, and is sharing that bandwidth with multiple devices on the bus. The effective bandwidth of a regular pci bus is around 600Mb/sec half-duplex, some chipsets are better than others for actual pci bandwidth. PCIe 1x has 4Gb/sec total bandwidth per device (2Gb/sec full-duplex), so PCIe Gb ethernet interfaces have no issues with the bus as a bottleneck)

5. Hard Disc stuff.

SCSI, SAS, ATA-66/100/133, and SATA1/2 all had impressive throughput rates for their time, but the interface was never the bottleneck.

The discs themselves have sustained transfer rates (STR) limited by:

1. Linear Speed L of the Head which is a function of:

  • Rotational Velocity V (revolutions per time) on most hard drives this is not dynamic.
  • Location of the data r (radial distance of the head where the data is being operated on).

so L=V*2*Pi*r , in a normal 7200 revolution per minute, 3.5″ hard disc the Linear Velocity L = 7200*2*Pi*3.5/2 = ~79168 inches per minute at the outer edge, less as you get closer to the center.

this Linear Velocity reduces in a linear fashion as you approach the spindle of the disc.

2. Linear data density d, (bits per inch), which is usually proportional to the square root of the areal density (higher density means the head can traverse and r/w more sectors for a given linear velocity)

so

Sequential Throughput ~ L * d ~ 2 * Pi * V * r * d

so there you have it, in a given disc, sequential data throughput is linearly related to distance from center.

with the same rotational velocity, a 3.5″ disc will have an outer edge 40% further from the center, which means 40% faster then the outer edge of a similarly dense 2.5″ platter.

————————————————————————–

It may not be called a disc problem, but effective disc large-file transfer rate can be throttled if the data has to be fragmented on various spots on the disc, since that requires head seeks for something that could be a sequential operation. Seeks cost time and transfer no data.

Carefully chosen modern 7200rpm SATA2 high-areal density discs like the this or this can perform sustained sequential reads or writes close to 1Gb per second at the outer edge. The discs I have been messing with ( Western Digital WD6400AAKS ) have sequential performance around 900Mb/sec near the outer edge, so they will bottleneck large 1,000Mb/sec Ethernet transfers, but only a bit.

these graphs are decreasing because the program calls 100% the innermost, and 0% the outermost of the disc… also you may notice the graphs are not linear as I suggested, this is because the horizontal axis is “%” which is % of data, not % of radial distance.

I will not bore anyone with the math/logic to understand why this makes sense, but it does… and graphing it like HD-Tune did here should theoretically yield a quadratic, which it seems to by the pictures.

————————————————————————–

For my testing, I used 2 WD WD640AAKSs in Raid0:

write_Intel___Raid_0_Volume read_Intel___Raid_0_Volume

If the data was placed at the inner edge, where throughput is only 800Mb/sec that could obviously throttle any Gigabit Ethernet transfers relying on that.

In this case I am not worried about data location because my drives are empty and windows/ntfs is fairly smart about data placement, here is where my 10GB test file was placed:

discplace

Exactly where I would want it, at the outer edge. I can rely on Windows to do this consistently because the disc is otherwise empty.

So with this scenario, I am confident the discs will not bottleneck my transfers.

6. Framing overhead

Standard Ethernet frames are 12,000 bits each, in a bus capable of doing 1,000,000,000 bits each second this can be significant framing overhead.

72,000 bit “Jumbo” frames are small enough to be effectively checked for integrity by Ethernet’s 32-bit CRC, while reducing the overhead per bit ratio for huge (compared to frame size) data transfers.

note each end of a network and all nodes in the path which are traversed must support these “jumbo” frames in order for the transaction to use them.

further reading on jumbo frames: http://sd.wareonearth.com/~phil/jumbo.html , http://docs.hp.com/en/783/jumbo_final.pdf

well, I wanted to explore this in the real world of windows on platforms with abundant processing power:

 

smb2through

Jumbo Frames make a big difference here, but arguably more important is the observation that the transfer is a lot faster if the current holder of the file initiates the transfer, rather than the recipient.

smb2cpu

Here is where we see the other benefit of jumbo frames, CPU utilization is less than half on the recipient of the transfer, regardless of who initiates it.

—————————————————————————————————————————–

2008×64 is a q9450 12MB 45nm quad core @ 333×8=2.66Ghz Core 2 Quad “Yorkfield” CPU

with 8GB of ram

Windows server 2008 x64

Intel 82566DC Gigabit Ethernet on PCIe w/9.11.5.7 drivers from Microsoft on 6/21/2006

DG33TL motherboard

dedicated 2xWD640AAKS in RAID0 on ich9r for network share w/ntfs cluster size of 64kB and hdd&volume cache enabled.

—————————————————————————————————————————–

Both Computers are directly connected to a Trendnet TEG-S8/A switch.

—————————————————————————————————————————–

“box” is an e6300 2MB 65nm dual core @ 333×7 = 2.33Ghz Core 2 Duo

with 4GB of ram

vista x64 sp1

Atheros L1 Gigabit Ethernet on PCIe w/2.4.7.13 drivers from Atheros on 4/28/2008

Asus G35 motherboard

dedicated 2xWD640AAKS in RAID0 on ich9r for network share w/ntfs cluster size of 64kB and hdd&volume cache enabled.

—————————————————————————————————————————–

Why does it matter which side initiates?

Before I put this to rest, Let’s examine what goes on during the transfer, to see why it is faster when the sender initiates:

well lets consider the transfer from box to 2008×64

when it’s initiated by the sender, this happens:

sender and initiator recipient
Network Throughput net net2008
Memory Usage ram ram2008

This seems pretty normal, let’s look again at the difference in transfer rates depending on the initiator.

boxsendrate

Now lets look and see why it’s slower when the recipient initiates:

sender recipient and initiator
Network vistanet 2008net
Memory vistaram 2008ram

Whatever exactly is going on in the sending computer when a transfer is initiated by a recipient, it’s clearly using as much RAM as it can find, until there’s none left. When it runs out of RAM, the transfer goes on without degradation of performance, which leaves me at a loss to why it needed all that ram so badly to begin with.

Let’s look at CPU use:

boxsend

Although it uses more CPU cycles on the sending node, it’s clearly better on modern multi-core CPUs with SMB2.0 as-is to initiate a transfer with the sender of a file as throughput is significantly better, and it doesn’t pillage the node of its RAM.

—————————————————————————————————————————————

 

 

Conclusion:

  • High Performance is possible in Windows Shares with SMB2.0
  • Jumbo Frames are good
  • Sending node should initiate large transfers
  • Must be careful to rule out disc bottlenecks, especially important to consider data location on the drive.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: