Skip to content
August 13, 2008 / steve8

Storage, Raid, and intel’s ICHxR

If you look in the Internet, in various forums including storagereview, hardocp, and anandtech you will find hundreds of people talking about RAID.  I’d say almost all of them involve someone thinking about using some onboard raid solution that came with their computer, and a bunch of other more “experienced” forum members posting replies to the effect of: “don’t even bother with it, buy a “real” raid controller and go from there”.

Let’s take a deep look at storage, and while we are there we can stop and see if the if Intel’s ubiquitous ich9r is any good at raid.

 

The Performance analysis of storage is generally partitioned in two, Sequential R/W transfer rates STR, and Random R/W performance usually measured in I/Os per second or a Latency measurement.

 

STR>

Sequential Transfer Rate (STR) is the data throughput a storage device can achieve without having to seek to various parts of the disc.

 

Hard disc primer for STR:

SCSI, SAS, ATA-66/100/133, and SATA1/2 all had impressive throughput rates for their time, but the interface was never the bottleneck.

The discs themselves have sustained transfer rates (STR) limited by:

1. Linear Speed L of the Head which is a function of:

  • Rotational Velocity V (revolutions per time) on most hard drives this is not dynamic.
  • Location of the data r (radial distance of the head where the data is being operated on).

so L=V*2*Pi*r , in a normal 7200 revolution per minute, 3.5″ hard disc (radius near 1.5″) the Linear Velocity L = 7200*2*Pi*1.5 = ~67858 inches per minute at the outer edge, less as you get closer to the center.

this Linear Velocity reduces in a linear fashion as you approach the spindle of the disc.

2. Linear data density d, (bits per inch), which is usually proportional to the square root of the areal density (higher density means the head can traverse and r/w more sectors for a given linear velocity)

so

Sequential Throughput ~ L * d ~ 2 * Pi * V * r * d

so there you have it, in a given disc, sequential data throughput is linearly related to distance from center.

this is why 3.5″ drives are generally faster in sequential operations since much of the data is a lot further away from the center than any 2.5″ disc.

————————————————————————–

It may not be called a disc problem, but effective disc large-file transfer rate can be throttled if the data has to be fragmented on various spots on the disc, since that requires head seeks for something that could be a sequential operation. Seeks cost time and transfer no data.

Carefully chosen modern 7200rpm SATA2 high-areal density discs like the this or this can perform sustained sequential reads or writes close to 1Gb per second at the outer edge. The discs I have been messing with ( Western Digital WD6400AAKS ) have sequential performance around 900Mb/sec near the outer edge, so they will bottleneck large 1,000Mb/sec Ethernet transfers, but only a bit.

these graphs are decreasing because the program calls 100% the innermost, and 0% the outermost of the disc… also you may notice the graphs are not linear as I suggested, this is because the horizontal axis is “%” which is % of data, not % of radial distance.

I will not bore anyone with the math/logic to understand why this makes sense, but it does… and graphing it like HD-Tune did here should theoretically yield a quadratic, which it seems to by the pictures.

————————————————————————–

 

 

 

 

Latency>

The other major performance specification of any storage system is the latency, or time required to begin a read or write at a random location on the disc.

 

In a traditional rotating disc drive, the latency is not so straightforward to predict based on the geometry.

Clearly the latency cannot be predicted simply by the location of the data, one must also know where the head will be before needing to access the data.  But lets talk of average random access times…

Certainly faster rotation will minimize rotational delays, while higher areal density can effectively minimize real world latency if it means you can cram more data at the outer edge, minimizing or eliminating the need for the head to go towards the spindle… but on a disc full of data, the argument is irrelevant.

Let me save a more thorough analysis of seek times for another day, and lets think of it as a latency (penalty) for each read/write that is not adjacent to the previous i/o.

 

————————————————————————–

 

question: does the revision and/or firmware of a disc drive matter?

answer: I like to use graphs to communicate, so here goes:

(these are all WD6400AAKS Western Digital 640GB SE16 drives, but different versions)

wd6400AAKS cacheless reads log

wd6400AAKS cacheless reads linear

wd6400AAKS cacheless writes

 

 

So, clearly the revision/version of a specific disc drive matters significantly. The 00A7B0 was quicker in latency, but slower in STR than the other two.  The 65A7B0 was the best in STR.

 

ok, everything so far was for a specific type of storage device consisting of 1 head and 1 rotating disc (yes there are multiple platters , each face getting a head, but they do not seek independently, so effectively we can consider it as 1 face of 1 platter with 1 head.)

 

Redundant Array of Inexpensive Discs?>

Although the actual words which make the acronym are not all that applicable today (since many raid volumes lack redundancy and few are cheap), RAID has taken off as an admirably simple solution to achieve what many situations require:

 

Primer:

Why would anyone want to use raid?

+a given hard disk around a 5% chance of failing per year. (this depends on age and temperature and utilization) so It’s nice if a failure does not result in data loss or even any downtime at all.

+having a larger pool, rather than several smaller pools to store data is preferable as it eases file management. (no shuffling around data to various discs to make room)

+the hope that n discs could be n times faster than 1…

 

 

RAID 1

If your only concern is maintaining data integrity and availability in the event of a disc failure, RAID 1 “mirroring” is the obvious solution. The idea that all writes are done to 2 discs, instead of one. If one disc fails, it simply reads from the remaining one, since all the data on both drives is identical.

r1

  • write throughput depends on implementation, but could be slower than a single disc
  • write latency should be a bit worse, since it must be written on both drives, the write is not complete until both seek, so its the slower of the two seeks, every time.
  • read latency could be slightly better than a single disc, as a smart controller could only request the data from the disc who’s head is closer to the data, or request it from both, and take the one who gives it first.
  • read speed throughput depends on implementation, but could be faster than a single disc, as the controller could have each disc read a different half of the file, halving the time it takes to read the whole file… this is not generally implemented.
  • capacity is halved, as the 2 discs have identical data, usable storage is only the data capacity of one disc, so it costs 1/2 capacity for the redundancy of 2-disc raid 1.
  • If each disc has a f % annual failure rate, the raid1 array will have a 100*( (f/100)^2 ) % failure rate. (so if the disc failure rate is 5.0%, the raid1 volume failure rate will be 0.25% )

 

RAID 0

If you want a larger pool of data then a single drive can provide, or you want faster file reads and writes, and you aren’t worried about disc failure than you RAID 0 is the solution.

r0

  • read and write throughput should linearly improve with the number of drives in the array.
  • Latency should be a bit worse than a single disc, as each disc needs to seek to the file, so the longest any of the heads have to seek is your access time for that read or write action.
  • volume capacity is simply the number of discs times the capacity of each disc (if they are not the same size, then the capacity of the smallest disc times the number of discs), no or very little wasted space.
  • there is absolutely no data redundancy, in fact if each drive has a f% chance to die in a year, then a n-disc raid0 volume has a 100*(1-(1-f/100)^n)% chance of failing resulting in data loss of the entire volume. (so a 4-disc array in raid0 where each drive has a 5% chance of failing annually means the raid0 volume has a 18.55% chance of failing each year.)

So, raid1 gives you redundancy but no speed gain, while eating up 1/2 of your disc space, and raid0 can yield a lot of streaming speed gains, but will actually make your reliability much worse.

 

RAID1+0 or RAID 10

This is a even-number-of-discs >= 4 solution where the controller uses raid1 for each pair of discs, and RAID 0 to stripe the pairs together into a larger, faster volume.

r10

6-disc raid 10 :
                       RAID 0
       .-----------------------------------.
       |                 |                 |
     RAID 1            RAID 1            RAID 1
   .--------.        .--------.        .--------.
   |        |        |        |        |        |
120 GB   120 GB   120 GB   120 GB   120 GB   120 GB
  A1       A1       A2       A2       A3       A3
  A4       A4       A5       A5       A6       A6
  A7       A7       A8       A8       A9       A9
  A10      A10      A11      A11      A12      A12

if there are n-discs in the array:

  • Read throughput should be about n/2 times single disc speeds.
  • Read latency could be close to a single disc, as raid0’s latency is the worst of all member volumes, but here member volumes are raid1, where each could take only the fastest seek of its member discs.
  • Write throughput could be close to n/2 times single disc speeds, but is dependant on the implementation as each write must duplicated on both drives of each of the mirrored volumes.
  • Write latency should be in the same ballpark as a single disc, but should be worse as both raid1 and raid0 must complete the seek on all member drives for the operation to complete.
  • total volume size of RAID10 arrays are n/2 times the single disc capacity, so the cost of redundancy here is 50% of the disc space.
  • If the disc failure rate is f%, then the raid10 volume failure rate will be 100*(1-(f/100)^2)^(n/2) %. If each drive has a 5.0% annual failure rate, a 4-disc RAID10 array will have a failure rate of 0.499% , while a 6-disc RAID10 array will have a failure rate of 0.748%

So raid10 might seem like a decent balance of performance and reliability, but it is at the cost of 50% of the hard drives involved.

 

RAID 5

RAID5 arrays are of 3 or more discs where the data is striped across all discs, but for each stripe of data, a parity bit is written on one of the discs, so if one drive fails, the data can be recreated or recovered for each stripe.

r5

  • Read throughput should be n-1 times as fast as a single disc.
  • Read latency should be slightly worse than a single disc, as all but one of the discs must complete the seek, for the operation to complete.
  • Write throughput could be n-1 times as fast as a single disc, but there are some important notes on write performance. While data is being written, the controller must calculate the parity for each stripe, this has the potential to slow down writing significantly, while some may call this parity writing a latency, but since it is a constant tax on all writes, big or small, I’ll consider to effectively throttle write throughput. Cache/buffer can buy time for the system to calculate this parity.
  • Write latency in raid 5 is inherently troublesome.  not only does the system need to calculate parity for each stripe that is written, and write the corresponding parity bit, but if the data being written is smaller than the stripe itself, the controller must first read the stripe, then modify the data (adding / changing bits as needed), and then calculate parity on the entire stripe, and write it entirely.  I will refer to this as the partial-stripe-write issue/penalty.
  • Total usable volume size of a RAID 5 array is about (n-1) * smallest-disc-size, so a 4 disc array will have a usable storage area of 3*disc-size.
  • a n-disc RAID 5 volume failure rate will be the odds that 2 (or more) discs fail in the array.  If each disc has a failure rate of f%, the RAID 5 volume failure rate is 100*[1- ((1-f/100)^n * (f/100)^0 + nC1*(1-f/100)^(n-1) * (f/100)^1)] %.  (for a 4-disc array where the Annual Failure Rate per-disc is 5%, the failure rate of the RAID 5 volume is 1.40%, 3.2% for a 6-disc array)  In production environments the effective failure rate will be closer to zero for the volume since a failed disc will be replaced in a matter of hours or days, the array will be rebuilt and redundancy restored.

.

RAID 5 clearly looks like the best and most scalable raid technology listed above, providing streaming speed boosts and protection from disc failure, while only “wasting” 1 disc of the array for parity…. but do we need an expensive or obscure RAID controller to get decent write performance in the real world?

enter the Intel ICH9r.

this is Intel’s current raid-enabled southbridge, paired with most 3-series northbridges and many motherboard manufactures have paired it with the x48 northbridge.

Obviously the southbridge does many things, but here let’s look at it solely as a raid controller.

During the write of a raid 5 stripe, a parity bit must be calculated from the data to be written, before the write occurs.  Some high-end RAID controllers perform this calculation with the aid of an XOR processor.  Intel’s ICH9r does not have an XOR processor, it leverages the CPU to do the calculation.  In the past this may be an enormous concern, today we have dual and quad core CPUs on the cheap, yielding an excessive amount of unused processing power in most small business machines.

Another trick expensive RAID controllers use is onboard cache, for aiding in small writes, this is critically important because the system must calculate the parity of a stripe before it is written, so to have a buffer to buy time w/o holding up other IOs is key.  Intel’s ICH9r has no cache, but it can simply take some of the very cheap and abundant system memory, It may be more now days, but back in the ich7r days it allocated 4MB of system memory on boot for caching raid arrays.  The is enough to handle small random writes gracefully, dealing with the partial-stripe-write issue.

So with that out of the way, the ich9r is a hardware raid controller in that, the raid arrays are transparent to the OS, but it does utilize the system resources to do the XOR for the parity calculation and volume write caching.

 

 

That’s all well and good, but text is cheap, lets look at graphs:

Read Throughput:

read_str

 

Read Latency:

read_latency

 

 

 

 

 

Write Throughput:

 

write_str

Here we see the ICH9R doing a pretty good job in Raid10 and  Raid5.  In raid10 performance is almost double a single disc, which is what you would expect/hope for.  In raid5 you might hope for tipple the performance of a single drive, but it looks like we will have to settle for double.  Clearly if you are not going to enable disc and raid cache, you should not use raid5 for anything where write performance matters.

 

Write Latency:

write_latency

 

Here we see impressive number from the ICH9R in Raid10, latency is significantly less than a single disc if disc cache is enabled, I’m not sure why this is, but possibly a larger pool of cache from both volumes in the stripe.  Raid5 is, as expected worse than a single disc in write latency, with disc and raid cache enabled, the latency is comparable, yet still worse than a single disc.

 

is it really this simple? 

(Can the performance of these drives really be characterized by only these metrics?)

for a single disc:

 1xread_prediction

 

1xwrite_prediction

 

 

Overall for a single disc, with or without cache the 2 metric model seems to work pretty well, although around 1MiB request sizes we see actual performance fail to match predicted, especially for writes.

 

 

What about for Raid1:

2xR1_read_prediction

It’s almost as if it only does striped reads when RAID cache is off and the read data request is larger than 64MiB…  strange.

What about writes?:

2xR1_write_prediction

 

 

Here we see a discrepancy across the board at 1MiB transfers…Clearly HDD cache on, RAID cache off is ideal for Raid 1 on the Intel ich9r.

 

What about Raid10:

4xR10_read_prediction

 

4xR10_write_prediction

Let’s see what’s going on at small write request sizes:

4xR10_write_prediction_log

Here we see the 1st order latency/str model really failing to model the mid-size transfer random write performance of the drive…

Although it doesn’t have the highest write STR or latency, it seems RAID cache should be disabled, with hdd cache on because the read STR is so much larger.  Intel’s raid cache seems to have adverse affect in many scenarios.

 

 

 

 

And finally, Raid 5:

 

4xR5_read_prediction

Raid5 random read performance is decently approximated by the 1st order latency/str model, although there are significantly under-performing real results in the mid-range as seen above from 1MiB to 64MiB data request size.

what about writes:

4xR5_write_prediction

at least in the mid range this simple model of latency/str seems to fail to really characterize performance here… Let’s look at the log scale:

4xR5_write_prediction_log

 

Cacheless and disc cache only raid5 volume configurations seem to act as expected by the simplistic 1st order latency/STR model…

But that is becuase the model doesn’t really deal with the cache.  Cache can greatly improve random write effective latency, since the system does not have to wait for the controller to calculate the parity bit and then physically write the entire stripe to the disc… eventually as the transfer request size is large, this cache becomes irrelevant.

 

I am not one to be content to leave things w/o a full understanding, so lets dig deeper:

 

OK, so we know w/o cache the raid5 volume has a latency of 62.5ms on average..

that includes a seek and a partial-stripe-write penalty…

 

we know from a single cacheless disc, the seek time is about 17.5ms,

so the partial-stripe-write penalty = 62.5-17.5 = 45ms

As the write request gets large relative to stripe size (here is 64KiB), the write will probably incurr 1 seek and a partial-write penalty at the start, and a partial-write-penalty at the end of the write for a total of 1 seek and 2 partial-stripe-write penalties. This means the effective random write latency is 17.5+45+45 = 107.5ms, knowing this I can go back and re-calculate the STR on buffered writes, for example with HDD&RAID cache enabled the 4-disc r5 volume got 0.31 iops in random writes of 512MiB, so

.1075 + 512/STR = 1/0.31 ==> STR = 164.2MiB/sec write with disc and ich9r cache enabled.

ok so lets model a volume with 107.5ms random write latency and 164.2MiB/sec write STR:

 

 4xR5_write_prediction_yes_log

 

This predicts it dead on after 1MiB data request sizes…

 

Now this makes some sense… with small data sizes, the caching can buffer the writes, buying some time for the system to read the stripe, make the write in the buffer, then calculate the parity bit for that modified stripe without holding up the next operation….(the data is written to the storage device, even though its actually not written to the physical platters yet).

 

The combined cache does seem to adequately buffer the sequential writes, by giving the system time to calculate the parity bit before its actually written to the platters, it is not 3x the STR of a single disc, as it theoretically could be, but its still admirable.

 

note* performance was a bit better with OS caching,, (the “advanced performance” option under the volume properties in the device manager), and while I don’t feel scared when using the disc or raid controller’s cache… I think the slight gains (~5% write STR) isn’t worth the bother of having yet another place data is written to without actually going to the disc.

 

 

Brief Cache Talk:

I focused on various caching schemes throughout the article, and for good reason, it has massive effects on real-world performance.

there is something worth mentioning here:

In OS level caching, writes can be buffered in the system memory, it’s possible something could interfere before this data makes it to the hard disc.  While this is probably not a big concern, ive seen little-to-zero advantages in my testing in windows, in fact sometimes the overhead.

Raid Controller cache, for many fancy controllers are on the controller card itself, so its a bit independent of the OS, which is good.. but if power goes out this data is not written to the discs and power is lost.  High end controllers have cache batteries which are meant to keep that data alive until its written to the disc, in the event where power is lost.  This intel controller dedicates some of the system ram to itself on boot, so its fairly isolated from the system.  The battery backup i think is a like using a hammer to fix a computer… Yes it will be able to put the cache on teh disc, but who’s to say the cache doesn’t have partial information of files, resulting in corruption…  

Hard Disc Cache is basically a bit of ram on the drive itself, it is used for buffering writes and buffering IOs to allow for re-ordering (NCQ), anything in this buffer will not be saved to disc in the event of power-loss to the system.

Disabling cache altogether is also not a solution, if the head of the disc if half way through a write operation, or a file write, partial writes will occur, possibly resulting in corruption/unusable data.

Bottom line if you care about integrity of your data, your system simply can not lose power, spend the money on redundant PSU’s and UPS’s rather than the (imho) useless cache battery system.

 

 

Conclusions>

 

well, this was more a lesson in storage than anything else, but lets see:

 

The ICH9R does a very admiral job, we aren’t talking about breaking any records here but performs respectively as long as you configure the caching properly.  It also has price (nearly free) going for it as well as an industry giant supporting it… not to mention its ginormous userbase and forward/backward compatibility of raid volumes.

In summery, raid on the ICH9R:

1-disc AHCI 2-drive raid1 4-drive raid10 4-drive raid10 4-drive raid5
optimal configuration hdd cache on hdd cache on hdd cache on hdd&raid cache on hdd&raid cache on
cost per usable GB 13.3 cents 26.6 cents 26.6 cents 26.6 cents 17.7 cents
annual failure rate (assuming no disc replace/volume rebuild) 5% 0.25% 0.50% 0.50% 1.4%
Sequential Reads (average) 78MiB/sec 144MiB/sec 237MiB/sec 175MiB/sec 232MiB/sec
Random Read Latency (average) 12.9ms 13.2ms 15.9ms 14.3ms 14.34ms
Sequential Writes (average) 84.6MiB/sec 85.1MiB/sec 148.3MiB/sec 161.5MiB/sec 164.2MiB/sec
Random Latency (average) on small writes 12.9ms 7.35ms 4.3ms 4.1ms 9.65ms
Random Latency (average) on large writes 17.5ms 18.1ms 18.9ms 18.9ms 107.5ms

 

There is more to do, but my time is finite.

to do: present my data and analyze how allowing multiple random IOs to accumulate before a read/write affects performance.

to do: in the future compare the ICH9r with a fancy/expensive raid controller

to do: confirm that Intel’s new ICH10R performance almost identical with the ICH9R

to do:  RAID-Z  <– this is big, but seemingly not ready for prime-time.

to do: clean up this article and fix some slight imperfections here and there.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: