SSD Endurance Revisited

SSD Endurance Revisited

Summary

Lots of feedback, lots of links. TLDR; still not likely you'll break anything soon.

A Lengthy Update - No Pretty Pictures But Tons Of Links

My previous article on SSD write endurance seems to have caused quite a stir. I've received lots of constructive feedback via email (and slashdot), so I'd like to take the time to follow up with some corrections for more common devices. I will, however, keep the original article unmodified because it's been linked to quite a lot according to my logs. It did also completely reach its goal: judging from some of the sources I've seen and the feedback I got, people started talking and thinking about the problems instead of a lot of general handwaving. And those that didn't seem to have calmed a bit.

In particular, before I begin, I would like to refer to Allyn's rebuttal over at PC Perspective. That article nicely sums up most of the suggestions I received.

Do keep in mind this is still basically a whole lot of theorycrafting, individual drives may vary considerably. Lots of infomation on SSDs isn't available to customers under most circumstances, so getting accurate figures is hard at best. It'd be nice if hardware vendors were just a bit more forthcoming with their data. Here's hoping.

SATA 3.0 Bandwidth

As has been pointed out by several readers, my maximum bandwidth assumption at 6GBit/s did not take SATA 3.0's 8b/10b coding into account. This places the maximum throughput I should've calculated with at a mere 600MB/s instead of the 750MB/s I used. However, this actually works in favour of the time estimate - a lower throughput directly results in longer times until blocks start dying because fewer blocks will be erased in a given time window. Since my goal was to point out that it will in fact take excessively long, this correction would thus make the graphs look even "better" - so I went ahead and corrected the original graphs. Did I say I'd not touch the original article? Oops :).

I'd also assumed that current SSDs would not be able to handle writes at the maximum SATA bandwidth. Ewen has taken his time to prove me wrong in that regard with a couple of screenshots of his Intel 520 in benchmarks. Good to know, that :).

SLC, MLC, TLC, et al...

The original article only had SLCs in mind, thus the rather high number of P/E cycle estimates. I've purposefully used old data sheets as a reference because technology tends to get better in time, not worse. Seems SSDs are a bit of an exception there - they just get faster but P/E cycles keep degrading. The 1M writes figure is thus sadly very unlikely to apply to any contemporary SLC device. The 100k figure applies to some, but as Allyn wrote in the article above the source datasheet isn't exactly the best. I've also been tipped off about newer SLCs being closer to 40k writes with sigma at 25% rather than the figures I've used.

Then again, if you followed the news there's been a lot of talk about new tech that places maximum number of P/E cycles a lot closer to 1M and beyond, so maybe that 1M graph in particular isn't that far off after all with the next generation of drives? Who knows, maybe we'll find out by 2020 or so. I'm keeping my fingers crossed.

Now, consumer grade SSDs typically employ MLC or even TLC flash. MLCs are only rated for up to 7.5k writes with sigma at about 33% or even 4k writes with sigma around 25%, while TLCs are rated at 1k writes with sigma being anyone's guess. That is in fact quite a bit worse, but theorycrafting on that crosses into the next section so we'll come to that in a second.

Workloads

The original figures were purposefully completely exaggerated in terms of throughput to give an estimate of the minimal time required to break a device by writing to it. While you might occasionally find a workload where you'll be writing at full speed for a few hours, it's still unlikely that you'll be doing so for days or weeks in a row. In a more realistic setting you'll find yourself at your computer for 8-ish hours a day with the drive idling away and being bored most of that time.

If you're on OSX, there's a tool called Activity Monitor which will happily tell you how busy - or more likely bored - your harddisk is. Windows 8 will also tell you in its Task Manager and earlier Windows versions ship with a Resource Monitor that basically does the same thing. And of course *BSD and Linux have several tools for that as well, such as GKrellM and any number of window manager widgets. If you use either of these tools on a regular computer under typical office-ish situations, you'll find that the drives are very bored indeed. Unless you're currently running some hefty calculations. But then how often do you do that?

Well, not very often at all is how often. I was going to cover this with more theorycrafting, but havard at hblok.net totally beat me to it. This particular article reuses my formulas with saner values for workloads and write cycles (10k and 1k). As an added bonus, they included the Gnuplot script they came up with as well. I strongly recommend reading this piece.

Spare Blocks

This hasn't been mentioned too often, but it's still worth pointing out. In the previous article I simply asserted 10% of the drive's space dedicated as spare blocks. Looking around a bit more, it would seem that 7% is a likelier assumption, but as much as 20% can be seen in the wild if people played with drive parameters.

Write Amplification

Allyn's article rightly points out that I've not taken write amplification into account at all. The thing is, if you're writing at the speeds I've used in the calculations, it's extremely unlikely that the drive will actually have to perform very many small writes to start with. Including OS and drive caches, the drive will probably be able to write full erase blocks and only need to use a very small number of ancillary writes. The 25% speed decrease from my omission of 8b/10b coding used with SATA should compensate for those in this (very) particular case.

Write amplification becomes a much bigger problem when calculating with real workloads. There's only so much time and cache space that the OS and the drive can sensibly delay writes to the flash before things will start to get messy and a lot of data will need to be juggled. Additionally each write access to a file on the drive will need to update at least two completely unrelated areas of the flash - the one where the data is updated, and the one where filesystem metadata resides. It's safe to say at least the latter will be very small. If you're using a journaled filesystem or one with redundant metadata areas, each write to a file will incur even more distinct, independent but related writes - flash firmware might be able to bundle some of those, but at least the filesystem journal is unlikely to be one of them.

Estimating write amplification factors gets kind of tricky. The minimum factor is obviously 1 - a single write for a block of data. The maximum factor is also trivially estimated by the size of the erase block - changing a single byte somewhere will, at most, cause an entire block to be erased. For MLCs that means our write amplification factor is somewhere between 1 and 2 million - not exactly a useful span to work with. And that's still disregarding filesystem overhead completely.

There is, however, something else that I'm holding back on which may or may not negate this effect. We'll get to that in just a second as well.

Read Disturb

One reader who'd like to stay anonymous took her time to send in reports of her use case where she experienced problems completely unrelated to writing but still very much related to endurance. In her use case, a small audio file would be played repeatedly and continuously. The file was apparently small enough to cause nearby flash memory to degrade in an effect called read disturb. This effect basically degrades the pages of a flash memory block by repeatedly reading a few pages in that block - several hundreds of thousands or millions of times, according to this Micron Tech Note.

This effect is rarely mentioned in endurance articles, presumably simply because it's even rarer in practice than excessive writes. But apparently it does happen - outside of lab conditions, even. Given the number of reads involved, the 1M write count estimate in my original paper could actually given an indication of the time it would take to wear out flash in this way. Newer drives might try to counteract this effect by regularly rewriting flash blocks that are accessed frequently, thus wearing out the drive like with normal writes but without external write instructions. Scary thought, that. This isn't included in any of my theorycrafting, but I thought it'd definitely be worth mentioning for some use cases.

Comparison with Vendor Data

While exact stats aren't readily available, some vendors do provide minimal guidance by providing estimates of the endurance of an SSD by how much data can be written to it. Curiously these stats, when plotted against the theorycrafting above, are significantly more optimistic than what the underlying NAND specs would suggest.

This suggests that either there's something they're not telling us, or they're deliberately overprovisioning the devices by unexpected amounts. Alternatively they might just consider the drives writable for quite a bit longer than I would consider them usable. Or this is related to the next point...

Maybe The Rated Numbers Aren't Averages

Uh d'oih! OK there's one thing that I haven't been criticised for that slightly surprised me. Remember how in the last article I came up with a formula that simply asserts the rated number of P/E cycles to be averages? Well, the thing is, vendor data includes little more data than raw P/E cycles. If that. So we're pretty much left to assume we're dealing with an average. Except everyone assumes these numbers are really minima - according to a tip I received they might even be minima at 6 sigma of the standard deviation. How does that translate to an average? Well, Jim Hardy pointed me to a series of articles of his own, which might hold an answer for that.

That answer could be in an SNIA white paper dealing with NAND flash endurance linked in his series. I suggest you skim over this paper and then I'd like to direct your attention to figure 2 on page 6, inconspicuously titled "Bit Error Rates (BER) as a function of P/E cycles". If you look closely, you'll find that the SLC NANDs actually tested in the white paper have somewhat negligible bit error rates for anything under 1.5M writes. And the paper suggests that's for 100k P/E rated NAND. Even the MLCs come out at what appears to be 40k-100k in the graph. Mind. Blown. Thoroughly.

So what does that mean? Well, for one thing, it suggests that the average number of maximum write cycles a device will endure is more like 15x-40x higher than I'd have anticipated. And that's not counting devices that employ ECC strategies, which would drastically improve the number of write cycles the devices will endure. See page 8 for that.

In a nutshell, I'd personally think that this might completely cancel out any write amplification effects the devices might have to endure, and then some. Might need a bit more theorycrafting regarding that, of course, but hey that's why applied maths is so much fun ;).

Conclusion

Thanks to everyone for writing in with suggestions, I really appreciated that. I hope I could clear up some issues with the original article, and I also hope this helped debunk some of the concerns people have about using SSDs for what appear to be "heavy duty" use cases. Seriously people, unless you plan on putting these things in very extreme situations don't worry about the SSD write cycles. And for the record: office, gaming, log files and OS swap are NOT extreme use cases. Not even close. They're not even in the same ball park. Heck, even professional video editing is still going to be hard pressed to get close.

So, thanks for reading everyone! And keep the feedback coming ;).

Written by Magnus Deininger ().