Oops is an understatement here. Tesla’s got a problem with dead eMMC flash memory cards in Model S and Model X cars equipped with their MCUv1 (Media Control Unit). And according to InsideEVs, this dead part could cost more than $1,800 to repair! As reported in InsideEVs, the problem stems from excessive writing to a log file that’s causing flash wear. This, combined with the ever-increasing size of Tesla’s firmware (which has grown from 300 MB to 1 GB), leads to a situation where the MCUv1’s storage reaches its maximum endurance and fails.
A closer look at Tesla’s MCUv1 flash memory failure
An application in Tesla’s MCUv1 was writing and rewriting massive amounts of log data. But flash memory lifetime is limited by a finite number of write and erase cycles, typically in the tens of thousands. Over time in these intensive write-rewrite scenarios, the blocks in the flash memory storage eventually “die out,” making them unusable for storing any data. And when blocks fail, portions of the firmware file may become unreadable, which can lead to application crashes or complete failure of the MCU.
A lot of suggestions have been made which could have prevented the problem:
1. Intelligent wear leveling
We caught up with our Technical Product Manager, Thom Denholm, to explain the wear leveling aspect in more detail. “Wear leveling describes techniques used to ensure even use of the blocks on flash media, with the purpose of achieving the longest possible lifetime of the storage.
The most basic wear leveling design is dynamic wear leveling. For each data write, the algorithm makes sure the block being written to is the least erased block. Unfortunately, this “quick and dirty” design has a flaw on most use cases that will compromise the lifetime of the flash. Any blocks containing data written only once (so-called static data) are left out of rotation for wear leveling. If a system contains a large portion of these static data blocks, that means fewer blocks available to balance the load. This could have perhaps been a problem in the Tesla case.”
Thom’s technical assessment of the situation is spot on! That said, even with good wear leveling at the flash management or file system level, the Tesla MCU would have still failed under such write and erase conditions. But proper wear leveling would have bought considerably more lifetime for the system.
2. Retaining all the logged data in RAM
The InsideEVs article suggests that the logged data could instead be moved to RAM to essentially trick the system. This, however, would have drawbacks in poorer performance and the mere fact that logs would be in volatile memory. According to Joel Catala, our Director of Engineering, “This solution potentially defeats the initial intentions of the system designers, and might result in troublesome situations if the needed data isn’t available in case of a crash.”
3. The right grade of memory hardware
In a post on LinkedIn about the Tesla issue, Kevin Kilbuck, VP of Business Development for Longsys and Lexar comments, “As someone who has worked in the semiconductor memory industry for over 30 years, I can state emphatically that not all flash is created equal. There are many ‘grades’ of flash memory produced, ranging from low-end consumer grade, that is really only suitable for things like entry-level (cheap) USB drives and other consumer products, to enterprise-grade flash, which is utilized in high-reliability/endurance applications, such as write-intensive flash storage arrays.”
Kevin continues, “As others have pointed out, the hardware (controller) and software used to manage the flash are also important. That being said, it is not possible to take a low grade of flash and use it in a high-endurance application, no matter how robust your flash management is. The lower grades of flash have other failure mechanisms that error management software/hardware cannot correct. These lower grades are perfectly fine for the intended application, but not much else. I am in no way suggesting that Tesla tried to use a lower grade of flash than they should have, only that the silicon matters as well as the flash management techniques.”
4. Putting the logged data on separate memory hardware
Sure, this could have solved Tesla’s problem. But memory hardware is costly and adds extra weight to the system, and there are better ways to solve the problem.
So what do we at Tuxera feel would have been the best solution to preventing Tesla’s MCU storage failure?
Go back to the basics – understand and test your entire storage stack
“Correct understanding of the memory devices (and their limitations), of software components such as data management (file system and flash management), and application behavior is key to designing systems that will be robust and survive the X number of years they are intended to live,” writes Joel Catala, our director of Engineering. This overview of the entire storage stack and all the data workload requirements must be done in the planning and design phase, not as an afterthought.
Joel continues, “At Tuxera, we’ve encountered issues like Tesla’s several times, and we are continuously collaborating with customers and partners on activities such as workload analysis, lifetime estimation, write amplification measures, and ultimately selection of data management software and storage devices.” Tuxera offers a Flash Memory Cost Analysis and Testing service to provide this level of understanding, and to help prevent memory failures.
Let’s be clear. We’re actually big fans of Tesla over here at Tuxera (did someone say Caraoke? Yuss!). You can typically find a few Teslas in the parking lot behind our HQ here in Finland. But the Tesla MCU memory failure illustrates the need to fully understand everything going on in storage stack – from hardware, to software, to the use cases and potential workloads from when the car leaves the factory to 5 years from that date. These factors are the key to designing robust systems that will match the lifetime of the car.
About the author
Tiffiny brings over a decade of technology marketing experience to Tuxera as Head of Marketing. Before joining Tuxera in 2016, she wrote about technology, consumer electronics, and industrial tech for Nokia, Microsoft, KONE Corporation, and many others. Around the office, we know Tiffiny for her love of geek culture, console gaming, and her adoration for Cloud City’s Baron Administrator, Lando Calrissian.