Posted by: Taylorallis
DataCenter, DataManagement, dedup, Storage
I was recently talking about a Storage Magazine article, Dedupe moves beyond backup.
The conversation led me to look back on some of my past analysis around de-dup. I ended up looking 5 years into the past.
Global Compression at StorageTek
At StorageTek there used to be an engineering research and IP department called Advanced Technology Research or “AdTek.” My current business partner and boss, Randy Chalfant, used to run it. A brilliant engineer by the name of Chuck Milligan ran the group after Randy – Chuck is the one who hired me at StorageTek. I eventually ended up heading the department.
I was looking at an old list of Research Probes we were recommending to STK execs for productization – there were 11 cases we presented in 2003 (Grid Storage, Flash/SSD, Encryption, etc.) On the list was “Global Compression.” In our pitch to management, we stated that this yielded extremely high compression ratios and had the potential to disrupt tape. We recommended adding it as a feature to the backup disk products STK was looking to bring to market – we even recommended some companies to evaluate for investment. (Unfortunately, some other probes were picked for further research that year!)
Fast forward some years and my strategy team and I found ourselves briefing Sun executives (after the STK acquisition) on the future of de-duplication as it has come to be known. I remember saying two things:
1. De-duplication has officially moved from cutting edge to a must have for disk backup, VTL, and secondary storage
2. Dedup will move from secondary storage to primary storage in the future (we backed up our claims with an excellent 451 Group report on the subject)
Dedup in Primary vs. Secondary Storage
Now we have dedup in primary storage. However, some think primary storage is not always the best place for dedup. The thinking is that de-dup works where there is a lot of…duplication. Primary storage tends to hold more transactional data, while secondary storage has more duplicate data. While this is true, there is more duplicate data on primary storage than users know.
I have moved from simply recommending storage strategies to actually implementing them in my new venture (which is much more fun!) Dedup is one of the steps we use with clients to get to a more efficient and optimized storage infrastructure.
We help storage users identify all of the inert data sitting on their primary storage – data that has not been referenced in more than 6 months. Users are almost always surprised about how much we find – around 40% on average.
The next question is what to do with this data – it needs to be cleaned up or moved in order to return that 40% to free pool capacity.
One clean up step is dedup – and in some instances a significant amount can be deduplicated. What are duplicates doing on primary storage? A lot of data management practices (or lack thereof) lead to this.
One example: In many cases application engineers will be testing new applications or updates. They need to run tests on real data – but obviously can’t run them on live, production data. So, they make a snap copy of the production data and run the tests against this data set. If they want to run another test, they’ll make another copy and so on. Do they remember to go back into the system and clean up their copies? Most often the answer is no – and this simple process (which is one of many) robs a primary disk system of its precious capacity.
So, deduplication can have a significant impact on primary storage in addition to secondary storage. But like any storage technology, the way in which it is implemented is the critical part of the equation.