It took me a while to write this. I had to clinically remove all confidential information and potential traces of information that will give you a hint of who or what this case study is about. I am currently engaged as an architect to help the enterprise build an IT strategy. For a start, the strategy is shaping out to be one that will prepare them for a digital future, while keeping existing things running well and dandy. It traverses bi-modal cloud infrastructure and automation; building PaaS and DevOps capabilities for agile development; enabling mobile services securely, and transforming their analytics platform. And it is in the analytics domain that many non-congruent views were surfaced.
The Situation : The enterprise that has been running an ETL / data warehouse / BI setup, wants to build a “next-generation big data analytics” platform. With the ever-present Hadoop fever these days, the enterprise has asked for an analytics platform that removes the “old legacy” layers of their existing setup. The enterprise had expressed a strong desire to remove ETL and data warehouse completely, among many other things.
I had many questions. Can ETL and data warehouse be completely eliminated? If it could, should it? What are their current use cases now? What are their big data use cases? What would succeed or fail if they made such a drastic change? Are skills in place? How much effort would it be for them to make that migration? What’s a more palatable approach? (architecture, of course…)
Learning Points : In short, here is a summary of some decision points for the analytics architecture :
- Not all roads lead to Hadoop. Think outcome. The primary need for this enterprise is still accurate and timely business reporting. There is no doubt that although Hadoop is going through a rapid enrichment program, it will still require expertise in Java, Hive, and other programs. It will likely require a huge amount of resources and effort to make that migration. Today’s MPP datawarehouse and ETL tools have adapters, schemas and systems integration built-in. Decision point : Use DWH/ETL for the primary need.
- Some roads are fantastic for Hadoop. For this particular enterprise, the source data are files. And these are files loaded up by various parties. Hadoop naturally became a consideration here. Land raw files, unchanged into a Hadoop-ready platform. Avoid ‘busy’ app-database layers. In this case, a NAS-based platform with Hadoop. Makes perfect sense for web-based applications. Supports object APIs for integrating cloud sources in the future. More on this later. Decision point : Use Hadoop + multi-protocol scale out NAS as a landing space.
- Leverage great improvements in ETL and Datawarehouse. Like MADLib for example. With all the Hadoop-ing going on, the traditional datawarehouse has undergone some cool augmentations with in-database analytics, and running that in parallel over an MPP architecture. And in this case, ETL is still very much needed.
- It takes time to build a successful big data analytics program. This is crucial. From my previous post (http://wp.me/p6zlyQ-K), a successful big data analytics program actually hinges on the data science capabilities. For this enterprise, it only made sense if we bought them time to grow in that familiarity. So, since the landing space is Hadoop-ready, the enterprise can start exploring and experimenting, within reasonable means, new patterns and associations in their raw data. In-place analytics experiment with Hadoop-with-Query. Visualization with RStudio or Tableau. Cool. At this point, the architectural decisions actually gave sound reasoning around the risks of going all out on Hadoop. The enterprise still gets to explore and make gradual steps toward advanced analytics, while maintaining the sanctity of their business reporting function.
- Finally, with the building blocks in place, the enterprise can easily scale toward a business data lake, with multi-rate ingestion capabilities, with workflows, and with export capabilities between Hadoop and data warehouse. And who is to say that Hadoop would not take over the world eventually, with all the development effort around this space. So, when the time comes for a major shift to unstructured, predictive analytics, they can shift easily.
Outcome : A hybrid analytics platform that doesn’t result in major redundancy, that serves the current purpose (strict, timely business reporting) as well as giving the enterprise a gradual transition toward advanced analytics in the future.