Data warehousing is not dead, but it is changing as new technologies, including Hadoop and cloud platforms, have an impact.
Nostalgia might be a strong driver in music, fashion, and entertainment, but it has never been a captivating force in technology. Mainframes and COBOL are still very much with us, but I don’t run into people polishing up their IBM System/390 as they would an Oldsmobile Toronado. The tech industry is always about the new, so why is one of 2018’s top trends the revival of interest in data warehousing?
Enterprises need to solve present-day requirements for trusted, curated, and integrated data that are only going to grow in the years ahead as organizations strive to become more data-driven in their decision making.
Vendors and Technology Trends
To spot evidence of this data warehousing trend, I needed to look no further than the Strata Data Conference in New York this past September. Strata has long been a bastion of big data, with a particular focus on the Apache Hadoop and Spark ecosystem. However, in meetings with vendors at this fall’s event, the dominant topics were data warehousing, data cataloging, metadata and semantic data integration, and governance.
For example, a key focus at the event for Cloudera, which partners with O’Reilly Media to produce Strata, was the Cloudera Data Warehouse. Cloudera’s solution melds technology advances in big data platforms that have enabled organizations to collect and manage petabytes of data with functionality to support the concurrency demands generated by democratized, self-service BI and analytics. Cloudera and other vendors made it clear that the revival of interest in data warehousing is not about stepping back. The “modernized” data warehouse is about exploiting the maturity of big data technologies and self-service data preparation and integration to expand the scale, scope, and usability of data warehouses.
Cloudera and erstwhile rival Hortonworks made further news in October when they announced plans to merge and bring to market a unified data platform and Hadoop distribution. Although the combined company under Cloudera management will begin work on the technology merger as soon as the deal closes, both distributions will continue to be supported for at least three more years.
The merger demonstrates the maturity of the Hadoop distribution market as well as the pressure being brought to bear by the growth in cloud-based data management and storage. MapR, which offers its Converged Data Platform on premises and in the cloud, provides the primary alternative Hadoop distribution to this combine. MapR and Cloudera both have data warehouse modernization and optimization solutions that, for example, advocate use of Hadoop and Spark platforms for ETL workloads. Cloud computing is unquestionably the biggest change agent in the data management and data warehouse landscape. Vendors that have been prominent in the market for on-premises data warehousing and big data platform solutions are having to adjust fast to the surge of interest in cloud-based services.
Venture capitalists are excited about the market opportunity of cloud as a new platform for BI and data warehouse management; prime evidence of this was the massive $450 million growth funding invested in Snowflake Computing by Sequoia Capital and several other leading VC firms in October. Amazon, with Redshift, is perhaps the most prominent cloud-native competitor for data warehousing given the “data gravity” of its large share of the data storage and Web services market. Google, with BigQuery, has a growing presence, and major platform vendors such as IBM, Microsoft, Oracle, and SAP are also players in the cloud-based data warehousing arena.
Although the Hadoop distribution marketplace may be settling down, that is not to say that technology development in the open source Hadoop and Spark ecosystem is slowing.
Over the past decade, the explosion in Apache Hadoop and Spark ecosystem technologies and frameworks for managing, integrating, and interacting with data in data lakes and hubs has given organizations many options for supporting data science, advanced analytics, streaming, and AI. Only recently, however, have SQL-on-Hadoop and Hadoop- or Spark-native technologies matured to where organizations can offer BI users direct access to data on these systems. TDWI research finds that most access is provided through ODBC or JDBC connectors or through an intermediate layer, such as a data warehouse.
From the beginning of the Hadoop revolution, some business and IT leaders saw data lakes based on Hadoop (and more recently, Spark) clusters as potential replacements for traditional data warehouses that they deemed too limited in scale, performance, and scope to handle the big data tsunami, at least for the right price. Other organizations wanted to retain their traditional data warehouses to support existing BI and analytics workloads but develop and deploy data lakes and hubs to handle new data science and advanced analytics workloads. In other words, they preferred to augment their data warehouse by building out a multiplatform architecture.
TDWI recently surveyed organizations about their plans to augment or replace existing data warehouses (the full research will be published in an upcoming Best Practices Report). We found that the majority of organizations do not want to stand pat; about two-thirds (65 percent) of the 232 organizations surveyed want to either augment or replace their existing data warehouse with technology solutions centered on a data lake or hub built with on-premises Hadoop or Spark clusters or systems based in (or native to) the cloud.
In the research, we examined which technology systems or cloud services are in most organizations’ plans when it comes to augmenting or replacing their existing BI, analytics, and data warehousing systems. Cloud-based solutions figured prominently but are not overwhelmingly dominant.
The highest percentage (38 percent) of respondents are planning to use cloud-native BI systems: that is, solutions fully residing in the cloud and built for cloud-based workloads. Nearly a third (31 percent) plan to deploy a cloud-based data lake (compared to 19 percent who plan to deploy an on-premises data lake) and 21 percent are going to go with a cloud-native data warehouse. For on-premises systems, just over one third (35 percent) plan to deploy a new analytics platform and 28 percent plan to go with a Hadoop- or Spark-native BI platform.
Augmentation: Keeping Objectives Clear
Strategies for deploying emerging technologies such as big data platforms, cloud-native services, and AI and machine learning will be the central focus of the TDWI Leadership Summit coming up in Orlando (November 11-12, 2018). If you plan to augment your existing data warehouse — or develop a new one to complement a data lake, hub, or cloud data storage — be sure you have a clear idea of the role of each part of the expanded, multiplatform data architecture.
Three critical areas include:
- Data quality and consistency. Many BI reporting and analytics use cases depend on good data. Organizations can position the data warehouse as the repository of carefully curated data to complement data lakes and data storage that serve as collection points for the mass of data ingested from multiple sources.
- Governance and security. Organizations need to protect sensitive data, especially to adhere to data privacy regulations. The data warehouse can serve as the system of record for sensitive data; administrators can load sensitive data into the warehouse, then guard access to this selected data centrally rather than try to do so across data lakes and other repositories in the architecture.
- Stewardship and recommendations. An important BI trend is to provide users with automated recommendations, usually for data selection but also for selecting filters, visualizations, and even analytics models. As administrators and/or automated, “smart” software learn user preferences, they can use the data warehouse to position commonly used data for easier access.
Data warehousing is not dead, but it is changing as new technologies — running the gamut from scalable, high-performance platforms and better development and administration tools to AI and machine learning — have their impact. The challenge will be to embrace the new technologies and cloud services and position the data warehouse for optimal benefit.