In Canonical Models Should Be A Core Component Of Your API Strategy, we explored why your canonical model is so critical to future success with APIs and services. The answer centers on the need for business interoperability among the various constituencies producing and consuming APIs and services. The more widely used you intend your API to be, the more compelling the case for a canonical model.
Firms implementing Big Data solutions do so for many reasons, aiming to:
- Gain deeper insights into customers, sales, and other activity. Rather than throwing away most data (e.g. POS data), Big Data aims to retain more data, including content, to feed analytical models to uncover previously unknown trends and insights. For example, one large retailer analyzed POS data, discovering regional patterns in sales of ladies’ shoes that it leveraged to drive increased same-store sales of 15% or more.
- Save money and speed access to data insights. Large firms have moved analytics workloads from data warehouses onto Hadoop-based Big Data infrastructure at a cost ratio of one third to one fifth (or better). These results are leading firms to dramatically shift analytics investments toward Big Data platforms, which are not only cheaper, but also more flexible. However, gaining access to these savings requires the right talent and tools to fully exploit Big Data technologies, e.g. Hadoop, NoSQL, Avro…
- Provision a Web scale data platform for the API & Data Economy. If your firm is serious about exploiting the API and Data Economy, you’ll need a platform to suit. The NoSQL foundation of Big Data platforms has the scale-out architecture you’ll need, whether on premises or in the cloud.
All this goes to show that your Big Data strategy is really just the latest update to your overall information management strategy. Don’t think of Big Data as just a special case with unique rules, recognize that it is really the logical evolution of your data platform to take advantage of the compelling price/performance and flexibility of new Big Data platforms.
Business Interoperability Matters For Big Data, Too
Given this understanding that Big Data is really just part of the modernization of your information management strategy, it stands to reason that if you care about business interoperability in other parts of your “information stack” – and you do – then you likely care about it for Big Data, too. However, the more dynamic and free-flowing nature of Big Data brings unique new challenges and opportunities to your approach to business interoperability:
- Unlike a data warehouse, Big Data does not have a fixed data model. That’s the beauty of Big Data, that your “data lake” can contain many different kinds of fish, and that you don’t need to worry about which kind of fish it is until you reel it in and use it. A related factor is schema evolution, one source of Big Data’s flexibility – think of it as fish (data instances) being able to evolve quite rapidly, as new information becomes available.
- Big Data blends data-at-rest with data-in-motion. What some call “Big Fast Data,” streaming of flexible data at high rates, requires a range of processing techniques and technologies that partially overlap more batch-oriented Big Data technologies associated with uses like sales analytics – although in some cases, even sales analytics benefit from near-real-time processing. Just as fish move between lakes and rivers, business data has a lifecycle it’s important you understand across your data landscape.
- Your Big Data strategy competes for resources, too. An overly simplistic view of the concept of a “data lake” suggests just throwing everything in there, without thinking too much about the business case for keeping it, or its use for analytics. Big Data’s extremely attractive price/performance makes many use cases feasible that could not previously be justified in the old data warehousing model, but it’s not free, and the talent to exploit it is often in high demand within and across the innovative firms that most needs it.
One of the key concepts often associated with Big Data (but not unique to it) is moving from ETL (Extract-Transform-Load) to ELT (Extract-Load-Transform), or more accurately ELTLTLTLT, meaning, the data lake contains raw data in the form it’s extracted, to which many different transforms for different purposes can then be applied. Using this approach is one of the reasons for Big Data’s agility, as it’s in some ways easy and quick to toss many different data artifacts in the lake without worrying about their all conforming to a pre-defined schema. So how can we reconcile these ideas with the idea of business interoperability?
Easy: it’s all about data-in-motion. The business doesn’t care how you format the data in storage, but they do care about what they see, and how they can use it. One of the world’s largest energy firms found that it needed to filter analytics on raw Big Data from oilfield equipment through a canonical access layer before the business – and developers – could easily consume the analytics and make use of them in the business. Otherwise, there were simply too many irritating and ultimately pointless variations in the raw data, given the highly variable data formats from oilfield equipment across all of North America.
Yet for some use cases, the exquisite rawness of the data is actually a requirement to find the hidden meaning and connections in the data. For example, casinos and intelligence agencies share certain commonalities when it comes to the techniques they use for entity analytics.
What This Means For Your Big Data Strategy
Bottom line, in the wide and wonderful world of Big Data, there is no single approach being used everywhere for everything – it’s a highly variable environment, being used for a wide range of applications. But for some applications of Big Data, interoperability matters, especially at the point of usage/consumption.
As Big Data technologies innovate your data environment, look for opportunities to promote business interoperability, and to make it part of the landscape governed from your canonical model. digitalML expects Avro to play a key role in making this a reality, based on what we’ve heard from some customers already experimenting with it as part of their Hadoop environment. What do you think? Please respond via the comments.