When contemplating the significance of powerful “schema-less” NoSQL databases and Big Data environments, architects may wonder, “do schemas still matter?” Even if you’re pretty sure they do, there’s plenty of scope to wonder just how and why and under what circumstances. How should schemas figure into your information management strategy going forward?
What is a schema? Wikipedia has a broad/general definition, but within the community of people who care about Big Data, data-in-motion, and data modeling, we typically mean to include database schemas, message schemas (XML or JSON), and other specialized kinds of schemas for data management, such as Avro. Bottom line, this kind of schema is a machine-readable model of the data you need to store, access, or move, including rules/constraints on what makes a valid data instance.
The benefits of the schema-less approach are nicely summarized by Gary Bhattacharjee of Morgan Stanley in Forbes (May 2012): “The way it has typically been done for 20 years is that IT asked the business what they want, creates a data structure and writes structured query language, sources the data, conforms it to the table and writes a structured query. Then you give it to them and they often say that is not what they wanted. Since Hadoop stores everything and it is schema-less, I can carve up a record or an output from whatever combination the business wants. If I give the business 10 fields filtered in a certain way and they want another 10, I already have those and can write the code and deliver the results almost on demand.”
Schemas Play Multiple Roles
Part of the potential for confusion comes from the way the industry overloads the term “schema” – there are multiple kinds of schemas: for storage, query, and data-in-motion. A “schema-less” data store does not require you to design a schema before you can store any data in it, the way a Relational DBMS does. Even if an application uses no explicit schema when it accesses data from a “schema-less” data store, the application still has an implicit schema, defined by the structures it uses to manipulate and present that data to people or other applications (as via an API). But more often, NoSQL stores offer a query schema as a convenience to the developer.
There are a number of expert resources you can consult to better understand the finer points of schema vs schema-less approaches:
- As is often the case, Martin Fowler (Chief Scientist of Thoughtworks) offers one of the best and most complete combinations of explanation, debunking, and guidance.
- Even passionate advocates of the schema-less approach like Edward Capriolo (Data Architect at The Huffington Post) can see some scenarios where schemas come in handy.
- IBM offers some sage guidance on the pros and cons of schema-on-read vs schema-on-write.
- Pete Stiglich, a Principal at Clarity Solution Group, did some great analysis when he was a Sr. Technical Architect at Perficient.
- Merv Adrian, an analyst at Gartner, has been carrying on an interesting discussion on Twitter with several other Big Data experts, on this topic.
The Evolution Of “Schema-Less” Approaches
Given this context, what’s interesting to me today is the way some large firms we work with at digitalML are evolving their thinking about schemas, where Big Data is concerned. I’ll summarize the thought process in this simplified way:
- Yes, it’s easy and quick to just dump a lot of “raw” data (in the format it has from whatever source provides it – files, tables, documents, streams) into your data lake (shorthand for a Big Data store of some kind), and that may well be the best approach for certain use cases such as entity analytics.
- However, if your purpose in filling your data lake is to build a general-purpose resource for business analytics and BI, then you may find that the amount of work you have to do to transform all that raw data into a usable form is so great, that if you have to do that every time you access it, the cost may be too high to provide reasonable performance.
- Still, if your analytics use cases vary so widely that each has unique requirements for viewing the data, then the benefit of trying to define schemas, even if only for read, is limited. Schemas are most useful when they represent a view of data shared by multiple business stakeholders, where an agreed format matters to achieving business goals, such as a more consistent way to analyze sales results from around the globe, or interoperability between various parts of the application landscape or partner ecosystem.
- Schemas can also play an important role in data quality (see Figure 1). Unlike entity analytics use cases which favor keeping the data raw (so as not to remove the very data that may reveal fraud), most business analytics use cases benefit from using schemas to improve data quality, which may require schema-on-write to be part of the picture.
- You can do all-of-the-above, if you want – store the raw data, then do the analysis through a particular schema-on-read, then write that data back, a bit like caching a materialized view in the old DBMS world. Or if appropriate, you can transform raw data to conform to your desired schema on write when you first load data into the lake, thus making that the easiest schema to support on read – but not the only one.
What all this means is that schemas become a tool for managing your data lake for best overall capability, performance, and cost, shaped by your mix of use cases.
Schema Evolution Adds Power To Your Toolkit
One of the things that makes this new approach to using schemas more feasible and flexible in the Big Data world is the emergence of technologies like Avro, Protocol Buffers, and Thrift. Avro is particularly interesting because it serializes data in a very compact binary representation, while offering the same level of schema flexibility as name/value pairs. In theory each instance can be stored with a different schema, although in practice one typically finds schemas changing less often – such as having a different schema for each fiscal year, or tax year.
In such an approach, my data lake could contain sales results for multiple years, each stored with all the information available each year and in each geography, no matter how much those differ between regions or across years. But then comes the magic – I can analyze that data as if it all has the same schema, using Avro’s built-in support for schema evolution.
Of course it’s not really magic, so there are some things to bear in mind, but if you use it correctly, you really can have the kind of schema flexibility you need for a proper data lake containing multiple kinds of fish, while easing access to that diverse data for BI use cases that used to run on a data warehouse with a single schema. Among the things to consider:
- If an app using a version of the schema missing one element reads and updates a record, the value of that element will be set to the default value, even if another app previously stored a value for that element, using a different version of the schema. Therefore you need to be careful which processes update which records in the lake – only operate on the right fish.
- As you evolve the schema, you need to deal with change in particular ways to preserve the ability to sensibly synthesize a workable schema on read (which Avro will do quite nicely, automatically, if it can): adding elements is easy, deleting them is tricky, renaming them is verboten, as is changing their type (instead, add the element with the new type, and deprecate the older version of that element for new uses).
These are but a couple of the more interesting constraints, but they are representative of the kinds of things you have to do to make use of Avro schema evolution. Still considering that in the past schema evolution was only available through using particular advanced DBMS that supported it (e.g. Prism), Avro is providing a much simpler and more democratic approach today (meaning, possible for a wider range of use cases to actually make use of it).
Yes, Schemas Matter
So it’s clear that schemas will continue to play an important role in most firms’ information management strategies going forward, even as their role continues to evolve. I think it’s safe to predict, given the way this is playing out, that:
- There will be more and more schemas to manage. All this flexibility comes with the burden of more schema management effort, if you want to take advantage of these capabilities.
- A lot of the effort involved will be keeping track of the versions of the schemas, the differences between them, and the mappings between data in its various formats (raw, normalized, Avro schemas, JSON schemas, etc.).
- Another area of effort will be managing the contracts of interfaces for APIs and services that use those schemas, including the emerging class of data adapter services that exist to move data in and out of your data lake.
As it happens, this is the sweet spot for the ignite Service Design Platform, which has been helping the world’s largest firms manage large populations of schemas, for years, all while rapidly evolving to support emerging types of schemas such as JSON and Avro. As more firms enter the world of needing to manage large populations of schemas, ignite will become an essential component of more firms’ approach to information management.