The Hidden Elephant in 'Big Data' Modeling - by Len Silverston
“We must understand the data and therefore continue to develop data models, even in this 'Big Data' era." exclaimed the data warehouse lead.
"You traditional guys! We are in a new age and we have new technologies and tools that allow us to immediately store tremendous amounts of data and use it, without the ridiculous long and expensive cycles of data modeling and design. Wake up and start living in our new world!" said the data scientist.
"If you use the data without first understanding the nature of the data, this opens up huge risks of misrepresentations, acting on misunderstood data, and making bad decisions that were based on misunderstood information. Just because Big Data is here, there is still a need for modeling and governing data."
"Are you trying to control our environment, slow us down, hamper progress, and impose the same old bureaucracies again?!"
There is an elephant in the room with the ‘Big Data’ movement. The elephant is perceived separation. We have always had various factions and 'camps' which are the root cause of data silos. Data silos come from people silos. The way that people think directly influences how integrated or disintegrated our systems are. For example, when an application thinks that the data they are creating is primarily theirs (I call this data 'mine'ing, when people say 'this data in mine')[i], then this leads to separate, disintegrated data stores. Now we have another form of this perceived separation, namely between the traditional data management world such as data warehouses, master data management, application databases, and other structured environments versus the new world of 'Big Data' environment of unstructured[ii], NoSQL (not only SQL) data stores and analysis.
The key to effective data management is integration. Integration means ‘combining into a whole’ while disintegration means ‘separate into parts’ or ‘to crumble’. In a disintegrated world, data is more separated, less useful, and not as powerful of an asset. Therefore, it is important to recognize how structured and unstructured data environments can be used together and how various structured and unstructured data sub-organizations and groups can collaborate for the greater good. All these groups are involved in the organization’s data; however, if they act in silos there will be inconsistent semantics, communication issues, misunderstandings that could cause major problems such as misreporting, lack of being able to see the whole picture, and less efficient ways of conducting business. We need to be able to integrate, or in other words, holistically combine various aspects of data management. We need to be able to collaborate between various groups to accomplish our objectives. This includes integrating structured, relational database management environments with unstructured, Big Data environments such as Hadoop, graph databases, document-object databases, key-value stores, and other types of database storage.
Some believe that data modeling is not as important in the Big Data arena. However, it is actually more important to understand the nature of data given the much more complex space that exists. It is also important in facilitating common semantics and communications. Modeling is the art of describing something to better understand it, for example, a model of an airplane or a model of a building. Thus, ‘modeling’ of data in our complex world is important since, if we don’t understand the data (and often we don’t), this resource can not only be less useful, but it can cause potential damage. For example, we could misrepresent and misuse sales figures, customer sentiments, or other data if it is not properly understood and communicated with a similar vocabulary. A simple example is reporting on customers where some view a prospect as a customer and some people don’t think that prospects are customers.
Thus, there is a still a very critical need for data modeling. However, in this age of Big Data, there are other ways that we can perform data modeling that will be more appropriate and effective.
How Does Data Modeling Change in a Big Data Environment?
The advent of many different types of NoSQL data stores now allow data to be loaded and explored without first modeling the data. There are now many graph databases, document-object databases, key-value stores and other types of database storage where data can be loaded without having a database design in place first. Some call this ‘Schema on Read’, where a model can be applied after reading it, versus ‘Schema on Write’, where the database schema needs to exist before data is loaded.
Thus, in Big Data environments, the paradigm has changed from:
- Model, load, query (explore)
- Load, query (explore), and model.
Thus, in an unstructured environment, it is not that we don’t model; it is that we model after loading and exploring the data first. Also, we don't have to model everything. Modeling can be applied if and when needed, for instance, when we think there is a need to put some of this data into a data warehouse or simply to understand it better.
This has the tremendous advantage of being able to explore the data without the overhead associated with data modeling. It also allows for much more flexibility, since the data is not constrained by a fixed database schema so when the data feed changes, there is not a need to go through the data modeling and database design process up-front.
How to Collaborate
This requires a huge change in mindset and paradigm. When there are different groups that think quite differently, there is often separation. The challenge is for various groups to operate collaboratively versus separately.
Some key collaboration principles are:
- Have a common, shared vision and purpose: In both unstructured and structured environments, there is a need to use the data of the enterprise, and it is important to recognize the common vision that both groups may have, for example, to analyze data in order to help serve the enterprise.
- Understand motivations: There may be underlying motivations of various people and there is a technique called motivational modeling where we understand the real wants and needs of the key players involved, which aren’t always apparent. By understanding each other’s real needs and wants, we can more effectively work together.
- Develop trust: Trust is earned through ‘Character’ and ‘Competence’. When we are willing to ‘get outside of ourselves’ and work on collaborating, we can demonstrate character. Then, we are looking for small, incremental wins, to actually show results, and thus demonstrating competence.
- Communicate effectively: One important thing in communicating is in letting go of any agenda and just listening for a while. This can help bridge paradigm differences. It is very useful when each of these groups develops some understanding of each other’s expertise to help each group ‘cross to the other side’.
- Manage conflict: There are very useful frameworks for managing conflict such as Bill Ury’s 5 step conflict management approach[iii]. A key concept from this framework is to first recognize when we are reacting; for example, there may be some fear in moving to new ways of doing things, so it is important to take a step back and be objective, and then instead of ‘reacting’, find a way to ‘respond’ intelligently.
About the Author:
Len Silverston is a best-selling author, consultant, and a fun and top-rated speaker in the field of data modeling, data governance, as well as human behavior in the data management industry, where he has pioneered new approaches to effectively tackle enterprise data management. With over 30 years of experience as a data management consultant helping organizations world-wide, he is well known for his work on "Universal Data Models", which are described in his The Data Model Resource Book series.
This topic of how to model data collaboratively in a Big Data environment was discussed in more detail in the Embarcadero-hosted webinar by Len Silverston on “The Key to Big Data Modeling: Collaboration”. Register here to watch the on-demand session.
[i] This concept is further defined in the article 'Data Mining Versus Data Oursing' at www.universaldatamodels.com/publications/articles
[ii] By the way, 'unstructured' is a bit of a misnomer since all data that we use has some structure. For example, we may use a meta-tag defining what is in the field, otherwise we would not be able to make sense of the data at all. Unstructured data is mainly used to distinguish from data that is more defined including showing the relationships between data, for example as in a relational data structure.
[iii] From “Getting Past No: Negotiating with Difficult People” By William Ury
Please login first in order for you to submit comments