June 8, 2020Employee Spotlight: Mike Carlino
Imagine the scenario. You are a marketing manager and you want to get a list of customers to focus on in your next upsell campaign and your primary BI tool is Excel. You are happily slicing and dicing the data available to you in your spreadsheet when it occurs to you that there is other data that the company captures about your customers that would help you to better understand your customers. This data is not showing up in your spreadsheet today, so what do you do?
You call your contact in the IT organisation and tell him your woes. He’s a friendly guy and he does what he does, and provides an updated version of your original spreadsheet on a Google Drive including this new data. Now you can try out your hypothesis to potentially make some better decisions on which customers are good prospects for your new products.
Is this Self-service BI?
It feels like the fact that you needed to call your IT guy to move data from somewhere to a spreadsheet that you can then access yourself and exploit isn’t really self-service. In fact I would go so far as to call this “IT-Service,” simply because the IT function had to service the business requirement.
Now imagine that we scale this up, and instead of a single marketing manager, we have hundreds of business users, data scientists, consumers of data that need to get access to new or different data. We now need to operate a queuing system in IT to support these requests, the IT folks become less friendly, the requests take longer to respond, the business users become frustrated. Finger pointing begins, and the business then decides to operate outside of IT thereby bypassing data security, governance and controls – ‘Shadow IT’ is initiated. With all the issues that follow; Data security could be compromised, there is no centralised definition of data, everyone does their own thing – nothing good is shared.
This is not a new story – I have seen this first hand for the last two decades. At its root is the manual movement of data.
To counter this, organisations created the Enterprise Data Warehouse (EDW) on database technologies such as Oracle and Teradata, a place where everyone could find data for their reporting needs. These worked for a while, but because we were concerned about the cost of data storage we structured our data to remove any duplication of data. There are shelves of books on the best practices to do this, but fundamentally these EDWs were far too complex for end users to exploit. These complex data structures meant that when a business user pressed a button, the system needed to join a few, a dozen or tens of tables. Performing so much data processing when the end user pressed a button inevitably resulted in query response times that resulted in complaints from frustrated end users, and eventually ‘Shadow IT’.
To resolve the performance and complexity issues, data marts and OLAP cubes were designed and created. Again, there are shelves of books on the best practices as to do this, but essentially these involve a duplicate of the data from the EDW that is moved and structured to simplify access for business users. This approach worked for a while, until a business user came and asked for other data to be included in the data mart or the data cube, or the data volumes that needed to be moved and restructured became too large for our technology choices. We have seen this pattern before end user queries run slow; business users get frustrated; and eventually Shadow IT is initiated.
So to the rise of Hadoop. The promise of cheap data storage and data processing. Now all our problems would be solved! We could store all the data in a simple to use format for the end users, performance will never be an issue again, Shadow IT is banished, never to appear again.
Whatever your feelings on Hadoop, it’s no longer the popular kid in the playground. The fact that there is only a single Hadoop vendor in the marketplace speaks volumes. The reasons are likely to be many fold, but in my experience, it’s just too complicated to own, manage and operate, not to mention the fact that the inbuilt SQL engines were originally designed to support batch processing. So running hundreds of queries at once and expecting a response in a couple of seconds is unrealistic .
Another approach to resolve this performance problem is through the use of “Self Service BI tools,” which became more and more popular. These tools tend to be from the “small data era,” and were originally designed about 20 years ago. Their approach is simply to move all the necessary data from Hadoop and structure it in their own “caches, servers, and data stores”.
Now remember we are on Hadoop, the data is bigger, and so moving data not only takes time and effort to manage, you cannot move all the data (you would need another cluster similar in size to the Hadoop one), so the end users will need to rely on IT to make changes when they need other data.
This seems like a familiar pattern, remember what happened previously? This feels like we have just introduced “IT-Service” with our old friend “Shadow IT,” eventually making a comeback when end users become frustrated at their inability to respond and react quickly to the needs of their business.
But, hang on, now we have the Cloud that’s the answer to all our problems, right?
Unlimited data storage and data processing! Now all our problems really are solved, that Hadoop thing was difficult to manage, but now someone else is doing that for us. Also Cloud is cheap right?
Before reading on, it’s worth Googling something called “cloud bill shock”. This is a real thing, unlimited computer processing with minimal management costs is cool, but someone does have to pay for it. So we have now replaced end users waiting for queries to run, with end users having to pay a lot of money for queries to run, and it’s true that we have yet to really understand the organisational and working culture impact of this, but the “Self-Service BI Tools” are back. “Reduce your cloud cost” they say “by moving the data into our cache/service/server”. Feel familiar? Guess what comes next?
So how do we stop making the same mistakes? Firstly let’s really qualify the term “Self-Service BI” when someone says it. What does it mean? What happens if I need new/different data in my analysis? If data needs to move, this is a red flag. This is not self-service. This is not a good path to tread, for we know what lies down this path, we have been here before.
So what’s the alternative?
How about an approach that enables end users to serve themselves? To get new data themselves? To get performance without having to move data or manually optimise it?
How about a combination of hiding the complexities of data access from the end user, whilst autonomously ensuring that the data structures in the underlying data systems are optimised for the end users? How about doing this in a controlled and governed way?
How about a situation where IT can serve the role of providing comprehensive data services that are secured, and relevant to the business whilst data optimisation happens autonomously?
Sounds like a way to get off the path to “Shadow IT”.
If you want Self-Service BI, you need AtScale. To get a demo, go to: www.atscale.com/demo.
The Practical Guide to Using a Semantic Layer for Data & Analytics