Semantic Layer, Data Observability and more with Sanjeev Mohan, SanjMo Ex VP Gartner

Data-Driven Podcast

Listen to Sanjeev Mohan talk about Data Mesh, Data Fabric and the development of Data Products. You’ll hear about data consumers that need access to data and also about how data engineering often becomes a bottleneck. Sanjeev also shares his thoughts on Data Observability and the significance of the semantic layer.

See All Podcasts Button Arrow
Quote icon

Data has lived in its own silo for the longest time, owned by the data people. There is a reason why applications and infrastructure got more structured and data has taken its time, because data is a different beast. Applications don’t change that often. Infrastructure changes, it’s a major event

The metrics that data observability vendors will collect, visualize it in some sort of a time series graph, do some sort of a prediction into how, what will be your cost or what can break down the road. So they’re being proactive. It is not just notifying you that something broke, go fix it. But they do that. So they have alerts and notifications. If something breaks in my very diverse heterogeneous, data and analytics ecosystem, how do I know where it broke? So doing that root cause analysis is one of the tasks that data observability products have

Transcript

Dave Mariani: Hi everyone, and welcome to, AtScale’s Data Driven podcast. and today’s special guest is, Sanjeev Mohan, and Sanjeev is the principal of San Mo. He has been an industry analyst, and Gartner analyst for years. And I’m so happy that Sanjeev is on the podcast because we got lots of questions for Sanjeev and we’re gonna have a really great discussion on, the state of the industry for data and analytics and, and what the future looks like. So, Sanjeev, welcome to the podcast.

Sanjeev Mohan: Thank you, Dave. It’s such a pleasure. We’ve known each other for so many years, you know, in different events we’ve, met. And now here I’m at your podcast. Thank you so much for having me.

Dave Mariani: Yeah. You know, it’s, it’s the data and analytics sort of community is, it’s, it’s, it’s still small enough that every, you know, that you tend to know each other and, and so it’s, it’s really great to sort of develop those, those relationships. I always tell anybody who’s young, it’s like, always take that meeting cuz you know, you never know where it’s gonna go. And, and, and it usually pays off in the long run. So, always take, take the opportunity to meet people and get to know them. Absolutely. Sanjeev. So, let’s just, let’s just, talk about you for a minute here. so, why don’t you tell listeners a little bit about, your background and how you got to where you are today. Cause you have a, a fascinating story.

Sanjeev Mohan: Thank you. I started my career at, mainly in, in, in the US at Oracle. I was part of the, the database team, early days of 1990s. And we’ve really got into, like, very deep into internals of Oracle and, helped our, our, initial customers. I have to say, at that time, Silicon Valley was still growing up. It was a very young place, enjoyed my time there, but then.com boom happened. So I quickly jumped into the.com space. that was probably the most exhilarating time of our lives, in terms of excitement, building out, you know, all these websites from scratch. You know, we, we were on a mission to change the world till the.com bust happened. So we went through that time. That was quite a journey. And I feel, I’m like, it’s back to, the times when data is so pivotal to everything that we do. It’s a foundation on which, new companies, are being built. Like, you know, there’s a famous saying from Mark Andre software is eating the world. I think data is eating the world now. It’s eating. Yeah,

Dave Mariani: I agree. So, so, so, so Sanjeev, what happened after, because, you’ve been a, a, you’ve been a very, very big person in the industry. So what happened after that, dot com boom and bust and what sort of got you more involved on, on the, being an industry expert

Sanjeev Mohan: Yes. thanks for that, Jason. After the.com post, I spent a few years in the cons, consulting space, and then I decided to join Gartner. Gartner is an absolutely amazing place. It gives its people this, platform to research topics. And I remember, I, my background was database management systems. I’d been a dba, a data architect. I built data warehouses. And I remember it was about five years ago when I, I got on a Gartner client inquiry from a client. And this guy says, Well, how do I get ready for the gdpr And I’m g and I’m asking him, gd, what, what’s gdpr I never heard of such a strange acronym. And that opened up this whole world of data governance for me, which up to that point I had avoided to the best of my ability because to need was so, so, confining.

Sanjeev Mohan: It was all about compliance and, and fines, oxley, that kind of a thing. But all of a sudden, data governance became this way to expose all your data assets in a, in a very secure, governed, curated manner. And, I and I added data governance to my, coverage. And I spent many years in Gartner covering this, this space till last year, July of 2021, when I decided that that although I could have stayed in Gartner, it’s an amazing place. no doubts about it, but I wanted to, you know, be independent so I could advise more people and have even a wider breadth of coverage. Now, what I do is not just DBMS and data governance, I’ve gotten into data observability, which is a brand new space. Hopefully I made some part in defining data observability space data op is super exciting. Semantic layers. There’s a whole variety of topics that, are coming around in this expanding ecosystem, metrics layer feature stores. So this is what I do now.

Dave Mariani: Yeah. So, that sort of leads me into like, what should be top of mind for a data and analytics leader And I think you just sort of, you just rattled off a bunch of really important topics. so, so maybe we needed to, to drill down on a little little bit on that. So let’s just start with data governance actually, cuz you’re right that, you know, when it came to data, data governance, you could think of it as like, as how to constrain users. But you know, the constraints for, making data available for consumption has always been the fear that you’re going to be exposing the wrong data to the wrong people. Correct. So if you do have a strong governance practice and, and, and a strong governance, layer, then you can share more data because you have more confidence that that data is gonna only be seen by the people who should be seeing it, right

Sanjeev Mohan: Yeah. In fact, people say the sexiest job of 21st century is data scientist. I think it’s, it is the, the people who do data governance, they’re the ones who are helping companies leverage data assets. The difference between a successful company and an unsuccessful company is, is the successful company is leveraging the data and the insights that are hidden in plain sight. Mm-hmm. . And to do that, you need data governance.

Dave Mariani: Yeah. And I’ve seen, look, look, it’s, and that’s a key piece to the semantic layer. And then for, from my perspective, because it’s, you know, there’s two parts of it, right There’s making data available to everyone in the tool where they live. So don’t force them to, you know, learn new habits, to use new ways of querying data. So that’s one, but then it’s like, you know, you gotta make sure that, you know, that they get access to all the data that they should have access to. And, you know, traditionally, you know, it has been very, you know, suspicious about sort of making data available because of those fears of, of the breach or the data being getting into the wrong hands. So, I think that’s a really empowering thing, data for everyone. You’re right. I mean, you’re right Ji it’s like I’ve, I’ve, I’ve seen customers who really take that approach of, you know, we’re gonna make data available to everyone who, who, who needs it. and they’re much more successful than those that are sort of trying to dole out access to data based on, you know, some arcane rules of, of who gets to see what

Sanjeev Mohan: Yeah, agreed. it, was, the keeper of, of all the technology at one point in time, but we’ve gone through some major inflection points in just recent few years. Cloud being one of them, Machine learning is another one of them. So the businesses are so much, they have so much more power now that they can sidestep it and have a whole environment. So, so now we are in a real, where IT and business are two sites are the same point. They, there’s completely aligned, or they should be aligned otherwise, you know, the companies are not going to be able to get the most out of their data assets,

Dave Mariani: You know, Sanjeev when it comes to like the business and, IT partnership, You know, there’s, there’s a lot of talk out there about data mesh, and Yes, when you were a gardener, you guys were pushing data fabric, you know, that was like a big, big, a big sort of, push right. Towards, you know, some of your research. Can you explain to the audience just a little bit about what’s the difference between data fabric versus data mesh, and should we all, and what should we care about as a data and analytics leader

Sanjeev Mohan: So the, the first question is, you know, we always start with why, why are we even talking about these things And the reason for that is because we now are analyzing far more data than ever before. I was blown away by this IDC research that said, in the next two years, 750 million new apps will be developed, which is more than the number of apps in the last 40 years.

Dave Mariani: Wow.

Sanjeev Mohan: So, so this is the, the amount, the desire of data consumers need to access data. So it, or data engineers become sort of the bottleneck, in this whole process. The, the idea for data mesh and data fabric is how do we, bring agility into developing data products Data mesh is a organizational concept, and data fabric is a technology implementation. So they are literally apples and bananas and people compare them, but you can’t really compare them. Mm-hmm. mesh says, is that because, you know, the data engineers are hard to find and, and you cannot constantly keep going to the same data engineers and said, I need this new data product delivered to me now or yesterday. So, decentralize your, your data environment into domains. Every business domain, since they’re the expert on what is business semantec and the meaning of that data, they’re the ones who should develop the products and make them available through some data sharing or some mechanism.

Sanjeev Mohan: So that’s data mesh. What data fabric says is that we’ve had a profusion of best breed technologies. There’s just so many best breed technologies. So integrated in, in a, in a way that you’re spending less time trying to stitch together different components, but not only, so data fabric becomes a data integration pattern. Mm-hmm. , you put a knowledge graph and a semantic layer that becomes a key. So the business people can go directly to a knowledge graph, they understand the, it’s an intuitive business term, not a, name of a table and a column. And so they’re very quickly able to start using the system.

Dave Mariani: Okay. That’s a fantastic, that’s a fantastic explanation. so they’re not, they’re not opposed to each other. they’re, they’re really one’s a technology implementation or like, an architectural sort of pattern. and data mesh is more of an organizational style. Yes. and so, so, so you, you talked about data mesh and, and creating and, and decentralizing that creation of data products. So Sanjeev, how do you, how do you prevent sort of the chaos from reoccurring, like, you know, we went through the self-service BI revolution, Tableau sort of really pushed that, and the business was empowered to create their own data products. But, you know, the enterprise suffered with consistency and, and, and control. Really, I, that’s, that’s why I started at Scale Up. I had that, I brought Tableau into, into Yahoo, and now everybody had their own definition of what a, an ad impression was or what a click was. And, there was no way to sort of standardize that. So different dashboards had, different numbers and nobody trusted the data. So how do we, how can we do and implement a data mesh decentralized style of data product creation Because I really think that’s the way you scale, but how do we do it with control and consistency without letting things go crazy

Sanjeev Mohan: That is a great question. The problem that we are facing in the market right now is people are developing their own solution and putting the stamp of a data mesh on it and declaring victory. So, so there isn’t, because when, So Dani did an amazing job of defining what are the four principles or pillars of data mesh. What she did not do was get into the technology site. So she didn’t define like, how should people do it So now there are multiple definitions, and you are absolutely right. If you just decentralize everything, you have duplication of data, It becomes the same problem as set of data silos. Now. So, and a host of other problems, what I have been advising clients is that it’s a mixture of decentralization and centralization. Decentralized, your domain driven development, because subject matter experts sit in the domains, they know the data the best, But governance should always be centralized.

Sanjeev Mohan: Why do I say that Because let’s say, you know, you have three different departments, Sales defines customer in a certain way. Marketing has a totally different definition of what a customer is. Product team has yet another definition. How can you possibly create a, a business level report on customer when you have three different definitions So, that’s 1.1 problem. The second problem is all these, accessing the customer, requires some policy. Sales people should be allowed to see certain aspects of customer because they’re selling to them. But, finance team may have a totally different needs because they have to build them and maybe do collections, right Mm-hmm. . So they need bank details, which sales people should not have. If you’ve got a decentralized environment, then you’ve got decentralized policies, which means you, there’s no common way to govern it. So I believe that governance layer needs to be centralized. There’s a common definition, the common set of data access policies, and, and that is sort of your window into the data product. The data products could be then, decentralized. So you need this combination, to make it work.

Dave Mariani: I completely agree with you on that. you know, some of our customers, before the whole data mesh concept became popular, They called it the hub and spoke style. Yeah. So the hub was, there was central governance, and it was like, not just governance, it was also architecture, right It was like, what tools are we gonna use Let’s make sure we’re all using the same tools, the same standards, but also conform dimensions. and so you talked about the customer dimension, maybe that should be defined either in one domain and they own it, right And everybody else has to use it, or that’s defined by that central team and used by all the different domains. But there’s still needs to be that governance. And to me, I like to think of as hub and spoke, where the spoke is the domains and the hub is that central governance entity, whether it be a data team or, or what have you. because I think that’s, that’s a style that could work. Correct. versus free for all full deep centralization.

Sanjeev Mohan: I completely agree. In fact, what the point you raise is extremely important that we sometimes forget. We, we treat data mesh as something that’s brand spanking new. Mm-hmm. , what data mesh, did really well, was it packaged the, the existing principles in, in a very neat way so people could fathom what to do. But the techniques and the approaches are not new. What you’ve been doing in with hub and spoke is what we are doing today. So sometimes, you know, people think, you know, this is a brand new way of doing thing. No, it’s not. You know, we’ve been doing this stuff, but, but now we have a structure and we call that structure a data

Dave Mariani: Mesh. Yeah. And, and I, and I do, I like the concepts. I like look, I like having common terminology. I like having a data steward, having, having do data domains. It really does help when it comes to, you know, communicating and speaking and talking about how technology sort of fits and, and enables that, that picture. So, so that’s great. That’s great. Yeah, there’s, a lot of stuff like, if I were, if I were at Yahoo and I were, you know, running analytics, again, I would definitely choose, I would call it a hub and spoke, but it would be that, that sort of decentralization to, with standardization, I, I don’t think you can have decentralization without some sort of standardization in governance, but I would definitely would, would go that, that route.

Sanjeev Mohan: So if I may add one, one quick thing and go on a bit of a rant.

Dave Mariani: Love it. Rant on.

Sanjeev Mohan: Okay. Alright. So, so in our industry, we love terminologies, hypes, and, you know, we invent things, just because they’re just fashionable. One of them, that is used for data mesh is it’s a techno social, problem or, or sociotechnical problem. And I was like, taken aback. I’m like, Wow, that sounds pretty, nifty. What is, sociotechnical when I dug deeper, it turns out social is people and processes. So focus on people, the consumers and the process. Less on technology, but social technical. If I, if I par it is people, process technology mm-hmm. , I’m like, Oh wait, I, we’ve been talking about it for 15 years now, but , but that sounds old fashioned, but, but so technical. So anyway, hashtag ran off.

Dave Mariani: Yeah. . I love it. I love it. I love it. You know, So, one of the other things you brought, brought up in the very beginning that was top of mind for you is as data observability. And, and that’s also getting talked a lot about now. and I’ll, you know, I’ll, I’ll I’ll be the first to admit. It’s like, okay, so I I I don’t, I don’t necessarily get it yet. So can you, can you help, help me understand, help the audience understand what is data observability in your mind and why is, is it important for data and analytics leaders to be thinking

Sanjeev Mohan: About Correct. Okay, Perfect. I feel personally data observability should get a lot more attention than it gets to today. Why do I say this Is because observability, not data per se, but just observability has been around in infrastructure and applications for, for over a decade. And there are some heavy hitting companies in this space that have done extremely well. Splunk, Datadog, Dynatrace, App Dynamics, New Relic, Sumo Logic, the name the list is, is Vast. Why do, why do we have so many companies doing observability is because it’s so important for them to track How’s the application performing, how the packets are being moved, How’s the security and firewalls and all of the infrastructure pieces. Well, guess what, All that infrastructure is moving to, to the cloud. So AWS cares about it. Google Cloud, Microsoft Azure, they all care about it. But as an end user, I only care about my data.

Sanjeev Mohan: So why would I not have a full visibility into how’s my data moving in a pipeline from data producers to data transformation engines, whether it’s D B T or, you know, all, you know, consumers like look or what’s, what is happening in Snowflake with all the transformations that, that get pushed down. So data observability is a way for me to get a full stack visibility. For example, data quality. How is my data quality changing, performance, but performance at data level, spark level, for instance, mm-hmm. infrastructure, how cost effective it is. Even thin ops now we are in little bit of a tight market situation. It is, we are not as much into throw as many resources to get the best performance. It is cost is a very important thing. So I need to know how’s my query What resources is it consuming

Sanjeev Mohan: All of these things are the metrics that data observability vendors will collect, visualize it in some sort of a time series graph, do some sort of a prediction into how, what will be your cost or what can break down the road. So they’re being proactive. It is not just notifying you that something broke, go fix it. But they do that. So they have alerts and notifications. If something breaks in my very diverse heterogeneous, data and analytics ecosystem, how do I know where it broke So doing that root cause analysis is one of the tasks that data observability products have.

Dave Mariani: That’s a great explanation. That does definitely help. So, look, I, we, we we’re all familiar with monitoring cuz you know, you gotta monitoring your services, make sure that your, that your, you know, your application server, your web server is, is, is, meeting load and all that kind of stuff. So I think what you’re saying is like applying that same set of visibility to data, not just to the hardware and software infrastructure that you’re managing. Right. Very good. Very good. Okay. I got it. So, I, I, data ops, there was the last sort of topic, that you, are, are really covering, ji so what’s happening in the world of data ops and, and, and help us understand why data ops and that concept, cuz that word in term is used a lot. why we should care.

Sanjeev Mohan: thank you for asking that. I, I think data ops is, is, yet not super critical, important, space. But again, like everything, it’s confusing. The terms can mean different things. in fact, just few years ago, everybody was calling themselves a data ops company because what the way people define data ops is you’ve got the data producer and you’ve got data consumers. Everything in between is data ops, but that’s, that’s incorrect. What is in between data producer and a data consumer is data management. The data management layer is, is data ops, Let me define, define it to the next year. What is data management Elt, I have products and tools, approaches for doing replication. Chain data capture, ingestion of data. Then I do data transformation, integrate that data with multiple different places, create my data models, persisted in some sort of, data lake or lake house or cloud data warehouse.

Sanjeev Mohan: Then I do my analytics and I have my semantic layer, governance, all of that. All of this is data management. So what is data ops Data ops is how do I do, my code development and configuration in an automated, automated manner so I can do my testing and very quickly push my code from my, my development environment to QA to, to prod. So, so data ops can be applied to anything. Maybe I use a BI tool and the BI tool lets me create my reports and dashboards, but how do I do it So I have version control and I can push different versions of my report into my GitHub. If there is something wrong, I can, you know, revert back. So doing this, continuous integration, continuous development and deployment, C I C D testing as far ahead, automating this whole process, introducing, collaboration.

Sanjeev Mohan: All of these things fall in data ops. Data ops is a layer that you put on top of data management. You can put it on top of, your building your, your business logic, or you can put it, on top of your maybe cloud migration initiative. So that, that’s why data ops is so important. Again, going back to observability, Observability has existed for, for the longest time on the application site. We used to call it a application performance management. Now it’s all data. Same thing with data ops. DevOps came into existence in 2007, and we’ve been applying agile development practices to software, development. So it’s taken a few years, but now data has caught up to those best practices. So data ops, to make a long story short, is actually collection of techniques and practices and technologies. Mm-hmm. .

Dave Mariani: Yeah. I mean, being, look, I, I’m a software developer, so I understand DevOps. I understand, what that means. And, and to me that’s like, that’s the analogy, right It’s, it’s, it’s really those software principles that we’ve been honing over the years, We’re now applying it to our data pipelines, with the same kind of rigor. And so, you’re right, observability is similar to that. It’s like, all these sort of best practices that have been, you know, haven’t been applied to data. Data’s been sort of the, the wild west. Yes. We’re starting to get a little bit, we’re starting to get a little bit more buttoned up, aren’t we, when it comes to data and operating data products

Sanjeev Mohan: Correct. Yeah. Data has lived in its own silo for the longest time, owned by the data people. There is a reason by why applications and, infrastructure got more structured and data has taken its time. Cause data is a different beast. Applications don’t change that often. Infrastructure changes, it’s a major event. When the infrastructure changes, data changes all the time. You ingesting data, everything is fine, and then a pipeline breaks somewhere or some, something happens and data quality goes out the window. If you don’t observe it, you will not know, till the CFO finds out and says, This report is wrong, and then all the help breaks loose because the CFO is putting the pressure on it, it becomes reactive. So observability is trying to avoid that. And so as data ops, actually observability is part of data ops overall. Mm-hmm. , you think about it. So, so like, look at Kubernetes. Now, we are not going to go into Kubernetes today, but I just want to draw a point. Kubernetes came out to make this whole compute infrastructure, so self healing and automated, but it was for stateless applications. What happens if an application crashes Kubernetes kicks in Said, No problem. I brought it back up. What happens if your database crashes It’s a totally d

Dave Mariani: Yeah.

Sanjeev Mohan: Your data could be corrupted. So it took es years before data could adopt it. So it’s a natural progression. Data is a, has a state, can get corrupted, and it takes time for data to adopt some of these best practices.

Dave Mariani: Yeah, I love that. It’s a, it’s a, it’s exciting time for data and analytics. Yeah. And, and data and analytics leaders have a lot to think about, but I think it’s, it’s really a renaissance when you think about it, right I think we’re coming out of sort of the dark ages where we had stone tools and, and knives, and we’re starting to actually, you know, really do something that’s, that, that’s more akin to, what we’ve learned in application development realm and, and

Sanjeev Mohan: Exciting times. Yeah,

Dave Mariani: It’s exciting times. So, Sanjeev and you’re, you’re so good at, at explaining things in a way that, that, that makes, that simplifies. and not, not a bunch of words. Words and not a word, salad. So you’ve been fantastic to help, to help, me and rest of the audience understand what these things mean. so before I let you go, I’m gonna ask you, I’m gonna ask you to predict the future. So, Sanjeev, what do you, what do you think is, what do you think is gonna happen over the next, I’m not even gonna say 10 years. That’s way too long. Yeah. Even the next three to five years. What do you think that we should have top of mind in terms of looking at and thinking about as a data and analytics leader

Sanjeev Mohan: So we, I, I see the, the big challenge and where the movement is, is in simplification of our data architecture. It’s become just overly complicated. It’s literally a 24 by seven job. Mm-hmm. , and I’m doing this 24 by seven. It’s okay because that’s my job. I’m, I’m a research analyst. But, but Dave, you’ve got a company to run. Businesses have their day jobs. How are they supposed to keep up with, with this technology So, you know, I don’t want to pick on some technologies, but, you know, work that data breaks, Snowflake are doing, for example, you know mm-hmm. , they are introducing more features and sort of trying to, to have an integrated environment. so you don’t need to always go best or breed. Now, you do have to go best or breed for many, cases when you are in a hybrid multi-cloud environment, because one vendor’s not going to give you everything that you need. But I do see that, that there is an attempt to abstract the technical details. Now, when I go to conferences, I’ll give you a very simple example. I’ve been to a bunch of conferences this year, and it just amuses me that a few years ago at those conferences, people were asking about master data management mm-hmm. this year. No, they’re not asking about mdm. They’re saying, I have a business problem to solve. How do I do it

Dave Mariani: Right.

Sanjeev Mohan: So I, yeah. So that is a, the, the shift that I see happening where people are very keen, to, to abstract the technology, let you know, technologists take care of it, but the, the business needs to get cell service. Local, local people are on both sides of the aisle. Whether it’s a good thing, or not, but having that self service, that ability to, to create applications, run them, bring them up and down, and not have to worry about, about, resources. So that’s why you see there’s so much emphasis on serverless these days. Mm-hmm.

Dave Mariani: . Mm-hmm.

Sanjeev Mohan: , you know, not even sas. SaaS was great. SaaS is perfect. In fact, yeah, I’m right now, attending this, premier SaaS conference called sasa. But even in SaaS, you have to figure out sometimes, okay, what, cluster, what do I need Server desk takes it even a step further. I don’t even know there’s any, infrastructure. It’s an API call for me.

Dave Mariani: Yeah. I’m completely with you on that. Look, nobody wants to manage infrastructure. they just wanna, they just wanna get results. And, and that’s what’s really, that’s what, that’s what the, you know, the cloud has really helped us really accelerate down that path, hasn’t it and serverless is just another step along that path. But, you know, Sanjeev, this is, this has been amazing. I always love talking to you. I always learn so much when I’m, when I, when I talk to you. So, thank you so much for joining the, the podcast and to everybody out there, stay data driven.

Sanjeev Mohan: Thank you.

Be Data-Driven At Scale