Data Mesh, Data Ownership, Self-Service Data Infrastructure with Sharon Richardson

Listen to Sharon Richardson of Databricks talk about the data mesh concept and how Databricks enables organizations to implement it. Elif and Sharon also discussed the four principles of data mesh, which include domain-oriented data ownership, managing data like a product, having a common self-service data infrastructure as a platform, and governance through federated computational governance. They focus on the challenge of understanding what is meant by federated computational governance and how technology enables it.

See All Podcasts

Meet our Guest

Sharon Richardson

Regional Data+AI Strategist at Databricks

Dr Sharon Richardson, PhD is a Data and AI strategist at Databricks, where she helps clients envision potential and develop strategies to benefit from data-driven innovations at scale on the Databricks platform. Sharon has a PhD in cognitive and spatial data science, awarded in 2020 by University College London, and over 25 years experience in a variety of roles advising global organisations on digital trends and technologies transforming the 'intelligence - decision - action' cycle that underpins performance.

Meet our Host

Elif Tutuk

Global Head of Product at AtScale

Elif Tutuk has extensive business intelligence and analytics expertise, having spent the last 12 years with Qlik. Her most recent role prior to AtScale was as Qlik’s vice president of innovation and design, overseeing a global team of user experience designers, product designers, and engineers. Her innovations have led to patents for search and conversational analytics, data analysis, data management and more. Her research and technology development for augmented intelligence (a combination of data science and AI) has led to the rise of third-generation analytics. Prior to Qlik, Elif started her career as a developer and analyst. Elif is a founding member of the Innovation Forum at Forum Ventures, a leading venture group investing in early-stage software as a service startups. In this role, she serves as an advisor and mentor to founders and executives. Elif also recently won the Women in Tech Outstanding Leadership award, in recognition of her outstanding leadership contributions to the cloud industry. She was also named a winner of the Business Intelligence Group’s Artificial Intelligence Excellence Awards program, acknowledging her work leading the charge to blend AI into analytics to further the AI/human interaction with data while striving to eliminate bias.

This whole sort of directional flow from that foundation, right the way through and then it going through right up into the business logic layer as well, this is what is really transformative as we’re starting to see people mature in their use of, cloud analytics platforms.

Whereas in the past we perhaps had a dashboard with what for one set of data and we had a model and a Jupyter Notebook or an R studio output for the other set of data, now they’re coming together, and that for us have been what’s been the real change with the Lakehouse category.

Transcript

Elif Tutuk: Hello everyone. Welcome to Data-Driven Podcast Today. I’m, your host, Elif Tutuk. I’m the global head of product at AtScale, and I’m super excited to have our guest, Sharon Richardson, today to talk about data mesh and how Databricks enables organizations with the data mesh concept. Sharon, Sharon is, the AI and data strategist at Databricks, and she actually has done a lot of also good research, on decision and decision intelligence, which will be, which is very interesting, and that’s why I think we are in this, in the data business, right, to be able to support the decision making. So welcome to the podcast, Sharon.

Sharon Richardson: Thank you. Thanks for the lovely introduction.

Elif Tutuk: so I think we are going through a very exciting time in data and analytics space. There’s so much innovations coming up. I love what is going on at Databricks and you know, the recent innovations that you have been making in the couple, in the last couple years. and just overall, like one of my personal interests is the data mesh concept. I joined AtScale last year, to run the global product organization. And, but mainly because, you know, I spent two decades of my career innovating sub-service analytics solutions. And, for me, data Bash is very close to that concept in terms of how you enable the organization, where more the people who are close to the business start the ownership of data and data product creation. but you can still do that with a central, governance being in place and then enable that decentralized product data, product innovation. So with that, like, you know, can you tell us, how do you answer the question of what is data mesh I think there’s a result of, you know, discussions going on, on that.

Sharon Richardson: Yeah, sure. I actually personally stick with those orig four principles that were originally articulated by Jamma Dani, and I apologize in fact if I didn’t quite pronounce her name correctly, but she wrote, you know, two fantastic articles. One originally I think late 2019, and then followed up, with a more reasoned approach in 2020 that introduced these four principles that were all around about how to get the most value from your analytical data at scale. And I think by keeping focus and centered on one set of definitions, I think it helps, gives consistency to them because, you know, obviously Databricks for a technology vendor, there’s, there’s always that excuse that we could then come up with a, a nicely curated alternate set of principles that fit very well to us. And that’s not the way to do this because it, it is as much about organization structure and operating models.

Sharon Richardson: It’s really is and should be technology agnostic. The data mesh concept itself, it’s for us for how do we respond with the technology to make it happen. So it’s those original four principles, you know, that first one that you need domain oriented data ownership, which means decentralizing control. You know, the second, and actually it’s the, the principle I’ve probably focused the most on, which is thinking about managing data like a product. I think that’s a critical point, and I think that’s where we’ve really seen the maturing in our approach to data management in recent years. That third one about having that common self-service data infrastructure as a platform, back to what exactly what you said, without that you risk aiming up, just going fully decentralized into separate silos that cannot communicate, cannot work with one another. And having that third principle is all about stopping that from happening.

Sharon Richardson: And then the fourth is that you need governance, you know, and the principle states federated computational governance. I think that’s perhaps the area that causes and provokes the most discussion, you know, in terms of what does it mean and what does it require. But yes, as a data mesh, you know, I’ve always feel that it’s, I think Sha Matt did such a fantastic job at articulating it. even though many people say, oh, well there’s nothing origin there. We were already doing a data mesh. It’s like, well, you might have been, but we didn’t have something to converge on. And I think that’s where she’s been brilliant with articulating those principles. And it benefits everyone if we, we all agree that those principles are a good foundation to start from when you’re thinking about data mesh.

Elif Tutuk: Yeah, perfect. I, I love how you approach it. Like, let’s kind of stick to the core definition of data mesh. And you’re, you’re right, like every vendor has a day on screen. actually I was doing another, fireside chat with JAC yesterday for our, semantically summit and it was great to hear these same principles from her as well. So, so I think just as you said, like, one thing that I also like about, you know, this definition is just, you know, behaving data is a product and that kind of goes to, you know, thinking about user needs. But then the other thing is, I think maybe we can do more deep dive on this federated governance part of it, and how is the technology enables enablers. We can, we can focus on that. So do you want to kind of expand more on that, and talk,

Sharon Richardson: But in terms of the challenge, it’s, it’s more the challenge of what do we mean by federated con and it’s, it the term federated, you know, there we can argue what the definition means there because there’s federated in which does it mean that all the domains agree jointly as a community on what governance means for the organization Cuz that would be truly federated decision. You know, you, you are bringing, and I I live in Switzerland, you know, it’s a, it’s a, a federated set of cantons. It’s been quite fascinating seeing, you know, references being happening all the time with the voting on every government decision. It’s truly, truly federated. It’s actually quite unusual to see, not any, the not sure any other country does it quite to this extent. So if we are really thinking about it as an organizational principle, it’s quite a challenge because how many, depending on the number of domains you’ve got, bringing consensus at a truly federated level I think is for me, that’s why I call it out.

Sharon Richardson: I think it’s one of the challenges. There’s another interpretation as well. You know, my background is very much a blending of, I, I originally started out with a computer science background, but also with psychology. So always been very much hu interested in the human aspects and the impact on humans from advances in technology. And so the computer side of federated, a number of years ago we started looking at federated search capabilities, and that was, while you wouldn’t have necessarily one index, you would federate your query across many indexes and then you’d bring the combined results together. So that’s a different approach to federated. So federated governance in that perspective could be for an organization that’s got potentially multiple platforms in operation that may be running on completely different technologies, not just Databricks, they could be running on other technologies too. Each may have their own governance.

Sharon Richardson: So in that case, we are now thinking about Fed, and particularly when we bring computational in, now we’re thinking about, well, is it a case of being able to federate elements of the governance out to these platforms, but in a way that there isn’t an end-to-end governed solution So this is why I call it out as I think in terms of when I see customers, they’ve adopted data, which I think they can be very quick to talk about how they’re structured around domains. They’re bringing product thinking in that they’ve got self-service infrastructure, but ask about federated computational governance. And, and then I think the conversation is, is less, either less mature or I think often I sit not so much think, but see it tends to be, I would say more, it’s like centralized governance. And, and the conversations I have with a lot of organizations is generally when they really want to sort of simplify their demands right down to the most base level within a data mesh, it’s how do we have global consistency, global standard, global governance with local autonomy, local flexibility so that the demands can operate effectively in their own environment, you know, and can be agile, they can be innovate, they can drive forward at different speeds because some parts of the business operate at very different speeds to other areas of the business.

Sharon Richardson: And we don’t want to hold anyone back, but there are global constraints that we do have to place on the domains. And so I do tend to see it more as a, I think most end adopting global or I, I’ll use global rather than centralize because it can be such a loaded term when we’re talking about a data mesh, but it is bringing, what are the global principles that need to become into play Who decides on them, who agrees them, you know, and then how do you implement them in a as automated way as possible Yeah. So a bit of a long-winded answer, but yeah, that’s why I think it’s, it’s an it, that fourth principle I think drives the most debate and head scratching when people are starting to think seriously about embracing data mesh.

Elif Tutuk: That’s very interesting how you have a product, like it’s not, not only the, you know, federated governance, but federated computational governance. And let’s come back to the search concept that you have mentioned. I would love to hear more about that. I guess, you know, what I’ve been, focusing on like within that scale, within the product, just thinking about, you know, how we can have the common definitions to be centrally owned initially. Like there has to be one definition of customer, one definition of product, but I’m also acknowledging that every business unit marketing may still have a spin a version of that customer because it’s not like, again, they just, they are, they are the ones who are making the decisions based on, you know, the, the, the, the business needs that they see for their, you know, focus area. So with that, like I think what I was thinking is to be able to have still a central definition, but then to be able to, register that definition with that, you know, the definition of that data and make that available and then let the business owners to start using and then version that definition and, you know, the structure of the data based on their needs, but then still have some type of a data contract to be able to keep track of who’s using what version and what is, you know, why, what business questions they are answering that.

Elif Tutuk: Is this aligning with your thinking

Sharon Richardson: Yeah, no, I think it absolutely makes sense and it is, it’s an interesting conversation I’ve been having quite recently with a client too, which is trying to figure out how much can you achieve, at a foundation layer. And that foundation is probably going to be centralized, you know, so using your analogy of the customer definition, you know, what are the, the absolute musts, the, you know, the, no matter what the different requirements are across the domains, these are certain either definitions criteria that must be sat satisfied, and so they are going to be managed as that foundation. And that foundation is probably going to sit centrally, but it’s also acknowledging that yes, there’s gonna be, there may be use cases whereas specific domain need to work with data in a different way, or they need to work with different features or different categorization or thresholds.

Sharon Richardson: For example, you know, a simple one might be, think of spatial data. You know, you might have spatial data coded down to the postcode or zip code level, in your data set, but you might have a domain, and this is gonna be a little bit silly and not quite applicable, but just to give the example, you might have a domain that’s acquiring an external data set. It might be an academic data set, or, you know, some that’s already been given a certain amount of aggregation. So they’ve aggregated at some kind of regional grouping. And it might not be a particularly standard one, but it’s the one that this data set’s working with. Well, if you want to analyze spatially, you’re gonna have to effectively recode your data and bring in and, and code it to these spatial categories, but there are no use to anybody else whatsoever.

Sharon Richardson: So it’s, it’s an extra layer. So you almost think of sort of that level one foundation transformation, but we’re gonna do a level, level two transformation for this use case. But I think the criteria that’s important is whatever you do down the line in the domain shouldn’t undo or challenge or conflict with that foundation level, because that’s when inconsistencies can come in where you are producing models or analytics that are un simply not reproducible anywhere else, or would even worse would be challenged elsewhere. And we say, I don’t, I can’t reproduce those numbers. They don’t, they don’t make sense to me. That’s what you don’t want happening. So it’s, it’s thinking about it as really in my, and this is quite my personal opinion, rather than anything specific to either to Databricks or the industry, but I really feel you need it to be this one directional approach where your foundation has all the, the red lines that you cannot cross and, but you can, you have got a freedom to do more, but that whatever you do, it must not undo what was done before. You don’t wanna have to send data backwards and forwards and say, oh, well we’ve changed it here and now we’re gonna copy it over there and we’ve gotta try and unpick this and do it back. Look, this is, that’s when the complexity comes in that I think a lot of organizations have experienced in the past when, when pipelines go wrong. So it’s avoiding those scenarios.

Elif Tutuk: Yeah, I don’t like this approach. Like I’ve been thinking, you know, what we have gone through during the pandemic, right Every organization, enterprises, they, they, they, they already have their, you know, data warehouse, data mar star schema, but like the business dynamic and the environment changed so fast that they weren’t able to reflect the new business moment. Going back to your original example, maybe there is a need to open a new warehouse or close some of the warehouses. Like you have to be enabling that business units to apply that business moment to the data in a flexible and agile way. So that, as you said, like the core stays there as the definition, but then on the edges there is different versions of that agreeing without overriding. Maybe this is a great kind of way to talk about, like I know you have been talking about different data match options. so one thing that I talk been referring is the Hoban spoke. So there is the central definition and the spoke is the edges. Like what is your perspective on those different data mesh options, what you’re seeing at the customers

Sharon Richardson: Yeah. Well, I I I, I touched on it in the article, that’s been published online on, on Databricks too, is because obviously the, to to truly interpret data mesh is to produce a, what that is 10 tended to be called a harmonized data mesh. So there is no central authority if you are truly harmonizing as a data mesh. But yes, in reality, I would say with most, the majority of the customers I talk to about it will lean towards a hub and spoke model. it may be a very, very light hub, but they’re often is some need for centralization for multiple different reasons, you know, and it varies by organization. For some to be truly harmonized means having the luxury of having all the skills you need available within each domain. And it’s like, well, that simply may not be practical for, for many organizations, you know, data science is an expensive field, you know, do you, if you don’t need high end data sciences scientists in every domain, but there may be times when each domain wants access to data science resource.

Sharon Richardson: Well, this is where hub spokes starts to make more sense, which is to have a core of that high quality, but scarce, a skilled resource that can be reused across domains. But then absolutely things that are specific to domain will be skilled within the domain. So a splitting a, some sharing of skills across domains generally necessary. Some kind of, we often use the phrase center of excellence, for example, you know, and it can be very light touch. It doesn’t mean creating a bottleneck in the process, which is the criticism and rightly so of many centralized environments, but it’s about how to be as efficient and effective as possible with the resources that you’ve got available to yourself. The other scenario that we see tends to lean towards hubs spoke is thinking about data acquisition strategies. You know, if you’re acquiring data from third parties externally, does that data only be, is it only needed by one domain

Sharon Richardson: Because if it is, that’s, that’s very straightforward, then the domain makes the acquisition. But what if that data is valuable to multiple domains Which one owns it Which one owns the acquisition Which one decides to notify the other domains We’ve now got this data set available. Again, it, it makes sense that there should at least be some form of centralized process for managing data acquisition strategies if data’s going to go across domains, and it also can apply internally, you know, def how, and this will depend on how you’ve defined and scoped your domains too, but a challenge can be where a data source even internally spans the domain definitions. Because, you know, if we’re being truly organized, it should be a data source should have a logical home in one domain, in one domain only. That’s the source, that’s where it’s processed and its use elsewhere will be as a product, you know, and it may go on to be consumed by the domains, but in a product environment.

Sharon Richardson: But there may be some data sources that this is still not quite clear cut. So again, you may say, well, for these, we are gonna have some kind of central function that then serves out across multiple domains. So, so there are, there are also, and I guess I’ve mentioned about acquisition. I, I’ll carry that through to the logical conclusion too, which is the sharing of data externally, lots and lots of interest and thought right now around data marketplaces, commercialization of data, data becoming literally a product in its own right that can be used externally. And so again, do you want each domain managing this process or do you want a conduit that’s you, you know, you control. Now you can argue if you’ve got your governance structure well set up in a harmonized environment, it should all happen very seamlessly and very automated. But again, it’s the overhead and there’s, there’s governance isn’t just about technology and code, it’s policy, and those policies being kept up to date with changes in regulatory requirements, which are also updated, you know, across the industry, including our own data and ai. We, you know, the EU AI Act has just been updated, I think in the last two days to start tackling how do we treat generational AI models such as chat, G P T So again, do we expect every domain to be tracking these changes or would it make more sense for there to be a central hub that has responsibility for these sorts of decisions

Elif Tutuk: That’s very interesting. Like a lot of things, again, to, to do deep dive. just interestingly, I’ve been having conversation with our customers as well, this concept of how you can, because like with the semantic layer, they are making the data business ready, right And they have vendors, they have partners. And so just, you know, the thinking is how they can enable their external consumer, like consumers with that business ready data so that they can actually start analyzing their own performance. Like, you know, if you’re a retailer, you may have, vendors and then if you can just, you already have their, you know, supply chain data and delivery time. So if you can make that data that you are already using, being the business ready, right, the inventory management, if you can open that up to the vendor, then they can start, start doing their self-service performance analysis and make decisions. And that is a very interesting concept. As you said, it’s around the data market. So from all of those concepts that you have seen, how you see Databricks, supporting these types of initiatives that could be external data sharing or having central, govern, governance, but also kind of creating a workspace so that the business units can start, collaborating around data with the definitions of the da, data and the, and the governance that needs to be in place.

Sharon Richardson: Yeah, I mean, from a Databricks terminology perspective, you know, this is what we look to do with data mesh. You we’re not, we are not making any claims. We don’t sell a data mesh product, and we don’t think you can, you know, it’s not a technology, it is an organizational principle. It’s how do you structure, how do you organize So for us, it’s more a case of being clear and how you align our capabilities to the terminology. So it, at the simplest level, for example, we talk about workspaces as being a, the, an environment with which you can have dedicated compute resource to do all the different data processing and analytics capabilities that you want. And you can have multiple workspaces within a Databricks account and they’re running and basically they’re then sitting on top of a shared infrastructure. So you can already see that we are, how we would align that terminology up.

Sharon Richardson: So for a workspace would be our logical mapping likely to a domain. It’s not an absolute mapping. You could have many domains within a workspace, you know, there’s no hard boundary here. But if we want to try and keep it nice and simple as an analogy, think of the workspace to the a domain mapping, and then our Unity catalog product is the one that goes across all the workspaces. So that’s the global governance layer that then means you can put in certain principles to determine how do you manage access to your data, how do you audit access to, so you’re monitoring particularly, and this is particularly important when you start to allow your data to flow externally as well. So we have delta sharing, the Dell sharing protocol that’s available. We very much connect with that with Unity catalog, so you can audit and see exactly who has been accessing what data.

Sharon Richardson: We do things like automated lineage too, so that you’ve got that flow so you can see how the data’s been transformed from its original ingest all the way through to the different products. It then becomes, so that’s the global layer. And I said, we’ve already mentioned it’s all running on a shared infrastructure service. So that fits the, the third principle and then the fourth one, back to data as a product. Yeah. Unity catalog for us comes in here because it’s about making the data assets discoverable. And we talk a lot about what do we mean as well, because generally the first thing people think of is, well, it’s a table, right , it’s a table of data. And it’s like, well, it can be a lot more than that today. You know, we’re talking about AI as well. So it’s not just data sets. We’re talking about models, you know, code pipelines, notebooks, dashboards. These are all forms of data product potential that you want to make discoverable in a secure way. And that’s, that’s why for us, unity Catalog is the underpinning that global governance and product thinking principle.

Elif Tutuk: That’s, that’s perfect, Sharon. Like we have an awesome integration with Databricks and we also have an integration as at scale with, unity catalog. And, and just the way that I see our integrations for Kellogg, services is, you know, semantic layer has its own unique metadata value, right It’s about the semantics, software, data, business definition. And just integrating with Unity Kellogg is now we can actually make those data sets, you know, exposed with the net scale so that you can define the business definition on, on them. And then, then the users can actually use their BI and AI tool of choice. And as you said, like I really see kind of like maybe, what’s the right word Like how we can open up the power of Databricks and Unity catalog to wider analytics consumers. It is how I see the at scale, you know, sitting on top of Databricks to enable that. And I think the other thing that, sorry, I was gonna

Sharon Richardson: Say, absolutely. I think that’s just a great example of what we were talking about earlier. Now, this whole sort of directional flow from that foundation, right the way through and then it going through right up into the business logic layer as well, is this is what is really transformative as we’re starting to see people mature in their use of, of cloud analytics platforms.

Elif Tutuk: Yeah. And some of the things that you have listed as the, you know, awesome capabilities like data lineage, like when you talk about the data product and thinking about user needs, like the data needs to be trustable, the user needs to understand where this data coming from, what type of transformation has been done. Like, maybe if you can expose more on that lineage, like how it’s been, it’s done in Databricks and also like what value you’re seeing at your cu at your customers.

Sharon Richardson: Yeah, so this is really sort of the underpinned, our arrival with Lakehouse as a category a sort of two to three years ago. Cuz it was really the introduction of Delta Lake, this ability to bring acid transactions capabilities to all data types. And, and it’s worth giving a caveat, my background, you know, cuz most people, you know, your background is either, if you came from the, the data warehouse world of struc very structured and type structured data, or you came from the more messy data lake and the precursors to it, which was file systems and knowledge stores and all the messy unstructured one. I’m from that world. I’m, I’m from the unstructured, side of the world and the, the beauty with Delta Lake has been bringing these two together, which is something I never really had visited with ever happen, you know, 20 years ago we always used to joke, you know, we’d go, yeah, there’s the data warehouse.

Sharon Richardson: You can have your lovely, very well curated and managed data, but all the cool stuff’s happening over here. And it’s, it’s, it’s, it’s chaos. You know, there’s, it’s very, very hard to put any kind of sort of transactional integrity on the ingest process when we’re talking about unstructured data sources. So bringing that capability at the foundation level, that’s what’s been the real game changer because it’s having that layer of integrity on the top that then enables you to start tracking the linear, you’ve got versioning capability, you can then start to see what’s happening to this data and where is it happening as it goes through transformation pipelines. And you know, that’s, and on top of that, and it doesn’t matter what that underlying data is, whether it’s a, a more traditional structured table or whether it’s an image, it’s, it could be an iott j streaming events coming from an internet of things device in a, in a field somewhere, you know, no matter what the data is now they’re all coming through this process. And that’s for me is where the real power comes in because then you can combine these different sources of data together much more effectively. Whereas in the past we perhaps had a dashboard with what for one set of data and we had a model and a Jupyter Notebook or an R studio output for the other set of data, now they’re coming together. and that for us have been what’s been the real change with the Lakehouse category.

Elif Tutuk: Yeah, I’m super happy that you touched on the Lakehouse concept. I just want to make a joke like you are coming from unstructured, my background is structured. I think we are building the Lakehouse in this call, but maybe for our audience, like if you can just expand more, like, you know, how we have in having the data warehouse and then the you know, data lakes and now like what is lakehouse maybe just exposing on that more so

Sharon Richardson: Yeah, just to, and I’ll try not to talk too much cuz I can talk for the, for the country, but if you think, you know, historically we’ve had to have separate platforms because data warehouse is a very well established technology. It’s what 30, 40 years on it was was 1970s, you know, you stop after a certain point counting any additional decades. But, you know, it was very much, called back in the 1970s that envisaged relatable storage systems to manage what was an explosion of data in that era. And so data warehouses very, very well established, but really designed in the era of thinking that data means numbers, numerical data, hence we can structure it. I mean, most people, if you think spreadsheets, even though we use spreadsheets for absolutely everything, the the, you know, if you visit envisage one, you’re thinking numbers generally in the fields, not, not your your your traveling list or your shopping list or all the other things that we also tend to use them for.

Sharon Richardson: So this world is very, very mature, very well established. The data lake is barely a, it is just over a decade old. It’s a much younger technology. And that really it came into being around the start of the century with this whole explosion of what we call big data. and whilst some people roll their eyes, you know, big data did have a very specific definition, which was data that defined all traditional management techniques because of either the volume, it was on a scale we’ve never seen or had to handle before. But the more interesting it was these vs that actually Doug Laney of Meta Group, that became Gartner. He first articulated these vs back in 2001 when I started with, he started with three in the published article I actually discovered in conversation with him last year. There were 12. but they narrowed it down into three for the article.

Sharon Richardson: But the three big ones were volume, velocity, velocity and variety. So the sheer volume of data, the speed at which data is being captured and changes and the variety, all of a sudden being able to query and digitize, you know, our photos, you know, we think it it normal today that all imagery is captured digitally. Movies are now captured predominantly digitally. They’re not using traditional analog film anymore. That’s only happened in the last 10 to 15 and in fact, movies I think only in the last five to six years at the turn of the century, not many people even have digital cameras yet. So it’s easy to forget how quick that change has happened. So data lakes really arose out of this demand that all of a sudden we’re digitizing handwriting images, you know, signals, things that previously were in an analog world, how do we process them because they are in such a scale.

Sharon Richardson: You can’t, certainly can’t do it in a data warehouse. You can try and do it on your laptop and good luck, it’ll fall over quite quickly. That was the rise of the data lake, which was very, very clever separation of storage and compute as cloud computing became a thing to be able to distribute the processing overhead across many, many computers and start analyzing this, it didn’t happen in isolation cuz at the same time we had in literally within a five year period, we had the deep learning papers released first by Jeffrey Hinton and others that triggered the arrival of neural networks. Again, there weren’t a new, new invention, but having a method to actually use them was what changed because they used the graphic processing unit, the G P U Nvidia then blew up the G P U capabilities. You know, we had this whole, I’ve got a plot of it somewhere, happily share it.

Sharon Richardson: You know, we had this convergence of these different trends, which meant we were now able to analyze unstructured data on a scale we’ve never seen before. But no acid transactions, no integrity. You know, you just throw your files in and you start playing, you know, if you want to secure it, you either secure the entire file or none of the file. There’s no nice being able to select columns or rows and, and having attribute level controls. None of that in the lake world. And this is what’s changing now very, very rapidly at the moment with the rival of the lake house category because we’ve now been able to bring these two together with this clever format with Delta Lake. That means now what’s been a luxury and warehouse land, we’re starting to be able to apply. And the beauty of that is that we, you know, the payback goes the other way too because by bringing the two worlds together, we can start bringing a lot of the data science tools to the warehouse data sets too. And that’s equally inciting. We talk a lot about the, the serious stuff at the bottom, the security and the integrity, but it’s also the models now that we can build by bringing these two worlds together. So I think it’s really exciting time.

Elif Tutuk: Yeah, yeah, a hundred percent agree. That’s why we have started our podcast saying that this is a very exciting time in the data analytics. And I, to be honest, like I love what Databricks has done with the Lakehouse. Like now we have mini joint customers where, especially with the Databricks sequel, right So like its scale has been focusing on how you make that, you know, structured data, business ready for the wider, no matter what BI tool you can just consume that data. And now with working with Databricks Lakehouse, like really we are really opening up that, value of the data right To to for, to be consumed. Like, one of the like, you know, exciting thing for me to see was like how an Excel user can actually, you know, connect to Databricks and then just do their, analysis with the tool.

Sharon Richardson: It’s absolutely phenomenal.

Elif Tutuk: Yeah, like, cause I’m a finance user, I don’t want to use a dashboarding tool like, but there are dashboarding tool, like users, like, you know, using Tableau, they love using that. But if I’m a finance user, I want to use Excel. And then just enabling them with the, the power and scalability and performance, for all the things that Databricks Lakehouse provide through the scale and with Excel. It’s just an eyeopener for me. And, and then just providing the governance on top of that so that you don’t live in a, you know, hell of metrics and nobody knows what the definition of that. so that is kind of like, you know, one of the awesome use case that I’m seeing with Databricks, Lakehouse, from our joint customers. Like do you have, any examples of like, joint customers or like the data mesh examples that you’re seeing from the customers

Sharon Richardson: Yeah, we’ve actually had a few customers already present over the last two years at our conferences and they’ve written articles on how they’ve deployed a, you know, a data mesh designed architecture on Databricks Lake House. So one, what what is quite interesting is, and at least a couple of instances, they were both say that it is something they were already doing. And this sort of re you know, reverting back to what we were saying earlier is that it’s, it’s a natural evolution because systems tend to cycle through centralized to decentralize, to centralize to decentralized. And we were definitely going through one of these phases anyway over the past two years. And I think that’s partly because when you’re experimenting with a brand new technology, such as moving from on-premise systems into a cloud environment, it’s absolutely logical to begin with a relatively simple centralized environment to begin with. It’s usually quite small, it’s quite contained, it’s really proving art of the possible. It’s once you want to scale that, then the desire then comes and Excuse me one moment, I’m just gonna cough, I’m gonna mute my

Sharon Richardson: Sorry about that. I’ve had a bit of a, a bug this week. where was I Yes, so from centralized to decentralized, you know, it’s, it is an actually evolution. You start centralized and then when you want to scale, centralized always hits bottlenecks problems at scale. And that’s, it’s, it’s the classic network hierarchy. Network hierarchy. It’s centralized, decentralized. And I think the beauty with data mesh has been to try and, and, and Max’s vision on this has been to try and step in and to stop it from yo-yoing this cycle backwards and forwards and say, well let’s come up with the solution that tries to stop this yo-yoing cycle and and settle in. And so I think that’s why I’m gonna cough again one moment. ,

Elif Tutuk: I think we have been talking very fast and covering, trying to cover a lot of topics. So

Sharon Richardson: Yeah, totally fatal. I’ve dragged my throat out. So yeah, so for example, one customer is Michelin, you know, they very much have talked about this exact journey that they went on. They had that single centralized platform was great for that initial proof of moving and committing into a cloud-based environment. But then with popularity came those bottlenecks. And so they’ve now since deployed fully as a harmonized data mesh and they have described their journey, they’ve published it up on their website. So if you search for Michelin three years after data mesh, you’ll find the article. But I think they wrote that last year. So three years after data mesh, you know, data mesh was only announced barely three years ago, I think under three years before the article was written. So it’s, it’s kind of making that exact point that they were already on this journey, but it’s helped give them a framework to strengthen it and to, and to build something with resilience for scaling going forward. And a very similar was, HSSBC spoke at our conference last year and they literally titled their talk how we accidentally built a petabyte scale cybersecurity data mesh in Azure with Delta. It’s exactly that scenario.

Sharon Richardson: So, you know, I think this is the, the reality is most customers who today already say they’ve got a data mesh we’re probably on that journey to decentralization as the article started to come out and it gave them a structure on, well how can we do this in a way that we don’t fully decentralized to and create costs or technical debts or silos How can we prevent that happening And that’s where data mesh I think’s been really successful. Yeah,

Elif Tutuk: Yeah, yeah, exactly. And I think you’re a hundred percent right, like, like the data governance, I think the customers and users have been going through this journey because like when I, every time like since over there maybe in the last, you know, 10 years when I first initially started the data governance conversations, it was always, okay, do we achieve this centrally or how do we share with the business units So those discussions have been already happening and as you said, they have all started the central and then that kind of created a bottleneck and now they start enabling the business units and that’s why it’s an, you know, exciting time from solution innovation perspective as well because now we are seeing the actual user needs and now that, you know, from technology side, we are providing the right solutions to, to support the actual needs that is happening.

Elif Tutuk: So, that is, that is awesome to hear. So one thing that just maybe final topic to talk about, Sharon in terms of like, you know, there are many data sources and I know Databricks has been looking at the federation concept as well. and I think it is important to be able to kind of acknowledge and, and kind of ties back to the external data or being able to share the data that kind of triggered this, you know, my thinking in terms of the other data sources and how federation through the Databricks can enable that as well. Do you have any, insights that you can share around that

Sharon Richardson: Well that’s a tough one. because not really I think at this stage, other than to say you’re absolutely right, it’s an area that you know, unity catalog is the absolute foundation for us, the governance going forward, will we see some rapid evolutions as we round out that product further Definitely, you know, some of the public announcements we have already made, for example, there’s been how we’re now very much unifying across all the data use cases that they, they feed from Unity catalog. So it’s very, you know, the, the initial focus has been the more traditional data warehouse capabilities. So you know, through DB SQL and serverless into, into more traditional reporting, but also now bringing things such as feature store, model deployment, you know, all the AI capabilities, having those surfaced up and discoverable as well. So there’s lots of developments there.

Sharon Richardson: But also looking at how do you think about the end-to-end flow of data where it comes into the system, where it goes out, at what point can we actually start being able to stitch full end-to-end governance pipelines together But this is where it shows and it’s why I call it out this force principle, the federated computational governance yes, is itself I think in quite itself early stage. I think most people would agree with that. I think we will see some iterations, some learnings coming through at a technology agnostic level even, you know, independent of anything Databricks or other companies are doing. I saw a really interesting, phrasing of it and an article and I can’t remember the source and I feel terrible and I can’t cite the source, but they described the a frustration with governance and saying it slows the project down cuz nobody can agree on it and everyone’s fighting over the policies and they’re saying maybe we should just be adopting minimum viable governance.

Sharon Richardson: And I thought that was a really interesting take. You know, there’s a risk there obviously, you know, cuz governance steps into once you’re moving, particularly if you’re in a regulated industry. But I thought it was actually quite a smart observation as well because it was acknowledging that this whole industry is still moving and evolving and innovating at a fairly rapid pace. I mean you’ve only just got to look at what’s going on with all the hype and discussions around chat G P T right now. So actually adopting that minimum level and apologies I am going to cover again

Elif Tutuk: no, this, this is, I love that like, you know, and this kind of comes back to also the, you know, behaving the, the, the, the data is a product. It is all about applying the product management. Like you know, in the product world you talk a lot about the minimum viable and I have a tendency to say minimum lovable product. so, and from that perspective I think, I just love that minimum viable governance.

Sharon Richardson: No, you’ve absolutely n hit the nail on the head there. I think if you are truly thinking about data as a product, then the minimum viable governance starts to make sense because how much do you know about where that product will go and how it will be used You know, it’s a bit like you can sell a car, is it your fault if somebody then crashes it You know, it’s just being a really blunt example there. But you know, but well no, you can’t necessarily control the driver but there are certain expectations. Provide seat belts, install air packs, you know, there are things you can do to protect people when the unfor, you know, the unforeseen or unfortunate happens that you would rather didn’t, but it’s outside of your control. We can’t yet imagine, and this is a big debate cuz we are now in danger of moving into the world of ethics and the AI ethics and responsible ai. These are challenges that, you know, governments are grappling with right now, let alone companies, let alone technology vendors. And so they are going to change our perception of what is or isn’t governance and what’s, they’re gonna certainly change expectations on governance. So thinking about putting in that, you know, the, the minimal error and having the agility flexibility to keep improving, to evolve it in as you need around a data product. It kind, there’s kind of a lot of thinking that makes sense there I think.

Elif Tutuk: Yeah. Yeah, I a hundred percent agree. Wow. I mean again, we can’t just go our talking with you Sharon around all those different things, but I think this was a super interesting discussion. I love how you approach the data mesh, just sticking on the definition of it and just always keeping in mind that it’s a practice organizational behavior, not a technology. But then I love how you have focused on the fourth, point of federated computational governance and that was a really nice touch. so any final thoughts that you want to add

Sharon Richardson: Gosh, on the top of my head, no would be the quick answer that, cause I think we’ve covered a broad range. I say I think I will still always ran back to, particularly with data mesh, is that it is an organizational principle. well structure more than anything it’s supposed to be technology agnostic. And I think that’s a good thing cause it’s thinking about how do you operate your business How do you become data driven How do you do that in an effective way that can scale with resilience, with re robustness And, and this is where I think Sha Max’s been really good at articulating this from day one. I get very frustrated if I see organizations trying to say, well, we’ve got a data mesh product and you can install it. It’s like, no, that, that’s really not what it’s about. You know, I think it’s, it’s really good in that it’s, it’s showing a maturing in the thinking of, of cloud analytics platforms. It’s still a very, very exciting time. There’s lots of growth and innovations going on, but there’s also a maturing of recognizing this is becoming part of business as usual. And I think that’s a very good thing to see too.

Elif Tutuk: Yeah. Great. Thank you Sherry. I’m, you know, from the product side at scale, I’m super excited to have the, you know, co-innovation opportunities is that we are doing with Databricks on many of the things. And, and just as you said, it’s an exciting time to, to us together with, you know, all of the vendors like to be able to innovate as the user needs are kind of surfacing even more now, with, with the, with the data data mesh concept. So thank you very much for joining us today. It was a great pleasure to have you on the Data Dream podcast and I look forward to our future discussions.

Sharon Richardson: Likewise, thank you so much for the invite. It’s been an absolute pleasure.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Data Mesh, Data Ownership, Self-Service Data Infrastructure with Sharon Richardson, Data+AI Strategist, Databricks

Meet our Guest

Sharon Richardson

Meet our Host

Elif Tutuk

Transcript

Be Data-Driven At Scale