Modern Data Stack with Chad Sanderson

Listen to this podcast with Chad Sanderson, head of the data platform team at Convoy, which includes Convoy’s Data Engineering team, Data Warehouse, Data Pipeline tooling, Experimentation Platform, Machine Learning Platform, Analytics Platform, and Streaming. Hear how Chad’s team is building out one of the most advanced Experimentation and Machine Learning Platforms in the world from the ground up, servicing thousands of carriers every day to ship freight more efficiently.

See All Podcasts

Meet our Guest

Chad Sanderson

Head of Data Platform at Convoy

Chad Sanderson is currently Head of Data Platform @ Convoy, Formerly Experimentation Product Leader @ Microsoft / SEPHORA / Subway

Meet our Host

Dave Mariani

Chief Technology Officer, Founder, AtScale

Dave is the founder of AtScale and is the Chief Technology Officer. Prior to AtScale, he ran engineering and data at Klout and Yahoo! where he built the world’s largest multi-dimensional cube.

The biggest thing that’s affecting us is how the direction of the modern data stack was a pretty clear initial divergence from some of the original data warehousing philosophies. It seems like horseshoe theory is in play where we’re kind of circling around back to a lot of the stuff that was really popular in, like the nineties when it came to data warehouse, design, entity, relationship diagrams, things like that.

Anytime something breaks the data scientists have to go and act like Sherlock Holmes and figure that out or try to throw it to the data engineering team so they can go and figure it out. So, what we’re trying to do is create that clear concept of ownership from the source, having a software engineer, actually own the production code, the events in this case. So we’re using Kafka streams and those Kafka events tied directly to the semantic events that are most relevant to our business, the Semantic events, and the semantic entities.

Transcript

Dave Mariani: Hi everyone. And welcome to AtScale data-driven podcast. And today’s special guest is Chad Sanderson. Chad, welcome to the podcast.

Chad Sanderson: Thanks for having me.

Dave Mariani: Thanks for joining us. So, Chad is the head of product and data platform for Convoy. Chad’s based in Seattle, right?

Chad Sanderson: That’s right.

Dave Mariani: Yeah. So, tell us a little bit about yourself and about your company Convoy.

Chad Sanderson: So, as for myself, I lead what’s called the data platform team at Convoy. It is the intent data infrastructure organization. So that’s everything from instrumentation and data ingestion to ETL the data warehouse. And then all of the applications that sit on top of the data warehouse, like our metrics layer, experimentation software and our machine learning stack as well. Before being a Convoy, I was at Microsoft on the artificial intelligence platform team, which is also a data infrastructure organization. And then before that, I kind of bounced around. I worked at some e-commerce companies like Sephora and subway. I was at Oracle for a little bit. So kind of an interesting mix of businesses, as far as Convoy, Convoy is a digital freight marketplace. So it’s also a broker. That means it sits between a shipper. That’s trying to move some freight, from between facilities and a carrier, which is a business that’s trying to take that freight from place to place.

Chad Sanderson: We use an auction model. So we will bid on RFPs that are offered by the shipper. We will then when those loads put those loads into our marketplace and then provide functionality for our carriers to bid on those loads. So it’s relatively small data. we’re, we’re not, operating with like, you know, massive, massive amounts of data. Our computational, spend is quite low. And so that means what we’re really focused on is the complexity of the data and really high data quality because our entire marketplace is based off of machine learning models. If we didn’t have those ML models, there would be no company. And so ensuring that the data going into those models is very, very high, high quality is of utmost importance.

Dave Mariani: Yeah. Well, if you’re it’s, especially since you’re, you’re, you’re running, or participating in an auction on data’s has gotta be right and your algorithm spot and gotta be right. Or else you’re gonna, you’re gonna, you’re gonna lose money or not be able to get those loads that, that would be profitable for you. So it’s, so it is a truly a data-driven business that’s for sure.

Chad Sanderson: That’s right.

Dave Mariani: That’s awesome. So you just, so how did you end up Chad, being in data and analytics, how’d you, how, what was your path into the industry

Chad Sanderson: It was a pretty non-traditional path, actually. So I got my from Georgia Southern university in creative writing and linguistics. I spent about a year and a half roaming through Thailand as a freelance writer. I moved into design and I started getting into social media analytics, working with a couple of companies while, I was, I was traveling through Southeast Asia and I did that for about three to four years. worked in a whole bunch of really, interesting companies from like event based, athletic startups to reality TV shows. And I came back to the U S and I really liked the, the social media analytics work that I did. So I got a role at a, at a really small business, sort of in the, the Northern foothills of Georgia, a town called Cartersville, working for a company that may grill parts.

Chad Sanderson: And I was doing internal analytics for them. And the specific, specific type of digital analytics I was doing was called conversion rate optimization. So it’s like web data to figure out how to optimize websites and apps and stuff like that. And that turned out to be a really interesting set of skills to have that led directly to a position with Oracle, where I was working on their third-party marketing software. So they have a tool called Oracle maximizer. It’s an AB testing software. And, and then having that experience was also incredibly neat. So then I got really, really deep into the experimentation infrastructure side of things. I bounced around doing AB testing. That was the job that brought me to Microsoft, was to work on experimentation infrastructure. And then it also brought me to Convoy where they asked me to build out their internal experimentation platform. And then that grew to expand, all, all the data pipelines in, in all aspects of the, our data stack,

Dave Mariani: You know, it’s, you know, chatted to common. it’s a, it’s a common story. I hear a lot of, in terms of the path into data and analytics, isn’t always through, you know, traditional sort of going to college to become a computer science or data engineer. it’s, it’s always seems to be different paths. Myself is I have a similar, you know, non traditional path, you know, I, I was economics, at UCLA and, and here I am and because in technology and as a technology founder, so it’s a real, it just, I guess it’s, you know, when it comes to data and analytics, you could really come from anywhere. And so I guess the message to the listeners out there is, if you’re interested in this subject, there’s a lot of different paths to sort of get into the position like that you yourself are in.

Dave Mariani: So that’s an, that’s an, that’s an awesome story. awesome story. And I, and you know, data and data-driven sort of businesses like Convoy, and you’d like you mentioned, when you were doing sort of a conversion attribution, and that’s where I got a lot of my, interest in analytics as well was in, early email marketing and PR and digital advertising where, you know, it’s all about data improving that those conversion rates, which turns into real dollars. So it’s a very exciting connection of connecting your work to actually a direct link into producing revenue. It’s very cool. So, so Chad, you know, you’re, you’re a pretty prolific writer about just tools and the stack and modern analytic stack. And so that’s what that sort of attracted me to you, and gods and, and, and wanting to want to get you on the podcast. So you’ve written a lot about, you know, about the data stack. So talk to me a little bit about, you know, what kinds of, what kinds of tools, what kinds of things that you’ve been doing at Convoy, and before that, where do you see the industry going What are some of the, the exciting things that you’re working on or you’re participating in when it comes to the modern data and analytics stack

Chad Sanderson: Yeah. so I think, I think there’s a few really interesting trends that I’ve been seeing play out, not just in the industry, but also sort of at a very micro level within Convoy happening somewhat, some somewhat rapidly like on a advanced timeline. The first one that’s pretty major is the, the, the separation of data science skill skillsets. Originally you had what a lot of people refer to as like a full stack data scientist. And this is the type of role that we hired really frequently when we were at Microsoft. If you want it to be a data scientist at Microsoft, you had to also be a computer scientist by training. And usually you have to have like some level of, of experience as well. And so the folks coming out of school that were joining our team had, you know, they were, they were PhDs in something, but they knew how to write code.

Chad Sanderson: They were, they were pretty good developers. They knew a lot of the fundamentals of software engineering, and oftentimes they knew how to work in a software engineering, team. and that’s, that’s changing. So, this is certainly the case at Convoy, and there are still a few of those types of folks, but I would say the majority of data scientists that we hire, their background is primarily in very data science, specific related skillsets, equal they know Python. They know how to build models. they know how to explore data really well. They know how to do experiment analysis. They know how to create features, but there’s not as much skill on the software engineering side, or even on the data architecture or data modeling side. and that sort of schism in skillsets has kind of come at odds with a lot of the tools in the modern data stack. folks don’t really know what to do with that. So, so that’s sort of one really interesting thing that’s happened in

Dave Mariani: Chad actually, let me, let’s drill down on that. Cause that’s really interesting. So, a lot of people out there have real trouble hiring a data scientist. So you sort of talked about a different profile altogether. So where do you go hunting for, for, for those people, how’d you hire them Where’d you go find them at Convoy to sort of meet those, that skill, that skill set.

Chad Sanderson: So a lot of people have, analysts, backgrounds, and, that sort of like really good at SQL. They maybe spend some time on, on the business side or on the sales side or on the HR or finance side. And they want to make a transition into products is usually a pretty good profile for someone that can sort of start off as more of an entry level data scientist, and then transition up to someone that’s doing more complex things and they learn Python and they, they learn some of these other, tools and techniques. another one is folks who are coming out of school is becoming a lot more frequent for, data science specialists to come directly from the, from the colleges. These are people that, yeah, maybe they took a couple courses in computer science, but generally their focus is on data and it’s on like working with data and doing machine learning.

Chad Sanderson: So there’s a lot more folks like that that are starting to come out now. and even the, the bootcamps has been a pretty good source for us as well. Just having someone that owns that understands the fundamentals of how to work with data, if you have a pretty good training program in place, and you really need people and Convoys sort of in that position because our models are so important, then just then, then getting, getting in that skill set and building them up is, is worth it. I think, I think th th we’re in a, a hyper competitive market right now, as you said, and the days of being extremely selective, your talents are just kind of over for the immediate future.

Dave Mariani: Yeah. I love gets, I really love to grow people into roles. So find people with that have this sort of base skillsets and motivations and the motivations are right. And I think that, you know, having them be able to grow into the role is great for them because they increase their market value and they’re interested and excited to learn new things. And it’s great for the company because, you know, you get to shape that growth. so I really liked that strategy of, of, finding somebody with the right skills and then growing them into that. So, you know, when you look at sort of the kids straight out of school and you’re talking about there, so they had a background in data science, what were those, what were those majors What are those majors I w what are, you know, what, what are some of the, some of the schools and majors that you’re looking at to find those people, where are you, where are you efficient for them

Chad Sanderson: I mean, we go, we go everywhere. Like we, we definitely focus, local, you know, university of Washington is a big source of our hires, but, you know, even some of the bigger, more technical schools like Waterloo and places like that, we’ve, we’ve had a lot of people come out of there. in terms of the majors, it really it’s, it sort of ranges the board. Like we’ve had quite a number of, of, of folks with, you know, economists backgrounds that joined the team as scientists. we’ve had, mathematicians that have a focus on data, joined the team. There, there are, are quite literally a data science tracks in a lot of these universities now. So they’re coming out with that sort of explicit, understanding. So, so it, it just, it, it ranges across a pretty wide gamut of things. But what we look for is, you know, do you understand how to work with a team Do you have a really fundamental grasp of SQL And can you, can you, can you write that efficiently and do you have sort of an analytical way of thinking and, and addressing problems and then do your cultural values fit with the company

Dave Mariani: So you’re, so you’re definitely seeing that transition from my traditional data scientist into a, more of a citizen data scientists.

Chad Sanderson: Yup, yup. That I think, I think that that’s definitely happening and even in the places, and this may be a good thing or a bad thing, but even in the areas of, of, of like data is so fundamental to Convoy that understanding some sequel is really important giving this, the, the, the, in my opinion, limited tool sets that we have. So even if you’re on the op side, for example, and you need to really quickly go into a database and figure out what the status of a particular shipment is, or you need some details about a shipment that’s not covered in our, our UI that’s provided for the ops team. Like you will have to go and write a query and go into snowboarding and understand how to do that. And, and being able to, to be independent and not have to wait for a data scientist to go and build that, that query for you, is, is, is important. So we are seeing more of a development of the skill sets, even on the non on the citizen data science role, as you said.

Dave Mariani: Yeah. SQL is definitely the key language to know on, so you can’t really do much without, you know, without those that skillset. So, so I interrupted you, you, you, you had, besides that sort of transitioning role of the data scientists, what other things are you seeing, in the industry that, that, that are affecting you and your business

Chad Sanderson: I think another big thing that’s affecting us is how the modern, the direction of the modern data stack was a pretty clear initial divergence from some of the original data warehousing philosophies. And it seems like horseshoe theory is, is in play where we’re kind of circling around back to a lot of the stuff that was really popular in, in like the nineties, when it came to data warehouse, design, entity, relationship diagrams, things like that. and I think w I think basically what happens is, and this is as someone who is a, a relatively, a relative outsider to like the startup, data space for a pretty long period of time. And I’m only recently getting into it was you, you had the cloud data warehouse really take off, you had snowflakes and big queries, things like that. And that made, that, that, that made the, the, the ability to pipe data into this, you know, relatively cheap storage, very easy.

Chad Sanderson: You can just use five Tran. You could use a CDC method, like w there’s so many different ways of getting the data in. And there’s so many different sources of data and the modern technology team, you know, really resembled Facebook and, and Amazon, and wanting to move extremely, extremely quickly. And so instead of doing it the old way, where we plan out sort of our entire data map, we do all the transformations, like really, really far upstream, the data engineers owned all of that process, which means you need this very heavy upstream, like governance layer teams are like, we don’t do that anymore. We want, we want to move faster. So let’s just dump everything that we have into snowflake. And then we can do all the transforms within snowflake. And that’s where DBT really started taking off, like, okay, here’s a just take like a SQL view, right

Chad Sanderson: Like, right. Write some SQL and treat the whole, like version of the whole data warehouse and basically treat it like a, like, like, like, like a code base. and, that I think that had, that had pros and cons, I think the pros were, you can move a lot faster and you can lever do a lot more with your data. as these businesses are evolving very quickly, but the cons are, as a lot of folks are starting to realize, it’s not that great for data governance and not having data governance and data quality is a really big deal, especially when machine learning is so fundamental, to your business. And then based on the first thing I said, if you’re treating the data scientist like a software engineer, if they have that computer science background, everything that I just said makes a lot of sense, right. Versions we have CICB, they have all the typical software engineering tools. but if that’s not what their skillset is, then, then you could be potentially creating a mess.

Dave Mariani: So what’s so, so, and for, for the audience, chatty, you mentioned DVT, DVT, you’re sort of talking about moving from the traditional ETL to more of an ELT, where DBT has been a really popular tool for performing transformations right. In, in, in the database, in the data warehouse. and so it’s a very different model. So, so what are, what are your solutions to the downsides to this sort of new trend how, how are you solving those problems, Chad, the governance problems in like,

Chad Sanderson: So the way that we think about it is, the first thing the, and, and this, this goes back to a, a relatively new concept of, of data mesh and the fundamental belief of the, the, the fundamental philosophy around data mesh is that, if you, the team that produces the data should also be the ones that own, whatever the downstream like data mark becomes like w whatever, whatever, like valuable business data is built on top of that, that team owns it. And that’s not the world that we live in today, at least at Convoy, that’s sort of what we’re transitioning to. anybody can really own anything. The data warehouse is pretty open. it, it, lineage is really hard. So it’s, it’s, it’s super difficult to understand based on all these like random transformations that are happening and all the dependencies on transformations, it’s just really opaque, who owns what, and that leads to a lot of anything.

Chad Sanderson: Anytime something breaks the data scientists have to go and act like Sherlock Holmes and figure that out, or, or try to throw it to the data engineering team so they can go and figure it out. so, so what we’re trying to do is, is, is actually create that clear concept of, ownership from the source, having a software engineer, actually own the production code, the events in this case. So we’re using Kafka streams and that those Kafka events tied directly to the semantic events that are most relevant about our, our business, the Symantec events and the semantic entities. So that’s one part of it. And that does involve an upfront definition. So it’s kind of going back to the old world, that world, where you have an ERD, and you’re, you’re defining like, what are the entities that are really important What are the like real world things that happen How was all that stuff connected What are the relationships between them And then the software engineer goes and goes and implements those. And then, and then from there, we pipe it into the, into the data warehouse. The other thing I think is important is this is this concept of a semantic layer, which is once the, the first thing I described is base is like the upstream, the upstream business logic, and, and the semantic layer is more like the downstream business logic. It’s once we have that

Dave Mariani: Physical versus I think if it’s physical versus logical. So, yeah,

Chad Sanderson: And that’s a, that’s a great way of describing it, the fit, the physical layer versus the logical layer, like what, or a concept like margin, which does not exist in the real world, has it has to be, the, the subject of many different transformations and combining different types of costs and revenue and things like that. There needs to be some way to express that clearly in one place. So that, that 50 people aren’t coming up with the exact same definition of margin, and then our finance team has no idea what to do with it. So I think those two layers, that that’s where Convoy is invested in right now and the, the hope is, and, we’re still in the middle of this transition. So I can’t say I know exactly how it’s going to go, but the hope is you will, once that once that plan sort of emerges, you will have a very, very clear ownership boundary, between the software engineers and what they own, the data that they actually care about that the downstream team cares about. And then you have very clear ownership over these like semantic, logical concepts in the data warehouse.

Dave Mariani: Yeah. And I love that. That’s a, of course I love semantic layers. That’s, that’s all, that’s all we’re about. So, but, you know, I want to sort of, go back a little bit to the concept of data mash. Cause there’s a, there’s a lot of confusion when people don’t, when they think of data match, they think of almost like query Federation and it’s, and that’s really not what it’s about. Right. Data mesh is more of a, it’s more of an organizational architecture. you know, where, you know, you’re trying to decentralize, data management and the like, can you, you know, can, can you, can you sort of help the help the listeners, chat understand really what the sort of core core tenants behind data mesh are. and, and what’s so different about, and it’s, what’s so different about a data mesh compared to, what we’ve traditionally done when it came to data and analytics.

Chad Sanderson: So the, the core of data mesh is really domain-driven design. It’s a way to enable domain driven design at the organizational layer, at the, at the, at the organizational level. And so what involves is, is number one, the software engineer actually has to own the quality, the, the product team, which includes a data person, like an analytics engineer, a product manager, maybe a data product manager, and the software engineer are all actively working to develop this data domain and they own it end to end. So if you have a team, for example, that owns the, the service that’s, that, that generates new shipments in Convoys case, then that team would also be the one that’s responsible for building out. If you’re, you know, if you’re thinking like the, the Kimball perspective, they’re building out all the data marts that are associated with shipments, and they essentially own that pipeline into end.

Chad Sanderson: And every team that owns a service or owns an entity or owns a domain is doing the same thing. And then they’re sharing that data. So if I on, if I am in a, a, the team that maybe really cares about, the ETA’s of shipments and I need leasing data, then I’m going to go over to the pricing data Mart, which is owned by the pricing team who owns the pricing service. and so that’s that sort of, it’s like the, the, the F the ownership, the Federation of the ownership of like business concepts and, and domains and entities. And it truly is a, as you described a massive organizational shift, that is not that, that is not the way that we design teams today. And if you’re doing that, if you’re sort of owning the entire pipeline, what it, what it basically necessitates is that you have a group within your product organization that is thinking explicitly about how the data evolves over time.

Chad Sanderson: And, and that, that, that involves a data product managers. So that’s where this concept is starting to become a lot more, more popular is, you know, what are the needs of our customers that are going to be accessing this data you’re going to need something like an analytics engineer who actually knows, you know, their, their job is to do all of the, all the joins and all the modeling to, to, to produce like, and they, they request the schemas that need to be implemented in production like that. and that’s just not, that’s not how we’re set up today. at least that’s certainly not how Convoy is set up today, and that’s not how a lot of companies are. So I think we’ll talk about data, mesh it, they think about it almost like it’s a technology solution and it’s, it is really so much. Yeah,

Dave Mariani: It’s not, it’s an, it’s truly an organizational sort of restructuring and restructuring who owns the data too. It’s not, it’s not a centralized team anymore. It’s distributed, decentralized. So, you know, chatted in, in, in, along those lines. Right. you know, some of our customers, they have a hub and spoke model, and you also hear that hub and spoke. I hear that a lot in the industry as well. So how does a hub and spoke model and data mesh model Are they the same Are they different are they part of the same story What’s your opinion on that

Chad Sanderson: I think that they are different, but there’s, there’s overlap. And you probably like, I don’t see how you really get to a data mesh model without some type of hub. there’s always going to be things in, in my opinion, and maybe I’ll be proven wrong, over time as we sort of continue down this path, but it seems like there’s probably always going to be things that are more cross-organizational and, or, or potentially the cost of lifting them and turning them into a, a totally separate service and in a pipeline that only, that a single team can own. It’s just not feasible. So, for example, if you’re in like a monolithic system, like you, you might have a, you might have a monolith that is a, that that’s generating some really important, entity or business concept. And maybe that was the case from the very beginning of the company.

Chad Sanderson: And nobody really owns it. Like, there’s not a team. This is actually the case that Convoy this, today, we’re where we have a concept called shipments, that entity is produced by our monolith and, and there is no shipments team. And so what do you, what do you do in that situation where, where there is no shipments team and you have this like really critical concept that affects basically every single team at the company. I think if you sort of doggedly push for data mash without thinking about the realities of the business, then you could obviously wind up in like a really, really bad place there that, that doesn’t help our customers. So I, I do think that having that central governance layer, that acts as a hub and sort of understands where are the gaps in the company where the, the, the, this w what are the roles that we need to fill it’s a, it’s a set of specialists that can look across these different areas. Like, I think that probably will be required, as teams make the transition. And then maybe at some point in the future, we will achieve like true Federation. The hub model is not actually needed anymore. And, you know, everybody is equipped to, to own their domains and iterate on them. But, but it’s hard for me to imagine how that would work completely.

Dave Mariani: Yeah. I’m with you. I think that look at the minimum, you need a hub, you need a centralized team today to just to choose the tool sets and, and to establish standards, right Because if you’re going to decentralize the ownership of these domains, you still need to play by the same rules. Otherwise you’re not to be able to combine that, that the domain data, to get those composite views. So, you know, from a, I’m an old dimensional guy, and I think of conformed dimensions, and the fact is, you know, a calendar is a conformed dimension, and it’s, you’ve gotta be talking about time, the same way, regardless of whether you’re talking about shipments or, or CTS. and, it’s the same thing with the, you know, with the product hierarchy, it’s the same thing with an organizational hierarchy. You know, you want to have some core entities and somebody needs to own those.

Dave Mariani: and maybe it’s not a domain owner, maybe it’s the hub. and then the spokes, and the domain teams get to plug and play, with some of the common objects. So I definitely see that, the hub and spoke model, as well as the data mesh concept is really about decentralizing ownership of those data domains. I think those are compatible, and I think that you need them both. and I’m excited to see how this develops, in organizations, it’s going to start with the Convoys, because you can move fast and you can experiment, and you can try these new things out. It’s a lot harder for a lot of big enterprises to really move and make that kind of a sea change. So I definitely see that, hope that, organizations like Convoy and you, Chad can, can sort of work this out for the rest of us.

Chad Sanderson: We, we will try our best. And I think that was, I think, I think it’s a really good point. One of the things that I try to advise folks who are considering data mesh or thinking about data mesh is to actually not do that. Like don’t, don’t even think about data mesh. what I, what I would advise is like, start from the problems that your customers have and start from the problems that the business actually has and, and then work backwards. And, and, and so that, that if, if, if having sort of this, the, a hub team is not a part of like, whatever article on data managers that you read, it doesn’t really matter that much. If when you do your analysis of the business needs, it’s very, very clear that there would be gaps in, in ownership. so, so that’s, that’s sort of what I recommend.

Chad Sanderson: We actually started this whole process around governance and quality, and I, I had never heard of data mesh. We had never even heard that term, but what we, what we understood was that the data science teams did not have the ability to own data quality themselves, because there was this very nebulous, upstream arrangement where software engineers were essentially just dumping whatever they wanted into, into production tables, and then saying, don’t use this data, but the data didn’t have a choice because there was nothing else. And that created this crazy sort of ownership model, where we had a schema that we call BI, which was basically a sandbox playground data scientists would do sort of a lot of modeling and, you know, essentially creating like pseudo data marts there. And that got out of control because you had a crazy amount of dependencies and the SQL wasn’t scalable, and you had stuff from four or five years ago that hadn’t been updated. And they were data. Our data team was asking for a solution, and this is what we came up with. And, and it so happens that I, that I think, you know, the, the federated model is the way of the future, but you just start from the problem set first and then, and then work backwards.

Dave Mariani: I like that. That’s a, that’s great advice to, to work backwards, and start small and figure out, figure it out, solve one problem at a time. Don’t try to boil the ocean. so do you know, Chad, you’ve been talking a lot about data quality and for Convoy, as you mentioned in the start data, quality is critical. if you get it wrong, you lose money. so, you know, data quality is so big and nebulous and giant. how, how, how do you define data quality and, and, and then how do you, how do you implement a data quality program in, in Convoy How’d you do it

Chad Sanderson: And that’s a, that’s a pretty good question. so we have, we have the, the, a lot of the traditional, the very, somewhat traditional views on data quality. Like there, we have SLA plays and, you know, we have a lot of monitors to check for, like the query failures and just in general, things that are breaking and stuff like that. so, so that’s important. And that’s the responsibility of our central data engineering team. They own the infrastructure in our pipeline, like up to a certain point, basically all the data lands, in raw, like Jason in our system, there’s a automated macro that, that transforms that it, it parses, it, it renames it, we call that our source schema. And up until that point data engineers own it. They own all the quality for it. and they, they make, they make sure that things are not breaking and, you know, and, and everything’s like, if we’re like merging a whole bunch of pipelines, like we’re merging five trend data with Kafka data that it’s all duplicated and all that stuff.

Chad Sanderson: And, and that’s, and that’s, that’s definitely, that’s definitely really valuable, but I think data quality is obviously a lot more than that. You have the, the Monte Carlo, metric that they’ve been talking about LA a lot recently is data downtime. that’s a really important thing to include. I think the, the accuracy of the data, which is, which is typically talked about as a major, major part of data quality, but I find that, accuracy metrics, meaning like, are we measuring the right things Is that there’s not a clear way to measure that. Like it’s much more qualitative. Are we admitting the right things in the first place Like, are we actually capturing all the data that we need is the data discoverable there’s a whole sort of body of work, a whole body of quality that focuses around usability, can teams find the data that they need and can they actually trust those, the, the decisions that they come to if they leverage this data.

Chad Sanderson: and so we’ve been focusing on that a lot recently, and the reason we’ve been focusing on it is because we have found that, that side of the house, which, which we’ve just been calling, like the data usability, subset of data quality is the bottleneck to the speed of our data team. How quickly can they iterate over models How much time do they actually spend data monitoring versus doing deep work, like quality, you know, model development or experiments, analytics, analytics, or something like that, and trustworthiness of data as well. So how, if, if we run an experiment where we deploy a new model, how confidently, how confident are we that that model was trained on the right data And that’s a really hard question that you can’t really answer a hundred percent ever, but, it, but, but I think there are ways of getting close.

Dave Mariani: Do you, so do you, how do you, do you measure your data quality I mean, do you have a, do you have a dashboard somewhere or is that, or is it, so how do you know if you’re doing, if you’re, if you’re getting better

Chad Sanderson: Yeah. So we have, we have dashboards around our SLS. So, you know, we will monitor like, you know, freshness and timeliness and uptime and like all those great things. And, and, that is a responsibility of our data engineering team. We have an on-call and that if anything ever breaks they’re their responsibility is to go and try to understand like what’s going on in the pipeline and fix it. and then they also respond to like ad hoc requests from folks who are downstream that understand where things are breaking, and maybe we have a little bit less visibility. So, and, and to be clear that that visibility layer extends only for what the central data infrastructure team actually has control over. everything else is kind of the wild west. And, if something breaks there, then it, you know, generally it’s on the team that built it.

Chad Sanderson: We expect them to have good monitors and use like leverage DBT to, to, to add the right alerts and tests and things like that. Although the amount of time, like the frequency that that actually happens is, is quite low. but then if they, if they come to us and it’s like a serious breaking issue, then, then we, we will do our best to work with them. But the other things that I mentioned more about like accuracy and, is the data discoverable. It’s, it’s not really something that we, that we, that we measure extremely well. in, in an automated way, we do have surveys that go out, like every quarter, sometimes every other quarter, for certain things that asks almost like an NPS score of, you know, how easy is it to do this set of tasks can you find the data that you need

Chad Sanderson: How long does it take you to find the data that you need What is your experience working with our variety of infrastructure tools Like where do you experience the most pain How does this impact your work Right Those are the types of questions that we ask. We produce a score, and then every quarter we try to report out on that score and incrementally make it better. Of course, the hard thing is these things are, these numbers are like hyper variable because the data team is not that big. And so if you have any major issue that happens in a quarter, it can swing that number one, one way or the other. so it’s, it’s, it’s okay as a directional metric, but really it’s more about just going out and having conversations with people and understanding the general sentiment of the team.

Dave Mariani: Yeah. Actually I really do like that, that, that whole concept of having a survey you’re right. That it can swing, especially if the organization is on the smaller sides, but, ultimately those are the customers, right. And are your customers satisfied or not and, that’s a good way of, of, of judging and at least give putting a stake in the ground and, and hoping you can move in the right direction of making them more satisfied over time. so, you know, Chad, you’ve, you’ve been fantastic because you’ve, you’ve given some really concrete, advice for how to make this all work. and so it’s been fantastic having you on the podcast, because, you know, because your advice is actionable. And so I love that. so, just as a, as a sort of, of a closing question, or really comment for you, what should people out there, who are looking to do what you’re doing, w w what’s some, some of your best advice for how they can achieve what you’re achieving, and improve their, their data-driven decision-making.

Chad Sanderson: I would probably suggest two things. The first thing that I would do is go and talk to your internal customers. I find this as not something that data teams do enough, we can oftentimes operate from best practices and analogy. This thing is something that worked at another company. And so I want to do it now at this company, you should go out and speak to people. take 30 minutes, try to talk to as many folks as a data team, as you can, about the widest variety of things that you can really try to understand their day to day and where they struggle and what their specific problems are. And then based on those specific problems, try to extrapolate the impact on the business. If you have a team, if you have a machine learning team and they’re spending 75% of their time, every single week, just managing the data to get features, maybe making it more, making, making the data more discoverable or solving whatever that problem is, is much more impactful than like upgrading to the next version of DBT, which may, may be the dominant thing on the roadmap.

Chad Sanderson: Right that, that’s, that’s one thing. The other thing that I would, I would recommend that I’ve had a lot of success getting Convoy to prioritize data projects is to always frame things in terms of business ROI. I think another problem that that people might have is they’ll say we want to upgrade to these technologies and, and do these solutions and organize the team in this way. And a lot of it is very theoretical and philosophical, and that doesn’t resonate with business leaders because they’re trying to prioritize based on what can make the business money there. They’re thinking very explicitly of, if I have one engineering head, where can I put that person in order to get the most bang for my buck and a, and a lot of data teams just are not competitive with other organizations because they don’t present their, their problems in that way.

Chad Sanderson: So when we were making this big data quality push, the way that I framed, it was a narrative around why data quality actually impacts Convoy at a fundamental level. We have these machine learning algorithms. Here’s the ones that we think are untrustworthy. Here’s how much ROI we’re that here’s how much revenue is flowing through those models today. If they were X percent off, here’s what the, the impact of that would be on the business. Here are the things that we would need to do in order to get a, to get to a trustworthy place of quality. Here’s the resources we would need. And here’s how quickly we’d be able to see results when you frame it that way you actually become competitive with customer facing features in terms of priority.

Dave Mariani: I love it. That’s such actionable advice, such good advice. Chad, you’re awesome. Thank you so much for, for joining me today. And Chad had a product and data platform for Convoy. Thanks for joining the podcast and to all the listeners out there. Thanks for listening and, be data-driven. Thanks.