"Now I recognize there are a million reasons why this wouldn’t work"
Are there really a million reasons this wouldn't work? I don't have a deep intuition about what technical blockers would prevent something like this from existing. Seems like elements of it could actually happen today if we really wanted.
There aren't a million, but the few blockers that I can think of aren't easy to solve. For instance, there are thousands of syntactic differences between any data warehouse and DuckDB. Sqlglot has mapped, maybe, 1% of that syntax for transpilation. Any significantly advanced dbt project is going to use syntax that is not yet mapped. Also, would you still read the source data from the real warehouse? If not, copying it out the warehouse and into a file system that can be read from DuckDB isn't a trivial feat.
Yea I had the same thoughts. Some of it can be overcome through using dbt macro's but there will still be major differences. Unit testing is really tough for data, and I don't know if my fantasy-world would really make it useful or practical to be honest.
I confess my naïveté about language servers -- would it not be feasible to write a dbt implementation of the Language Server Protocol in a reasonable amount of time? Surely dbt as a language can't be any more complex than, say, Terraform or other purpose-built languages. Are there technical reasons why a mixed SQL-Jinja-YAML cocktail would be intractable to parse?
Granted this doesn't address your point about file consolidation, which is a good one. Putting the templated SQL into the YAML feels incremental enough that it might be a feasible way forward.
Yea, you're right that you could get some of the benefits of a language server on the SQL-Jinja alone. SQL already has language servers, which can provide some limited functionality. Jinja has no concept of types though and very hard to debug. Don't forget there's also the Python interpreter in the mix somewhere too.
Not so sure about the yaml side. I don't think yaml is an expressive enough language, which is why so much of what we end up writing there when building metrics or tests can feel so clunky. Incremental improvements are definitely possible, but the whole point of this post was to remove constraints and think of what dbt might look like if we shed the weight of our past decisions.
Definitely. There's a lot of value to divergent thinking so I really appreciate your original post.
So my first thought is to toss out a platitude like "YAML isn't meant to be a very expressive language," but of course that's missing the point. Instead I want to learn about what would provide a good balance of expressivity and light weight. It makes me think of something like TypeScript -> JS transpilation, where you can impose type safety on yourself but in a manner that can easily be ignored for those who don't need (or don't think they need) it.
I'd say both LookML and Malloy offer some great alternatives to the dbt model, and it feels like they strike the right balance. Malloy is written in TypeScript, so a very similar model to what you're proposing too.
Our unit-tests definition look like what you described here. We have just a little more complexity for partitioned tables and some meta-information like the test name.
The language support we built allowed us to implement column lineage for SparkSQL however we don't have yet something like relationships nor an integration with downstream BI tool.
We have some improvements to our data-quality metrics support in our very packed roadmap but we haven't though about leveraging relationship info as you describe that for now are just text in our data catalog documentation.
My dream is that we can find the time to open-source it so we could have more hands to implement such ideas !
"Now I recognize there are a million reasons why this wouldn’t work"
Are there really a million reasons this wouldn't work? I don't have a deep intuition about what technical blockers would prevent something like this from existing. Seems like elements of it could actually happen today if we really wanted.
There’s maybe really only five reasons why it wouldn’t work, but you’re right, if we want it enough it would be possible.
There aren't a million, but the few blockers that I can think of aren't easy to solve. For instance, there are thousands of syntactic differences between any data warehouse and DuckDB. Sqlglot has mapped, maybe, 1% of that syntax for transpilation. Any significantly advanced dbt project is going to use syntax that is not yet mapped. Also, would you still read the source data from the real warehouse? If not, copying it out the warehouse and into a file system that can be read from DuckDB isn't a trivial feat.
Yea I had the same thoughts. Some of it can be overcome through using dbt macro's but there will still be major differences. Unit testing is really tough for data, and I don't know if my fantasy-world would really make it useful or practical to be honest.
I confess my naïveté about language servers -- would it not be feasible to write a dbt implementation of the Language Server Protocol in a reasonable amount of time? Surely dbt as a language can't be any more complex than, say, Terraform or other purpose-built languages. Are there technical reasons why a mixed SQL-Jinja-YAML cocktail would be intractable to parse?
Granted this doesn't address your point about file consolidation, which is a good one. Putting the templated SQL into the YAML feels incremental enough that it might be a feasible way forward.
Yea, you're right that you could get some of the benefits of a language server on the SQL-Jinja alone. SQL already has language servers, which can provide some limited functionality. Jinja has no concept of types though and very hard to debug. Don't forget there's also the Python interpreter in the mix somewhere too.
Not so sure about the yaml side. I don't think yaml is an expressive enough language, which is why so much of what we end up writing there when building metrics or tests can feel so clunky. Incremental improvements are definitely possible, but the whole point of this post was to remove constraints and think of what dbt might look like if we shed the weight of our past decisions.
Definitely. There's a lot of value to divergent thinking so I really appreciate your original post.
So my first thought is to toss out a platitude like "YAML isn't meant to be a very expressive language," but of course that's missing the point. Instead I want to learn about what would provide a good balance of expressivity and light weight. It makes me think of something like TypeScript -> JS transpilation, where you can impose type safety on yourself but in a manner that can easily be ignored for those who don't need (or don't think they need) it.
I'd say both LookML and Malloy offer some great alternatives to the dbt model, and it feels like they strike the right balance. Malloy is written in TypeScript, so a very similar model to what you're proposing too.
Hi there,
It's not fantasy land!
It a good idea and has been built already, at least once by the team i work in at Criteo.
I've wrote about how it works and our experience with it here: https://medium.com/criteo-engineering/scheduling-data-pipelines-at-criteo-part-1-8b257c6c8e55
Our unit-tests definition look like what you described here. We have just a little more complexity for partitioned tables and some meta-information like the test name.
The language support we built allowed us to implement column lineage for SparkSQL however we don't have yet something like relationships nor an integration with downstream BI tool.
We have some improvements to our data-quality metrics support in our very packed roadmap but we haven't though about leveraging relationship info as you describe that for now are just text in our data catalog documentation.
My dream is that we can find the time to open-source it so we could have more hands to implement such ideas !