9 Comments

"Now I recognize there are a million reasons why this wouldn’t work"

Are there really a million reasons this wouldn't work? I don't have a deep intuition about what technical blockers would prevent something like this from existing. Seems like elements of it could actually happen today if we really wanted.

Expand full comment
Mar 9, 2023·edited Mar 9, 2023

I confess my naïveté about language servers -- would it not be feasible to write a dbt implementation of the Language Server Protocol in a reasonable amount of time? Surely dbt as a language can't be any more complex than, say, Terraform or other purpose-built languages. Are there technical reasons why a mixed SQL-Jinja-YAML cocktail would be intractable to parse?

Granted this doesn't address your point about file consolidation, which is a good one. Putting the templated SQL into the YAML feels incremental enough that it might be a feasible way forward.

Expand full comment

Hi there,

It's not fantasy land!

It a good idea and has been built already, at least once by the team i work in at Criteo.

I've wrote about how it works and our experience with it here: https://medium.com/criteo-engineering/scheduling-data-pipelines-at-criteo-part-1-8b257c6c8e55

Our unit-tests definition look like what you described here. We have just a little more complexity for partitioned tables and some meta-information like the test name.

The language support we built allowed us to implement column lineage for SparkSQL however we don't have yet something like relationships nor an integration with downstream BI tool.

We have some improvements to our data-quality metrics support in our very packed roadmap but we haven't though about leveraging relationship info as you describe that for now are just text in our data catalog documentation.

My dream is that we can find the time to open-source it so we could have more hands to implement such ideas !

Expand full comment