<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Pedram's Data Based: Deep Dives]]></title><description><![CDATA[Technical content, deep dives, tool explorations, and more]]></description><link>https://databased.pedramnavid.com/s/deep-dives</link><image><url>https://substackcdn.com/image/fetch/$s_!Gq30!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2af60e2c-8ad1-48ec-ad43-345f51acbdb3_1280x1280.png</url><title>Pedram&apos;s Data Based: Deep Dives</title><link>https://databased.pedramnavid.com/s/deep-dives</link></image><generator>Substack</generator><lastBuildDate>Fri, 17 Apr 2026 02:55:55 GMT</lastBuildDate><atom:link href="https://databased.pedramnavid.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Pedram Navid]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[pedram@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[pedram@substack.com]]></itunes:email><itunes:name><![CDATA[Pedram Navid]]></itunes:name></itunes:owner><itunes:author><![CDATA[Pedram Navid]]></itunes:author><googleplay:owner><![CDATA[pedram@substack.com]]></googleplay:owner><googleplay:email><![CDATA[pedram@substack.com]]></googleplay:email><googleplay:author><![CDATA[Pedram Navid]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Doing Data The Hard Way Part 1: Extracting Data]]></title><description><![CDATA[It's one table Michael, how hard could it be?]]></description><link>https://databased.pedramnavid.com/p/doing-data-the-hard-way-part-1-extracting</link><guid isPermaLink="false">https://databased.pedramnavid.com/p/doing-data-the-hard-way-part-1-extracting</guid><dc:creator><![CDATA[Pedram Navid]]></dc:creator><pubDate>Tue, 02 May 2023 13:30:41 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!HyzY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb2e44ee-a025-47bf-9428-fbf73d6ae288_1424x656.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In my last post, I promised you all a deep dive into doing data the hard way. In this post, I hope to deliver on that promise.</p><p>I&#8217;ll explore the first part of every data journey: getting data out of a system. The story is likely to be a familiar one to many of you. Data about something you care about exists in some system, and you want to extract that data and store it somewhere. How hard can it be?</p><p>I&#8217;m going to skip over the pleasantries of why you might want to do this and pretend that we all understand it&#8217;s something that needs to be done. There are many types of systems that have data that we might want, but for our example, we will cover a common use case: application databases.</p><p>You can&#8217;t extract data without also putting it somewhere so we&#8217;ll also discuss the merits of two different strategies: saving the data in some structured format like a CSV or a Parquet file, or writing the data directly into a Data Warehouse.</p><p>Let&#8217;s get started</p><h2>Querying Data</h2><p>Given some source system, we will need to query that system to retrieve some subset of data. If our source system is a SQL database, then naturally, we&#8217;ll use SQL. </p><p>In the brute force method, we could extract all data from all tables that we&#8217;re interested in, ignoring system load and storage costs. This is a stateless process that is often overlooked but is rather simple. It might look something like this in Postgres:</p><pre><code>\copy customers TO './customers.csv' CSV DELIMITER ','
</code></pre><p>We might soon realize while this is a simple method, it comes with some costs: namely it can be expensive to run. Our backend database may not appreciate the load. (You are running this against a replica, right?)</p><p>One solution is to only fetch data that has changed since your last run. This is known as an incremental data load. </p><p>We have two options:</p><ol><li><p>Use Change Data Capture: Subscribe to a log of all database events and fetch all events that occurred after our last load. </p></li><li><p>Use a column that updates every time a row is updated, and only fetch records that have been updated since the last load.</p></li></ol><p>The first option tends to be more accurate, but also a bigger pain to set up. It requires configuration of the database server, possibly even a restart. To really understand it, you&#8217;ll need to understand the WAL or write-ahead log in Postgres.</p><h3>A quick detour on databases</h3><p>The WAL is a mysterious place within a database. </p><div class="paywall-jump" data-component-name="PaywallToDOM"></div><p>While on the surface, a database appears to be a collection of tables that mirror a spreadsheet with many tabs, under the hood what is going on are sequences of events. The WAL is the ledger that keeps track of these events. Anytime you insert, update, or delete a row, a record of that transaction is kept just like it would be in an accounting ledger.</p><p>The balance of all these transactions, much like the bank balance you have, is the collective sum of all these events. While the WAL is used to ensure that in the event a database goes down, a record of everything that happened since the last backup is persisted, the WAL can also be used to sync data other systems, such as a backup replica database, or even, your silly little ETL job.</p><p>A plugin, such as wal2json, can translate these events into something a little more manageable, as we see here. </p><pre><code>{
        "change": [
                {
                        "kind": "insert",
                        "schema": "public",
                        "table": "inventory",
                        "columnnames": ["id", "item", "qty"],
                        "columntypes": ["integer", "character varying(30)", "integer"],
                        "columnvalues": [1, "apples", 100]
                }
        ]
}
{
        "change": [
                {
                        "kind": "update",
                        "schema": "public",
                        "table": "inventory",
                        "columnnames": ["id", "item", "qty"],
                        "columntypes": ["integer", "character varying(30)", "integer"],
                        "columnvalues": [1, "apples", 96],
                        "oldkeys": {
                                "keynames": ["id"],
                                "keytypes": ["integer"],
                                "keyvalues": [1]
                        }
                }
        ]
}
</code></pre><p>Every change to a database emits an event, and nearly every database has their own way of emitting these events. There is no single standard for CDC so it is up to the downstream implementations to handle the varying logic. Now that you know what a WAL is, let&#8217;s pick an option.</p><h3>Incremental Options</h3><p>The WAL, while appealing, comes with many complexities. We&#8217;d have to store JSON data and process each row. </p><p>Instead, we&#8217;ll opt for using the updated_at column. </p><p>These are typically maintained by the database and automatically update whenever a row changes. Be cautious, sometimes they don&#8217;t update, and this can cause inconsistencies.</p><p>This simplifies our future queries quite a bit. Now we can filter on data updated since our last sync.</p><pre><code>select * from customers
// if incremental run
where updated_at &gt;= {{last_sync_date}}</code></pre><h3>Data Storage and Encoding</h3><p>Once the data has been queried, it needs to be saved. The key decision here is whether to save the data in a binary or text format. You&#8217;re already familiar with text formats: CSV and JSON are the most popular. The benefits of text formats are that they are easy to read for humans, but this comes at the expense of efficiency and ambiguity around data types. </p><p>If you&#8217;ve ever had to parse the text &#8220;01-03-12&#8221; into a date format, you&#8217;ll appreciate the horrors of ambiguity. The structure of JSON also makes for large files as keys are repeated for every row.</p><p>Binary formats solve these issues by encoding data in a machine-readable format. Binary formats encode the data&#8217;s schema and types and offer more efficient storage of data. Parquet, for example, offers techniques like column-level compression and bit-packing. But nothing is every easy when it comes to data.</p><p>Parquet has a few well-defined standard types. Your typical types such as INT and FLOAT along with strings, dates, and timestamps are all well supported. The problem is your database may not have the same type system. For example, Postgres has the <a href="https://www.postgresql.org/docs/current/datatype-net-types.html">inet and cidr</a> types which are ways of encoding network addresses. </p><p>Here you may wish to resort to trusty strings for unknown data types by default, allowing downstream to cast these as needed or simply use them as is. </p><p>To extract the data into parquet, you could use a wonderful little like <a href="https://github.com/exyi/pg2parquet">pg2parquet</a> which can take either a table or a query and output a parquet file, or even take advantage of DuckDB&#8217;s <a href="https://duckdb.org/2022/09/30/postgres-scanner.html">Postgres scanner.</a> </p><p>With pg2parquet, you can extract an entire table in a single line:</p><pre><code> pg2parquet export 
     --table businesses \
     -H $POSTGRES_HOST \
     -U $POSTGRES_USER \
     --password $POSTGRES_PASSWORD \
     -o ./businesses.parquet \
     --dbname can_i_haz_replica</code></pre><p>You&#8217;ll want to save this to a persistent storage through your cloud provider, such as S3 or GCS for safe-keeping. Don&#8217;t let your hard work go to waste!</p><h3>Skipping Storage</h3><p>Pedram, you might say, why store data twice? I could simply write each row to my data warehouse right away! </p><p>And you are correct, you could do this! In fact, if you think this is a good idea, I&#8217;d encourage you to try it. What you will eventually run into is a failure, and failures are never fun.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HyzY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb2e44ee-a025-47bf-9428-fbf73d6ae288_1424x656.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HyzY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb2e44ee-a025-47bf-9428-fbf73d6ae288_1424x656.png 424w, https://substackcdn.com/image/fetch/$s_!HyzY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb2e44ee-a025-47bf-9428-fbf73d6ae288_1424x656.png 848w, https://substackcdn.com/image/fetch/$s_!HyzY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb2e44ee-a025-47bf-9428-fbf73d6ae288_1424x656.png 1272w, https://substackcdn.com/image/fetch/$s_!HyzY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb2e44ee-a025-47bf-9428-fbf73d6ae288_1424x656.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HyzY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb2e44ee-a025-47bf-9428-fbf73d6ae288_1424x656.png" width="727" height="334.91011235955057" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cb2e44ee-a025-47bf-9428-fbf73d6ae288_1424x656.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:656,&quot;width&quot;:1424,&quot;resizeWidth&quot;:727,&quot;bytes&quot;:302455,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HyzY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb2e44ee-a025-47bf-9428-fbf73d6ae288_1424x656.png 424w, https://substackcdn.com/image/fetch/$s_!HyzY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb2e44ee-a025-47bf-9428-fbf73d6ae288_1424x656.png 848w, https://substackcdn.com/image/fetch/$s_!HyzY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb2e44ee-a025-47bf-9428-fbf73d6ae288_1424x656.png 1272w, https://substackcdn.com/image/fetch/$s_!HyzY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb2e44ee-a025-47bf-9428-fbf73d6ae288_1424x656.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>If you chose to write directly into the warehouse and something in the process failed, you&#8217;re left scratching your head. How do I debug this issue? What was the row that caused the issue? Can I even see what the data looked like? Why did I choose this career? </p><p>Instead, if you chose to write the data to an intermediate layer, you have some more options available to you. You can take the exact file that failed and inspect it, load it into a dev environment, scan for gremlins, and hopefully address them so that your pipeline becomes more robust to failures in the future. What a happy data engineer you&#8217;ve become!</p><h2>Loading Data</h2><p>Loading data can be fraught with difficulties. You will need to insert new rows, updating existing rows, and delete removed rows. All of these actions require a primary key. </p><p>The MERGE DDL command allows you to perform all of the above in one go. </p><pre><code><code>MERGE dataset.Inventory T
USING dataset.NewArrivals S
ON T.product = S.product
WHEN MATCHED THEN
  UPDATE SET quantity = T.quantity + S.quantity
WHEN NOT MATCHED THEN
  INSERT (product, quantity) VALUES(product, quantity)</code></code></pre><p>Issues arise however whenever schemas change. New columns in your source data means tables must be altered before an insert. Even worse, if a source data type changes then you may have a bigger problem on your hands. </p><p>It can often be helpful to add helper columns as you load data, such as a timestamp of when the current batch was loaded, in case rollbacks are needed.</p><h2>Now do it all again </h2><p>Let&#8217;s pretend this wonderful journey proceeded smoothly for you. The next question is, can you do it again tomorrow? Will you know if it ran successfully?</p><p></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://databased.pedramnavid.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://databased.pedramnavid.com/subscribe?"><span>Subscribe now</span></a></p><p> Doing a task once is much easier than doing a task every day. In my next post, we will look at orchestration and scheduling, and all the other bits that come with.</p><p>Until next time!</p>]]></content:encoded></item><item><title><![CDATA[Streaming Data Pipelines with Striim + DuckDB]]></title><description><![CDATA[Big thanks to Striim for getting me a preview of their new developer experience and sponsoring this post. Last month I got a sneak preview of Striim&#8217;s new developer experience that makes it easy to get started with CDC using BigQuery or Snowflake. If you missed my]]></description><link>https://databased.pedramnavid.com/p/streaming-data-pipelines-with-striim</link><guid isPermaLink="false">https://databased.pedramnavid.com/p/streaming-data-pipelines-with-striim</guid><dc:creator><![CDATA[Pedram Navid]]></dc:creator><pubDate>Tue, 31 Jan 2023 18:01:14 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/66e257ce-5432-4eba-9c7e-509d70899b6f_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Big thanks to Striim for getting me a preview of their new developer experience and sponsoring this post.</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!C7bP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8517dd8e-b714-4973-bf5d-d66d8b0de2d2_1024x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!C7bP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8517dd8e-b714-4973-bf5d-d66d8b0de2d2_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!C7bP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8517dd8e-b714-4973-bf5d-d66d8b0de2d2_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!C7bP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8517dd8e-b714-4973-bf5d-d66d8b0de2d2_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!C7bP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8517dd8e-b714-4973-bf5d-d66d8b0de2d2_1024x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!C7bP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8517dd8e-b714-4973-bf5d-d66d8b0de2d2_1024x1024.png" width="495" height="495" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8517dd8e-b714-4973-bf5d-d66d8b0de2d2_1024x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:495,&quot;bytes&quot;:2031737,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!C7bP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8517dd8e-b714-4973-bf5d-d66d8b0de2d2_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!C7bP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8517dd8e-b714-4973-bf5d-d66d8b0de2d2_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!C7bP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8517dd8e-b714-4973-bf5d-d66d8b0de2d2_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!C7bP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8517dd8e-b714-4973-bf5d-d66d8b0de2d2_1024x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><p>Last month I got a sneak preview of Striim&#8217;s new developer experience that makes it easy to get started with CDC using BigQuery or Snowflake. If you missed my <a href="https://twitter.com/pdrmnvd/status/1605620847741325312?s=20">thread</a> about that, check it out. In this post, I&#8217;ll look at how you can leverage Striim, Parquet, and DuckDB for real-time data ingestion with fast data analysis. </p><p>Data pipelines have traditionally been batch, and batch pipelines are usually easier to reason about. Data comes in once a day. I run all my transformations and load them sometime between 12:01 AM and 7 AM UTC (or was it PT? Timezones are hard.) Views and tables get updated, people look at data from yesterday, they get the answers to all their questions, and life is good. Life is simple.&nbsp;</p><p>Unfortunately: the good old days are dead. Now we run operational workflows off data constantly fed into the data warehouse. We need to reduce the lag of all the various components that ingest, digest, transform, and reform our data as much as possible. For example, we have personalized workflows that send automated emails to prospects and customers who expect us to understand only every interaction they&#8217;ve had with us but also anticipate every human desire they could conceivably have in the next fifteen minutes.&nbsp;</p><p>We&#8217;ve started pushing batch to the limits of streaming. Some of these batch tools can run as often as every 5 minutes, pushing the boundaries of what is and isn&#8217;t streaming anymore.</p><p>This is what got me interested in the streaming and CDC space in the first place. I wanted to know if there was a better way. After a dizzying stroll down Debezium Lane, and a confusing jaunt through Kafka Caverns, I received a nice demo from the fine folks at <a href="https://striim.com">Striim</a>.&nbsp;</p><p>Striim is an enterprise-grade CDC platform, and I am but a lowly developer with toy examples, and it works just as well for me. For my first attempt in the tweet above, I set up a simple Postgres instance, piped data into it, and watched as Striim fed my BigQuery tables with change capture data every few seconds.&nbsp;</p><p>But BigQuery is old news. Today, I wanted to see if I could get our lord and savior, DuckDB, to work with Striim. The setup was simple: use a GCP Writer to save streaming data to Parquet. Then, use DuckDB&#8217;s HTTPFS extension to read data from Parquet files in bulk. Write queries. Enjoy streaming.</p><p>Let&#8217;s dive in.&nbsp;</p><h2>Striim Setup</h2><p>I decided to use one of the built-in data generators to get started quickly. These data generators are great for sketching out ideas since they let you avoid the messy parts of connecting data systems, such as permissions and IP allow-lists. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pXZJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd944dbe4-c57d-4c54-846b-82c9ced25086_528x860.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pXZJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd944dbe4-c57d-4c54-846b-82c9ced25086_528x860.png 424w, https://substackcdn.com/image/fetch/$s_!pXZJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd944dbe4-c57d-4c54-846b-82c9ced25086_528x860.png 848w, https://substackcdn.com/image/fetch/$s_!pXZJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd944dbe4-c57d-4c54-846b-82c9ced25086_528x860.png 1272w, https://substackcdn.com/image/fetch/$s_!pXZJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd944dbe4-c57d-4c54-846b-82c9ced25086_528x860.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pXZJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd944dbe4-c57d-4c54-846b-82c9ced25086_528x860.png" width="318" height="517.9545454545455" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d944dbe4-c57d-4c54-846b-82c9ced25086_528x860.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:860,&quot;width&quot;:528,&quot;resizeWidth&quot;:318,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pXZJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd944dbe4-c57d-4c54-846b-82c9ced25086_528x860.png 424w, https://substackcdn.com/image/fetch/$s_!pXZJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd944dbe4-c57d-4c54-846b-82c9ced25086_528x860.png 848w, https://substackcdn.com/image/fetch/$s_!pXZJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd944dbe4-c57d-4c54-846b-82c9ced25086_528x860.png 1272w, https://substackcdn.com/image/fetch/$s_!pXZJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd944dbe4-c57d-4c54-846b-82c9ced25086_528x860.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The <strong>ContinousGenerator</strong> can be set to various types of throughput. Low Throughput sends about ten messages per second, Medium for hundreds per second, or Spike for variable traffic with high spikes which can be handy for testing the resiliency of your pipelines.</p><p>Next up, I used a <strong>Query</strong> cell, which operates on an incoming stream and allows you to do transformations as data is produced. This can save lots of expensive compute in your warehouse by shifting the transformations left, closer to the data source. You can also do data-masking as data arrives, to make staying compliance easier. I wrote a simple query that takes the generated data, masks sensitive information, and outputs the results to a GDPR stream.&nbsp;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MuZ9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff36bbf5d-07be-4138-8f9e-7f833594b352_992x500.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MuZ9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff36bbf5d-07be-4138-8f9e-7f833594b352_992x500.png 424w, https://substackcdn.com/image/fetch/$s_!MuZ9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff36bbf5d-07be-4138-8f9e-7f833594b352_992x500.png 848w, https://substackcdn.com/image/fetch/$s_!MuZ9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff36bbf5d-07be-4138-8f9e-7f833594b352_992x500.png 1272w, https://substackcdn.com/image/fetch/$s_!MuZ9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff36bbf5d-07be-4138-8f9e-7f833594b352_992x500.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MuZ9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff36bbf5d-07be-4138-8f9e-7f833594b352_992x500.png" width="551" height="277.7217741935484" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f36bbf5d-07be-4138-8f9e-7f833594b352_992x500.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:500,&quot;width&quot;:992,&quot;resizeWidth&quot;:551,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!MuZ9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff36bbf5d-07be-4138-8f9e-7f833594b352_992x500.png 424w, https://substackcdn.com/image/fetch/$s_!MuZ9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff36bbf5d-07be-4138-8f9e-7f833594b352_992x500.png 848w, https://substackcdn.com/image/fetch/$s_!MuZ9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff36bbf5d-07be-4138-8f9e-7f833594b352_992x500.png 1272w, https://substackcdn.com/image/fetch/$s_!MuZ9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff36bbf5d-07be-4138-8f9e-7f833594b352_992x500.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Finally, to be able to analyze this data in DuckDB, I elected to write the data in Parquet format to Google Cloud Storage, although S3 would also work just as well. To do that, I used the GCP Writer Target. After creating a Service Account in Google Cloud, I setup a few basic settings such as the path to the bucket and format I&#8217;d like the files saved in. </p><p>One setting to be aware of is the Upload Policy, which determines how frequently (and conversely, how large) the files are. Finding a good balance here is important, as too many files or too few can both hinder performance.</p><p>I set the Upload Policy to write every 100,000 events or every 1 minute. I set the ParquetFormatter as the output option.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ceBj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fdbcf61-1467-4880-9a8b-a3304d43fd1b_754x524.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ceBj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fdbcf61-1467-4880-9a8b-a3304d43fd1b_754x524.png 424w, https://substackcdn.com/image/fetch/$s_!ceBj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fdbcf61-1467-4880-9a8b-a3304d43fd1b_754x524.png 848w, https://substackcdn.com/image/fetch/$s_!ceBj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fdbcf61-1467-4880-9a8b-a3304d43fd1b_754x524.png 1272w, https://substackcdn.com/image/fetch/$s_!ceBj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fdbcf61-1467-4880-9a8b-a3304d43fd1b_754x524.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ceBj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fdbcf61-1467-4880-9a8b-a3304d43fd1b_754x524.png" width="411" height="285.6286472148541" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3fdbcf61-1467-4880-9a8b-a3304d43fd1b_754x524.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:524,&quot;width&quot;:754,&quot;resizeWidth&quot;:411,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ceBj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fdbcf61-1467-4880-9a8b-a3304d43fd1b_754x524.png 424w, https://substackcdn.com/image/fetch/$s_!ceBj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fdbcf61-1467-4880-9a8b-a3304d43fd1b_754x524.png 848w, https://substackcdn.com/image/fetch/$s_!ceBj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fdbcf61-1467-4880-9a8b-a3304d43fd1b_754x524.png 1272w, https://substackcdn.com/image/fetch/$s_!ceBj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fdbcf61-1467-4880-9a8b-a3304d43fd1b_754x524.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>With the setup complete, all that is needed is for the app to be deployed and started. You even have a preview feature to watch data as it is fed through the system. You can see I&#8217;m fetching about 900 messages every second, and after about a minute the data will write to GCP.&nbsp;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ib2S!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F592a62d6-f92a-4aa0-b747-c3290dcc851d_1600x971.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ib2S!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F592a62d6-f92a-4aa0-b747-c3290dcc851d_1600x971.png 424w, https://substackcdn.com/image/fetch/$s_!ib2S!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F592a62d6-f92a-4aa0-b747-c3290dcc851d_1600x971.png 848w, https://substackcdn.com/image/fetch/$s_!ib2S!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F592a62d6-f92a-4aa0-b747-c3290dcc851d_1600x971.png 1272w, https://substackcdn.com/image/fetch/$s_!ib2S!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F592a62d6-f92a-4aa0-b747-c3290dcc851d_1600x971.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ib2S!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F592a62d6-f92a-4aa0-b747-c3290dcc851d_1600x971.png" width="1456" height="884" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/592a62d6-f92a-4aa0-b747-c3290dcc851d_1600x971.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:884,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ib2S!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F592a62d6-f92a-4aa0-b747-c3290dcc851d_1600x971.png 424w, https://substackcdn.com/image/fetch/$s_!ib2S!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F592a62d6-f92a-4aa0-b747-c3290dcc851d_1600x971.png 848w, https://substackcdn.com/image/fetch/$s_!ib2S!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F592a62d6-f92a-4aa0-b747-c3290dcc851d_1600x971.png 1272w, https://substackcdn.com/image/fetch/$s_!ib2S!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F592a62d6-f92a-4aa0-b747-c3290dcc851d_1600x971.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>What&#8217;s neat is that Striim even displays the total End to End Lag so you can have insight into how delayed your pipelines are. In my case, the lag was about 30 seconds from creation to write.</p><p>After running for a while, the Parquet files are loaded in GCP and now it&#8217;s time to analyze the results with DuckDB.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eWp2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2033dfe6-1d79-4da6-9103-2ef139156f9e_458x546.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eWp2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2033dfe6-1d79-4da6-9103-2ef139156f9e_458x546.png 424w, https://substackcdn.com/image/fetch/$s_!eWp2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2033dfe6-1d79-4da6-9103-2ef139156f9e_458x546.png 848w, https://substackcdn.com/image/fetch/$s_!eWp2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2033dfe6-1d79-4da6-9103-2ef139156f9e_458x546.png 1272w, https://substackcdn.com/image/fetch/$s_!eWp2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2033dfe6-1d79-4da6-9103-2ef139156f9e_458x546.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eWp2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2033dfe6-1d79-4da6-9103-2ef139156f9e_458x546.png" width="364" height="433.93886462882097" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2033dfe6-1d79-4da6-9103-2ef139156f9e_458x546.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:546,&quot;width&quot;:458,&quot;resizeWidth&quot;:364,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!eWp2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2033dfe6-1d79-4da6-9103-2ef139156f9e_458x546.png 424w, https://substackcdn.com/image/fetch/$s_!eWp2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2033dfe6-1d79-4da6-9103-2ef139156f9e_458x546.png 848w, https://substackcdn.com/image/fetch/$s_!eWp2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2033dfe6-1d79-4da6-9103-2ef139156f9e_458x546.png 1272w, https://substackcdn.com/image/fetch/$s_!eWp2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2033dfe6-1d79-4da6-9103-2ef139156f9e_458x546.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Analyze with DuckDB</h2><p>There are many ways to use DuckDB given that it&#8217;s a small portable binary. The CLI is a great place for simple prototyping but I prefer using Datagrip for writing queries.&nbsp;</p><p>After creating a new DuckDB connection and enabling single-session mode, I added a small startup script to ensure that every time I connect to DuckDB my GCP credentials are entered.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GFTY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecbde811-07b9-4aa8-8a73-76b2593131dd_1106x734.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GFTY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecbde811-07b9-4aa8-8a73-76b2593131dd_1106x734.png 424w, https://substackcdn.com/image/fetch/$s_!GFTY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecbde811-07b9-4aa8-8a73-76b2593131dd_1106x734.png 848w, https://substackcdn.com/image/fetch/$s_!GFTY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecbde811-07b9-4aa8-8a73-76b2593131dd_1106x734.png 1272w, https://substackcdn.com/image/fetch/$s_!GFTY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecbde811-07b9-4aa8-8a73-76b2593131dd_1106x734.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GFTY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecbde811-07b9-4aa8-8a73-76b2593131dd_1106x734.png" width="507" height="336.47197106690777" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ecbde811-07b9-4aa8-8a73-76b2593131dd_1106x734.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:734,&quot;width&quot;:1106,&quot;resizeWidth&quot;:507,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!GFTY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecbde811-07b9-4aa8-8a73-76b2593131dd_1106x734.png 424w, https://substackcdn.com/image/fetch/$s_!GFTY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecbde811-07b9-4aa8-8a73-76b2593131dd_1106x734.png 848w, https://substackcdn.com/image/fetch/$s_!GFTY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecbde811-07b9-4aa8-8a73-76b2593131dd_1106x734.png 1272w, https://substackcdn.com/image/fetch/$s_!GFTY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecbde811-07b9-4aa8-8a73-76b2593131dd_1106x734.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p>The <a href="https://duckdb.org/docs/guides/import/s3_export">docs on setting up S3 or GCS</a> access are pretty straightforward. A few simple SET commands and then you&#8217;re ready to query!</p><pre><code><code>INSTALL httpfs;
LOAD httpfs;&nbsp;
SET s3_endpoint='storage.googleapis.com';
SET s3_access_key_id='MY_ACCESS_KEY';
SET s3_secret_access_key='MY_SECRET';</code></code></pre><p>To start, I ran a simple query to see how many records we have in each file. By using the <code>filename=TRUE</code> command, DuckDB returns the filename as a column in the table, which I use for aggregation.</p><pre><code>SELECT
 
filename,
COUNT(1) AS n_records
 
FROM parquet_scan('s3://my-duckdb-bucket/striim-out.*', filename=TRUE)
GROUP BY filename
ORDER BY 1;</code></pre><p>In about 7 seconds, DuckDB scanned 760,000 records across 14 files with 55,000 records each to generate a count of records by file. And the best part is there&#8217;s no Spark cluster to maintain. You can see below that using the filename to do a group by makes it easy to get a sense of how many records were written in each file.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-twD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F647f3d01-7ce7-434d-ac85-8c2e5f9afa34_640x444.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-twD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F647f3d01-7ce7-434d-ac85-8c2e5f9afa34_640x444.png 424w, https://substackcdn.com/image/fetch/$s_!-twD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F647f3d01-7ce7-434d-ac85-8c2e5f9afa34_640x444.png 848w, https://substackcdn.com/image/fetch/$s_!-twD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F647f3d01-7ce7-434d-ac85-8c2e5f9afa34_640x444.png 1272w, https://substackcdn.com/image/fetch/$s_!-twD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F647f3d01-7ce7-434d-ac85-8c2e5f9afa34_640x444.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-twD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F647f3d01-7ce7-434d-ac85-8c2e5f9afa34_640x444.png" width="512" height="355.2" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/647f3d01-7ce7-434d-ac85-8c2e5f9afa34_640x444.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:444,&quot;width&quot;:640,&quot;resizeWidth&quot;:512,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-twD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F647f3d01-7ce7-434d-ac85-8c2e5f9afa34_640x444.png 424w, https://substackcdn.com/image/fetch/$s_!-twD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F647f3d01-7ce7-434d-ac85-8c2e5f9afa34_640x444.png 848w, https://substackcdn.com/image/fetch/$s_!-twD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F647f3d01-7ce7-434d-ac85-8c2e5f9afa34_640x444.png 1272w, https://substackcdn.com/image/fetch/$s_!-twD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F647f3d01-7ce7-434d-ac85-8c2e5f9afa34_640x444.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We can even do fast text processing. In 6 seconds, I can categorize all products by whether they have Heavy or Lightweight in their name and aggregate across both dimensions.</p><pre><code>SELECT
    product_name LIKE '%Lightweight%' AS is_lightweight,
    product_name LIKE '%Heavy%' AS is_heavy,
    COUNT(1) AS count_products
FROM parquet_scan('s3://my-duckdb-bucket/striim-out.*')
GROUP BY 1, 2;</code></pre><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bpHn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81c82003-2740-4bee-837d-603adfaf9277_1122x202.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bpHn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81c82003-2740-4bee-837d-603adfaf9277_1122x202.png 424w, https://substackcdn.com/image/fetch/$s_!bpHn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81c82003-2740-4bee-837d-603adfaf9277_1122x202.png 848w, https://substackcdn.com/image/fetch/$s_!bpHn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81c82003-2740-4bee-837d-603adfaf9277_1122x202.png 1272w, https://substackcdn.com/image/fetch/$s_!bpHn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81c82003-2740-4bee-837d-603adfaf9277_1122x202.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bpHn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81c82003-2740-4bee-837d-603adfaf9277_1122x202.png" width="1122" height="202" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/81c82003-2740-4bee-837d-603adfaf9277_1122x202.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:202,&quot;width&quot;:1122,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!bpHn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81c82003-2740-4bee-837d-603adfaf9277_1122x202.png 424w, https://substackcdn.com/image/fetch/$s_!bpHn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81c82003-2740-4bee-837d-603adfaf9277_1122x202.png 848w, https://substackcdn.com/image/fetch/$s_!bpHn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81c82003-2740-4bee-837d-603adfaf9277_1122x202.png 1272w, https://substackcdn.com/image/fetch/$s_!bpHn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81c82003-2740-4bee-837d-603adfaf9277_1122x202.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>The best part is this data is constantly updated by Striim. Every minute a new batch of 55,000 records arrives.&nbsp;</p><p>That was fun, but we needed to go faster. Just for fun, I cranked up the generator to see how it would handle a higher rate and set the Upload Limit to 25,000 records per file. I easily hit 20,000 messages per second, and the end-to-end lag was just a few seconds. Striim had no problem with the throughput. In just a few minutes, I had 60 Parquet files ready for DuckDB to process.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AiPs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8debd398-9475-49fe-98d2-66b2699290b4_1600x810.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AiPs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8debd398-9475-49fe-98d2-66b2699290b4_1600x810.png 424w, https://substackcdn.com/image/fetch/$s_!AiPs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8debd398-9475-49fe-98d2-66b2699290b4_1600x810.png 848w, https://substackcdn.com/image/fetch/$s_!AiPs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8debd398-9475-49fe-98d2-66b2699290b4_1600x810.png 1272w, https://substackcdn.com/image/fetch/$s_!AiPs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8debd398-9475-49fe-98d2-66b2699290b4_1600x810.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AiPs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8debd398-9475-49fe-98d2-66b2699290b4_1600x810.png" width="1456" height="737" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8debd398-9475-49fe-98d2-66b2699290b4_1600x810.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:737,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!AiPs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8debd398-9475-49fe-98d2-66b2699290b4_1600x810.png 424w, https://substackcdn.com/image/fetch/$s_!AiPs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8debd398-9475-49fe-98d2-66b2699290b4_1600x810.png 848w, https://substackcdn.com/image/fetch/$s_!AiPs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8debd398-9475-49fe-98d2-66b2699290b4_1600x810.png 1272w, https://substackcdn.com/image/fetch/$s_!AiPs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8debd398-9475-49fe-98d2-66b2699290b4_1600x810.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>With now 60 files to process, DuckDB took just under 30 seconds to count every record in every file. The product name query now took 23 seconds on 1.45 million records.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UW9u!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf50d02e-b2ea-4358-82cd-03fe1b25f68d_1128x214.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UW9u!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf50d02e-b2ea-4358-82cd-03fe1b25f68d_1128x214.png 424w, https://substackcdn.com/image/fetch/$s_!UW9u!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf50d02e-b2ea-4358-82cd-03fe1b25f68d_1128x214.png 848w, https://substackcdn.com/image/fetch/$s_!UW9u!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf50d02e-b2ea-4358-82cd-03fe1b25f68d_1128x214.png 1272w, https://substackcdn.com/image/fetch/$s_!UW9u!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf50d02e-b2ea-4358-82cd-03fe1b25f68d_1128x214.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UW9u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf50d02e-b2ea-4358-82cd-03fe1b25f68d_1128x214.png" width="1128" height="214" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/df50d02e-b2ea-4358-82cd-03fe1b25f68d_1128x214.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:214,&quot;width&quot;:1128,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UW9u!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf50d02e-b2ea-4358-82cd-03fe1b25f68d_1128x214.png 424w, https://substackcdn.com/image/fetch/$s_!UW9u!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf50d02e-b2ea-4358-82cd-03fe1b25f68d_1128x214.png 848w, https://substackcdn.com/image/fetch/$s_!UW9u!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf50d02e-b2ea-4358-82cd-03fe1b25f68d_1128x214.png 1272w, https://substackcdn.com/image/fetch/$s_!UW9u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf50d02e-b2ea-4358-82cd-03fe1b25f68d_1128x214.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>As one final test, I decided to push some regex and aggregates down to see the impact on performance, and DuckDB held up well. This query took under a minute to query all 1.45 million records, and I didn&#8217;t have to store a single file locally. (And if you were wondering, the average of the last 4 digits of a phone number is 5001).</p><pre><code>SELECT
    date_trunc('minute', CAST(TIME AS TIMESTAMP)) AS DATE,
    avg(CAST(regexp_extract(Phone_Number, '\d+') AS NUMERIC)) AS avg_number
FROM parquet_scan('s3://my-duckdb-bucket/striim-out.*')
GROUP BY 1</code></pre><h2>Wrapping Up</h2><p>I hope this was a helpful exploration of how you can use Striim and DuckDB to process real-time analytic queries quickly and easily. Gone are the days of Kafka, Zookeeper and Debezium. In less than 30 minutes you can get a CDC stream setup, write to a cloud bucket location, and query with DuckDB for blazing-fast analytics.&nbsp;</p><p>If you want to give Striim a try, <a href="https://signup-developer.striim.com/">you can sign up here</a> with my referral code <strong>tAlaDngxjQ</strong>.</p>]]></content:encoded></item><item><title><![CDATA[Deep Dive: What the Heck is Entity Resolution]]></title><description><![CDATA[or record linkage, or identity mapping, or data matching.]]></description><link>https://databased.pedramnavid.com/p/entity-resolution</link><guid isPermaLink="false">https://databased.pedramnavid.com/p/entity-resolution</guid><dc:creator><![CDATA[Pedram Navid]]></dc:creator><pubDate>Fri, 11 Nov 2022 00:35:16 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/h_600,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F03cd8208-403f-4c17-b515-562993c8ffdd_654x532.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Entity Resolution, Identity Mapping, Record Linkage, Data Matching, and Record Matching</em>. The names are many, but the concept is deceptively simple. In this Deep Dive, we'll look at Entity Resolution and some of its core components.&nbsp;</p><p>If you prefer video, I gave a talk on&nbsp;<a href="https://www.youtube.com/watch?v=cL2dBMuY2lw&amp;t=533s">Entity Resolution at dbt's office hours</a>&nbsp;2 years ago. If you prefer books, there's no other book I'd recommend more than&nbsp;<a href="https://www.amazon.com/Data-Matching-Techniques-Data-Centric-Applications/dp/3642430015/ref=sr_1_4?crid=CCBB9W3V844A&amp;keywords=entity+resolution&amp;qid=1668092009&amp;sprefix=entity+resolution%2Caps%2C166&amp;sr=8-4&amp;ufe=app_do%3Aamzn1.fos.006c50ae-5d4c-4777-9bc0-4513d670b6bc">Peter Christen's Data Matching.</a> If you prefer my Substack, you&#8217;re in the right place.</p><p>Let's dive in.</p><h2>What is Entity Resolution?</h2><p>Entity resolution is all about combining multiple records of things. There are two parts to entity resolution: first is the entity, and second is the record of that entity in some database.</p><p>The entity can be anything from a person to a company to a physical product. I'll use companies as examples here, but the underlying logic applies to any entity you want to dedupe.&nbsp;</p><p>A record of that entity might exist in a spreadsheet, a database, or across multiple databases.&nbsp;</p><p>What's important is that there is no unique identifier representing that entity. If you were trying to dedupe people and had their Social Security Number or another national identification number, then the problem would be relatively easy. However, absent a single unique indicator, if we want to match or dedupe these records, then we need a way to resolve them to a single entity: hence, entity resolution.</p><h2>An Illustrative Example</h2><p>Let's say you work at a small B2B company with data in various systems of record: your production database, your Salesforce instance, and several spreadsheets of data with leads captured at various events.&nbsp;</p><p>Anyone can sign in to your product by providing their company name and email address. Your Salesforce instance has accounts with company names, locations, websites, and contact information. The leads spreadsheet has similar information hand-captured by a person running the event.</p><p>It might look something like this:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8PHU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fd40b2f7f-f6b5-4c9b-b87d-689781f1d704_1602x422.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8PHU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fd40b2f7f-f6b5-4c9b-b87d-689781f1d704_1602x422.png 424w, https://substackcdn.com/image/fetch/$s_!8PHU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fd40b2f7f-f6b5-4c9b-b87d-689781f1d704_1602x422.png 848w, https://substackcdn.com/image/fetch/$s_!8PHU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fd40b2f7f-f6b5-4c9b-b87d-689781f1d704_1602x422.png 1272w, https://substackcdn.com/image/fetch/$s_!8PHU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fd40b2f7f-f6b5-4c9b-b87d-689781f1d704_1602x422.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8PHU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fd40b2f7f-f6b5-4c9b-b87d-689781f1d704_1602x422.png" width="1456" height="384" data-attrs="{&quot;src&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/d40b2f7f-f6b5-4c9b-b87d-689781f1d704_1602x422.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:384,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:155037,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8PHU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fd40b2f7f-f6b5-4c9b-b87d-689781f1d704_1602x422.png 424w, https://substackcdn.com/image/fetch/$s_!8PHU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fd40b2f7f-f6b5-4c9b-b87d-689781f1d704_1602x422.png 848w, https://substackcdn.com/image/fetch/$s_!8PHU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fd40b2f7f-f6b5-4c9b-b87d-689781f1d704_1602x422.png 1272w, https://substackcdn.com/image/fetch/$s_!8PHU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fd40b2f7f-f6b5-4c9b-b87d-689781f1d704_1602x422.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>How do you decide whether or not these different records are part of the same underlying entity? You might keep it simple and decide that two entities are the same if they share the same website, but even websites change over time. While a simple solution may be sufficient, you're entering the realm of entity resolution if you're not satisfied with that.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://databased.pedramnavid.com/?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share Pedram's Data Based&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://databased.pedramnavid.com/?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share Pedram's Data Based</span></a></p><h2>The 5 Steps to Entity Resolution</h2><p>This deep dive will go through the five keys to entity resolution. There is much more nuance and depth beyond this post, but this should be enough to get us started.</p><ol><li><p>Pre-processing</p></li><li><p>Indexing</p></li><li><p>Comparing</p></li><li><p>Classifying</p></li><li><p>Merging</p></li></ol><h2>Pre-processing</h2><p>Before we embark on our journey, starting with a good foundation is essential. As much as possible, we want to clean our underlying data. Everything from trimming extra whitespaces to lowercasing all the characters, removing stop-words, or even stemming and&nbsp;<a href="https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html">lemmatizing</a>&nbsp;is possible here. The specifics are highly context-dependent, and always an iterative process. As you perform these steps on data samples, your appreciation for what you need to do to improve the quality of matches will increase, and you will refine your pre-processing.</p><p>You'll want to abstract the pre-processing steps as much as possible. Regular expressions can be convenient here, and knowing how to use them correctly can increase the performance of your system. For example, in Python,&nbsp;<a href="https://docs.python.org/3/howto/regex.html#compiling-regular-expressions">compiling your regular expression</a>&nbsp;before using them will improve performance, and taking advantage of&nbsp;<a href="https://docs.python.org/3/library/functools.html#functools.cache">a cache</a>&nbsp;to avoid repetitive computations can save significant time as you process millions of rows.</p><p>If you're using dbt, macros are helpful to reduce code duplication, and as you find incremental improvements, you only have to apply them in one place.</p><p>Some common pre-processing steps I've seen are:</p><ul><li><p>Making everything lowercase and removing whitespaces</p></li><li><p>Splitting an email into user and domain</p></li><li><p>Cleaning company names to remove stop words such as ... <em>Inc</em>. ... <em>LLC</em>, <em>The</em> .., <em>A</em> ...</p></li><li><p>Converting words such as null, na, n/a to actual NULLs</p></li><li><p>Filtering out demo/test/internal users</p></li><li><p>Parsing and cleaning websites and addresses</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ogec!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F4e5dd7f0-845b-4f03-a790-781ea98f1520_1894x462.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ogec!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F4e5dd7f0-845b-4f03-a790-781ea98f1520_1894x462.png 424w, https://substackcdn.com/image/fetch/$s_!ogec!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F4e5dd7f0-845b-4f03-a790-781ea98f1520_1894x462.png 848w, https://substackcdn.com/image/fetch/$s_!ogec!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F4e5dd7f0-845b-4f03-a790-781ea98f1520_1894x462.png 1272w, https://substackcdn.com/image/fetch/$s_!ogec!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F4e5dd7f0-845b-4f03-a790-781ea98f1520_1894x462.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ogec!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F4e5dd7f0-845b-4f03-a790-781ea98f1520_1894x462.png" width="1456" height="355" data-attrs="{&quot;src&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/4e5dd7f0-845b-4f03-a790-781ea98f1520_1894x462.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:355,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:129232,&quot;alt&quot;:&quot;Example dbt macro for cleaning company names&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Example dbt macro for cleaning company names" title="Example dbt macro for cleaning company names" srcset="https://substackcdn.com/image/fetch/$s_!ogec!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F4e5dd7f0-845b-4f03-a790-781ea98f1520_1894x462.png 424w, https://substackcdn.com/image/fetch/$s_!ogec!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F4e5dd7f0-845b-4f03-a790-781ea98f1520_1894x462.png 848w, https://substackcdn.com/image/fetch/$s_!ogec!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F4e5dd7f0-845b-4f03-a790-781ea98f1520_1894x462.png 1272w, https://substackcdn.com/image/fetch/$s_!ogec!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F4e5dd7f0-845b-4f03-a790-781ea98f1520_1894x462.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Example Macro for cleaning company names using Regular Expressions</figcaption></figure></div></li></ul><h2>Indexing and Blocking</h2><p>Once you have cleaned your data, the next step is to index the data to improve performance. Consider this example: you have 100,000 records in Database A and 10,000 in Database B, with no common indicator. How many comparisons are you performing if you look at the website and name?</p><div class="paywall-jump" data-component-name="PaywallToDOM"></div><p>The formula for this is <code>(m * n) * p</code>, where <code>m</code> and <code>n</code> are the number of records in each table, and <code>p</code> is the number of indicators. So we get 2,000,000,000 or two billion comparisons if we do the math, and that's on relatively small tables. You can see how this will not scale beyond a million records.&nbsp;</p><p>Our only hope is to reduce the search space. Of course, we can't compare every single record against each other, but we can compare within a subset with a three-step process of <strong>blocking</strong>, <strong>indexing</strong> and <strong>reverse-indexing.</strong>&nbsp;</p><p><strong>Blocking</strong> is taking your entire record and chunking it into smaller blocks to avoid comparing every value with every other value. So the obvious question is, how do you block records together if you need to know how to match them?&nbsp;</p><p>In the simplest example, you might block by comparing records from the same country, state, or zip code, or you could look at the first letter of a name.&nbsp;</p><p>There are also algorithmic options. If your entities are names of people or companies, you could use various functions to reduce the search space by compressing information.&nbsp;</p><p>Soundex is an example of an algorithm, developed a century ago for use on names. It encodes similar-sounding names, can be used as a blocking key and is supported in many data warehouses, including Snowflake.</p><p><code>select</code></p><p><code>soundex('pedram') as pedram,</code></p><p><code>soundex('pedrum') as pedrum,</code></p><p><code>soundex('peter') as peter,</code></p><p><code>soundex('pedro') as pedro</code></p><p><code>&gt; PEDRAM PEDRUM PETER PEDRO&nbsp;</code></p><p><code>&nbsp; &nbsp;p365   p365  p360  p360</code></p><p>You can see that Peter and Pedro are blocked together, and Pedram and Pedrum are as well. </p><p>There are many different blocking techniques,&nbsp;<a href="https://arxiv.org/pdf/1905.06167.pdf">and this survey paper</a>&nbsp;reviews many of them, but the principle behind them is essentially the same.&nbsp;</p><p>You could also use multiple blocking keys to improve accuracy. For example, you might run Soundex on first and last names and compare similar blocks across either first or last names. But, of course, the trade-off is always between performance and accuracy.&nbsp;</p><p>Once you've defined your blocking function, you can apply it to every database record. This step is called indexing. For example, below, suppose we ran every name through a Soundex function. Each record has a Soundex associated with it.</p><p><code>Record 1 - D130<br>Record 2 - D130<br>Record 3 - F235<br>Record 4 - F235<br>Record 5 - D130</code></p><p>Next, we combine all records with the same blocking key into a subgroup for comparison purposes. To do this, we rely on a reverse index: for every blocking key, identify all the rows that belong to that block.</p><p><code>D130: {1, 2, 5}</code></p><p><code>F235: {3, 4}</code></p><p>In doing so, we can efficiently work on a block of similar records, and can even distribute this work in parallel. Once we have our reverse index, we are ready to proceed to Comparing.</p><h2>Comparing and Classifying</h2><p>I group comparing and classifying into one topic here because they are interrelated. Comparing is the act of summarizing the similarity between two records, and classifying is the act of deciding whether two records are 'similar enough.'&nbsp;</p><p>There are many ways to compare two records. First, you can look at equality for any column, which is the most straightforward comparison. </p><p><code>if(a.name = b.name) then 1 else 0</code></p><p>On strings, you can look at how similar they are using similarity functions such as the&nbsp;<a href="https://docs.snowflake.com/en/sql-reference/functions-string.html">edit distance</a>&nbsp;or the&nbsp;<a href="https://docs.snowflake.com/en/sql-reference/functions/jarowinkler_similarity.html">Jaro-Winkler similarity score</a>. </p><p>You can compare numbers by absolute or percent differences. You could look at dates, ages, times, or geographies.&nbsp;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!W5gh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F94a419d4-7d37-4464-b718-ef00c8fd932f_2232x1782.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!W5gh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F94a419d4-7d37-4464-b718-ef00c8fd932f_2232x1782.png 424w, https://substackcdn.com/image/fetch/$s_!W5gh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F94a419d4-7d37-4464-b718-ef00c8fd932f_2232x1782.png 848w, https://substackcdn.com/image/fetch/$s_!W5gh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F94a419d4-7d37-4464-b718-ef00c8fd932f_2232x1782.png 1272w, https://substackcdn.com/image/fetch/$s_!W5gh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F94a419d4-7d37-4464-b718-ef00c8fd932f_2232x1782.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!W5gh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F94a419d4-7d37-4464-b718-ef00c8fd932f_2232x1782.png" width="1456" height="1162" data-attrs="{&quot;src&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/94a419d4-7d37-4464-b718-ef00c8fd932f_2232x1782.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1162,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:530916,&quot;alt&quot;:&quot;Example code of scoring functions&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Example code of scoring functions" title="Example code of scoring functions" srcset="https://substackcdn.com/image/fetch/$s_!W5gh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F94a419d4-7d37-4464-b718-ef00c8fd932f_2232x1782.png 424w, https://substackcdn.com/image/fetch/$s_!W5gh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F94a419d4-7d37-4464-b718-ef00c8fd932f_2232x1782.png 848w, https://substackcdn.com/image/fetch/$s_!W5gh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F94a419d4-7d37-4464-b718-ef00c8fd932f_2232x1782.png 1272w, https://substackcdn.com/image/fetch/$s_!W5gh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F94a419d4-7d37-4464-b718-ef00c8fd932f_2232x1782.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Example scoring functions</figcaption></figure></div><p>To classify, you can average the results from above to generate a score for each record. You can weigh individual fields, for example, giving greater weight to the last name over the first name. You might decide on a threshold that balances false positives and negatives or lean on machine learning or other more advanced techniques for classification.</p><p>There's no perfect way to compare and classify, but the goal is always the same: to create a list of tuple pairs of matched records. </p><p>One complication: records A and B might match, and records B and C might match, but records A and C might not. Therefore after matching, you need to process all the records to perform a merge.&nbsp;</p><p>For example, suppose you have seven records and have compared them with some arbitrary matching formula. You end up with the following tuple pairs of matched records.</p><ul><li><p><code>{1, 2} </code></p></li><li><p><code>{3, 4}</code></p></li><li><p><code>{5, 6}</code></p></li><li><p><code>{1, 7}</code></p></li><li><p><code>{2} </code></p></li><li><p><code>{6, 5}</code></p></li><li><p><code>{7, 5}</code></p></li></ul><p>If we graphed these pairs, it would look something like this:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fztN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F7b4b1117-50c7-4282-9db1-850ca890727e_774x814.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fztN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F7b4b1117-50c7-4282-9db1-850ca890727e_774x814.png 424w, https://substackcdn.com/image/fetch/$s_!fztN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F7b4b1117-50c7-4282-9db1-850ca890727e_774x814.png 848w, https://substackcdn.com/image/fetch/$s_!fztN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F7b4b1117-50c7-4282-9db1-850ca890727e_774x814.png 1272w, https://substackcdn.com/image/fetch/$s_!fztN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F7b4b1117-50c7-4282-9db1-850ca890727e_774x814.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fztN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F7b4b1117-50c7-4282-9db1-850ca890727e_774x814.png" width="434" height="456.42894056847547" data-attrs="{&quot;src&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/7b4b1117-50c7-4282-9db1-850ca890727e_774x814.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:814,&quot;width&quot;:774,&quot;resizeWidth&quot;:434,&quot;bytes&quot;:56086,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fztN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F7b4b1117-50c7-4282-9db1-850ca890727e_774x814.png 424w, https://substackcdn.com/image/fetch/$s_!fztN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F7b4b1117-50c7-4282-9db1-850ca890727e_774x814.png 848w, https://substackcdn.com/image/fetch/$s_!fztN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F7b4b1117-50c7-4282-9db1-850ca890727e_774x814.png 1272w, https://substackcdn.com/image/fetch/$s_!fztN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F7b4b1117-50c7-4282-9db1-850ca890727e_774x814.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>However, since some pairs share a common record, we need to connect these pairs. So how do we do that? We use the aptly named&nbsp;<a href="https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.components.connected_components.html">connect components algorithm!</a></p><p>Using this algorithm, we reduce the above example to two distinct entities:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Og5W!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F03cd8208-403f-4c17-b515-562993c8ffdd_654x532.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Og5W!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F03cd8208-403f-4c17-b515-562993c8ffdd_654x532.png 424w, https://substackcdn.com/image/fetch/$s_!Og5W!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F03cd8208-403f-4c17-b515-562993c8ffdd_654x532.png 848w, https://substackcdn.com/image/fetch/$s_!Og5W!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F03cd8208-403f-4c17-b515-562993c8ffdd_654x532.png 1272w, https://substackcdn.com/image/fetch/$s_!Og5W!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F03cd8208-403f-4c17-b515-562993c8ffdd_654x532.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Og5W!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F03cd8208-403f-4c17-b515-562993c8ffdd_654x532.png" width="394" height="320.50152905198775" data-attrs="{&quot;src&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/03cd8208-403f-4c17-b515-562993c8ffdd_654x532.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:532,&quot;width&quot;:654,&quot;resizeWidth&quot;:394,&quot;bytes&quot;:37643,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Og5W!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F03cd8208-403f-4c17-b515-562993c8ffdd_654x532.png 424w, https://substackcdn.com/image/fetch/$s_!Og5W!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F03cd8208-403f-4c17-b515-562993c8ffdd_654x532.png 848w, https://substackcdn.com/image/fetch/$s_!Og5W!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F03cd8208-403f-4c17-b515-562993c8ffdd_654x532.png 1272w, https://substackcdn.com/image/fetch/$s_!Og5W!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F03cd8208-403f-4c17-b515-562993c8ffdd_654x532.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We identified two entities using those seven records and are tantalizingly close to the end: our last step is merging records.</p><h2>Merging Records</h2><p>Given two or more records that refer to the same underlying entity, we must decide how to merge the information. For example, if the names are different, what do we keep?</p><p>The first approach is randomly picking one as the master record and using all the information in that one. This approach is the easiest and least subtle, but it can often be sufficient for our needs.</p><p>A better approach is to rank sources on a per-record or per-field basis. So, for example, we might pick first and last names from Database A but use addresses from Database B.&nbsp;</p><p>Both methods result in a loss of information, so another approach is called the Union Set. Essentially we keep all distinct elements across all records. At the very least, we want to keep the union set of table primary keys for better debugging.&nbsp;</p><p>Suppose Database A has a record with primary key 123 and Database B has a record with primary key 456; we might merge these two records such that the primary key field is now <code>{A: 123, B: 456}</code></p><p>Another option is to use ranges. If we are merging company information and have two different sources for the number of employees, we might include them as a range. If Record A had 100 employees, Record B had 250 employees, and Record C had 175, we might merge these two records as [100, 250].&nbsp;</p><p>You can imagine many other ways to merge records, but the goal is to preserve the right level of detail for your particular use case.&nbsp;</p><div class="community-chat" data-attrs="{&quot;url&quot;:&quot;https://open.substack.com/pub/pedram/chat?utm_source=chat_embed&quot;,&quot;subdomain&quot;:&quot;pedram&quot;,&quot;pub&quot;:{&quot;id&quot;:367470,&quot;name&quot;:&quot;Pedram's Data Based&quot;,&quot;author_name&quot;:&quot;Pedram Navid&quot;,&quot;author_photo_url&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/6655dbab-8253-47c8-9717-f00b12dcc3b8_400x400.jpeg&quot;}}" data-component-name="CommunityChatRenderPlaceholder"></div><h2>Wrapping Up</h2><p>Once merged, your work is largely done. You have created a set of candidates, classified them, linked records, and merged them together, but your journey has just begun. There is much more in this field to learn. Decisions on blocking keys, classification methods, and supervised/unsupervised learning, just to name a few. </p><p>You may also want to check out some libraries and products in this space, such as the Python <a href="https://recordlinkage.readthedocs.io/en/latest/index.html">Record Linkage</a> library and the many available <a href="https://arxiv.org/pdf/2008.04443.pdf">research papers</a> on this topic.</p><p>Hope you enjoyed this deep dive, if there&#8217;s any topic you&#8217;re interested in, <a href="mailto:pedram@pedramnavid.com">reach out</a> and let me know!</p><p></p>]]></content:encoded></item><item><title><![CDATA[Deep Dive: What The Heck Is the Metrics Layer]]></title><description><![CDATA[also known as the semantic layer, previously known as the random queries in my BI tools]]></description><link>https://databased.pedramnavid.com/p/what-is-the-metrics-layer</link><guid isPermaLink="false">https://databased.pedramnavid.com/p/what-is-the-metrics-layer</guid><dc:creator><![CDATA[Pedram Navid]]></dc:creator><pubDate>Wed, 14 Sep 2022 19:36:06 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!0WpE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8a7b5b0-2c42-4f1b-b89c-67fb86dc0092_822x660.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>There has been a lot of buzz about the metrics layer. As always, we start with a little trip down memory lane:</p><p>In January 2021, Base Case (an investor in what was then the headless BI version Supergrain) explored the future of headless BI as a solution to unbundling metrics from BI. In April 2021, Benn Stencil&nbsp;<a href="https://benn.substack.com/p/metrics-layer">made a case for the metrics layer</a>. In October that year, Drew&nbsp;<a href="https://github.com/dbt-labs/dbt-core/issues/4071">opened an issue</a>&nbsp;that generated more discussion. In December 2021, the Metrics Layer achieved&nbsp;<a href="https://www.youtube.com/watch?v=MdSMSbQxnO0&amp;ab_channel=dbt">keynote status</a>&nbsp;at dbt Coalesce (with a long journey through the history of standardization).</p><p>Since then, Supergrain pivoted from a headless BI to a marketing tool (ok,&nbsp;<em>a warehouse-native approach to customer engagement).</em>&nbsp;Transform, which was a metrics engine, is shifting toward self-serve BI.</p><p>dbt continues development on the metrics layer, later renamed the&nbsp;<a href="https://www.getdbt.com/blog/dbt-semantic-layer/">semantic layer</a>. In October at Coalesce, we&#8217;re likely to hear more on the metrics layer, although we&#8217;ve&nbsp;<a href="https://docs.getdbt.com/blog/getting-started-with-the-dbt-semantic-layer">gotten the occasional update</a>.</p><p>To date, Amit Prakash has done the best job exploring&nbsp;<a href="https://www.thoughtspot.com/blog/the-metrics-layer-has-growing-up-to-do">the metrics layer on Thoughtspot&#8217;s blog.</a>&nbsp;In it, he describes six classes of metrics and three solutions for what a proper semantic layer could look like. I won&#8217;t go into all the details since the post already does a great job, and his writing is clear and approachable.</p><p>Instead, we&#8217;ll go one step closer to code and look at three implementations of the metrics layer and what a world without it looks like.</p><h2>The Activation Metric</h2><p>For the rest of this post, we&#8217;ll look at a metric that I think shows the true power of a well-defined metrics layer.</p><p>Pretend we&#8217;re a B2B SaaS, where users can sign up for our product and belong to one or more workspaces. Each workspace has one or more users. A workspace is active if they perform some activation event within 24 hours of workspace creation.</p><p>We&#8217;ll call the metric of interest&nbsp;<em>activation rate</em>, and we&#8217;ll define it as so:</p><blockquote><p>The&nbsp;<em>activation rate</em>&nbsp;is the ratio of active workspaces to all workspaces over a certain period.</p></blockquote><p>More concretely, every day, we have a list of all workspaces and a flag for whether that workspace was active on that day or not. The count of all workspaces on that day is the total number of workspaces. The count of all workspaces where the flag is&nbsp;<em>true</em>&nbsp;is the count of active workspaces.</p><p>We may want to report on the activation rate daily, weekly, or monthly. In addition, we&#8217;ll want to know the change in the activation rate over time.</p><h2>First, in SQL</h2><p>Let&#8217;s define everything in SQL to get a baseline. We&#8217;ll start with a basic table:</p><pre><code>select reporting_day, workspace_id, is_active from workspace_details;

####

reporting_day | workspace_id | is_active
--------------|--------------|----------|
2022-07-04    | 100          | true
2022-07-04    | 101          | false
...</code></pre><p>So far, so good. Now let&#8217;s count workspaces:</p><pre><code>select

reporting_day,
count(distinct workspace_id) as n_workspaces,
sum(case when is_active then 1 else 0 end) as n_active_ws

from workspace_details
group by 1

####

reporting_day | n_workspaces | n_active_ws|
--------------|--------------|------------|
2022-07-04    |       2      |         1
...</code></pre><p>Now, if we want to know the activation rate, we divide active over the total. For simplicity, we&#8217;ll pretend we&#8217;re using Snowflake, which allows us to refer to columns created in the same select statement.</p><pre><code>select

reporting_day,
count(distinct workspace_id) as n_workspaces,
sum(case when is_active then 1 else 0 end) as n_active_ws,
n_active_ws / n_workspaces as activation_rate

from workspace_details
group by 1

####

reporting_day | n_workspaces | n_active_ws | activation_rate
--------------|--------------|-------------|---------------
2022-07-04    |       2      |         1   | 0.5
...</code></pre><p></p><p>So far, so good. We could take this SQL, create a dbt model, and then use any reporting tool to visualize the activation rate over time. We can even start looking at change over time. But, first, let&#8217;s make the activation rates easier to read with some formatting.</p><pre><code>select

reporting_day,
count(distinct workspace_id) as n_workspaces,
sum(case when is_active then 1 else 0 end) as n_active_ws,

round(100 * (n_active_ws / n_workspaces), 2) as activation_rate,

activation_rate - lag(activation_rate) over(order by reporting_day) as abs_change,
round(100 * abs_change / lag(activation_rate) over(order by reporting_day), 2) as pct_change


from workspace_details
group by 1
order by 1

####

reporting_day | n_ws | n_active | a_rate|abs_change|pct_chg
--------------|------|----------|-------|----------|-------
2022-07-04    |  2   |      1   | 50.   |    -     |   -     
2022-07-05    |  3   |      2   | 66.6  |  +16.6   | +33.3%   

...</code></pre><p>We&#8217;ve made a ton of progress and haven&#8217;t needed to touch a metrics layer, so what&#8217;s the big deal? The real pain comes when your stakeholder now asks you for these numbers at a weekly, monthly, and quarterly aggregate. Pain is imminent.</p><p>What&#8217;s worse is if your end-users don&#8217;t understand how these measures are defined, they might start doing silly things like this:</p><pre><code>select

date_trunc('month', reporting_day) as reporting_month,
avg(activation_rate) as avg_activation_rate

from...
</code></pre><p>Instead of finding the average over a period by adding the individual components and calculating the rate, they might average a ratio and end up with incorrect measures. We don&#8217;t want that.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://databased.pedramnavid.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Data Based is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Enter the Metrics Layer</h2><p>A metrics layer solves these and other problems. Let&#8217;s look at how Looker approaches this.</p><h3>Looker</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0WpE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8a7b5b0-2c42-4f1b-b89c-67fb86dc0092_822x660.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0WpE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8a7b5b0-2c42-4f1b-b89c-67fb86dc0092_822x660.png 424w, https://substackcdn.com/image/fetch/$s_!0WpE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8a7b5b0-2c42-4f1b-b89c-67fb86dc0092_822x660.png 848w, https://substackcdn.com/image/fetch/$s_!0WpE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8a7b5b0-2c42-4f1b-b89c-67fb86dc0092_822x660.png 1272w, https://substackcdn.com/image/fetch/$s_!0WpE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8a7b5b0-2c42-4f1b-b89c-67fb86dc0092_822x660.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0WpE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8a7b5b0-2c42-4f1b-b89c-67fb86dc0092_822x660.png" width="822" height="660" data-attrs="{&quot;src&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/e8a7b5b0-2c42-4f1b-b89c-67fb86dc0092_822x660.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:660,&quot;width&quot;:822,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:343550,&quot;alt&quot;:&quot;screenshot of the looker application&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="screenshot of the looker application" title="screenshot of the looker application" srcset="https://substackcdn.com/image/fetch/$s_!0WpE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8a7b5b0-2c42-4f1b-b89c-67fb86dc0092_822x660.png 424w, https://substackcdn.com/image/fetch/$s_!0WpE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8a7b5b0-2c42-4f1b-b89c-67fb86dc0092_822x660.png 848w, https://substackcdn.com/image/fetch/$s_!0WpE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8a7b5b0-2c42-4f1b-b89c-67fb86dc0092_822x660.png 1272w, https://substackcdn.com/image/fetch/$s_!0WpE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8a7b5b0-2c42-4f1b-b89c-67fb86dc0092_822x660.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Looker has a syntax-aware UI with a great reference built-in as you code, making the development experience smoother.</figcaption></figure></div><p>In Looker, we define metrics in LookML files. We have a view, which represents a model of data, usually built from an existing table or view in the data warehouse. Here&#8217;s what that might look like:</p><pre><code>view: workspace_activation {
  sql_table_name: "METRICS"."WORKSPACE_ACTIVATION"
    ;;

  dimension_group: date {
    type: time
    timeframes: [
      raw,
      time,
      date,
      week,
      month,
      quarter,
      year
    ]
    sql: ${TABLE}."REPORTING_DATE" ;;
  }

  dimension: is_active_workspace {
    type: yesno
    sql: ${TABLE}."IS_ACTIVE_WORKSPACE" ;;
  }

  dimension: workspace_id {
    type: string
    primary_key: yes
    sql: ${TABLE}."WORKSPACE_ID" ;;
  }

  measure: count_workspaces {
    type: count_distinct
    description: "# of Workspaces"
    sql:  ${workspace_id} ;; 
    filters: [workspace_name: "!='Demo Workspace'"]
  }

  measure: count_active_workspaces {
  type: count_distinct
  description: "# of Unique Workspaces Active within 1 Day"
  sql:  ${workspace_id} ;;
  filters:  [is_active_workspace: "yes"]
}


  measure: activation_rate {
  type:  number
  sql:  ${count_active_workspaces} / ${count_workspaces} ;;
  value_format_name: percent_1
}


}
</code></pre><p>A lot is going on here, but the thing to notice is that we are defining measures as formulas, not as fully-formed SQL tables. We also specify what formatting to use on measures, how to drill into the details when going from aggregate views to detailed views, how to aggregate measures, and all the ways we want to break down our reporting date.</p><p>From this code, Looker can dynamically generate the SQL needed without us having to worry about different granularities.</p><p>We could go further and start defining joins between this table and other tables, for example if we wanted to break down workspaces by paid vs. not-paid or attribution category.</p><p>Without a metrics layer, we&#8217;d have to anticipate and perform all these joins upfront. With a metrics layer, we can specify relationships between tables and let the BI tool join as needed, only on the columns the user requests. As a result, our users never need to consider what types of joins to use.</p><h2>dbt Metrics</h2><p>Let&#8217;s try and do the same with the dbt metrics layer to understand better what we&#8217;ve got. The big caveat is that dbt metrics are not yet complete and are undergoing active development. So things may change, and rough edges might need time to polish.</p><p>We&#8217;ll first define our metrics in the dbt yml file:</p><pre><code>metrics:
  - name: count_workspaces
    label: '# Workspaces'
    model: ref('active_workspace')
    type: count_distinct
    sql: workspace_id

    timestamp: reporting_day
    time_grains: [day, week, month]
    filters:
      - field: workspace_name
        operator: '!='
        value: 'Demo Workspace


  - name: count_active_workspaces
    label: 'Active Workspaces'
    model: ref('active_workspace')
    type: count_distinct
    sql: workspace_id

    timestamp: reporting_day
    time_grains: [day, week, month]

  - name: activation_rate
    label: 'Activation Rate'
    type: expression
    sql: " 100.0* {{ metric('count_active_workspaces') }} / {{ metric('count_workspaces') }} "
    
    timestamp: reporting_day
    time_grains: [day, week, month]</code></pre><p>And we&#8217;ll create a model that allows us to select from these new metrics:</p><pre><code>select * from 
  {{ metrics.calculate(
   [metric('count_workspaces'), metric('count_active_workspaces'), metric('activation_rate')], 
    grain='week',
  )}}</code></pre><p>There are a few key things to note here. In the Looker model, the organizing principle was a View, which contained dimensions, measures, and time-grains all within one namespaced View object. dbt took a different approach: metrics are self-contained units. Each metric must specify which dbt model it should run against, the timestamp column, and which time grains to support.</p><p>This approach is already leading us to some duplicated code for the three metrics above.</p><p>Another point to consider is the expression metric. We can&#8217;t refer to a metric there but need to wrap it in jinja, leading to jinja in YAML, which can be a parsing nightmare without a good IDE. While Looker&#8217;s IDE can parse, highlight, and show errors within your LookML code and expression, we don&#8217;t have that level of tooling for dbt.</p><p>You&#8217;ll also note that I am defining how to query my metrics within dbt using the dbt_metrics macro. For now, there&#8217;s no support for reading these metrics outside of dbt itself, although dbt has partnerships with BI tools, and I expect they&#8217;ll be announcing better ways to interact with dbt&#8217;s metrics layer soon enough.</p><p>Filtering is more clunky. In Looker, we provide an array of expressions to filter on, while in dbt we build our filters as yaml, explicitly defining what operator to use.</p><p>One final observation: there is no support for joins. In Looker, you can define relationships between different tables and explicitly define which related views should be available to a user within an Explore. Until support for joins arrives in dbt, it&#8217;s hard to see any value in an isolated semantic layer.</p><h3>Lightdash</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!50En!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F9fbf4f60-22b9-4821-aa70-3a884c461415_1281x929.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!50En!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F9fbf4f60-22b9-4821-aa70-3a884c461415_1281x929.png 424w, https://substackcdn.com/image/fetch/$s_!50En!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F9fbf4f60-22b9-4821-aa70-3a884c461415_1281x929.png 848w, https://substackcdn.com/image/fetch/$s_!50En!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F9fbf4f60-22b9-4821-aa70-3a884c461415_1281x929.png 1272w, https://substackcdn.com/image/fetch/$s_!50En!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F9fbf4f60-22b9-4821-aa70-3a884c461415_1281x929.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!50En!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F9fbf4f60-22b9-4821-aa70-3a884c461415_1281x929.png" width="1281" height="929" data-attrs="{&quot;src&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/9fbf4f60-22b9-4821-aa70-3a884c461415_1281x929.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:929,&quot;width&quot;:1281,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:413592,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!50En!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F9fbf4f60-22b9-4821-aa70-3a884c461415_1281x929.png 424w, https://substackcdn.com/image/fetch/$s_!50En!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F9fbf4f60-22b9-4821-aa70-3a884c461415_1281x929.png 848w, https://substackcdn.com/image/fetch/$s_!50En!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F9fbf4f60-22b9-4821-aa70-3a884c461415_1281x929.png 1272w, https://substackcdn.com/image/fetch/$s_!50En!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F9fbf4f60-22b9-4821-aa70-3a884c461415_1281x929.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">In Lightdash, the presentation layer is separate from the code. The UI is intuitive and easy to navigate, but changes to it require changing your dbt model code.</figcaption></figure></div><p>Lightdash is a BI tool that is tightly integrated within the dbt ecosystem. It offers two ways of expressing metrics: the first uses the native dbt metrics layer we discussed above. But beyond that, it also has a metrics implementation that you can leverage, which has some added benefits, such as joins and formatting.</p><p>Like dbt, you add your metrics implementation directly in your dbt yml file. However, the structure is a little different. The metrics are specified in a meta tag underneath the column. You can define multiple metrics related to a column in-line, and there is no need to duplicate dimensions across metrics.</p><pre><code>version: 2
models:
  - name: active_workspace
    columns:
      - name: reporting_day
        description: "Day of report"
        meta:
          dimension:
            type: date
      - name: workspace_id
        description: "The Id of the workspace"
        meta:
          metrics:
            count_workspaces:
              type: count_distinct
            count_active_workspaces:
              type: count_distinct
              sql: "case when is_active then workspace_id else null end"
            activation_rate:
              type: number
              sql: (1.0 * ${count_active_workspaces} / ${count_workspaces})
              round: 2
              format: percent
</code></pre><p>We also have convenient helpers for rounding and formatting numbers. The templating is simplified, and there&#8217;s no reliance on dbt ref macros. We can directly specify a metric using the <code>${metric_name}</code> format.</p><p>There are some downsides to the integration with dbt. Namely, any change to your metrics requires a full dbt refresh, which can be slow. There&#8217;s also the question of where a metric belongs: not all metrics should live under a particular dbt column definition; perhaps a separate metric definition file could be more maintainable long-term.</p><p>That said, the reporting is simplified quite a bit. It&#8217;s easy to query the metrics using the Lightdash UI, and there&#8217;s no need to write custom code to fetch a metric. But then, your metrics are only accessible within Lightdash, although this could be alleviated with APIs that make metrics more accessible beyond just Lightdash. Given the open nature of the product, I wouldn&#8217;t be surprised if metrics became more accessible over time.</p><h2>What it all means</h2><p>All three tools have different trade-offs, and their strengths and weaknesses tell of the challenges a metrics layer faces. Looker deeply integrates its metrics layer within the Looker ecosystem. Dimensions and measures are defined within the same application, and Looker&#8217;s semantic understanding of LookML allows for a rich parsing and developer experience. Looker can write to Git for version control, but most development occurs within the Looker ecosystem.</p><p>Despite its strength, there are also pitfalls. Measures defined within Looker are not easily accessed. While Looker exposes an API, we haven&#8217;t seen it become a standard metrics layer across the data stack, perhaps because the high entry price makes it prohibitive for smaller companies.</p><p>That said, a well-configured Looker instance can reduce the burden on data teams. Providing access to views your end-users can query without relying on data teams whenever you need just one more column can be powerful. That power has led to increased interest in a universal metrics layer solution.</p><p>With dbt, it&#8217;s clear that they are trying to stake their place within the data ecosystem as a natural fit for a universal metrics layer. Much of the modern data stack already integrates with dbt, and dbt is widely adopted and available to nearly any data team. However, dbt is also moving toward a cloud-based and server-based model, and full adoption of the metrics layer will likely involve some subscription requirements.</p><p>Pricing aside, the real challenge with dbt is delivering an ergonomic and performant solution. The current jinja/yaml-based definition of metrics, the lack of any significant development tooling, and a gap in features that would make it broadly applicable are still outstanding questions.&nbsp;</p><p>Since it&#8217;s been announced, there has been very little news, although there&#8217;s still active development. Just last week, dbt changed the API by renaming some fields. Unfortunately, this active development also makes it difficult to recommend. Without stability, data teams will not likely want to develop against it.</p><p>Lightdash is in an exciting place as well. In some ways, they are trying to integrate with dbt and find a way to develop their own metrics definitions apart from it. Too much reliance on dbt can bring challenges, especially as there&#8217;s no clear roadmap on where the metrics layer will be going. On the other hand, saving your metrics definition next to your dbt code can have a lot of ergonomic benefits. The outstanding question is whether other apps can leverage the metric definitions. If not, Lightdash may approach Looker-status, another BI silo for metrics.</p><p>So the real question I have is this:&nbsp;<strong>Can a metrics layer be universal enough to gain applicability across the data stack yet still be designed in such a way to be relevant to BI tools?</strong></p><p>We are still ways off from having an answer to that question, but I&#8217;m excited to see how we get there.</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://databased.pedramnavid.com/p/what-is-the-metrics-layer?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thank you for reading Pedram's Data Based. This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://databased.pedramnavid.com/p/what-is-the-metrics-layer?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://databased.pedramnavid.com/p/what-is-the-metrics-layer?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div>]]></content:encoded></item><item><title><![CDATA[Deep Dive: What the heck is Airflow]]></title><description><![CDATA[This is the first installment in the Deep Dive series, where I go deep on a particular product or category.]]></description><link>https://databased.pedramnavid.com/p/deep-dive-airflow</link><guid isPermaLink="false">https://databased.pedramnavid.com/p/deep-dive-airflow</guid><dc:creator><![CDATA[Pedram Navid]]></dc:creator><pubDate>Mon, 22 Aug 2022 03:20:34 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!BrZY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc8da285-6e61-4ab9-ae37-5ba139a96ea2_1021x481.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This is the first installment in the Deep Dive series, where I go deep on a particular product or category. Some of these will be free, and some will be paid. This one is paid and was a special request by a paid subscriber. I hope you enjoy!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://pedram.substack.com/subscribe?coupon=24c49df0&quot;,&quot;text&quot;:&quot;Get 20% off for 1 year&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://pedram.substack.com/subscribe?coupon=24c49df0"><span>Get 20% off for 1 year</span></a></p><h2>A Short History of Orchestration</h2><p>Apache Airflow is part of a class of tools called an orchestrator, but to understand what it is and why people use it, we need to travel back a little bit to its origin and Airbnb. </p><p><a href="https://airflow.apache.org/docs/apache-airflow/stable/project.html">Airflow was created in 2014</a> and released in 2015 at Airbnb. The original blog <a href="https://medium.com/airbnb-engineering/airflow-a-workflow-management-platform-46318b977fd8">announcing the release is still up</a> and is a good resource for reminding ourselves of where Airflow came up and what the world was like then.</p><p>At Airbnb, data engineers used tools like&nbsp;<a href="https://medium.com/airbnb-engineering/data-infrastructure-at-airbnb-8adfb34f169c">Apache Hive</a>&nbsp;as a data warehouse, with much of their infrastructure built on Hadoop and Spark. There were many problems to be solved and jobs to be done: data extraction, cleaning, quality checks, and long-term storage.</p><p>Airbnb was also performing a lot of computation. They needed to know everything from how guests felt about their accommodations to how their hosts felt about their guests. They needed to understand how well their recommendations were doing and whether their experiments were working well. They needed to compute sessions from all the clickstream data on both their app and the web.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://databased.pedramnavid.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Like what you&#8217;re reading? The rest of this article is only for paid subscribers.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>
      <p>
          <a href="https://databased.pedramnavid.com/p/deep-dive-airflow">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Counting Things: Counting Users Part 2 ]]></title><description><![CDATA[come on get a little bit closer baby, cause tonight is the night]]></description><link>https://databased.pedramnavid.com/p/count-things-counting-users-part</link><guid isPermaLink="false">https://databased.pedramnavid.com/p/count-things-counting-users-part</guid><dc:creator><![CDATA[Pedram Navid]]></dc:creator><pubDate>Tue, 09 Aug 2022 03:49:16 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!T507!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf55b494-298e-4abb-a299-bec34ab98cca_1844x1484.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a href="https://pedram.substack.com/p/counting-users">In my last post</a>, I walked through an example of counting people who visit your site, and the complexities that come with it. Next, we&#8217;ll explore what happens when a visitor becomes a user and emits a couple innocuous events.</p><h3>When Two Become One</h3><p>Let&#8217;s say Rachel visited your site over the past few months. For simplicity, she was kind enough to persist cookies, use the same device across both sites, and generally be friendly toward your site tracking. We are using something like <a href="https://www.rudderstack.com/">Rudderstack</a>, <a href="http://segment.io">Segment</a>, <a href="http://amplitude.com">Amplitude</a>, <a href="http://mixpanel.com">Mixpanel</a>, <a href="https://jitsu.com/">Jitsu</a>, or <a href="http://snowplowanalytics.com">Snowplow</a> for event tracking.</p><p>Rachel clicks the giant, blinking, iridescent &#10024;<strong>sign-up&#10024;</strong> button your growth team so thoughtfully placed in the middle of your website. She signs up with her email address and creates a password. Somewhere, a growth marketer wakes from her dreams. Success.</p><p>If your engineering team was kind and generous, they also instrumented the sign-up event and the subsequent sign-in, and now you have three types of events, and they might look like this.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cAGj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fff13ede1-57fe-47de-a0d4-d7064504e39b_523x253.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cAGj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fff13ede1-57fe-47de-a0d4-d7064504e39b_523x253.png 424w, https://substackcdn.com/image/fetch/$s_!cAGj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fff13ede1-57fe-47de-a0d4-d7064504e39b_523x253.png 848w, https://substackcdn.com/image/fetch/$s_!cAGj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fff13ede1-57fe-47de-a0d4-d7064504e39b_523x253.png 1272w, https://substackcdn.com/image/fetch/$s_!cAGj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fff13ede1-57fe-47de-a0d4-d7064504e39b_523x253.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cAGj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fff13ede1-57fe-47de-a0d4-d7064504e39b_523x253.png" width="523" height="253" data-attrs="{&quot;src&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/ff13ede1-57fe-47de-a0d4-d7064504e39b_523x253.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:253,&quot;width&quot;:523,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:34172,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cAGj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fff13ede1-57fe-47de-a0d4-d7064504e39b_523x253.png 424w, https://substackcdn.com/image/fetch/$s_!cAGj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fff13ede1-57fe-47de-a0d4-d7064504e39b_523x253.png 848w, https://substackcdn.com/image/fetch/$s_!cAGj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fff13ede1-57fe-47de-a0d4-d7064504e39b_523x253.png 1272w, https://substackcdn.com/image/fetch/$s_!cAGj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fff13ede1-57fe-47de-a0d4-d7064504e39b_523x253.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>Our tracked events table, (let&#8217;s call it <code>tracks</code>), looks like this:</p><pre><code>anonymous_id | event_name | event_time | user_id  | source
123          | viewed_page| 2022-06-01 | null     | web
123          | signed_up  | 2022-08-01 | aaaa-111 | nodejs
123          | signed_in  | 2022-08-01 | aaaa-111 | nodejs</code></pre><p>A few things to note:</p><ol><li><p>Your old rows don&#8217;t get updated when new information arrives. In June, we didn&#8217;t know the user id of our visitor, since they had not signed up. But in August we did have that information.</p></li><li><p>Events take place in different contexts. The first event was emitted from your marketing website on the front-end. The second and third were server-side events from the backend, directly to your event-tracker. The first event may not always fire, depending on ad-blocking,  network blips or browser behaviour.</p></li></ol><p>These nuances will make the lives of your data practitioners hard, so it&#8217;s important to have lots of sympathy and moral support for them when they inevitably start working on sessionization. <em>Help is available, and they are not alone.</em></p><h3>What Can We Do With Events?</h3><p>Given just the three events, we can ask many different types of questions:</p><h4>Attribution</h4><p><em>What are the leading sources of user sign ups?</em></p><p><em>For people who signed up, what was the first page they visited on our marketing site? Or the last?</em></p><h4>Adoption</h4><p><em>How many people signed-up for my product each day? </em></p><p><em>How long does it take for an average visitor to sign-up?</em></p><p><em>What percent of visitors end up signing up for our product, and how does that change over time?</em></p><h4>Engagement</h4><p><em>How many people who sign-in to our product every day are new users? </em></p><p><em>How many of them are existing? </em></p><p><em>How many users stopped signing in? </em></p><p><em>How many users came back after a break?</em></p><div><hr></div><h3>Stitching User Events</h3><p>Before we can start chipping away at our newly formed backlog of questions we still have to solve the fundamental problem of <em>user stitching</em>. We want to associate every event we have with the user id, even if the user was not known until later.</p><p>Given the simplified example above, we can create a mapping of <code>anonymous_id &#8594; user_id </code>by using a <a href="https://docs.snowflake.com/en/sql-reference/functions-analytic.html">window function</a>. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hTXq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F145e66d3-0a42-4d74-8693-8b4a38a6aebe_802x162.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hTXq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F145e66d3-0a42-4d74-8693-8b4a38a6aebe_802x162.png 424w, https://substackcdn.com/image/fetch/$s_!hTXq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F145e66d3-0a42-4d74-8693-8b4a38a6aebe_802x162.png 848w, https://substackcdn.com/image/fetch/$s_!hTXq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F145e66d3-0a42-4d74-8693-8b4a38a6aebe_802x162.png 1272w, https://substackcdn.com/image/fetch/$s_!hTXq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F145e66d3-0a42-4d74-8693-8b4a38a6aebe_802x162.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hTXq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F145e66d3-0a42-4d74-8693-8b4a38a6aebe_802x162.png" width="802" height="162" data-attrs="{&quot;src&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/145e66d3-0a42-4d74-8693-8b4a38a6aebe_802x162.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:162,&quot;width&quot;:802,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:18763,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hTXq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F145e66d3-0a42-4d74-8693-8b4a38a6aebe_802x162.png 424w, https://substackcdn.com/image/fetch/$s_!hTXq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F145e66d3-0a42-4d74-8693-8b4a38a6aebe_802x162.png 848w, https://substackcdn.com/image/fetch/$s_!hTXq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F145e66d3-0a42-4d74-8693-8b4a38a6aebe_802x162.png 1272w, https://substackcdn.com/image/fetch/$s_!hTXq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F145e66d3-0a42-4d74-8693-8b4a38a6aebe_802x162.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">For simplicity, I&#8217;m using Snowflake syntax, with other implementations you may need to specify: <code>ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING </code>to look past the current row for the last value. </figcaption></figure></div><p>If you&#8217;re new to window functions, this can look daunting. Think of a window as a slice of a table. We want to operate on every row that has the same <code>anonymous_id</code>. In each slice, apply a function to get a result, and add it as a new column. In this case, we&#8217;re applying the <code>last</code> function, which finds the last row in that window. </p><p>Here&#8217;s a little illustration of how a window function might work. Start by taking the partition highlighted in yellow, then within each partition order by timestamp, and then take the last value in that partition, and use that as the result that fills each row. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!T507!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf55b494-298e-4abb-a299-bec34ab98cca_1844x1484.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!T507!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf55b494-298e-4abb-a299-bec34ab98cca_1844x1484.png 424w, https://substackcdn.com/image/fetch/$s_!T507!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf55b494-298e-4abb-a299-bec34ab98cca_1844x1484.png 848w, https://substackcdn.com/image/fetch/$s_!T507!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf55b494-298e-4abb-a299-bec34ab98cca_1844x1484.png 1272w, https://substackcdn.com/image/fetch/$s_!T507!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf55b494-298e-4abb-a299-bec34ab98cca_1844x1484.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!T507!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf55b494-298e-4abb-a299-bec34ab98cca_1844x1484.png" width="1456" height="1172" data-attrs="{&quot;src&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/df55b494-298e-4abb-a299-bec34ab98cca_1844x1484.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1172,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:468676,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!T507!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf55b494-298e-4abb-a299-bec34ab98cca_1844x1484.png 424w, https://substackcdn.com/image/fetch/$s_!T507!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf55b494-298e-4abb-a299-bec34ab98cca_1844x1484.png 848w, https://substackcdn.com/image/fetch/$s_!T507!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf55b494-298e-4abb-a299-bec34ab98cca_1844x1484.png 1272w, https://substackcdn.com/image/fetch/$s_!T507!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf55b494-298e-4abb-a299-bec34ab98cca_1844x1484.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Once the events have been stitched, we have our desired output: a mapping of anonymous id to user id. </p><p>Conversely, we also know that any anonymous id that isn&#8217;t mapped to a user id has not signed up, or cannot otherwise be identified.<br></p><pre><code>anonymous_id | user_id 
123          | aaaa-111
124          | aaaa-111
125          | aaaa-111
234          | bbbb-222 
789          | NULL      &lt;- this person has never signed up</code></pre><p>With the above, we can now start to chip away at our questions from before. </p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://databased.pedramnavid.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Liking what you read? Data Based is only possible because of the support of subscribers. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><h4>Attribution</h4><p><em>What are the leading sources of user sign ups?</em></p><p><em>For people who signed up, what was the first page they visited on our marketing site? Or the last?</em></p><p>What we know:</p><ul><li><p>all the visitors to our websites and how they got there (from <a href="https://pedram.substack.com/p/counting-users">Counting Users, Part 1</a>)</p></li><li><p>which visitors eventually became users</p></li></ul><p>So naturally,  we can look at where visitors came from before signing up to understand where our sign-ups come from.</p><p>There are many models for attributing sign-ups to visits (all of them bad,  some of them useful). The simplest ones look at either the first, or last thing someone did before they performed a conversion. Consider the following events:</p><pre><code>anonymous_id| path   | utm        | referrer | event_time
123         | /      | cpc-google | google   | 2 days ago
123         | /blog/ | NULL       | bing     | yesterday</code></pre><p>We start by joining  <code>anonymous_id</code> above to a <code>stitching</code> table we created in the previous step, and let&#8217;s say we find that the anonymous user <code>123 </code>is actually user <code>aaaa-1111. </code></p><p>Pretend we also have a table that tells us when the user signed-up, and  it was <code>today. </code>We can either give credit to a cost-per-click advertisement (first-touch attribution) or to our blog (last-touch attribution). </p><p>We can get more complex if we wish. Maybe we only want to look back a certain number of days. For example, does it make sense to give credit to a paid ad from 18 months ago if someone signs up today? </p><p>We might want to categorize different types of web traffic according to some rules by bucketing similar traffic together, such as social media sources. (There&#8217;s a great <a href="https://github.com/dbt-labs/segment/blob/main/seeds/referrer_mapping.csv">dbt seed file</a> in the segment package that helps with this)</p><p>We may want to go beyond attributing conversion events, to better understand what brings new visitors to our site for the first time. For every user, we could look at the first page they visited and categorize that traffic to understand &#8216;landing pages&#8217;.</p><p>Hard to believe, but the answers to every single one of these questions starts with just a few events and stitching.</p><h4>Adoption</h4><p><em>How many people signed-up for my product each day? </em></p><p><em>How long does it take for an average visitor to sign-up?</em></p><p><em>What percent of visitors end up signing up for our product, and how does that change over time?</em></p><p>Understanding adoption is also made possible by the same types of events we used for attribution. If we look only at the sign-in events, we can count how many people visit our site like so:</p><pre><code>select 
date_trunc('day, timestamp) as event_day,
count(distinct user_id) 

from tracks
group by event_day</code></pre><p>This query counts the number of distinct users within a specified time period. We use count distinct because a single user often has multiple events a day.</p><p>If we want to know how long it takes for a visitor to sign-up, we can look at the time elapsed between their first visit, and their sign-up event.</p><pre><code>with conversions as (
  /* Assume one signed-up event per user for simplicity */
    select
    user_id,
    timestamp as signup_date

    from tracks 
    where event_name = 'signed-up'

)

select distinct

user_id,
first_value(timestamp) over(partition by stitched.user_id) as first_event_date,
signup_date,
datediff('days', first_event_date, signup_date) as days_to_signup

from tracks 
join stitched using(anonymous_id)</code></pre><p>We use a window function again, this time to get the first event. We count the days between the first event and the conversion to see how long it takes for someone to sign up.</p><p>We can also perform a very rudimentary funnel analysis by counting the number of new visitors and sign-ups each day. </p><p>To help us, let&#8217;s imagine a helper column called <code>blended_user_id</code>. It is the user id if it&#8217;s known, or the anonymous id if not. </p><p>We find the first event ever for a particular blended user id, and then find the first sign-up event for each user. Count the number of times each of those events happen, every day, and get a funnel count of visitors &#8594; users.</p><pre><code>with visitors as (
    select 

    date_trunc('days', timestamp) as day,
    count(distinct blended_user_id) as new_visitors

    from stitched_tracks
    group by day
    qualify row_number() over(partition by anonymous_id order by timestamp) = 1
),

signups as (

    select 

    date_trunc('days', timestamp) as day,
    count(distinct blended_user_id) as new_signups

    from stitched_tracks
    where event_name = 'signed-up'
    group by day
    qualify row_number() over(partition by anonymous_id order by timestamp) = 1
),

select 

day,
new_signups,
new_visitors

from visitors
full join signups using (day)</code></pre><h4>Engagement</h4><p><em>How many people who sign-in to our product every day are new users? </em></p><p><em>How many of them are existing? </em></p><p><em>How many users stopped signing in? </em></p><p><em>How many users came back after a break?</em></p><p>We can even start to get into some fun churn and retention analysis. One really simple (and not useful) way to measure churn might be to count:</p><ul><li><p>Anyone who signed in today that signed in yesterday (retention)</p></li><li><p>Anyone who signed in yesterday that didn&#8217;t sign in today (churn)</p></li></ul><p>We&#8217;re using some really fun SQL now, by joining a single table to itself and offsetting the day in the join condition.</p><pre><code>with daily_activity as (
  select distinct
    date_trunc('day', timestamp) as day,
    user_id
  from tracks
  where user_id is not null
),

retained as (
select
  today.day,
  count(distinct today.user_id) as retained
from daily_activity today
join daily_activity yesterday
  on today.user_id = yesterday.user_id
  and today.day = yesterday.day + interval 1 DAY
group by today.day
),

churned as (
select
  yesterday.day + interval 1 DAY as day,
  count(distinct yesterday.user_id) as churned
from daily_activity yesterday
left join daily_activity today
  on today.user_id = yesterday.user_id
  and today.day =  yesterday.day + interval 1 DAY,
where today.user_id is null
group by 1
)

select 
day,
coalesce(retained, 0) as retained,
coalesce(churned, 0) as churned

from retained
full join churned using (day)
order by 1
</code></pre><p>This example was taken with great inspiration from the <strong><a href="https://www.sisense.com/blog/use-self-joins-to-calculate-your-retention-churn-and-reactivation-metrics/">Sisense blog</a>, </strong>so feel free to give it a read to really understand what&#8217;s going on. Don&#8217;t sweat if this one makes your head hurt, the goal here is really to show you how much you can do with just a couple events.</p><p>I hope this was a useful foray into the depths you can go to with event streams. The world only gets more complicated from here as you try to do things like tie ad spend to revenue by connecting Salesforce Accounts to Product Signups through intermediary tables. Yuck! Let&#8217;s pretend we never spoke of such things. </p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://databased.pedramnavid.com/p/count-things-counting-users-part?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thank you for reading Data Based! If you enjoyed this post, it would mean a lot if you shared it!</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://databased.pedramnavid.com/p/count-things-counting-users-part?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://databased.pedramnavid.com/p/count-things-counting-users-part?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><div><hr></div><p>Did you enjoy this post? Do you have ideas for future metrics to cover? Maybe you think Cohort Analysis is something you&#8217;ve always wanted to learn more, or you think there&#8217;s nothing hotter than a well-defined activation metrics. Well, leave a comment or <a href="mailto:pedram@pedramnavid.com">drop me an email!</a></p><p></p>]]></content:encoded></item><item><title><![CDATA[Counting Things: Counting Users Part 1]]></title><description><![CDATA[one of the easiest things to define]]></description><link>https://databased.pedramnavid.com/p/counting-users</link><guid isPermaLink="false">https://databased.pedramnavid.com/p/counting-users</guid><dc:creator><![CDATA[Pedram Navid]]></dc:creator><pubDate>Sat, 23 Jul 2022 22:45:16 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!3lFY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F09cafd56-ce17-46db-9807-7b461bf54569_637x443.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This is a new series as part of my promise to write more in-depth content for people who care about data. Every [irregular interval] I will cover a metric that we all know and love and go deep on it. The goal is not to just define metrics, but show the thought-process that can go into them. If you&#8217;re a data practitioner, I hope you&#8217;ll learn something new. If you work with data practitioners, I hope you learn the value that data teams can bring to an organization.</p><div><hr></div><p>It&#8217;s a question as old as time: how many users we do we have? Well, that depends on what you mean by &#8216;users&#8217;, and &#8216;we&#8217;, and &#8216;have&#8217;. </p><p>If I were to press you to define them, you might have a few definitions for each. For example,</p><p>A user is:</p><ul><li><p>Someone who visited our website, or</p></li><li><p>Someone who has logged in to our application</p></li><li><p>Any account or customer within our CRM</p></li></ul><p>We might mean:</p><ul><li><p>All the teams at our company</p></li><li><p>Anything that the marketing team is responsible for</p></li><li><p>Anyone that sales knows about</p></li></ul><p>Have could mean:</p><ul><li><p>The number of users we have today</p></li><li><p>The number of users we&#8217;ve ever had</p></li><li><p>The number of users we&#8217;ve had on a given day, at that point in time, and subsequently into the future</p></li></ul><p>Let&#8217;s dig into the first one for now. We&#8217;ll return to the second one later.</p><h2>How many people visited our website?</h2><p>Let&#8217;s take the first one: how do you know when someone visits our website? Well, we have event tracking, so our event tracker can tell us when someone views any of our pages. But what data does an event tracker provide? Let&#8217;s take <a href="https://www.rudderstack.com/docs/destinations/warehouse-destinations/warehouse-schema/#standard-rudderstack-properties">Rudderstack&#8217;s standard schema</a> and explore it further. When you save their data to your warehouse, you get something like this:</p><ul><li><p>anonymous_id: The user&#8217;s anonymous ID</p></li><li><p>event: the name of the event</p></li><li><p>context_ip: The IP address of the device </p></li><li><p>context_&lt;props&gt;: Additional properties on the event</p></li><li><p>id: the event&#8217;s unique id</p></li><li><p>url / path: the URL and path where the event was captured</p></li><li><p>timestamps: various timestamps with slight nuances that don&#8217;t matter here</p></li></ul><p>It seems we&#8217;re in the clear. If we want to know how many people visited our website, we can just<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> count the distinct number of anonymous IDs. This data stuff is easier than you think!</p><p>But, let&#8217;s push our curiosity out a little more. What..is an anonymous id? Well, we don&#8217;t have to go very far to find out. Rudderstack is open-source so we can <a href="https://github.com/rudderlabs/rudder-sdk-js-autotrack/blob/0e249fc65b4f36646047dacf9462cf2fb65fd2b8/analytics.js">find out for ourselves.</a></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3lFY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F09cafd56-ce17-46db-9807-7b461bf54569_637x443.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3lFY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F09cafd56-ce17-46db-9807-7b461bf54569_637x443.png 424w, https://substackcdn.com/image/fetch/$s_!3lFY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F09cafd56-ce17-46db-9807-7b461bf54569_637x443.png 848w, https://substackcdn.com/image/fetch/$s_!3lFY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F09cafd56-ce17-46db-9807-7b461bf54569_637x443.png 1272w, https://substackcdn.com/image/fetch/$s_!3lFY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F09cafd56-ce17-46db-9807-7b461bf54569_637x443.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3lFY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F09cafd56-ce17-46db-9807-7b461bf54569_637x443.png" width="637" height="443" data-attrs="{&quot;src&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/09cafd56-ce17-46db-9807-7b461bf54569_637x443.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:443,&quot;width&quot;:637,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:53895,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3lFY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F09cafd56-ce17-46db-9807-7b461bf54569_637x443.png 424w, https://substackcdn.com/image/fetch/$s_!3lFY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F09cafd56-ce17-46db-9807-7b461bf54569_637x443.png 848w, https://substackcdn.com/image/fetch/$s_!3lFY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F09cafd56-ce17-46db-9807-7b461bf54569_637x443.png 1272w, https://substackcdn.com/image/fetch/$s_!3lFY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F09cafd56-ce17-46db-9807-7b461bf54569_637x443.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In here, <code>storage</code> refers to the device&#8217;s local storage where cookies are saved. If the ID already exists, then your anonymous id is what it was previously. But if not, then <a href="https://github.com/rudderlabs/rudder-sdk-js/blob/480d37d3dc1119de42c13fbbb3e836f967236fc8/utils/utils.js#L37">Rudderstack will generate one for you</a> and save it to your cookies.</p><p>Interesting, so if you clear your cookies, then you get a new anonymous id. If you use a different browser, you get a new anonymous id. If you use private mode, you get a new anonymous id. If you use a different device, like your laptop, your work laptop, or your phone, then you get a new anonymous id. And with the war on cookies from Safari and Firefox, this problem is getting worse. Turns out one person can have many different anonymous ids. </p><p>Well, what about the IP address? Couldn&#8217;t we just use that to dedupe? Let&#8217;s think a bit more about that one too. How do devices get an IP address? Let&#8217;s not dive too deep, but from a router. But multiple people connect to the same router. Especially at work, or at school, or at the airport, or on public wifi. We could have many, many different people all on the same IP address. </p><p>Wow, maybe counting things isn&#8217;t so easy after all?</p><p>So what do we do? <em>Well, in the absence of the right answer, we often have to make do with a good enough answer.</em> Let&#8217;s say that when we count visitors to our website, we will count the distinct anonymous ids, knowing full well that that number over-inflates the true number of people visiting our website.  </p><p>Our final code might look something like this </p><pre><code>select count(distinct anonymous_id) from events;</code></pre><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://databased.pedramnavid.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Liking what you read? Data Based is only possible because of the support of subscribers. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h3>How many net new people visited our website every week?</h3><p>Okay, so we know how many &#8216;people&#8217; visited our website. But that&#8217;s not actionable data. Is 20,000 good? Is 50,000 bad? What we almost always want with analytics is understanding change over time.</p><p>The first question to ask is what time-grain should we break down our data over time? We can count the number of users every minute, hour, day, week, month, or year. </p><p>Picking the right time-grain is context-dependent, but the choice revolves around having enough time for the noise to level off but not so much time that the results are not actionable. Also consider how often you will look at the data. Looking at data daily or more frequently than that is not healthy, or good for the soul, and I&#8217;m all about making sure you&#8217;re living a healthy, happy life.</p><p>Daily data is subject to fluctuations based on weekends and holidays. Monthly data is smoother, but it lacks immediacy. Do you want to wait 20 days to find out your website broke on the 10th? Let&#8217;s go with weekly, it smooths out the weekends and provides a nice balance.</p><p>The simple approach to counting people by week might look something like this:</p><pre><code>select 

date_trunc('week', timestamp) as week, 
count(distinct anonymous_id) as visitors 

from events
group by 1</code></pre><p>What we end up measuring here is the number of unique visitors to our website, every week. If Pedram and Claire both visit the website every week, but no one new shows up, well have a steady rate of 2 weekly users. Fine, but not exciting enough.</p><p>What we&#8217;re interested is how many new people are we bringing into our website. We want new people joining so we can create a healthy top-of-funnel pipeline to drive our marketing and sales motions. Without new people visiting, we&#8217;ll run out of sales, our company will die, and we will be sad forever. We&#8217;re all about happiness here.</p><p>So instead, let&#8217;s find out when we first saw a user:</p><pre><code>select

date_trunc('week', timestamp) as week,
anonymous_id,
row_number() over(partition by anonymous_id order by timestamp) as event_date_index

from events;</code></pre><p>This cute little row_number function does nothing more than count from 1 all the way down until there are no more rows. But, the magic is in the partition. A partition is nothing more than a group, so we&#8217;re asking our little function to count the number of times every user visited our website, from the 1st time, to the last time (we ordered from oldest visit to newest, but could also have done it in descending order with <code>order by timestamp desc</code>)</p><p>Now we can do something fun. We can find the first time a user visited our website by filtering that previous query.</p><pre><code>with numbered_events as (
  select

  date_trunc('week', timestamp) as week,
  anonymous_id,
  row_number() over(partition by anonymous_id order by timestamp) as event_date_index

  from events
)

select 

week, 
count(anonymous_id) as new_visitors

from numbered_events
where event_date_index = 1
group by 1</code></pre><p>We use a CTE<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a> because we can&#8217;t filter on <code>row_number</code> using WHERE or HAVING<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a></p><p>So we answered our first question and dug into it a little bit. We have a better sense of the limitations of the data, and why it&#8217;s hard, but that doesn&#8217;t mean it&#8217;s not useful. Every week we can keep an eye on our overall users and see how it is trending. We might even take our models further and use things like the referrer and UTM parameters to better understand not just how many users we have, but where they come from! On to our next question.</p><h3>Next Time: How Many People Use Our Product?</h3><p>Now that we&#8217;ve solved the first question, in our next one we&#8217;ll dig into our product itself. There, our users have authenticated, so counting should be easier. But we might end up with some more interesting questions, like how many of them use our product every day? </p><p>Did you enjoy this post? Do you have ideas for future metrics to cover? Maybe you think Cohort Analysis is something you&#8217;ve always wanted to learn more, or you think there&#8217;s nothing hotter than a well-defined activation metrics. Well, leave a comment or <a href="mailto:pedram@pedramnavid.com">drop me an email!</a></p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://databased.pedramnavid.com/p/counting-users?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thank you for reading Data Based! If you enjoyed this post, it would mean a lot if you shared it!</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://databased.pedramnavid.com/p/counting-users?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://databased.pedramnavid.com/p/counting-users?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Whenever I use the world <em>just</em> to show how easy something is, the thing I&#8217;m describing is actually really hard.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>A CTE, or Common Table Expression, is a way of taking a snippet of SQL, putting it in a little metaphorical box, and giving it a name so you can reuse it later. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>We can use QUALIFY in Snowflake, but not every database supports that function.</p></div></div>]]></content:encoded></item></channel></rss>