The Hidden Benefits of Databricks' Managed Tables

Databricks manages the entire life cycle of managed tables, resulting in no “zombies” files after DROP operations. There’s also a significant benefit in predictive optimization, while you can theoretically optimize the timing for OPTIMIZE, ANALYZE, and VACUUM operations manually, it’s extremely difficult in practice. Managed tables provide an additional advantage that impacts both performance and cost.

Since Databricks assumes that managed tables are modified only by Databricks itself, it can cache references to all Parquet files used in Delta Lake and avoid expensive list operations. This is a theory, but I decided to test it in practice.

I created two identical tables that differ only in their storage location and populated them with the same number of rows:

CREATE TABLE transactions_external -- name transactions_managed
(
  txn_id     BIGINT,
  txn_ts     TIMESTAMP,
  account_id INT,
  amount     DECIMAL(10,2),
  txn_type   STRING,
  note       STRING,
  txn_date   DATE GENERATED ALWAYS AS (CAST(txn_ts AS DATE))
)
PARTITIONED BY (txn_date)
LOCATION 'abfss://unity@westus2uc.dfs.core.windows.net/transactions_ext' -- that line skipped for maned table
TBLPROPERTIES (
  delta.autoOptimize.autoCompact = false,
  delta.autoOptimize.optimizeWrite = false
);

-- Insert identical data to the external table
INSERT INTO transactions_external (txn_id, txn_ts, account_id, amount, txn_type, note)
SELECT * FROM tx_1mln;

-- I did some tests without cache
SET use_cached_result = false;  

-- test statement
SELECT COUNT(*) FROM transactions_managed;
SELECT COUNT(*) FROM transactions_external;

I ran count operations on both tables with caching disabled to ensure accurate measurements. Here’s the clock for the external tables:

On most query plans, we can observe that managed tables are slightly faster.

However, I decided go one level deeper and configured storage logs to be sent to Log Analytics in Azure. Note: Only enable this for debugging purposes; Log Analytics can be very expensive per GB in production environments. If you want to try this, you’ll find this option in the “Diagnostic setting” of your storage account:

I then ran a simple query to analyze recent storage operations. I ran all counts in different minutes to spot the issue.

StorageBlobLogs
| where TimeGenerated > ago(30m)
| project TimeGenerated,
          OperationName,
          ResponseHeaderSize,
          ResponseBodySize
| order by TimeGenerated desc

Here’s the managed table:

And here is the external table:

The Key Difference

The critical difference lies in the ListFilesystemDir operations. External tables require additional directory listing operations that managed tables can avoid through caching.

While each storage operation costs only a small amount, these ListFilesystemDir operations can accumulate into significant additional costs for external tables, especially at scale.

Managed tables not only provide operational simplicity but also deliver measurable performance improvements and cost savings through optimized storage access patterns.

Hubert Dudek

Databricks MVP | Advisor to Databricks Product Board and Technical advisor to SunnyData

https://www.linkedin.com/in/hubertdudek/
Next
Next

Keeping Up With Agent Bricks