I wrote a small script to analyze HQL scripts in bulk and find out how various tables are related to each other. Contents of an example file sample.hql are shown below.

CREATE TABLE IF NOT EXISTS some_generated_data_table
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
LOCATION '/tmp/some_datasource_export.csv' AS
SELECT b.user_id, b.age, count(distinct a.book_id) AS book_count,
		avg(a.price) AS average_price,
		avg(a.num_pages) AS average_num_pages
FROM some_external_table a
JOIN some_other_external_table b
ON a.book_id=b.book_id
GROUP BY b.user_id,b.age;

Passing this file to process-file returns a structure that contains three groups, representing

  • Create - Table that may be created

  • Insert - Tables into which data may be inserted

  • Referred - Tables from which data may be read and funneled into the above

(process-file "sample.hql")
;=>
({:create #{"some_generated_data_table"},
  :insert #{},
  :refer  #{"some_external_table" "some_other_external_table"}})

Now, you could pass many such expressions, across multiple files, through this function and analyze the combined output to build a graph of how various tables are related to each other, and in a very naive way, understand how data flows across the various tables too.