Enabling static evaluation of SQL queries at Meta –

  • UPM is our inner standalone library to carry out static analysis of SQL code and improve SQL authoring. 
  • UPM takes SQL code as enter and represents it as an information construction referred to as a semantic tree.
  • Infrastructure groups at Meta leverage UPM to construct SQL linters, catch person errors in SQL code, and carry out information lineage evaluation at scale.

Executing SQL queries towards our information warehouse is essential to the workflows of many engineers and information scientists at Meta for analytics and monitoring use circumstances, both as a part of recurring information pipelines or for ad-hoc information exploration. 

Whereas SQL is extraordinarily highly effective and extremely popular amongst our engineers, we’ve additionally confronted some challenges over time, specifically: 

  • A necessity for static evaluation capabilities: In a rising variety of use circumstances at Meta, we should perceive programmatically what occurs in SQL queries earlier than they’re executed towards our question engines — a job referred to as static evaluation.  These use circumstances vary from efficiency linters (suggesting question optimizations that question engines can’t carry out mechanically) and analyzing information lineage (tracing how information flows from one desk to a different). This was exhausting for us to do for 2 causes: First, whereas question engines internally have some capabilities to investigate a SQL question in an effort to execute it, this question evaluation element is often deeply embedded contained in the question engine’s code. It isn’t straightforward to increase upon, and it isn’t meant for consumption by different infrastructure groups. Along with this, every question engine has its personal evaluation logic, particular to its personal SQL dialect; in consequence, a workforce who desires to construct a chunk of study for SQL queries must reimplement it from scratch inside of every SQL question engine.
  • A limiting sort system: Initially, we used solely the mounted set of built-in Hive data types (string, integer, boolean, and so on.) to explain desk columns in our information warehouse. As our warehouse grew extra advanced, this set of varieties turned inadequate, because it left us unable to catch widespread classes of person errors, resembling unit errors (think about making a UNION between two tables, each of which include a column referred to as timestamp, however one is encoded in milliseconds and the opposite one in nanoseconds), or ID comparability errors (think about a JOIN between two tables, every with a column referred to as user_id — however, actually, these IDs are issued by completely different methods and subsequently can’t be in contrast).

How UPM works

To handle these challenges, we have now constructed UPM (Unified Programming Mannequin). UPM takes in an SQL question as enter and represents it as a hierarchical information construction referred to as a semantic tree.

 For instance, for those who move on this question to UPM:

SELECT
COUNT(DISTINCT user_id) AS n_users
FROM login_events

UPM will return this semantic tree:

SelectQuery(
 	objects=[
 	SelectItem(
       	name="n_users",
       	type=upm.Integer,
       	value=CallExpression(
            	function=upm.builtin.COUNT_DISTINCT,
                arguments=[ColumnRef(name="user_id", parent=Table("login_events"))],
       	),
 	)
    ],
    father or mother=Desk("login_events"),
)

 Different instruments can then use this semantic tree for various use circumstances, resembling:

  1. Static evaluation: A instrument can examine the semantic tree after which output diagnostics or warnings in regards to the question (resembling a SQL linter).
  2. Question rewriting: A instrument can modify the semantic tree to rewrite the question.
  3. Question execution: UPM can act as a pluggable SQL entrance finish, which means {that a} database engine or question engine can use a UPM semantic tree on to generate and execute a question plan. (The phrase front end on this context is borrowed from the world of compilers; the entrance finish is the a part of a compiler that converts higher-level code into an intermediate illustration that may in the end be used to generate an executable program). Alternatively, UPM can render the semantic tree again right into a goal SQL dialect (as a string) and move that to the question engine.

A unified SQL language entrance finish

UPM permits us to offer a single language entrance finish to our SQL customers in order that they solely have to work with a single language (a superset of the Presto SQL dialect) — whether or not their goal engine is Presto, Spark, or XStream, our in-house stream processing service.

This unification can also be helpful to our information infrastructure groups: Due to this unification, groups that personal SQL static evaluation or rewriting instruments can use UPM semantic timber as a typical interop format, with out worrying about parsing, evaluation, or integration with completely different SQL question engines and SQL dialects. Equally, very like Velox can act as a pluggable execution engine for information administration methods, UPM can act as a pluggable language entrance finish for information administration methods, saving groups the trouble of sustaining their very own SQL entrance finish.

Enhanced type-checking

UPM additionally permits us to offer enhanced type-checking of SQL queries.

 In our warehouse, every desk column is assigned a “bodily” sort from a set checklist, resembling integer or string. Moreover, every column can have an non-obligatory user-defined sort; whereas it doesn’t have an effect on how the info is encoded on disk, this kind can provide semantic data (e.g., Electronic mail, TimestampMilliseconds, or UserID). UPM can benefit from these user-defined varieties to enhance static type-checking of SQL queries.

 For instance, an SQL question creator may wish to UNION information from two tables that include details about completely different login occasions:

 Within the question on the appropriate, the creator is making an attempt to mix timestamps in milliseconds from the desk user_login_events_mobile with timestamps in nanoseconds from the desk user_login_events_desktop — an comprehensible mistake, as the 2 columns have the identical identify. However as a result of the tables’ schema have been annotated with user-defined varieties, UPM’s typechecker catches the error earlier than the question reaches the question engine; it then notifies the creator of their code editor. With out this test, the question would have accomplished efficiently, and the creator won’t have seen the error till a lot later.

Column-level information lineage

Knowledge lineage — understanding how information flows inside our warehouse and thru to consumption surfaces — is a foundational piece of our information infrastructure. It allows us to reply information high quality questions (e.g.,“This information appears to be like incorrect; the place is it coming from?” and “Knowledge on this desk have been corrupted; which downstream information property have been impacted?”). It additionally helps with information refactoring (“Is that this desk protected to delete? Is anybody nonetheless relying on it?”). 

 To assist us reply these vital questions, our information lineage workforce has constructed a question evaluation instrument that takes UPM semantic timber as enter. The instrument examines all recurring SQL queries to construct a column-level information lineage graph throughout our total warehouse. For instance, given this question:

INSERT INTO user_logins_daily_agg
SELECT
   DATE(login_timestamp) AS day,
   COUNT(DISTINCT user_id) AS n_users
FROM user_login_events
GROUP BY 1

Our UPM-powered column lineage evaluation would deduce these edges:

[
   from: “user_login_events.login_timestamp”,
   to: “user_login_daily_agg.day”,
   transform: “DATE”
,

   from: “user_login_events.user_id”,
   to: “user_logins_daily_agg.n_user”,
   transform: “COUNT_DISTINCT”
]  

By placing this data collectively for each question executed towards our information warehouse every day, the instrument exhibits us a world view of the total column-level information lineage graph.

What’s subsequent for UPM

We stay up for extra thrilling work as we proceed to unlock UPM’s full potential at Meta. Finally, we hope all Meta warehouse tables can be annotated with user-defined varieties and different metadata, and that enhanced type-checking can be strictly enforced in each authoring floor. Most tables in our Hive warehouse already leverage user-defined varieties, however we’re rolling out stricter type-checking guidelines progressively, to facilitate the migration of present SQL pipelines.

We now have already built-in UPM into the principle surfaces the place Meta’s builders write SQL, and our long-term aim is for UPM to develop into Meta’s unified SQL entrance finish: deeply built-in into all our question engines, exposing a single SQL dialect to our builders. We additionally intend to iterate on the ergonomics of this unified SQL dialect (for instance, by permitting trailing commas in SELECT clauses and by supporting syntax constructs like SELECT * EXCEPT <some_columns>, which exist already in some SQL dialects) and to in the end elevate the extent of abstraction at which individuals write their queries.