TY - JOUR
T1 - Saggitarius
T2 - A DSL for Specifying Grammatical Domains
AU - Miltner, Anders
AU - Loehr, Devon
AU - Mong, Arnold
AU - Fisher, Kathleen
AU - Walker, David
N1 - Publisher Copyright: © 2023 Owner/Author.
PY - 2023/10/16
Y1 - 2023/10/16
N2 - Common data types like dates, addresses, phone numbers and tables can have multiple textual representations, and many heavily-used languages, such as SQL, come in several dialects. These variations can cause data to be misinterpreted, leading to silent data corruption, failure of data processing systems, or even security vulnerabilities. Saggitarius is a new language and system designed to help programmers reason about the format of data, by describing grammatical domains - -that is, sets of context-free grammars that describe the many possible representations of a datatype. We describe the design of Saggitarius via example and provide a relational semantics. We show how Saggitarius may be used to analyze a data set: given example data, it uses an algorithm based on semi-ring parsing and MaxSAT to infer which grammar in a given domain best matches that data. We evaluate the effectiveness of the algorithm on a benchmark suite of 110 example problems, and we demonstrate that our system typically returns a satisfying grammar within a few seconds with only a small number of examples. We also delve deeper into a more extensive case study on using Saggitarius for CSV dialect detection. Despite being general-purpose, we find that Saggitarius offers comparable results to hand-tuned, specialized tools; in the case of CSV, it infers grammars for 84% of benchmarks within 60 seconds, and has comparable accuracy to custom-built dialect detection tools.
AB - Common data types like dates, addresses, phone numbers and tables can have multiple textual representations, and many heavily-used languages, such as SQL, come in several dialects. These variations can cause data to be misinterpreted, leading to silent data corruption, failure of data processing systems, or even security vulnerabilities. Saggitarius is a new language and system designed to help programmers reason about the format of data, by describing grammatical domains - -that is, sets of context-free grammars that describe the many possible representations of a datatype. We describe the design of Saggitarius via example and provide a relational semantics. We show how Saggitarius may be used to analyze a data set: given example data, it uses an algorithm based on semi-ring parsing and MaxSAT to infer which grammar in a given domain best matches that data. We evaluate the effectiveness of the algorithm on a benchmark suite of 110 example problems, and we demonstrate that our system typically returns a satisfying grammar within a few seconds with only a small number of examples. We also delve deeper into a more extensive case study on using Saggitarius for CSV dialect detection. Despite being general-purpose, we find that Saggitarius offers comparable results to hand-tuned, specialized tools; in the case of CSV, it infers grammars for 84% of benchmarks within 60 seconds, and has comparable accuracy to custom-built dialect detection tools.
KW - Grammar Induction
KW - Semiring Parsing
KW - Syntax-Guided Synthesis
UR - http://www.scopus.com/inward/record.url?scp=85175013981&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85175013981&partnerID=8YFLogxK
U2 - 10.1145/3622869
DO - 10.1145/3622869
M3 - Article
SN - 2475-1421
VL - 7
JO - Proceedings of the ACM on Programming Languages
JF - Proceedings of the ACM on Programming Languages
IS - OOPSLA2
M1 - 293
ER -