mirror of https://github.com/apache/datafusion-sqlparser-rs.git synced 2025-10-09 21:42:05 +00:00

Extensible SQL Lexer and Parser for Rust

Find a file

Nickolay Ponomarev 9a8b6a8e64 Rework keyword/identifier parsing (1/8) Fold Token::{Keyword, Identifier, DoubleQuotedString} into one Token::SQLWord, which has the necessary information (was it a known keyword and/or was it quoted). This lets the parser easily accept DoubleQuotedString (a quoted identifier) everywhere it expects an Identifier in the same match arm. (To complete support of quoted identifiers, or "delimited identifiers" as the spec calls them, a TODO in parse_tablename() ought to be addressed.) As an aside, per <https://en.wikibooks.org/wiki/SQL_Dialects_Reference/Data_structure_definition/Delimited_identifiers> sqlite seems to be the only one supporting 'identifier' (which is rather hairy, since it can also be a string literal), and `identifier` seems only to be supported by MySQL. I didn't implement either one. This also allows the use of `parse`/`expect_keyword` machinery for non-reserved keywords: previously they relied on the keyword being a Token::Keyword, which wasn't a Token::Identifier, and so wasn't accepted as one. Now whether a keyword can be used as an identifier can be decided by the parser. (I didn't add a blacklist of "reserved" keywords, so that any keyword which doesn't have a special meaning in the parser could be used as an identifier. The list of keywords in the dialect could be re-used for that purpose at a later stage.)		2019-01-31 03:57:16 +03:00
docs	Update docs on writing custom parsers	2018-09-08 07:29:34 -06:00
examples	tokenizer delegates to dialect now	2018-09-08 14:49:25 -06:00
src	Rework keyword/identifier parsing (1/8)	2019-01-31 03:57:16 +03:00
tests	Rework keyword/identifier parsing (1/8)	2019-01-31 03:57:16 +03:00
.gitignore	roughing out classic pratt parser	2018-02-08 07:49:24 -07:00
.travis.yml	add travis build script	2018-09-03 11:03:04 -06:00
Cargo.toml	(cargo-release) start next development iteration 0.2.2-alpha.0	2019-01-13 09:19:46 -07:00
LICENSE.TXT	replace with code from datafusion	2018-09-03 09:56:39 -06:00
README.md	Update README	2019-01-12 10:00:00 -07:00

README.md

Extensible SQL Lexer and Parser for Rust

The goal of this project is to build a SQL lexer and parser capable of parsing SQL that conforms with the ANSI SQL:2011 standard but also making it easy to support custom dialects so that this crate can be used as a foundation for vendor-specific parsers.

This parser is currently being used by the DataFusion query engine and LocustDB.

Example

The current code is capable of parsing some trivial SELECT and CREATE TABLE statements.

let sql = "SELECT a, b, 123, myfunc(b) \
           FROM table_1 \
           WHERE a > b AND b < 100 \
           ORDER BY a DESC, b";

let dialect = GenericSqlDialect{}; // or AnsiSqlDialect, or your own dialect ...

let ast = Parser::parse_sql(&dialect,sql.to_string()).unwrap();

println!("AST: {:?}", ast);

This outputs

AST: SQLSelect { projection: [SQLIdentifier("a"), SQLIdentifier("b"), SQLLiteralLong(123), SQLFunction { id: "myfunc", args: [SQLIdentifier("b")] }], relation: Some(SQLIdentifier("table_1")), selection: Some(SQLBinaryExpr { left: SQLBinaryExpr { left: SQLIdentifier("a"), op: Gt, right: SQLIdentifier("b") }, op: And, right: SQLBinaryExpr { left: SQLIdentifier("b"), op: Lt, right: SQLLiteralLong(100) } }), order_by: Some([SQLOrderBy { expr: SQLIdentifier("a"), asc: false }, SQLOrderBy { expr: SQLIdentifier("b"), asc: true }]), group_by: None, having: None, limit: None }

Design

This parser is implemented using the Pratt Parser design, which is a top-down operator-precedence parser.

I am a fan of this design pattern over parser generators for the following reasons:

Code is simple to write and can be concise and elegant (this is far from true for this current implementation unfortunately, but I hope to fix that using some macros)
Performance is generally better than code generated by parser generators
Debugging is much easier with hand-written code
It is far easier to extend and make dialect-specific extensions compared to using a parser generator

Supporting custom SQL dialects

This is a work in progress but I started some notes on writing a custom SQL parser.

Contributing

Contributors are welcome! Please see the current issues and feel free to file more!