Parsing Program Source
Emacs provides various ways to parse program source text and produce a syntax tree. In a syntax tree, text is no longer considered a one-dimensional stream of characters, but a structured tree of nodes, where each node represents a piece of text. Thus, a syntax tree can enable interesting features like precise fontification, indentation, navigation, structured editing, etc. Emacs has a simple facility for parsing balanced expressions (Parsing Expressions). There is also the SMIE library for generic navigation and indentation (SMIE). In addition to those, Emacs also provides integration with the tree-sitter library if support for it was compiled in. The tree-sitter library implements an incremental parser and has support for a wide range of programming languages.
-
treesit-available-p - This function returns non-
nilif tree-sitter features are available for the current Emacs session.
To be able to parse the program source using the tree-sitter library and access the syntax tree of the program, a Lisp program needs to load a language grammar library, and create a parser for that language and the current buffer. After that, the Lisp program can query the parser about specific nodes of the syntax tree. Then, it can access various kinds of information about each node, and search for nodes using a powerful pattern-matching syntax. This chapter explains how to do all this, and also how a Lisp program can work with source files that mix multiple programming languages.
Tree-sitter Language Grammar
Loading a language grammar
Tree-sitter relies on language grammar to parse text in that language. In Emacs, a language grammar is represented by a symbol. For example, the C language grammar is represented as the symbol c, and c can be passed to tree-sitter functions as the language argument. Tree-sitter language grammars are distributed as dynamic libraries. In order to use a language grammar in Emacs, you need to make sure that the dynamic library is installed on the system. Emacs looks for language grammars in several places, in the following order:
- first, in the list of directories specified by the variable
treesit-extra-load-path; - then, in the
tree-sittersubdirectory of the directory specified byuser-emacs-directory(Init File); - and finally, in the system's default locations for dynamic libraries.
In each of these directories, Emacs looks for a file with file-name extensions specified by the variable dynamic-library-suffixes. If Emacs cannot find the library or has problems loading it, Emacs signals the treesit-load-language-error error. The data of that signal could be one of the following:
-
(not-found ERROR-MSG ...) - This means that Emacs could not find the language grammar library.
-
(symbol-error ERROR-MSG) - This means that Emacs could not find in the library the expected function that every language grammar library should export.
-
(version-mismatch ERROR-MSG) - This means that the version of the language grammar library is incompatible with that of the tree-sitter library.
In all of these cases, error-msg might provide additional details about the failure.
-
treesit-language-available-p - This function returns non-
nilif the language grammar for language exists and can be loaded. If detail is non-nil, return(t . nil)when language is available, and(nil . DATA)when it's unavailable. data is the signal data oftreesit-load-language-error.
By convention, the file name of the dynamic library for language is libtree-sitter-LANGUAGE.EXT, where ext is the system-specific extension for dynamic libraries. Also by convention, the function provided by that library is named tree_sitter_LANGUAGE. If a language grammar library doesn't follow this convention, you should add an entry
(LANGUAGE LIBRARY-BASE-NAME FUNCTION-NAME)
to the list in the variable treesit-load-name-override-list, where library-base-name is the basename of the dynamic library's file name (usually, libtree-sitter-LANGUAGE), and function-name is the function provided by the library (usually, tree_sitter_LANGUAGE). For example,
(cool-lang "libtree-sitter-coool" "tree_sitter_cooool")
for a language that considers itself too "cool" to abide by conventions.
-
treesit-library-abi-version - This function returns the version of the language grammar Application Binary Interface (ABI) supported by the tree-sitter library. By default, it returns the latest ABI version supported by the library, but if min-compatible is non-
nil, it returns the oldest ABI version which the library still can support. Language grammar libraries must be built for ABI versions between the oldest and the latest versions supported by the tree-sitter library, otherwise the library will be unable to load them. -
treesit-language-abi-version - This function returns the ABI version of the language grammar library loaded by Emacs for language. If language is unavailable, this function returns
nil. -
treesit-language-display-name - This function translates language to an appropriate display name. For example, it translates
rubyto "Ruby",cppto "C++". Most languages has "regular" names, and their display name is simply the symbol name with first letter capitalized. For languages that has "irregular" names, treesit-language-display-name-alist maps language symbols to their display names. If a major mode package uses a language with "irregular" name, they should add a mapping into treesit-language-display-name-alist on load.
Concrete syntax tree
A syntax tree is what a parser generates. In a syntax tree, each node represents a piece of text, and is connected to each other by a parent-child relationship. For example, if the source text is
1 + 2
its syntax tree could be
+--------------+
| root "1 + 2" |
+--------------+
|
+--------------------------------+
| expression "1 + 2" |
+--------------------------------+
| | |
+------------+ +--------------+ +------------+
| number "1" | | operator "+" | | number "2" |
+------------+ +--------------+ +------------+
We can also represent it as an s-expression:
(root (expression (number) (operator) (number)))
Node types
Names like root, expression, number, and operator specify the type of the nodes. However, not all nodes in a syntax tree have a type. Nodes that don't have a type are known as anonymous nodes, and nodes with a type are named nodes. Anonymous nodes are tokens with fixed spellings, including punctuation characters like bracket ], and keywords like return.
Field names
To make the syntax tree easier to analyze, many language grammar assign field names to child nodes. For example, a function_definition node could have a declarator and a body:
(function_definition declarator: (declaration) body: (compound_statement))
Exploring the syntax tree
To aid in understanding the syntax of a language and in debugging Lisp programs that use the syntax tree, Emacs provides an "explore" mode, which displays the syntax tree of the source in the current buffer in real time. Emacs also comes with an "inspect mode", which displays information of the nodes at point in the mode-line.
-
Command treesit-explore - This command pops up a window displaying the syntax tree of the source in the current buffer. Selecting text in the source buffer highlights the corresponding nodes in the syntax tree display. Clicking on nodes in the syntax tree highlights the corresponding text in the source buffer. To switch to another parser, use
treesit-explorer-switch-parser. -
Command treesit-inspect-mode - This minor mode displays on the mode-line the node that starts at point. For example, the mode-line can display
PARENT FIELD: (NODE (CHILD (...)))
where node, child, etc., are nodes which begin at point. parent is the parent of node. node is displayed in a bold typeface. field-name/s are field names of /node and of child, etc. If no node starts at point, i.e., point is in the middle of a node, then the mode line displays the earliest node that spans point, and its immediate parent. This minor mode doesn't create parsers on its own. It uses the first parser in (treesit-parser-list) (Using Parser).
Reading the grammar definition
Authors of language grammars define the grammar of a programming language, which determines how a parser constructs a concrete syntax tree out of the program text. In order to use the syntax tree effectively, you need to consult the grammar file. The grammar file is usually grammar.js in a language grammar's project repository. The link to a language grammar's home page can be found on tree-sitter's homepage. The grammar definition is written in JavaScript. For example, the rule matching a function_definition node may look like
function_definition: $ => seq(
$.declaration_specifiers,
field('declarator', $.declaration),
field('body', $.compound_statement)
)
The rules are represented by functions that take a single argument $, representing the whole grammar. The function itself is constructed by other functions: the seq function puts together a sequence of children; the field function annotates a child with a field name. If we write the above definition in the so-called Backus-Naur Form (BNF) syntax, it would look like
function_definition := <declaration_specifiers> <declaration> <compound_statement>
and the node returned by the parser would look like
(function_definition (declaration_specifier) declarator: (declaration) body: (compound_statement))
Below is a list of functions that one can see in a grammar definition. Each function takes other rules as arguments and returns a new rule.
-
seq(RULE1, RULE2, ...) - matches each rule one after another.
-
choice(RULE1, RULE2, ...) - matches one of the rules in its arguments.
-
repeat(RULE) - matches rule zero or more times. This is like the
*operator in regular expressions. -
repeat1(RULE) - matches rule one or more times. This is like the
+operator in regular expressions. -
optional(RULE) - matches rule zero or one times. This is like the
?operator in regular expressions. -
field(NAME, RULE) - assigns field name name to the child node matched by rule.
-
alias(RULE, ALIAS) - makes nodes matched by rule appear as alias in the syntax tree generated by the parser. For example, alias(preprocessor_call_exp, call_expression) makes any node matched by
preprocessor_call_expappear ascall_expression.
Below are grammar functions of lesser importance for reading a language grammar.
-
token(RULE) - marks rule to produce a single leaf node. That is, instead of generating a parent node with individual child nodes under it, everything is combined into a single leaf node. Retrieving Nodes.
-
token.immediate(RULE) - Normally, grammar rules ignore preceding whitespace; this changes rule to match only when there is no preceding whitespace.
-
prec(N, RULE) - gives rule the level-n precedence.
-
prec.left([N,] RULE) - marks rule as left-associative, optionally with level n.
-
prec.right([N,] RULE) - marks rule as right-associative, optionally with level n.
-
prec.dynamic(N, RULE) - this is like
prec, but the precedence is applied at runtime instead.
The documentation of the tree-sitter project has more about writing a grammar. Read especially "The Grammar DSL" section.
Using Tree-sitter Parser
This section describes how to create and configure a tree-sitter parser. In Emacs, each tree-sitter parser is associated with a buffer. As the user edits the buffer, the associated parser and syntax tree are automatically kept up-to-date.
-
treesit-max-buffer-size - This variable contains the maximum size of buffers in which tree-sitter can be activated. Major modes should check this value when deciding whether to enable tree-sitter features.
-
treesit-languages-require-line-column-tracking - Emacs by default doesn't keep track of line and column numbers for positions in a buffer. However, some language grammars utilize the line and column information for parsing. If parsers of these languages are created in a buffer, Emacs will turn on line and column tracking and report these information to these parsers. Once the buffer starts tracking line and column, it never stops doing so. And once a parser is created as tracking/not-tracking line and column, it stays that way regardless of changes to this variable. This variable is a list of languages that require line and column tracking. The vast majority of languages don't need line and column information. So far, only Haskell is known to need it. User can use
treesit-tracking-line-column-pandtreesit-parser-tracking-line-column-pto check if a buffer or parser is tracking line and column, respectively. -
treesit-parser-create - Create a parser for the specified buffer and language (Language Grammar), with tag. If buffer is omitted or
nil, it stands for the current buffer. By default, this function reuses a parser if one already exists for language with tag in buffer, but if no-reuse is non-nil, this function always creates a new parser. tag can be any symbol exceptt, and defaults tonil. Different parsers can have the same tag.
Given a parser, we can query information about it.
-
treesit-parser-buffer - This function returns the buffer associated with parser.
-
treesit-parser-language - This function returns the language used by parser.
-
treesit-parser-p - This function checks if object is a tree-sitter parser, and returns non-
nilif it is, andnilotherwise.
There is no need to explicitly parse a buffer, because parsing is done automatically and lazily. A parser only parses when a Lisp program queries for a node in its syntax tree. Therefore, when a parser is first created, it doesn't parse the buffer; it waits until the Lisp program queries for a node for the first time. Similarly, when some change is made in the buffer, a parser doesn't re-parse immediately. When a parser does parse, it checks for the size of the buffer. Tree-sitter can only handle buffers no larger than about 4GB. If the size exceeds that, Emacs signals the treesit-buffer-too-large error with signal data being the buffer size. Once a parser is created, Emacs automatically adds it to the internal parser list. Every time a change is made to the buffer, Emacs updates parsers in this list so they can update their syntax tree incrementally.
-
treesit-parser-list - This function returns the parser list of buffer, filtered by language and tag. If buffer is
nilor omitted, it defaults to the current buffer. If language is non-nil, only include parsers for that language, and only include parsers with tag. tag defaults tonil. If tag ist, include parsers in the returned list regardless of their tag. -
treesit-parser-delete - This function deletes parser.
Normally, a parser "sees" the whole buffer, but when the buffer is narrowed (Narrowing), the parser will only see the accessible portion of the buffer. As far as the parser can tell, the hidden region was deleted. When the buffer is later widened, the parser thinks text is inserted at the beginning and at the end. Although parsers respect narrowing, modes should not use narrowing as a means to handle a multi-language buffer; instead, set the ranges in which the parser should operate. Multiple Languages. Because a parser parses lazily, when the user or a Lisp program narrows the buffer, the parser is not affected immediately; as long as the mode doesn't query for a node while the buffer is narrowed, the parser is oblivious of the narrowing. Besides creating a parser for a buffer, a Lisp program can also parse a string. Unlike a buffer, parsing a string is a one-off operation, and there is no way to update the result.
-
treesit-parse-string - This function parses string using language, and returns the root node of the generated syntax tree. Do not use this function in a loop: this is a convenience function intended for one-off use, and it isn't optimized; for heavy workload, use a temporary buffer instead.
Be notified by changes to the parse tree
A Lisp program might want to be notified of text affected by incremental parsing. For example, inserting a comment-closing token converts text before that token into a comment. Even though the text is not directly edited, it is deemed to be "changed" nevertheless. Emacs lets a Lisp program register callback functions (a.k.a. notifiers) for these kinds of changes. A notifier function takes two arguments: ranges and parser. ranges is a list of cons cells of the form (START . END), where start and end mark the start and the end positions of a range. parser is the parser issuing the notification. Every time a parser reparses a buffer, it compares the old and new parse-tree, computes the ranges in which nodes have changed, and passes the ranges to notifier functions. Note that the initial parse is also considered a "change", so notifier functions are called on the initial parse, with range being the whole buffer.
-
treesit-parser-add-notifier - This function adds function to parser's list of after-change notifier functions. function must be a function symbol, not a lambda function (Anonymous Functions).
-
treesit-parser-remove-notifier - This function removes function from the list of parser's after-change notifier functions. function must be a function symbol, rather than a lambda function.
-
treesit-parser-notifiers - This function returns the list of parser's notifier functions.
A lisp program can also choose to force a parser to reparse and get the changed regions immediately with treesit-parser-changed-regions.
-
treesit-parser-changed-regions - This function forces parser to reparse, and returns the affected regions: a list of
(START . END). If the parser has nothing new to reparse, or the affected regions are empty, this function returnsnil.
Substitute parser for another language
Sometimes, a grammar for language B is a strict superset of the grammar of another language A. Then it makes sense to reuse configurations (font-lock rules, indentation rules, etc.) of language A for language B. For that purpose, treesit-language-remap-alist allows users to remap language A into language B.
-
treesit-language-remap-alist - The value of this variable should be an alist of
(LANGUAGE-A . LANGUAGE-B). When such pair exists in the alist, creating a parser for language-a actually creates a parser for language-b. By extension, anything that creates a node or makes a query of language-a will be redirected to use language-b instead. This mapping is completely transparent, the created parser will reported to use language-b, and the same goes for nodes created by this parser. Specifically, the parser created bytreesit-parser-createwill report to use whatever language was given to it. For example, if languagecppis mapped tocuda:
(setq treesit-language-remap-alist '((cpp . cuda))) (treesit-parser-language (treesit-parser-create 'cpp)) => 'cpp (treesit-parser-language (treesit-parser-create 'cuda)) => 'cuda
Even though both parser are actually cuda parser.
Retrieving Nodes
Here are some terms and conventions we use when documenting tree-sitter functions. A node in a syntax tree spans some portion of the program text in the buffer. We say that a node is "smaller" or "larger" than another if it spans, respectively, a smaller or larger portion of buffer text than the other node. Since nodes that are deeper ("lower") in the tree are children of the nodes that are "higher" in the tree, it follows that a lower node will always be smaller than a node that is higher in the node hierarchy. A node that is higher up in the syntax tree contains one or more smaller nodes as its children, and therefore spans a larger portion of buffer text. When a function cannot find a node, it returns nil. For convenience, all functions that take a node as argument and return a node, also accept the node argument of nil and in that case just return nil. Nodes are not automatically updated when the associated buffer is modified, and there is no way to update a node once it is retrieved. Using an outdated node signals the treesit-node-outdated error. The printed representation of a tree-sitter node uses the hash notation described in Printed Representation. It looks like #<treesit-node TYPE in POS1-POS2>, where type is the type of the node (which comes from the tree-sitter grammar used by the buffer), and pos1 and pos2 are buffer positions of the node's span. Tree-sitter nodes have no read syntax.
Retrieving nodes from syntax tree
-
treesit-node-at - This function returns a leaf node at buffer position pos. A leaf node is a node that doesn't have any child nodes. This function tries to return a node whose span covers pos: the node's beginning position is less than or equal to pos, and the node's end position is greater than or equal to pos. If no leaf node's span covers pos (e.g., pos is in the whitespace between two leaf nodes), this function returns the first leaf node after pos. Finally, if there is no leaf node after pos, return the first leaf node before pos. If parser-or-lang is a parser object, this function uses that parser; if parser-or-lang is a language, this function uses the first parser for that language in the current buffer, or creates one if none exists; if parser-or-lang is
nil, this function tries to guess the language at pos by callingtreesit-language-at(Multiple Languages). If this function cannot find a suitable node to return, it returnsnil. If named is non-nil, this function looks only for named nodes (named node). Example:
;; Find the node at point in a C parser's syntax tree. (treesit-node-at (point) 'c) => #<treesit-node primitive_type in 23-27>
-
treesit-node-on - This function returns the smallest node that covers the region of buffer text between beg and end. In other words, the start of the node is before or at beg, and the end of the node is at or after end. Beware: calling this function on an empty line that is not inside any top-level construct (function definition, etc.) most probably will give you the root node, because the root node is the smallest node that covers that empty line. Most of the time, you want to use
treesit-node-atinstead. If parser-or-lang is a parser object, this function uses that parser; if parser-or-lang is a language, this function uses the first parser for that language in the current buffer, or creates one if none exists; if parser-or-lang isnil, this function tries to guess the language at beg by callingtreesit-language-at. If named is non-nil, this function looks for a named node only (named node). -
treesit-parser-root-node - This function returns the root node of the syntax tree generated by parser.
-
treesit-buffer-root-node - This function finds the first parser for language in the current buffer, or creates one if none exists, and returns the root node generated by that parser. If language is omitted, it uses the first parser in the parser list. If it cannot find an appropriate parser, it returns
nil.
Given a node, a Lisp program can retrieve other nodes starting from it, or query for information about this node.
Retrieving nodes from other nodes
By kinship
-
treesit-node-parent - This function returns the immediate parent of node. If node is more than 1000 levels deep in a parse tree, the return value is undefined. Currently it returns
nil, but that could change in the future. -
treesit-node-child - This function returns the n'th child of node. If named is non-
nil, it counts only named nodes (named node). For example, in a node that represents a string"text", there are three children nodes: the opening quote", the string texttext, and the closing quote". Among these nodes, the first child is the opening quote", and the first named child is the string text. This function returnsnilif there is no n'th child. n could be negative, e.g., −1 represents the last child. -
treesit-node-children - This function returns all of node's children as a list. If named is non-
nil, it retrieves only named nodes. -
treesit-node-next-sibling - This function finds the next sibling of node. If named is non-
nil, it finds the next named sibling. -
treesit-node-prev-sibling - This function finds the previous sibling of node. If named is non-
nil, it finds the previous named sibling.
By field name
To make the syntax tree easier to analyze, many language grammars assign field names to child nodes (field name). For example, a function_definition node could have a declarator child and a body child.
-
treesit-node-child-by-field-name - This function finds the child of node whose field name is field-name, a string.
;; Get the child that has "body" as its field name. (treesit-node-child-by-field-name node "body") => #<treesit-node compound_statement in 45-89>
By position
-
treesit-node-first-child-for-pos - This function finds the first child of node that extends beyond buffer position pos. "Extends beyond" means the end of the child node is greater or equal to pos. This function only looks for immediate children of node, and doesn't look in its grandchildren. If named is non-
nil, it looks for the first named child (named node). -
treesit-node-descendant-for-range - This function finds the smallest descendant node of node that spans the region of text between positions beg and end. It is similar to
treesit-node-at. If named is non-nil, it looks for the smallest named child.
Searching for node
-
treesit-search-subtree - This function traverses the subtree of node (including node itself), looking for a node for which predicate returns non-
nil. predicate is a regexp that is matched against each node's type, or a predicate function that takes a node and returns non-nilif the node matches. predicate can also be a thing symbol or thing definition (User-defined Things). Using an undefined thing doesn't raise an error, the function simply returnsnil. This function returns the first node that matches, ornilif none matches predicate. By default, this function only traverses named nodes, but if all is non-nil, it traverses all the nodes. If backward is non-nil, it traverses backwards (i.e., it visits the last child first when traversing down the tree). If depth is non-nil, it must be a number that limits the tree traversal to that many levels down the tree. If depth isnil, it defaults to 1000. -
treesit-search-forward - Like
treesit-search-subtree, this function also traverses the parse tree and matches each node with predicate (except for start), where predicate can be a regexp or a predicate function. predicate can also be a thing symbol or thing definition (User-defined Things). Using an undefined thing doesn't raise an error, the function simply returnsnil. For a tree like the one below where start is markedS, this function traverses as numbered from 1 to 12:
12
|
S--------3----------11
| | |
o--o-+--o 1--+--2 6--+-----10
| | | |
o o +-+-+ +--+--+
| | | | |
4 5 7 8 9
Note that this function doesn't traverse the subtree of start, and it always traverses leaf nodes first, before moving upwards. Like treesit-search-subtree, this function only searches for named nodes by default, but if all is non-nil, it searches for all nodes. If backward is non-nil, it searches backwards. While treesit-search-subtree traverses the subtree of a node, this function starts with node start and traverses every node that comes after it in the buffer position order, i.e., nodes with start positions greater than the end position of start. In the tree shown above, treesit-search-subtree traverses node S (start) and nodes marked with o, whereas this function traverses the nodes marked with numbers. This function is useful for answering questions like "what is the first node after start in the buffer that satisfies some condition?"
-
treesit-search-forward-goto - This function moves point to the start or end of the next node after node in the buffer that matches predicate. If start is non-
nil, stop at the beginning rather than the end of a node. This function guarantees that the matched node it returns makes progress in terms of buffer position: the start/end position of the returned node is always greater than that of node. Arguments predicate, backward, and all are the same as intreesit-search-forward. -
treesit-induce-sparse-tree - This function creates a sparse tree from root's subtree. It takes the subtree under root, and combs it so only the nodes that match predicate are left. Like previous functions, the predicate can be a regexp string that matches against each node's type, or a function that takes a node and returns non-
nilif it matches. predicate can also be a thing symbol or thing definition (User-defined Things). Using an undefined thing doesn't raise an error, the function simply returnsnil. For example, given the subtree on the left that consists of both numbers and letters, if predicate is "letter only", the returned tree is the one on the right.
a a a
| | |
+---+---+ +---+---+ +---+---+
| | | | | | | | |
b 1 2 b | | b c d
| | => | | => |
c +--+ c + e
| | | | |
+--+ d 4 +--+ d
| | |
e 5 e
If process-fn is non-nil, instead of returning the matched nodes, this function passes each node to process-fn and uses the returned value instead. If non-nil, depth limits the number of levels to go down from root. If depth is nil, it defaults to 1000. Each node in the returned tree looks like (TREE-SITTER-NODE . (CHILD ...)). The tree-sitter-node of the root of this tree will be nil if root doesn't match predicate. If no node matches predicate, the function returns nil.
More convenience functions
-
treesit-node-get - This is a convenience function that chains together multiple node accessor functions together. For example, to get node's parent's next sibling's second child's text:
(treesit-node-get node
'((parent 1)
(sibling 1 nil)
(child 1 nil)
(text nil)))
instruction is a list of INSTRUCTIONs of the form (FN ARG...). The following fn's are supported:
-
(child IDX NAMED) - Get the idx'th child.
-
(parent N) - Go to parent n times.
-
(field-name) - Get the field name of the current node.
-
(type) - Get the type of the current node.
-
(text NO-PROPERTY) - Get the text of the current node.
-
(children NAMED) - Get a list of children.
-
(sibling STEP NAMED) - Get the nth prev/next sibling, negative step means prev sibling, positive means next sibling.
Note that arguments like named and no-property can't be omitted, unlike in their original functions.
-
treesit-filter-child - This function finds immediate children of node that satisfy predicate. The predicate function takes a node as argument and should return non-
nilto indicate that the node should be kept. If named is non-nil, this function only examines named nodes. -
treesit-parent-until - This function repeatedly finds the parents of node, and returns the parent that satisfies predicate. predicate can be either a function that takes a node as argument and returns
tornil, or a regexp matching node type names, or other valid predicates described intreesit-thing-settings. If no parent satisfies predicates, this function returnsnil. Normally this function only looks at the parents of node but not node itself. But if include-node is non-nil, this function returns node if node satisfies predicate. -
treesit-parent-while - This function goes up the tree starting from node, and keeps doing so as long as the nodes satisfy predicate, a function that takes a node as argument. That is, this function returns the highest parent of node that still satisfies predicate. Note that if node satisfies predicate but its immediate parent doesn't, node itself is returned.
-
treesit-node-top-level - This function returns the highest parent of node that has the same type as node. If no such parent exists, it returns
nil. Therefore this function is also useful for testing whether node is top-level. If predicate isnil, this function uses node's type to find the parent. If predicate is non-nil, this function searches the parent that satisfies predicate. If include-node is non-nil, this function returns node if node satisfies predicate.
Accessing Node Information
Basic information of Node
Every node is associated with a parser, and that parser is associated with a buffer. The following functions retrieve them.
-
treesit-node-parser - This function returns node's associated parser.
-
treesit-node-buffer - This function returns node's parser's associated buffer.
-
treesit-node-language - This function returns node's parser's associated language.
Each node represents a portion of text in the buffer. Functions below find relevant information about that text.
-
treesit-node-start - Return the start position of node.
-
treesit-node-end - Return the end position of node.
-
treesit-node-text - Return the buffer text that node represents, as a string. (If node is retrieved from parsing a string, it will be the text from that string.)
Here are some predicates on tree-sitter nodes:
-
treesit-node-p - Checks if object is a tree-sitter syntax node.
-
treesit-node-eq - Checks if node1 and node2 refer to the same node in a tree-sitter syntax tree. This function uses the same equivalence metric as
equal. You can also compare nodes usingequal(Equality Predicates).
Property information
In general, nodes in a concrete syntax tree fall into two categories: named nodes and anonymous nodes. Whether a node is named or anonymous is determined by the language grammar (named node). Apart from being named or anonymous, a node can have other properties. A node can be "missing": such nodes are inserted by the parser in order to recover from certain kinds of syntax errors, i.e., something should probably be there according to the grammar, but is not there. This can happen during editing of the program source, when the source is not yet in its final form. A node can be "extra": such nodes represent things like comments, which can appear anywhere in the text. A node can be "outdated", if its parser has reparsed at least once after the node was created. A node "has error" if the text it spans contains a syntax error. It can be that the node itself has an error, or one of its descendants has an error. A node is considered live if its parser is not deleted, and the buffer to which it belongs is a live buffer (Killing Buffers).
-
treesit-node-check - This function returns non-
nilif node has the specified property. property can benamed,missing,extra,outdated,has-error, orlive. -
treesit-node-type - Named nodes have "types" (node type). For example, a named node can be a
string_literalnode, wherestring_literalis its type. The type of an anonymous node is just the text that the node represents; e.g., the type of a= node is just =. This function returns node's type as a string.
Information as a child or parent
-
treesit-node-index - This function returns the index of node as a child node of its parent. If named is non-
nil, it only counts named nodes (named node). -
treesit-node-field-name - A child of a parent node could have a field name (field name). This function returns the field name of node as a child of its parent.
-
treesit-node-field-name-for-child - This function returns the field name of the n'th child of node. It returns
nilif there is no n'th child, or the n'th child doesn't have a field name. Note that n counts both named and anonymous children, and n can be negative, e.g., −1 represents the last child. -
treesit-node-child-count - This function returns the number of children of node. If named is non-
nil, it only counts named children (named node).
Convenience functions
-
treesit-node-enclosed-p - This function returns non-
nilif smaller is enclosed in larger. smaller and larger can be either a cons(BEG . END)or a node. Return non-nilif larger's start <= smaller's start and larger's end <= smaller's end. If strict ist, compare with < rather than <=. If strict ispartial, consider larger encloses smaller when at least one side is strictly enclosing.
Pattern Matching Tree-sitter Nodes
Tree-sitter lets Lisp programs match patterns using a small declarative language. This pattern matching consists of two steps: first tree-sitter matches a pattern against nodes in the syntax tree, then it captures specific nodes that matched the pattern and returns the captured nodes. We describe first how to write the most basic query pattern and how to capture nodes in a pattern, then the pattern-matching function, and finally the more advanced pattern syntax.
Basic query syntax
A query consists of multiple patterns. Each pattern is an s-expression that matches a certain node in the syntax node. A pattern has the form (TYPE (CHILD...)). For example, a pattern that matches a binary_expression node that contains number_literal child nodes would look like
(binary_expression (number_literal))
To capture a node using the query pattern above, append @CAPTURE-NAME after the node pattern you want to capture. For example,
(binary_expression (number_literal) @number-in-exp)
captures number_literal nodes that are inside a binary_expression node with the capture name number-in-exp. We can capture the binary_expression node as well, with, for example, the capture name biexp:
(binary_expression (number_literal) @number-in-exp) @biexp
Query function
Now we can introduce the query functions.
-
treesit-query-capture - This function matches patterns in query within node. The argument query can be either an s-expression, a string, or a compiled query object. For now, we focus on the s-expression syntax; string syntax and compiled queries are described at the end of the section. The argument node can also be a parser or a language symbol. A parser means use its root node, a language symbol means find or create a parser for that language in the current buffer, and use the root node. The function returns all the captured nodes in an alist with elements of the form
(CAPTURE_NAME . NODE). If node-only is non-nil, it returns the list of node/s instead. By default the entire text of /node is searched, but if beg and end are both non-nil, they specify the region of buffer text where this function should match nodes. Any matching node whose span overlaps with the region between beg and end is captured; it doesn't have to be completely contained in the region. If grouped is non-nil, this function returns a grouped list of lists of captured nodes. The grouping is determined by query. Captures in the same match of a pattern in query are grouped together. This function raises thetreesit-query-errorerror if query is malformed. The signal data contains a description of the specific error. You can usetreesit-query-validateto validate and debug the query.
For example, suppose node's text is 1 + 2, and query is
(setq query
'((binary_expression
(number_literal) @number-in-exp) @biexp)
Matching that query would return
(treesit-query-capture node query)
=> ((biexp . <NODE FOR "1 + 2">)
(number-in-exp . <NODE FOR "1">)
(number-in-exp . <NODE FOR "2">))
As mentioned earlier, query could contain multiple patterns. For example, it could have two top-level patterns:
(setq query
'((binary_expression) @biexp
(number_literal) @number)
-
treesit-query-string - This function parses string as language, matches its root node with query, and returns the result.
More query syntax
Besides node type and capture name, tree-sitter's pattern syntax can express anonymous node, field name, wildcard, quantification, grouping, alternation, anchor, and predicate.
Anonymous node
An anonymous node is written verbatim, surrounded by quotes. A pattern matching (and capturing) keyword return would be
"return" @keyword
Wild card
In a pattern, (_) matches any named node, and _ matches any named or anonymous node. For example, to capture any named child of a binary_expression node, the pattern would be
(binary_expression (_) @in-biexp)
Field name
It is possible to capture child nodes that have specific field names. In the pattern below, declarator and body are field names, indicated by the colon following them.
(function_definition declarator: (_) @func-declarator body: (_) @func-body)
It is also possible to capture a node that doesn't have a certain field, say, a function_definition without a body field:
(function_definition !body) @func-no-body
Quantify node
Tree-sitter recognizes quantification operators :*, :+, and :?. Their meanings are the same as in regular expressions: :* matches the preceding pattern zero or more times, :+ matches one or more times, and :? matches zero or one times. For example, the following pattern matches type_declaration nodes that have zero or more long keywords.
(type_declaration "long" :*) @long-type
The following pattern matches a type declaration that may or may not have a long keyword:
(type_declaration "long" :?) @long-type
Grouping
Similar to groups in regular expressions, we can bundle patterns into groups and apply quantification operators to them. For example, to express a comma-separated list of identifiers, one could write
(identifier) ("," (identifier)) :*
Alternation
Again, similar to regular expressions, we can express "match any one of these patterns" in a pattern. The syntax is a vector of patterns. For example, to capture some keywords in C, the pattern would be
[ "return" "break" "if" "else" ] @keyword
Anchor
The anchor operator :anchor can be used to enforce juxtaposition, i.e., to enforce two things to be directly next to each other. The two "things" can be two nodes, or a child and the end of its parent. For example, to capture the first child, the last child, or two adjacent children:
;; Anchor the child with the end of its parent. (compound_expression (_) @last-child :anchor) ;; Anchor the child with the beginning of its parent. (compound_expression :anchor (_) @first-child) ;; Anchor two adjacent children. (compound_expression (_) @prev-child :anchor (_) @next-child)
Note that the enforcement of juxtaposition ignores any anonymous nodes.
Predicate
It is possible to add predicate constraints to a pattern. For example, with the following pattern:
( (array :anchor (_) @first (_) @last :anchor) (:eq? @first @last) )
tree-sitter only matches arrays where the first element is equal to the last element. To attach a predicate to a pattern, we need to group them together. Currently there are three predicates: :eq?, :match?, and :pred?.
-
Predicate :eq? - Matches if arg1 is equal to arg2. Arguments can be either strings or capture names. Capture names represent the text that the captured node spans in the buffer. Note that this is more like
equalin Elisp, buteq?is the convention used by tree-sitter. Previously we supported the:equalpredicate but it's now considered deprecated. -
Predicate :match? - Matches if the text that capture-name's node spans in the buffer matches regular expression regexp, given as a string literal. Matching is case-sensitive. The ordering of the arguments doesn't matter. Previously we supported the
:matchpredicate but it's now considered deprecated. -
Predicate :pred? - Matches if function fn returns non-
nilwhen passed each node in nodes as arguments. The function runs with the current buffer set to the buffer of node being queried. Be very careful when using this predicate, since it can be expensive when used in a tight loop. Previously we supported the:predpredicate but it's now considered deprecated.
Note that a predicate can only refer to capture names that appear in the same pattern. Indeed, it makes little sense to refer to capture names in other patterns.
String patterns
Besides s-expressions, Emacs allows the tree-sitter's native query syntax to be used by writing them as strings. It largely resembles the s-expression syntax. For example, the following query
(treesit-query-capture
node '((addition_expression
left: (_) @left
"+" @plus-sign
right: (_) @right) @addition
["return" "break"] @keyword))
is equivalent to
(treesit-query-capture
node "(addition_expression
left: (_) @left
\"+\" @plus-sign
right: (_) @right) @addition
[\"return\" \"break\"] @keyword")
Most patterns can be written directly as s-expressions inside a string. Only a few of them need modification:
- Anchor
:anchoris written as.. :?is written as?.:*is written as*.:+is written as+.:eq?,:match?and:pred?are written as#eq?,#match?and#pred?, respectively. In general, predicates change the:to#.
For example,
'(( (compound_expression :anchor (_) @first (_) :* @rest) (:match? "love" @first) ))
is written in string form as
"( (compound_expression . (_) @first (_)* @rest) (#match? \"love\" @first) )"
Compiling queries
If a query is intended to be used repeatedly, especially in tight loops, it is important to compile that query, because a compiled query is much faster than an uncompiled one. A compiled query can be used anywhere a query is accepted.
-
treesit-query-compile - This function compiles query for language into a compiled query object and returns it. This function raises the
treesit-query-errorerror if query is malformed. The signal data contains a description of the specific error. You can usetreesit-query-validateto validate and debug the query. By default, Emacs lazily compiles query, meaning query isn't actually compiled until it's used. To compile query immediately, pass non-nilfor eager. To tell an actually compiled query apart from one that hasn't been compiled, usetreesit-query-eagerly-compiled-p. If query is malformed or language can't be loaded, this function signalstreesit-query-error. Obviously this will only happen when eager is non-nil, since otherwise Emacs doesn't actually compile query.
There are some additional functions for queries: treesit-query-language returns the language of a query; treesit-query-source returns the original string or sexp source query of a compiled query; treesit-query-valid-p checks whether a query is valid; treesit-query-expand converts a s-expression query into the string format; and treesit-pattern-expand converts a pattern. Tree-sitter grammars change overtime. To support multiple possible versions of a grammar, a Lisp program can use treesit-query-first-valid to pick the right query to use. For example, if a grammar has a (defun) node in one version, and later renamed it to (function_definition), a Lisp program can use
(treesit-query-first-valid 'lang '((defun) @defun) '((function_definition) @defun))
to support both versions of the grammar. For more details, consider reading the tree-sitter project's documentation about pattern-matching. The documentation can be found at https://tree-sitter.github.io/tree-sitter/using-parsers#pattern-matching-with-queries.
User-defined "Things" and Navigation
It's often useful to be able to identify and find certain things in a buffer, like function and class definitions, statements, code blocks, strings, comments, etc., in terms of node types defined by the tree-sitter grammar used in the buffer. Emacs allows Lisp programs to define what kinds of tree-sitter nodes corresponds to each "thing". This enables handy features like jumping to the next function, marking the code block at point, transposing two function arguments, etc. The "things" feature in Emacs is independent of the pattern matching feature of tree-sitter (Pattern Matching), and comparatively less powerful, but more suitable for navigation and traversing the buffer text in terms of the tree-sitter parse tree. You can define things with treesit-thing-settings, retrieve the predicate of a defined thing with treesit-thing-definition, and test if a thing is defined with treesit-thing-defined-p.
-
treesit-thing-settings - This is an alist of thing definitions for each language supported by the grammar used in a buffer; it should be defined by the buffer's major mode (the default value is
nil). The key of each entry is a language symbol (e.g.,cfor C,cppfor C=++=, etc.), and the value is a list of thing definitions of the form(THING PRED), where thing is a symbol representing the thing, and pred specifies what kinds of tree-sitter nodes are considered as this thing. The symbol used to define the thing can be anything meaningful for the major mode:defun,defclass,sentence,comment,string, etc. To support tree-sitter based navigation commands (List Motion), the mode should define two things:listandsexp. pred can be a regexp string that matches the type of the node; it can be a function that takes a node as the argument and returns a boolean that indicates whether the node qualifies as the thing; or it can be a cons(REGEXP . FN), which is a combination of a regular expression regexp and a function fn—the node has to match both the regexp and to satisfy fn to qualify as the thing. pred can also be recursively defined. It can be(or PRED...), meaning that satisfying any one of the pred/s qualifies the node as the thing. It can be(and PRED...), meaning that satisfying all of the /pred/s qualifies the node as the thing. It can be(not PRED), meaning that not satisfying /pred qualifies the node. Finally, pred can refer to other /thing/s defined in this list. For example,(or sexp sentence)defines something that's either asexpthing or asentencething, as defined by some other rules in the alist. There are two pre-defined predicates:namedandanonymous, which qualify, respectively, named and anonymous nodes of the tree-sitter grammar. They can be combined withandto narrow down the match. Here's an exampletreesit-thing-settingsfor C and C=++=:
((c
(defun "function_definition")
(sexp (not "[](),[{}]"))
(comment "comment")
(string "raw_string_literal")
(text (or comment string)))
(cpp
(defun ("function_definition" . cpp-ts-mode-defun-valid-p))
(defclass "class_specifier")
(comment "comment")))
Note that this example is modified for didactic purposes, and isn't exactly how tree-sitter based C and C=++= modes define things. Emacs builtin functions already make use of some thing definitions. Command treesit-forward-sexp uses the sexp definition if major mode defines it (List Motion); treesit-forward-list, treesit-down-list, treesit-up-list, treesit-show-paren-data use the list definition (its symbol list has the symbol property treesit-thing-symbol to avoid ambiguity with the function that has the same name); treesit-forward-sentence uses the sentence definition. Defun movement functions like treesit-end-of-defun uses the defun definition (defun definition is overridden by treesit-defun-type-regexp for backward compatibility). Major modes can also define comment, string, and text things (to match comments and strings). The rest of this section lists a few functions that take advantage of the thing definitions. Besides the functions below, some other functions listed elsewhere also utilize the thing feature, e.g., tree-traversing functions like treesit-search-forward, treesit-induce-sparse-tree, etc. Retrieving Nodes.
-
treesit-node-match-p - This function checks whether node represents a thing. If node represents thing, return non-
nil, otherwise returnnil. For convenience, ifnodeisnil, this function just returnsnil. The thing can be either a thing symbol likedefun, or simply a predicate that defines a thing, like"function_definition", or(or comment string). By default, if thing is undefined or malformed, this function signalstreesit-invalid-predicateerror. If ignore-missing ist, this function doesn't signal the error when thing is undefined and just returnsnil; but it still signals the error if thing is a malformed predicate.
Functions below are responsible for finding things and moving across them, and they have to deal with the fact that a buffer sometimes contains multiple adjacent or nested parsers. By default, these functions try to be helpful and search in every relevant parser at point, from most specific (deepest embedded) to the least. Lisp programs should be cautious and assess whether this behavior is desired when using these functions as building blocks of other functions; if not, explicitly pass a parser or language.
-
treesit-thing-prev - This function returns the first node before position in the current buffer that is the specified thing. If no such node exists, it returns
nil. It's guaranteed that, if a node is returned, the node's end position is less or equal to position. In other words, this function never returns a node that encloses position. Again, thing can be either a symbol or a predicate. If parser is non-nil, only use that parser's parse tree. Otherwise try each parser covering point, from the most specific (deepest-embedded) to the least specific. If there are multiple parsers with the same embed level at position, which parser is tried first is undefined. If parser is a language symbol, the function limits the parsers it tries to the ones for that language. -
treesit-thing-next - This function is similar to
treesit-thing-prev, only it returns the first node after position that's the thing. It also guarantees that if a node is returned, the node's start position is greater or equal to position. The parser parameter is the same as intreesit-thing-prev. -
treesit-navigate-thing - This function builds upon
treesit-thing-prevandtreesit-thing-nextand provides functionality that a navigation command would find useful. It returns the position after moving across arg instances of thing from position. If there aren't enough things to navigate across, it returns nil. The function doesn't move point. A positive arg means moving forward that many instances of thing; negative arg means moving backward. If side isbeg, this function returns the position of the beginning of thing; if it'send, it returns the position at the end of thing. Like intreesit-thing-prev, thing can be a thing symbol defined intreesit-thing-settings, or a predicate, and parser can be eithernil, a parser, or a language symbol. Like in that function, parser decides which parsers or languages are searched. When there are multiple parsers available, this function tries each until it succeeds. tactic determines how this function moves between things. It can benested,top-level,restricted,parent-first, ornil.nestedornilmeans normal nested navigation: first try to move across siblings; if there aren't any siblings left in the current level, move to the parent, then its siblings, and so on.top-levelmeans only navigate across top-level things and ignore nested things.restrictedmeans movement is restricted within the thing that encloses position, if there is such a thing. This tactic is useful for commands that want to stop at the current nesting level and not move up. parent-first means move to the parent if there is one; and move to siblings if there's no parent. -
treesit-thing-at - This function returns the smallest node that's the thing and encloses position; if there's no such node, it returns
nil. The returned node must enclose position, i.e., its start position is less or equal to position, and it's end position is greater or equal to position. If strict is non-nil, this function uses strict comparison, i.e., start position must be strictly smaller than position, and end position must be strictly greater than position. thing can be either a thing symbol defined intreesit-thing-settings, or a predicate. If parser is non-nil, only use that parser's parse tree. Otherwise try each parser covering point, from the most specific (deepest-embedded) to the least specific. If there are multiple parsers with the same embed level at position, which parser is tried first is undefined. parser can also be a language symbol.
There are also some convenient wrapper functions. treesit-beginning-of-thing moves point to the beginning of a thing, treesit-end-of-thing moves to the end of a thing, and treesit-thing-at-point returns the thing at point. There are also defun commands that specifically use the defun definition (as a fallback of treesit-defun-type-regexp), like treesit-beginning-of-defun, treesit-end-of-defun, and treesit-defun-at-point. In addition, these functions use treesit-defun-tactic as the navigation tactic. They are described in more detail in other sections (Tree-sitter Major Modes).
Parsing Text in Multiple Languages
Sometimes, the source of a programming language could contain snippets of other languages; HTML + CSS + JavaScript is one example. In that case, text segments written in different languages need to be assigned different parsers. Traditionally, this is achieved by using narrowing. While tree-sitter works with narrowing (narrowing), the recommended way is instead to specify regions of buffer text (i.e., ranges) in which a parser will operate. This section describes functions for setting and getting ranges for a parser. Generally when there are multiple languages at play, there is a "primary", or "host" language. The parser for this language—the primary parser, parses the entire buffer. Parsers for other languages are "embedded" or "guest" parsers, which only work on part of the buffer. The parse tree of the primary parser is usually used to determine the ranges in which the embedded parsers operate. Major modes should set treesit-primary-parser to the primary parser before calling treesit-major-mode-setup, so that Emacs can configure the primary parser correctly for font-lock and other features. Lisp programs should call treesit-update-ranges to make sure the ranges for each parser are correct before using parsers in a buffer, and call treesit-language-at to figure out the language responsible for the text at some position. These two functions don't work by themselves; they need major modes to set treesit-range-settings and optionally treesit-language-at-point-function, which do the actual work. These functions and variables are explained in more detail towards the end of the section. In short, multi-language major modes should set treesit-primary-parser, treesit-range-settings, and optionally treesit-language-at-point-function before calling treesit-major-mode-setup.
Getting and setting ranges
-
treesit-parser-set-included-ranges - This function sets up parser to operate on ranges. The parser will only read the text of the specified ranges. Each range in ranges is a pair of the form
(BEG . END). The ranges in ranges must come in order and must not overlap. That is, in pseudo code:
(cl-loop for idx from 1 to (1- (length ranges))
for prev = (nth (1- idx) ranges)
for next = (nth idx ranges)
should (<= (car prev) (cdr prev)
(car next) (cdr next)))
If ranges violates this constraint, or something else went wrong, this function signals the treesit-range-invalid error. The signal data contains a specific error message and the ranges we are trying to set. This function can also be used for disabling ranges. If ranges is nil, the parser is set to parse the whole buffer. Example:
(treesit-parser-set-included-ranges parser '((1 . 9) (16 . 24) (24 . 25)))
-
treesit-parser-included-ranges - This function returns the ranges set for parser. The return value is the same as the ranges argument of
treesit-parser-included-ranges: a list of cons cells of the form(BEG . END). If parser doesn't have any ranges, the return value isnil.
(treesit-parser-included-ranges parser)
=> ((1 . 9) (16 . 24) (24 . 25))
-
treesit-query-range - This function matches source with query and returns the ranges of captured nodes. The return value is a list of cons cells of the form
(BEG . END), where beg and end specify the beginning and the end of a region of text. For convenience, source can be a language symbol, a parser, or a node. If it's a language symbol, this function matches in the root node of the first parser using that language; if a parser, this function matches in the root node of that parser; if a node, this function matches in that node. The argument query is the query used to capture nodes (Pattern Matching). The capture names don't matter. The arguments beg and end, if both non-nil, limit the range in which this function queries. Like other query functions, this function raises thetreesit-query-errorerror if query is malformed.
Supporting multiple languages in Lisp programs
It should suffice for general Lisp programs to call the following two functions in order to support program sources that mix multiple languages.
-
treesit-update-ranges - This function updates ranges for parsers in the buffer. It makes sure the parsers' ranges are set correctly between beg and end, according to
treesit-range-settings. If omitted, beg defaults to the beginning of the buffer, and end defaults to the end of the buffer. For example, fontification functions use this function before querying for nodes in a region. -
treesit-language-at - This function returns the language of the text at buffer position pos. Under the hood it calls
treesit-language-at-point-functionand returns its return value. Iftreesit-language-at-point-functionisnil, this function returns the language of the deepest parser by embed level among parsers returned bytreesit-parsers-at. If there is no parser at that buffer position, it returnsnil.
Supporting multiple languages in major modes
Normally, in a set of languages that can be mixed together, there is a host language and one or more embedded languages. A Lisp program usually first parses the whole document with the host language's parser, retrieves some information, sets ranges for the embedded languages with that information, and then parses the embedded languages. Take a buffer containing HTML, CSS, and JavaScript as an example. A Lisp program will first parse the whole buffer with an HTML parser, then query the parser for style_element and script_element nodes, which correspond to CSS and JavaScript text, respectively. Then it sets the range of the CSS and JavaScript parsers to the range which their corresponding nodes span. Given a simple HTML document:
<html>
<script>1 + 2</script>
<style>body { color: "blue"; }</style>
</html>
a Lisp program will first parse with a HTML parser, then set ranges for CSS and JavaScript parsers:
;; Create parsers.
(setq html (treesit-parser-create 'html))
(setq css (treesit-parser-create 'css))
(setq js (treesit-parser-create 'javascript))
;; Set CSS ranges.
(setq css-range
(treesit-query-range
'html
'((style_element (raw_text) @capture))))
(treesit-parser-set-included-ranges css css-range)
;; Set JavaScript ranges.
(setq js-range
(treesit-query-range
'html
'((script_element (raw_text) @capture))))
(treesit-parser-set-included-ranges js js-range)
Emacs automates this process in treesit-update-ranges. A multi-language major mode should set treesit-range-settings so that treesit-update-ranges knows how to perform this process automatically. Major modes should use the helper function treesit-range-rules to generate a value that can be assigned to treesit-range-settings. The settings in the following example directly translate into operations shown above.
(setq treesit-range-settings
(treesit-range-rules
:embed 'javascript
:host 'html
'((script_element (raw_text) @capture))
:embed 'css
:host 'html
'((style_element (raw_text) @capture))))
;; Major modes with multiple languages can optionally set
;; `treesit-language-at-point-function' (which see).
(setq treesit-language-at-point-function
(lambda (pos)
(let* ((node (treesit-node-at pos 'html))
(parent (treesit-node-parent node)))
(cond
((and node parent
(equal (treesit-node-type node) "raw_text")
(equal (treesit-node-type parent) "script_element"))
'javascript)
((and node parent
(equal (treesit-node-type node) "raw_text")
(equal (treesit-node-type parent) "style_element"))
'css)
(t 'html)))))
-
treesit-range-rules - This function is used to set
treesit-range-settings. It takes care of compiling queries and other post-processing, and outputs a value thattreesit-range-settingscan have. It takes a series of query-spec/s, where each /query-spec is a query preceded by zero or more keyword///value pairs. Each query is a tree-sitter query in either the string, s-expression, or compiled form, or a function. If query is a tree-sitter query, it should be preceded by two keyword///value pairs, where the:embedkeyword specifies the embedded language, and the:hostkeyword specifies the host language. If the query is given the:localkeyword whose value ist, the range set by this query has a dedicated local parser; otherwise the range shares a parser with other ranges for the same language. By default, a parser sees its ranges as a continuum, rather than treating them as separate independent segments. Therefore, if the embedded ranges are semantically independent segments, they should be processed by local parsers, described below. Local parser set to a range can be retrieved bytreesit-local-parsers-atandtreesit-local-parsers-on.treesit-update-rangesuses query to figure out how to set the ranges for parsers for the embedded language. It queries query in a host language parser, computes the ranges which the captured nodes span, and applies these ranges to embedded language parsers. If query is a function, it doesn't need any keyword and value pair. It should be a function that takes 2 arguments, start and end, and sets the ranges for parsers in the current buffer in the region between start and end. It is fine for this function to set ranges in a larger region that encompasses the region between start and end. -
treesit-range-settings - This variable helps
treesit-update-rangesin updating the ranges for parsers in the buffer. It is a list of setting/s where the exact format of a /setting is considered internal. You should usetreesit-range-rulesto generate a value that this variable can have. -
treesit-language-at-point-function - This variable's value should be a function that takes a single argument, pos, which is a buffer position, and returns the language of the buffer text at pos. This variable is used by
treesit-language-at. -
treesit-parsers-at - This function returns all parsers at pos in the current buffer. pos defaults to point. The returned parsers are sorted by the decreasing embed level. If language is non-
nil, return parsers only for that language. If with-host is non-nil, return a list of(PARSER . HOST-PARSER)where host-parser is the host parser which created the parser. If only is non-nil, return all parsers including the primary parser. The argument only can be a list of symbols that specify what parsers to include in the return value. If only contains the symbollocal, include local parsers. Local parsers are those which only parse a limited region marked by an overlay with a non-niltreesit-parser-local-pproperty. If only contains the symbolglobal, include non-local parsers excluding the primary parser. If only contains the symbol `primary', include the primary parser. -
treesit-local-parsers-at - This function returns all the local parsers at pos in the current buffer. pos defaults to point. Local parsers are those which only parse a limited region marked by an overlay with a non-
niltreesit-parser-local-pproperty. If language is non-nil, only return parsers for that language. -
treesit-local-parsers-on - This function is the same as
treesit-local-parsers-at, but it returns the local parsers in the range between beg and end instead of at point. beg and end default to the entire accessible portion of the buffer.
Developing major modes with tree-sitter
This section covers some general guidelines on developing tree-sitter integration for a major mode. A major mode supporting tree-sitter features should roughly follow this pattern:
(define-derived-mode woomy-mode prog-mode "Woomy"
"A mode for Woomy programming language."
(when (treesit-ready-p 'woomy)
(setq-local treesit-variables ...)
...
(treesit-major-mode-setup)))
treesit-ready-p automatically emits a warning if conditions for enabling tree-sitter aren't met. If a tree-sitter major mode shares setup with its "native" counterpart, one can create a "base mode" that contains the common setup, like this:
(define-derived-mode woomy--base-mode prog-mode "Woomy"
"An internal mode for Woomy programming language."
(common-setup)
...)
(define-derived-mode woomy-mode woomy--base-mode "Woomy"
"A mode for Woomy programming language."
(native-setup)
...)
(define-derived-mode woomy-ts-mode woomy--base-mode "Woomy"
"A mode for Woomy programming language."
(when (treesit-ready-p 'woomy)
(setq-local treesit-variables ...)
...
(treesit-major-mode-setup)))
-
treesit-ready-p - This function checks for conditions for activating tree-sitter. It checks whether Emacs was built with tree-sitter, whether the buffer's size is not too large for tree-sitter to handle, and whether the grammar for language is available on the system (Language Grammar). This function emits a warning if tree-sitter cannot be activated. If quiet is
message, the warning is turned into a message; if quiet ist, no warning or message is displayed. If all the necessary conditions are met, this function returns non-nil; otherwise it returnsnil. -
treesit-major-mode-setup - This function activates some tree-sitter features for a major mode. Currently, it sets up the following features:
- ?
- If
treesit-font-lock-settings(Parser-based Font Lock) is non-nil, it sets up fontification. - ?
- If either
treesit-simple-indent-rulesortreesit-indent-function(Parser-based Indentation) is non-nil, it sets up indentation. - ?
- If
treesit-defun-type-regexpis non-nil, it sets up navigation functions forbeginning-of-defunandend-of-defun. List Motion. - ?
- If
treesit-defun-name-functionis non-nil, it sets up add-log functions used byadd-log-current-defun. - ?
- If
treesit-simple-imenu-settings(Imenu) is non-nil, it sets up Imenu. - ?
- If
treesit-outline-predicate(Outline Minor Mode) is non-nil, it sets up Outline minor mode. - ?
- If
sexpand/orsentenceare defined intreesit-thing-settings(User-defined Things), it enables navigation commands that move, respectively, by sexps and sentences by defining variables such asforward-sexp-functionandforward-sentence-function.
For more information on these built-in tree-sitter features, Parser-based Font Lock, Parser-based Indentation, and List Motion. For supporting mixing of multiple languages in a major mode, Multiple Languages. Besides beginning-of-defun and end-of-defun, Emacs provides some additional functions for working with defuns: treesit-defun-at-point returns the defun node at point, and treesit-defun-name returns the name of a defun node.
-
treesit-defun-at-point - This function returns the defun node at point, or
nilif none is found. It respectstreesit-defun-tactic: if its value istop-level, this function returns the top-level defun, and if its value isnested, it returns the immediate enclosing defun. This function requirestreesit-defun-type-regexpto work. If it isnil, this function simply returnsnil. -
treesit-defun-name - This function returns the defun name of node. It returns
nilif there is no defun name for node, or if node is not a defun node, or if node isnil. Depending on the language and major mode, the defun names are names like function name, class name, struct name, etc. Iftreesit-defun-name-functionisnil, this function always returnsnil. -
treesit-defun-name-function - If non-
nil, this variable's value should be a function that is called with a node as its argument, and returns the defun name of the node. The function should have the same semantics astreesit-defun-name: if the node is not a defun node, or the node is a defun node but doesn't have a name, or the node isnil, it should returnnil.
Tree-sitter C API Correspondence
Emacs's tree-sitter integration doesn't expose every feature provided by tree-sitter's C API. Missing features include:
- Creating a tree cursor and navigating the syntax tree with it.
- Setting timeout and cancellation flag for a parser.
- Setting the logger for a parser.
- Printing a DOT graph of the syntax tree to a file.
- Copying and modifying a syntax tree. (Emacs doesn't expose a tree object.)
- Using (row, column) coordinates as position.
- Updating a node with changes. (In Emacs, retrieve a new node instead of updating the existing one.)
- Querying statics of a language grammar.
In addition, Emacs makes some changes to the C API to make the API more convenient and idiomatic:
- Instead of using byte positions, the Emacs Lisp API uses character positions.
- Null nodes are converted to
nil.
Below is the correspondence between all C API functions and their Lisp counterparts. Sometimes one Lisp function corresponds to multiple C functions, and many C functions don't have a Lisp counterpart.
ts_parser_new treesit-parser-create ts_parser_delete ts_parser_set_language ts_parser_language treesit-parser-language ts_parser_set_included_ranges treesit-parser-set-included-ranges ts_parser_included_ranges treesit-parser-included-ranges ts_parser_parse ts_parser_parse_string treesit-parse-string ts_parser_parse_string_encoding ts_parser_reset ts_parser_set_timeout_micros ts_parser_timeout_micros ts_parser_set_cancellation_flag ts_parser_cancellation_flag ts_parser_set_logger ts_parser_logger ts_parser_print_dot_graphs ts_tree_copy ts_tree_delete ts_tree_root_node ts_tree_language ts_tree_edit ts_tree_get_changed_ranges ts_tree_print_dot_graph ts_node_type treesit-node-type ts_node_symbol ts_node_start_byte treesit-node-start ts_node_start_point ts_node_end_byte treesit-node-end ts_node_end_point ts_node_string treesit-node-string ts_node_is_null ts_node_is_named treesit-node-check ts_node_is_missing treesit-node-check ts_node_is_extra treesit-node-check ts_node_has_changes ts_node_has_error treesit-node-check ts_node_parent treesit-node-parent ts_node_child treesit-node-child ts_node_field_name_for_child treesit-node-field-name-for-child ts_node_child_count treesit-node-child-count ts_node_named_child treesit-node-child ts_node_named_child_count treesit-node-child-count ts_node_child_by_field_name treesit-node-child-by-field-name ts_node_child_by_field_id ts_node_next_sibling treesit-node-next-sibling ts_node_prev_sibling treesit-node-prev-sibling ts_node_next_named_sibling treesit-node-next-sibling ts_node_prev_named_sibling treesit-node-prev-sibling ts_node_first_child_for_byte treesit-node-first-child-for-pos ts_node_first_named_child_for_byte treesit-node-first-child-for-pos ts_node_descendant_for_byte_range treesit-node-descendant-for-range ts_node_descendant_for_point_range ts_node_named_descendant_for_byte_range treesit-node-descendant-for-range ts_node_named_descendant_for_point_range ts_node_edit ts_node_eq treesit-node-eq ts_tree_cursor_new ts_tree_cursor_delete ts_tree_cursor_reset ts_tree_cursor_current_node ts_tree_cursor_current_field_name ts_tree_cursor_current_field_id ts_tree_cursor_goto_parent ts_tree_cursor_goto_next_sibling ts_tree_cursor_goto_first_child ts_tree_cursor_goto_first_child_for_byte ts_tree_cursor_goto_first_child_for_point ts_tree_cursor_copy ts_query_new ts_query_delete ts_query_pattern_count ts_query_capture_count ts_query_string_count ts_query_start_byte_for_pattern ts_query_predicates_for_pattern ts_query_step_is_definite ts_query_capture_name_for_id ts_query_string_value_for_id ts_query_disable_capture ts_query_disable_pattern ts_query_cursor_new ts_query_cursor_delete ts_query_cursor_exec treesit-query-capture ts_query_cursor_did_exceed_match_limit ts_query_cursor_match_limit ts_query_cursor_set_match_limit ts_query_cursor_set_byte_range ts_query_cursor_set_point_range ts_query_cursor_next_match ts_query_cursor_remove_match ts_query_cursor_next_capture ts_language_symbol_count ts_language_symbol_name ts_language_symbol_for_name ts_language_field_count ts_language_field_name_for_id ts_language_field_id_for_name ts_language_symbol_type ts_language_version