{"id":116,"date":"2024-04-07T23:54:09","date_gmt":"2024-04-07T21:54:09","guid":{"rendered":"https:\/\/lukasmielczarek.de\/?page_id=116"},"modified":"2024-04-07T23:54:31","modified_gmt":"2024-04-07T21:54:31","slug":"bachelorarbeit-integrating-supertag-features-into-neural-discontinuous-constituent-parsing","status":"publish","type":"page","link":"https:\/\/lukasmielczarek.de\/en\/bachelorarbeit-integrating-supertag-features-into-neural-discontinuous-constituent-parsing\/","title":{"rendered":"Bachelor's Thesis: Integrating Supertag Features into Neural Discontinuous Constituent Parsing"},"content":{"rendered":"<p>Syntactic parsing lies at the foundation of many natural-language processing applications. The term denotes the automatic extraction of a syntactic representation from a natural language text, usually in the form of a sequence of words and punctuation marks. The representation is designed to capture information about the elements\u2018 grammatical relations and functions. This output is traditionally used as a basis for higher-level tasks like grammar checking, semantic analysis, question answering and information extraction (Jurafsky and Martin, 2009, chapter 12).<\/p>\n\n\n\n<p>One widely used description of syntax is so-called \\textit{constituent structure}. Rooted in X-bar theory (Chomsky, 1970), the idea of constituency arises from the observation that groups of words can behave as single units (Jurafsky and Martin, 2009, chapter 12). Take for instance the noun phrases in (1.1).<\/p>\n\n\n\n<p>(1.1)<br>a. the memory<br>b. a new computer<br>c. the person she really likes<\/p>\n\n\n\n<p>Despite having completely different meanings, these three terms can occur in similar syntactic environments, for example before a verb as in (1.2).<\/p>\n\n\n\n<p>(1.2)<br>a. the memory <em>fades<\/em><br>b. a new computer <em>arrives<\/em>\u2026<br>c. the person she really likes <em>waits<\/em>\u2026<\/p>\n\n\n\n<p>However, this is not true for all of the individual components of the phrases, as can be seen in (1.3). Therefore, one assumes the existence of an abstract category called a <em>constituent<\/em> to which one or more words can belong and for which rules like \u201enoun phrases can be followed by verbs\u201c can be postulated (Jurafsky and Martin, 2009, chapter 12).<\/p>\n\n\n\n<p>(1.3)<br>a. *the<em> fades<\/em><br>b.<em> *<\/em>new <em>arrives<\/em>\u2026<br>c. *likes <em>spends<\/em>\u2026<\/p>\n\n\n\n<p>Note: The asterisk marks constructions that are not grammatical.<\/p>\n\n\n\n<p>Traditional views of constituency demand that constituents consist of adjacent words. This is rooted in a system for modelling languages called <em>context-free grammar<\/em> (CFG) (Chomsky, 1956), a generative device that derives sentences by using rewrite rules, starting with an initial symbol. The derivation path marks a hierarchy of constituents that can be arranged into a <em>derivation\/parse tree<\/em>. Figure 1.1 gives an example for such a description. Note that the tree only contains non-crossing edges which is why it is called <em>projective<\/em> or <em>continuous<\/em>.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/mielczarek.de\/wp-content\/uploads\/2023\/11\/image-800x322.png\" alt=\"\" class=\"wp-image-727\"\/><\/figure>\n\n\n\n<p>This view of constituency was adopted by many <em>treebanks<\/em>. Treebanks are linguistic corpora in which the sentences have been annotated with syntactic representations (e.g. constituency trees) (Jurafsky and Martin, 2009, chapter 12). However, this leads to difficulties when analysing syntactic phenomena that are believed to exhibit non-local dependencies. While the analysis of German or other languages with varying degrees of free word order evidently questions the projective approach, the English language also exhibits a fair amount of discontinuous phenomena, see (1.4) and (1.5).<\/p>\n\n\n\n<p>(1.4)<br>a. sie <em>packte<\/em> ihre Werkzeuge <em>ein<\/em><br>she packed her tools up<br>\u2019she packed her tools\u2018<\/p>\n\n\n\n<p>b. <em>den Dank<\/em> m\u00f6chte ich nicht <em>annehmen<\/em><br>the thanks want he not accept<br>\u201ahe does not want to accept the thanks\u2018<\/p>\n\n\n\n<p>c. <em>ohne Scham <\/em>scheinen sie <em>das komplette Buffet aufgegessen<\/em> zu haben<br>without shame seem they the complete buffet eaten to have<br>\u201athey seem to have eaten the entire buffet without shame\u2018<\/p>\n\n\n\n<p>(1.5)<br>a. <em>areas of the factory<\/em> were particularly dusty <em>where the crocidolite was used<\/em> (Evang, 2011)<br>b. <em>a man<\/em> entered <em>who was wearing a black suit<\/em> (McCawley, 1982)<\/p>\n\n\n\n<p>Some treebanks like the English <em>Penn Treebank<\/em> (Marcus et al., 1993) opted to mark discontinuous phenomena through indexing and trace nodes, i.e. empty nodes in the place where a discontinuous subsection of the sentence would be expected in \u201estandard\u201c word order. This is motivated by the tradition of analysing discontinuities as instances of transformation or syntactic movement as introduced by Chomsky (1975). The tree in figure 1.2 features three cases of co-indexing. While such a strategy acknowledges the existence of discontinuous relationships, the long-range dependencies of the markers cannot be fully captured by traditional parsing techniques for projective grammars and often remain unused.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/mielczarek.de\/wp-content\/uploads\/2023\/11\/image-1-800x320.png\" alt=\"\" class=\"wp-image-728\"\/><\/figure>\n\n\n\n<p>In a number of treebanks like the German NeGra (Skut et al., 1998) and TIGER treebanks (Brants et al., 2004) or English <em>discontinuous Penn Treebank<\/em> (Evang and Kallmeyer, 2011) (DPTB) long-range dependencies are represented by crossing edges and constituents with non-adjacent elements. The resulting trees are called <em>discontinuous constituent trees<\/em>. Figure 1.3 shows an example from the NeGra corpus.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/mielczarek.de\/wp-content\/uploads\/2023\/11\/image-2-800x307.png\" alt=\"\" class=\"wp-image-729\"\/><\/figure>\n\n\n\n<p>In order to capture syntactic descriptions with discontinuous constituents, several formalisms have been proposed. Among these <em>linear context-free rewriting systems<\/em> (LCFRS) (Vijay-Shanker et al., 1987) have been shown to be a natural candidate for accurately covering the notion of discontinuous constituents employed by the aforementioned treebanks.<\/p>\n\n\n\n<p>Being a foundation for many natural language processing tasks, parsing speed is a key concern and a driver of active research. Parsing LCFRS is non-trivial in terms of efficiency. The adaptation of traditional chart parsing for LCFRS results in a polynomial time complexity of O(n^(3*dim(G))) where n is the length of a sentence and dim(G) is the maximal number of discontinuous blocks of a constituent in the grammar. Transition-based parsing approaches aim at reducing this factor by doing away with the need for an explicit grammar. Instead, an artificial neural network is trained to produce discontinuous constituent trees given only raw text input using <em>supervised learning<\/em> on large annotated corpora. Errors in prediction are backpropagated through the network resulting in optimisations of its parameters. In this way, it implicitly learns to identify regularities that it can effectively use to predict trees for unseen data. A recent and elegant proposal for a neural stack-free transition-based parser developed by Coavoux and Cohen (2019) successfully allows for the derivation of any discontinuous constituent tree over a sentence in worst-case quadratic time.<\/p>\n\n\n\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button\"><a class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/lukasmielczarek.de\/wp-content\/uploads\/2024\/04\/Bachelorarbeit_Mielczarek_Lukas_online.pdf\">Download Thesis<\/a><\/div>\n\n\n\n<div class=\"wp-block-button\"><a class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/github.com\/filemon11\/discoparset-supertag\">Download Code<\/a><\/div>\n<\/div>","protected":false},"excerpt":{"rendered":"<p>Syntactic parsing lies at the foundation of many natural-language processing applications. The term denotes the automatic extraction of a syntactic representation from a natural language text, usually in the form of a sequence of words and punctuation marks. The representation is designed to capture information about the elements\u2018 grammatical relations and functions. This output is [&hellip;]<\/p>","protected":false},"author":1,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-116","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/lukasmielczarek.de\/en\/wp-json\/wp\/v2\/pages\/116","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/lukasmielczarek.de\/en\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/lukasmielczarek.de\/en\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/lukasmielczarek.de\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/lukasmielczarek.de\/en\/wp-json\/wp\/v2\/comments?post=116"}],"version-history":[{"count":2,"href":"https:\/\/lukasmielczarek.de\/en\/wp-json\/wp\/v2\/pages\/116\/revisions"}],"predecessor-version":[{"id":119,"href":"https:\/\/lukasmielczarek.de\/en\/wp-json\/wp\/v2\/pages\/116\/revisions\/119"}],"wp:attachment":[{"href":"https:\/\/lukasmielczarek.de\/en\/wp-json\/wp\/v2\/media?parent=116"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}