Geschreibsel Tagged: dsl http://172.105.249.169:9000/tags/dsl Sun, 28 Apr 2013 00:00:01 +0000 Sun, 28 Apr 2013 00:00:01 +0000 Metadata processing with Metafacture and Eclipse <p><small>Cross-posted to: <a href="https://fsteeg.wordpress.com/2013/04/28/metadata-processing-with-metafacture-and-eclipse/">https://fsteeg.wordpress.com/2013/04/28/metadata-processing-with-metafacture-and-eclipse/</a></small></p> <p/> A lot has happened job-wise for me since my last post: in the summer I quit my little trip into the startup world (where I used Clojure and was not just the only Eclipse user on the team, but actually the only IDE user) and joined hbz, the Library Service Center of the German state of North Rhine-Westphalia. The <em>library</em> part is actually about the book thing, and it sort of divides not only the general public, but the Eclipse community as well: <blockquote class="twitter-tweet"><p lang="en" dir="ltr"><a href="https://twitter.com/waynebeaton">@waynebeaton</a> <a href="https://twitter.com/DonaldOJDK">@DonaldOJDK</a> <a href="https://twitter.com/IanSkerrett">@IanSkerrett</a> I go with my kids to get library books every week. They think it&#39;s retro.</p>&mdash; Alex Blewitt (@alblue) <a href="https://twitter.com/alblue/status/308980332684771328">March 5, 2013</a></blockquote> <p/> I’m with Alex Blewitt’s kids on this one. And for what ‘retro’ means, I recently saw a nice definition: <p/> <blockquote>"Gamers say we're 'retro', which I guess means 'old, but cool'" -- Wreck-It Ralph</blockquote> <p/> OK, so libraries are old, but cool. But what’s in the library world for a programmer? Metadata! <blockquote class="twitter-tweet"><p lang="en" dir="ltr">...and if you&#39;re interested in metadata, you should be interested in library metadata. That&#39;s where metadata began.</p>&mdash; Bob DuCharme (@bobdc) <a href="https://twitter.com/bobdc/status/304960413467045889">February 22, 2013</a></blockquote> <p/> At hbz, we don’t deal with books, but with information about books. I’m in the <em>Linked Open Data</em> group, where we work on bringing this library metadata to the web. At the heart of this task is processing and transforming the metadata, which is stored in different formats. In general, metadata provides information about specific aspects of some data. For libraries, the data is books, and the aspects described by the metadata are things like the book’s author, title, etc. These aspects can be represented as key-value pairs, giving metadata a structure of key-value pairs, grouped to describe the aspects of some piece of data. <p/> Given this general structure, metadata processing isn’t a problem specific to libraries. Even if you take only open, textual formats like XML or JSON, you simply have a myriad of ways to express your specific set of metadata (e.g. which keys to use, or how to represent nested structure). Given this reality of computing, metadata processing is not only at the heart of library data, but central to any data processing or information system. Therefore I was very happy to learn about a very useful toolkit for metadata processing called <a href="https://github.com/culturegraph/metafacture-core/wiki">Metafacture</a>, developed by our partners in the <a href="https://github.com/culturegraph">Culturegraph</a> project at the German National Library. <p/> <a href="http://fsteeg.com/images/metafacture-ide-a.png"><img class="alignnone size-full" alt="metafacture-ide-A" src="http://fsteeg.com/images/metafacture-ide-a.png" /></a> <p/> Metafacture is a framework for metadata processing. One basic idea of Metafacture is to separate the transformation rules from the specific input and output formats. In the transformation, key-value pairs are converted to other key-value-pairs, defined declaratively in an XML file (the <em>Morph</em> file). The Metafacture framework provides support for different input and output formats and is extensible for custom formats. <p/> The interaction of input, transformation rules, and output makes up the transformation workflow. This workflow is defined in Metafacture with a domain-specific language called Flux. Since I knew about the awesome power of <a href="http://www.eclipse.org/Xtext">Xtext</a>, I started working on tools for this DSL, in particular an Xtext-based editor and an Eclipse launcher for Flux files. I'll use these tools here to give you an idea of what Flux is about. We start the workflow by specifying the path to our input file, relative to the location of the Flux file (using a variables declared with the <code>default</code> keyword). The first interesting part of the workflow is how to interpret this input path. What we mean here is the path to a local, uncompressed file: <p/> <a href="http://fsteeg.com/images/metafacture-ide-b.png"><img class="alignnone size-full" alt="metafacture-ide-B" src="http://fsteeg.com/images/metafacture-ide-b.png" /></a> <p/> Next, we specify how to decode the information stored in the file (XML in our case): <p/> <a href="http://fsteeg.com/images/metafacture-ide-c.png"><img class="alignnone size-full" alt="metafacture-ide-C" src="http://fsteeg.com/images/metafacture-ide-c.png" /></a> <p/> And as a separate step, how to handle this information to make it available in the grouped key-value structure used for the actual transformation (in our case, this will e.g. flatten the sub fields in the input by combining each top field key with its sub field keys): <p/> <a href="http://fsteeg.com/images/metafacture-ide-d.png"><img class="alignnone size-full" alt="metafacture-ide-D" src="http://fsteeg.com/images/metafacture-ide-d.png" /></a> <p/> At this point we’re ready to trigger the actual transformation, which is defined in the XML Morph file. To understand the basic idea of how the transformation works, let’s open the morph.xml file with the <a href="http://www.eclipse.org/webtools/">WTP</a> XML editor and have a look at the design tab: <p/> <a href="http://fsteeg.com/images/metafacture-ide-e.png"><img class="alignnone size-full" alt="metafacture-ide-E" src="http://fsteeg.com/images/metafacture-ide-e.png" /></a> <p/> What we see here are two transformation rules for different fields of an input record. The first one just changes the attribute key: we want to map the input field <code>8564 .u</code> (the URL field identifier in the MARC format) to <code>http://lobid.org/vocab/lobid#fulltextOnline</code> (the identifier used for full text references in our linked open data). The second rule does the same for the field <code>24500.a</code>, mapping it to <code>http://iflastandards.info/ns/isbd/elements/P1004</code> (the title field), but this second rule also changes the value: it removes newlines with surrounding spaces by calling the <em>replace</em> Morph function. There are many options in the Metafacture Morph language, but this should give you the basic idea. For details, see the Metafacture <a href="https://github.com/culturegraph/metafacture-core/wiki#morph">Morph user guide</a>. <p/> After the actual transformation, we have new key-value pairs, which we now want to encode somehow. In our use case at hbz, we want to create linked open data to be processed with Hadoop, so we wrote an encoder for N-Triples, a line-based RDF graph serialization: <p/> <a href="http://fsteeg.com/images/metafacture-ide-f.png"><img class="alignnone size-full" alt="metafacture-ide-F" src="http://fsteeg.com/images/metafacture-ide-f.png" /></a> <p/> Notice how we didn’t just pipe into the <code>encode-ntriples</code> command, but first opened a <code>stream-tee</code> to branch the output of the <code>morph</code> command into two different receivers, one generating the mentioned N-Triples, the other creating a renderable representation of the graph in the Graphviz DOT language: <p/> <a href="http://fsteeg.com/images/metafacture-ide-g.png"><img class="alignnone size-full" alt="metafacture-ide-G" src="http://fsteeg.com/images/metafacture-ide-g.png" /></a> <p/> We now have a complete Flux workflow, which we can run by selecting <em>Run -&gt; Run As -&gt; Flux Workflow</em> on the Flux file (in the editor or the explorer view). This will generate the output files and refresh the workspace: <p/> <a href="http://fsteeg.com/images/metafacture-ide-h.png"><img class="alignnone size-full" alt="metafacture-ide-H" src="http://fsteeg.com/images/metafacture-ide-h.png" /></a> <p/> Since we’re running in Eclipse, it’s easy to integrate this with other stuff. For instance I’ve added the Zest feature to our build to make the <a href="http://wiki.eclipse.org/Zest/DOT#Zest_Graph_View">Zest graph view</a> available. This view can listen to DOT files in the workspace, and render their content with Zest. When enabled, the view will update as soon as the output DOT file is written by the Flux workflow, and display the resulting graph. <p/> Check out the <a href="https://github.com/culturegraph/metafacture-ide/wiki/User-Guide">Metafacture IDE user guide</a> for detailed instructions on this sample setup. The Metafacture IDE is in an early alpha stage, but we think it’s already useful. It can be installed from the <a href="http://marketplace.eclipse.org/content/metafacture-ide">Eclipse Marketplace</a>. We’re happy about any kind of feedback, contributions, etc. There's a <a href="https://github.com/culturegraph/metafacture-ide/wiki/Developer-Guide">Metafacture IDE developer guide</a> and further information on all <a href="http://culturegraph.github.com">Metafacture modules on GitHub</a>. <p/> One of the best things about Metafacture is that it is built to be extensible. Which makes sense, since you can supply tons of stuff for metadata processing, but you will never cover every format out of the box. So Metafacture helps users to solve their problems by providing hooks into the framework. For instance, both Flux commands (e.g. <code>open-file</code>, <code>decode-xml</code> in the Flux file above) and Morph functions (e.g. <code>replace</code> in the Morph XML file above) are actually Java classes implementing specific interfaces that are called using reflection internally. The additional content assist information above is generated from annotations on these classes. <p/> With its stream-based processing pipeline inspired by an architecture developed by <a href="http://github.com/jprante">Jörg Prante</a> at hbz, Metafacture is efficient and can deal with big data. With its declarative, modular setup with Morph and Flux files Metafacture provides accurate, complete, and reproducible documentation on how the data was created. <p/> <a href="http://fsteeg.com/images/metafacture-ide-i.png"><img class="alignnone size-full" alt="metafacture-ide-I" src="http://fsteeg.com/images/metafacture-ide-i.png" /></a> <p/> I think the nature of this work as well as the organizational challenges in the library community, where multiple public and commercial entities both cooperate and compete, make Eclipse a great model - both technically (build for extensibility, use a common platform, etc) and for open source governance (transparency, vendor neutrality, etc). I’m therefore very happy that hbz recently joined the Eclipse Foundation as an <a href="http://www.eclipse.org/membership/showMember.php?member_id=1072">associate member</a>. http://172.105.249.169:9000/notes/metadata-processing-with-metafacture-and-eclipsehttp://172.105.249.169:9000/notes/metadata-processing-with-metafacture-and-eclipse Sun, 28 Apr 2013 00:00:01 +0000 programmingeclipsemetadataxtextdsl Run MWE2 workflows for Xtext 2.0 in a Tycho build <p><small>Cross-posted to: <a href="https://fsteeg.wordpress.com/2011/07/15/run-mwe2-workflows-for-xtext-2-0-in-a-tycho-build/">https://fsteeg.wordpress.com/2011/07/15/run-mwe2-workflows-for-xtext-2-0-in-a-tycho-build/</a></small></p> <p/> I've been doing some work on the Zest build to make the MWE workflow (which generates a bunch of files from an <a href="http://www.eclipse.org/Xtext/">Xtext</a> grammar) part of the <a href="http://www.eclipse.org/tycho/">Tycho</a> build. I guess I just couldn't stand committing tons of changes after some very small tweaks to the grammar I did earlier today. <p/> To hook the workflow into the Tycho build, I was basically able to use the info in the very good and detailed <a href="http://kthoms.wordpress.com/2010/08/18/building-xtext-projects-with-maven-tycho/">tutorial by Karsten Thoms</a>. To make it work for our specific project setup and for the latest Xtext version I still had to tweak some minor things, so I thought I'll do a quick writeup of what I have actually done to make it work. This assumes you already have an Xtext project and you're already building with Tycho in general. If not, check out the tutorial mentioned above. <p/> To add the MWE workflow to the Tycho build, all I had to do was add a plugin repository in the <a href="http://git.eclipse.org/c/gef/org.eclipse.zest.git/tree/pom.xml">parent pom</a>: <pre><code>&lt;pluginRepositories&gt; &lt;pluginRepository&gt; &lt;id&gt;fornax-snapshots&lt;/id&gt; &lt;url&gt;http://www.fornax-platform.org/archiva/repository/snapshots/&lt;/url&gt; &lt;snapshots&gt;&lt;enabled&gt;true&lt;/enabled&gt;&lt;/snapshots&gt; &lt;/pluginRepository&gt; &lt;/pluginRepositories&gt;</code></pre> <p/> Add a plugin pointing to the MWE workflow file in the Xtext <a href="http://git.eclipse.org/c/gef/org.eclipse.zest.git/tree/org.eclipse.zest.dot.core/pom.xml">grammar project pom</a>: <pre><code>&lt;build&gt; &lt;plugins&gt; &lt;plugin&gt; &lt;groupId&gt;org.fornax.toolsupport&lt;/groupId&gt; &lt;artifactId&gt;fornax-oaw-m2-plugin&lt;/artifactId&gt; &lt;version&gt;3.3.0-SNAPSHOT&lt;/version&gt; &lt;configuration&gt; &lt;workflowEngine&gt;mwe2&lt;/workflowEngine&gt; &lt;workflowDescriptor&gt;src/parser/GenerateLang.mwe2&lt;/workflowDescriptor&gt; &lt;/configuration&gt; &lt;executions&gt; &lt;execution&gt; &lt;phase&gt;generate-sources&lt;/phase&gt; &lt;goals&gt;&lt;goal&gt;run-workflow&lt;/goal&gt;&lt;/goals&gt; &lt;/execution&gt; &lt;/executions&gt; &lt;/plugin&gt; &lt;/plugins&gt; &lt;/build&gt;</code></pre> <p/> And define the Xtext grammar project source folder as a resource directory: <pre><code>&lt;build&gt; &lt;resources&gt; &lt;resource&gt;&lt;directory&gt;src&lt;/directory&gt;&lt;/resource&gt; &lt;/resources&gt; &lt;/build&gt;</code></pre> <p/> When running <em>mvn clean install</em>, the MWE workflow now runs as part of the grammar project build. http://172.105.249.169:9000/notes/run-mwe2-workflows-for-xtext-20-in-a-tycho-buildhttp://172.105.249.169:9000/notes/run-mwe2-workflows-for-xtext-20-in-a-tycho-build Fri, 15 Jul 2011 00:00:01 +0000 dsleclipsejavaprogramming Computational Representation of Linguistic Structures using Domain-Specific Languages <i>Abstract</i>: We (Fabian Steeg, Christoph Benden, Paul O. Samuelsdorff) describe a modular system for generating sentences from formal definitions of underlying linguistic structures using domain-specific languages. The system uses Java in general, Prolog for lexical entries and custom domain-specific languages based on Functional Grammar and Functional Discourse Grammar notation, implemented using the ANTLR parser generator. We show how linguistic and technological parts can be brought together in a natural language processing system and how domain-specific languages can be used as a tool for consistent formal notation in linguistic description. <p/> arXiv: <a href="http://arxiv.org/abs/0805.3366">0805.3366</a>; 12 pages http://172.105.249.169:9000/notes/computational-representation-of-linguistic-structures-using-domain-specific-languageshttp://172.105.249.169:9000/notes/computational-representation-of-linguistic-structures-using-domain-specific-languages Wed, 21 May 2008 00:00:01 +0000 reportdslnlp Linguistic DSLs with ANTLR <p><small>Cross-posted to: <a href="https://fsteeg.wordpress.com/2007/10/06/linguistic-dsls-with-antlr/">https://fsteeg.wordpress.com/2007/10/06/linguistic-dsls-with-antlr/</a></small></p> As an update to <a href="http://fgram.sourceforge.net/">our Functional Grammar project</a> I've added some experimental ANTLR v3 grammar files for <i>Functional Discourse Grammar</i> (<a href="https://lingweb.eva.mpg.de/linguipedia/index.php/Functional_Discourse_Grammar">FDG</a>) structures on the <i>Interpersonal</i> and <i>Representational Levels</i> and updated the project page (with some details in an updated version of our overview paper). For example, on the RL, it is possible to generate a parser in all sorts of programming languages (thanks to ANTLR v3), that will parse linguistic descriptions on the RL, such as the following serial verb construction in Jamaican Creole (<i>Im tek naif kot mi</i>, 'He cut me with a knife'): <pre> (p1:[ (Past e1:[ (f1:tek[ (x1:im(x1))Ag (x2:naif(x2))Inst ](f1)) (f2:kot[ (x1:im(x1))Ag (x3:mi(x3))Pat ](f2)) ](e1)) ](p1)) </pre> The grammar looks like this (naming is based on FDG terminology): <pre> pcontent : '(' OPERATOR? 'p' X ( ':' head '(' 'p' X ')' )* ')' FUNCTION? ; soaffairs : '(' OPERATOR? 'e' X ( ':' head '(' 'e' X ')' )* ')' FUNCTION? ; property : '(' OPERATOR? 'f' X ( ':' head '(' 'f' X ')' )* ')' FUNCTION? ; individual : '(' OPERATOR? 'x' X ( ':' head '(' 'x' X ')' )* ')' FUNCTION? ; location : '(' OPERATOR? 'l' X ( ':' head '(' 'l' X ')' )* ')' FUNCTION? ; time : '(' OPERATOR? 't' X ( ':' head '(' 't' X ')' )* ')' FUNCTION? ; head : LEMMA? ( '[' ( soaffairs | property | individual | location | time )* ']' ) ? ; </pre> And with the same short grammar it is basically possible to parse all valid RL representations. It's the same for IL structures. Such an approach makes sense for two reasons, I believe: <ol><li>It provides a way for linguists to define and use a real formal description language for linguistic structures on all linguistic levels, not the more common almost formal description languages which tend to differ in the details from paper to paper, even within one theory or framework </li><li>It might in the long run provide a way for detailed computational representation of linguistic knowledge in a domain-specific language (DSL) specialized for the descriptive linguist and such make detailed linguistic knowledge available for natural language processing</li></ol>On the practical side, where this could lead into as a next step could be creating a usable tool on the basis of the FDG grammar files, that would validate an entered structure against the grammar files and provide pretty-printed HTML output (with indices and stuff) of the structure. For integration reasons (e.g. with <a href="http://www.spinfo.uni-koeln.de/space/Forschung/Tesla">Tesla</a>) this would make sense as an Eclipse Plug-in (and as a small RCP-app for simplified usage as a validator only). Also, with <a href="http://wiki.eclipse.org/Provide_an_Eclipse_IDE_generation_environment_derived_from_a_language_grammar">some interesting stuff</a> going on around Eclipse and having ANTLR grammar files, this might be relativley easy to do even with some useful stuff like auto-complete and an outline view. http://172.105.249.169:9000/notes/linguistic-dsls-with-antlrhttp://172.105.249.169:9000/notes/linguistic-dsls-with-antlr Sat, 06 Oct 2007 00:00:01 +0000 releasedslfdglinguisticsnlp