Content Augmentation Tools

This page contains information about Content Augmentation Tools developed during the work on TAO project. These tools include

 

KCIT (Key Concept Identification Tool)

a tool for automatic retrieval of key concepts from software-related legacy content w.r.t. a domain ontology... more

Content Augmentation Service

information about Web service that is capable of producing ontology-aware annotations over any content with regards to given ontology...more

Content Manager
a client application that is using Content Augmentation Service for attaching ontology-aware annotations to the given content...more

 

 

KCIT (Key Concept Identification Tool)

 

KCIT is combining the features of several GATE's generic language analysis components (e.g., sentence splitter, tokeniser, Flexible Gazetteer) with some newly developed ones, the main one being the OntoRoot Gazetteer. The OntoRoot Gazetteer is using the features of the generic language analysers such as the gazetteer and morphological analyzer in order to achieve effectiveness and robustness when identifying the key concepts.

 

OntoRoot Gazetteer is published as a GATE plugin named Ontology-based Gazetteer. The source is available with latest GATE distribution from svn, and the documentation is published under GATE User Manual. More details on how to use this gazetteer is available in README file that is distributed along with a source code of the GATE distribution. More information about how to get GATE from svn is available here.

 

KCIT is domain independent which means, it can be easily used with any ontology. KCIT is a generic language processing tool, aimed at processing regular text. In TAO project we have legacy content consisting of software artefacts, many of which do not have full sentences, proper punctuation, or capitalisation. In order to adapt KCIT for software artefacts, among which source code and documentation play the most important role, we customised few generic GATE processing resources, such as Tokeniser and Sentence Splitter, to work with the source code, WSDL files and the like.

 

We prepared a self-contained application with instructions on how to use KCIT and how to configure it or use with any ontology. To download and try it click here (7.1M). Documentation is available in the complementary README file.

 

To make KCIT tool available outside the GATE GUI, we exposed it as a Web service.

 

To know more on KCIT, contact Danica Damljanovic.

 

Content Augmentation Service

 

KCIT is exposed as Content Augmentation Service and is available at:

http://gate.ac.uk/caservice/services/CAService. To view WSDL file click here.

 

This service has three important methods that could be accessed from a client application:

 

1. Boolean loadOntology(String ontologyURI) - this method will accept any ontology URI that is available online. If null or the empty String is provided (""), it will load default ontology which is GATE domain ontology and knowledge base available at http://gate.ac.uk/ns/gate-kb. If given URI is the same as the URI of the ontology that is already loaded, it will do nothing. This is to ensure that the ontology is loaded only once.

 

2. String processText(String docText, String ontologyUrl, String encoding, boolean markupAware) - this method works so that first loadOntology method is called with given ontologyUrl. Hence, if null or empty String ("") is provided, GATE domain ontology/kb will be loaded.

The rest of the parameters

 

docText - a set of characters to be processed given as a String.
encoding is optional, if null or ("") is provided it will use default value which is UTF-8.
markupAware is useful in cases when the markup of the language (in xml those would be tags) needs to be ignored/not ignored. When for example processing an html document 'true' should be used by default (so that, and all other tags will be excluded from processing), but in cases when WSDL file is processed, this parameter should be set to 'false'.

 

3. String processURL(String docURL, String ontologyUrl, String encoding, boolean markupAware) This method accepts the same parameters as the previous one, with a difference that it doesn't accept a set of characters to be processed given as a String, but it will accept docURL, which is an URL of the file available online.

 

To know more on CA Service, contact Danica Damljanovic.

Content Augmentation Manager

 

The Content Augmentation Manager service automatically indentifies the instances of a given ontology that appears in a web page. Those instances can then be used to semantically annotate the web page. It connects to the Content Augmentation service to extract the relevant information, i.e. named entities such as Persons, Locations, Organisations, Dates, and so on. That information is controlled and consolidated by ITM according to the given ontology model and the existing instances of its concepts.

 

The CA Manager service can be tested at the following address: http://62.210.155.132/ca-test/

 

You need to provide the url of the web page you want to process as well as the url of the ontology model to be used in the extraction process. As an output, a RDF fragment is automatically generated contening the references towards the ontology instances annotated in the given web page. If you use Firefox, you can better visualise the RDF fragments thanks to the Tabulator plugin if installed.

 

Below you can find some tests results done with different web pages/ontologies couples:

 

Results of annotating GATE web site with regards to the gate ontology.

Results of annotating Mondeca web site with regards to the proton ontology.

Results of annotating Wine web site with regards to the wine ontology.

 

CA Manager is open-source software and is available to download from SourceForge.net. The manual is available here.

 

To know more on CA Manager, contact Florence Amardeilh.

 

Related Publications

Amardeilh F. "Semantic Annotation & Ontology Population", In Semantic Web Engineering in the Knowledge Society, Cardoso Jorge and Lytras Miltiadis D. (Eds), Idea Group Reference, 2008. Abstract - PDF.

Related Deliverables