Configuring the Textproc Module
The module has two functions: Managing how text is modeled according to the ontology and providing methods to automatically analyse text.
Both functions can be configured via the four tabs on the admin pages under "Site Configuration" -> "WissKI module settings" -> "Textproc".
In order to get started, you only need to specify the modeling of text as described in the next subsection.
Text and ontology
Like in Wikis, WissKI allows describing objects by free text. This text is not only displayed together with the instance, the text itself is also modeled as an instance that is linked on ontology level with the object instance. That is, you can distinguish between the object and a description of the very same object. Consequently you can express, for example, that the description mentions some other objects without implying that the object itself somehow refers to them.
The system needs to know about how to model texts and links to objects. The modelling information is provided by defining a special pathbuilder group with paths. See the Installation Guide on how to create paths and groups. A sample group using the ECRM (versions 101001 & 120111) can be downloaded here or on GitHub. You may import it into your path definitions. The group and paths can be set in the Document Settings tab through the four select boxes:
- Group for documents This field should point to the special group. It defines of what ontology concept the text instance will be modeled. If you use the sample path definitions, you should be able to select a group called "Document".
- Path for subject of a document This field defines how the text is linked to the instance it describes (the one with which the text is displayed). If you use the sample path definitions, set it to "Topic".
- Path for referred objects This field defines how the text is linked to instances it mentions, i.e. that are annotated in the text. In the sample paths, it's "refers to".
- Path for creator of a document This field is experimental. It should normally be set to "
Furthermore, you can set the default language of the texts in your system. Currently, you cannot set the language for individual texts, i.e. the default language setting applies to all texts in the system. Currently, only German and English are supported. However, most pre-installed analysis methods are configured to work sufficiently well with any (European) language. You may just try out other languages.
Automatic text analysis
The automatic text analysis can be configured on the other three tabs. The default tab List shows a list of all active components for analysing text. For each component, its user-defined name and its type (i.e. functionality) are given. Each component can be configured by clicking on the Edit link.
During the installation process of the Textproc module, the "Default vocabulary detection" component will already be configured and activated. It detects mentions of vocabulary entries, both local and imported, and marks them according to the defined pathbuilder groups.
Adding and editing an analysis method
The user may add other analysis components for a better automatic detection of entities. The Add tab shows a list of available component types together with a short description of what they can detect. By clicking on a link, the user may create and configure a new component.
This is always pre-installed as the default vocabulary detection. It scans the text for occurences of vocabulary entries. This method looks up entries in all vocabularies defined in the Vocabulary Control module that have been indexed. (Vocabularies from the local store are always indexed.) If a text snippet matches a vocabulary entry, the snippet will be added an annotation that links it with the entity the vocabulary entry represents. E.g. if you have defined a vocabulary of persons with their names, and you run the automatic text analysis by pressing the Send button in the editor, the method will look for occurences of the persons' names. If it finds one, it will create an annotation of group "Person" (displayed with appropriate icon and color) that links to the specified person.
As it is pre-installed, you usually need not add that component by yourself. On the configuration page, you can tune the parameters of the analysis algorithm. This is usually not necessary as the default values should work in most cases. The fields are:
- Place coordinates check These settings are only interesting, if you have instances representing geo-referenced places and the vocabularies in question provide coordinate information via the latitude/longitude fields. Such instances may be reranked according to their geographic distance to some reference points. Reference points may be either statically set (in the "preferred coords" textarea) or calculated from other approved annotations in the text.
- Place classes Here you can specify, instances of which pathbuilder groups should be treated as places. WissKI expects a whitespace-separated list of group IDs.
- Use coordinates of approved annotations Toggle this if you want approved place annotations to act as reference points. If enabled, already annotated places will tend to attract new place annotations, i.e. places in the neighborhood of existing places will be preferred.
- Preferred coords Here you can specify the static reference points, line per line, where each line contains
- Latitude factor, Longitude factor Both fields define a factor for the latitudinal/longitudinal difference that determines how much impact has the Latitude/longitude on the reranking. For example, if you only want to rerank places diverging from a certain latitude, you may set Longitude factor to 0. Negative values favour places in the neighborhood of reference points, positive values favour remote places.
- The next 5 textfields give the weights and factors for exact and partial hits. The default values should work sufficiently well in most scenarios.
- Rank offset exact Rank of a one-word hit, like "John", "Germany", "Chair", etc...
- Rank offset contains Rank of a complete multi-word hit, like "John Smith", "New York", etc...
- Rank offset length contains Additional rank for each word in a complete multi-word hit
- Rank offset guess Rank of a partial multi-word hit, like only "John" for "John Smith", only "York" for "New York", etc...
- Rank offset length guess Additional rank for each word in a partial multi-word hit
- Has lemma, Has pos These fields are only of interest, if you use a preprocessor that provides lemmata and/or part-of-speech (POS) tags. For each group, you may specify positive or negative weights that rerank a hit, if a the word(s) has/have lemmata or a certain POS. For example, "Bath" may be both an ordinary word or a settlement. In order to suppress misdetection as a place, you can define, that a token "Bath" tagged as normal noun (opposed to proper noun) should be downranked. "Has lemma" accepts one factor for each field, "Has pos" accepts one weight for a POS per line; syntax being
Person Name Detection
This component detects person names that are not in a vocabulary, i.e. it tries to identify persons that are unknown to the system until now. As such, it will always create a annotation for a new instance of the person group. If the vocabulary detection detects an existing person, the existing person will be linked instead of a new instance. On the configuration page, there are the following settings:
- Group The group with which the annotations will be associated, ie. the group that represents persons in your system. This setting is mandatory and needs to be set on each WissKI information space.
- Database table name The component uses a table of name parts, like givennames, to detect names. The default database table is filled with name parts extracted from Wikipedia. You usually need not change it.
- Rankings You can tell the component how you expect names to be build up in your texts: For example,in most of Europe, names are given in the form
or . You may alter that scheme here or give the patterns alternate weights. Again, at least for most European languages the default patterns should be sufficient.
Date & Time Detection
This component detects and annotates date formats like "21 August 1920". Annotations are then converted to new instances of a group that represents Dates/Time-Spans. On the configuration page you can select the group that represents Dates/Time-Spans.
Detection by Regular Expressions
This component can be used if you want to annotate occurences of a specific textual pattern as entities. Examples are identifiers that follow certain rules like inventory numbers, or names with specific suffixes like -sky or -vich. In the image example a pattern for a museum's inventory numbers is given: One of the letters A, G or X followed by a hyphen followed by four to six digits. This will link all occurences of that pattern to new instances of museum objects. If the vocabulary detection component finds this pattern as museum object in a vocabulary, the existing instance will be linked instead of creating a new instance.
In the "Preprocessor" tab you can define for each supported language an external program to preprocess (i.e. POS-tag and lemmatize) the text. Currently only German and English are supported. Using a preprocessor is not mandatory altough it yields better analysis results.
WissKI expects following input and output formats: Input must be line-based, one token/word per line. Output is also line-based, of the form <lemma>\t<tag>. The <tag> is optional.
An example for a preprocessing program with such input/output format is the TreeTagger (<http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ >).