Annotext 3
From AHRC Helix Wiki
annotext™
Contents |
Technical Details
annotext uses both JavaScript code (on the client-side) and PHP code (on the server side) to work. PHP scripts deal mostly with importing texts or glossaries/dictionaries from external sources and managing the edited texts, while the JavaScript code is responsible for the most important aspects of this application, namely editing and displaying texts and glossaries/dictionaries, and executing commands like word lookup.
Server Side
Server requirements: Web server and PHP with mbstring extension enabled, iconv (as a binary)
On the server the two main components, Display and Editor, are located in two different directories. Note that the Editor uses the global.js file found in the display directory and the global.php file found in the root directory.
Information about the texts is stored in the tab-delimited text file texts.lst. Every text has an ID number (specified in texts.lst). The XML files with the actual texts, as well as the XML glossaries, are stored in the texts directory; a file containing a text is named in the following way: <text_id>.xml; glossary files are named in the following way: <text_id>-<to_lang>.xml where to_lang is (the abbreviation of) the language in which the definitions are written within this glossary. The same text may be translated into more than one language. Dictionaries (which are generic in nature and are not used in the display part, only during editing) are kept in the dicts directory and their names are formed as follows: <from_lang>-<to_lang>.xml where from_lang is the language in which the original words are and to_lang is the language in which the definitions are.
Display
(All files mentioned in this paragraph are in the display directory.)
The list_texts.php script is used for listing the available texts and it provides links to displaying the text or downloading it. To allow downloading a text, the download_text.php script puts all files necessary for viewing the text into a temporary ZIP archive and provides a link to it. It also allows the user to download each file separately in case the user already has most files. The other option for the user is to open the text for viewing, in which case the display_text.php script just prints out a frameset that includes the text’s XML file, the XML glossary (in a zero-sized – i.e. invisible – frame), and a control frame on top that will display word definition and provide options. The rest is left to the client-side JavaScript code.
Editor
(All files mentioned in this paragraph are in the editor directory.)
The index page of the editor provides links to: editing a text, uploading a new text, importing an edited XML text, importing a dictionary, and exporting a dictionary. The Edit Text link leads to list_texts.php which lists all texts with links to edit (Open), export, auto-define, and delete a text. The Open buttons functions much like its analog in the display part: prints out a frameset with a control frame, the text frame and two invisible frames: the glossary and the dictionary (note that there is no dictionary in the display part, only in the editor). The export and delete scripts allow downloading and deleting a text, respectively. The Auto-def button is intended to make the auto_define.php script go through the text and automatically lookup every word in the dictionary and create entries for found words in the glossary, which would allow the author to start out with a certain set of definitions, instead of having to write or copy them manually. Currently, the auto_define.php script is not functional.
The next option for a user of the Editor is to “Upload New Text”. This link opens the new_text_form.php script which generates a form for uploading a new text (and the target of that form is new_text.php). From here an author can import a text from a plain-text file with a few different conversion options. Firstly, the character encoding has to be converted to UTF-8 if it is not already such. Secondly, the author can specify what general format the text file follows – whether each new line in the file should be considered a new paragraph, or whether paragraphs are delimited by blank lines, etc. (Note: conversion from HTML is not yet implemented). The author also has the option to import a glossary in the Old Annotext tab-delimited text file format. The author has to specify the language of the text and what language this text will be initially translated into (so that annotext knows what dictionary to associate with it). If no dictionary exists for the specified pair of languages, a blank one is created automatically.
The third option is to import a text. The text to be imported must be a text created with annotext 3.0 (Note: any text following the TEI standard will also be accepted but some of its information – tags, header data – may be ignored or overwritten by the editor.)
Importing a dictionary is intended to support merging the newly imported dictionary with any existing dictionary for the same pair of languages in a meaningful way – like only take entries that do not already exist, or append definitions to existing definitions, etc. The import_dict.php script and the lib_combine_dicts.php that it uses are not implemented yet.
The Export Dictionary function allows the user to download a dictionary (through the export_dict.php script).
Client Side
annotext makes use of the Document Object Model Level 2 standard (DOM2) to manipulate XML nodes and other objects. The Display part works both with Internet Explorer 5+ and Netscape 6+. The Editor part requires manipulation of selected text and other advanced functions which are not provided by DOM2 and therefore annotext has to use browser-specific features and functions. Thus the Editor only works with Netscape 6 or newer (and not with Internet Explorer).
One major feature of this application is that it is run on the client side, but its files are saved on the server side. Thus there needs to be two-way communication between those different tiers. The server-to-client communication is done just by opening the needed files through HTTP. The client-to-server communication, however, is more complex. The only instance when that type of communication is needed is when saving a file that the Editor has changed. This is accomplished by using JavaScript to send an HTTP POST request to the server while the application is running on the client side. Netscape/Mozilla includes an XMLHttpRequest class which can be used to send a whole XML document (in its dynamically modified state) to the server through a POST request (see sendDoc and saveDoc functions in global.js).
Almost all of annotext’s client-side JavaScript code is within three .js files: global.js in the display directory (used by both Editor and Display), display_text.js (display-specific code), and edit_text.js (editor-specific code).
global.js (found in the display directory, although it is also used by the editor) contains generic helper functions (mostly pertaining to strings and DOM nodes), configuration constants, and a few classes that allow the Editor and Display parts to create, lookup, and manipulate glossary/dictionary entries, including one large class implementing a dictionary or glossary (called Dict, used both for glossaries and dictionaries).
The Entry class represents a dictionary or glossary entry (i.e. a word/phrase and its definition, along with other properties). The Entry objects are a medium of exchange between the user interface and the Dict objects.
The Gloss class is used to store the state of currently displayed entries. It manages the actual interaction between the user interface and the Dict objects (by passing Entry object to and from the Dictionary objects).
Finally, the Dict class manages a glossary or dictionary. It has methods such as word lookup, addition and deletion of entries, and saving the dictionary to the server. This class does not load all entries from the dictionary into some own type of structure; rather, it uses the XML node structure provided by the web browser (in accordance with DOM2) to manage the dictionary. Every dictionary/glossary XML file contains a collection of <word> and <phrase> tags representing the entries, and a so-called redirect table in the beginning of the file. Each word entry can have multiple match forms (stored in <match> tags within the <word> tag) which allow the same glossary/dictionary entry to match more than one word (in practice, used for different forms of the same word). In fact, the “original” or <canonical> form of each word is NOT searched when looking up a word; only the match forms are. When searching for a word, we need to search through all match forms of all word entries. As explained below, the Dict class uses binary search for looking up words, which only works if the entries are sorted alphabetically by match form. However, since there are may be multiple match forms for the same word entry, it is impossible to sort entries by all match forms. The solution is the following: we keep the entries sorted by their first match form and maintain additional “redirect tables” for sorting them based on their second, third, etc. match forms. The first redirect table contains pointers to all word entries that have at least 2 match forms and those pointers are arranged in the redirect table in such a way that if you follow the pointers one by one and obtain the word entries they point to, you will get a list of word entries sorted by their second match form. Similarly, the second redirect table contains pointers to word entries sorted by their third match form, and so on. Thus we need as many redirect tables as the number of match forms in the word entry that has the most match forms. When searching for a particular word, we perform binary search on the first level (directly on the entries, which are sorted by their first match form), then we perform binary search on the second level, using the first redirect table to obtain the right ordering of entries by second match form and so on.
The Dict class does load the redirect tables into memory (in the form of arrays of indices of word entries) and manipulates them from there.
When adding a word entry to a Dict object, the addEntry function makes use of a few heuristics such as the fact that if the new entry is in fact a modification of an existing entry and its match forms have not been changed, we can just replace the old entry without having to re-index anything. However, if we’re inserting a new entry, or an entry with modified match forms, we need to figure out where exactly to insert it so that the entries remain sorted by the first match and then figure out where to insert a pointer to it in the redirect tables, so that their ordering remains proper based on the second, third, etc. match forms. There is an additional complication with redirect tables: when I said that they store pointers to the word entries, I meant that they store the indices of those word entries (as integer numbers, no actual memory pointers involved). Thus if we insert a new entry somewhere within the existing entries, all entries after it will change their index number. Thus when inserting a new entry, we have to increase every index stored in the redirect tables that is greater than or equal to the index of the just-inserted entry. The addEntry function includes within itself a few smaller helper functions such as getInsertIndex, reindexRedirects, createEntryElement, and haveSameMatches.
When deleting an entry, we need to go through the process of (linearly) determining the node’s index and then re-indexing the redirect tables before removing the actual node.
Phrases are handled quite differently from words. When created, each phrase entry has a specific ID number and its “id” parameter (of the <phrase> XML tag) is set to “n<ID>”. Phrases need not be sorted in any way, and do not have multiple things to match, so they are implemented as a straightforward list of <phrase> tags. DOM2 provides convenient functions for inserting/appending and deleting a specific node, so phrase addition and deletion is easy (see addEntry and deletePhrase functions; note that addEntry is used for inserting both word and phrase entries, while the deletion functions are specialized: deleteEntry for words and deletePhrase for phrases).
The save function uses saveDoc (discussed earlier) to save the modified dictionary/glossary to the server.
display_text.js is used to display a text and allow for instant word/phrase lookup. It also allows the user to view only a particular chapter/section at a time (which removes the need for annoying scrolling back and forth).
(Note that at an earlier stage of the development of annotext, the Display also included a dictionary instead of just the glossary, so there are some pieces of code remaining from that time; however, they do not interfere with the way this script currently works.)
When a text is opened for displaying, the initializeDisplay function is executed first. It sets up event handlers/listeners and creates objects such as the Dict object for the glossary. Note that Netscape/Mozilla seems to have a bug in its document.getElementById function (as of version 7.0) so in lines 41-45 of display_text.js we have to replace the built-in functions with custom-made ones (which have actually been written with this particular use in mind). Next, initializeDisplay checks if the text is separated into multiple parts, and, if it is, displays them one at a time, along with a drop-down box to choose the part from. The way the display by part is achieved is the following: all parts except for one are set to a specific class name that specifies CSS styles that make that part invisible (“hidden”). We can easily change those styles dynamically when the user changes the selection for chapter/section, and then it is the browser’s responsibility to display only the part that the user wants.
display_text.js also includes functions for handling clicks on words/phrases in the text and looking up their meaning in the glossary.
edit_text.js contains the editing code. That includes advanced functions for dealing with nodes and especially text selections. There are quite a few global variables defined in this script that need to be accessed by multiple functions and maintain their values over time. For instance, the g_busy variable specifies whether the application is currently engaged in an activity that requires some time and cannot do anything else (called “suspend mode”).
The markSelection function determines what tag you want to mark the selection with and what the selection is, and calls setTag to put your selection into a new tag.
This script needs to keep track of what the user is currently doing because there are a few different “modes” of operation – for instance, selecting something and marking it up as a “Term/phrase” when another term/phrase is being displayed will add the new selection to the term/phrase being displayed, as opposed to creating a new term/phrase (which is what would happen if you are in “regular” mode and no phrase is displayed). Keeping those in mind, the comments in the code explain just about everything that the script does; please refer to them. Note that this script defines an extension of the Gloss class, called GlossEdit, which adds features for editing the gloss that is to be displayed (as opposed to just displaying it). It then extends GlossEdit further to create the GlossEditPhrase class which contains phrase-editing-specific function implementations. With this 3-level class interface we can keep a high level of abstraction from the actual implementation and can easily interchange between different uses of the Gloss objects.
Annotext 3 was originally written by Todor Kalaydjiev; now maintained by Phil Fazio Annotext™ is Copyright © 1992-2008, Trustees of Dartmouth College
Last updated: 2008 Apr. 19, for Annotext 3.1
